RELATIONAL MANAGEMENT and DISPLAY of SITE ENVIRONMENTAL DATA
RELATIONAL MANAGEMENT and DISPLAY of SITE ENVIRONMENTAL DATA David W. Rich, Ph.D.
LEWIS PUBLISHERS A CRC Press Company Boca Raton London New York Washington, D.C.
Library of Congress Cataloging-in-Publication Data Rich, David William, 1952Relational management and display of site environmental data / David W. Rich. p. cm. Includes bibliographical references and index. ISBN 1-56670-591-6 (alk. paper) 1. Pollution—Measurement—Data processing. 2. Environmental monitoring—Data processing. 3. Database management. I. Title. TD193 .R53 2002 628.5′028′ 7—dc21
2002019441
This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe.
Visit the CRC Press Web site at www.crcpress.com © 2002 by CRC Press LLC Lewis Publishers is an imprint of CRC Press LLC No claim to original U.S. Government works International Standard Book Number 1-56670-591-6 Library of Congress Card Number 2002019441 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper
PREFACE
The environmental industry is changing, along with the way it manages data. Many projects are making a transition from investigation through remediation to ongoing monitoring. Data management is evolving from individual custom systems for each project to standardized, centralized databases, and many organizations are starting to realize the cost savings of this approach. The objective of Relational Management and Display of Site Environmental Data is to bring together in one place the information necessary to manage the data well, so everyone, from students to project managers, can learn how to benefit from better data management. This book has come from many sources. It started out as a set of course notes to help transfer knowledge about earth science computing and especially environmental data management to our clients as part of our software and consulting practice. While it is still used for that purpose, it has evolved into a synthesis of theory and a relation of experience in working with site environmental data. It is not intended to be the last word on the way things are or should be done, but rather to help people learn from the experience of others, and avoid mistakes whenever possible. The book has six main sections plus appendices. Part One provides an overview of the subject and some general concepts, including a discussion of system data content. Part Two covers system design and implementation, including database elements, user interface issues, and implementation and operation of the system. Part Three addresses gathering the data, starting with an overview of site investigation and remediation, progressing through gathering samples in the field, and ending with laboratory analysis. Part Four covers the data management process, including importing, editing, maintaining data quality, and managing multiple projects. Part Five is about using the data once it is in the database. It starts with selecting data, and then covers various aspects of data output and analysis including reporting and display; graphs; cross sections and similar displays; a large chapter on mapping and GIS; statistical analysis; and integration with other programs. Section Six discusses problems, benefits, and successes with implementing a site environmental data management system, along with an attempt to look into the future of data management and environmental projects. Appendices include examples of a needs assessment, a data model, a data transfer standard, typical constituent parameters, some exercises, a glossary, and a bibliography. A number of people have contributed directly and indirectly to this book, including my parents, Dr. Robert and Audrey Rich; Dr. William Fairley, my uncle and professor of geology at the University of Notre Dame; and Dr. Albert Carozzi, my advisor and friend at the University of Illinois. Numerous coworkers and friends at Texaco, Inc., Shell Oil Company, Sabine Corporation, Grant Environmental, and Geotech Computer Systems, Inc. helped bring me to the point professionally where I could write this book. These include Larry Ratliff, Jim Thomson, Dr. James L. Grant, Neil Geitner, Steve Wampler, Jim Quin, Cathryn Stewart, Bill Thoen, Judy Mitchell, Dr. Mike Wiley, and other Geotech staff members who helped with the book in various ways. Friends
in other organizations have also helped me greatly in this process, including Jim Reed of RockWare, Tom Bresnahan of Golden Software, and other early members of the Computer Oriented Geological Society. Thanks also go to Dr. William Ganus, Roy Widmann, Sherron Hendricks, and Frank Schultz of Kerr-McGee for their guidance. I would also like to specifically thank those who reviewed all or part of the book, including Cathryn Stewart (AquAeTer), Bill Thoen (GISNet), Mike Keester (Oklahoma State University), Bill Ganus and Roy Widmann (Kerr-McGee), Mike Wiley (The Consulting Operation), and Sue Stefanosky and Steve Clough (Roy. F. Weston). The improvements are theirs. The errors are still mine. Finally, my wife, business partner, and best friend, Toni Rich, has supported me throughout my career, hanging in there through the good times and bad, and has always done what she could to make our enterprise successful. She’s also a great proofreader. Throughout this book a number of trademarks and registered trademarks are used. The registered trademarks are registered in the United States, and may be registered in other countries. Any omissions are unintentional and will be remedied in later editions. Enviro Data and Spase are registered trademarks of Geotech Computer Systems, Incorporated. Microsoft, Office, Windows, NT, Access, SQL Server, Visual Basic, Excel, and FoxPro are trademarks or registered trademarks of Microsoft Corporation. Oracle is a registered trademark of Oracle Corporation. Paradox and dBase are registered trademarks of Borland International, Incorporated. IBM and DB2 are registered trademarks of International Business Machines Corporation. AutoCAD and AutoCAD Map are registered trademarks of Autodesk, Incorporated. ArcView is a registered trademark of Environmental Systems Research Institute, Incorporated. Norton Ghost is a trademark of Symantec Corporation. Apple and Macintosh are registered trademarks of Apple Computer, Incorporated. Sun is a registered trademark and Sparcstation is a trademark of Sun Microsystems. Capability Maturity Model and CMM are registered trademarks of The Software Engineering Institute of Carnegie Mellon University. Adobe and Acrobat are registered trademarks of Adobe Systems. Grapher is a trademark and Surfer is a registered trademark of Golden Software, Inc. RockWare is a registered trademark and RockWorks and Gridzo are trademarks of RockWare, Inc. Intergraph and GeoMedia are trademarks of Intergraph Corporation. Corel is a trademark and Corel Draw is a registered trademark of Corel Corporation. UNIX is a registered trademark of The Open Group. Linux is a trademark of Linus Torvalds. Use of these products is for illustration only, and does not signify endorsement by the author. A Web site has been established for updates, exercises, and other information related to this book. It is located at www.geotech.com/relman. I welcome your comments and questions. I can be reached by email at
[email protected]. David W. Rich
AUTHOR
David W. Rich is founder and president of Geotech Computer Systems, Inc. in Englewood, CO. Geotech provides off-the-shelf and custom software and consulting services for environmental data management, GIS, and other technical computing projects. Dr. Rich received his B.S. in Geology from the University of Notre Dame in 1974, and his M.S. and Ph.D. in Geology from the University of Illinois in 1977 and 1979, with his dissertation on “Porosity in Oolitic Limestones.” He worked for Texaco, Inc. in Tulsa, OK and Shell Oil Company in Houston, TX, exploring for oil and gas in Illinois and Oklahoma. He then moved to Sabine Corporation in Denver, CO as part of a team that successfully explored for oil in the Minnelusa Formation in the Powder River Basin of Wyoming. He directed the data management and graphics groups at Grant Environmental in Englewood, CO where he worked on several projects involving soil and groundwater contaminated with metals, organics, and radiologic constituents. His team created automated systems for mapping and cross section generation directly from a database. In 1986 he founded Geotech Computer Systems, Inc., where he has developed and supervised the development of custom and commercial software for data management, GIS, statistics, and Web data access. Environmental projects with which Dr. Rich has been directly involved include two Superfund wood treating sites, three radioactive material processing facilities, two hazardous waste disposal facilities, many municipal solid waste landfills, two petroleum refineries, and several mining and petroleum production and transportation projects. He has been the lead developer on three public health projects involving blood lead and related data, including detailed residential environmental measurements. In addition he has been involved in many projects outside of the environmental field, including a real-time Web-based weather mapping system, an agricultural GIS analysis tool, and database systems for petroleum exploration and production data, paleontological data, land ownership, health care tracking, parts inventory and invoice printing, and GPS data capture. Dr. Rich has been using computers since 1970, and has been applying them to earth science problems since 1975. He was a co-founder and president of the Computer Oriented Geological Society in the early 1980s, and has authored or co-authored more than a dozen technical papers, book chapters, and journal articles on environmental and petroleum data management, geology, and computer applications. He has taught many short courses on geological and environmental computing in several countries, and has given dozens of talks at various industry conventions and other events. When he is not working, Dr. Rich enjoys spending time with his family and riding his motorcycle in the mountains, and often both at the same time.
CONTENTS
PART ONE - OVERVIEW AND CONCEPTS ............................................................................1 CHAPTER 1 - OVERVIEW OF ENVIRONMENTAL DATA MANAGEMENT.............3 Concern for the environment...........................................................................................3 The computer revolution ..................................................................................................5 Convergence - Environmental data management ..........................................................7 Concept of data vs. information.......................................................................................8 EMS vs. EMIS vs. EDMS .................................................................................................8 CHAPTER 2 - SITE DATA MANAGEMENT CONCEPTS .............................................11 Purpose of data management .........................................................................................11 Types of data storage......................................................................................................12 Responsibility for data management .............................................................................18 Understanding the data ..................................................................................................19 CHAPTER 3 - RELATIONAL DATA MANAGEMENT THEORY................................21 What is relational data management?...........................................................................21 History of relational data management.........................................................................21 Data normalization .........................................................................................................22 Structured Query Language ..........................................................................................26 Benefits of normalization................................................................................................30 Automated normalization...............................................................................................31 CHAPTER 4 - DATA CONTENT ........................................................................................35 Data content overview ....................................................................................................35 Project technical data .....................................................................................................36 Project administrative data............................................................................................39 Project document data....................................................................................................41 Reference data.................................................................................................................42 Document management ..................................................................................................43 PART TWO - SYSTEM DESIGN AND IMPLEMENTATION ...............................................47 CHAPTER 5 - GENERAL DESIGN ISSUES......................................................................49 Database management software ....................................................................................49
Database location options...............................................................................................50 Distributed vs. centralized databases ............................................................................56 The data model................................................................................................................59 Data access requirements ...............................................................................................61 Government EDMS systems...........................................................................................63 Other issues .....................................................................................................................64 CHAPTER 6 - DATABASE ELEMENTS ...........................................................................69 Hardware and software components.............................................................................69 Units of data storage .......................................................................................................75 Databases and files..........................................................................................................76 Tables (“databases”).......................................................................................................76 Fields (columns)...............................................................................................................78 Records (rows).................................................................................................................79 Queries (views) ................................................................................................................79 Other database objects ...................................................................................................80 CHAPTER 7 - THE USER INTERFACE............................................................................85 General user interface issues..........................................................................................85 Conceptual guidelines .....................................................................................................86 Guidelines for specific elements .....................................................................................90 Documentation ................................................................................................................91 CHAPTER 8 - IMPLEMENTING THE DATABASE SYSTEM ......................................93 Designing the system.......................................................................................................93 Buy or build?...................................................................................................................97 Implementing the system ................................................................................................99 Managing the system ....................................................................................................103 CHAPTER 9 - ONGOING DATA MANAGEMENT ACTIVITIES...............................107 Managing the workflow................................................................................................107 Managing the data ........................................................................................................109 Administering the system .............................................................................................110 PART THREE - GATHERING ENVIRONMENTAL DATA................................................115 CHAPTER 10 - SITE INVESTIGATION AND REMEDIATION..................................117 Overview of environmental regulations ......................................................................117 The investigation and remediation process.................................................................119 Environmental Assessments and Environmental Impact Statements.......................121 CHAPTER 11 - GATHERING SAMPLES AND DATA IN THE FIELD ......................123 General sampling issues................................................................................................123 Soil..................................................................................................................................126 Sediment.........................................................................................................................127 Groundwater .................................................................................................................127 Surface water ................................................................................................................130 Decontamination of equipment ....................................................................................131 Shipping of samples ......................................................................................................131 Air...................................................................................................................................131
Other media...................................................................................................................132 Overview of parameters ...............................................................................................133 CHAPTER 12 - ENVIRONMENTAL LABORATORY ANALYSIS .............................139 Laboratory workflow ...................................................................................................139 Sample preparation.......................................................................................................140 Analytical methods........................................................................................................141 Other analysis issues .....................................................................................................145 PART FOUR - MAINTAINING THE DATA ..........................................................................149 CHAPTER 13 - IMPORTING DATA................................................................................151 Manual entry .................................................................................................................151 Electronic import ..........................................................................................................153 Tracking imports...........................................................................................................163 Undoing an import ........................................................................................................164 Tracking quality............................................................................................................165 CHAPTER 14 - EDITING DATA ......................................................................................167 Manual editing ..............................................................................................................167 Automated editing.........................................................................................................168 CHAPTER 15 - MAINTAINING AND TRACKING DATA QUALITY........................173 QA vs. QC......................................................................................................................173 The QAPP ......................................................................................................................173 QC samples and analyses..............................................................................................175 Data quality procedures ...............................................................................................181 Database support for data quality and usability ........................................................186 Precision vs. accuracy...................................................................................................187 Protection from loss ......................................................................................................188 CHAPTER 16 - DATA VERIFICATION AND VALIDATION......................................191 Types of data review .....................................................................................................191 Meaning of verification ................................................................................................191 Meaning of validation...................................................................................................193 The verification and validation process ......................................................................193 Verification and validation checks ..............................................................................194 Software assistance with verification and validation.................................................195 CHAPTER 17 - MANAGING MULTIPLE PROJECTS AND DATABASES...............199 One file or many?..........................................................................................................199 Sharing data elements...................................................................................................201 Moving between databases...........................................................................................201 Limiting site access........................................................................................................202 PART FIVE - USING THE DATA ............................................................................................203 CHAPTER 18 - DATA SELECTION.................................................................................205 Text-based queries ........................................................................................................205 Graphical selection........................................................................................................207 Query-by-form ..............................................................................................................210 CHAPTER 19 - REPORTING AND DISPLAY................................................................213
Text output ....................................................................................................................213 Formatted reports.........................................................................................................214 Formatting the result ....................................................................................................216 Interactive output .........................................................................................................223 Electronic distribution of data .....................................................................................224 CHAPTER 20 - GRAPHS ...................................................................................................225 Graph overview.............................................................................................................225 General concepts ...........................................................................................................226 Types of graphs .............................................................................................................227 Graph examples.............................................................................................................228 Curve fitting ..................................................................................................................232 Graph theory .................................................................................................................233 CHAPTER 21 - CROSS SECTIONS, FENCE DIAGRAMS, AND 3-D DISPLAYS.....235 Lithologic and wireline logs .........................................................................................235 Cross sections ................................................................................................................237 Profiles ...........................................................................................................................238 Fence diagrams and stick displays...............................................................................239 Block Diagrams and 3-D displays ................................................................................240 CHAPTER 22 - MAPPING AND GIS ...............................................................................243 Mapping concepts .........................................................................................................243 Mapping software .........................................................................................................251 Displaying data..............................................................................................................254 Contouring and modeling.............................................................................................256 Specialized displays.......................................................................................................262 CHAPTER 23 - STATISTICS AND ENVIRONMENTAL DATA..................................269 Statistical concepts ........................................................................................................269 Types of statistical analyses..........................................................................................273 Outliers and comparison with limits ...........................................................................275 Toxicology and risk assessment ...................................................................................277 CHAPTER 24 - INTEGRATION WITH OTHER PROGRAMS....................................279 Export-import ...............................................................................................................279 Digital output.................................................................................................................282 Export-import advantages and disadvantages ...........................................................282 Direct connection ..........................................................................................................283 Data warehousing and data mining.............................................................................285 Data integration ............................................................................................................286 PART SIX - PROBLEMS, BENEFITS, AND SUCCESSES...................................................287 CHAPTER 25 - AVOIDING PROBLEMS........................................................................289 Manage expectations.....................................................................................................289 Use the right tool ...........................................................................................................290 Prepare for problems with the data ............................................................................291 Plan project administration .........................................................................................292 Increasing the chance of a positive outcome...............................................................292
CHAPTER 26 - SUCCESS STORIES................................................................................293 Financial benefits ..........................................................................................................293 Technical benefits..........................................................................................................295 Subjective benefits ........................................................................................................296 CHAPTER 27 - THE FUTURE OF ENVIRONMENTAL DATA MANAGEMENT ...299 PART SEVEN - APPENDICES .................................................................................................301 APPENDIX A - NEEDS ASSESSMENT EXAMPLE.......................................................303 APPENDIX B - DATA MODEL EXAMPLE ....................................................................307 Introduction...................................................................................................................307 Conventions ...................................................................................................................307 Primary tables ...............................................................................................................308 Lookup tables ................................................................................................................312 Reference tables ............................................................................................................321 Utility tables...................................................................................................................324 APPENDIX C - DATA TRANSFER STANDARD ...........................................................327 Purpose ..........................................................................................................................327 Database background information ..............................................................................327 Data content ..................................................................................................................328 Acceptable file formats .................................................................................................332 Submittal requirements ................................................................................................334 Non-conforming data....................................................................................................335 APPENDIX D - THE PARAMETERS...............................................................................337 Overview........................................................................................................................337 Inorganic parameters ...................................................................................................338 Organic parameters ......................................................................................................340 Other parameters..........................................................................................................347 Method reference ..........................................................................................................348 APPENDIX E - EXERCISES..............................................................................................357 Database redesign exercise...........................................................................................357 Data normalization exercise .........................................................................................359 Group discussion - data management and your organization...................................360 Database redesign exercise solution ............................................................................360 Data normalization exercise solution...........................................................................361 Database software exercises .........................................................................................361 APPENDIX F - GLOSSARY ..............................................................................................363 APPENDIX G - BIBLIOGRAPHY ....................................................................................407 INDEX..........................................................................................................................................419
PART ONE - OVERVIEW AND CONCEPTS
CHAPTER 1 OVERVIEW OF ENVIRONMENTAL DATA MANAGEMENT
Concern for our environment has been on the rise for many years, and rightly so. At many industrial facilities and other locations toxic or potentially toxic materials have been released into the environment in large amounts. While the health impact of these releases has been quite variable and, in some cases, controversial, it clearly is important to understand the impact or potential impact of these releases on the public, as well as on the natural environment. This has led to increased study of the facilities and the areas around them, which has generated a large amount of data. More and more, people are looking to sophisticated database management technology, together with related technologies such as geographic information systems and statistical analysis packages, to make sense of this data. This chapter discusses this increasing concern for the environment, the growth of computer technology to support environmental data management, and then some general thoughts on environmental data management in an organization.
CONCERN FOR THE ENVIRONMENT The United States federal government has been regulating human impact on the environment for over a century. Section 13 of the River and Harbor Act of 1899 made it unlawful (with some exceptions) to put any refuse matter into navigable waters (Mackenthun, 1998, p. 20). Since then hundreds of additional laws have been enacted to protect the environment. This regulation occurs at all levels of government from international treaties, through federal and state governments, to individual municipalities. Often this situation of multiple regulatory oversight results in a maze of regulations that makes even legitimate efforts to improve the situation difficult, but it has definitely increased the effort to clean up the environment and keep it clean. Through the 1950s the general public had very little awareness or concern about environmental issues. In the 1960s concern for the environment began to grow, helped at least some by the book Silent Spring by Rachel Carson (Carson, 1962). The ongoing significance of this book is highlighted by the fact that a 1994 edition of the book has a foreword by then Vice President Al Gore. In this book Ms. Carson brought attention to the widespread and sometimes indiscriminate use of DDT and other chlorinated hydrocarbons, organic phosphates, arsenic, and other materials, and the impact of this use on ground and surface water, soil, plants, and animals. She cites examples of workers overcome by exposure to large doses of chemicals, and changes in animal populations after use of these chemicals, to build the case that widespread use of these materials is harmful. She also discusses the link between these chemicals and cancer.
4
Relational Management and Display of Site Environmental Data
Rachel Carson’s message about concern for the environment came at a time, the 1960s, when America was ready for a “back-to-the-earth” message. With the youth of America and others organizing to oppose the war in Vietnam, the two causes fit well together and encouraged each other’s growth. This was reflected in the music of the time, with many songs in the sixties and seventies discussing environmental issues, often combined with sentiments against the war and nuclear power. The war in Vietnam ended, but the environmental movement lives on. There are many examples of rock songs of the sixties and seventies discussing environmental issues. In 1968 the rock musical Hair warned about the health effects of sulfur dioxide and carbon monoxide. Zager and Evans in their 1969 song In The Year 2525 talked about taking from the earth and not giving back, and in 1970 the Temptations discussed air pollution and many other social issues in Ball of Confusion. Three Dog Night also warned about air pollution in their 1970 songs Cowboy and Out in the Country. Perhaps the best example of a song about the environment is Marvin Gaye’s 1971 song Mercy Mercy Me (The Ecology), in which he talked about oil polluting the ocean, mercury in fish, and radiation in the air and underground. In 1975 Joni Mitchell told farmers not to use DDT in her song Big Yellow Taxi, and the incomparable songwriter Bob Dylan got into the act with his 1976 song A Hard Rain’s A-gonna Fall, warning about poison in water and global hunger. It’s not a coincidence that this time frame overlaps all of the significant early environmental regulations. A good example of an organized environmental effort that started in those days and continues today is Earth Day. Organized by Senator Gaylord Nelson and patterned after teach-ins against the war in Vietnam, the first Earth Day was held on April 22, 1970, and an estimated 20 million people around the country attended, according to television anchor Walter Cronkite. In the 10 years after the first Earth Day, 28 significant pieces of federal environmental legislation were passed, along with the establishment of the U.S. Environmental Protection Agency (EPA) in December of 1970. The first major environmental act, the National Environmental Policy Act of 1969 (NEPA) predated Earth Day, and had the stated purposes (Yost, 1997) of establishing harmony between man and the environment; preventing or eliminating damage to the environment; stimulating the health and welfare of man; enriching the understanding of ecological systems; and establishment of the Council on Environmental Quality. Since that act, many laws protecting the environment have been passed at the national, state, and local levels. Evidence that public interest in environmental issues is still high can be found in the public reaction to the book A Civil Action (Harr, 1995). This book describes the experience of people in the town of Woburn, Massachusetts. A number of people in the town became ill and some died due to contamination of groundwater with TCE, an industrial solvent. This book made the New York Times bestseller list, and was later made into a movie starring John Travolta. More recently, the movie Erin Brockovich starring Julia Roberts covered a similar issue in California with Pacific Gas and Electric and problems with hexavalent chrome in groundwater causing serious health issues. Public interest in the environment is exemplified by the various watchdog organizations that track environmental issues in great detail. A good example of this is Scorecard.org, (Environmental Defense, 2001) a Web site that provides a very large amount of information on environmental current events, releases of toxic substances, environmental justice, and similar topics. For example, on this site you can find the largest releasers of pollutants near your residence. Sites like this definitely raise public awareness of environmental issues. It’s also important to point out that the environmental industry is big business. According to reports by the U.S. Department of Commerce and Environmental Business International (as quoted in Diener, Terkla, and Cooke, 2000), the environmental industry in the U.S. in 1998 had $188.7 billion in sales, up 1.6% from the previous year. It employed 1,354,100 people in 115,850 companies. The worldwide market for environmental goods and services for the same period was estimated to be $484 billion.
Overview of Environmental Data Management
5
Figure 1 - The author (front row center) examining state-of-the-art punch card technology in 1959
THE COMPUTER REVOLUTION In parallel with growing public concern for the environment has been growth of technology to support a better understanding of environmental conditions. While people have been using computing devices of some sort for over a thousand years and mainframe computers since the 1950s (see Environmental Computing History Timeline sidebar), the advent of personal computers in the 1980s made it possible to use them effectively on environmental projects. For more information on the history of computers, see Augarten (1984) and Evans (1981). Discussions of the history of geological use of computers are contained in Merriam (1983,1985). With the advent of Windows-based, consumer-oriented database management programs in the 1990s, the tools were in place to create an environmental data management system (EDMS) to store data for one or more facilities and use it to improve project management. Computers have assumed an increasingly important role in our lives, both at work and at home. The average American home contains more computers than bathtubs. From electronic watches to microwave ovens, we are using computers of one type or another a significant percentage of our waking hours. In the workplace, computers have changed from big number crunchers cloistered somewhere in a climate-controlled environment to something that sits on our desk (or our lap). No longer are computers used only for massive computing jobs which could not be done by hand, but they are now replacing the manual way of doing our daily work. This is as true in the earth science disciplines as anywhere else. Consequently, industry sages have suggested that those who do not have computer skills will be left behind in the next wave of automation of the
6
Relational Management and Display of Site Environmental Data
Environmental Computing History Timeline 1000 BC – The Abacus was invented (still in use). 1623 – The first mechanical calculator was invented by German professor Wilhelm Schickard. 1834 – Charles Babbage began work on the Analytical Engine, which was never completed. 1850 – Charles Lyell was the first person to use statistics in geology. 1876 – Alexander Graham Bell patented the telephone. 1890 – Herman Hollerith built the Tabulating Machine, which was the first successful mechanical calculating machine. 1899 – The River and Harbor Act was the first environmental law passed in the United States. 1943 – The Mark 1, an electromechanical calculator, was developed. 1946 – ENIAC (Electronic Numerator, Integrator, Analyzer and Computer) was completed. (Dick Tracy’s wrist radio also debuted in the comic strip.) 1947 – The transistor was invented by Bardeen, Brattain, and Shockley at Bell Labs. 1951 – UNIVAC, the first commercial computer, became available. 1952 – Digital plotters were introduced. 1958 – The integrated circuit was invented by Jack Kilby at Texas Instruments. 1962 – Rachel Carson’s Silent Spring is published, starting the environmental movement. 1965 – IBM white paper on computerized contouring appeared. 1969 – National Environmental Policy Act (NEPA) was enacted. 1970 – The first Earth Day was held. 1970 – Relational data management was described by Edwin Codd. 1971 – The first microprocessor, the Intel 4004, was introduced. 1973 – SQL was introduced by Boyce and Chamberlain. 1977 – The Apple II, the first widely accepted personal computer, was introduced. 1981 – IBM releases its Personal Computer. This was the computer that legitimized small computers for business use. 1984 – The Macintosh computer was introduced, the first significant use of a graphical user interface on a personal computer. 1985 – Windows 1.0 was released. 1990 – Microsoft shipped Windows 3.0, the first widely accepted version. 1994 – Netscape Navigator was released by Mosaic Communications, leading to widespread use of the World Wide Web. workplace. At the least, those who are computer aware will be in a better position to evaluate how computers can help them in their work. The growth that we have seen in computer processing power is related to Moore’s law (Moore, 1965; see also Schaller, 1996), which states that the capacity of semiconductor memory doubles every 18 months. The price-performance ratio of most computer components meets or exceeds this law over time. For example, I bought a 10 megabyte hard drive in 1984 for $800. In 2001 I could buy a 20 gigabyte hard drive for $200, a price-performance increase of 8000 times in 17 years. This averages to a doubling about every 16 months. Over the same time, PC processing speed has increased from 4 megahertz for $5000 to 1000 megahertz for $1000, an increase of 1250, a doubling every 20 months. These average to 18 months. So computers become twice as powerful every year and a half, obeying Moore’s law. Unlike 10 or especially 20 years ago, it is now usual in industrial and engineering companies for each employee to have a suitable computer on his or her desk, and for that computer to be networked to other people’s computers and often a server. This computing environment is a good base on which to build a data management system.
Overview of Environmental Data Management
7
As the hardware has developed, so has the data management software. It is now possible to outfit an organization with the software for a client-server data management system starting at $1,000 or $2,000 a seat. Users probably already have the hardware. Adding software customization, training, support, and other costs still allows a powerful data management system to be put in place for a cost which is acceptable for many projects. In general, computers perform best when problem solving calls for either deductive or inductive reasoning, and poorly when using subjective reasoning. For example, calculating a series of stratigraphic horizon elevations where the ground level elevation and the depth to the formation are known is an example of deductive reasoning. Computers perform optimally on problems requiring deductive reasoning because the answers are precise, requiring explicit computations. Estimating the volume of contamination or contouring a surface is an example of inductive reasoning. Inductive reasoning is less precise, and requires a skilled geoscientist to critique and modify the interpretation. Lastly, the feeling that carbonate aquifers may be more complex than clastic aquifers is an example of subjective reasoning. Subjective reasoning uses qualitative data and is the most difficult of all for computer analysis. In such instances, the analytical potential of the computer is secondary to its ability to store and graphically portray large amounts of information. Graphic capabilities are requisite to earth scientists in order to make qualitative data usable for interpretation. Another example of appropriate use of computers relative to types of reasoning is the distinction between verification and validation, which is discussed in detail in Chapter 16. Verification, which checks compliance of data with project requirements, is an example of deductive logic. Either a continuous calibration verification sample was run every ten samples or it wasn’t. Validation, on the other hand, which determines the suitability of the data for use, is very subjective, requiring an understanding of sampling conditions, analytical procedures, and the expected use of the data. Verification is easily done with computer software. How far software can go toward complete automation of the validation process remains to be seen.
CONVERGENCE - ENVIRONMENTAL DATA MANAGEMENT Efficient data management is taking on increased importance in many organizations, and yours is probably no exception. In the words of one author (Diamondstone, 1990, p. 3): Automated measuring equipment has provided rapidly increasing amounts of data. Now, the challenge before us is to assure sufficient data uniformity and compatibility and to implement data quality measures so that these data will be useful for integrative environmental problem solving. This is particularly true in organizations where many different types of site investigation and monitoring data are coming from a variety of different sources. Fortunately, software tools are now available which allow off-the-shelf programs to be used by people who are not computer experts to retrieve this data in a format that is meaningful to them. According to Finkelstein (1989, p. 3): Management is on the threshold of an explosive boom in the use of computers. A boom initiated by simplicity and ease of use. Managers and staff at all levels of an organization will be able to design and implement their own systems, thereby dramatically reducing their dependence on the data processing (DP) department, while still ensuring that DP maintains central control, so that application systems and their data can be used by others in the business. With the advent of relatively easy to use software tools such as Microsoft Windows and Microsoft Access, it is even more true now that individuals can have a much greater role in
8
Relational Management and Display of Site Environmental Data
satisfying their own data management needs. It is important to develop a data management approach that makes efficient use of these tools to solve the business problem of managing data within the organization. The environmental data management system that will result from implementation of a plan based on this approach will provide users with access to the organization’s environmental data to satisfy their business needs. It will allow them to expand their data retrievals as their needs change and as their skills develop. As with most business decisions, the decision to implement a data management system should be based on an analysis of the expected return on the time and money invested. In the case of an office automation project, some of the return is tangible and can be expressed in dollar savings, and some is intangible savings in efficiency in everyday operations. In general, the best approach for system implementation is to look for leverage points in the system where a great return can be had for a small cost. The question becomes: How do you improve the process to get the greatest savings? Often some examples of tangible returns can be identified within the organization. The benefits can best be seen from analyzing the impact of the data management system on the whole site investigation and remediation process. For example, during remediation you might be able, by more careful tracking and modeling of the contamination, to decrease the amount of waste to be removed or water to be processed. You may also be able to decrease the time required to complete the project and save many person-years of cost by making quality data available in a standardized format and in a timely fashion. For smaller sites, automating the data management process can provide savings by repetition. Once the system has been set up for one site and people trained to use it, that effort can be re-used on the next site. The intangible benefits of a data management system are difficult to quantify, but subjectively can include increased job satisfaction of project workers, a higher quality work product, and better decision making. The cumulative financial and intangible return on investment of these various benefits can easily justify reasonable expenditures for a data management system.
CONCEPT OF DATA VS. INFORMATION It is important to recognize that there is a difference between numbers and letters stored in a computer and useful information. Numbers stored in a computer, or printed out onto a sheet of paper, may not themselves be of any value. It is only when those numbers are presented in a form that is useful to the intended audience that they become useful information. The keys to making the transition from data to information are organization and access. It doesn't matter if you have a file of all the monitoring wells ever drilled; if you can't get the information you want out of the file, it is useless. Before any database is created, careful attention should be paid to how the data is going to be used, to ensure that the maximum use can be received from the effort. Statistics and graphics can be tremendously helpful in perceiving relationships among different variables contained in the data. As the power and ease-of-use of both general business programs and technical programs for statistics and graphics improves, it will become common to take a good look at the data as a set before working with individual members of the set. The next step is to move from information to knowledge. The difference between the two is understanding. Once you have processed the information and understand it, it becomes knowledge. This transition is a human activity, not a computer activity, but the computer can help by presenting the information in an understandable manner.
EMS VS. EMIS VS. EDMS A final overview issue to discuss is the relationship between EMS (environmental management systems), EMIS (environmental management information systems), and site EDMS (environmental data management systems). An EMS is a set of policies and procedures for managing
Overview of Environmental Data Management
9
Data is or Data are? Is “data” singular or plural? In this book the word data is used as a singular noun. Depending on your background, you may not like this. Many engineers and scientists think of data as the plural of “datum,” so they consider the word plural. Computer people view data as a chunk of stuff, and, like “chunk,” consider it singular. In one dictionary I consulted (Webster, 1984), data as the plural of datum was the third definition, with the first two being synonyms for “information,” which is always treated as singular. It also states that common usage at this time is singular rather than plural, and that “data can now be used as a singular form in English.” In Strunk and White (1935), a style manual that I use, the discussion of singular vs. plural nouns uses an example of the contents of a jar. If the jar contains marbles, its contents are plural. If it contains jam, its content is singular. You decide: Is data jam or marbles? environmental issues for an organization or a facility. An EMIS is a software system implemented to support the administration of the EMS (see Gilbert, 1999). EMIS usually has a focus on record keeping and reporting, and is implemented with the hope of improving business processes and practices. A site environmental data management systems (EDMS) is a software system for managing data regarding the environmental impact of current or former operations. EDMS overlaps partially with EMIS systems. For an operating facility, the EDMS is a part of the EMIS. For a facility no longer in operation, there may be no formal EMS or EMIS, but the EDMS is necessary to facilitate monitoring and cleanup.
CHAPTER 2 SITE DATA MANAGEMENT CONCEPTS
The size and complexity of environmental investigation and monitoring programs at industrial facilities continue to increase. Consequently the amount of environmental data, both at operating facilities and orphan sites, is growing as well. The volume of data often exceeds the capacity of simple tools like paper reports and spreadsheets. When that happens it is appropriate to implement a more powerful data management system and often the system of choice is a relational database manager. This section provides a top-down discussion of management of environmental data. It focuses on the purpose and goals of environmental data management, and on the types and locations of data storage. These issues should always be resolved before an electronic (or in fact any) data management system should be implemented.
PURPOSE OF DATA MANAGEMENT Why manage data electronically? Or why even manage it at all? Clear answers to these questions are critical before a successful system can be implemented. This section addresses some of the issues related to the purpose of data management. It all comes down to planning. If you understand the goal to be accomplished, you have a better chance of accomplishing it. There is only one real purpose of data management: to support the goals of the organization. These goals are discussed in detail in Chapter 8. No data management system should be built unless it satisfies one or more significant business or technical goals. Identification of these goals should be done prior to designing and implementing the system for two reasons. One reason is that the achievement of these goals provides the economic justification for the effort of building the system. The other reason is that the system is more likely to generate satisfactory results if those results are understood, at least to a first approximation, before the system is implemented and functionality is frozen. Different organizations have different things that make them tick. For some organizations, internal considerations such as cost and efficiency are most important. For others, outside appearances are equally or more important. The goals of the organization must be taken into consideration in the design of the system so that the greatest benefit can be achieved. Typical goals include: Improve efficiency – Environmental site investigation and remediation projects can involve an enormous amount of data. Computerized methods, if they are well designed and implemented,
12
Relational Management and Display of Site Environmental Data
Environmental problems are complex problems. Complex problems have simple, easy-tounderstand wrong answers. From Environmental Humor by Gerald Rich (1996), reprinted with permission can be a great help in improving the flow of data through the project. They can also be a great sink of time and effort if poorly managed. Maximize quality – Because of the great importance of the results derived from environmental investigation and remediation, it is critical that the quality of the output be maximized relative to the cost. This is not trivial, and careful data storage, and annotation of data with quality information, can be a great help in achieving data quality objectives. Minimize cost – No organization has an unlimited amount of money, and even those with a high level of commitment to environmental quality must spend their money wisely to receive the greatest return on their investment. This means that unnecessary costs, whether in time or money, must be minimized. Electronic data management can help contain costs by saving time and minimizing lost data. People tend to start working on a database without giving a lot of thought to what a database really is. It is more than an accumulation of numbers and letters. It is a special way to help us understand information. Here are some general thoughts about databases: A database is a model of reality – In many cases, the data that we have for a facility is the only representation that we have for conditions at that facility. This is especially true in the subsurface, and for chemical constituents that are not visible, either because of their physical condition or their location. The model helps us understand the reality – In general, conditions at sites are nearly infinitely complex. The total combination of geological, hydrological and engineering factors usually exceeds our ability to understand it without some simplification. Our model of the site, based on the data that we have, helps us to perform this simplification in a meaningful way. This understanding helps us make decisions – Our simplified understanding of the site allows us to make decisions about actions to be taken to improve the situation at the site. Our model lets us propose and test solutions based on the data that we have, identify additional data that we need, and then choose from the alternative solutions. The clearer the model, the better the decisions – Since our decisions are based on our databased model, it follows that we will make better decisions if we have a clear, accurate, up-to-date model. The purpose of a database management system for environmental data is to provide us the information to build accurate models and keep them current. Clearly information technology, including data management, is important to organizations. Linderholm (2001) reports the results of a study that asked business executives about the importance of information technology (IT) to their business. 70% reported that it was absolutely essential, and 20% said it was extremely valuable. The net increase in revenue attributable to IT, after accounting for IT costs, was estimated to be 20%, clearly a good return. 70% said that the role of IT in business strategy is increasing. In the environmental business the story must be similar, but perhaps not as strong. If you were to survey project managers today about the importance of data management on their projects, probably the percentage that said it was essential or extremely valuable would be less than the 90% quoted above, and maybe less than 50%. But as the amount of data for sites continues to grow, this number will surely increase.
TYPES OF DATA STORAGE Once the purpose of the system has been determined, the next step is to identify the data to be contained in the system and how it is to be stored. Some data must be stored electronically, while
Site Data Management Concepts
13
other data might not need to be stored this way. Implementers should first develop a thorough understanding of their existing data and storage methods, and then make decisions about how electronic storage can provide an improvement. This section will cover ways of storing site environmental data. The content of an EDMS will be discussed in Chapter 4.
Hard copy Since its inception, hard copy data storage has been the lifeblood of the environmental industry. Many organizations have thousands of boxes of paper related to their projects. The importance of this data varies greatly, but in many organizations, it is not well understood. A data management system for hard copy data is different from a data management system for digital data such as laboratory analytical results. The former is really a document management system, and many vendors offer software and other tools to build this type of system. The latter is more of a technical database issue, and can be addressed by in-house generated solutions or offthe-shelf or semi-custom solutions from environmental software vendors.
LAB REPORTS Laboratory analyses can generate a large volume of paper. Programs like the U.S.E.P.A. Contract Lab Program (CLP) specify deliverables that can be hundreds of pages for one sampling event. This paper is important as backup for the data, but these hundreds of pages can cause a storage and retrieval problem for many organizations. Often the usable data from the lab event, that is, the data actually used to make site decisions, may be only a small fraction of the paper, with the rest being quality assurance and other backup information.
DERIVED REPORTS Evaluation of the results of laboratory analysis and other investigation efforts usually results in a printed report. These reports contain a large amount of useful information, but over time can also become a storage and retrieval problem.
Electronic There are many different ways of organizing data for digital storage. There is no “right” or “wrong” way, but there are approaches that provide greater benefits than others in specific situations. People store environmental data a lot of different ways, both in database systems and in other file types. Here we will discuss two non-database ways of storing data, and several different database system designs for storing data.
TEXT FILES AND WORD PROCESSOR FILES The simplest way to manage environmental data is in text files. These files contain just the information of interest, with no formatting or information about the data structure or relationships between different data elements. Usually these files are encoded in ASCII, which stands for American Standard Code for Information Interchange and is pronounced like as′-kee. For this reason they are sometimes called ASCII files. Text files can be effectively used for storing and transferring small amounts of data. Because they lack “intelligence” they are not usually practical for large data sets. For example, in order to search for one piece of data in a text file you must look at every word until you find the one you are looking for, rather than using a more efficient method such as indexed searching used by data management programs. A variation on text files is word processor files, which contain some formatting and structure resulting from the word processing program that created them. An example of this would be the data in a table in a report. Again this works well only for small amounts of data.
14
Relational Management and Display of Site Environmental Data
SPREADSHEETS Over the years a large amount of environmental data has been managed in spreadsheets. This approach works for data sets that are small to medium in size, and where the display and retrieval requirements are relatively simple. For large data sets, a database manager program is usually required because spreadsheets have a limit to the number of rows and columns that they contain, and these limits can easily be exceeded by a large data set. For example, Lotus 123 has a limit of about 16,000 rows of data, and Excel 97 has a limit of 65,536 rows. Spreadsheets do have their place in working with environmental data. They are particularly useful for statistical analysis of data and for graphing in a variety of ways. Spreadsheets are for doing calculations. Database managers are for managing data. As long as both are used appropriately, the two together can be very powerful. The problem with spreadsheets occurs when they are used in situations where real data management is required. For example, it’s not unusual for organizations to manage quarterly groundwater monitoring data using spreadsheets. They can do statistics on the data and print reports. Where the problem becomes evident is when it becomes necessary to do a historical analysis of the data. It can be very difficult to tie the data together. The format of the spreadsheets may have evolved over time. The file for one quarter may be missing or corrupted. Suddenly it becomes a chore to pull all of the data together to answer a question such as “What is the trend of the sulfate values over the last five years?”
DATABASE MANAGERS For storing large amounts of data, and where immediate calculations are not as important, database managers usually do a better job than spreadsheets, although the capabilities of spreadsheets and databases certainly overlap somewhat. The better database managers allow you to store related data in several different tables and to link them together based on the contents of the data. Many database manager programs have a reputation for not being very easy to use, partly because of the sheer number of options available. This has been improved with the menu-driven interfaces that are now available. These interfaces help with the learning curve, but data management software, especially database server software, can still be very difficult to master. Many database manager programs provide a programming language, which allows you to automate tasks that you perform often or repeatedly. It also allows you to configure the system for other users. This language provides the tools to develop sophisticated applications programs for nearly any data handling need, and provides the basis for some commercial EDMS software. Database managers are usually classified by how they store and relate data. The most common types are flat files, hierarchical, network, object-oriented, and relational. Most use the terminology of “record” for each object in the database (such as a well or sample location) and “field” for each type of information on each object (such as county or collection date). For information on database management concepts see Date (1981) and Rumble and Hampel (1984). Sullivan (2001) quotes a study by the University of California at Berkeley that humans have generated 12 exabytes (an exabyte is over 1 million terabytes, or a million trillion bytes) of data since the start of time, and will double this in the next two and a half years. Currently, about 20% of the world’s data is contained in relational databases, while the rest is in flat files, audio, video, pre-relational, and unstructured formats.
Flat file A flat file is a two-dimensional array of data organized in rows and columns similar to a spreadsheet. This is the simplest type of database manager. All of the data for a particular type of object is stored in a single file or table, and each record can have one instance of data for each field. A good analogy is a 3"×5" card file, where there is one card (record) for each item being tracked in the database, and one line (field) for each type of information stored.
Site Data Management Concepts
15
Flat file database managers are usually the cheapest to buy, and often the easiest to use, but the complexity of real-world data often requires more power than they can provide. In a flat file storage system, each row represents one observation, such as a boring or a sample. Each column contains the same kind of data. An example of a flat file of environmental data is shown in the following table: Well B-1 B-1 B-2 B-2 B-2 B-3 B-3
Elev 725 725 706 706 706 714 714
X 1050 1050 342 342 342 785 785
Y 681 681 880 880 880 1101 1101
SampDate 2/3/96 5/8/96 11/4/95 2/3/96 5/8/96 2/3/96 5/8/96
Sampler JLG DWR JAM JLG DWR JLG CRS
As .05 .05 3.7 2.1 1.4 .05 .05
AsFlag not det not det detected detected detected not det not det
Cl
ClFlag
.05 9.1 8.4 7.2
not det detected detected detected
.05
not det
pH 6.8 6.7 5.2 5.3 5.8 8.1 7.9
Figure 2 - Flat file of environmental data
In this table, each line is the result of one sampling event for an observation well. Since the wells were sampled more than once, and analyzed for multiple parameters, information specific to the well, such as the elevation and location (X and Y), is repeated. This wastes space and increases the chance for error since the same data element must be entered more than once. The same is true for sampling events, represented here by the date and the initials of the person doing the sampling. Also, since the format for the analysis results requires space for each value, if the value is missing, as it is for some of the chloride measurements, the space for that data is wasted. In general, flat files work acceptably for managing small amounts of data such as individual sampling events. They become less efficient as the size of the database grows. Examples of flat file data management programs are FileMaker Pro (www.filemaker.com) and Web-based database programs such as QuickBase (www.quickbase.com).
Hierarchical In the hierarchical design, the one-to-many relationship common to many data sets is formalized into the database design. This design works well for situations such as multiple samples for each boring, but has difficulty with other situations such as many-to-many relationships. This type of program is less common than flat files or relational database managers, but is appropriate for some types of data. In a hierarchical database, data elements can be viewed as branches of an inverted tree. A good example of a hierarchical database might be a database of organisms. At the top would be the kingdom, and underneath that would be the phyla for each kingdom. Each phylum belongs to only one kingdom, but each kingdom can have several phyla. The same continues down the line for class, order, and so on. The most important factor in fitting data into this scheme is that there must be no data element at one level that needs to be under more than one element at a higher level. If a crinoid could be both a plant and an animal at the same time, it could not be classified in a hierarchical database by phylogeny (which biological kingdom it evolved from). Environmental site data is for the most part hierarchical in nature. Each site can have many monitoring wells. Each well can have many samples, either over time or by depth. Then each sample can be analyzed for multiple constituents. Each constituent analysis comes from one specific sample, which comes from one well, which comes from one site. A data set which is inherently hierarchical can be stored in a relational database manager, and relational database managers are somewhat more flexible, so pure hierarchical database managers are now rare.
16
Relational Management and Display of Site Environmental Data
Network In the network data model, multiple relationships between different elements at the same level are easy to manage. Hypertext systems (such as the World Wide Web) are examples of managing data this way. Network database managers are not common, but are appropriate in some cases, especially those in which the interrelationships among data are complex. An example of a network database would be a database of authors and articles. Each author may have written many articles, and each article may have one or more authors. This is called a “many-to-many” relationship. This is a good project for a network database manager. Each author is entered, as is each article. Then the links between authors and articles are established. The data elements are entered, and then the network established. Then an article can be called up, and the information on its authors can be retrieved. Likewise, an author can be named, and his or her articles listed. A network data topology (geometric configuration) can be stored in a relational database manager. A “join table” is needed to handle the many-to-many relationships. Storing the above article database in a relational system would require three tables, one for authors, one for articles, and a join table with the connections between them.
Object oriented This relatively recent invention stores each data element as an object with properties and methods encapsulated (wrapped up) into each object. This is a deviation from the usual separation of code and data, but is being used successfully in many settings. Current object-oriented systems do not provide the data retrieval speed on large data sets provided by relational systems. Using this type of software involves a complete re-education of the user, since different terminology and concepts are used. It is a very powerful way to manipulate data for many purposes, and is likely to see more widespread use. Some of the features of object-oriented databases are described in the next few paragraphs. Encapsulation – Traditional programming languages focus on what is to be done. This is referred to as “procedural programming.” Object-oriented programming focuses on objects, which are a blend of data and program code (Watterson, 1989). In a procedural paradigm (a paradigm is an approach or model), the data and the programs are separate. In an object-oriented paradigm, the objects consist of data that knows what to do with itself, that is, objects contain methods for performing actions. This is called encapsulation. Thus, instead of applying procedures to passive data, in object-oriented programming systems (OOPS), methods are part of the objects. Some examples of the difference between procedural systems and OOPS might be helpful. In a procedural system, the data for a well could contain a field for well type, such as monitoring well or soil boring. The program operating on the data would know what symbol to draw on the map based on the contents of that field. In an OOPS the object called “soil boring” would include a method to draw its symbol, based on the data content (properties) of the object. Properties of objects in OOPS are usually loosely typed, which means that the distinction between data types such as integers and characters is not rigorously defined. This can be useful when, as is often the case, a numeric property such as depth to a particular formation needs to be filled with character values such as NP (not present) or NDE (not deep enough). For another illustration, imagine modeling a rock or soil body subject to chemical and physical processes such as leaching or neutralization using an OOPS. Each mineral phase would be an object of class “mineral,” while each fluid phase would be an object of class “fluid.” Methods known to the objects would include precipitation, dissolution, compaction, and so on. The model is given an initial condition, and then the objects interact via messages triggering methods until some final state is reached. Inheritance – Objects in an OOPS belong to classes, and members of a particular class share the same methods. Also, similar classes of objects can inherit properties and methods from an existing class. This feature, called inheritance, allows a building-block approach to designing a
Site Data Management Concepts
17
database system by first creating simple objects and then building on and combining them into more complex objects. In this way, an object called “site” made up of “well” objects would know how to display itself with no additional programming. Message Passing – An object-oriented program communicates with objects via messages, and objects can exchange messages as well. For example, an object of class “excavated material” could send a message to an object of class “remediation pit” which would update the property “remaining material” within object “remediation pit.” Polymorphism – A method is part of an object, and is distinct from messages between objects. The objects “well” and “boring” could both contain the method “draw yourself,” and sending the “draw yourself” message to one or the other object will cause a similar but different result. This is referred to as polymorphism. Object-oriented programming directly models the application, with messages being passed between objects being the analog of real-world processes (Thomas, 1989). Software written in this way is easier to maintain because programmers, other than the author, can easily comprehend the program code. Since program code is easily reusable, development of complex applications can be done more quickly and smoothly. Encapsulation, message passing, inheritance, and polymorphism give OOPS developers very different tools from those provided by traditional programming languages. Also, OOPS often use a graphical user interface and large amounts of memory, making them more suitable to high-end computer systems. For these reasons, OOPS have been slow in gaining acceptance, but they are gaining momentum and are considered by many to be the programming system of the future. Examples of object-oriented programming languages include Smalltalk developed by Xerox at the Palo Alto Research Center in the 1970s (Goldberg and Robson, 1983); C++, which is a superset of the C programming language (Soustrup, 1986); and Hypercard for the Macintosh. NextStep, the programming environment for the Next computer, also uses the object-oriented paradigm. There are several database management programs that are designed to be object oriented, which means that their primary data storage design is to store objects. Also, a number of relational database management systems have recently added object data types to allow object-oriented applications to use them as data repositories, and are referred to as Object-Relational systems.
Relational Relational database managers and SQL are discussed in much greater detail in Chapter 3, and are described here briefly for comparison with other database manager types. In the relational model, data is stored in one or more tables, and these tables are related, that is, they can be joined together, based on data elements within the tables. This allows storage of data where there may be many pieces of one type of information related to one object (one-to-many relationship), as well as other relationships such as hierarchical and many-to-many. In many cases, this has been found to be the most efficient form of data storage for large, complicated databases, because it provides efficient data storage combined with flexible data retrieval. Currently the most popular type of database manager program is the relational type. A file of monitoring well data provides a good example of how real-world data can be stored in a relational database manager. One table is created which contains the header data for the well including location, date drilled, elevation, and so on, with one record for each well. For each well, the driller or logger will report a different number of formation tops, so a table of formation tops is created, with one record for each top. A unique identifier such as well ID number relates the two tables to each other. Each well can also have one or more screened intervals, and a table is created for each of those data types, and related by the same ID number. Each screened interval can have multiple sampling events, with a description for each, so another table can be created for these sample events, which can be related by well ID number and sample event number. Very complex
18
Relational Management and Display of Site Environmental Data
systems can be created this way, but it often will take a program, written in the database language, to keep track of all the relationships and allow data entry, updating, and reporting. The most popular way of interacting with relational database managers is Structured Query Language (SQL, sometimes incorrectly pronounced “sequel,” see below). SQL provides a powerful, flexible way of retrieving, adding and changing data in a relational system. A typical SQL query might look like this: SELECT X_COORD, Y_COORD, COLLAR_ELEV - MUDDY_TOP, SYMBOL_CODE FROM WELLS, TOPS WHERE MUDDY_TOP > 100 AND WELLS.WELL_ID = TOPS.WELL_ID This query would produce a list where the first column is the X coordinate for a well, the second column is the Y coordinate, the third column is the difference between the well elevation and the depth to the top of the Muddy Formation, and the fourth column is the symbol code. Only the wells for which the Muddy Formation is deeper than 100 would be listed. The X and Y coordinates, the elevation, and the symbol code come from the WELLS table and the Muddy Formation top comes from the TOPS table. The last line is the “relational join” that hooks the two tables together for this query, based on WELL_ID, which is a field common to both tables. A field that relates two tables like this is called a “key.” The output from this query could be sent to a mapping program to make a contour map of the Muddy structure. Most relational database managers have SQL as their native retrieval language. The rest usually have an add-in capability to handle SQL queries.
XML XML (eXtensible Markup Language) was developed as a data transfer format, and has become increasingly popular for exchanging data on the Internet, especially in business-to-business transactions. The use of XML for transferring data is discussed in Chapter 24. Database management products are now starting to appear that use XML as their data storage format. (For example, see Dragan, 2001.) As of this writing these products are very immature. One example product costs $50,000 and does not include a report writer module. This is expected with a new product category. What is not clear is whether this data storage approach will catch on and replace more traditional methods, especially relational data management systems. This may be unlikely, especially since relational software vendors are adding the capability to store XML data in their programs, which are much more mature and accepted. Given that XML was intended as a transfer format, it’s not clear that it is a good choice for a storage format. It will be interesting to see if database products with XML as their primary data storage type become popular.
RESPONSIBILITY FOR DATA MANAGEMENT A significant issue in many organizations is who is responsible for data management. In some organizations, data management is centralized in a data management group. In others, the project team members perform the data management. Some organizations outsource data management to a consultant. Finally, some groups use a combination of these approaches. Each of these options will be discussed briefly, along with its pros and cons. Dedicated data management group – The thinking in this approach is that the organization can have a group of specialists who manage data for all of the projects. This is often done in conjunction with the group (which may be the same people) that performs validation on the data. The advantages of this are that the people develop specialized skills and expertise that allows them to manage the data efficiently. They can respond to requests for custom processing and output, because they have mastered the tools that they use to manage data. The disadvantage is that they
Site Data Management Concepts
19
may not have hands-on knowledge of the project and its data, which may be necessary to recognize and remedy problems. They need to be kept in the loop, such as when a new well is drilled, or when the laboratory is asked to analyze for a different suite of constituents, so that they can react appropriately. Data management by the project team – Here the focus is on the benefit of project knowledge rather than data management expertise. The people managing the data are likely to be involved in decisions that affect the data, and should be in a position to respond to changes in the data being gathered and reported. They might have problems, though, when something is asked of them that is beyond their data management expertise, because that is only part of what they are expected to do. Outsourcing to a consultant – A very common practice is to outsource the data management to a consultant, often the one that is gathering the data. Then the consultant has to decide between one of the previous approaches. This may be the best option when staff time or expertise is not available in-house to do the data management. The price can be loss of control over the process. A team effort – Sometimes the best solution is a team effort. In this scenario, project team members are primarily responsible for the data management process. They are backed up by data management specialists, either in-house or through a consultant, to help them when needs change or problems occur. Project staff may outsource large data gathering, data entry, or cleanup projects, especially when it is necessary to “catch up,” for example, to bring a lot of historical data from disparate sources into a comprehensive database to satisfy some specific project requirements. The team approach has the potential to be the strongest, because it allows the project team to leverage the strengths of the different available groups. It does require good management skills to keep everyone on the same page, especially as deadlines approach.
UNDERSTANDING THE DATA It is extremely important that project managers, data administrators, and users of the software have a complete understanding of the data in the database and how it is to be used. It is important for them to understand the data structure. It is even more important for them to understand the content. The understanding of the structure refers to how data elements from the real world are placed in fields in the database. Many DBMS programs allow comments to be associated with fields, either in the data model, or on editing forms, or both. These comments can assist the user with understanding how the data is to be entered. Additional information on data structure is contained in Chapter 4. Once you know where different data elements are to go, you must also know what data elements in the database mean. This is true of both primary data elements (data in the main tables) and coded values (stored in lookup tables). The content can be easily misunderstood. A number of issues regarding data content are discussed in Parts Three and Four.
CHAPTER 3 RELATIONAL DATA MANAGEMENT THEORY
The people using an EDMS will often come to the system with little or no formal training or experience in environmental data management. In order to provide a conceptual framework on which they can build their expertise in using the system, this section provides an overview of the fundamentals of relational management of environmental data. This section starts with a discussion of the meaning and history of relational data management. This is followed by a description of breaking data apart with data normalization, and using SQL to put it back together again.
WHAT IS RELATIONAL DATA MANAGEMENT? Relational data management means organizing the data based on relationships between items within the database. It involves designing the database so that like data elements are grouped into tables together. This process is called data normalization. Then the data can be joined back together for retrievals, usually using the SQL data retrieval language. The key elements of relational data storage are: • • •
Tables – Database object containing records and fields where the data is stored. Examples: Samples, Analyses. Fields – Data elements (columns) within the table. Examples: Parameter Name, Value. Records – Items being stored (rows) within the table. Example: Arsenic for last quarter.
Each of these items will be discussed in much greater detail later.
HISTORY OF RELATIONAL DATA MANAGEMENT Prior to 1970, data was primarily managed in hierarchical and networked storage systems. Edwin Codd, an IBM Fellow working at the San Jose research lab, became concerned about protecting users from needing to understand the internal representation of the data, and published a paper entitled “A Relational Model of Data for Large Shared Data Banks” (Codd, 1970). This set off a debate in industry about which model was the best. IBM developed System/R and a team at Berkeley developed INGRES, both prototype relational data management systems built in the mid1970s.
22
Relational Management and Display of Site Environmental Data
SQL was the database language for System/R, and was first described in Boyce and Chamberlain (1973). SQL is sometimes pronounced “sequel.” This is not really correct. In the early 1970s Edwin Codd and others in IBM’s research center were working on relational data management, and Donald Chamberlain and others developed a language to work with relational databases (Gagnon, 1998). This language was called Structured English Query Language, or SEQUEL. This was later revised and renamed SEQUEL/2. SQL as it is currently implemented is not the same as these early languages. And IBM stopped using the name SEQUEL for legal reasons. So it is not correct to call SQL “sequel.” It’s better to just say the letters, but many call it “sequel” anyway. In 1977, a contract programming company called Software Development Laboratories (SDL) developed a database system for the Central Intelligence Agency for a project called “Oracle.” The company released the program, based on System/R and SQL, in 1979. This became the first commercial relational database management system, and ran on IBM mainframes and Digital VAX and UNIX minicomputers. SDL assumed the name Oracle for both the company and the product. IBM followed in 1981 with the release of SQL/DS, and with DB2 in 1983 (Finkelstein, 1989). Oracle and DB2 are still active products in widespread use. The American National Standards Institute (ANSI) accepted SQL as a standardized fourth generation (4GL) language in 1986, with several revisions since then.
DATA NORMALIZATION Often the process of storing complex types of data in a relational database manager is improved by a technique called data normalization. Usually the best database designs have undergone this process, and are referred as a “normalized data model.” The normalization process separates data elements into a logical grouping of tables and fields.
Definition Normalization of a database is a process of grouping the data elements into tables in a way that reduces needless duplication and wasted space while allowing maximum flexibility of data retrieval. The concepts of data normalization were developed by Edwin Codd, and the first three steps of normalization were described in his 1972 paper (Codd, 1972). These steps were expanded and developed by Codd and others over the next few years. A good summary of this work is contained in Date (1981). Figure 3 shows an example of a simple normalized data model for site environmental data. This figure is known as an entity-relationship diagram, or E-R diagram. In Figure 3, and in other E-R diagrams used in this book, each box represents a table, with the name of the table shown at the top. The other words in each box represent the fields in that table. These are the “entities.” The lines between the boxes represent the relationships between the tables, and the fields used for the relationships. All of the relationships shown are “one-to-many,” signified by the number one and the infinity symbol on the ends of the join lines. That means that one record in one table can be related to many records in the other table.
The Five Normal Forms Most data analysts recognize five levels of normalization. These forms, called First through Fifth Normal Form, represent increasing levels of organization of the data. Each level will be discussed briefly here, with examples of a simple environmental data set organized into each form.
Relational Data Management Theory
23
Figure 3 - Simple normalized data model for site environmental data
Tools are now available to analyze a data set and assist with normalizing it. The Access Table Analyzer Wizard (see Figure 14, later) is one example. These tools usually require that the data be in First Normal Form (rows and columns, no repeating groups) before they can work with it. In this section we will go through the process of normalizing a data set. We will start with a flat file of site analytical data in a format similar to the way it would be received from a laboratory. This is shown in Figure 4. Well B-1 B-1 B-2 B-2 B-2 B-3 B-3
Elev 725 725 706 706 706 714 714
X 1050 1050 342 342 342 785 785
Y 681 681 880 880 880 1101 1101
SampDate 2/3/96 5/8/96 11/4/95 2/3/96 5/8/96 2/3/96 5/8/96
Sampler JLG DWR JAM JLG DWR JLG CRS
As .05 .05 3.7 2.1 1.4 .05 .05
AsFlag not det not det detected detected detected not det not det
Cl
ClFlag
.05 9.1 8.4 7.2
not det detected detected detected
.05
not det
pH 6.8 6.7 5.2 5.3 5.8 8.1 7.9
Figure 4 - Environmental data prior to normalization - Problems: Repeating Groups, Redundancy
First Normal Form – First we will convert our flat file to First Normal Form. In this form, data is represented as a two-dimensional array of data, like a flat file. Unlike some flat files, a first normal form table has no repeating groups of fields. In the flat file illustration in Figure 4, there are columns for arsenic (As), chloride (Cl), and pH, and detection flags for arsenic and chloride. This design requires that space be allocated for every constituent for every sample, even though some constituents were not measured. Converting this table to first Normal Form Results in the configuration shown in Figure 5.
24
Well B-1 B-1 B-1 B-1 B-1 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-3 B-3 B-3 B-3 B-3
Relational Management and Display of Site Environmental Data
Elev 725 725 725 725 725 706 706 706 706 706 706 706 706 706 714 714 714 714 714
X 1050 1050 1050 1050 1050 342 342 342 342 342 342 342 342 342 785 785 785 785 785
Y 681 681 681 681 681 880 880 880 880 880 880 880 880 880 1101 1101 1101 1101 1101
SampDate 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 11/4/95 11/4/95 11/4/95 2/3/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96
Sampler JLG JLG DWR DWR DWR JAM JAM JAM JLG JLG JLG DWR DWR DWR JLG JLG CRS CRS CRS
Param As pH As Cl pH As Cl pH As Cl pH As Cl pH As pH As Cl pH
Value .05 6.8 .05 .05 6.7 3.7 9.1 5.2 2.1 8.4 5.3 1.4 7.2 5.8 .05 8.1 .05 .05 7.9
Flag not det not det not det detected detected detected detected detected detected not det not det not det
Figure 5 - Environmental data in first normal form (no repeating groups) - Problem: Redundancy
In this form, there is a line for each constituent that was measured for each well. There is no wasted space for the chloride measurements for B-1 or B-3 for 2/3/96. There is, however, quite a bit of redundancy. The elevation and location are repeated for each well, and the sampler’s initials are repeated for each sample. This redundancy is addressed in Second Normal Form. Second Normal Form – In this form, redundant data is moved to separate tables. In the more formal terminology of data modeling, data in non-key columns must be fully dependent on the primary key for the table. Key columns are the columns that uniquely identify a record. In our example, the data that uniquely identifies each row in the analytical table is a combination of the well, the sampling date, and the parameter. The above table has data, such as the elevation and sample date, which is not dependent on the entire compound key, but on only part of the key. Elevation depends on well but not on sample date, and sampler depends on well and sample date but not on parameter. In order to convert our table to Second Normal Form, we must separate it into three tables as shown in Figure 6. Third Normal Form – This form requires that the table conform to the rules for First and Second Normal Form, and that all non-key columns of a table be dependent on the table’s primary key and independent of one another. Once our example data set has been modified to fit Second Normal Form, it also meets the criteria for Third Normal Form, since all of the non-key values are dependent on the key fields and not on each other. Fourth Normal Form – The rule for Fourth Normal Form is that independent data entities cannot be stored in the same table where many-to-many relationships exist between these entities. Many-to-many relationships cannot be expressed as simple relationships between entities, but require another table to express this relationship. Our tables as described above meet the criteria for Fourth Normal Form.
Relational Data Management Theory
Stations Well B-1 B-2 B-3
Elev 725 706 714
Analyses X 1050 342 785
Y 681 880 1101
Samples Well B-1 B-1 B-2 B-2 B-2 B-3 B-3
25
SampDate 2/3/96 5/8/96 11/4/95 2/3/96 5/8/96 2/3/96 5/8/96
Sampler JLG DWR JAM JLG DWR JLG CRS
Well B-1 B-1 B-1 B-1 B-1 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-3 B-3 B-3 B-3 B-3
SampDate 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 11/4/95 11/4/95 11/4/95 2/3/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96
Param As pH As Cl pH As Cl pH As Cl pH As Cl pH As pH As Cl pH
Value .05 6.8 .05 .05 6.7 3.7 9.1 5.2 2.1 8.4 5.3 1.4 7.2 5.8 .05 8.1 .05 .05 7.9
Flag not det not det not det detected detected detected detected detected detected not det not det not det
Figure 6 - Environmental data in second normal form - Problems: Compound Keys, Repeated Values
Fifth Normal Form – Fifth Normal Form requires that you be able to recreate exactly the original table from the tables into which it was decomposed. Our tables do not meet this criterion, since we cannot duplicate the repeating groups (value and flag for As, Cl, and pH) from our original table because we don’t know the order in which to display them. In order to overcome this, we need to add a table with the display order for each parameter. Param 1 2 3
Parameter As Cl pH
Order 1 2 3
Figure 7 -Parameters table for fifth normal form
The Flag field can be handled in a similar way. Note that when we handle the flag as a lookup table (a table containing values and codes), we are required to have values for that field in all of the records, or else records will be lost in a join at retrieval time. We have used “v” for detected value for the pH results, but a different flag could be used if desired to signify that these are different from the chemical values. Removal of compound keys – In order to minimize redundant data storage and to decrease the opportunity for error, it is often useful to remove compound keys from the tables and replace them with an artificial key that represents the removed values. For example, the Analyses table has a compound key of Well and SampDate. This key can be replaced by a SampleID field, which points to the sample table. Likewise, the parameter name can be removed from the Analyses table and a code inserted in its place. These keys can be AutoNum fields maintained by the system, and
26
Relational Management and Display of Site Environmental Data
don’t have to have any meaning relative to the data other than that the data is assigned to that key. Keys of this type are called synthetic keys and, because they are not compound, are also called simple keys. In many data management systems, some activities are made much easier by simple synthetic keys. For example, joining records between tables, and especially more complicated operations like pivot tables, can be much easier when there is a single field to use for the operation. The final result of the conversion to Fifth Normal Form is shown in Figure 8. The SampleID field in the Samples table is called a primary key, and must be unique for each record in the table. The SampleID field in the Analyses table is called a foreign key because it is the primary key in a different (foreign) but related table.
STRUCTURED QUERY LANGUAGE Normalization of a data model usually breaks the data out into multiple tables. Often to view the data in a way that is meaningful, it is necessary to put the data back together again. You can view this as “de-normalizing” or reconstituting the data. The tool most commonly used to do this is Structured Query Language, or SQL. The following sections provide an overview of SQL, along with examples of how it is used to provide useful data.
Overview of SQL The relational data model provides a way to store complex data in a set of relatively simple tables. These tables are then joined by key fields, which are present in more than one table and allow the data in the tables to be related to each other based on the values of these keys. Structured Query Language (SQL) is an industry-standard way of retrieving data from a relational database management system (RDBMS). There are a number of good books available on the basics of SQL, including van der Lans (1988) and Trimble and Chappell (1989).
How SQL is used In the simplest sense, an SQL query can take the data from the various tables in the relational model and reconstruct them into a flat file again. The benefit is that the format and content of the resulting grid of data can be different each time a retrieval is performed, and the format of the output is somewhat independent of the structure of the underlying tables. In other words, the presentation is separate from the contents. There are two parts to SQL: the data definition language (DDL) for creating and modifying the structure of tables, and the data manipulation language (DML) for working with the data itself. Access and other graphical data managers replace the SQL DDL with a graphical system for creating and editing table structures. This section will discuss SQL DML, which is used for inserting, changing, deleting, and retrieving data from one or more relational tables. The SQL keywords for changing data are INSERT, UPDATE, and DELETE. As you would expect, INSERT places records in a table, UPDATE changes values in a table, and DELETE removes records from a table. Data retrieval using SQL is based on the SELECT statement. The SELECT statement is described in a later section on queries. SQL is a powerful language, but it does take some effort to learn it. In many cases it is appropriate to hide this complexity from users. This can be done using query-by-form and other techniques where users are asked for the necessary information, and then the query is generated for them automatically.
Relational Data Management Theory
Stations WellID 1 2 3 Samples WellID 1 1 2 2 2 3 3 Flags Flag u v
Well B-1 B-2 B-3
Elev 725 706 714
SampleID 1 2 3 4 5 6 7
X 1050 342 785
SampDate 2/3/96 5/8/96 11/4/95 2/3/96 5/8/96 2/3/96 5/8/96
Name not det detected
Parameters Param Parameter 1 As 2 Cl 3 pH
Y 681 880 1101
Order 1 2 3
Sampler JLG DWR JAM JLG DWR JLG CRS
Analyses SampleID Param 1 1 1 3 2 1 2 2 2 3 3 1 3 2 3 3 4 1 4 2 4 3 5 1 5 2 5 3 6 1 6 3 7 1 7 2 7 3
Value .05 6.8 .05 .05 6.7 3.7 9.1 5.2 2.1 8.4 5.3 1.4 7.2 5.8 .05 8.1 .05 .05 7.9
27
Flag u v u u v v v v v v v v v v u v u u v
Figure 8 - Environmental data in fifth normal form with simple keys and coded values
Data retrieval is based on the SQL SELECT statement. The basic SELECT statement syntax (the way it is written) is: SELECT field list FROM table list WHERE filter expression An example of this would be an SQL query to extract the names and depths of the borings (stations) deeper than 10 feet: SELECT StationName, Depth FROM Stations WHERE Depth > 10 The SELECT and FROM clauses are required. The WHERE clause is optional. Data retrieval in Access and other relational database managers such as Oracle and SQL Server is based on this language. Most queries are more complicated than this. The following is an example of a complicated multi-table query created by Enviro Data, a product from Geotech Computer Systems, Inc. This query outputs data to be used for a report of analytical results.
28
Relational Management and Display of Site Environmental Data
Figure 9 - Data display resulting from a query
SELECT DISTINCTROW Stations.StationName, Stations.StationType, Stations.Location_CX, Stations.Location_CY, Stations.GroundElevation, Stations.StationDate_D, Samples.SampleType, Samples.SampleTop, Samples.SampleBottom, Samples.Sampler, Samples.SampleDate_D, Parameters.ParameterNumber, Parameters.LongName, Analytic.Value, Analytic.AnalyticMethod, Analytic.AnalyticLevel, Analytic.ReportingUnits, Analytic.Flag, Analytic.Detect, Analytic.Problems, Analytic.AnalDate_D, Analytic.Lab, Analytic.ReAnalysis AS Expr1 FROM StationTypes INNER JOIN (Stations INNER JOIN (Samples INNER JOIN ([Parameters] INNER JOIN Analytic ON Parameters.ParameterNumber = Analytic.ParameterNumber) ON Samples.SampleNumber = Analytic.SampleNumber) ON Stations.StationNumber = Samples.StationNumber) ON StationTypes.StationType = Stations.StationType WHERE (((Stations.StationName) Between [Forms]![AnalyticReport]![StartingStation] And [Forms]![AnalyticReport]![EndingStation]) AND ((Samples.SampleTop) Between [Forms]![AnalyticReport]![LowerElev] And [Forms]![AnalyticReport]![UpperElev]) AND ((Samples.SampleBottom) Between [Forms]![AnalyticReport]![LowerElev] And [Forms]![AnalyticReport]![UpperElev]) AND ((Parameters.ParameterNumber) Between [Forms]![AnalyticReport]![LowerParam] And [Forms]![AnalyticReport]![UpperParam])) ORDER BY Stations.StationName, Samples.SampleTop; Users are generally not interested in all of this complicated SQL code. They just want to see some data. Figure 9 shows the result of a query in Access.
Relational Data Management Theory
29
Figure 10 - Query that generated Figure 9
The SQL statement to generate Figure 9 is shown in Figure 10. It’s beyond most users to type queries like this. Access helps these users to create a query with a grid-type display. The user can drag and drop fields from the tables at the top to create their result. Figure 11 shows what the grid display looks like when the user is creating a query.
Figure 11 - Grid display of the query in Figure 10
30
Relational Management and Display of Site Environmental Data
Figure 12 - Query-by-form in the Enviro Data customized database system (This software screen and others not otherwise noted are courtesy of Geotech Computer Systems)
Even this is too much for some people. The complexity can be hidden even more in a customized system. The example in Figure 12 allows the users to enter selection criteria without having to worry too much about tables, fields, and so on. Then if users click on List they will get a query display similar to the one in Figure 9. In this system, they can also take the results of the query (the code behind this form creates an SQL statement just like the one above) and do other things such as create maps and reports.
BENEFITS OF NORMALIZATION The data normalization process can take a lot of effort. Why bother? The answer is that normalizing your data can provide several benefits, including greater data integrity, increased storage efficiency, and more flexible data retrieval.
Data integrity A normalized data model improves data integrity in several ways. The most important is referential integrity. This is an aspect of managing the data that is provided by the data management software. With referential integrity, relationships can be defined which require that a record in one table must be present before a record in a related table can be added. Usually the required record is a parent record (the “one” side of the one-to-many), which is necessary before a child record (the “many” side) can be entered. An example would be preventing you from entering a sample from a well before you have entered the well. More importantly, referential integrity can prevent deletion of a record when there are related records that depend on it, and this can be a great tool to prevent “orphan” records in the database. A good introduction to referential integrity in Access can be found in Harkins (2001b).
Relational Data Management Theory
31
A related concept is entity integrity. This requires that every row must be uniquely identified, meaning that each primary key value in the table must be unique and non-null. It is always a good idea for tables to have a primary key field, and that entity integrity be enforced. Another contributor to data integrity is the use of lookup tables, which is a capability of the relational database design. Lookup tables are tables where values that are repeated in the database are replaced by codes in the main data tables, and another table contains the codes and the lookups. An example is types of stations in the Stations table. Values for this field might include “soil boring,” “monitoring well,” and so on. These could be replaced by codes like “s” and “m,” respectively, in the main table, and the lookup table would contain the translation between the code and the value. It is much more likely that a user will mistype “detected” than “2.” Errors are even less likely if they are picking the lookup value off a list. They can still pick the wrong one, but at least errors from misspelling will be minimized.
Efficient storage Despite the increase in organization, the data after normalization as shown in Figure 8 actually takes less room than the original flat file. The data normalization process helps remove redundant data from the database. Because of this it may take less space to store the data. The original flat file contained 293 characters (not including field names) while the Fifth Normal Form version has only 254 characters, a decrease of 13% with no decrease in information content. The decrease in size can be quite a bit more dramatic with larger data sets. For example, in the new normalized model adding data for another analysis would take up only about 6 characters, where in the old model it would be about 34 characters, a savings in size of 82%.
Flexible retrieval When the data is stored in a word processor or spreadsheet, the output is usually a direct representation of the way the data is stored in the file. In a relational database manager, the output is usually based on a query, which provides an opportunity to organize and format the data a different way for each retrieval need.
AUTOMATED NORMALIZATION Normalization of a database is such an important process that some database programs have tools to assist with the normalization process. For example, Microsoft Access has a Table Analyzer Wizard that can help you split a table into multiple normalized tables based on the data content. To illustrate this, Figure 13 shows a very simple flat file of groundwater data with several stations, sample dates, and parameters. This file is already in First Normal Form, with no repeating groups, which is the starting point for the Wizard. The Table Analyzer Wizard was then run on this table. Figure 14 shows the result of this analysis. The Wizard has created three tables, one with the value, one with the station and sample date, and one with the parameter, units, and flag. It has also added several key fields to join the tables. The user can then modify the software’s choices using a drag and drop interface. The Wizard then steps you through a process for correcting typographical errors to improve the normalization process. For comparison, Figure 15 shows how someone familiar with the data would normalize the tables, creating primary tables for stations, samples, and analyses, and a lookup table for parameters. (Actually, units and flags should probably be lookup tables too.)
32
Relational Management and Display of Site Environmental Data
Figure 13 - Data set for the Table Analyzer Wizard
Figure 14 - Wizard analysis of groundwater data
Relational Data Management Theory
33
Figure 15 - Manual normalization of groundwater data
For this data set, the Wizard didn’t do very well, missing the basic hierarchical nature of the data. With other data sets with different variability in the data it might do better. In general the best plan is for people familiar with the data and the project needs to design the fields, tables, and relationships based on a thorough understanding of the natural relationships between the elements being modeled in the database. But the fact that Microsoft has provided a tool to assist with the normalization process highlights the importance of this process in database design.
CHAPTER 4 DATA CONTENT
A database can contain just about anything. The previous chapter discussed what a database is and how the data can be stored. A bigger issue is what data the database will contain. It is also important to make sure that database users understand the data and use it properly.
DATA CONTENT OVERVIEW The data contained in an EDMS can come from a wide variety of sources and can be used in many different ways. For some published examples, see Sara (1974), especially chapters 7 and 11. There is a wide range of data needs for the different projects performed by the typical environmental organization. With projects ranging from large sites with hundreds of wells and borings to small service stations with minimal data, and staffing levels from dozens of people on one project to many projects for one person, the variability of the data content needed for environmental projects is very great. Even once a data management system is in place, additional data items may be identified over time, as other groups and individuals are identified and become involved in the data management system, and as different project needs arise. Since different groups have differing data storage needs, a primary design goal of the data management system must be to accommodate this diversity while fostering standardization where possible. Sets of needs must be considered during design and development to avoid unnecessary limitations for any group. While there is always a trade-off between flexibility, ease of use, and development cost in creating a data management system, a well-designed data model for information storage can be worth the initial expense. Typical data components of an EDMS include four areas. The first three are project related: Project Technical Data, Project Administrative Data, and Project Document Data. The fourth, Reference Data, is not specific to an individual project, but in many cases has relationships to project-specific data. Often project data is the same as site data, so the terms are used interchangeably here. Figure 16 shows the main data components for the EDMS. This is just one way of organizing data. There are probably other equally valid ways of assigning a hierarchy to the data. Different data components are covered here in different levels of detail. The primary focus of this book is on project technical data, and especially data related to site investigation and remediation. This data is described briefly here, and in greater detail throughout the book. Other data components are covered briefly in this section, and have been well discussed in other books and articles. For each category of data, the major data components will be listed, along with, in many cases, comments about data management aspects of those components.
36
Relational Management and Display of Site Environmental Data
Enterprise
Project
Technical
Administrative
Reference
Documents
Figure 16 - Overview of the EDMS data model
PROJECT TECHNICAL DATA This category covers the technical information used by project managers, engineers, and geoscientists in managing, investigating, and remediating sites of environmental concern. Many acronyms are used in this section. The environmental business loves acronyms. You can find the meanings of these acronyms in Appendix A.
Product information and hazardous materials In an operating facility, tracking of materials within and moving through the facility can be very important from a health and safety perspective. This ranges from purchasing of materials through storage, use, and disposal. There are a number of aspects of this process where data management technology can be of value. The challenge is that in many cases the data management process must be integrated with the process management within the facility. Material information – The information to be stored about the materials themselves is roughly the same for all materials in the facility. Examples include general product information; materials management data; FIFRA labeling; customer usage; chemical information; and MSDS management, including creation, access, and updating/maintenance. Material usage – Another set of data is related to the usage of the materials, including shelf life tracking; recycling and waste minimization and reporting; and allegation and incident reports related to materials. Also included would be information related to keeping the materials under control, including source inventory information; LDR; pollution prevention; TSCA and asbestos management; and exception reports.
Hazardous wastes When hazardous materials move from something useful in the facility to something to be disposed of, they become hazardous wastes. This ranges from things like used batteries to toxic mixed wastes (affectionately referred to with names like “trimethyl death”). Waste handling – Safe handling and disposal of hazardous chemicals and other substances involves careful tracking of inventories and the shipping process. There is a significant recordkeeping component to working with this material. Data items include waste facility permit applications; waste accumulation and storage information (usually for discrete items like batteries); and waste stream information (more for materials that are part of a continuous process).
Data Content
37
If you put a drop of wine in a gallon of hazardous waste, you get a gallon of hazardous waste. If you put a drop of hazardous waste in a gallon of wine, you get a gallon of hazardous waste. Rich (1996) Waste disposal – If the waste is to be disposed onsite, then detailed records must be kept of the process, and data management can be an important component. If the waste is to be shipped offsite, then waste manifesting; waste shipping; NFPA/HMIS waste labels; and hazardous waste reports are important issues.
Environmental releases Inevitably, undesirable materials make it from the facility into the environment, and the EDMS can help track and report these releases. Types of releases – Water quality and drinking water management cover issues related to supposedly clean water. Wastewater management and pretreatment management cover water that is to be released beyond the facility. Once the material makes it into the groundwater, then groundwater management becomes important as discussed in the next section. Release issues – Regarding the releases themselves, issues include emissions management; air emissions inventory; NPDES permits; discharges and stormwater monitoring; stormwater runoff management; leak detection and LDAR; toxic chemical releases and TRI; and exceedence monitoring and reporting.
Site investigation and remediation The data gathered for site investigation and remediation is the major topic of this book. This data will be discussed briefly here, and in much more detail in other places in the book, especially Chapter 11 and Appendix B. Site data – Site data covers general information about the project such as location; type (organics, metals, radiologic, mixed, etc.); ownership; status; and so on. Other data to be stored at the site level includes QA and sampling plans; surveys; field data status; various other components of project status information; and much other information below the site level. Geology and hydrogeology – Geology and hydrogeology data includes surface geology information such as geologic maps, as well as subsurface information from wells, borings, and perhaps geophysical surveys. There are two kinds of geology/hydrogeology data for a site. The first could be considered micro data. This data is related to a particular boring or surface sample, and includes physical, lithological, and stratigraphic information. This information can be considered part of the sampling process, and can be included in the chemistry portion of the EDMS. The other type of data can be considered macro. This information covers more than an individual sample. It would include unit distribution and thickness maps; outcrop geology; facies maps (where appropriate); hydraulic head maps; and other data that is site-wide or nearly so. Stations – This is information related to locations of sampling such as monitoring wells; borings; surface water and air monitoring locations; and other sampling locations. In the EDMS, station data is separated into sample-related data and other data. The sample-related data is discussed below. Other station-related data includes surveys; boring logs; wellbore construction; and stratigraphy and lithology on a boring basis (as opposed to by sample). Some primary data elements for stations can cause confusion, including station coordinates and depths. For example, the XY coordinates used to describe station locations are based on some projection from latitude-longitude, or measurements on the surface of the earth from some (hopefully) known point (see Chapter 22). Either way, the meaning of the coordinates depends on reference information about the projection system of known points. Another example is depths,
38
Relational Management and Display of Site Environmental Data
When control equipment fails, it will do so during inspection. Rich (1996) which are sometimes stored as measured depths, and other times as elevations above sea level. With both of these examples, interpretation of numbers in fields in the database depends on knowledge of some other information. Samples – Information gathered for each sample includes sample date; frequency; depth; matrix; and many other things. This information is gathered from primary documents generated as part of the sampling process, such as the sample report and the Chain of Custody (COC). Quality control (QC) samples are an integral part of the sampling process for most environmental projects. An area where data management systems have great potential is in the interface between the field sampling event, the laboratory, and the centralized database; and for event planning; field management; and data checking. Analyses – The samples are then analyzed in the field and/or in the laboratory for chemical, physical, geotechnical, and sometimes geophysical parameters. QC data is important at the analyses level as well.
Cartography This category includes map-related data such as site maps; coordinates; air and satellite photos; and topography. It is a broad category. For example, site maps can include detailed maps of the site as well as regional maps showing the relationship of the site to geographic or other features. Implementation of this category in a way that is accessible to all the EDMS users requires providing some map display and perhaps editing capabilities as part of the system. This can be done by integrating an external mapping program such as ArcView, MapInfo, or Enviro Spase, or by inserting a map control (software object) like GeoObjects or MapObjects into the Access database. Data displayed on maps presented this way could include not only the base map information such as the site outline; buildings; etc. but also data from the EDMS, including well locations; sample counts; analytical values; and physical measurements like hydraulic head.
Coverages Some site data is discrete, that is, it is gathered at and represents a specific location in space. Other data represents a continuous variable across all or part of the site, and this type of data is referred to as a coverage. Examples of coverage data include surface elevation or the result of geophysical surveys such as gravity or resistivity. Often a single variable is sampled at discrete locations selected to represent a surface or volume. These locations can be on a regular grid or at irregular locations to facilitate sampling. In the first case, the locations can be stored implicitly by specifying the grid geometry and then listing the values. In the second case the coordinates need to be stored with each observation.
Models Models are spatial data that result from calculations. The EDMS can work with models in two ways. The first is to store and output data to be used in the modeling process. An example would be to select data and export an XYZ file of the current concentration of some constituent for use in a contouring program. This feature should be a standard part of the EDMS. The other model component of the EDMS is storage and display of the results of modeling. Most modeling projects result in a graphical output such as a contour map or three-dimensional display, or a numerical result. Usually these results can be saved as images in a file in vector (lines,
Data Content
39
polygons, etc.) or raster (pixels) format, or in numeric fields in tables. These graphics files or other results can be imported into the EDMS and displayed upon request. Examples of models often used in environmental projects include air dispersion; surface water and watershed configuration and flow; conceptual hydrologic; groundwater fate and transport; subsurface (2-D and 3-D) contouring; 3-D modeling and visualization; statistics, geostatistics, and sampling plan analysis; cross sections; volume estimates; and resource thicknesses and stripping ratios. Some of these are covered in more detail in later chapters.
Other issues Other data elements – The list of possible data elements to store for remediation projects is almost endless. Other data elements and capabilities include toxicity studies and toxicology; risk assessment; risk analysis and risk management; biology and ecology; SCADA; remediation planning and remedial plans; design calculations; and geotechnical designs. Summary information – Often it is useful to be able to see how many items of a particular kind of data are stored in the database, such as how many wells there are at a particular site, or how many arsenic analyses in the last year exceeded regulatory limits. There are two ways to provide this information, live or canned. In live count generation, the records are counted in real time as the result of a query. This is the most flexible approach, since the counts can be the result of nearly any query, and usually the most accurate, since the results are generated each time from the base data. With canned count generation, the counting is done separately from the display, and the results stored in a table. This is useful when the counts will be used often, and doing the counting takes a long time. It has the disadvantage that the counts can become “stale” when the primary data changes and the counts are not updated. Some systems use a combination of both approaches.
PROJECT ADMINISTRATIVE DATA Project administrative data covers a wide variety of types of data related to managing and administering projects. Examples of project administrative and management data elements include general site information; project management data; health and safety; and employee information. Some of this data consists of numbers and text items that should be stored using database management software. Other parts of this data are more document oriented, and might be a better fit in a document management system as discussed in a later section.
Site information Project overview – This could include site profile (overview) reports and descriptions; status reports and other general project information; and process descriptions and history. Ownership and permits – Items in this category include ownership and related administrative information; property transfers and related contact information; and information about permitting, permit management and tracking. Infrastructure – Possible infrastructure data elements include building and equipment information such as type; location; and operating history. Operations and maintenance – This is a broad area with much opportunity for data management assistance. This includes data related to requirements management; certifications, inspections, and audits; QA planning and documentation; continuous improvement and performance management; lockout/tagout surveys for equipment (where equipment that is not working properly is removed from service); energy analysis; emergency response; hazard analysis and tracking; collection systems; and residuals and biosolids management
40
Relational Management and Display of Site Environmental Data
Laboratory operations – If the site has one or more onsite laboratories, this would include data related to the LIMS; lab data checking; EDDs; and Web data delivery.
Project management Project management information can be stored as part of the general site database, or in specialized project management or accounting software. Budgets – Typical information in this category includes planned expenditures vs. actual expenditures; progress (% complete); costs on an actual and a unit basis, including EH&S (employee health and safety) expenses; sampling and analysis costs; and fee reports. Schedules – Schedule items can be regulatory items such as report due dates and other deadlines; or engineering items like work breakdown and critical path items. Reimbursement and fund recovery – For some projects it is possible to obtain partial reimbursement for remediation expenses from various funds such as state trust funds. The database can help track this process. Emission reduction credits – Facilities are required to meet certain emission criteria, and to encourage them to meet or even exceed these criteria they can be issued emission reduction credits that can be sold to facilities that are having trouble meeting their criteria. The second facility can then use these credits to minimize penalties. Customers and vendors – This would include information such as customer’s and vendor’s names and contact information; purchasing data; and electronic procurement of products and services. Other issues – There are many other project management data items that could be stored in the database. These include project review reports, records of status meetings, and so on.
Health and safety Tracking employee health and training is another fruitful area for a computerized data management approach. Because the health and safety of people is at stake, great care must be taken in data gathering, manipulation, and reporting of health and safety information. Facility information – Information to be tracked at the facility level includes process safety; hazard assessments; workplace safety plans and data; fire control and alarm systems; safety inspections and audits; and OSHA reports. Employee information – There is a large amount of information in this category that needs to be tracked, including safety, confined space, and other specialized training; accident/illness/injury tracking and reporting; individual exposure monitoring; and workers’ compensation and disability coverage and claims. General health and information – This category includes industrial hygiene; occupational medicine; environmental risk factors; toxicology and epidemiology; and monitoring of the health exposure and status of onsite and offsite individuals.
Personnel/project team Project staff – Data categories for project staff include the health and safety information discussed above, as well as organization information; personnel information; recruiting, hiring, and human resource information; demographics; training management and records; and certifications and fit tests. Others – Often information similar to that stored for project staff must be stored for others such as contractor and perhaps laboratory personnel.
Data Content
41
Incident tracking and reporting Despite the best efforts of project personnel, incidents of various types occur. The database system can help minimize incidents, and then track them when they do occur. Planning – Data to be stored prior to an incident includes emergency plans and information on emergency resources. Responding – This item covers things such as emergency management information and mutual aid (cooperation on emergency response). Tracking – This includes incident tracking; investigation and notification; “near miss” tracking; spill tracking; agency reports; and complaint tracking and response.
Regulatory status Regulatory status information for a facility or project includes quality assurance program plans; corrective action plans and corrective actions; rulings; limits (target values) and reporting times; Phase 1, Form R, and SARA (Superfund) reporting; Clean Air Act management; right to know management; project oversight and procedures; ROD and other limits; and certifications (such as ISO 9000 and 14000).
Multi-plant management Large organizations often have a large number of facilities with various environmental issues, each with a different level of urgency. Even big companies have limited resources, both financial and personnel-wise, and are unable to pay full attention to all of the needs of all of the facilities. Unfortunately, this can lead to a “brush-fire” approach to project management, where the projects generating the most heat get the most attention. Methods exist to evaluate facilities and prioritize those needing the most attention, with the intention of dealing with issues before they become serious problems.
Public affairs, lobbying, legislative activities The response of the public to site activities is increasing greatly in importance, with activities such as environmental justice lawsuits requiring more and more attention from site and corporate staff. A proactive approach of providing unbiased information to the public can, in some cases, minimize public outcry. A data management system can help organize these activities. Likewise, contacts with legislators and regulators can be of value, and can be managed using the database system.
PROJECT DOCUMENT DATA In terms of total volume of data, the project document data is probably the largest amount of data that will need to be managed within any organization. It may also be the most diverse from the viewpoint of its current physical configuration and its location. Options for storing document data are discussed in a later section of this chapter. The distinction between this category and the previous one about administrative data is that the previous data elements are part of the operations process, while items in this category are records of what happened. These items include compliance plans and reports; investigative reports (results); agreements; actual vs. planned activities; correspondence; drawings; and document management, control, and change tracking.
42
Relational Management and Display of Site Environmental Data
REFERENCE DATA Often it is desirable to provide access to technical reference information, both project-specific and general, on a timely basis. This includes boilerplate letters and standardized report templates. Reference data looks similar to project document data in its physical appearance, but is not associated with a specific project. Storing this type of information in a centralized location can be a great time saver in getting the work out. The storage options for this data are mostly the same as those for project document data. One difference is that some of the components of this data type may be links to external data sources, rather than data stored within the organization. An example of this might be a reference source for updated government regulations accessed from the Internet via the company intranet. Other than that, the reference data will usually be handled in the same way as project document data.
Administrative General administrative data can be similar to project data, but for the enterprise rather than for a specific project. This data includes timesheets; expense reports; project and task numbers; policies; correspondence numbers; and test reports. This category can also include information about shared resources, both internal and external, such as personnel and equipment availability; equipment service records; contractors; consultants; and rental equipment.
Regulatory Regulatory compliance information can be either general or project-specific. The project specific information was discussed above. In many cases it is helpful to store general regulatory compliance data in the database, including regulatory limits; reporting time guidelines; analyte suites; and regulatory databases (federal, state, local), including copies of regulations and decisions, like the Code of Federal Regulations. Regulatory alerts and regulatory issue tracking can also be helped with the use of a database.
Documentation Documentation at the enterprise level includes the information needed to run projects and the organization itself, from both the technical and administrative perspective. Technical information – Reference information in this category includes design information and formulas; materials engineering information; engineering guidelines; and other reference data. Enterprise financial information – Just as it is important to track schedules and budgets for each project, it is important to track this information for the enterprise. This is especially true for finances, with databases important for tracking accounting information, requests for proposals and/or bids; purchase orders and AFEs; employee expenses; and so on. Document resources – Sometimes a significant amount of time can be saved by maintaining a library of template documents that can be modified for use on specific projects. Examples include boilerplate letters and standardized reports. QA data – This category includes manuals; standard operating procedures for office, facility, and field use; and other quality documents. Quality management information can be general or specific. General information can include Standard Operating Procedures and other enterprise quality documents. Specific information can be project wide, such as a Quality Assurance Project Plan, or more specific, to the level of an individual analysis. Specific quality control information is covered in more detail in Chapter 15.
Data Content
43
News and information – It is important for staff members to have access to industry information such as industry news and events. Sometimes this is maintained in a database, but more commonly nowadays it is provided through links to external resources, especially Web pages managed by industry publications and organization.
DOCUMENT MANAGEMENT In all of the above categories, much of the data may be contained in existing documents, and often not in a structured database system. The biggest concern with this data is to provide a homogeneous view of very diverse data. The current physical format of this data ranges from digital files in a variety of formats (various generations of word processing, spreadsheet, and other files) through reports and oversized drawings. There are two primary issues that must be addressed in the system design: storage format and organization.
Storage options There are many different options for storing document data. There may be a part of this data that should be stored using database management software, but it usually consists largely of text documents, and to a lesser degree, diagrams and drawings. The text documents can be handled in five ways, depending on their current physical format and their importance. These options are presented in decreasing order of digital accessibility, meaning that the first options can be easily accessed, searched, edited, and so on, while the capability to do this becomes more limited with the later options. Many document management systems use a combination of these approaches for different types of documents. The first storage option, which applies to documents currently available in electronic form, is to keep them in digital form, either in their original format or in a portable document format such as Acrobat .pdf files (a portable document format from Adobe). Digital storage provides the greatest degree of flexibility in working with the documents, and provides relatively compact data storage (a few thousand bytes per page). The remaining four storage options apply to documents currently available only in hard copy form. The second and third options involve scanning the documents into digital form. In the second option, each document is scanned and then submitted to optical character recognition (OCR) to convert it to editable text. Current OCR software has an error rate of 1% or less, which sounds pretty good until you apply this to the 2000 or so characters on the average text page which gives 20 errors per page. Running the document through a spell-checker can catch many of these errors, but in order to have a usable document, it really should be manually edited as well. This can be very time-consuming if there are a large number of documents to be managed. Also, some people are good at this kind of work and some aren’t, so the management of a project to perform this conversion can be difficult. This option is most appropriate for documents that are available only in hard copy, but must be stored in editable form for updates, changes, and so on. It also provides compact storage (a few thousand bytes per page) if the scanned images are discarded after the conversion process. The third storage option, which is the second involving scanning, omits all or most of the OCR step. Instead of converting the documents to text and saving the text, the scanned images of the documents are saved and made available to users. In order to be able to search the documents, the images are usually associated with an index of keywords in the document. These keywords can be provided manually by reviewers, or automatically using OCR and specialized indexing software. Then when users want to find a particular document they can search the keyword index, locate the document they want, and then view it on the screen or print it. This option is most appropriate when the documents must be available online, but will be used for reference and not edited. It requires a large amount of storage. The theoretical storage size for an 8½ by 11 inch black and
44
Relational Management and Display of Site Environmental Data
white page at 300 dots per inch is about one megabyte, but with compression, the actual size is usually a few hundred thousand bytes, which is about a hundred times larger than the corresponding text document. The fourth storage option is to retain the documents in hard copy, and provide a keyword index similar to that in the third option. Creating the keyword index requires manually reviewing the documents and entering the keywords into a database. This may be less time-consuming than scanning and indexing. When the user wants a document, he or she can scan the keyword index to identify it, and then go and physically find it to work with it. This option requires very little digital storage, but requires a large amount of physical storage for the hard copies, and is the least efficient for users to retrieve documents. It is the best choice for documents that must be available, but will be accessed rarely. There is a final option for documents that do not justify any of the four previous options. That is to throw them away. On a practical level this is the most attractive for many documents, but for legal and other reasons is difficult to implement. A formal document retention policy with time limits on document retention can be a great help in determining which documents to destroy and when. For oversized documents, the options for storage are similar to those for page-sized documents, but the tools for managing them are more limited. For example, for a document in Word format, anyone in the organization can open it in Word, view it, and print it. For a drawing in AutoCAD format, the user must have AutoCAD, or at least a specialized CAD viewing program, in order to access it. To further complicate the issue, that drawing is probably a figure that should be associated with a text document. Part of the decision process for selecting document storage tools should include an evaluation of the ability of those tools to handle both regular and oversized documents, if both types are used in the organization.
Organization and display Once the decision has been made about the storage format(s) for the documents, then the organization of the documents must be addressed. Organization of the documents covers how they are grouped, where they are stored, and how they are presented. One possibility for grouping the documents is to organize them by type of document. Another is to divide them into quality-related and quality-unrelated documents. Documents that are in a digital format should be stored on a server in a location visible to all users with a legitimate need to see them. They can be stored in individual files or in a document management system, depending on the presentation system selected. The presentation of the data depends mostly on the software chosen. The options available for the type of data considered here include a structured database management system, a specialized document management system, the operating system file system, or a hypertext system such as the hypertext markup language (HTML) used for the Internet and the company intranet. A database management system can be used for document storage if an add-in is used to help manage the images of scanned pages. This approach is best when very flexible searching of the keywords and other data is the primary consideration. Specialized document management systems are programs designed to overcome many of the limitations of digital document storage. They usually have data capture interfaces designed for efficient scanning of large numbers of hard-copy documents. They may store the documents in a proprietary data structure or in a standard format such as .tif (tagged image file format). Some can handle oversized documents such as drawings and maps. They provide indexing capabilities to assist users in finding documents, although they vary in the way the indices work and in the amount of OCR that is done as part of the indexing. Some provide manual keyword entry only, while others will OCR some or all of the text and create or at least start the index for you. Document management systems work best when there is a large number of source documents with a similar
Data Content
45
physical configuration, such as multi-page printed reports. They do less well when the source documents are very diverse (printed reports, hard copy and digital drawings and maps, word processor and spreadsheet digital documents, etc.) as in many environmental organizations. This software might be useful if the decision is made to scan and index a large volume of printed documents, in which case it might be used to assist with the data capture. The file system built into the operating system provides a way of physically storing documents of all kinds. With modern operating systems, it provides at least some multi-user access to the data via network file sharing. This is the system currently used in most organizations for storing and retrieving digital files, and many may wish to continue with this approach for file storage. (For example, an intranet approach relies heavily on the operating system file storage capabilities.) Where this system breaks down is in retrieving documents from a large group, since the retrieval depends on traversing the directory and file structure and locating documents by a (hopefully meaningful) file name. In this case, a better system is needed for organizing and relating documents. Hypertext systems have been growing in popularity as a powerful, flexible way of presenting large amounts of data where there may be great variation in the types of relationships between data elements. Some examples of hypertext systems include the Hypercard program for the Apple Macintosh and the Help System for Microsoft Windows. A more recent and widely known example of hypertext systems is the HTML system used by the World Wide Web (and its close relative, the intranet). All hypertext systems have as a common element the hyperlink, which is a pointer from one document into another, usually based on highlighted words or images. The principal advantage of this approach is that it facilitates easy movement of the user from one resource to another. The main disadvantage is the effort required to convert documents to HTML, and the time that it takes to set up an intuitive system of links. This approach is rapidly becoming the primary document storage system within many companies.
PART TWO - SYSTEM DESIGN AND IMPLEMENTATION
CHAPTER 5 GENERAL DESIGN ISSUES
The success of a data management task usually depends on the tool used for that task. The theoretical physicist Stephen Hawking is quoted as saying, “When all you have is a hammer, everything looks like a nail.” This is as true in data management as in anything else. People who like to use a word processor or a spreadsheet program are likely to use the tool they are familiar with to manage their data. But just as a hammer is not the right tool to tighten a screw, a spreadsheet is not the right tool to manage a large and complicated database. A database management program should be used instead. This section discusses the design of the database management tool, and how the design can influence the success of the project.
DATABASE MANAGEMENT SOFTWARE Database management programs fall into two categories, desktop and client-server. The use of the two different types and decisions about where the data will be located are discussed in the next section. This section will discuss database applications themselves and briefly discuss the features and benefits of the programs. The major database software vendors put a large amount of effort into expanding and improving their products, so these descriptions are a snapshot in time. For an overview of desktop and Web-based database software, see Ross et al. (2001). Older database systems were, for the most part, based on dBase, or at least the dBase file format. dBase started in the early days of DOS, and was originally released as dBase II because that sounded more mature than calling it 1.0. If anyone tells you that they have been doing databases since dBase 1 you know they are bluffing. dBase was an interpreted application, meaning that the code was translated into machine language (compiled) each time it was run, which was slow on those early computers. This created a market for dBase compilers, of which FoxPro was the most popular. Both used a similar data format in which each data table was called a database file or .dbf. Relationships were defined in code, rather than as part of the data model. Much data has been, and still is in some cases, managed in this format. These files were designed for single-user desktop use, although locking capabilities were added in later versions of the software to allow shared use. Nowadays Microsoft Access dominates the desktop database market. This program provides a good combination of ease of use for beginners and power for experts. It is widely available, either as a stand-alone product or as part of the Office desktop suite. Additional information on Access can be found in books by Dunn (1994), Jennings (1995), and others, and especially in journals such as PC Magazine and Access/Visual Basic Advisor. Access has a feature that is common to almost all successful database programs, which is a programming language that allows users to
50
Relational Management and Display of Site Environmental Data
automate tasks, or even build complete programs using the database software. In the case of Access, there are actually two programming models, a macro language that is falling out of favor, and a programming language. The programming language is called Visual Basic for Applications (VBA), and is a fairly complete development environment. Since Access is a desktop database, it has limitations relative to larger systems. Experience has shown that for practical use, the software starts to have performance problems when the largest table in a database starts to reach a half million to a million records. Access allows multiple users to share a database, and no programming is required to implement this, but a dozen or so concurrent users is an upper limit for heavy usage database scenarios. Access is available either as a stand-alone product or as part of the Microsoft Office Suite. An alternative to Access is Paradox from Corel (www.corel.com). This is a programmable, relational database system, and is available as part of the Corel Office Suite. Paradox is a capable tool suitable for a complex database project, but the greater acceptance of Access makes Paradox an unlikely choice in the environmental business where file sharing is common, and Access is widespread. The next step up from Access for many organizations is Microsoft SQL Server. This is a fullscale client-server system with robust security and a larger capacity than Access. It is moderately priced and relatively easy to use (for enterprise software), and increases the capacity to several million records. It is easy to attach an Access front end (user interface) to a SQL Server back end (data storage), so the transition to SQL Server is relatively easy when the data outgrows Access. This connection can be done using ODBC (Open DataBase Connectivity) or other connection methods. For even larger installations, Oracle or IBM’s DB2 offer industrial-strength data storage, but with a price and learning curve to match. These can also be connected to the desktop using connection methods like ODBC, and one front-end application can be set up to talk to data in these databases, as well as to data in Access. Using this approach it is possible to create one user interface that can work with data in all of the different database systems. A new category of database software that is beginning to appear is open-source software. Open-source software refers to programs where the source code for the software is freely available, and the software itself is often free as well. This type of software is popular for Internet applications, and includes such popular programs as the Linux operating system and Apache Web server. Two open-source database programs are PostgreSQL and MySQL (Jepson, 2001). These programs are not yet as robust as commercial database systems, but are improving rapidly. They are available in commercial, supported versions as well as the free open-source versions, so they are starting to become options for enterprise database storage. And you can’t beat the price. Another new category of database software is Web-based programs. These programs run in a browser rather than on the desktop, and are paid for with a monthly fee. Current versions of these programs are limited to a flat-file design, which makes them unsuitable for the complex, relational nature of most environmental data, but they might have application in some aspects of Web data delivery. Examples of this type of software include QuickBase from the authors of the popular Quicken and QuickBooks financial software (www.quickbase.com), and Caspio Bridge (www.caspio.com).
DATABASE LOCATION OPTIONS A key decision in designing a data management system is where the data will reside. Related to this are a variety of issues, including what hardware and software will provide the necessary functionality, who will be responsible for data entry and editing, and who will be responsible for backup of the database.
General Design Issues
51
Stand-alone The simplest design for a database location is stand-alone. In this design, the data and the software to manage it reside on the computer of one user. That computer may or may not be on a network, but all of the functionality of the database system is local to that machine. The hardware and software requirements of a system like this are modest, requiring only one computer and one license for the database management software. The software does not need to provide access for more than one user at a time. One person is in control of the computer, software, and data. For small projects, especially one-person projects, this type of design is often adequate. For larger projects where many people need access to the data, the single individual keeping the data can become a bottleneck. This is particularly true when the retrievals required are large or complicated. The person responsible for the data can end up spending most or all of his or her time responding to data requests from users. When the data management requirements grow beyond that point, the stand-alone system no longer meets the needs of the project team, and a better design is required.
Shared file Generally the next step beyond a stand-alone system is a shared file system. In a shared file system, the server (or any computer) stores the database on its hard drive like any other file. Clients access the file using database software on their computers the same way they would open any other file on the server. The operating system on the server makes the file available. The database software on the client computer is responsible for handling access to the database by multiple users. An example of this design would be a system in which multiple users have Microsoft Access on their computers, and the database file, which has an extension of .mdb, resides on a server, which could be running Windows 95/98/ME or NT/2000/XP. When one or more users is working in the database file, their copy of Access maintains a second file on the server called a lock file. This file, which has an extension of .ldb, keeps track of which users are using the database and what objects in the database may be locked at any particular time. This design works well for a modest number of users in the database at once, providing adequate performance for a dozen or so users at any given time, and for databases up to a few hundred thousand records.
Client-server When the load on the database increases to the point where a shared file system no longer provides adequate performance, the next step is a client-server system. In this design, a data manager program runs on the server, providing access to the data through a system process. One computer is designated the server, and it holds the data management software and the data itself. This system may also be used as the workstation of an individual user, but in high-volume situations this is not recommended. More commonly, the server computer is placed on a network with the computers of the users, which are referred to as clients. The software on the server works with software on the client computers to provide access to the data. The following diagram covers the internal workings of a client-server EDMS. It contains two parts, the Access component at the top and the SQL Server part at the bottom. In discussing the EDMS, the Access component is sometimes called the user interface, since that is the part of the system that users see, but in fact both Access and SQL Server have user interfaces. The Access user interface has been customized to make it easy for the EDMS users to manage the data in ways useful to them. The SQL Server user interface is used without modifications (as provided by Microsoft) for data administration tasks. Between these user interfaces are a number of pieces that work together to provide the data management capabilities of the system.
52
Relational Management and Display of Site Environmental Data
Access User Interface Lookup Table Maintenance Electronic Import
Manual Entry
Table View
Data Review
Record Counts
Maps
Subset Creation
Graphs
Formatted Reports Fmt1 Fmt2
Client Server
Selection Scr. Access Queries / Modules Access Attachments
Selection Screen Access Queries / Modules Access Attachments
Security System Read/Write
Security System Read Only Server Tables
Volume Maint.
File Export Fmt1 Fmt2
Subset Database Backup / Restore
Server Volume
SQL Server / Oracle User Interface Figure 17 - Client-server EDMS data flow diagram
Discussion of this diagram will start at the bottom and work toward the top, since this order starts with the least complicated parts (from the user’s perspective, not the complexity of the software code) and moves to the more complicated parts. That means starting on the SQL Server side and working toward the client side. This sequence provides the most orderly view of the system. In this diagram, the part with a gray background represents SQL Server, and the rest of the box is Access. The basic foundation of the server side is the SQL Server volume, which is actually a file on the server hard drive that contains the data. The size of this volume is set when the database is set up, and can be changed by the administrator as necessary. Unlike many computer files, it will not grow automatically as data is added. Someone needs to monitor the volume size and the amount of data and keep them in synch. The software works this way because the location and structure of the file on the hard drive is carefully managed by the SQL Server software to provide maximum performance (speed of query execution). The database tables are within the SQL Server volume. These tables are similar in function and appearance to the tables in Access. They contain all of the data in the system, usually in normalized database form. The data in the tables is manipulated through SQL queries passed to the SQL Server software via the ODBC link from the clients. Also stored in the SQL Server volume can be triggers and stored procedures to increase performance in the client-server system and to enforce referential integrity. If they wish, users can see the tables in the SQL Server volume through the database window in Access, but their ability to manipulate the data depends on the privileges that they have through the security system. A System Administrator should back up data in the SQL Server tables on a regular basis (at least daily). The interface between the EDMS and the SQL Server tables is through a security system that is partly implemented in SQL Server and partly in Access. Most users should have read-only permission for working with the data; that is, they will be able to view data but not change it. A small group of users, called data administrators, should be able to modify data, which will include importing and entering data, changing data, and deleting data.
General Design Issues
53
The actual connection between Access and SQL Server is done through attachments. Attachments in Access allow Access to see data that is located somewhere other than the current Access .mdb file as if it were in the current database. This is the usual way of connecting to SQL Server, and also allows us to provide the flexibility of attaching to other data sources. Once the attachments are in place, the client interaction with the database is through Access queries, either alone or in combination with modules, which are programs written in VBA. Various queries provide access to different subsets of the data for different purposes. Modules are used when it is necessary to work with the data procedurally, that is, line by line, instead of the whole query as a set.
Distributed The database designs described above were geared toward an organization where the data management is centralized, that is, where the data management activities are performed in one central location, usually on a local area network (LAN). With environmental data management, this is not always the case. Often the data for a particular facility must be available both at the facility and at the central office. The situation becomes even more complicated when the central office must manage and share data with multiple remote sites. This requires that some or all of the data be made available at multiple locations. The following sections describe three ways to do this: wide-area networks, distributed databases with replication, and remote communication with subsets. The factors that determine which solution is best for an organization include the amount of data to be managed, how fresh the data needs to be at the remote locations, and whether full-time communication between the facilities is available and the speed of that communication. Wide-area networks – In situations where a full-time, high-speed communication link is or can be made available (at a cost which is reasonable for the project), a wide-area network (WAN) is often the best choice. From the point of view of the people using it, the WAN hardware and software connect the computers just as with a LAN. The difference is that instead of all of the computers being connected directly through a local Ethernet or Token Ring network, some are connected through long-distance data lines of some sort. Often there are LANs at the different locations connected together over the WAN. The connection between the LANs is usually done with routers on either end of the longdistance line. The router looks at the data traffic passing over the network, and data packets which have a destination on the other LAN are routed across the WAN to that LAN, where they continue on to their destination. There are several options for the long-distance lines forming the WAN between the LANs. This discussion will cover some popular existing and emerging technologies. This is a rapidly changing industry, and new technologies are appearing regularly. At the high end of connectivity options are dedicated analog solutions such as T1 (or in cases of very high data volume, T3) or frame relay. These services are connected full-time, and provide high to moderate speeds ranging from 56 kilobits per second (kbps) to 1 megabit per second (mbps) or more. These services can cost $1000 per month or more. This is proven technology, and is available, at a cost, to nearly any facility. Recently, newer digital services have become available for some areas. Integrated Services Digital Network (ISDN) provides 128 kbps for around $100 per month. Digital Subscriber Line (DSL) provides connectivity ranging in speed from 256 kbps to 1.5 mbps or more. Prices can be as low as $40 per month, so this service can be a real bargain, but service is limited to a fairly short distance from the telephone company central office, so it’s not available to many locations. Cable modems promise to give DSL a run for its money. It’s not widely available right now, especially for business locations, and when it is it will have more of a focus on residential rather than business since that is where cable is currently connected.
54
Relational Management and Display of Site Environmental Data
Another option is standard telephone lines and analog modems. This is sometimes called POTS (plain old telephone system). This provides 56 kbps, or more with modem pooling, and the connection is made on demand. The cost is relatively low ($15 to 30 per month) and is available everywhere. In order to have WAN-level connectivity, you should have a full-time connection of about 1 mbps or faster. If the connection speed available is less than this, another approach should be used. Distributed databases with replication – There are several situations where a client-server connection over a WAN is not the best solution. One is where the connection speed is too low for real-time access. The second is where the data volume is extremely high, and local copies make more sense. In this situation, distributed databases can make sense. In this design, copies of the database are placed on local servers at each facility, and users work with the local copies of the data. This is an efficient use of computer resources, but raises the very important issue of currency of the data. When data is entered or changed in one copy, the other copy is no longer the most current, and, at some point, the changes must be transferred between the copies. Most high-end database programs, and now some low end ones, can do this automatically at specified intervals. This is called replication. Generally, the database manager software is smart enough to move only the changed data (sometimes called “dirty” records) rather than the whole database. Problems can occur when users make simultaneous changes to the same records at different locations, so this approach rapidly becomes complicated. Remote communication with subsets – Often it is valuable for users to be able to work with part of the database remotely. This is particularly useful when the communication line is slow. In this scenario, users call in with their remote computers and attach to the main database, and either work with it that way or download subsets for local use. In some software this is as easy as selecting data using the standard selection screen, then instructing the EDMS to create a subset. This subset can be created on the user’s computer, then the user can hang up, attach to the local subset, then use the EDMS in the usual way, working with the subset. This works for retrieving data, but not as well for data entry or editing, unless a way is provided for data entered into the subset to be uploaded to the main database.
Internet/intranet Tools are now available to provide access to data stored in relational database managers using a Web browser interface. At the time of this writing, in the opinion of the author, these tools are not yet ready for use as the primary interface in a sophisticated EDMS package. Specifically, the technology to provide an intuitive user interface with real-time feedback is too costly or impossible to build with current Web development tools. Vendors are working on implementing that capability, but the technology is not currently ready for prime time, at least for everyday users, and the current technology of choice is client-server. It is now feasible, however, to provide a Web interface with limited functionality for specific applications. For example, it is not difficult to provide a public page with summaries of environmental data for plants. The more “canned” the retrieval is, the easier it is to implement in a browser interface, although allowing some user selection is not difficult. In the near future, tools like Dynamic HTML, Active Server Pages, and better Java applets, combined with universal high-speed connections, will make it much easier to provide an interactive user interface, hosted in a Web browser. At that time, EDMS vendors will certainly provide this type of interface to their client’s databases. The following figure shows a view of three different spectra provided by the Internet and related technologies. There are probably other ways of looking at it, but this view provides a framework for discussing products and services and their presentation to users. In these days of multi-tiered applications, this diagram is somewhat of an over-simplification, but it serves the purpose here.
General Design Issues
Local
55
Global Applications
StandAlone
Shared Files
Client- WebServer Enabled
WebBased
Data Proprietary
Commercial
Public Domain
Users Desktops
Laptops
PDAs, etc.
Public Portals
Figure 18 - The Internet spectrum
The overall range of the diagram in Figure 18 is from Local on the left to Global on the right. This range is divided into three spectra for this discussion. The three spectra, which are separate but not unrelated, are applications, data, and users. Applications – Desktop computer usage started with stand-alone applications. A program and the data that it used were installed on one computer, which was not attached to others. With the advent of local area networks (LANs), and in some organizations wide-area networks (WANs), it became possible to share files, with the application running on the local desktop and with the data residing on a file server. As software evolved and data volumes grew, software was installed on both the local machine (client) and the server, with the user interface operating locally, and data storage, retrieval, and sometimes business logic operating on the server. With the advent of the Internet and the World Wide Web, sharing on a much broader scale is possible. The application can reside either on the client computer and communicate with the Web, or it can run on a Web server. The first type of application can be called Web-enabled. An example of this is an email program that resides locally, but talks through the Web. Another example would be a virusscanning program that operates locally but goes to the Web to update its virus signature files. The second type of application can be called Web-based. An example of this would be a browser-based stock trading application. Many commercial applications still operate in the range between stand-alone and client server. There is now a movement of commercial software to the right in this spectrum, to Web-enabled or Web-based applications, probably starting with Web-enabling current programs, and then perhaps evolving to a thin-client, browser-based model over time. This migration can be done with various parts of the applications at different rates depending on the costs and benefits at each stage. New technologies like Microsoft’s .NET initiative are helping accelerate this trend. Data – Most environmental database users currently work mostly with data that they generate themselves. Their base map data is usually based on CAD drawings that they create, and the rest of their project data comes from field measurements and laboratory analyses, which the data manager (or their client) pays for and owns. This puts them to the left of the spectrum in the above figure. Many vendors now offer both base map and other data, either on the Web or on CD-ROM, which might be of value to many users. Likewise, government agencies are making more and more data, spatial and non-spatial, available, often for free. As vendors evolve their software and Web presence, they can work toward integrating this data into their offerings. For example, software could be used to load a USGS or Census Bureau base map, and then display sites of environmental concern obtained from the EPA. Several software companies provide tools to make it possible to
56
Relational Management and Display of Site Environmental Data
serve up this type of data from a modified Web server. Revenue can be obtained from purchase or rental of the application, as well as from access to the data. Users – The World Wide Web has opened up a whole new world of options for computing platforms. These range from the traditional desktop computers and laptops through personal digital assistants (PDAs), which may be connected via wireless modem, to Web portals and other public access devices. Desktops and laptops can run present and future software, and, as most are becoming connected to the Internet, will be able to support any of the computing models discussed above. PDAs and other portable devices promise to provide a high level of portability and connectivity, which may require re-thinking data delivery and display. Already there are companies that integrate global positioning systems (GPS) with PDAs and map data to show you where you are. Other possible applications include field data gathering and delivery, and a number of organizations provide this capability. Web portals include public Internet access (such as in libraries and coffee shops) as well as other Internet-enabled devices like public phones. This brings up the possibility that applications (and data) may run on a device not owned by or controlled by the client, and argues for a thin-client approach. This is all food for thought as we try to envision the evolution of environmental software products and services (see Chapter 27). What is clear is that the options for delivery of applications and data have broadened significantly, and must be considered in planning for future needs.
Multi-tiered The evolution of the Internet and distributed computing has led to a new deployment model called “multi-tiered.” The three most common tiers are the presentation level, the business logic level, and the data storage level. Each level might run on a different computer. For example, the presentation level displayed to the user might run on a client computer, using either client-server software or a Web browser. The business logic level might enforce the data integrity and other rules of the database, and could reside on a server or Web server computer. Finally, the data itself could reside on a database server computer. Separating the tiers can provide benefits for both the design and operation of the system.
DISTRIBUTED VS. CENTRALIZED DATABASES An important decision in implementing a data management system for an organization performing environmental projects for multiple sites is whether the databases should be distributed or centralized. This is particularly true when the requirements for various uses of the data are taken into consideration. This issue will be discussed here from two perspectives. The first perspective to be discussed will be that of the data, and the second will be that of the organization. From the perspective of the data and the applications, the options of distributed vs. centralized databases are illustrated in Figures 19 and 20. Clearly it is easier for an application to be connected to a centralized, open database than a diverse assortment of data sources. The downside is the effort required to set up and maintain a centralized data repository.
General Design Issues
Mapping
?
Project Management
?
?
GIS Coverages
Statistics
Reporting
Validation
?
Lab Deliverables
Spreadsheets
CAD Files
Legacy Systems
?
ASCII Files
Word Proc. Files
Chain of Custody
Field Notebooks
Hard Copy Files
Regulatory Reports
?
?
?
Graphing
Planning
Web Access
Figure 19 - Connection to diverse, distributed data sources
Mapping
Validation
Reporting
Statistics
Centralized Open Database
Graphing
Project Management
Web Access
Planning
Figure 20 - Connection to a centralized open database
57
58
Relational Management and Display of Site Environmental Data
CLIENT
The Dilemma
The Solution
Site 1
Consultant Consultant Consultant Consultant Consultant 3A 3B 2B 1 2A
Site 4
Consultant 1 + Geotech
!"#$%& Other Access Spreadsheet Do It '()( EDMS Other Yourself EDMS
Consultant 4
Labs Web etc. !"#$%&*'()(
Figure 21 - Distributed vs. centralized databases
The choice of distributed vs. centralized databases can also be viewed from the perspective of the organization. This is illustrated in Figure 21. The left side of the diagram shows the way the data for environmental projects has traditionally been managed. The client, such as an industrial company, owns several sites with environmental issues. One or more consultants, labeled C1, C2, etc., manage each site, and each consultant may manage the project from various offices, such as C2A, C2B, etc. Each consultant office might use a different tool to manage the data. For example, for Site 1, consultant C1 may use an Excel spreadsheet. Consultant C2, working on a different part of the project, or on the same issues at a different time, may use a home-built database. Other consultants working on different sites use a wide variety of different tools. If people in the client organization, either in the head office or at one of the sites, want some data from one monitoring event, it is very difficult for them to know where to look. Contrast this with the right side of the diagram. In this scenario, all of the client’s data is managed in a centralized, open database. The data may be managed by the client, or by a consultant, but the data can be accessed by anyone given permission to do so. There are huge savings in efficiency, because everyone knows where the data is and how to get at it. The difficult challenge is getting the data into the centralized database before the benefits can be realized.
General Design Issues
59
Figure 22 - Example of a simplified logical data model
THE DATA MODEL The data model for a data management system is the structure of the tables and fields that contain the data. Creating a robust data model is one of the most important steps in building a successful data management system (Walls, 1999). If you are building a data management system from scratch, you need to get this part right first, as best you can, before you proceed with the user interface design and system construction. Many software designers work with data models at two levels. The logical data model (Figure 22) describes, at a conceptual level, the data content for the system. The lines between the boxes represent the relationships in the model. The physical data model (Figure 23) describes in detail exactly how the data will be stored, with names, data types, and sizes for all of the fields in each table, along with the relationships (key fields which join the tables) between the tables. The overall scope of the logical data model should be identified as early in the design process as possible. This is particularly true when the project is to be implemented in stages. This allows identification of the interactions between the different parts of the system so that dependencies can be planned for as part of the detailed design for each subset of the data as that subset is implemented. Then the physical data model for the subset can be designed along with the user interface for that subset. The following sections describe the data structure and content for a relational EDMS. This structure and content is based on a commercial system developed and marketed by Geotech Computer Systems, Inc. called Enviro Data. Because this is a working system that has managed hundreds of databases and millions of records of site environmental investigation and monitoring data, it seems like a good starting point for discussing the issues related to a data model for storing this type of data.
60
Relational Management and Display of Site Environmental Data
Figure 23 - Table and field display from a physical data model
Data structure The structure of a relational EDMS, or of any database for that matter, should, as closely as possible, reflect the physical realities of the data being stored. For environmental site data, samples are taken at specific locations, at certain times, depths, and/or heights, and then analyzed for certain physical and chemical parameters. This section describes the tables and relationships used to model this usage pattern. The section after this describes in some detail the data elements and exactly how they are used so that the data accurately reflects what happened. Tables – The data model for storing site environmental data consists of three types of tables: primary tables, lookup tables, and utility tables. The primary tables contain the data of interest. The lookup tables contain codes and their expanded values that are used in the primary tables to save space and encourage consistency. Sometimes the lookups contain other useful information for the data elements that are represented by the coded values. The utility tables provide a place to store various data items, often related to the operation and maintenance of the system. Often these tables are not related directly to the primary tables. For the most part, the primary data being stored in the EDMS has a series of one-to-many (also known as parent-child or hierarchical) relationships. It is particularly fortunate that these relationships are one-to-many rather than many-to-many, since one-to-many relationships are handled well by the relational data model, and many-to-many are not. (Many-to-many relationships can be handled in the relational data model. They require adding another table to track the links between the two tables. This table is sometimes called a join table. We don’t have to worry about that here.)
General Design Issues
61
The primary tables in this system are called Sites, Stations, Samples, and Analyses. The detailed content of these tables is described below. Sites contains information about each facility being managed in the system. Stations stores data for each location where samples are taken, such as monitoring wells and soil borings. (Note that what is called a station in this discussion is called a site in some system designs.) Samples represents each physical sample or monitoring event at specific stations, and Analyses contains specific observed values or analytical results from the samples. Relationships – The hierarchical relationships between the tables are obvious. Each site can have one or more stations, each station has one or more samples, and each sample is analyzed for one or more, often many, constituents. But each sulfate measurement corresponds to one specific sampling event for one specific location for one specific site. The lookup relationships are one-to-many also, with the “one” side being the lookup table and the “many” side being the primary table. For example, there is one entry in the StationTypes table for monitoring wells, with a code of “mw,” but there can be (and usually are) many monitoring wells in the Stations table.
Data content This section will discuss briefly the data content of the example EDMS. This material will be covered in greater detail in Appendix B. Sites – A Site is a facility or project that will be treated as a unit. Some projects may be treated as more than one site, and sometimes a site can be more than one facility, but the use of the site terminology should be consistent within the database, or at least for each project. Some people refer to a sampling location as a site, but in this discussion we will call that a station. Stations – A Station is a location of observation. Examples of stations include soil borings, monitoring wells, surface water monitoring stations, soil and stream sediment sample locations, air monitoring stations, and weather stations. A station can be a location that is persistent, such as a monitoring well which is sampled regularly, or can be the location of a single sampling event. For stations that are sampled at different elevations (such as a soil boring), the location of the station is the surface location for the boring, and the elevation or depth component is part of the sampling event. Samples – A Sample is a unique sampling event or observation for a station. Each station can be sampled at various depths (such as with a soil boring), at various dates and times (such as with a monitoring well), or, less commonly, both. Observations, which may or may not accompany a physical sample, can be taken at a station at a particular time, and in this model would be considered part of a sample event. Analyses – An Analysis is the observed value of a parameter related to a sample. This term is intended to be interpreted broadly, and not to be limited to chemical analyses. For example, field parameters such as pH, temperature, and turbidity also are considered analyses. This would also include operating parameters of environmental concern such as flow, volume, and so on. Lookups – A lookup table is a table that contains codes that are used in the main data tables, and the expanded values of those codes that are used for selection and display. Utilities – The system may contain tables for tracking internal information not directly related to the primary tables. These utility tables are important to the software developers and maybe the system and data administrators, but can usually be ignored by the users.
DATA ACCESS REQUIREMENTS The user interface provides a number of data manipulation functions, some of which are read/write and the rest are read-only.
62
Relational Management and Display of Site Environmental Data
Read-write The functions that require read/write access to the database are: Electronic import – This function allows data administrators to import analytical and other data. Initially the data formats supported will be the three formats defined in the Data Transfer Standard. Other import formats may be added as needed. This is shown in Figure 17 as a singleheaded arrow going into the database, but in reality there is also a small flow of data the other way as the module checks for valid data. Manual entry – The hope is that the majority of the data that will be put in the system will be in digital format that can be imported without retyping. However, there will probably be some data which will need to be manually entered and edited, and this function will allow data administrators to make those entries and changes. Editing – Data administrators will sometimes need to change data in the database. Such changes must be done with great care and be fully documented. Lookup table maintenance – One of the purposes of the lookup tables is to standardize the entries to a limited number of choices, but there will certainly be a need for those data tables to evolve over time. This feature allows the data administrators to edit those tables. A procedure will be developed for reviewing and approving those changes before entry. Verification and validation – Either as part of the import process or separately, data validators will need to add or change validation flags based on their work. Data review – Data review should accompany data import and entry, and can be done independently as well. This function allows data administrators to look at data and modify its data review flag as appropriate, such as after validation.
Read-only The functions that require read-only access to the database are: Record counts – This function is a useful guide in making selections. It should provide the number of selected items whenever a selection criterion is changed. Table view – This generalized display capability allows users to view the data that they have selected. This might be all of the output they need, or they might wish to proceed to another output option, once they have confirmed that they have selected correctly. They can also use this screen to copy the data to the clipboard or save it to a file for use in another application. Formatted reports – Reports suitable for printing can be generated from the selection screen. Different reports could be displayed depending on the data element selected. Maps – The results of the selection can be displayed on a map, perhaps with the value of a constituent for each station drawn next to that station and a colored dot representing the value. See Chapter 22 for more information on mapping. Graphs – The most basic implementation of this feature allows users to draw a graph of constituent values as a function of time for the selected data. They should be able to graph multiple constituents for one station or one constituent for several stations. More advanced graphing is also possible as described in Chapter 20. Subset creation – Users should be able to select a subset of the main database and export it to an Access database. This might be useful for providing the data to others, or to work with the subset when a network connection to the database is unavailable or slow. File export – This function allows users to export data in a format suitable for use in other software needing data from the EDMS. Formats need to be provided for the data needs of the other software. Direct connection without export-import is also possible.
General Design Issues
63
GOVERNMENT EDMS SYSTEMS A number of government agencies have developed systems for managing site environmental data. This section describes some of the systems that are most widely used. STORET (www.epa.gov/storet) – STORET (short for STOrage and RETrieval) is EPA’s repository for water quality, biological, and physical data. It is used by EPA and other federal agencies, state environmental agencies, universities, private citizens, and others. It is one of EPA’s two data management systems containing water quality information for the nation's waters. The other system, the Legacy Data Center, or LDC, contains historical water quality data dating back to the early part of the 20th century and collected up to the end of 1998. It is being phased out in favor of STORET. STORET contains data collected beginning in 1999, along with older data that has been properly documented and migrated from the LDC. Both LDC and STORET contain raw biological, chemical, and physical data for surface and groundwater collected by federal, state, and local agencies, Indian tribes, volunteer groups, academics, and others. All 50 states, territories, and jurisdictions of the U.S., along with portions of Canada and Mexico, are represented in these systems. Each sampling result is accompanied by information on where the sample was taken, when the sample was gathered, the medium sampled, the name of the organization that sponsored the monitoring, why the data was gathered, and much other information. The LDC and STORET are Web-enabled, so users can browse both systems interactively or create files to be downloaded to their computer for further use. CERCLIS (www.epa.gov/superfund/sites/cursites) – CERCLIS is a database that contains the official inventory of Superfund hazardous waste sites. It contains information on hazardous waste sites, site inspections, preliminary assessments, and remediation of hazardous waste sites. The EPA provides online access to CERCLIS data. Additionally, standard CERCLIS site reports can be downloaded to a personal computer. CERCLIS is a database and not an EDMS, but can be of value in EDMS projects. IRIS (www.epa.gov/iriswebp/iris/index.html) – The Integrated Risk Information System, prepared and maintained by the EPA, is an electronic database containing information on human health effects that may result from exposure to various chemicals in the environment. The IRIS system is primarily a collection of computer files covering individual chemicals. These chemical files contain descriptive and quantitative information on oral reference doses and inhalation reference concentrations for chronic non-carcinogenic health effects, and hazard identification, oral slope factors, and oral and inhalation unit risks for carcinogenic effects. It is a database and not an EDMS, but can be of value in EDMS projects. ERPIMS (www.afcee.brooks.af.mil/ms/msc_irp.htm) – The Environmental Resources Program Information Management System (ERPIMS, formerly IRPIMS) is the U.S. Air Force system for validation and management of data from environmental projects at all Air Force bases. The project is managed by the Air Force Center for Environmental Excellence (AFCEE) at Brooks Air Force Base in Texas. ERPIMS contains analytical chemistry samples, tests, and results as well as hydrogeological information, site/location descriptions, and monitoring well characteristics. AFCEE maintains ERPTools/PC, a Windows-based software package that has been developed to help Air Force contractors in collection and entry of their data, validation, and quality control. Many ERPIMS data fields are filled by codes that have been assigned by AFCEE. These codes are compiled into lists, and each list is the set of legal values for a certain field in the database. Air Force contractors use ERPTools/PC to prepare the data, including comparing data to these lists, and then submit it to the main ERPIMS database at Brooks. IRDMIS (aec.army.mil/prod/usaec/rmd/im/imass.htm) – The Installation Restoration Data Management Information System (IRDMIS) supports the technical and managerial requirements of the Army's Installation Restoration Program (IRP) and other environmental efforts of the U.S. Army Environmental Center (USAEC, formerly the U.S. Toxic and Hazardous Materials Agency). (Don’t confuse this AEC with the Atomic Energy Commission, which is now the Department of
64
Relational Management and Display of Site Environmental Data
Energy.) Since 1975, more than 15 million data records have been collected and stored in IRDMIS with information collected from over 100 Army installations. IRDMIS users can enter, validate, store, and retrieve the Army’s geographic; geological and hydrological; sampling; chemical; and physical analysis information. The system covers all aspects of the data life cycle, including complete data entry and validation software using USAEC and CLP QA/QC methods; a Web site for data submission and distribution; and an Oracle RDMS with menu-driven user interface for standardized reports, geographical plots, and plume modeling. It provides a fully integrated information network of data status and disposition for USAEC project officers, chemists, geologists, contracted laboratories, and other parties, and supports Geographical Information Systems and other third-party software. USGS Water Resources (http://water.usgs.gov/nwis) – This is a set of Web pages that provide access to water resources data collected at about 1.5 million locations in all 50 states, the District of Columbia, and Puerto Rico. The U.S. Geological Survey investigates the occurrence, quantity, quality, distribution, and movement of surface and groundwater, and provides the data to the public. Online access to data on this site includes real-time data for selected surface water, groundwater, and water quality sites; descriptive site information for all sites with links to all available water data for individual sites; water flow and levels in streams, lakes, and springs; water levels in wells; and chemical and physical data for streams, lakes, springs, and wells. Site visitors can easily select data and retrieve it for on-screen display or save it to a file for further processing.
OTHER ISSUES Creating and maintaining an environmental database is a serious undertaking. In addition to the activities directly related to maintaining the data itself, there are a number of issues related to the database system that must be considered.
Scalability Databases grow with time. You should make sure that the tool you select for managing your environmental data can grow with your needs. If you store your data in a spreadsheet program, when the number of lines of data exceeds the capacity of the spreadsheet, you will need to start another file, and then you can’t easily work with all of your data. If you store your data in a standalone database manager program like Access, when your data grows you can relatively easily migrate to a more powerful database manager like SQL Server or Oracle. The ability of software and hardware to handle tasks of different sizes is called scalability, and this requirement should be part of your planning if there is any chance your project will grow over time.
Security The cost of building a large environmental database can be hundreds of thousands of dollars or more. Protect this investment from loss. Ensure that only authorized individuals can get access to the database. Make adequate backups frequently. Be sure that the people who are working with the database are adequately trained so that they do a good job of getting clean data into the database, and that the data stays there and stays clean. Instill an attitude of protecting the database and keeping its quality up so that people can feel comfortable using it.
Access and permissions Most database manager programs provide a system for limiting who can use a database, and what actions they can perform. Some have more than one way of doing this. Be sure to set up and
General Design Issues
65
use an access control system that fits the needs of your organization. This may not be easy. You will have to walk a thin line between protecting your data and letting people do what they need to do. Sometimes it’s better to start off more restrictive than you think you need to, and then grant more permissions over time, than to be lenient and then need to tighten up, since people react better to getting more power rather than less. Also be aware that security and access limitations are easier to implement and manage in a client-server system than in a stand-alone system, so if you want high security, choose SQL Server or Oracle over Access for the back-end.
Activity tracking To guarantee the quality of the data in the database, it is important to track what changes are made to the data, when they are made, who made them, and why they were made. A simple activity tracking system would include an ActivityLog table in the database to allow data administrators to track data modifications. On exit from any of the data modification screens, including importing, editing, or reviewing, an activity log screen will appear. The program reports the name of the data administrator and the activity date. The data administrator must enter a description of the activity, and the name of the site that was modified. The screen should not close until an entry has been made. Figure 24 shows an example of a screen for this type of simple system. The system should also provide a way to select and display the activity log. Figure 25 shows an example of a selection screen and report of activity data. In this example, the log can be filtered on Administrator name, Activity Date, or Site. If no filters are entered, the entire log is displayed. Another option is a more elaborate system that keeps copies of any data that is changed. This is sometimes called a shadow system or audit log. In this type of system, when someone changes a record in a table, a copy of the unchanged record is stored in a shadow table, and then the change is made in the main table. Since most EDMS activity usually does not involve a lot of changes, this does not increase the storage as much as it might appear, but it does significantly increase the complexity of the software.
Figure 24 - Simple screen for tracking database activity
66
Relational Management and Display of Site Environmental Data
Figure 25 - Output of activity log data
Database maintenance There are a number of activities that must be performed on an ongoing or at least occasional basis to keep an EDMS up and running. These include: Backup – Backing up data in the database is discussed in Chapter 15, but must be kept in mind as part of ongoing database maintenance. Upgrades – Both commercial and custom software should be upgraded on a regular basis. These upgrades may be required due to a change in the software platform (operating system, database software) or to add features and fix bugs. A system should be implemented so that all users of the EDMS receive the latest version of the software in a timely fashion. For large enterprises with a large number of users, automated tools are available to assist the system administrator with distributing upgrades to all of the appropriate computers without having to visit each one. Web-based tools are beginning to appear that provide the same functionality for all users of software programs that support this feature. Either of these approaches can be a great time saver for a large enterprise system. Other maintenance – Other maintenance activities are required, both on the client side and the server side. For example, on the client side, Access databases grow in size with use. You should occasionally compact your database files. You can do this on some set schedule, such as monthly, or when you notice that it has grown large, such as larger than 5 megabytes (5000 Kb). Occasionally problems will occur with Access databases due to power failures, system crashes, etc.
General Design Issues
67
When this happens, first exit Access, then shut down Windows, power down the computer, and restart. If you get errors in the database after that, you can have Access repair and compact the database. In the worst case (if repairing does not work), you should obtain a new copy of the database program from the original source, and restore your data file from a backup. System maintenance will be required on the server database as well, and will generally be performed by the system administrator with assistance from the vendor if necessary. These procedures include general maintenance of the server computer, user administration, database maintenance, and system backup. The database is expected to grow as new data is received for sites currently in the database, and as new sites are added. At some point in the future it will be necessary to expand the size of the device and the database to accommodate the increased volume of data which is anticipated. The system administrator should monitor the system to determine when the database size needs to be increased.
CHAPTER 6 DATABASE ELEMENTS
A number of elements make up an EDMS. These elements include the computer on the user’s desk, the software on that computer, the network hardware and software, and the database server computer. They also include the components of the database management system itself, such as files, tables, fields, and so on. This chapter covers the important elements from these two categories. This presentation focuses on how these objects are implemented in Access (for standalone use) and SQL Server (for client-server), two popular database products from Microsoft. A good overview of Access for both new and experienced database users can be found in Jennings (1995). More advanced users might be interested in Dunn (1994). More information on SQL Server can be found in Nath (1995); England and Stanley (1996); and England (1997). More information on database elements can be found in Dragan (2001), Gagnon (1998), Harkins (2001a, 2001b), Jepson (2001), and Ross et al. (2001).
HARDWARE AND SOFTWARE COMPONENTS A modern networked data management system consists of a number of hardware and software components. These items, which often come from different manufacturers and vendors, all must work together for the system to function properly.
The desktop computer It is obvious that in order to run a data management system, either client-server or stand-alone, you must have a computer, and the computer resources must be sufficient to run the software. Data management programs can be relatively large applications. In order to run a program like this you must have a computer capable of running the appropriate operating system such as Windows. This section describes the desktop hardware and software requirements for either a client-server or stand-alone database management system. Other than the network connection, the hardware requirements are the same.
DESKTOP HARDWARE The computer should have a large enough hard drive and enough random access memory (RAM) to be able to load the software and run it with adequate performance, and data management software can have relatively high requirements. For example, Microsoft Access has the greatest resource requirements of any of the Microsoft Office programs. At the time of this writing, the minimum and recommended computer specifications for adequate performance using the data management system are as shown in Figure 26.
70
Relational Management and Display of Site Environmental Data
Item Computer Hard drive Memory Removable storage Display Network Peripherals
Minimum 200 megahertz Pentium processor Adequate for software and local data storage, at least 1 gigabyte 64 megabytes RAM 3.5” floppy, CD-ROM
Recommended 500 to 1000 megahertz Pentium processor Adequate for software and local data storage, at least 1 gigabyte 128 megabytes RAM 3.5” floppy, CD-RW, Zip drive
VGA 800x600 10 megabits per second Printer
XGA 1024x768 or better 100 megabits per second High-speed printer, scanner
Figure 26 - Suggested hardware specifications
Probably the most important requirement is adequate random access memory (RAM), the chips that provide short-term storage of data. The amount of RAM should be increased on machines that are not providing acceptable performance. If increasing the RAM does not increase the performance to a level appropriate for that user’s specific needs, then replacing the computer with a faster one may be required. It is important to note that the hardware requirements to run the latest software, and the computer processing power of standard systems available at the store, both become greater over time. Computers that are more than three years or so old may be inadequate for running the latest version of the database software. A brand-new, powerful computer including a monitor and printer sells for $1000 or less, so it doesn’t make sense to limp along on an underpowered, flaky computer. Don’t be penny-wise and pound-foolish. Be sure that everyone has adequate computers for the work they do. It will save money in the long run. An important distinction to keep in mind is the difference between memory and storage. A computer has a certain amount of system memory or RAM. It also has a storage device such as a hard drive. Often people confuse the two, and say that their computer has 10 gigabytes of memory, when they mean disk storage.
DESKTOP SOFTWARE Several software components are required in order to run a relational database management system. These include the operating system, networking software (where appropriate), database management software, and the application.
Operating system Most systems used for data management run one of the Microsoft operating systems: Windows 95, 98, ME, or NT/2000/XP. All of these systems can run the same client data management software and perform pretty much the same. Apple Macintosh systems are present in some places, but are used mostly for graphic design and education, and have limited application for data management due to poor software availability. UNIX systems (including the popular open-source version, Linux) are becoming an increasingly viable possibility, with serious database systems like Oracle and DB2 now available for various types of UNIX.
Networking software If the data is to be managed with a shared-file or client-server system, or if the files containing a single-user database are to be stored on a file server computer, the client computer will need to run networking software to make the network interface card work. In some cases the networking software is part of the operating system. This is the case with a Windows network. In other cases the networking will be done with a separate software package. Examples include Novell Netware
Database Elements
71
and Banyan Vines. Either way, the networking software will generally be loaded during system startup, and after that can pretty much be ignored, except that network file server resources and network database server resources are available. This networking software is described in more detail in the next section.
Database management software The next software element in the database system is the database management software itself. Examples of this software are Microsoft Access, FoxPro, and Paradox. This software can be used by itself to manage the data, or with the help of a customized application as described in the next section. The database application provides the user interface (the menus and forms that the user sees) and can, in the case of a stand-alone or single-user system, also provide the data storage. In a client-server system, the database software on the client computer provides the user interface, and some or all of the data is stored on the database server computer somewhere else on the network. If the data to be managed is relatively simple, the database management software by itself is adequate for managing it. For example, a simple table of names and addresses can be created and data entered into it with a minimum of effort. As the data model becomes more complicated, and as the interaction between the database and external data sources becomes more involved, it can become increasingly difficult to perform the required activities using the tools of the software by itself. At that point a specialized application may be required.
Application When the complexity of the database or its interactions exceeds the capability of the generalpurpose database manager program, it is necessary to move to a specialized vertical market application. This refers to software specialized for a particular industry segment. An EDMS represents software of this type. This type of system is also referred to as COTS (commercial offthe-shelf) software. Usually the vertical market application will provide pre-configured tables and fields to store the data, import routines for data formats common in the industry, forms for editing the data, reports for printing selected data, and export formats for specific needs. Using off-theshelf EDMS software can give you a great head start in building and managing your database, relative to designing and building your own system.
The network Often the EDMS will run on a network so people can share the data. The network has hardware and software components, which are discussed in the following sections.
NETWORK HARDWARE The network on which the EDMS operates has three basic components, in addition to the computers themselves: network adapters, wiring, and hubs. These network hardware components are shown in Figure 27. The network adapters are printed circuit boards that are placed in slots in the client and server computers and provide the electronic connection between the computer and the network. The type of adapter card used depends on the kind of computer in which it is placed, and the type of network being used.
72
Relational Management and Display of Site Environmental Data
Clients
Network Adapter
Network Adapter
Network Adapter
Network Hub
Network Adapter
Server Figure 27 - The EDMS network hardware diagram
The wiring also depends on the type of network being used. The two most common types of wiring are twisted pair and coaxial, usually thin Ethernet. Twisted pair is becoming more common over time due to lower cost. Most twisted pair networks use Category 5 (sometimes called Cat5) cable, which is similar to standard telephone wiring, but of higher quality. There is usually a short cable that runs between the computer and a wall plate, wiring in the walls from the client’s or server’s office to a wiring closet, and then another cable from the wall plate or switch block in the wiring closet to the hub. The hub is a piece of hardware that takes the cables carrying data from the computers on the network and connects them together physically. Depending on the type of network and the number of computers, other hardware may be used in place of or in addition to the hub. This might include network switches or routers. The network can run at different speeds depending on the capability of the computers, network cards, hubs, wiring, and so on. Until recently 10 megabits per second was standard for local area networks (LANs), and 56 kilobits per second was common for wide-area networks (WANs). Increasingly, 100 megabits per second is being installed for LANs and 1 megabit per second or faster is used for WANs.
EDMS NETWORK SOFTWARE There are a number of software components required on both the client and server computers in order for the EDMS to operate. Included in this category is the operating system transport protocols and other software required just to make the computer and network work. The operating system and network software should be up and running before the EDMS is installed.
Database Elements
73
Clients
Access Front-end
Access Front-end
Access Front-end
ODBC Driver
ODBC Driver
ODBC Driver
SQL Queries Data In
Query Results Data Out
SQLServer Process SQLServer Data Storage
Backup and Restore
Server Figure 28 - The EDMS network software components
The major networked data management software components of the EDMS are discussed in this section from an external perspective, that is, looking at the various pieces and what they do, but not at the detailed internal workings of each. The important parts of the internal view, especially of the data management system, will be provided in later sections. On the client computers in a client-server system, the important components for data management provide the user interface and communication with the server. On the server, the software completes the communication and provides storage and manipulation of the data. For a stand-alone system, both parts run on the client computer. The diagram in Figure 28 shows the major data management software components for a client-server system, based on Access as a front-end and SQL Server as a back-end. On the client computers, the user interface for the EDMS can be provided by a database such as Microsoft Access, or can be written in a programming language like Visual Basic, Power Builder, Java, or C++. The advantage of using a database language is ease of development and flexibility. The advantage of a compiled language is code security, and perhaps speed, although speed is less of a distinguishing factor than it used to be. The main user interface components are forms and menus for soliciting user input and forms and reports for displaying output. Also provided by Access on the desktop are queries to manipulate data and macros and modules (both of which are types of programs) to control program
74
Relational Management and Display of Site Environmental Data
operation and perform various tasks. Customized components specific to the EDMS, if any, are contained in an Access .mdb file which is placed on the client computer during setup and which can be updated on a regular basis as modifications are made to the software. Through this interface, the user should be able to (with appropriate privileges) import and check data, select subsets of the data, and generate output, including tables, reports, graphs, and maps. To communicate data with the server, the Access software works with a driver, which is a specialized piece of software with specific capabilities. In a typical EDMS this driver uses a data transfer protocol called Open DataBase Connectivity (ODBC). The driver for communicating with SQL Server is provided by Microsoft as part of the Access software installation, although it may not be installed as part of the standard installation. Drivers for other server databases are available from various sources, often the vendor of the database software. There are two parts to the ODBC system in Windows. One part is ODBC administration, which can be accessed through the ODBC icon in Control Panel. This part provides central management of ODBC connections for all of the drivers that are installed. The second part consists of individual drivers for specific data sources. There are two kinds of ODBC drivers, single-tier and multi-tier. The single-tier drivers provide both the communication and data manipulation capabilities, and the data management software for that specific format itself is not required. Examples of single-tier drivers include the drivers for Access, dBase, and FoxPro data files. Multi-tier drivers provide the communication between the client and server, and work with the database management software on the server to provide data access. Examples of multi-tier drivers include the drivers for SQL Server and Oracle. The server side of the ODBC communication link is provided by software that runs on the server as an NT/2000/XP process. The SQL Server process listens for data requests from clients across the network via the ODBC link, executes queries locally on the server, and sends the results back to the requesting client. This step is very important, because the traffic across the network is minimized. The requests for data are in the form of SQL queries, which are a few hundred to a few thousand characters, and the data returned is whatever was asked for. In this way the user can query a small amount of data from a database with millions of records and the network traffic would be just a few thousand characters. Some EDMS software packages can work in either stand-alone or client-server mode. In the first case it uses a direct link to the Jet database engine when working with an Access database. In the second case, the EDMS uses the SQL Server multi-tier driver to communicate between the user interface in Access and SQL Server on the server. When users are attached to a local Access database, all of the processing and data flow occurs on the client computer. When connected to the server database the data comes from the server.
The server SERVER HARDWARE The third hardware component of the EDMS, besides client computers and the network, is the database server. This is a computer, usually a relatively powerful one, which contains the data and runs the server component of the data management software. Usually it runs an enterprise-grade operating system such as Windows NT/2000/XP or UNIX. In large organizations the server will be provided or operated by an Information Technology (IT) or similar group, while in smaller organizations data administrators or power users in the group will run it. The range of hardware used for servers, especially running Windows NT/2000/XP, is great. NT/2000/XP can run on a standard PC of the type purchased at discount or office supply stores. This is actually a good solution for small groups, especially when the application is not mission critical, meaning that if the database becomes unavailable for short periods of time the company won’t be shut down.
Database Elements
75
Figure 29 - Example administrative screen from Microsoft SQL Server
For an organization where the amount of use of the system is greater, or full-time availability is very important, a computer designed as a server, with redundant and hot-swappable (can be replaced without turning off the computer) components, is a better solution. This can increase the cost of the computer by a factor of two to ten or more, but may be justified depending on the cost of loss of availability.
SERVER SOFTWARE The client-based software components described above are those that users interact with. System administrators also interact with the server database user interface, which is software running on the server computer that allows maintenance of the database. These maintenance activities include regular backup of the data and occasional other maintenance activities including user and volume administration. Software is also available which allows many of these maintenance activities to be performed from computers remote from the server, if this is more convenient. An example screen from SQL Server is shown in Figure 29.
UNITS OF DATA STORAGE The smallest unit of information used by computers is the binary bit (short for BInary digiT). A bit is made up of one piece of data consisting of either a zero or a one, or more precisely, the electrical charge is on or off at that location in memory. All other types of data are composed of one or more bits. The next larger common unit of storage is the byte, which contains eight bits. One byte can represent one of 256 different possibilities (two raised to the eighth power). This allows a byte to represent any one of the characters of the alphabet, the numbers and punctuation symbols, or a large number of other characters. For example, the letter A (capital A) can be represented by the byte 01000001. How each character is coded depends on the coding convention used. The two most common are ASCII (American Standard Code for Information Interchange) used on personal
76
Relational Management and Display of Site Environmental Data
computers and workstations, and EBCDIC (Extended Binary Coded Decimal Interchange Code) used on some mainframes. The largest single piece of data that can be handled directly by a given processor is called a word. For an 8-bit machine, a word is the same as a byte. For a 16-bit system, a word is 16 bits long, and so on. A 32-bit processor is faster than a 16-bit processor of the same clock speed because it can process more data at once, since the word size is twice as big. For larger amounts of data, the amount of storage is generally referred to in terms of the number of bytes, usually in factors of a thousand (actually 1024, or 210). Thus one thousand bytes would be one kilobyte, one million would be one megabyte, one billion is one gigabyte, and one trillion is one terabyte. As memory, mass storage devices, and databases become larger, the last two terms are becoming increasingly important.
DATABASES AND FILES As discussed in Chapter 5, databases can be described by their logical data model, which focuses on data and relationships, and their physical data model, which is how the data is stored in the computer. All data in a modern computer is stored in files. Files are chunks of related data stored together on a disk drive such as a hard disk or floppy disk. The operating system takes care of managing the details of the files such as where they are located on the disk. Files have names, and files in DOS and Windows usually have a base name and an extension separated by a period, such as Mydata.dbf. The extension usually tells you what type of file it is. Older database systems often stored their data in the format of dBase, with an extension of .dbf. Access stores its data and programs in files with the extension of .mdb for Microsoft DataBase, and can store many tables and other objects in one file. Most Access developers build their applications with one .mdb file for the program information (queries, forms, reports, etc.) and another for the data (tables). Larger database applications have their data in an external database manager such as Oracle or SQL Server. The user does not see this data as files, but rather as a data source available across the network. If the front end is running in Access, they will still have the program .mdb either on their local hard drive or available on a network drive. If their user interface is a compiled program written in Visual Basic, C, or a similar language, it will have an extension of .exe. We will now look at the remaining parts of a database system from the point of view of a stand-alone Access database. The concepts are about the same for other database software packages. Access databases contain six primary objects. These are tables, queries, forms, reports, macros, and modules. These objects are described in the following sections.
TABLES (“DATABASES”) The basic element of storage in a relational database system is the table. Each table is a homogeneous set of rows of data describing one type of real-world object. In some older systems like dBase, each table was referred to as a database file. Current usage tends more toward considering the database as the set of related tables, rather than calling one table a database. Tables contain the following parts: Records – Each line in a table is called a record, row, entity, or tuple. For example, each boring or analysis would be a record in the appropriate table. Records are described in more detail below. Fields – Each data element within a record is called a field, column, or attribute. This represents a significant attribute of a real-world object, such as the elevation of a boring or the measured value of a constituent. Records are also described in more detail below.
Database Elements
77
Figure 30 - Join Properties form in Microsoft Access
Relationships – Data in different tables can be related to each other. For example, each analysis is related to a specific sample, which in turn is related to a specific boring. Relationships are usually based on key fields. The database manager can help in enforcing relationships using referential integrity, which requires that defined relationships be fulfilled according to the join type. Using this capability, it would be impossible to have an analysis for which there is no sample. Join types – A relationship between two tables is defined by a join. There are two kinds of joins, inner joins and outer joins. In an inner join, matching records must be present on both sides of the join. That means that if one of the tables has records that have no matching records in the other, they are not displayed. An outer join allows unmatched records to be displayed. It can be a left join or a right join, depending on which table will have unmatched records displayed. Figure 30 shows an example of defining an outer join in Access. In this example, a query has been created with the Sites and Stations tables. The join based on the SiteNumber field has been defined as an outer join, with all records from the Sites table being displayed, even if there are no corresponding records in the Stations table. This outer join is a left join. Figure 31 shows the result of this query. There are stations for Rad Industries and Forest Products Co., but none for Refining, Inc. Because of the outer join there is a record displayed for Refining, Inc. even though there are no stations.
Figure 31 - Result of an outer join query
78
Relational Management and Display of Site Environmental Data
FIELDS (COLUMNS) The fields within each record contain the data of each specific kind within that record. These are analogous to the way columns are often used in a spreadsheet, or the blanks to be filled out on a paper form. Data types – Each field has a data type, such as numeric (several possible types), character, date/time, yes/no, object, etc. The data type limits the content of the field to just that kind of data, although character fields can contain numbers and dates. You shouldn’t store numbers in a character field, though, if you want to treat them as numbers, such as performing arithmetic calculations on them. Character fields are the most common type of field. They may include letters, numbers, punctuation marks, and any other printable characters. Some typical character fields would be SiteName, SampleType, and so on. Numeric is for numbers on which calculations will be performed. They may be either positive or negative, and may include a decimal point. Numeric fields that might be found in an EDMS are GroundElevation, SampleTop, etc. Some systems break numbers down further into integer and floating point numbers of various degrees of precision. Generally this is only important if you are writing software, and less important if you are using commercial programs. It is important to note that Microsoft programs such as Excel and Access have an annoying feature (bug) that refuses to save trailing zeros, which are very important in tracking precision. If you open a new spreadsheet in Excel, type in 3.10, and press Enter, the zero will go away. You can change the formatting to get it back, but it’s not stored with the number. The best way around this is to store the number of decimals with each result value, and then format the number when it is displayed. Date is pretty obvious. Arithmetic calculations can often be performed on dates. For example, the fields SampleDate and AnalysisDate could be included in a table, and could be subtracted from each other to find the holding time. Date fields in older systems are often 8 characters long (MM/DD/YY), while more modern, year 2000 compliant systems are 10 characters (MM/DD/YYYY). There is some variability in the way that time is handled in data management systems. In some database files, such as dBase and FoxPro .dbf files, date and time are stored in separate fields. In others, such as Access .mdb files, both can be stored in one field, with the whole number representing the date and the decimal component containing the time. The dates in Access are stored as the number of days since 12/30/1899, and times as the fraction of the day starting at midnight, such that .5 is noon. The way dates are displayed traditionally varies from one part of the world to another, so as we go global, be careful. On Windows computers, the date display format is set in the operating system under Start/Settings/Control Panel/Regional Settings. Logical represents a yes/no (true/false) value. Logical fields are one byte long (although it actually takes only one bit to store a logical value). ConvertedValue could be a logical field that is true or false based on whether or not a value in the database has been converted from its original units. Data domain – Data within each field can be limited to a certain range. For example, pH could be limited to the range of 0 to 14. Comprehensive domain checking can be difficult to implement effectively, since in a normalized data model, pH is not stored in its own field, but in the same Value field that stores sulfate and benzene, which certainly can exceed 14. That means that this type of domain analysis usually requires programming. Value – Each field has a value, which can be some measured amount, some text attribute, etc. It is also possible that the value may be unknown or not exist, in which case the value can be set to Null. Be aware, however, that Null is not the same as zero, and is treated differently by the software.
Database Elements
79
Figure 32 - Oracle screen for setting field properties
Key fields – Within each table there should be one or more fields that make each record in the table unique. This might be some real-world attribute (such as laboratory sample number) or a synthetic key such as a counter assigned by the data management system. A primary key has a unique value for each record in the table. A field in one table that is a primary key in another table is called a foreign key, and need not be unique, such as on the “many” side of a one-to-many relationship. Simple keys, which are made up of one field, are usually preferable to compound keys made up of more than one field. Compound keys, and in fact any keys based on real data, are usually poor choices because they depend on the data, which may change. Figure 32 shows an Oracle screen for setting field properties.
RECORDS (ROWS) Once the tables and fields have been defined, the data is usually entered one record at a time. Each well in the Stations table or groundwater sample in the Samples table is a record. Often the size of a database is described by the number of records in its tables.
QUERIES (VIEWS) In Access, data manipulation is done using queries. Queries are based on SQL, and are given names and stored as objects, just like tables. The output of a query can be viewed directly in an editable, spreadsheet-like view, or can be used as the basis of a form or a report. Access has six types of queries: Select – This is the basic data retrieval query. Cross-tab – This is a specialized query for summarizing data.
80
Relational Management and Display of Site Environmental Data
Figure 33 - Simple data editing form
Make table – This query is used to retrieve data and place it into a new table. Update – This query changes data in an existing table. Append – This query type adds records to an existing table. Delete – These queries remove records from a table, and should be used with great care!
OTHER DATABASE OBJECTS The other types of database objects in an Access system are forms, reports, macros, and modules. Forms and reports are for entering and displaying data, while macros and modules are for automating operations.
Forms Forms in data management programs such as Access are generally used for entering, editing, or selecting data, although they can also be used as menus for selecting an activity. Forms for working with data use a table or a query as a data source.
Figure 34 - Advanced data editing form
Database Elements
81
Figure 35 - Example of a navigation form
Figure 33 shows an example of a simple form for editing data from one table. Data editing forms can be much more complicated than the previous example. Figure 34 shows a data editing form with many fields and a subform, which allows for many records in the related table to be displayed for each record in the main table. Figure 35 shows a form used for navigation. Users click on one of the gray rectangles with their mouse to open the appropriate form for what they want to do. Sometimes the data entry forms can be combined with navigation capabilities. The following form is mostly a data entry form, with data fields and a subform, but it also allows users to navigate to a specific record. They do this by opening a combo box, and selecting an item from the list. The form then takes them to that specific record. Forms are a very important part of a database system, since they are usually the main way users interact with the system.
Figure 36 - A form combining data entry and navigation
82
Relational Management and Display of Site Environmental Data
Figure 37 - Report of analytical data
Reports Reports are used for displaying data, usually for printing. Reports use a table or form as a data source, the same way that forms do. The main differences are that the data on reports cannot be edited, and reports can better handle large volumes of data using multiple pages. Figure 37 shows a typical report of analytical data. Reports will be covered in much more detail in Chapter 19.
Macros, modules, subroutines, and functions Much of the power in modern data management programs comes from the ability to program them for specific needs. Some programs, like Access, provide more than one way to tell the program what to do. The two ways in Access are macros and modules. Macros are like stored keystrokes, and are used to automate procedures. Modules are more like programs, and can also be used to automate activities. Modules have some advantages over macros. Many Access developers prefer modules to macros, but since macros are easier to learn, they are frequently used, especially by beginners. Microsoft encourages use of modules instead of macros for programming their applications, and suggests that support for macros may be removed in future versions. Figure 38 shows an Access macro in the macro-editing screen. This macro minimizes the current window and displays the STARTUP form.
Database Elements
83
Figure 38 - Access macro example
Modules provide the programming power behind Access. They are written in Access Basic, a variety of Visual Basic for Applications (VBA). VBA is a complete, powerful programming language that can do nearly anything that any programming language can do. VBA is customized to work with Access data, which makes it easy to write sophisticated applications to work with data in Access tables. Figure 39 shows the Access screen for editing a module. The subroutine shown re-spaces the print order in the Parameters table with an increment of five so new parameters can be inserted in between. A module can have two kinds of code in it, subroutines and functions (also known as “subs”), both of which are referred to as procedures. Both are written in VBA. The difference is that a function returns a value, and a sub does not. Otherwise they can do exactly the same thing.
Figure 39 - Access module example
84
Relational Management and Display of Site Environmental Data
Figure 40 - SQL Server screen for editing a trigger
Triggers and stored procedures There is another kind of automation that a database manager program can have. This is associating activity with specific events and data changes. Access does not provide this functionality, but SQL Server and Oracle do. You can associate a trigger with an event, such as changing a data value, and the software will run that action when that event happens. Entering and editing triggers can be done in one of two ways. The programs provide a way to create and modify triggers using the SQL Data Definition Language by entering commands interactively. They also provide a user interface for drilling down to the trigger as part of the table object model and entering and editing triggers. This interface for SQL Server is shown in Figure 40. A stored procedure is similar except that it is called explicitly rather than being associated with an event.
Calculated fields One of the things that computers do very well is perform calculations, and often retrieving data from a database involves a significant amount of this calculation. People who use spreadsheets are accustomed to storing formulas in cells, and then the spreadsheet displays the result. It is tempting to store calculated results in the database as well. In general, this is a bad idea (Harkins, 2001a), despite the fact that this is easy to do using the programmability of the database software. There are several reasons why this is bad. First, it violates good database design. In a well-designed database, changing one field in a table should have no effect on other fields in the table. If one field is calculated from one or more others, this will not be the case. The second and main reason is the risk of error due to redundant data storage. If you change one data element and forget to change the calculated data, the database will be inconsistent. Finally, there are lots of other ways to achieve the same thing. Usually the best way is to perform the calculation in the query that retrieves the data. Also, calculated controls can display the result on the fly. There are exceptions, of course. A data warehouse contains extracted and often calculated data for performance purposes. In deeply nested queries in Access, memory limitations sometimes required storing intermediate calculations in a table, and then performing more queries on the intermediate results. For the most part, however, if your design involves storing calculated results, you will want to take a hard look at whether this is the best way to proceed.
CHAPTER 7 THE USER INTERFACE
The user interface for the software defines the interaction between the user and the computer for the tasks to be performed by the software. Good user interface design requires that the software should be presented from the point of view of what the user wants to do rather than what the computer needs to know to do it. Also, with user interfaces, less is more. The more the software can figure out and do for the user without asking, the better. This section provides information that may be helpful in designing a good user interface, or in evaluating a user interface designed by others.
GENERAL USER INTERFACE ISSUES The user interface should be a modern graphical user interface with ample guidance for users to make decisions about the action required from them so that they can perform the tasks offered by the system. In addition, online help should be provided to assist them should they need additional information beyond what is presented on the software screens. The primary method for navigating through the system should be a menu, forms, or a combination of both. These should have options visible on the screen to take users to the areas of the program where they can perform the various activities. An example of a main menu form for an EDMS is shown in Figure 41. Users select actions by pressing labeled buttons on forms. Pressing a button will usually bring up another form for data input or viewing. Each form should have a button to bring them back to the previous form, so they can go back when they are finished or if they go to the wrong place. Forms for data entry and editing should have labeled fields, so it is clear from looking at the screen what information goes where. Data entry screens should have two levels of undo. Pressing the Escape key once will undo changes to the current field. Pressing it again will undo all changes to the current record. Multiple levels of undo can significantly increase the users’ confidence that they can recover from problems. The user interface should provide guidance to the users in two ways. The first is in the arrangement and labeling of controls on forms. Users should be able to look at a form and get enough visual clues so they know what to do. The second type of user interface guidance is tool tips. Tool tips are little windows that pop up when the cursor moves across a control providing the user with guidance on using that control. Consistency and clarity are critical in the user interface, especially for people who use the software on a part time or even an occasional basis.
86
Relational Management and Display of Site Environmental Data
The illusion of simplicity comes from focusing on only one variable. Rich (1996)
Figure 41 - Example of a menu form with navigation buttons
Figure 42 - Tool tip
CONCEPTUAL GUIDELINES An environmental data management program is intended to have a useful life of many years. During both the initial development and ongoing maintenance stages, it is likely that many different people will make additions and modifications to the software. This section is intended to provide guidance to those individuals so that the resulting user interface is as seamless as possible. The primary focus of these guidelines is on ease of use and consistency. These two factors, combined with a high level of functionality, will lead to a positive user experience with the software and acceptance of the system in the organization. A number of questions to be answered regarding user interface design are listed in Cooper (1995, p. 20). His advice for the answers to these questions can be broken down into two premises. The first is that the software should be presented from the point of view of what the user wants to do (manifest model) rather than what the computer needs to know to do it (implementation model). The second premise is that with user interfaces, less is more. The more the software can figure out and do for the user without asking, the better.
The User Interface
87
This section uses Cooper’s questions as a framework for discussing the user interface issues for an EDMS. The target data management system is a client-server system with Microsoft Access as the front end and SQL Server as the back-end, but the guidelines apply equally well for other designs. Answers are provided for a typical system design, but of course these answers will vary depending on the implementation details. It is important to note that the tools used in addressing user interface issues must be those of the system in which it is operating. Some of this material is based on interviews with users in the early stages of system implementation, and some on discussions with designers and experienced users, so it represents several perspectives. What should be the form of the program? – The data management system consists of tables, queries, forms, reports, macros, and modules in Microsoft Access and SQL Server to store and manipulate the data and present the user interface. For the most part, the user will see forms containing action buttons and other controls. The results of their actions will be presented as an Access form or report window. A recurring theme with users is that the system must be easy to learn and use if people are going to embrace it. This theme must be kept in mind during system design and maintenance. Every attempt should be made to ensure that the software provides users with the guidance that they need to make decisions and perform actions. The software should help them get their work done efficiently. How will the user interact with the program? – The primary interaction with the user is through screen forms with buttons for navigation and various data controls for selection and data entry. In general users should be able to make selections from drop-down lists rather than having to type in selections, and as much as possible the software should remember their answers to questions so that those answers can be suggested next time. The most common comment from users related to the user interface is that the system should be easy to use. People feel that they don't have much time to learn a new system. In order to gain acceptance, the new system will need to save time, or at least provide benefits that outweigh the costs of setup and use. Another way of saying this is that the software should be discoverable. The user should be able to obtain clues about what to do by looking at the screen. The example shown in Figure 43 shows a screen from a character-mode DOS interface. This interface is not discoverable. The user is expected to know what to type in to make the program do something. In this example, what the user tried didn’t work. The next example, Figure 44, shows a major improvement. Users need no a priori knowledge of what to do. They can look at the screen and figure out what to do by reading their options. Of course, even a good idea can have flaws. In this example, the flow is a little illogical, expecting users to click on Start to stop (shut down) their computer, but the general idea is a great improvement. The transition to a discoverable interface, especially at the operating system level, which was originally popularized by the Apple Macintosh computer and later by the Microsoft Windows operating system, has made computer use accessible to a much wider audience. How can the program’s function be most effectively organized? – The functions of the program are organized by the tasks to be performed. In most cases, users will start their session by selecting a project, and perhaps other selection criteria, and then select an action to perform on the selected data. Where a set of answers is required, the questions they are asked should be presented in a clear, logical sequence, either on a single form or as a series of related “wizard”-like screens. How will the program introduce itself to first-time users? – An example of program introduction would be for the program to display an introductory (splash) screen followed by the main menu. An on-screen “tour” or tutorial screen as shown in Figure 45 can be very helpful in getting a new user up and running fast.
88
Relational Management and Display of Site Environmental Data
Figure 43 - Example of an interface that is not “discoverable”
Figure 44 - Example of a “discoverable” interface
Figure 45 - On-screen “tour” or tutorial
A printed tutorial can perform this function also, but experience has shown that people are more likely to go through the tutorial if it is presented on-screen. This satisfies the “instant gratification” requirement of someone who has just acquired a new software program and wants to see it do something right away (the “out of box” experience). Then after that users can take the time to learn the program in detail, get their data loaded, and use the software to perform useful work. How can the program put an understandable and controllable face on technology? – The software must make the user comfortable in working with the data. The user interface must make it easy to experiment with different approaches to retrieving and displaying data. This allows people to find the data selection and presentation approach that helps them best communicate the desired message. Figures 46 and 47 provide interesting examples of the good and the bad of using multiple display windows to help the user feel comfortable working with data. The multiple windows showing different ways of looking at the data would confuse some users. Others would be thrilled to be able to look at their data in all of these different ways. There is certainly a personal preference issue regarding how the software presents data and lets the user work with it. The software should provide the option of multiple on-screen displays, and users can open as many as they are comfortable with.
The User Interface
Figure 46 - Database software display with many data elements
Figure 47 - GIS software display with many data elements
89
90
Relational Management and Display of Site Environmental Data
Several things can be done in the user interface to support these usability objectives. As discussed above, the software should be discoverable. It should also be recoverable, so that users can back out of any selection or edit that they may make incorrectly. Users should be provided with information so that they can predict the consequences of their actions. For example, the software should give clues about how long a process will take, and the progress of execution. Any process that will take a long time should have a cancel button or other way to terminate processing. This goal can sometimes be difficult to accomplish, but it’s almost always worth the effort. How can the program deal with problems? – When an error occurs, the program should trap the error. If possible the software should deal with the error without user intervention. If this is not possible, then the program should present a description of the error to the user, along with options for recovery. The software designers should try to anticipate error conditions and prepare the software and the users to handle them. How will the program help infrequent users become more expert? – The user should be able to determine how to perform their desired action by looking at the screen. Tool tips should be provided for all controls to assist in learning, and context-sensitive help at the form level should be provided to make more information available should the user require it. How can the program provide sufficient depth for expert users? – The forms-based menu system should provide users with the bulk of the functionality needed to perform their work. Those wishing to go beyond this can be trained to use the object creation capabilities of the EDMS to make their own queries, forms, and reports for their specific needs. This is a major benefit of using a database program like Access rather than a compiled language like Visual Basic to build the EDMS front-end. Training is important at this stage so users can be steered away from pitfalls that can be detrimental to the quality of their output. For example, a section in Chapter 15 discusses data retrieval and output quality issues. Another way of looking at this issue is that the program should be capable of growing with the user.
GUIDELINES FOR SPECIFIC ELEMENTS Automated import, data review, manual data entry, lookup table maintenance, and related administrative activities should be done in forms, queries, and modules called up from menu choices. The rest of the interaction with the system is normally done through the selection screen. This screen is a standardized way for the user to select a subset of the data to work with. They should be able to select a subset based on a variety of data elements. The selection process starts with a base query for each of the data elements. The selection screen then appends to the SQL “WHERE” clause based on the items they have selected on the screen. This query is then saved, and can be used as the basis for further processing such as retrieving or editing data. The system should make it easy, or at least possible, to add new functions to the system as specific needs are identified. The user interface is the component of the EDMS that interacts with the people using the software. The user interface for a data management system consists of five principal parts: Input, Editing, Selection, Output, and Maintenance. Input – This section of the user interface allows data to be put into the system. In an EDMS, this involves file import and manual input. File import allows data in one or more specified formats to be brought into the system. The user interface component of file import should let the user select the location and name of the file, along with the format of the file being imported. They should be able to specify how various import options like parameter and unit conversion will be handled. Manual input allows data to be typed into the system. The procedures for file import and manual input must provide the necessary level of quality assurance before the data is considered ready for use. Editing – It is necessary to provide a way to change data in the system. The user interface component of data editing consists of presenting the various data components and allowing them to
The User Interface
91
be changed. It is also critical that the process for changing data be highly controlled to prevent accidental or intentional corruption of the data. Data editing procedures should provide a process to assure that the changes are valid. Components of the data management software, such as referential integrity, lookup tables, and selections from drop-down menus, can help with this. Selection – The two most important parts of an EDMS are getting the data in and getting the data out. Especially in larger organizations, getting the data in is done by data administrators who have been trained and have experience in carefully putting data into the system. Getting the data out, however, is often done by project personnel who may not be as computer literate, or at least not database experts. The user interface must address the fact that there will be a range of types of users. At one extreme is the type of user who is not familiar or comfortable with computers, and may never be. In the middle are people who may not have had much experience with data management, but will learn more over time. At the high extreme are power users who know a lot about data management coming in and want to roll their sleeves up and dig in. The software should make all of these types of users comfortable in selecting and outputting data. A query by form (QBF) selection screen is one way to accomplish this. Output – Once a subset of the data has been selected, the software should allow the data to be output in a variety of formats. For a system like this, almost all of the output follows a selection step. The selection for output involves choosing the data content (tables and fields), record subset (certain sites, stations, sample dates, etc.) and output format. The software should provide a set of standard (canned) output formats that can be chosen from the QBF selection screen. These can range from relatively unformatted lists to formalized reports, along with graphs and maps, and perhaps output to a file for further processing. Maintenance – All databases require maintenance, and the larger the database (number of records, number of users, etc.), the more maintenance is required. For an EDMS, the maintenance involves the server and the clients. The largest item requiring maintenance in the server component of an EDMS is data backup, which is discussed in more detail in Chapter 15. Another server task is maintenance of the list of users and passwords. This must be done whenever people or assignments change. Also, the data volumes in the database manager may need to be changed occasionally as the amount of data increases. This is usually done by a computer professional in IS. Some maintenance of the client component of the EDMS is usually required. Access databases (.mdb files) grow over time because temporary objects such as queries are not automatically removed. Consequently, maintenance (compacting) of the .mdb files must be performed on an occasional basis, which can vary from weekly to monthly depending on the level of use. Also, occasionally .mdb files become corrupted and need to be repaired. This can be automated as well. Finally, as improvements are made to the EDMS system, new versions of the program file containing the front end will need to be distributed to users, and a simple process should be developed to perform this distribution to a large number of users with a minimum of effort.
DOCUMENTATION The three main factors that lead to a satisfactory user experience with software are an intuitive user interface, accessible and clear documentation, and effective training. The data management system should have two primary documentation types, hard copy and online. The online documentation consists of two parts, the user interface and the help file. The hard copy documentation and help file will be described in the following sections.
Hard copy The hard copy documentation should consist of two parts, a tutorial section and a reference section. The tutorial section should take the user through the process of using the software and
92
Relational Management and Display of Site Environmental Data
working with data. The reference section should cover the various aspects of working with the system, and have a goal of anticipating and answering any questions the user might have in working with the software. Both sections should have appropriate illustrations to help the user understand what is being described. In addition to covering the day-to-day operation of the system, the documentation should also cover maintenance procedures for both the client and server sides of the system.
Help file The data management system should be installed with a complete online help file using the standard Windows Help System. It can be based on the hard copy documentation, with any modifications necessary to adapt that documentation to the online format. Help screens should be provided for each form in the system (form-level context-sensitivity).
CHAPTER 8 IMPLEMENTING THE DATABASE SYSTEM
This chapter addresses the process of getting a database management system up and running. Topics covered include designing the system, installing the system, and other issues related to the transition from design and construction to ongoing use.
DESIGNING THE SYSTEM The design of the database system is important if the goals of the system are to be satisfied. This section covers a number of issues related to designing the system.
General design goals Before designing a database system, the goals and needs for the system should be clearly identified. To design a usable system, both general design goals such as those described in the literature and goals specific to the organization must be taken into account. This section presents the design goals for an EDMS from a general perspective. The objectives of database design are generally the same for all organizations and all database systems. Stating these goals can help with implementation decisions because the choice that furthers the greatest number of goals is usually the right choice. Some of the material in this section is discussed in Jennings (1995, pp. 828-829) and Yourdon (1996). The most important aspect of designing a database management system is determining the business problem to be solved. Once this has been done, many of the design decisions are much easier to make. In designing and implementing this system, the following data management goals should be considered: Organization goals that should be addressed include: • • •
Fulfilling the needs of the organization for information in a timely, consistent, and economical manner. Working within the hardware and software standards of the organization, and, as much as is practical, using existing resources to minimize cost. Accommodating expansion of the database to adapt to changing organizational needs.
94
Relational Management and Display of Site Environmental Data
Planning for flexibility must be a part of the system design. Business rules have longevity, but processes change. The technology that models the business process must be able to change in order to have any longevity. Data management goals that should be considered include: • • • •
Providing rapid access to data required by each user category. This includes providing a user interface that is easy to learn and easy to use, yet flexible and powerful enough to allow any kind of data retrieval. Eliminating or minimizing the duplication of data across the organization. Easing the creation or integration of data entry, data review, editing, display, and reporting applications that efficiently serve the needs of the users of the database. Preserving data for future use. Any significant data available in digital form should be stored in the database, and backup and recovery procedures should be in place to preserve the content.
Database security goals include: • • • • •
Maintaining the integrity of the database so that it contains only validated, auditable information. Preventing access to the database by unauthorized people. Permitting access only to those elements of the database information that individual users or categories of users need in the course of their work. Allowing only authorized people to add or edit information in the database. Tracking modifications to the data.
Quality goals for the system could be: • • •
Designation of responsibility for decisions about data included in the database. Designation of responsibilities for decisions about data gathering. Use of approved data collection procedures.
Database project management issues include: • • • • • • • • • •
Responsibilities for data management should be clearly defined. The system should provide the container for data. However, project managers should decide how to use it based on the business model for their project, since the level of detail that is appropriate may vary from project to project. Potential uses for the data should be identified so that the quality of the data gathered will match the intended use for that data. Objectives of the data gathering effort should be clearly and unambiguously established. Where several organizations are separately collecting data for multimedia assessments, it is essential that efforts be made to identify and discuss the needs of the principal users of the data. This includes establishment of minimum data accuracy. Once the intended uses for the data are defined, then a quality control program should be designed and implemented. Data needs should be periodically reviewed in the light of increased understanding of environmental processes, or changes in toxicological, climatic, or other environmental conditions. To get the full use of a database system, correct data collection procedures should be used to achieve the highest possible quality for the data being entered into the database. Accepted measurement and data handling methodologies should be used whenever possible. Well-tested measurement practices and standard reference materials should be developed if not already in use. This will allow adequate quality control practices to be implemented in measurement programs.
Implementing the Database System • • • •
95
Protocols for data measurement and management should be periodically reviewed. The quality of the collected data should be carefully reviewed and documented before the data is made available for general use. When data-gathering procedures are changed, care should be taken to assure that the old data can be correlated with the new set with no loss in continuity. Information on existing data programs and data and measurement standards should be disseminated widely to the data user community.
Determine and satisfy needs It is possible to develop a standard procedure for completing the project on time and on budget. This procedure has several steps, which will be described here. Many of these steps are discussed in more detail in later sections, but are included here to make sure that they are considered in the planning process. Assess the needs of the organization – This is probably the most important step in the process. This process should be continued on into implementation. One good process is to select a cross section of potential users and other interested parties within the organization and interview them from a prepared questionnaire. (See Appendix A.) This questionnaire should be prepared for each project based on a template from previous similar projects. The questions on the form progress from general to specific in order to elicit each user’s needs and interests for the system. An important factor in selecting technology is the attitudes of your organization and the individuals in it toward the adoption of new technology. Moore (1991), in his popular book Crossing the Chasm, describes the Technology Adoption Life Cycle, and groups people by their response to technology. Gilbert (1999) has related these groups to software implementation in environmental organizations. These groups are Innovators, Early adopters, Early majority, Late majority, and Laggards. The chasm is between Innovators and Early adopters, and it is difficult to move technology across this chasm. This concept is important in selling and then managing a technology project in your organization. You should analyze the decision makers and prospective users of the technology, and choose technology at the appropriate level of innovation that will be comfortable (or at least not too uncomfortable) for them. Create a plan – Based on the results of the questionnaire, the implementation team should work with a small group of individuals within the target user group to develop a data management plan. This plan serves several major purposes. It provides a road map for the long-term direction of the project. It serves as a design document for the initial release of the system. And it helps facilitate discussion of the various issues to be addressed in developing and implementing the system. This often-overlooked step is critical to project success. Fail to plan, plan to fail. Develop the data model – This is also a critical step in the process. In this step, which can be done parallel with or subsequent to creation of the plan, the implementation team and the users work to make sure that the data content of the system meets the needs of the users. Perform software modifications – If the needs assessment or the data model design identifies changes which need to be made in the software, they are performed in this step. If the database software is based on an open system using standard tools, these changes are usually quite straightforward. Of course, the level of effort required by this step is proportional to the number and scope of the changes to be made. A key action that must be carried out throughout all five of the above steps is to communicate with the future users. Too often, the team writing the software creates what they think the users want, only to find out weeks or months later that the users’ needs are totally different. Frequent communication between the developers and users can help prevent frustration and project failure. Test, then test again – Once the data model and software functionality have been implemented, the system must be fully tested. The installation team should test the software during and after the modifications and remedy any difficulties encountered at that time. Once the team
96
Relational Management and Display of Site Environmental Data
members are satisfied that they have found all of the problems that they can, the software must be tested in the different and often more varied environment of the client site. It is better to start out with a small group of knowledgeable users, and then expand the user base as the number of problems encountered per unit of use time decreases. When the client and the implementation team agree that the problem rate is at an acceptable level, the software can be released for general use. Document – Good documentation is important for a successful user experience. Some users prefer and will read a written manual. Others prefer online help. Both must be provided. Train – Most users learn best from a combination of formally presented material and hands-on use. It is useful to have materials to teach classes for a variety of different types of users, and these materials can be modified prior to presentation to reflect the anticipated use pattern of the target users. A facility suitable for hands-on training must be provided that is in a location that is convenient for the students. Support – When the user has a problem, which they will despite the best developing and testing efforts, there must be a mechanism in place for them to get help. The actual execution of these steps varies from client to client, but by following this process, the project has the greatest chance for success.
Prepare a plan A data management plan is intended to provide guidance during the detailed design and implementation of an EDMS. Design of a computerized data management system should begin with a survey of the data management needs of the organization. It should integrate knowledge about the group’s information management needs gathered during the needs assessment phase, together with the necessary hardware and software components to satisfy as many of the data management needs as possible. It is intended that this plan be revised on a regular basis, perhaps semi-annually, as long as the data management system is in use. The expected life of a typical data management system is five to ten years. After that period, it can be anticipated that technology will have changed sufficiently and it will be appropriate to replace the system with a new tool. It is reasonable to expect that the data from the old system can be transported into the new system. Even during the life of the software it will be necessary to make changes to the system to accommodate changes in computer technology and changes in data needs. This is particularly true now, as the computer industry is undergoing a change from traditional client-server systems to browser-based Internet and intranet systems. Since the data management plan often involves an incremental development process where functionality is added over time, you should expect that lessons learned from early deployment will be incorporated into later development. Finally, the level of detail provided in the plan may vary in the discussion of the different data types to be stored. Additional detail will be added as software development progresses. You should allow the plan to evolve to address these and other, perhaps unanticipated, system changes. A typical data management plan might contain the following sections: Section 1 – System Requirements Section 2 – System Design Section 3 – Implementation Plan Section 4 – Resource Requirements Appendix A – Database Fundamentals Appendix B – User Interface Guidelines Appendix C – Preliminary Data Model Appendix D – Preliminary System Diagrams Appendix E – Data Transfer Standard Appendix F – Coded Values
Implementing the Database System
97
Appendix G – Data Management Survey Results Appendix H – Other Issues and Enhancements Appendix I – References The plan should contain a discussion of all of the important issues. The level of detail may vary depending on the particular item and its urgency. For example, at the planning stage the data content of the system may be outlined in broad terms. After identification of the data components that are the most significant to potential users the data content can be filled in with more detail. The next step toward implementation of the database management system is the detailed design.
Design the system in detail The next step after finalizing the data management plan will usually be a detailed system design. This detailed design will identify and respond in detail to various issues related to each data type being addressed. The plan should be viewed as a national road map. The national map provides an overview of the whole country, showing the relationships between different areas and the high-level connections between them, such as the Interstate highways. It contains greater detail in some areas, just as a national map may have more detailed sub-maps of some metropolitan areas. These sub-maps contain adequate detail to take you wherever you want to go in those areas. The detailed system design covers this greater detail. In many cases, the detailed design is not prepared entirely in one step, but parts of the system are designed in detail prior to implementation, in an evolving process. Figure 48 shows an example of this iterative process. This example shows a preliminary data model that was designed and then submitted to prospective users for comment. The result of the meeting to discuss the data model was the notes on the original data model shown in the figure. After several sessions of this type, the final data model was completed, which is shown at a reduced scale in Figure 49. This figure illustrates the complexity that resulted from the feedback process. It’s important to catch as many errors as possible in this stage of the design process. Conventional wisdom in the software development business is that the cost to fix an error that is found at the end of the implementation process is a factor of 80 to 100 greater than to fix the same error early in the design process. It’s worthwhile to beat the design process to death before you proceed with development, despite the natural desire to move on. One or two extra design sessions after all involved think they are happy with the design will usually pay for themselves many times over.
BUY OR BUILD? After the needs have been determined and a system design developed, you will need to decide whether to buy existing software or write your own (or have it written for you). A number of factors enter into this decision. The biggest, of course, is whether there is software that you can buy that does what you want. The closer that existing software functionality matches your needs, the easier the decision is. Usually, it is more cost-effective when you can buy rather than build, for several reasons, mostly related to the number of features relative to the cost of acquiring the system. There is a cultural component to the decision, with some organizations preferring to write software, while others, perhaps with less interest or confidence in their development capabilities, opting to buy when possible. There may be times when you have to bite the bullet and write software when there is no viable alternative, and the benefits justify the cost.
Relational Management and Display of Site Environmental Data
Figure 48 - Intermediate step in the detailed design process
98
Figure 48 - Intermediate step in the detailed design process
Implementing the Database System
99
Confidence is the feeling you have before you understand the situation. Rich (1996)
Figure 49 - Reduced data model illustrating the complexity resulting from the detailed design process in the previous figure
There is a definite trend in the environmental consulting business away from custom database software development. This is due to two primary reasons. The first is that commercial software has improved to the point that it often satisfies all or most project needs out of the box. The second is more scrutiny of project budgets, with clients less willing to pay for software development unless it is absolutely necessary. Abbott (2001) has stated that 31% of software development projects are canceled before they are completed, and 53% ultimately cost 189% or more of their original budgets. Software projects completed by large companies typically retain about 42% of the features originally proposed. Buying off-the-shelf software decreases the chance of one of these types of project failure. It would be helpful to be able to estimate the cost of implementing a database system. Vizard (2001) states that $7 of every $10 spent on software goes into installing and integrating the software once it is purchased. Turning this around, for every dollar spent on purchasing or developing the software, about two more are spent getting it up and running. So for a rough estimate of the project cost, take the cost of the software and triple it to get the total cost of the implementation.
IMPLEMENTING THE SYSTEM Once the system is selected or designed, there are a number of tasks that must be completed to get it up and running. These same basic tasks are the same, whether you are buying or building the software.
Acquire and install the hardware and software The process of selecting, purchasing, and installing a data management system, or of writing one, can be quite involved. You should be sure that the software selected or developed fits the needs of the individuals and the organization, and that it will run on existing hardware. It may be necessary to negotiate license and support agreements with the vendor. Then it will be necessary to install the software on users’ computers and perhaps on one or more servers.
100
Relational Management and Display of Site Environmental Data
Special considerations for developing software If you are building the software instead of buying it, it is important to follow good software development practices. Writing quality software is a difficult, error-prone process. For an overview of software quality assurance, see Wallace (2000). Here are a few tips that may help: Start with a requirements plan – Developing good software starts with carefully identifying the requirements for the finished system as described above (Abbott, 2001). According to Abbott, 40 to 60% of software defects and failures result from bad requirements. Getting everyone involved in the project to focus on developing requirements, and getting developers to follow them, can be very difficult, but the result is definitely worthwhile. Abbott quotes statistics that changes that occur in the development stage cost five times as much as those that occur during requirements development, and once the product is in production, the cost impact is a factor of 100. Use the best tool for the job – Choose the development environment so that it fits project needs. If code security is important, a compiled language such as Visual Basic may be a good choice. If flexibility and ease of change is important, a database language like Access is best. Manage change during development – Even with the best plan, changes will occur during development, and managing these changes is critical. On all but the smallest projects, a formal change order process should be used, where all changes are documented in writing, and all stakeholders sign off on every change before the developer changes code. A good guideline is ANVO (accept no verbal orders). Use prototypes and incremental versions – Developers should provide work examples to users early and throughout the process, and solicit feedback on a regular basis, to identify problems as early as possible in the development process. Then the change order process can be used to implement modifications. Manage the source code – Use source code management software or a formal process for source code management to minimize the chance for conflicting development activities and lost work, especially on projects with multiple developers. Implement a quality program – There are many different types of quality programs for software development. ISO 9000 for quality management and ISO 14000 for environmental management can be applied to EDMS software development and use. TQM, QFD, SQFD, and CMM are other examples of quality programs that can be used for any software development. Which program you choose is less important than that you choose one and stick to it. Write modular code and reuse when possible – Writing software in small pieces, testing these pieces thoroughly, and then assembling the pieces with further testing is a good way to build reliable code. Where possible centralize calculations in functions rather than spreading them across forms, reports, and code. Each module should have one entry and one exit point. Use syntax checking – Take advantage of the syntax checking of your development environment. Most languages provide a way for the program to make sure your syntax is valid before you move on, either by displaying a dialog box, or underlining the text, or both. Format your code – Use indentation, blank lines, and any other formatting tools you can think of to make your code easier to read. Explicitly declare variables – Always require that variables be explicitly declared in the declarations section of the code. In Visual Basic and VBA you can require this with the statement Option Explicit in the top of each module, and you can have this automatically entered by turning on Require Variable Declaration in the program options. Watch variable scope – The scope of variables (where in the code they are active) can be tricky. Where possible declare the scope explicitly, and avoid re-using variable names in subroutines in case you get it wrong. Keeping the scope as local as possible is usually a good idea. Be very careful about modifying global variables within a function. Design tables carefully – Every primary table should have a unique, system-assigned ID and an updated date to track changes.
Implementing the Database System
101
Be careful about error handling – This may be the one item that separates amateur code from professional code. A good programmer will anticipate error conditions that could arise and provide error trapping that helps the user understand what the problem is. For example, if the detection limit is needed for a calculation but is not present, an EDMS error message like “Please enter the detection limit” is much more helpful than a system message like “Invalid use of Null.” Use consistent names – The more consistent you are in naming variables and other objects the easier it will be to develop and support the code. There are several naming conventions to choose from, and how you do it is not as important as doing it consistently. An example of one such system can be found in Microsoft (2001). For example, it is a good idea to name variables based on their data type and the data they contain, such as txtStationName for the text field that contains station names. Avoid field and variable names that are keywords in the development language or any other place that they may be used. Document your code – Document code internally and provide programmer documentation for the finished product. Each procedure should start with a comment that describes what the function does, as well as its input and output variables. Inline comments in the code should be plentiful and clear. Don’t assume that it will be obvious what you are doing (or more importantly why you are doing it this way instead of some other way), especially when you try to get clever. Don’t get out of your depth – Many advanced computer users have an exaggerated view of their programming skills. If you are building a system that has a lot riding on it, be sure the person doing the development is up to the task. Don’t forget to communicate – On a regular basis, such as weekly or monthly, depending on the length of the project, or after each new section of the system is completed, talk to the users. Ask them if the direction you are taking is what they need. If yes, proceed ahead. If not, stop work, talk about directions, expectations, solutions, and so on, before you write one more line of code. This could make the difference between project success and failure.
Test The system must be thoroughly tested prior to release to users. This is usually done in two parts, alpha testing and beta testing, and both are quite valuable. Each type of testing should have a test plan, and should be carefully documented and methodically followed. Alpha testing – Alpha testing is testing of the software by the software developer after it has been written, but before it is delivered to users. Alpha testing is usually performed incrementally during development, and then comprehensively immediately before the software is released for beta testing. For commercial software purchased off the shelf, this should already have been done before the software is released to the general public. For custom software being made to order, this stage is very important and, unfortunately, in the heat of project deadlines, may not get the attention it deserves. The test plan for alpha testing should exercise all features of the software, and when a change is made, all previous tests should be re-run (regression testing). Test items should include logic tests, coverage (the software works for all cases), boundary tests (if appropriate), satisfaction of requirements, inputs vs. outputs, and user interface performance. It is also important to test the program in all of the environments where the software is going to be deployed. If some users will be using Access 97 running under Windows 95, and others Access 2000 and Windows 2000, then a test environment should be set up for each one. A program like Norton Ghost, which lets you easily restore various hard drive images, can be a great time saver in setting up test systems. A test machine with removable drives for different operating system versions can be helpful as well, and is not expensive to set up. Don’t assume that if a feature does what you want in one environment that it will work the same (or at all) in a different one. Beta testing – After the functionality is implemented and tested by the software author, the software should be provided to willing users and tested on selected computers. There is nothing
102
Relational Management and Display of Site Environmental Data
like real users and real data to expose the flaws in the software. The feedback from the beta testers should then be used to improve the software, and the process repeated as necessary until the list of problem reports each time is very short, ideally zero. The test plan for beta testing is usually less formal than that for alpha testing. Beta testers should use the software in a way similar to how they would use the final product. A caveat here is that since the software is not yet fully certified it is likely to contain bugs. In some cases these bugs can result in lost or corrupted data. Beta testers should be very careful about how they use the results obtained by the software. It is best to consider beta testing as a separate process from their ordinary workflow, especially early in the beta test cycle.
Document An important part of a successful user experience is for users to be able to figure out how to get the software to do the things that they need to do. Making the software discoverable, so that visual clues on the screen help them figure out what to do, can help with this. Unfortunately this is not always possible, so users should be provided with documentation to help them over the rough spots. Often a combination of printed and online documentation provides the best source of information for the user in need. The response of users to documentation varies tremendously from user to user. Some people, when faced with a new software product, will take the book home and read it before they start using the program. Other users (perhaps most) will refuse to read the manual and will call technical support with questions that are clearly answered in both the manual and the help file. This is a personality issue, and both types of help must be provided.
Train Environmental professionals are usually already very busy with their existing workloads. The prospect of finding time to learn new software is daunting to many of these people. The implementation phase must provide adequate training on use of the system, while not requiring a great time commitment from those learning it. These somewhat contradictory goals must be accommodated in order for the system to be accepted by users. New training tools on the horizon, such as Web-based, self-based training, show promise in helping people learn with a minimum impact on their time and the organization’s budget. Training should be provided for three classes of users. System administrators manage the system itself. Data administrators are responsible for the data in the system. Users access the data to assist them in their work. Usually the training of administrators involves the greatest time commitment, since they require the broadest knowledge of the system, but since it is likely to be an important part of their job, they are usually willing to take the time to learn the software. System administrators – Training for System administrators usually includes an overview of the database tools such as Access, SQL Server, or Oracle; the implementation of the data management system using these tools; and operation of the system. It should cover installation of the client software, and maintenance of the server system including volume management, user administration, and backup and restoration of data. Data administrators – Training for data administrators should cover primarily data import and editing, including management of the quality tracking system. Particular emphasis should be placed on building a thorough understanding of the data and how it is reported by the laboratories, since this is a place where a lot of the problems can occur. The data administrator training should also provide information on making enhancements to the system, such as customized reports for specific user needs. Users – Users should be trained on operation of the system. This should include some of the theory behind the design of the system, but should mostly focus on how they can accomplish the
Implementing the Database System
103
tasks to make their jobs easier, and how to get the most out of the system, especially in the area of data selection and retrieval. It should also include instructions for maintenance of files that may be placed on the user’s system by the software.
MANAGING THE SYSTEM In implementing a data management system, there are many other issues that should be considered both internal to and outside of the organization implementing the system. A few of these issues are addressed here.
Licensing the software Software licensing describes the relationship between the owner of the software and the users. Generally you don’t buy software, you pay for the right to use it, and the agreement that covers this is called the software license. This license agreement may be a signed document, or it may be a shrink-wrap agreement, where your using the software implies agreement to the terms. Software licenses can have many forms. A few are described briefly here. Organizations implementing any software should pay attention to exactly what rights they are getting for the money they are spending. Computer license – When software is licensed by computer, the software may be installed and used on one computer by whichever user is sitting at that computer. This was a useful approach when many organizations had fewer computers than users. Implicit in this licensing is that users cannot take the software home to use there, or to another computer. User license – Software can be licensed by user. In this situation, the software is licensed to a particular person. That person can take the software to whichever computer he or she is using at any particular time, including a desktop, laptop, and, in some license agreements, home computer. This type and the previous one are sometimes called licensing the software by seat. Concurrent user license – If the software is licensed by concurrent user, then licenses are purchased for the maximum number of users who will be using the software at any one time. If an organization had users working three non-overlapping shifts, then it could buy one-third the number of licenses as it has users, since only one third would be using the software at once. Software licensed this way should have a license manager, which is a program that tracks and reports usage patterns. If not, the organization or software vendor should perform occasional audits of use to ensure that the right number of licenses are in force. Server license – When a large component of the software runs on a server rather than on client computers, it can be licensed to each server on which it runs, with no attention paid to the number of users. Site license – This is similar to a server license, except that the software can be run on any computers at one specific facility, or site. Company-wide license – This extends the concept of a site license to the whole company. Periodic lease – In this approach, there is usually no up-front cost, but there is a periodic payment, such as monthly or annually. Usually this is used for very expensive software, especially if it requires a lot of maintenance and support, because this can all be rolled into the periodic fee. Pay per use – A variation on the periodic lease is pay per use, where you pay for the amount of time the software is used, to perform a specific calculation, look up a piece of reference information, and so on. As with the periodic lease, this is most popular for very expensive software. Application server – With the advent of the Internet and the World Wide Web, a new option for software licensing has become available. In this model, you pay little or no up-front fee, and pay as you go, as in the previous example. The difference is that most or all of the program is running on a server at the service provider, and your computer runs only a Web browser, perhaps
104
Relational Management and Display of Site Environmental Data
with some add-ins. The company providing this service is called an application service provider or ASP. There are many variations and combinations of these license types in order to fit the needs of the customer and software vendor. Purchasers of software can sometimes negotiate better license fees by suggesting a licensing model that fits their needs better.
Interfaces with other organizations There may be a variety of groups outside of the environmental organization that can be expected to interact with the environmental database in various different ways. During the detailed design phase of software implementation, there should be a focused effort to identify all such groups and to determine their requirements for interaction with the system. The following sections list some of these related organizations that might have some involvement in the EDMS. Information Technology – In many organizations, especially larger ones, an information technology group (IT, or sometimes called IS for Information Services) is charged with servicing the computer needs of the company. This group is usually responsible for a variety of areas ranging from desktop computers and networking systems to company mainframes and accounting software. In discussions with IT, it is often clear that it has resources that could be useful in building an EDMS. The network that it manages often connects the workstations on which the user interface for the software will run. It may have computers running data management software such as Oracle and Microsoft SQL Server, which could provide the back-end for data storage. Finally, it has expertise in many areas of networking and data management, which could be useful in implementing and supporting the data management system. It is also often the case that IT personnel are very busy with their current activities and do not, in general, have very much time available to take on a new project. On the other hand, their help will be needed in a variety of areas, the largest being support in implementing the back-end database. The people responsible for implementing the EDMS should arrange for a liaison person from IT to be assigned to the project, with his or her time funded by the environmental organization if necessary, to provide guidance as the project moves ahead. Help will probably be needed from others during design, implementation, and ongoing operations, and a mechanism should be put in place to provide access to these people. In implementing data management systems in organizations where IT should be involved, one thing stands out as the most important in helping smooth the relationship. That thing is early and ongoing involvement of IT in the selection, design, and implementation process. When this happens, the implementation is more likely to go smoothly. The reasons for lack of communication are often complicated, involving politics, culture differences, and other non-technical factors, but in general, effort in encouraging cooperation in this area is well rewarded. Operating divisions – Different organizations have different reporting structures and responsibilities for environmental and other data management. Often the operating divisions are the source of the data, or at least have some interest in the gathering and interpretation of the data. Once again, the most important key is early and frequent communication. Remote facilities – Remote facilities can have data management needs, but can also provide a challenge because they may not be connected directly to the company network, or the connection may not be of adequate speed to support direct database connections. These issues must be part of the planning and design process if these people are to have satisfactory access to the data. Laboratories – Laboratories provide a unique set of problems and opportunities in implementing an EDMS. In many cases, the laboratories are the primary source of the data. If the data can be made to flow efficiently from the laboratory into the database, a great step has been taken toward an effective system. Unfortunately, many laboratories have limited resources for responding to requests for digital data, especially if the format of the digital file is different from what they are used to providing.
Implementing the Database System
105
There are several things that can be done to improve the cooperation obtained from the labs: • • • • • •
Communicate clearly with the laboratory. Provide clear and specific instructions on how to deliver data. Be consistent in your data requirements and delivery formats. Be an important customer. Many companies have cut down on the number of laboratories that they use so that the amount of work going to the remaining labs is of sufficient volume to justify compliance by the lab with data transfer standards. Choose a lab or labs that are able to comply with your needs in a cost-effective way. Be constantly on guard for data format and content changes. Just because they got it right last quarter does not ensure that they will get it right this time.
Appendix C contains an example of a Data Transfer Standard document that can be used to facilitate communication with the lab about data structure and content. A great timesaving technique is to build a feedback loop between the data administrator and the lab. The EDMS software can be used to help with this. The data administrator and the lab should be provided with the same version of the software. The data administrator maintains the database with updated information on wells, parameter name spellings, units, and so on, and provides a copy of the database to the lab. Before issuing a data deliverable, the lab imports the deliverable into the database. It then remedies any problems indicated by the import process. Once the data imports cleanly, it can be sent to the data administrator for import into the main database. This can be a great time-saver for the busy data administrator, because most data deliverables will import cleanly on the first try. This capability is described in more detail in Chapter 13. Consultants – Consultants provide a special challenge in implementing the database system. In some cases they can act like or actually be laboratories, generating data. In other cases they may be involved in quality assurance aspects of the data management projects. It is often helpful for the consultants working on a project to be using the same software as the rest of the project team to facilitate transfer of data between the two. If that can’t be done, at least make sure that there is a format common to the two programs to use for transferring the data. Regulators – In many (perhaps most) cases, regulators are the true “customers” of the data management project. Usually they need to receive the data after it has been gathered, entered, and undergone quality assurance procedures. The format in which the data is delivered is usually determined by the requirements of the regulators, and can vary from a specific format tailored to their needs, to the native format in which it is stored in the EDMS. In the latter case they may request to be provided with a copy of the EDMS software so that they can easily work with the data. Companies managing the monitoring or cleanup project may view this with mixed feelings. It is usually to their benefit to keep the regulators happy and to provide them with all of the data that they request. They are often concerned, however, that by providing the regulators with the data and powerful software for exploring it, the regulators may uncover issues before the company is ready to respond. Usually on issues like this, the lawyers will need to be involved in the decision process. Auditors – Quality assurance auditors can work with the data directly, with subsets of the data, or with reports generated from the system. Sometimes their interest is more in the process used in working with the data than in the content of the data itself. In other cases they will want to dive into the details to make sure that the numbers and other information made it into the system correctly. The details of how this is done are usually spelled out in the QAPP (quality assurance program plan) and should be followed scrupulously.
Managing the implementation project Implementing a database system, whether you are buying or building, is a complex project. Good project management techniques are important if you are going to complete the project successfully. It is important to develop schedules and budgets early in the implementation project,
106
Relational Management and Display of Site Environmental Data
and track performance relative to plan at regular intervals during the project. Particular attention should be paid to programming, if the system is being built, and data cleanup and loading, as these are areas where time and cost overruns are common. For managing the programming component of the project, it is important to maintain an ongoing balance between features and time/cost. If work is ahead of schedule, you might consider adding more features. If, as is more often the case, the schedule is in trouble, you might look for features to eliminate or at least postpone. To control the data cleanup time, pay close attention to where the time overruns are occurring. Repetitive tasks such as fixing systematic data problems can often be automated, either external to the database using a spreadsheet or custom programs, or using the EDMS to help with the cleanup. If that is not enough, and the project is still behind, it may be necessary to prioritize the order of data import, and import either the cleanest data or the data that is most urgently needed first, and delaying import of other data to stay on schedule.
Preparing for change It is important to remember that any organization can be expected to undergo reorganizations, which change the administrative structure and the required activities of the people using the system. Likewise, projects undergo changes in management philosophy, regulatory goals, and so on. Such changes should be expected to occur during the life of the system. The design of the system must allow it to be easily modified to accommodate these changes. The key design component in preparing for change is flexibility in how the database is implemented. It should be relatively easy to change the organization of the data, moving sites to different databases, stations between sites, how values and flags are displayed, and so on. Effort expended early in the database design and implementation may be repaid manyfold later during the stressful period resulting from a major change, especially in the face of deadlines. The Boy Scouts were right: “Be Prepared.” All of the issues above should be considered and dealt with prior to and during the implementation project. Many will remain once the system is up and running.
CHAPTER 9 ONGOING DATA MANAGEMENT ACTIVITIES
Once the data management system has been implemented, the work is just starting. The cost of ongoing data management activities, in time and/or financial outlay, will usually exceed the implementation cost of the system, at least if the system is used for any significant period of time. These activities should be taken into account in calculating the total cost of ownership over the lifetime of the system. When this calculation is made, it may turn out that a feature in the software that appears expensive up-front may actually cost less over time relative to the labor required by not having the feature. Also, these ongoing activities must be both planned for and then performed as part of the process if the system is expected to be a success. These activities include managing the workflow, managing the data, and administering the system. Many of these activities are described in more detail later in this book.
MANAGING THE WORKFLOW For large projects, managing the workflow efficiently can be critical to project success. Data flow diagrams and workflow automation can help with this.
Data flow diagrams Usually the process of working with the data involves more than one person, often within several organizations. In many cases those individuals are working on several projects, and it can be easy to lose track of who is supposed to do what. A useful tool to keep track of this is to create and maintain a data flow diagram for each project (if this wasn’t done during the design phase). These flow diagrams can be distributed in hard copy, or made available via an intranet page so people can refer to them when necessary. This can be particularly helpful when a contact person becomes unavailable due to illness or other reason. If the flowchart has backup names at each position, then it is easier to keep the work flowing. An example of a data flow diagram is shown in Figure 50.
Workflow automation Tools are now becoming available that can have the software help with various aspects of moving the data through the data management process. Workflow automation software takes responsibility for knowing the flow of the data through the process, and makes it happen.
Figure 50 - Data flow diagram for a project No
= Work Flow
Data Administrator Name
Fail
Lab Problem
Test Import
Analyzed
Received
Figure 50 - Data flow diagram for a project
* DA Logs & Reports
Red Text = Name
Update
Preparation
Received
Reference List
Lab Name
Yes
(Proj.Mgr & Site Mgr.)
Reporting
Report Preparation Name
Main Database
Bottles
Field Tech Name
Interpretation
Print
3rd Party Name
Samples to Lab
Notify Proj. Mgr.
Hydrologist Review Name
Fax Sample Order Form
Sampling Plan Scheduled and Routine Sampling
Company or Consultant Name
Notify
Yes
Project Mgr. & Site Mgr.
Import
Data Administrator Name
Collate
(Excel or Access)
Field Measurements
Prepare COC
Collect Samples
Prepare Table
Site Name
No
Project Manager Name
Project Data Flow Diagram
108 Relational Management and Display of Site Environmental Data
Ongoing Data Management Activities
109
In a workflow automated setting, after the laboratory finishes its data deliverable the workflow automation system sends the data to the data administrator, who checks and imports the data. The project manager is automatically notified and the appropriate data review process initiated. After this the data management software generates the necessary reports, which, after appropriate review, are sent to the regulators. At the present time, bits and pieces are available to put a system together that acts this way, but the level of integration and automation can be expected to improve in the near future.
MANAGING THE DATA There are a number of tasks that must be performed on an ongoing basis as the data management system is being used. These activities cover getting the data in, maintaining it, and getting it out. Since it costs money to gather, manage, and interpret data, project managers and data managers should spend that money wisely on activities that will provide the greatest return on that investment. Sara (1994, p. 1-11) has presented the “Six Laws of Environmental Data”: 1. 2. 3. 4. 5. 6.
The most important data is that which is used in making decisions, therefore only collect data that is part of the decision making process. The cost of collecting, interpreting, and reporting data is about equal to the cost of analytical services. About 90% of data does not pass from the first generation level (of data use and interpretation) to the next (meaning that it is never really used). There is significant operational and interpretive skill required in moving up the generation ladder. Data interpretation is no better than the quality control used to generate the original data. Significant environmental data should be apparent.
Convert historical data The historical data for each project must be loaded before the system can be used on that project for trend and similar analyses. After that, current data must be loaded for that project on an ongoing basis. For both of these activities, data review must be performed on the data. An important issue in the use of the EDMS is developing a process for determining which data is to be loaded. This includes the choice of which projects will have their data loaded and how much historical data will be loaded for each of those projects. It also includes decisions about the timing of data loading for each project. Most organizations have limited resources for performing these activities, and the needs of the various projects must be balanced against resource availability in order to load the data that will provide the greatest return on the data loading investment. For historical data loading, project personnel will need to identify the data to be loaded for projects, and make the data available to whoever is loading it. This will require anywhere from a few hours to several weeks or more for each project, depending on the amount of the data and the difficulty in locating it. Significant data loading projects can cost in the hundreds of thousands of dollars.
Import data from laboratories or other sources For most projects this is the largest part of managing the EDMS. Adequate personnel must be assigned to this part of the project. This includes working with the laboratories to ensure that clean data is delivered in a useful format, and also importing the data into the database. There must be enough well trained people so that the work goes smoothly and does not back up to the point where it affects the projects using the system. Importing data is covered in more detail in Chapter 13.
110
Relational Management and Display of Site Environmental Data
Murphy’s law of thermodynamics: Things get worse under pressure. Rich (1996)
Manage the review status of all data Knowing where the data came from and what has happened to it is very important in order to know how much you can trust it. The EDMS can help significantly with this process. Management of the review status is discussed in depth in Chapter 15.
Select data for display or editing Many benefits can be obtained by moving enterprise environmental data to a centralized, open database. With all of the data in one place, project personnel will need to work with the data, and it will be unusual for them to need to look at all of the data at once. They will want to select parts of the data for analysis and display, so the selection system becomes more important as the data becomes more centralized. See Chapter 18 for more information on selecting data.
Analyze the data Once the data is organized and easy to find, the next step is to analyze the data to get a better understanding of what that data is telling you about the site. Organizations that implement data management systems often find that once they can spend less time managing the data because they are doing that part more efficiently, they can spend more time analyzing it. This can provide great benefits to the project. Map analysis of the data is discussed in Chapter 22, and statistical analysis in Chapter 23.
Generate graphs, maps, and reports Building the database is usually not the goal of the project. The goal is using the data to make decisions. This means that the benefits will be derived by generating output, either using the EDMS directly or with other applications. Several chapters in Part Five cover various aspects of using the data to achieve project benefits. The important point to be made here is that, during the planning and implementation phases, as much or more attention should be paid to how the data will be used compared to how it will be gathered, and often this is not done.
Use the data in other applications Modern EDMS products usually provide a broad suite of data analysis and display tools, but they can’t do everything. The primary purpose of the EDMS is to provide a central repository for the data, and the tools to get the data in and out. It should be easy to use the data in other applications, and software interface tools like ODBC are making this much easier. Integration of the database with other programs is covered in Chapter 24.
ADMINISTERING THE SYSTEM During and after installation of the EDMS, there are a number of activities that must be performed, mostly on an ongoing basis, to keep the system operational. Some of the most
Ongoing Data Management Activities
111
important of these activities are discussed here. A significant amount of staff time may be required for these tasks. In some cases, consultants can be substituted for internal staff in addressing these issues. Time estimates for these items can be difficult to generate because they are so dependent on the volume of data being processed, but are important for planning and allocating resources, especially when the project is facing a deadline.
System maintenance The EDMS will require ongoing maintenance in order to keep it operational. This is true to a small degree of the client component and to a larger degree of the server part. Client computer maintenance – For the client computers, this might include installing new versions of the software, and, for Access-based systems, compacting and perhaps repairing the database files. As new versions of the program file are released containing bug fixes and enhancements, these will need to be installed on each client computer, and especially for large installations an efficient way of doing this must be established. Compacting should be done on a regular basis. Repairing the data files on the client computers may occasionally be required should one of them become corrupted. This process is not usually difficult or time intensive. System maintenance will require that each user spend one hour or less each month to maintain his or her database. Server maintenance – For the server, there are several maintenance activities that must be performed on a regular basis. The most important and time-consuming is backing up system data as discussed below and in Chapter 15. Also, the user database for the system must be maintained as users are added and removed, or as their activities and data access needs change. Finally, with most server programs the database volume must be resized occasionally as the data content increases. System administrators and data administrators will need to spend more time on their maintenance tasks than users. System administrators should expect to spend at least several hours each week on system maintenance. The time requirements for data administrators will depend to a large extent on the amount of data that they are responsible for maintaining. Importing data from a laboratory can take a few minutes or several hours, depending on the complexity of the data, and the number of problems that must be overcome. Likewise, data review time for each data set can vary widely. Gathering and inputting existing data from hard copy can be very time-consuming, and then reviewing that data requires additional time. A complete data search and entry project for a large site could take several people weeks or more. Project or management personnel will need to decide which sites will be imported, in which order, and at what rate, based on personnel available.
Software and user support A significant contributor to the success of an EDMS is the availability of support on the system after installation. Even the best software requires good support in order to provide a satisfactory user experience. As the system is used, people will need help in using the system efficiently. This will include hot-line support so they have someone to call when they have a problem. Sometimes their problem will be that they can’t figure out how to use all the features of the system. Other times they will identify problems with the software that must be fixed. Still other times they will identify enhancements that might be added in a later release of the software. The organization should implement a system for obtaining software support, either through power users, a dedicated support staff, the software vendor, or consultants. There are two primary software support needs for an EDMS: system issues and feature issues. System issues involve running the software itself. This includes network problems, printing problems, and other similar items. These usually occur at the start of using the product. This usually requires an hour or less per user, both for the client and the support technician, to overcome any of these problems. Recurrence rate, after preliminary shakedown, is relatively low.
112
Relational Management and Display of Site Environmental Data
The second type of support is on software features. This ranges from usage issues like import problems to questions about how to make enhancements. This support for each user usually peaks in the first week or two of intense use, and tapers off after that, but never goes away entirely. The amount of support depends greatly on the way the software is used and the computer literacy level of the user. Some users have required an hour of support or less to be up and running. Other users need four hours or more in the first couple of months. Adequate training on theoretical and handson aspects of the software can cut the support load by about half. Users and the support organization should expect that the greatest amount of support will be required shortly after each user starts using the software. Staging the implementation so that not all of the users are new users at once can help with the load on the support line. It also allows early users to help their neighbors, which can be an efficient way of providing support in some situations. There are a number of different types of support required for each user. Initial hands-on support – After the software is delivered to the users, the development staff or support personnel should be onsite for a period of time to assist with any difficulties that are encountered. Technical personnel with access to reference resources should back up these people to assist with overcoming any obstacles. Telephone/email support – Once the system is up and running, problems will almost certainly be encountered. Often the resolution of these problems is simple for someone with a good understanding of how the system operates. These problems usually fall into two categories: user error (misuse of the software) and software error. In the case of user error, two steps are required. The first is to help the user overcome the current obstacle. The second is to analyze the user error to determine whether it could be avoided in the future with changes to the user interface, written documentation, help system, or training program. If the problem is due to software error, it should be analyzed for seriousness. Based on that analysis, the problem should be addressed immediately (if there is the potential for data loss or the problem leads to very inefficient use of the software) or added to a list of corrections to be performed as part of a future release. Troubleshooting – If software problems are identified, qualified personnel must be available to address them. This usually involves duplicating the problem if possible, identifying the cause, determining the solution, scheduling and performing the fix, and distributing the modified code.
Power user development Once the system is operational, some people may express an interest in becoming more knowledgeable about using the software, and perhaps in learning to expand and customize the system. It is often to the organization’s advantage to encourage these people and to support their learning more about the system, because this greater knowledge reduces dependence on support staff, and perhaps consultants. A system should be put in place for developing the advanced capabilities of these people, often referred to as power users. This system might include specialized training for their advanced needs, and perhaps also specialized software support for them. Development of power users is usually done individually as people identify themselves as this type of user. This will require their time for formal and informal training to expand their knowledge of the system.
Enhancements and customization After the EDMS has been installed and people are using the system, it is likely that they will have ideas for improving the system. A keystone of several popular management approaches to quality and productivity is continuous improvement. An EDMS can certainly benefit from this process. Users should be encouraged to provide feedback on the system, and, assuming that the software has the flexibility and configurability to accommodate it, changes should be made on an ongoing basis so that over time the system becomes a better and better fit to users’ needs. The
Ongoing Data Management Activities
113
organization should implement a system for gathering users’ suggestions, ranking them by the return on any cost that they may entail, and then implementing those that make good business sense. There is a conflict between the need for continuous improvement and the need to control the versions of the software in use. Too many small improvements can lead to differing and sometimes incompatible versions of the software. Too few revisions can result in an unacceptably long time until improvements are made available to users. A compromise must be found based on the needs of individual users, the development team, and the organization.
Backup of data to protect from loss The importance of backing up the database cannot be overemphasized. This should be done a minimum of every day. In some organizations, Information Services staff will do this as part of their services. Loss of data can be extremely costly. More information on the backup task can be found in Chapter 15.
PART THREE - GATHERING ENVIRONMENTAL DATA
CHAPTER 10 SITE INVESTIGATION AND REMEDIATION
The site investigation and remediation process is usually the reason for site environmental data management. The results of the data management process can provide vital input in the decisionmaking process. This chapter provides an overview of the regulations that drive the site investigation and remediation process, some information on how the process works under the major environmental regulations, and how data management and display is involved in the different parts of the process. Related processes are environmental assessments and environmental impact statements, which can also be aided by an EDMS.
OVERVIEW OF ENVIRONMENTAL REGULATIONS The environmental industry is driven by government regulations. These regulations have been enacted at the national, state, or local level. Nearly all environmental investigation and remediation activity is performed to satisfy regulatory requirements. A good overview of environmental regulations can be found in Mackenthun (1998). The following are some of the most significant environmental regulations: National Environmental Policy Act of 1969 (NEPA) – Requires federal agencies to consider potentially significant environmental impacts of major federal actions prior to taking the action. The NEPA process contains three levels of possible documentation: 1) Categorical Exclusion (CATEX), where no significant effects are found, 2) Environmental Assessment (EA), which addresses various aspects of the project including alternatives, potential impacts, and mitigation measures, and 3) Environmental Impact Statement (EIS), which covers topics similar to an EA, but in more detail. Clean Air Act of 1970 (CAA) – Provides for the designation of air quality control regions, and requires National Ambient Air Quality Standards (NAAQS) for six criteria pollutants (particulate matter, sulfur dioxide, carbon monoxide, ozone, nitrogen dioxide, and lead). Also requires National Emission Standards for Hazardous Air Pollutants (NESHAPs) for 189 hazardous air pollutants. The act requires states to implement NAAQS, and requires that source performance standards be developed and attained by new sources of air pollution. Occupational Safety and Health Act of 1970 – Requires private employers to provide a place of employment safe from recognized hazards. The act is administered by the Occupational Safety and Health Administration (OSHA).
118
Relational Management and Display of Site Environmental Data
Bad regulations are more likely to be supplemented than repealed. Rich (1996) Endangered Species Act of 1973 (ESA) – Provides for the listing of threatened or endangered species. Any federal actions must be evaluated for their impact on endangered species, and the act makes it illegal to harm, pursue, kill, etc. a listed endangered or threatened species. Safe Drinking Water Act of 1974 (SDWA) – Protects groundwater aquifers and provides standards to ensure safe drinking water at the tap. It makes drinking water standards applicable to all public water systems with at least 15 service connections serving at least 25 individuals. Requires primary drinking water standards that specify maximum contamination at the tap, and prohibits certain activities that may adversely affect water quality. Resource Conservation and Recovery Act of 1976 (RCRA) – Regulates hazardous wastes from their generation through disposal, and protects groundwater from land disposal of hazardous waste. It requires criteria for identifying and listing of hazardous waste, and covers transportation and handling of hazardous materials in operating facilities. The act also covers construction, management of, and releases from underground storage tanks (USTs). In 1999, 20,000 hazardous waste generators regulated by RCRA produced over 40 million tons of hazardous waste (EPA, 2001b). RCRA was amended in 1984 with the Hazardous and Solid Waste Amendments (HSWA) that required phasing out land disposal of hazardous waste. Toxic Substances Control Act of 1976 (TSCA) – Requires testing of any substance that may present an unreasonable risk of injury to health or the environment, and gives the EPA authority to regulate these substances. Covers the more than 60,000 substances manufactured or processed, but excludes nuclear materials, firearms and ammunition, pesticides, tobacco, food additives, drugs, and cosmetics. Clean Water Act of 1977 (CWA) – Based on the Federal Water Pollution Control Act of 1972 and several other acts. Amended significantly in 1987. This act, which seeks to eliminate the discharge of pollutants into navigable waterways, has provisions for managing water quality and permitting of treatment technology. Development of water quality standards is left to the states, which must set standards at least as stringent as federal water quality standards. Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (CERCLA, Superfund) – Enacted to clean up abandoned and inactive hazardous waste sites. Creates a tax on the manufacture of certain chemicals to create a trust fund called the Superfund. Sites to be cleaned up are prioritized as a National Priority List (NPL) by the EPA. Procedures and cleanup criteria are specified by a National Contingency Plan. The NPL originally contained 408 sites, and now contains over 1300. Another 30,000 sites are being evaluated for addition to the list. Emergency Planning and Community Right-to-Know Act of 1986 (EPCRA) – Enacted after the Union Carbide plant disaster in Bhopal, India in 1984, in which release of methyl isocyanate from a chemical plant killed 2,000 and impacted the health of 170,000 survivors, this law requires industrial facilities to disclose information about chemicals stored onsite. Pollution Prevention Act of 1990 (PPA) – Requires collection of information on source reduction, recycling, and treatment of listed hazardous chemicals. Resulted in a Toxic Release Inventory for facilities including amounts disposed of onsite and sent offsite, recycled, and used for energy recovery. These regulations have contributed significantly to improvement of our environment. They have also resulted in a huge amount of paperwork and other expenses for many organizations, and explain why environmental coordinators stay very busy.
Site Investigation and Remediation
119
THE INVESTIGATION AND REMEDIATION PROCESS The details of the site investigation and remediation process vary depending on the regulation under which the work is being done. Superfund was designed to remedy mistakes in hazardous waste management made in the past at sites that have been abandoned or where a sole responsible party cannot be determined. RCRA deals with sites that have viable operators and ongoing operations. The majority of sites fall into one of these two categories. The rest operate under a range of regulations through various different regulatory bodies, many of which are agencies in the various states.
CERCLA CERCLA (Superfund) gives the EPA the authority to respond to releases or threatened releases of hazardous substances that may endanger human health and the environment. The three major areas of enforcement at Superfund sites are: achieving site investigations and cleanups led by the potentially responsible party (PRP) or parties (PRP lead cleanups, meaning the lead party on the project is the PRP); overseeing PRP investigation and cleanup activities; and recovering from PRPs the costs spent by EPA at Superfund cleanups (Fund lead cleanups). The National Contingency Plan of CERCLA describes the procedures for identification, evaluation, and remediation of past hazardous waste disposal sites. These procedures are preliminary assessment and site inspection; Hazard Ranking System (HRS) scoring and National Priority List (NPL) site listing; remedial investigation and feasibility studies; record of decision; remedial design and remedial action; construction completion; operation and maintenance; and NPL site deletion. Site environmental data can be generated at various steps in the process. Additional information on Superfund enforcement can be found in EPA (2001a). Preliminary assessment and site inspection – The process starts with investigations of site conditions. A preliminary assessment (PA) is a limited scope investigation performed at each site. Its purpose is to gather readily available information about the site and surrounding area to determine the threat posed by the site. The site inspection (SI) provides the data needed for the hazard ranking system, and identifies sites that enter the NPL site listing process (see below). SIs typically involve environmental and waste sampling that can be managed using the EDMS. HRS scoring and NPL site listing – The hazard ranking system (HRS) is a numerically based screening system that uses information from initial, limited investigations to assess the relative potential of sites to pose a threat to human health or the environment. The HRS assigns a numerical score to factors that relate to risk based on conditions at the site. The four risk pathways scored by HRS are groundwater migration; surface water migration; soil exposure; and air migration. HRS is the principal mechanism EPA uses to place uncontrolled waste sites on the National Priorities List (NPL). Identification of a site for the NPL helps the EPA determine which sites warrant further investigation, make funding decisions, notify the public, and serve notice to PRPs that EPA may begin remedial action. Remedial investigation and feasibility studies – Once a site is on the NPL, a remedial investigation/feasibility study (RI/FS) is conducted at the site. The remedial investigation involves collection of data to characterize site conditions, determine the nature of the waste, assess the risk to human health and the environment, and conduct treatability testing to evaluate the potential performance and cost of the treatment technologies that are being considered. The feasibility study is then used for the development, screening, and detailed evaluation of alternative remedial actions. The RI/FS has five phases: scoping; site characterization; development and screening of alternatives; treatability investigations; and detailed analyses. The EDMS can make a significant contribution to the site characterization component of the RI/FS, which often involves a significant amount of sampling of soil, water, and air at the site. The EDMS serves as a repository of the data,
120
Relational Management and Display of Site Environmental Data
as well as a tool for data selection and analysis to support the decision-making process. Part of the site characterization process is to develop a baseline risk assessment to identify the existing or potential risks that may be posed to human health and environment at the site. The EDMS can be very useful in this process by helping screen the data for exceedences that may represent risk factors. Record of decision – Once the RI/FS has been completed, a record of decision (ROD) is issued that explains which of the cleanup alternatives will be used to clean up the site. This public document can be significant for data management activities because it often sets target levels for contaminants that will be used in the EDMS for filtering, comparison, and so on. Remedial design and remedial action – In the remedial design (RD), the technical specifications for cleanup remedies and technologies are designed. The remedial action (RA) follows the remedial design and involves the construction or implementation phase of the site cleanup. The RD/RA is based on specifications described in the ROD. The EDMS can assist greatly with tracking the progress of the RA and determining when ROD limits have been met. Construction completion – A construction completion list (CCL) helps identify successful completion of cleanup activities. Sites qualify for construction completion when any physical construction is complete (whether or not cleanup levels have been met), EPA has determined that construction is not required, or the site qualifies for deletion from the NPL. Operation and maintenance – Operation and maintenance (O&M) activities protect the integrity of the selected remedy for a site, and are initiated by the state after the site has achieved the actions and goals outlined in the ROD. The site is then determined to be operational and functional (O&F) based on state and federal agreement when the remedy for a site is functioning properly and performing as designed, or has been in place for one year. O&M monitoring involves inspection; sampling and analysis; routine maintenance; and reporting. The EDMS is used heavily in this stage of the process. NPL site deletion – In this final step, sites are removed from the NPL once they are judged to no longer be a significant threat to human health and the environment. To date, not many sites have been delisted.
RCRA The EPA’s Office of Solid Waste (OSW) is responsible for ensuring that currently generated solid waste is managed properly, and that currently operating facilities address any contaminant releases from their operations. In some cases, accidents or other activities at RCRA facilities have released hazardous materials into the environment, and the RCRA Corrective Action Program covers the investigation and cleanup of these facilities. Additional information on RCRA enforcement can be found in EPA (2001b). As a condition of receiving a RCRA operating permit, active facilities are required to clean up contaminants that are being released or have been released in the past. EPA, in cooperation with the states, verifies compliance through compliance monitoring, educational activities, voluntary incentive programs, and a strong enforcement program. The EDMS is heavily involved in compliance monitoring and to some degree in enforcement actions. Compliance monitoring – EPA and the states determine a waste handler’s compliance with RCRA requirements using inspections, record reviews, sampling, and other activities. The EDMS can generate reports comparing sampling results to regulatory limits to save time in the compliance monitoring process. Enforcement actions – The compliance monitoring process can turn up violations, and enforcement actions are taken to bring the waste handler into compliance and deter further violations. These actions can include administrative actions, civil judicial actions, and criminal actions. In addition, citizens can file suit to bring enforcement actions against violators or potential violators.
Site Investigation and Remediation
121
One important distinction from a data management perspective between CERCLA and RCRA projects is that CERCLA projects deal with past processes, while RCRA projects deal with both past and present processes. This means that the EDMS for both projects needs to store information on soil, groundwater, etc., while the RCRA EDMS also might store information on ongoing processes such as effluent concentrations and volumes, and even production and other operational information.
Other regulatory oversight While many sites are investigated and remediated under CERCLA or RCRA, other regulatory oversight is also possible. The EPA has certified some states to oversee cleanup within their boundaries. In some cases, other government agencies, including the armed forces, oversee their own cleanup efforts. In general, the technical activities performed are pretty much the same regardless of the type of oversight, and the functional requirements for the EDMS are also the same The main exception is that some of these agencies require the use of specific reporting tools as described in Chapter 5.
ENVIRONMENTAL ASSESSMENTS AND ENVIRONMENTAL IMPACT STATEMENTS The National Environmental Policy Act of 1969 (NEPA), along with various supplemental laws and legal decisions, requires federal agencies to consider the environmental impacts and possible alternatives of any federal actions that significantly affect the environment (Mackenthun, 1998, p. 15; Yost, 1997, p. 1-11). This usually starts with an environmental assessment (EA). The EA can result in a determination that an environmental impact statement (EIS) is required, or in a finding of no significant impact (FONSI). The EIS is a document that is prepared to assist with decision making based on the environmental consequences and reasonable alternatives of the action. The format of an EIS is recommended in 40 CFR 1502.10, and is normally limited to 150 pages. Often there is considerable public involvement in this process. One important use of environmental assessments is in real estate transactions. The seller and especially the buyer want to be aware of any environmental liabilities related to the property being transferred. These assessments are broken into phases. The data management requirements of EAs and EISs vary considerably, depending on the nature of the project and the amount and type of data available. Phase 1 Environmental Assessment – This process involves evaluation of existing data about a site, along with a visual inspection, followed by a written report, similar to a preliminary assessment and site inspection under CERCLA, and can satisfy some CERCLA requirements such as the innocent landowner defense. The Phase 1 assessment process is well defined, and guidelines such as Practice E-1527-00 from the American Society for Testing and Materials (ASTM 2001a, 2001b), are used for the assessment and reporting process. There are four parts to this process: gathering information about past and present activities and uses at the site and adjoining properties; reviewing environmental files maintained by the site owner and regulatory agencies; inspection of the site by an environmental professional; and preparation of a report identifying existing and potential sources of contamination on the property. The work involves document searches and review of air photos and site maps. Often the source materials are in hard copy not amenable to data management. Public and private databases are available to search ownership, toxic substance release, and other information, but this data is usually managed by its providers and not by the person performing the search. Phase 1 assessments for a small property are generally not long or complicated, and can cost as little as $1,000.
122
Relational Management and Display of Site Environmental Data
Phase 2 Investigation – If a Phase 1 assessment determines that the presence of contamination is likely, the next step is a Phase 2 assessment. The primary differences are that Phase 1 relies on existing data, while in Phase 2 new data is gathered, usually in an intrusive manner, and the Phase 2 process is less well defined. This can involve sampling soil, sediment, and sludge and installation of wells for sampling groundwater. This is similar to remedial investigation and feasibility studies under CERCLA. If the assessment progresses to the point where samples are being taken and analyzed, then the in-house data management system can be of value. Phase 3 Site Remediation and Decommissioning – The final step of the assessment process, if necessary, is to perform the cleanup and assess the results. Motivation for the remediation might include the need to improve conditions prior to a property transfer, to prevent contamination from migrating off the property, to improve the value of the property, or to avoid future liability. Monitoring the cleanup process, which can involve ongoing sampling and analysis, will usually involve the EDMS.
CHAPTER 11 GATHERING SAMPLES AND DATA IN THE FIELD
Environmental monitoring at industrial and other facilities can involve one or more different media. The most common are soil, sediment, groundwater, surface water, and air. Other media of concern in specific situations might include dust, paint, waste, sludge, plants and animals, and blood and tissue. Each medium has its own data requirements and special problems. Generating site environmental data starts with preparing sampling plans and gathering the samples and related data in the field. There are a number of aspects of this process that can have a significant impact on the resulting data quality. Because the sampling process is specific to the medium being sampled, this chapter is organized by medium. Only the major media are discussed in detail.
GENERAL SAMPLING ISSUES The process of gathering data in the field, sending samples to the laboratory, analyzing the data, and reporting the results is complicated and error-prone. The people doing the work are often overworked and underpaid (who isn’t), and the requirements to do a good job are stringent. Problems that can lead to questionable or unusable data can occur at any step of the way. The exercise (and in some cases, requirement) of preparing sampling plans can help minimize field data problems. Field sampling activities must be fully documented in conformance with project quality guidelines. Those guidelines should be carefully thought out and followed methodically. A few general issues are covered here. The purpose of this section is not to teach field personnel to perform the sampling, but to help data management staff understand where the samples and data come from in order to use it properly. In all cases, regulations and project plans should be followed in preference to statements made here. Additional information on these issues can be found in ASTM (1997), DOE/HWP (1990a), and Lapham, Wilde, and Koterba (1985).
Taking representative samples Joseph (1998) points out that the basic premise of sampling is that the sample must represent the whole population, and quotes the law of statistical regularity as stating that “a set of subjects taken at random from a large group tends to reproduce the characteristics of that large group.” But the sample is only valid if the errors introduced in the sampling process do not invalidate the results for the purpose intended for the samples. Analysis of the samples should result in no bias and minimum random errors.
124
Relational Management and Display of Site Environmental Data
Types of Sampling Patterns
Known Plume
Simple Random Sampling
Judgment Sampling
Grid (Systematic) Sampling
Stratified Sampling
Primary Stage
A
B
Secondary Stage Random Grid Sampling
Two-Stage Sampling
Figure 51 - Types of sampling patterns
The size of the sample set is directly related to the precision of the result. More samples cost more money, but give a more reliable result. If you start with the precision required, then the number of samples required can be calculated:
Gathering Samples and Data in the Field
n=
125
Ns 2 N ( B 2 / 4) + s 2
where n is the number of samples, N is the size of the population, s is the standard deviation of the sample, and B is the desired precision, such as 95% confidence. According to Joseph (1998), the standard deviation can be estimated by taking the largest value of the data minus the smallest and dividing by four. There are several strategies for laying out a sampling program. Figure 51, modified after Adolfo and Rosecrance (1993), shows six possibilities. Sampling strategies are also discussed in Sara (1994, p. 10-49). In simple random sampling, the chance of selecting any particular location is the same. With judgment sampling, sampling points are selected based on previous knowledge of the system to be sampled. Grid sampling provides uniform coverage of the area to be studied. Stratified sampling has the sample locations based on existence of discrete areas of interest, such as aquifers and confining layers, or disposal ponds and the areas between them. Random grid sampling combines uniform coverage of the study area with a degree of random selection of each location, which can be useful when access to some locations is difficult. With two-stage sampling, secondary sampling locations are based on results of primary stage samples. In the example shown, primary sample A had elevated values, so additional samples were taken nearby, while primary sample B was clean, so no follow-up samples were taken. Care should be taken so that the sample locations are as representative as possible of the conditions being investigated. For example, well and sample locations near roadways may be influenced by salting and weed spraying activities. Also, cross-contamination from dirty samples must be avoided by using procedures like sampling first from areas expected to have the least contamination, then progressing to areas expected to have more.
Logbooks and record forms Field activities must be fully documented using site and field logbooks. The site logbook stores information on all field investigative activities, and is the master record of those activities. The field logbook covers the same activities, but in more detail. The laboratory also should keep a logbook for tracking the samples after they receive them. The field logbook should be kept up-to-date at all times. It should include information such as well identification; date and time of sampling; depth; fluid levels; yield; purge volume, pumping rate, and time; collection methods; evacuation procedures; sampling sequence; container types and sample identification numbers; preservation; requested parameters; field analysis data; sample distribution and transportation plans; name of collector; and sampling conditions. Several field record forms are used as part of the sampling process. These include Sample Identification and Chain of Custody forms. Also important are sample seals to preserve the integrity of the sample between sampling and when it is opened in the laboratory. These are legal documents, and should be created and handled with great care. Sample Identification forms are usually a label or tag so that they stay with the sample. Labels must be waterproof and completed in permanent ink. These forms should contain such information as site name; unique field identification of sample, such as station number; date and time of sample collection; type of sample (matrix) and method of collection; name of person taking the sample; sample preservation; and type of analyses to be conducted. Chain of Custody (COC) forms make it possible to trace a sample from the sampling event through transport and analysis. The COC must contain the following information: project name; signature of sampler; identification of sampling stations; unique sample numbers; date and time of collection and of sample possession; grab or composite designation; matrix; number of containers; parameters requested for analysis; preservatives and shipping temperatures; and signatures of individuals involved in sample transfer.
126
Relational Management and Display of Site Environmental Data
Velilind’s laws of Experimentation: 1. If reproducibility may be a problem, conduct the test only once. 2. If a straight line fit is required, obtain only two data points. McMullen (1996) COC forms should be enclosed in a plastic cover and placed in the shipping container with the samples. When the samples are given to the shipping company, the shipping receipt number should be recorded on the COC and in the site logbook. All transfers should be documented with the signature, date, and time on the form. A sample must remain under custody at all times. A sample is under custody if it is in the sampler’s possession; it is in the sampler’s view after being in possession; it is in the possession of a traceable carrier; it is in the possession of another responsible party such as a laboratory; or it is in a designated secure area.
SOIL Taking soil samples must take into consideration that soil is a very complex physical material. The solid component of soil is a mix of inorganic and organic materials. In place in the ground, soil can contain one or more liquid phases and a gas phase, and these can be absorbed or adsorbed in various ways. The sampling, transportation, and analysis processes must be managed carefully so that analytical results accurately represent the true composition of the sample.
Soil sampling issues Before a soil sample can be taken, the material to be sampled must be exposed. For surface or shallow subsurface soil samples this is generally not an issue, but for subsurface samples this usually requires digging. This can be done using either drilling or drive methods. For unconsolidated formations, drilling can be done using an auger (hollow-stem, solid flight, or bucket), drilling (rotary, sonic, directional), or jetting methods. For consolidated formations, rotary drilling (rotary bit, downhole hammer, or diamond drill) or cable tools can be used. Drive methods include cone penetrometers or direct push samplers. Sometimes it is useful to do a borehole geophysical survey after the hole is drilled. Examples of typical measurements include spontaneous potential, resistivity, gamma and neutron surveys, acoustic velocity, caliper, temperature, fluid flow, and electromagnetic induction. Soil samples are gathered with a variety of tools, including spoons, scoops, shovels, tubes, and cores. The samples are then sealed and sent to the laboratory. Duplicates should be taken as required by the QAPP (quality assurance project plan). Sometimes soil samples are taken as a boring is made, and then the boring is converted to a monitoring well for groundwater, so both soil and water samples may come from the same hole in the ground. Typical requirements for soil samples are as follows. The collection points should be surveyed relative to a permanent reference point, located on a site map, and referenced in the field logbook. A clean, decontaminated auger, spoon, or trowel should be used for each sample collected. Surface or air contact should be minimized by placing the sample in an airtight container immediately after collection. The sampling information should be recorded in the field logbook and any other appropriate forms. For subsurface samples, the process for verifying depth of sampling, the depth control tolerance, and the devices used to capture the samples should be as specified in the work plan. Care must be taken to prevent cross-contamination or misidentification of samples. Sometimes the gas content of soil is of concern, and special sampling techniques must be used. These include static soil gas sampling, soil gas probes, and air sampling devices.
Gathering Samples and Data in the Field
127
Groundwater or Ground Water? Is “groundwater” one word or two? When used by itself, groundwater as one word looks fine, and many people write it this way. The problem comes in when it is written along with surface water, which is always two words, and ground water as two words looks better. Some individuals and organizations prefer it one way, some the other, so apparently neither is right or wrong. For any one writer, just as with “data is” vs. “data are,” the most important thing is to pick one and be consistent. Special consideration should be given for soil and sediment samples to be analyzed for volatile organics (VOAs). The samples should be taken with the least disturbance possible, such as using California tubes. Use Teflon or stainless steel equipment. If preservatives are required, they should be added to the bottle before sampling. Samples for VOA analysis should not be split. Air bubbles cannot be present in the sample. The sample should never be frozen.
Soil data issues Soil data is usually gathered in discrete samples, either as surface samples or as part of a soil boring or well installation process. Then the sample is sent to the laboratory for analysis, which can be chemical, physical, or both. Each sample has a specific concentration of each constituent of concern. Sometimes it is useful to know not only the concentration of a toxin, but also its mobility in groundwater. Useful information can be provided by a leach test such as TCLP (toxicity characterization leaching procedure), in which a liquid is passed through the soil sample and the concentration in the leachate is measured. This process is described in more detail in Chapter 12. Key data fields for soil samples include the site and station name; the sample date and depth; COC and other field sample identification numbers; how the sample was taken and who took it; transportation information; and any sample QC information such as duplicate and other designations. For surface soil samples, the map coordinates are usually important. For subsurface soil samples, the map coordinates of the well or boring, along with the depth, often as a range (top and bottom), should be recorded. Often a description of the soil or rock material is recorded as the sample is taken, along with stratigraphic or other geologic information, and this should be stored in the EDMS as well.
SEDIMENT Procedures for taking sediment samples are similar to those for soil samples. Samples should be collected from areas of least to greatest contamination, and from upstream to downstream. Sediment plumes and density currents should be avoided during sample collection.
GROUNDWATER Groundwater is an important resource, and much environmental work involves protecting and remediating groundwater. A good overview of groundwater and its protection can be found in Conservation Technology Resource Center (2001). Groundwater accounts for more than 95% of all fresh water available for use, and about 40% of river flow depends on groundwater. About 50% of Americans (and 95% of rural residents) obtain all or part of their drinking water from groundwater. Groundwater samples are usually taken at a location such as a monitoring well, for an extended period of time such as quarterly, for many years. Additional information on groundwater sampling can be found in NWWA (1986) and Triplett (1997).
128
Relational Management and Display of Site Environmental Data
Figure 52 - Submersible sampling pump (Courtesy of Geotech Environmental Equipment)
Groundwater sampling issues The first step in groundwater sampling is to select the location and drill the hole. Drilling methods are similar to those described above in the section on soil sampling, and soil samples can be taken when a groundwater well is drilled. Then the wellbore equipment such as tubing, screens, and annular material is placed in the hole to make the well. The tubing closes off part of the hole and the screens open the other part so water can enter the wellbore. Screening must be at the correct depth so the right interval is being sampled. Prior to the first sampling event, the well is developed. For each subsequent event it is purged and then sampled. The following discussion is intended to generally cover the issues of groundwater sampling. The details of the sampling process should be covered in the project work plan. Appropriate physical measurements of the groundwater are taken in the field. The sample is placed in a bottle, preserved as appropriate, chilled and placed in a cooler, and sent to the laboratory. Well development begins sometime after the well is installed. A 24-hour delay is typical. Water is removed from the well by pumping or bailing, and development usually continues until the water produced is clear and free of suspended solids and is representative of the geologic formation being sampled. Development should be documented on the Well Development Log Form and in the site and field logbooks. Upgradient and background wells should be developed before downgradient wells to reduce the risk of cross-contamination. Measurement of water levels should be done according to the sampling plan, which may specify measurement prior to purging or prior to sampling. Groundwater level should be measured to a specific accuracy (such as 0.05 ft) and with a specific precision (such as 0.01 ft). Measurements should be made relative to a known, surveyed datum. Measurements are taken with a steel tape or an electronic device such as manometer or acoustical sounder. Some wells have a pressure transducer installed so water levels can be obtained more easily.
Gathering Samples and Data in the Field
129
Figure 53 - Multi-parameter field meter (Courtesy of Geotech Environmental Equipment)
Some wells contain immiscible fluid layers, called non-aqueous phase liquids (NAPLs). There can be up to three layers, which include the water and lighter and heavier fluids. The lighter fluids, called light non-aqueous phase liquids (LNAPLs) or floaters, accumulate above the water. The heavier fluids, called dense non-aqueous phase liquids (DNAPLs) or sinkers, accumulate below the water. For example, LNAPLs like gasoline float on water, while DNAPLs such as chlorinated hydrocarbons (TCE, TCA, and PCE) sink. NAPLs can have their own flow regime in the subsurface separate from the groundwater. The amount of these fluids should be measured separately, and the fluids collected, prior to purging. For information on measurement of DNAPL, see Sara (1994, p. 10-75). Purging is done to remove stagnant water in the casing and filter pack so that the water sampled is “fresh” formation water. A certain number of water column volumes (such as three) are purged, and temperature, pH, and conductivity must be monitored during purging to ensure that these parameters have stabilized prior to sampling. Upgradient and background wells should be purged and sampled before downgradient wells to reduce the risk of cross-contamination. Information concerning well purging should be documented in the Field Sampling Log. Sampling should be done within a specific time period (such as three hours) of purging, if recharge is sufficient, otherwise as soon as recharge allows. The construction materials of the sampling equipment should be compatible with known and suspected contaminants. Groundwater sampling is done using various types of pumps including bladder, gear, submersible rotor, centrifugal, suction, or inertial lift; or with a bailed rope. Pumping is usually preferred over bailing because it is takes less effort and causes less disturbance in the wellbore. An example of a submersible pump is shown in Figure 52. Field measurements should be taken at the time of sampling. These measurements, such as temperature, pH, and specific conductance, should be taken before and after the sample is collected to check on the stability of the water during sampling. Figure 53 shows a field meter for taking these measurements. The field data (also known as the field parameters) is entered on the COC, and should be entered into the EDMS along with the laboratory analysis data. Sometimes the laboratory will enter this data for the client and deliver it with the electronic data deliverable.
130
Relational Management and Display of Site Environmental Data
At all stages of the sampling process it is critical that contamination of the samples be prevented. Contamination can be minimized by properly storing and transporting sampling equipment, keeping equipment and bottles away from sources of contamination, using clean hands and gloves to handle equipment and bottles, and carefully cleaning the purging and sampling equipment after use. If sampling is for VOAs (volatile organic analysis) then equipment or processes that can agitate and potentially volatilize samples should be avoided. Sampling methods such as bottomfilling bailers of stainless steel or Teflon and/or Teflon bladder pumps should be used. Powell and Puls (1997) have expressed a concern that traditional groundwater sampling techniques, which are largely based on methods developed for water supply investigations, may not correctly represent the true values or extent of a plume. For example, the turbidity of a sample is often related to the concentration of constituents measured in the sample, and sometimes this may be due to sampling methods that cause turbulence during sampling, resulting in high concentrations not representative of in-situ conditions. Filtering the sample can help with this, but a better approach may be to use sampling techniques that cause less disturbance of materials in the wellbore. Small diameter wells, short screened intervals, careful device insertion (or the use of permanently installed devices), and low pump rates (also known as low-flow samples) are examples of techniques that may lead to more representative samples. Preservation and handling of the samples is critical for obtaining reliable analytical results. Groundwater samples are usually treated with a preservative such as nitric, sulfuric, or hydrochloric acid or sodium hydroxide (depending on the parameter) to stabilize the analytes, and then cooled (typically to 4°C) and shipped to the laboratory. The shipping method is important because the analyses must be performed within a certain period (holding time) after sampling. The preservation and shipping process varies for different groups of analytes. See Appendix D for more information about this.
Groundwater data issues The sample is taken and often some parameters are measured in the field such as temperature, pH, and turbidity. Then the sample is sent to the laboratory for analysis. When the field and laboratory data are sent to the data administrator, the software should help the data administrator tie the field data to the laboratory data for each sampling event. Key data fields for groundwater data include the site and station name, the sample date and perhaps time, COC and other field sample identification numbers, how the sample was taken and who took it, transportation information, and any sample QC information such as duplicate and other designations. All of this data should be entered into the EDMS.
SURFACE WATER Surface water samples have no purging requirements, but are otherwise sampled and transferred the same as groundwater samples. Surface water samples may be easier to acquire, so they may be taken more often than groundwater samples.
Surface water sampling issues Surface water samples can be taken either at a specific map location, or at an appropriate location and depth. The location of the sample should be identified on a site map and described in the field logbook. Samples should progress from areas of least contamination to worst contamination and generally from upstream to downstream. The sample container should be submerged with the mouth facing upstream (to prevent bubbles in the sample), and sample
Gathering Samples and Data in the Field
131
information should be recorded in the field logbook and any other appropriate forms. The devices used and the process for verifying depth and depth control tolerance should be as specified in the project work plan.
Surface water data issues The data requirements for surface water samples are similar to groundwater samples. For samples taken in tidal areas, the status of the tide (high or low) should be noted.
DECONTAMINATION OF EQUIPMENT Equipment must be decontaminated prior to use and re-use. The standard operating procedure for decontamination should be in the project work plan. The decontamination process is usually different for different equipment. The following are examples of equipment decontamination procedures (DOE/HWP 1990a). For nonsampling equipment such as rigs, backhoes, augers, drill pipe, casing, and screen, decontaminate with high pressure steam, and if necessary scrub with laboratory-grade detergent and rinse with tap water. For sampling equipment used in inorganic sampling, scrub with laboratory-grade detergent, rinse with tap water, rinse with ASTM Type II water, air-dry, and cover with plastic sheeting. For sampling equipment used in inorganic or organic sampling, scrub with laboratory-grade detergent, rinse with tap water, rinse with ASTM Type II water, rinse with methanol (followed by a hexane rinse if testing for pesticides, PCBs, or fuels), air-dry, and wrap with aluminum foil.
SHIPPING OF SAMPLES Samples should be shipped in insulated carriers with either freezer forms (“blue ice”) or wet ice. If wet ice is used, it should be placed in leak-proof plastic bags. Shipping containers should be secured with nylon reinforced strapping tape. Custody seals should be placed on the containers to verify that samples are not disturbed during transport. Shipping should be via overnight express within 24 hours of collection so the laboratory can meet holding time requirements.
AIR The data management requirements for air sampling are somewhat different from those of soil and water because the sampling process is quite different. While both types of data can (and often should) be stored in the same database system, different data fields may be used, or the same fields used differently, for the different types of data. As an example, soil samples will have a depth below ground, while air samples may have an elevation above ground. Typical air quality parameters include sulfur dioxide, nitrogen dioxide, carbon monoxide, ozone, and lead. Other constituents of concern can be measured in specific situations. Sources of air pollution include transportation, stationary fuel combustion, industrial processes, solid waste disposal, and others. For an overview of air sampling and analysis, see Patnaik (1997). For details on several air sampling methods, see ASTM (1997).
Air sampling issues Concentrations of contaminants in air vary greatly from time to time due to weather conditions, topography, and changes in source input. Weather conditions that may be important
132
Relational Management and Display of Site Environmental Data
include wind conditions, temperature, humidity, barometric pressure, and amount of solar radiation. The taking and analysis of air samples vary widely depending on project requirements. Air samples can represent either outdoor air or indoor air, and can be acquired by a variety of means, both manual and automated. Some samples represent a specific volume of air, while others represent a large volume of air passed through a filter or similar device. Some air measurements are taken by capturing the air in a container for analysis, while others are done without taking samples, using real-time sensors. In all cases, a sampling plan should be established and followed carefully. Physical samples can be taken in a Tedlar bag, metal (Summa) canister, or glass bulb. The air may be concentrated using a cryogenic trap, or compressed using a pump. For organic analysis, adsorbent tubes may be used. Adsorbent materials typically used include activated charcoal, Tenax (a porous polymer), or silica gel. Particulate matter such as dust, silica, metal, and carbon particles are collected using membrane filters. A measured volume of air is pumped through the filter, and the suspended particles are deposited on the filter. For water-soluble analytes such as acid, alkali, and some organic vapors, samples can be taken using an impinger, where the air is bubbled through water, and then the water is analyzed. Toxic gases and vapors can be measured using colorimetric dosimeters, where a tube containing a paper strip or granulated material is exposed to the air on one end, and the gas diffuses into the tube and causes the material to change color. The amount of color change reflects the concentration of the constituent being measured. Automated samples can be taken using ambient air analyzers. Care should be taken that the air being analyzed is representative of the area under investigation, and standards should be used to calibrate the analyzer.
Air data issues Often air samples are taken at relatively short time intervals, sometimes as short as minutes apart. This results in a large amount of data to store and manipulate, and an increased focus on time information rather than just date information. It also increases the importance of data aggregation and summarization features in the EDMS so that the large volume of data can be presented in an informative way. Key data fields for air data include the site and station name, the sample date and time (or, for a sample composited over time, the start and end dates and times), how the sample was taken and who took it, transportation information if any, and any sample QC information such as duplicate and other designations.
OTHER MEDIA The variety of media that can be analyzed to gather environmental information is almost unlimited. This section covers just a few of the possible media. There certainly are many others routinely being analyzed, and more being added over time.
Human tissue and fluids Exposure to toxic materials can result in the buildup of these materials in the body. Tracking this exposure involves measuring the concentration of these materials in tissue and body fluids. For example, hair samples can provide a recent history of exposure, and blood and urine analyses are widely used to track exposure to metals such as lead, arsenic, and cadmium. Often this type of data is gathered under patient confidentiality rules, and maintaining this confidentiality must be considered in implementing and operating the system for managing the data. Lead exposure in
Gathering Samples and Data in the Field
133
children (and pregnant and nursing mothers) is of special interest since it appears to be correlated with developmental problems, and monitoring and remediating elevated blood lead is receiving much attention in some communities. The data management system should be capable of managing both the blood lead data and the residential environmental data (soil, paint, water, and dust) for the children. It should also be capable of relating the two even if the blood data is within the patient confidentiality umbrella and the residential environmental data is not.
Organisms Because each level of the food chain can concentrate pollutants by one or more orders of magnitude, the concentration of toxins in biologic material can be a key indicator of those toxins in the environment. In addition, some organisms themselves can pose a health hazard. Both kinds of information might need to be stored in a database. Sampling procedures vary depending on the size of the organisms and whether they are benthic (attached to the bottom), planktonic (move by floating), or nektonic (move by swimming). For more information, see ASTM (1997).
Residential and workplace media Increasingly, the environmental quality in homes and offices is becoming a concern. From toxic materials, to “sick building syndrome” and infectious diseases like anthrax and Legionnaire’s disease, the quality of the indoor environment is coming into question, and it is logical to track this information in a database. Other environmental issues relate to exposure to toxic materials such as lead. Lead can occur in paint, dust, plumbing, soil, and in household accessories, including such common objects as plastic mini-blinds from the local discount store. Concentration information can be stored in a database, and the EDMS can be used to correlate human exposure information with residential or workplace media information to assist with source identification and remediation.
Plant operating and other data Often information on plant operations is important in management of the environmental issues at a facility. The relationship between releases and production volume, or chemical composition vs. physical properties, can be best investigated if some plant operating information is captured and stored in the EDMS. One issue to keep in mind is that the volume of this information can be great, and care should be taken to prevent degradation of the performance of the system due to the potentially large volume of this data. Sometimes it makes sense to store true operating data in one database and environmental data in another. However, due to the potential overlap of retrieval requirements, a combined database or duplicated storage for the data with dual uses is sometimes necessary. Also, the reporting of plant operating data and its relationship to environmental factors often involves deriving results through calculations. For example, what is measured may be the volume of effluent and the concentration of some material in the effluent, but perhaps the operating permit sets limits on the total amount of material. The reporting process needs to multiply the volume times the concentration to get the amount, and perhaps scale that to different units. Figure 54 shows a screen for defining calculated parameters.
OVERVIEW OF PARAMETERS The EDMS is used in environmental investigation and remediation projects to track and understand the amount and location of hazardous materials in the environment. In addition, other parameters may be tracked to help better understand the amount and distribution of the contaminants.
134
Relational Management and Display of Site Environmental Data
Figure 54 - Screen for defining calculated parameters
This section briefly covers some of the environmental and related parameters that are likely to be managed in an EDMS. Other sources cover this material in much more detail. Useful reference books include Manahan (2000, 2001), Patnaik (1997), and Weiner (2000). Many Web sites include reference information on parameters. Examples include EPA (2000a), EXTOXNET (2001), Spectrum Labs (2001), NCDWQ (2001), SKC (2001), and Cambridge Software (2001). A Web search will turn up many more. This section covers inorganic parameters, organic parameters, and various other parameters commonly found in an EDMS.
Inorganic parameters Inorganic compounds include common metals, heavy metals, nutrients, inorganic nonmetals, and radiologic parameters. Common metals include calcium, iron, magnesium, potassium, sodium, and others. These metals are generally not toxic, but can cause a variety of water quality problems if present in large quantities. Heavy metals include arsenic, cadmium, chromium, lead, mercury, selenium, sulfur, and several others. These metals vary significantly in their toxicity. For example, arsenic is quite poisonous, but sulfur is not. Lead is toxic in large amounts, and in much lower amounts is thought to cause developmental problems in small children. The toxicity of some metals depends on their chemical state. Mercury is much more toxic in organic compounds than as a pure metal. Many of us have chased little balls of mercury around after a thermometer broke, and suffered no ill effects, while some organic mercury compounds are so toxic that one small drop on the skin can cause almost instantaneous death. Hexavalent chromium (Cr6+) is extremely toxic, while trivalent chromium (Cr3+) is much less so. (See the movie Erin Brockovich.) Nutrients include nitrogen and phosphorus (and the metal potassium). Nitrogen is present as nitrate (NO3-) and nitrite (NO2-). Nitrates can cause problems with drinking water, and phosphorus can pollute surface waters. Inorganic nonmetals include ammonia, bicarbonate, carbonate, chloride, cyanide, fluoride, iodide, nitrite, nitrate, phosphate, sulfate, and sulfide. With the exception of cyanide, these are not particularly toxic, but can contribute to low water quality if present in sufficient quantities. Asbestos is an inorganic pollutant that is somewhat different from the others. Most toxic substances are toxic due to their chemical activity. Asbestos is toxic, at least in air, because of its physical properties. The small fibers of asbestos (a silicate mineral) can cause cancer when breathed into the lungs. It has not been established whether it is toxic in drinking water.
Gathering Samples and Data in the Field
135
Radiologic parameters such as plutonium, radium, thorium, and uranium consist of both natural materials and man-made products. These materials were produced in large quantities for use in weapons and nuclear reactors, and many other uses (for example, lantern mantles and smoke detectors). They can cause health hazards through either chemical or radiologic exposure. Some radioactive materials, such as plutonium, are extremely toxic, while others such as uranium are less so. High levels of radioactivity (alpha and beta particles and gamma rays) can cause acute health problems, while long exposure to lower levels can lead to cancer. Radiologic parameters are differentiated by isotope number, which varies for the same element depending on the number of neutrons in the nucleus of each atom. For example, radium224 and radium226 have atomic weights of 224 and 226, respectively, but are the same element and participate in chemical reactions the same way. Different isotopes have different levels of radioactivity and different half-lives (how long it takes half of the material to undergo radioactive decay), so they are often tracked separately. Inorganic pollutants in air include gaseous oxides such as carbon dioxide, sulfur dioxide, and the oxides of nitrogen, which cause acid rain and may contribute to atmospheric warming (the greenhouse effect). Chloride atoms in the atmosphere can damage the ozone layer, which protects us from harmful ultraviolet radiation from the sun. Particulate matter is also significant, some of which, for example, sea salt, is natural, but much of which is man-made. Colloidal-sized particles formed by physical processes (dispersion aerosols) or chemical processes (condensation aerosols) can cause smog and health problems if inhaled.
Organic parameters Organic compounds are compounds that contain carbon, usually with hydrogen and often with oxygen. Organics may contain other atoms as well, such as halides, nitrogen, and phosphorus. They are usually segregated into volatile organic compounds (VOCs), and semivolatile organic compounds (SVOCs). Hydrocarbons, chlorinated hydrocarbons, pesticides, and herbicides are also organic compounds. The delineation between volatiles and semivolatiles is not as easy as it sounds. SW-846, the guidance document from EPA for analytical methods (EPA, 1980) describes volatile compounds as “compounds which have boiling points below 200°C and that are insoluble or slightly soluble in water.” Other references describe volatiles as those compounds that can be adequately analyzed by a purge and trap procedure. Unfortunately, semivolatiles are described altogether differently. SW846 describes semivolatiles in their procedures as “neutral, acidic and basic organic compounds that are soluble in methylene chloride and capable of being eluted without derivatization.” No mention is made of the boiling points of semivolatile compounds, although it’s probably implicit that their boiling points are higher than volatile compounds (Roy Widmann, pers. comm., 2001). VOCs are organic compounds with a high vapor pressure, meaning that they evaporate easily. Examples include benzene, toluene, ethylene, and xylene, (collectively referred to as BTEX), acetone, carbon tetrachloride, chloroform, ethylene glycol, and various alcohols. Many of these compounds are used as industrial solvents and cleaning fluids. SVOCs are organic compounds with a low vapor pressure, so they resist evaporation. They also have a higher boiling point than VOCs, greater than 200°C. Examples include anthracene, dibenzofuran, fluorene (not to be confused with the halide fluorine), pentachlorophenol (PCP), phenol, polycyclic aromatic compounds (PAHs), polychlorinated biphenyls (PCBs, Aroclor), and pyrene. Some of these substances are used in manufacture of a wide variety of materials such as plastics and medicine. Others are degradation products resulting from exposure of other organics to the environment. Halogenated compounds are organic compounds that have one or more of the hydrogens replaced with a halide like fluorine, chlorine, or bromine. For example, 1,2-dibromoethane has the first and second hydrogen replaced by bromine. One category of halogenated SVOCs, polychlorinated biphenyls, were widely used in industry in applications such as cooling of
136
Relational Management and Display of Site Environmental Data
transformers until banned by TSCA in 1976. They have high chemical, thermal, and biological stability, which makes them very persistent in the environment. Hydrocarbons consist of crude oil and various refined products. The have a wide range of physical and chemical properties, and are widely dispersed in the environment. In some situations, hydrocarbons are exempt from hazardous materials regulation. For example, crude oil is not currently considered hazardous if spilled at the wellhead, but it is if spilled during transportation. Chlorinated hydrocarbons can pose a significant health risk by contaminating drinking water (Cheremisinoff, 2001). Three of these substances, trichlorethylene (TCE), trichloroethane (TCA), and tetrachloroethylene (perchloroethylene or PCE), are widely used industrial solvents, and are highly soluble in water, so a small quantity of material can contaminate a large volume of groundwater. Pesticides (insecticides) are widely distributed in the environment due to agricultural and other uses, especially since World War II. Some pesticides such as nicotine, rotenone, and pyrethrins are naturally occurring substances, are biodegradable, and pose little pollution risk. Organochloride insecticides such as DDT, dieldrin, and endrin were widely used in the 1960s, but are for the most part banned due to their toxicity and persistence in the food chain. These have been largely replaced by organophosphates such as malathion, and carbamates such as carbaryl and carbofuran. Herbicides are also widely used in agriculture, and too often show up in drinking water. There are several groups of herbicides, including bipyridilium compounds (diquat and paraquat), heterocyclic nitrogen compounds (atrazine and metribuzin), chlorophenoxyls (2,4-D and 2,4,5-T), substituted amides (propanil and alachlor), nitroanilines (trifluralin), and others. By-products from the manufacture of pesticides and herbicides and the degradation products of these materials are also significant problems in the environment. Organic pollutants in the air can be a significant problem, including direct effects such as cancer caused by inhalation of vinyl chloride, or the formation of secondary pollutants such as photochemical smog.
Other parameters There are a number of other parameters that may be tracked in an EDMS. Some are pollutants, while others describe physical and chemical parameters that help better understand the site geology, chemistry, or engineering. Examples include biologic parameters, field parameters, geophysical measurements, operating parameters, and miscellaneous other parameters. Biologic parameters (also called microbes, or pathogens if they are toxic) include fungi, protozoa, bacteria, and viruses. Pathogens such as Cryptosporidium parvum and Giardia cause a significant percentage (more than half) of waterborne disease. However, not all microbes are bad. For example, bacteria such as Micrococcus, Pseudomonas, Mycobacterium, and Nocardia can degrade hydrocarbons in the environment, both naturally and with human assistance, as a site cleanup method. Field parameters fall into two categories, The first are parameters measured at the time that samples, such as water samples, are taken. These include pH, conductivity, turbidity, groundwater elevation, and presence and thickness of sinkers and floaters. In some cases, such as field pH, multiple observations may be taken for each sample, and this must be taken into consideration in the design of the database system and reporting formats. The other category of field parameters is items measured or observed without taking a sample. A variety of chemical and other measurements can be taken in the field, especially for air monitoring, and increasingly for groundwater monitoring, as sensors continue to improve. Groundwater elevation is a special type of field observation. It can be observed with or without sampling, and obtaining accurate and consistent water level elevations can be very important in managing a groundwater project. This is because many other parameters react in
Gathering Samples and Data in the Field
137
various ways to the level of the water table. Issues such as the time of year, amount of recent precipitation, tidal influences, and many other factors can influence groundwater elevation. The EDMS should contain sufficient field parameter information to assist with interpretation of the groundwater elevation data. Geophysical measurements are generally used for site characterization, and can be done either on the surface or in a borehole. For some projects this data may be stored in the EDMS, while for others it is stored outside the database system in specialized geophysical software. Examples of surface geophysical measurements include gravity, resistivity, spontaneous potential (SP), and magnetotellurics. Borehole geophysical measurements include SP, resistivity, density, sonic velocity, and radiologic parameters such as gamma ray and neutron surveys. Operating parameters describe various aspects of facility operation that might have an influence on environmental issues. Options for storage of this data are discussed in a previous section, and the parameters include production volume, fluid levels, flow rates, and so on. In some cases it is important to track this information along with the chemical data because permits and other regulatory issues may correlate pollutant discharges to production volumes. Managing operating parameters in the EDMS may require that the system be able to display calculated parameters, such as calculating the volume of pollutant discharge by multiplying the concentration times the effluent volume, as shown in Figure 54. Miscellaneous other parameters include the RCRA characterization parameters (corrosivity, ignitability, reactivity, and toxicity) as well as other parameters that might be measured in the field or lab such as color, odor, total dissolved solids (TDS), total organic carbon (TOC), and total suspended solids (TSS). In addition, there are many other parameters that might be of importance for a specific project, and any of these could be stored in the EDMS. The design of the EDMS must be flexible enough to handle any parameter necessary for the various projects on which the software is used.
CHAPTER 12 ENVIRONMENTAL LABORATORY ANALYSIS
Once the samples have been gathered, they are usually sent to the laboratory for analysis. How well the laboratory does its work, from sample intake and tracking through the analysis and reporting process, has a major impact on the quality of the resulting data. This chapter discusses the procedures carried out by the laboratory, and some of the major laboratory analytical techniques. A basic understanding of this information can be useful in working effectively with the data that the laboratory generates. The laboratory business is a tough business. In the current market, the amount of analytical work is decreasing, meaning that the laboratories are having to compete more aggressively for what work there is. They have cut their profit margins to the bone, but are still expected to promptly provide quality results. Unfortunately this has caused some laboratories to close, and others to cut corners to the point of criminal activities to make ends meet. Project managers must be ever vigilant to make sure that the laboratories are doing their job with an adequate level of quality. The good news is that there are still many laboratories performing quality work.
LABORATORY WORKFLOW Because laboratories process a large volume of samples and must maintain a high level of quality, the processes and procedures in the laboratory must be well defined and rigorously followed. Each step must be documented in a permanent record. Steps in the process include sample intake, sample tracking, preparation, analysis, reporting, and quality control. Sample intake – The laboratory receives the samples from the shipper and updates the chain of custody. The client is notified that the samples have arrived, the shipping container temperature is noted, and the samples are scheduled for analysis. Sample tracking – It is critical that samples and results be tracked throughout the processes performed in the laboratory. Usually the laboratory uses specialized software called a laboratory information management system or LIMS to assist with sample tracking and reporting. Sample preparation – For most types of samples and analyses, the samples must be prepared before they can be analyzed. This is discussed in the following section. Analysis – Hundreds of different analytical techniques are available for analyzing different parameters in various media. The most common techniques are described below. Reporting – After the analyses have been performed, the results must be output in a format suitable for use by the client and others. An electronic file created by the laboratory for delivery to
140
Relational Management and Display of Site Environmental Data
the client is called an electronic data deliverable (EDD). Creation of the EDD should be done using the LIMS, but too often there is a data reformatting step, or even worse a manual transcription process, to get some or all of the data into the EDD, and these manipulation steps can be prone to errors. Quality control – The need to maintain quality permeates the laboratory process. Laboratory quality control is discussed below, and aspects of it are also discussed in Chapter 15. Laboratory QC procedures are determined by the level required for each site. For example, Superfund projects generally require the highest level of QC and the most extensive documentation.
SAMPLE PREPARATION Most samples must be prepared before they can be analyzed. The preparation process varies depending on the sample matrix, the material to be analyzed, and the analytical method. The most important processes include extraction and cleanup, digestion, leaching, dilution, and filtering. Depending on the sample matrix, other procedures such as grinding and chemical manipulations may be required. Extraction and cleanup – Organic analytes are extracted to bring them into the appropriate solvent prior to analysis (Patnaik, 1997). The extraction method varies depending on whether the sample is liquid or solid. Extraction techniques for aqueous samples include liquid-liquid (separatory funnel or continuous) and solid-phase. For solid samples, the methods include Soxhlett, supercritical fluid, and sonication. Some extraction processes are repeated multiple times (such as three) to improve the efficiency of extraction. Samples may undergo a cleanup process to improve the analysis process and generate more reliable results. Cleanup methods include acid-base partitioning, alumina column, silica gel, florisil, gel-permeation, sulfur, and permanganate-sulfuric acid. Digestion – Samples analyzed for metals are usually digested. The digestion process uses strong acids and heat to increase the precision and accuracy of the measurement by providing a homogeneous solution for analysis by removing metals adsorbed to particles and breaking down metal complexes. Different digestion techniques are used depending on the analytical method and target accuracy levels. Leaching – Sometimes in addition to the concentration of a toxic substance in a sample, the mobility of the substance is also of concern, especially for material headed for disposal in a landfill. A leach test is used to determine this. Techniques used for this are the toxicity characterization leaching procedure (TCLP), synthetic precipitate leaching procedure (SPLP), and the EP toxicity test (EPTOX). In all three methods, fluids are passed through the solid material such as soil, and the quantity of the toxic substance leached by the fluid is measured. TCLP uses a highly buffered and mildly acidic aqueous fluid. In SPLP the fluid is slightly more acidic, varies by geographic area (east or west of the Mississippi River), and is intended to more accurately represent the properties of groundwater. EPTOX takes longer, tests for fewer parameters, and is no longer widely used. The concentration of an analyte after leaching is not comparable to the total concentration in the sample, so leached analyses should be so marked as such in the EDMS. Dilution – Sometimes it is necessary to dilute the sample prior to analysis. Reasons for this include that the concentration of the analyte may be outside the concentration range where the analytical technique is linear, or other substances in the sample may interfere with the analysis (matrix interference). A record of the dilution factor should be kept with the result. Dilution affects the result itself as well as the detection limit for the result (Sara 1994, p. 11-11). For non-detected results, the reported result based on the detection limit will be increased proportionately to the dilution, and this needs to be considered in interpreting the results. Filtering – The sample may or may not be filtered, either in the field or in the laboratory. If the sample is not filtered, the resulting measurement is referred to as a total measurement, while if it is filtered, it is considered a dissolved result. For filtered samples, the size of the openings in the
Environmental Laboratory Analysis
141
filter (such as 1 micron) should be included with the result. Commonly, once a sample has been filtered it is preserved. This information should also be noted with the result.
ANALYTICAL METHODS Laboratories use many different methods to analyze for different constituents. Most of the guidance on this comes from the EPA. The main reference for analytical methods in the environmental field is the EPA’s SW-846 (EPA, 1980). SW-846 is the official compendium of analytical and sampling methods that have been evaluated and approved for use in complying with the RCRA regulations. It was first issued in 1980, and is updated regularly. The tables in Appendix D show the recommended analytical methods for various parameters. This section provides a general description of the methods themselves, starting with methods used mostly for inorganic constituents, followed by methods for organic analysis. Additional information on analytical methods can be found in many sources, including EPA (1980, 2000a), Extoxnet (2001), Spectrum Labs (2001), SKC (2001), Cambridge Software (2001), NCDWQ (2001), Scorecard.org (Environmental Defense, 2001), Manahan (2000, 2001), Patnaik (1997), and Weiner (2000).
Inorganic methods Many different methods are used for analysis of inorganic constituents. The most common include titration, colorimetric, atomic absorption and emission spectrometry, ion-selective electrodes, ion chromatography, transmission electron microscopy, gravimetry, nephelometric, and radiochemical methods. Titration is one of the oldest and most commonly used of the wet chemistry techniques. It is used to measure hardness, acidity and alkalinity, chemical oxygen demand, non-metals such as chlorine and chloride, iodide, cyanide, nitrogen and ammonia, sulfide and sulfite, and some metals and metal ions such as calcium, magnesium, bromate, and bromide. Titration can be used on wastewater, potable water, and aqueous extracts of soil and other materials. The method works by slowly adding a standard solution of a known concentration to a solution of an unknown concentration until the chemical reaction between the solutions stops, which occurs when the analyte of concern in the second solution has fully reacted. Then the amount of the first solution required to complete the reaction is used to calculate the concentration in the second. The completion of the reaction is monitored using an indicator chemical such as phenolphthalein that changes color when the reaction in complete, or with an electrode and meter. Titration methods commonly used in environmental analyses include acid-base, redox, iodometric, argentometric, and complexometric. Titration is relatively easy and quick to perform, but other techniques often have lower detection limits, so are more useful for environmental analyses. Colorimetric methods are also widely used in environmental analysis. Hardness, alkalinity, chemical oxygen demand, cyanide, chloride, fluoride, ammonia, nitrite, nitrogen and ammonia, phosphorus, phosphate and orthophosphate, silica, sulfate and sulfite, phenolics, most metals, and ozone are among the parameters amenable to colorimetric analysis. For the most part, colorimetric methods are fast and inexpensive. Aqueous substances absorb light at specific wavelengths depending on their physical properties. The amount of monochromatic light absorbed is proportional to the concentration, for relatively low concentrations, according to Beer’s law. First the analyte is extracted, often into an organic solvent, and then a color-forming reagent is added. Filtered light is passed through the solution, and the amount of light transmitted is measured using a photometer. The result is compared to a calibration curve based on standard solutions to derive the concentration of the analyte. Atomic absorption spectrometry (AA) is a fast and accurate method for determining the concentration of metals in solution. Aqueous and non-aqueous samples are first digested in nitric
142
Relational Management and Display of Site Environmental Data
acid, sometimes along with other acids, so that all of the metals are in solution as metal nitride salts. Then a heat source is used to vaporize the sample and convert the metal ions to atoms; light from a monochromatic source is passed through the vapor; and the amount of light absorbed is proportional to the concentration. A photoelectric detector is used to measure the remaining light, and digital processing used to calculate the concentration. There are several AA methods, including direct (flame), graphite furnace, and platform techniques. Calibration is performed using the method of standard addition, in which various strengths of standards solutions are added to the sample, and the results used to create a calibration curve. A lower detection can be obtained for many metals, such as cadmium, chromium, cobalt, copper, iron, lead, manganese, nickel, silver, and zinc, by using the chelation-extraction method prior to analysis. A chelating agent such as ammonium pyrrolidine dithiocarbamate (APDC) reacts with the metal, and the resulting metal chelate is extracted with methyl isobutyl ketone (MIBK), and then analyzed with AA. Analysis of arsenic and selenium can be enhanced using the hydride generation method, in which the metals in HCl solution are treated with sodium borohydride, then purged with nitrogen or argon and atomized for analysis. Cold vapor is a specialized AA technique for mercury, in which mercury and its salts are converted to mercury nitrate with nitric acid, then reduced to elemental form with stannous chloride. The mercury forms a vapor, which is carried in air into the absorption cell for analysis. Inductively coupled plasma/atomic emission spectroscopy (ICP/AES) is a technique for simultaneous or sequential multi-element determination of elements in solution, allowing for the analysis of several metals at once. The ICP source is a high-powered radio frequency (RF) generator together with a quartz torch, water-cooled coil, nebulizer, spray chamber, and drain. An argon gas stream is ionized in the RF field, which is inductively coupled to the ionized gas by the coil. The sample is converted to an aerosol in the nebulizer, and injected into the plasma, where the analytes are ionized. The light emitted by the ions is then analyzed with a polychromatic or scanning monochromatic detector, and the results compared to a curve based on standards to generate concentrations of the target analytes. Inductively coupled plasma/mass spectrometry (ICP/MS) is similar to ICP/AES except that after the analyte has been ionized, the mass spectra of the molecules are used to identify the compounds. Ion-selective electrodes are useful for analysis of metals and anions, as well as dissolved gases such as oxygen, carbon dioxide, ammonia, and oxides of nitrogen. A sensing electrode specific to the analyte of interest is immersed in the sample solution, resulting in an electrical potential, which is compared to the potential of a reference electrode using a voltmeter. The voltage is proportional to the concentration of the analyte for which the electrode is designed. Solid samples must be extracted before analysis. Calibration is performed using the standard calibration method, standard addition method, or sample addition method. Ion chromatography is an inorganic technique that can analyze for multiple parameters sequentially in one procedure. Many common anions, including nitrate, nitrite, phosphate, sulfide, sulfate, fluoride, chloride, bromide, and iodide can be analyzed, as well as oxyhalides such as perchlorate and hypochlorite, weak organic acids, metal ions, and alkyl amine. The analytes are mixed with an elutant and separated chromatographically in an ion exchanger, and then measured with a conductivity detector based on their retention times and peak areas and heights. The samples are compared to calibration standards to calculate the concentrations. The most common elutant is a mixture of sodium carbonate and sodium bicarbonate, but other elutants such as sodium hydroxide and others can be used to analyze for different constituents. In some cases an electrical potential is applied to improve sensitivity and detection levels (auto suppression). Transmission electron microscopy (TEM) is used for analysis of asbestos. The prepared sample is placed in the transmission electronic microscope. An electron beam passes through the sample and generates a pattern based on the crystal structure of the sample. This pattern is then analyzed for the presence of the pattern for asbestos.
Environmental Laboratory Analysis
143
Gravimetry can be used for various analytes, including chloride, silica, sulfate, oil and grease, and total dissolved solids. The procedure varies from analyte to analyte, but the general process is to react the analyte of interest with one or more other substances, purify the resulting precipitate, and weigh the remaining material. For oil and grease, the sample is extracted, a known volume of fluid is separated, and the oil component weighed. For TDS, the sample is evaporated and the remaining solid is weighed. Nephelometric analysis is used to measure turbidity (cloudiness). The sample is placed in a turbidimeter, and the instrument light source illuminates the sample. The intensity of scattered light is measured at right angles (or a range of angles) to the path of the incident light. The system is calibrated with four reference standards and a blank. The scattering of light increases with a greater suspended load. Turbidity is commonly measured in nephelometric turbidity units (NTU) which have replaced the older Jackson turbidity units (JTU). Radiochemical methods cover a range of different techniques. These include radiochemical methodology (EPA 900 series), alpha spectroscopy (isotopic identification by measurement of alpha particle energy), gamma ray spectrometry (nuclide identification by measurement of gamma ray energy), gross alpha-beta counting (semiquantitative, or quantitative after wet chemical separation), gross alpha by co-precipitation, extraction chromatography, chelating resin, liquid scintillation counter, neuron activation followed by delayed neuron counting, electret ionization chambers, alpha track detectors, radon emanation technique, and fluorometric methodology.
Organic methods Analytical methods for organic constituents include gas chromatography, gas chromatography/ mass spectrometry, high performance liquid chromatography, enzyme immunoassay, and infrared spectroscopy. Gas chromatography (GC) is the most widely used technique for determining organics in environmental samples. Many EPA organic methods employ this technique. The method can be optimized for compounds with different physical and chemical properties by selecting from various different columns and detectors. The sample is first concentrated using a purge and trap technique. Then it is passed through a capillary (or less commonly, packed) column. The capillary column is made of fused silica, glass, or stainless steel of various inside diameters. The smaller the diameter, the higher the resolution, but the smaller the sample size. Also, the longer the column, the higher the resolution. Then one of a number of different types of detectors is used to determine the concentration. General-purpose detectors include flame ionization detectors (FID) and thermal conductivity detectors (TCD). Other detectors are specific to certain analytes. Electron capture detectors (ECD) and Hall electrolyte conductivity detectors (HECD) are specific to halogen compounds. Nitrogen-phosphorus detectors (NPD) can be used for nitrogen-containing organics or organophosphorus compounds, depending on the detection mode used. Flame photometric detectors (FPD) are used for sulfur-containing organics, and less commonly for phosphorus compounds. Photoionization detectors (PID) are used for substances containing the carbon-carbon double bond such as aromatics and olefins. The equipment must be calibrated prior to running samples, using either an external standard or an internal standard. The external standard method uses just the standard solution, while the internal standard method uses one or more standard solutions added to equal volumes of sample extracts and calibration standards. Internal standards are more reliable, but require more effort. Gas chromatography/mass spectrometry (GC/MS) is one of the most versatile techniques for analysis of organics. Examples of GC/MS methods include the EPA methods 624, 8240, and 8260 for volatiles; and 625 and 8270 for semivolatiles. The analysis process is similar for the two, but the sample extraction and concentration is different. The analytical approach is to use chromatography to separate the components of a mixture, then mass spectrometry to identify the compounds. The sample is concentrated using purge and trap or thermal desorption. Next, the
144
Relational Management and Display of Site Environmental Data
chromatograph column is used as above to separate the compounds. Then the components are eluted from the column and ionized using electron-impact or chemical ionization. Finally, the mass spectra of the molecules are used to identify the compounds, based on their primary and secondary ions and retention times. High performance liquid chromatography (HPLC) can be used to analyze more than twothirds of all organic compounds. It is widely used in the chemical industry, but is relatively new to environmental analyses, and just a few EPA methods, such as some methods for pesticides and PAHs, employ it. The chromatograph equipment consists of a constant-flow pump, a high-pressure injection valve, the chromatograph column, a detector, and a chart recorder or digital interface to gather the data. A mobile liquid phase transports the sample through the column, where individual compounds are separated when they are selectively retained on a stationary liquid phase that is bonded to a support. There are several types of detectors, including ultraviolet, fluorescence, conductivity, or electrochemical. Calibration standards at various concentrations are compared against the data for the samples to determine the concentration. Enzyme immunoassay analysis can be used to screen for various constituents such as pesticides, herbicides, PAHs, PCBs, PCP, nitro-organics, and other compounds. In this method, polyclonal antibodies specific to the desired analyte bind to the analyte in an analysis tube, competing with an analyte-enzyme conjugate also added to the tube, resulting in a color change of the material on the wall of the tube. The color change is inversely proportional to the concentration in the analyte due to competition with the conjugate. The color change can be compared visually to standards for qualitative determination, or analyzed with a spectrometer, which makes the result semiquantitative. Unlike most of the techniques described here, this one is suitable for use in the field as well as in the laboratory. Infrared spectroscopy (IR) is used to analyze total petroleum hydrocarbons (TPH). The samples are extracted with a fluorocarbon solvent, dried with anhydrous NaSO4, then analyzed with the IR spectrometer.
RCRA characterization The Code of Federal Regulations, Title 40, Paragraph 261.20 provides a definition of a waste as hazardous based on four criteria: corrosivity, ignitability, reactivity, and toxicity. Corrosivity – A waste is considered corrosive if it is aqueous and has a pH less than or equal to 2 or greater than or equal to 12.5 measured with a pH meter, or has a corrosivity to steel of more than 635 mm/yr. Ignitability – A solid waste is ignitable if the flash point is less than 60°C. Liquids are ignitable if their vapors are likely to ignite in the presence of ignition sources. Reactivity – A waste is reactive if it is normally unstable and readily undergoes violent change, reacts violently with water, forms potentially explosive mixtures or toxic gases, vapors, or fumes when combined with water. Also it is reactive if it is a cyanide- or sulfate-bearing waste that can generate toxic gases, vapors, or fumes when exposed to pH conditions between 2 and 12.5. Toxicity – A poisonous or hazardous substance is referred to as toxic. The toxicity of a waste is determined by the leaching methods described above. In addition, the EPA lists over 450 listed wastes, which are specific substances or class of substances known to be hazardous.
Air analysis Air analysis requires different sample preparation and may require different analysis methods from those used for soil and water. The material for analysis is received in sorbent tubes, Tedlar bags, or Summa canisters. Then the samples can be analyzed using specialized equipment, often with a pre-concentration step or using standard analytical techniques such as GC/FID or GC/MS.
Environmental Laboratory Analysis
145
Different techniques are used for different analytes such as halogenated organics (GC with ECD), phosphorus and sulfur (FPD), nitrogen (NPD), aromatics and olefins (PID), and light hydrocarbons such as methane, ethane, and ethylene (TCD). Other compounds can be measured with HPLC, IR, UV or visible spectrophotometry, and gravimetry.
OTHER ANALYSIS ISSUES There are a number of other issues related to laboratory analysis that can impact the storage and analysis of the data.
Levels of analysis The EPA has defined five levels for laboratory analyses. The levels, also called Data Quality Objective (DQO) levels, are based on the type of site being investigated, the level of accuracy and precision required, and the intended use of the data. The following table, based on DOE/HWP (1990b), lists these five levels: Level I
II
Analysis example Qualitative or semiquantitative analysis Indicator parameters Immediate response in the field Semiquantitative or quantitative analysis Compound specific Rapid turnaround in the field
III
Quantitative analysis Technically defensible data Sites near populated areas Major sites
IV
Quantitative analysis Legally defensible data National Priorities List sites
V
Qualitative to quantitative analysis Method specific Unique matrices (e.g., pure water, biota, explosives, etc.)
Typical data use Site characterization Monitoring during implementation Field screening Site characterization Evaluation of alternatives Engineering design Monitoring during implementation Field screening Risk assessment Site characterization Evaluation of alternatives Engineering design Monitoring during implementation Risk assessment Site characterization Evaluation of alternatives Engineering design Risk assessment Evaluation of alternatives Engineering design
Holding times Some analytes can degrade (change in concentration) after the sample is taken. For that reason, the analysis must be performed within a certain time period after sampling. This time is referred to as the holding time. Meeting holding time requirements is an important component of laboratory data quality. Some analytes have a holding time from sampling to analysis, while others will have one holding time before extraction and another before analysis. Samples for which holding time requirements are not met should be flagged, and in some cases the station must be re-
146
Relational Management and Display of Site Environmental Data
sampled, depending on project requirements. Holding times for some common analytes are listed in Appendix D.
Detection limits Analytical methods cannot analyze infinitely small amounts of target parameters. For each method, analyte, and even each specific sample, the lowest amount that can be detected will vary, and this is called the detection limit. There are actually a number of different detection limits determined in different ways (Core Labs, 1996) that overlap somewhat in meaning. The detection limit (DL) in general means that the concentration is distinctly detectable above the concentration of a blank. The limit of detection (LOD) is the lowest concentration statistically different from the blank. The instrument detection limit (IDL) represents the smallest measurable signal above background noise. The method detection limit (MDL) is the minimum concentration detectable with 99% confidence. The reliable detection limit (RDL) is the lowest level for reliable decisions. The limit of quantitation (LOQ) is the level above which quantitative results have a specified degree of confidence. The reliable quantitation limit (RQL) is the lowest level for quantitative decisions. The practical quantitation limit (PQL) is the lowest level that can be reliably determined within specific limits of precision and accuracy. The contract required quantitation limit (CRQL) or the reporting limit (RL) is the level at which the laboratory routinely reports analytical results, and the contract required detection limit (CRDL) is the detection limit required by the laboratory contract. Some limits, such as MDL, PQL, and RL are used regularly, while other limits are used less frequently. The limits to be used for each project are defined in the project work plan. The laboratory will report one or more detection limits, depending on project requirements. When an analyte is not detected in a sample, there is really no value for that analyte to be reported, and the lab will report the detection limit and provide a flag that the analyte was not detected. Sometimes the detection limit will also be placed in the value field. When you want to use the data for a non-detected analyte, you will need to decide how to display the result, or how to use it numerically for graphing, statistics, mapping, and so on. For reporting, the best thing is usually to display the detection limit and the flag, such as “.01 u,” or the detection limit with a “less than” sign in front of it, such as “< .01.” To use the number numerically, it is common to use one-half of the detection limit, although other multipliers are used in some situations. It’s particularly important for averages and other statistical measures to be aware of how undetected results are handled.
Significant figures Significant figures, also called significant digits, should reflect the precision of the analytical method used (Sara, 1994, p. 11-13), and should be maintained through data reporting and storage. Unfortunately, some software, especially Microsoft programs such as Access and Excel, don’t maintain trailing zeros, making it difficult to maintain significant digits. In this case, the number of significant figures or decimal places should be stored in a separate field. If you display fewer digits than were measured, then you are rounding the number. This is especially true if the digits you are dropping are not zero, such as rounding 3.21 to 3.2, but is also true if you are dropping zeros. Sometimes this is appropriate and sometimes it is not. Samuel Johnson, English writer of the 1700s, said, “Round numbers are always false.”
Data qualifiers When the laboratory encounters a problem with an analysis, it should flag or qualify the data so that users of the data are aware of the problems. Various different agencies require different
Environmental Laboratory Analysis
147
flagging schemes, and these schemes can be different and in some cases conflicting, which can cause problems for managing the data. Some typical qualifier flags are shown in the following table: Code * a b c d e f g h i j l
Flag Surrogate outside QC limits Not available Analyte detected in blank and sample Coelute Diluted Exceeds calibration range Calculated from higher dilution Concentration > value reported Result reported elsewhere Insufficient sample Est. value; conc. < quan. limit Less than detection limit
Code m n q r s t u v w x z
Flag Matrix interference Not measured Uncertain value Unusable data Surrogate Trace amount Not detected Detected value Between CRDL/IDL Determined by associated method Unknown
Reporting units The units of measure or reporting units for each analysis are extremely important, because an analytical result without units is pretty much meaningless. The units reported by laboratories for one parameter can change from time to time or from laboratory to laboratory. There aren’t any “standard” units that can be depended upon (with some exceptions, like pH, which is a dimensionless number), so the units must be carried along with the analytical value. The EDMS should provide the capability to convert to consistent units, either during import or preferably during data retrieval, since some output such as graphing and statistics require consistent units. Some types of parameters have specific issues with reporting units. For example, the amount of a radioactive constituent in a sample can be reported in either activity (amount of radioactivity from the substance) or concentration (amount of the substance by weight), and determining the conversion between them involves factors, such as the isotope ratios, which are usually sitespecific. Data managers should take great care in preserving reporting units, and if conversions are performed, the conversions should be documented.
Laboratory quality control Laboratories operate under a Laboratory Quality Assurance Plan (LQAP), which they submit to their regulators, usually through the contractor operating the project. This plan includes policies and procedures, some of which are required by regulations, and others are determined by the laboratory. Some of the elements of laboratory quality control (QC) are described in Chapter 15. The laboratories are audited on some regular basis, such as annually, to ensure that they are conforming to the LQAP. Edwards and Mills (2000) have brought up a number of problems with laboratory generation and delivery of electronic data. One issue is systematic problems in the laboratory, such as misunderstanding of analysis criteria, scheduling and resource difficulties, and other internal process problems. Another problem area is that laboratories usually generate two deliverables, a hard copy and an electronic file. The hard copy is usually generated by the LIMS system, while the EDD may be generated using other methods as described above, leading to errors. A third problem area is the lack of standardization in data delivery. No standard EDD format has been accepted across the industry (despite some vendor claims to the contrary), so a lot of effort is expended in
148
Relational Management and Display of Site Environmental Data
satisfying various custom format descriptions, which can result in poor data quality when the transfer specification has not been accurately met. Finally, lack of universal access to digital data in many organizations can lead to delays in data access and delivery, which can result in project delays and staff frustration. A centralized EDMS with universal data access is the answer to this, but many organizations have not yet implemented this type of system. Sara (1994, p. 11-8) has pointed out that many laboratories have a problem with persistent contamination by low levels of organic parameters such as methylene chloride and acetone. These are common laboratory chemicals that are used in the extraction process, and can show up in analytical results as low background levels. The laboratory might subtract these values from the reported results, or report the values without correction, perhaps with a comment about possible contamination. Users of the data should be made aware of these issues before they try to interpret the data for these parameters.
PART FOUR - MAINTAINING THE DATA
CHAPTER 13 IMPORTING DATA
The most important component of any data management system is the data in it. Manual and automated entry, importing, and careful checking are critical components in ensuring that the data in the system can be trusted, at least to the level of its intended use. For many data management projects, the bulk of the work is finding, organizing, and inputting the data, and then keeping up with importing new data as it comes along. The cost of implementing the technology to store the data should be secondary. The EDMS can be a great time-saver, and should more than pay for its cost in the time saved and greater quality achieved using an organized database system. The time savings and quality improvement will be much greater if the EDMS facilitates efficient data importing and checking.
MANUAL ENTRY Sometimes there’s no other way to get data into the system other than transcribing it from hard copy, usually by typing it in. This process is slow and error prone, but if it’s the only way, and if the data is important enough to justify it, then it must be done. The challenge is to do the entry cost-effectively while maintaining a sufficient level of data quality.
Historical entry Often the bulk of manual entry is for historical data. Usually this is data in hard-copy files. It can be found in old laboratory reports, reports which have been submitted to regulators, and many other places.
DATA SELECTION - WHAT’S REALLY IMPORTANT? Before embarking on a manual entry project, it is important to place a value on the data to be entered. The importance of the data and the cost to enter it must be balanced. It is not unusual for a data entry project for a large site, where an effort is made to locate and input a comprehensive set of data for the life of the facility, to cost tens or hundreds of thousands of dollars. The decision to proceed should not be taken lightly.
LOCATING AND ORGANIZING DATA The next step, and often the most difficult, is to find the data. This is often complicated by the fact that over time many different people or even different organizations may have worked on the
152
Relational Management and Display of Site Environmental Data
project, and the data may be scattered across many different locations. It may even be difficult to locate people who know or can find out what happened in the past. It is important to locate as much of this historical data as possible, and then the portion selected as described in the previous section can be inventoried and input. Once the data has been found, it should be inventoried. On small projects this can be done in word processor or spreadsheet files. For larger projects it is appropriate to build a database just to track documents and other items containing the data, or include this information in the EDMS. Either way, a list should be made of all of the data that might be entered. This list should be updated as decisions are made about what data is to be entered, and then updated again as the data is entered and checked. If the data inventory is stored in the EDMS, it should be set up so that after the data is imported it can be tracked back to the original source documents to help answer questions about the origin of the data.
TOOLS TO HELP WITH CORRECT ENTRY There are a number of ways to enter the data, and these options provide various levels of assistance in getting clean data into the system. Entry and review process – Probably the most common approach used in the environmental industry is manual entry followed by visual review. In this process, someone types in the data, then it is printed out in a format similar to the one that was used for import. Then a second person compares every piece of data between the two pieces of paper, and marks any inconsistencies. These are then remedied in the database, and the corrections checked. The end result, if done conscientiously, is reliable data. The process is tedious for those involved, and care should be taken that those doing it keep up their attention to detail, or quality goes down. Often it is best to mix this work with other work, since it is hard to do this accurately for days on end. Some people are better at it than others, and some like it more than others. (Most don’t like it very much.) Double entry – Another approach is to have the data entered twice, by two different people, and then have special software compare the two copies. Data that does not match is then entered again. This technique is not as widely used as the previous one in the environmental industry perhaps because existing EDMS software does not make this easy to do, and maybe also because the human checking in the previous approach sounds more reliable. Scanning and OCR – Hardware and software are widely available to scan hard copy documents into digital format, and then convert it into editable text using optical character recognition (OCR). The tools to do this have improved immensely over the last few years, such that error rates are down to just a few errors per page. Unfortunately, the highest error rates are with older documents and with numbers, both of which are important in historical entry of environmental data. Also, because the formats of old documents are widely variable, it is difficult to fit the data into a database structure after it has been scanned. These problems are most likely to be overcome, from the point of view of environmental data entry, when there is a large amount of data in a consistent format, with the pages in good condition. Unless you have this situation, then scanning probably won’t work. However, this approach has been known to work on some projects. After scanning, a checking step is required to maintain quality. Voice entry – As with scanning, voice recognition has taken great strides in recent years. Systems are available that do a reasonable job of converting a continuous stream of spoken words into a word processing document. Voice recognition is also starting to be used for on-screen navigation, especially for the handicapped. It is probably too soon to tell whether this technology will have a large impact on data entry. Offshore entry – There are a number of organizations in countries outside the United States, especially Mexico and India, that specialize in high-volume data entry. They have been very successful in some industries, such as processing loan applications. Again, the availability of a large number of documents in the same format seems to be the key to success in this approach, and a post-entry checking step is required.
Importing Data
153
Figure 55 - Form entry of analysis data
Form entry vs. spreadsheet entry – EDMS programs usually provide a form-based system for entering data, and the form usually has fields for all the data at each level, such as site, station, sample, and analysis. Figure 55 shows an example of this type of form. This is usually best for entering a small amount of data. For larger data entry projects, it may be useful to make a customized form that matches the source documents to simplify input. Another common approach is to enter the data into a spreadsheet, and then use the import tool of the EDMS to check and import the data. Figure 56 shows this approach. This has two benefits. The EDMS may have better data checking and cleanup tools as part of the import than it does for form entry. Also, the person entering the data into the spreadsheet doesn’t necessarily need a license for the EDMS software, which can save the project money. Sometimes it is helpful to create spreadsheet templates with things like station names, dates, and parameter lists using cut and paste in one step, and then have the results entered in a second step.
Ongoing entry There may be situations where data needs to be manually entered on an ongoing basis. This is becoming less common as most sources of data involve a computerized step, so there is usually a way to import the data electronically. If not, approaches as described above can be used.
ELECTRONIC IMPORT The majority of data placed into the EDMS is usually in digital format in some form or other before it is brought into the system. The implementers of the system should provide a data transfer standard (DTS) so that the electronic data deliverables (EDDs) created by the laboratory for the EDMS contain the appropriate data elements in a format suitable for easy import. An example DTS is shown in Appendix C.
154
Relational Management and Display of Site Environmental Data
Figure 56 - Spreadsheet entry of analysis data
Automated import routines should be provided in the EDMS so that data in the specified format (or formats if the system supports more than one) can be easily brought into the system and checked for consistency. Data review tracking options and procedures must be provided. In addition, if it is found that a significant amount of digital data exists in other formats, then imports for those formats should be provided. In some cases, importing those files may require operator involvement if, for example, the file is a spreadsheet file of sample and analytical data but does not contain site or station information. These situations usually must be addressed on a case-by-case basis.
Historical entry Electronic entry of historical data involves several issues including selecting, locating, and organizing data, and format and content issues. Data selection, location, and organization – The same issues exist here as in manual input in terms of prioritizing what data will be brought into the EDMS. Then it is necessary to locate and catalog the data, whatever format it is in, such as on a hard drive or on diskettes. Format issues – Importing historical data in digital format involves figuring out what is in the files and how it is formatted, and then finding a way to import it, either interactively using queries or automatically with a menu-driven system. Most modern data management programs can read a variety of file formats including text files, spreadsheets, word processing documents, and so on. Usually the data needs to be organized and reformatted before it can be merged with other data already in the EDMS. This can be done either in its native format, such as in a spreadsheet, or imported into the database program and organized there. If each file is in a different format, then there can be a big manual component to this. If there are a lot of data files in the same format, it may be possible to automate the process to a large degree.
Importing Data
155
Content issues – It is very important that the people responsible for importing the data have a detailed understanding of the content of the data being imported. This includes knowing where the data was acquired and when, how it is organized, and other details like detection limits, flags, and units, if they are not in the data files. Great care must be exercised here, because often details like these change over time, often with little or no documentation, and are important in interpreting the data.
Ongoing entry The EDMS should provide the capability to import analytical data in the format(s) specified in the data transfer standard. This import capability must be robust and complete, and the software and import procedures must address data selection, format, and content issues, and special issues such as field data, along with consistency checking as described in a later section. Data selection – For current data in a standard format, importing may not be very timeconsuming, but it may still be necessary to prioritize data import for various projects. The return on the time invested is the key factor. Format and content issues – It may be necessary to provide other import formats in addition to those in the data transfer standard. The identification of the need to implement other data formats will be made by project staff members. The content issues for ongoing entry may be less than for historical data, since the people involved in creating the files are more likely to be available to provide guidance, but care must still be taken to understand the data in order to get it in right. Field data – In the sampling process for environmental data there is often a field component and a laboratory component. More and more the data is being gathered in the field electronically. It is sometimes possible to move this data digitally into the EDMS. Some hard copy information is usually still required, such as a chain of custody to accompany the samples, but this can be generated in the field and printed there. The EDMS needs to be able to associate the field data arriving from one route with the laboratory data from another route so both types of data are assigned to the correct sample.
Understanding duplicated and superseded data Environmental projects generate duplicated data in a variety of ways. Particular care should be taken with duplicated data at the Samples and Analyses levels. Duplicate samples are usually the result of the quality assurance process, where a certain number of duplicates of various types are taken and analyzed to check the quality of the sampling and analysis processes. QC samples are described in more detail in Chapter 15. A sample can also be reanalyzed, resulting in duplicated results at the Analyses level. These results can be represented in two ways, either as the original result plus the reanalysis, or as a superseded (replaced) original result plus the new, unsuperseded result. The latter is more useful for selection purposes, because the user can easily choose to see just the most current (unsuperseded) data, whereas selecting reanalyzed data is not as helpful because not all samples will have been reanalyzed. Examples of data at these two levels and the various fields that can be involved in the duplications at the levels are shown in Figure 57.
Obtaining clean data from laboratories Having an accurate, comprehensive, historical database for a facility provides a variety of benefits, but requires that consistency be enforced when data is being added to the database. Matching analytical data coming from laboratories with previous data in a database can be a timeconsuming process.
156
Relational Management and Display of Site Environmental Data
Duplicate - Samples Level Sample No.
Station*
Sample Date*
Matrix*
Filtered*
1 2 3
MW-1
8/1/2000
Water
Total
Duplicate* 0 1 2
QC Code
Lab ID
Original Field Dup. Split
2000-001 2000-002 2000-003
* Unique Index - For Water Samples
Superseded - Analysis Level Sample No.* (Station, Date, Matrix, Filt., Dup.)
Parameter Name*
Leach Method*
Basis*
Superseded*
Value Code
1
Field pH Field pH Field pH Field pH
None None None None
None None None None
0 1 2 3
None None None None
1
Naphthalene Naphthalene Naphthalene
None None None
None None None
0 1 2
Original DL1 DL2
Dilution Factor
Reportable Result
1 50 10
N Y N
* Unique Index - For Water Analyses
Figure 57 - Duplicate and superseded data
Variation in station names, spelling of constituent names, abbreviation of units, and problems with other data elements can result in data that does not tie in with historical data, or, even worse, does not get imported at all because of referential integrity constraints. An alternative is a timeconsuming data checking and cleanup process with each data deliverable, which is standard operating procedure for many projects.
WORKING WITH LABS - STANDARDIZING DELIVERABLES The process of getting the data from the laboratory in a consistent, usable format is a key element of a successful data management system. Appendix C contains a data transfer standard (DTS) that can be used to inform the lab how to deliver data. EDDs should be in the same format every time, with all of the information necessary to successfully import the data into the database and tie it with field samples, if they are already there. Problems with EDDs fall into two general areas: 1) data format problems and 2) data content problems. In addition, if data is gathered in the field (pH, turbidity, water level, etc.) then that data must be tied to the laboratory data once the data administrator has received both data sets. Data format problems fall into two areas: 1) file format and 2) data organization. The DTS can help with both of these by defining the formats (text file, Excel spreadsheet, etc.) acceptable to the data management system, and the columns of data in the file (data elements, order, width, etc.). Data content problems are more difficult, because they involve consistency between what the lab is generating and what is already in the database. Variation in station names (is it “MW1” or “MW-1”?), spelling of constituent names, abbreviation of units, and problems with other data elements can result in data that does not tie in with historical data. Even worse, the data may not get imported at all because of referential integrity constraints defined in the data management system.
Importing Data
157
Figure 58 - Export laboratory reference file
USING REFERENCE FILES AND A CLOSED-LOOP SYSTEM While project managers expect their laboratories to provide them with “clean” data, on most projects it is difficult for the laboratory to deliver data that is consistent with data already in the database. What is needed is a way for the project personnel to keep the laboratory updated with information on the various data elements that must be matched in order for the data to import properly. Then the laboratory needs a way to efficiently check its electronic data deliverable (EDD) against this information prior to delivering it to the user. When this is done, then project personnel can import the data cleanly, with minimal impact on the data generation process at the laboratory. It is possible to implement a system that cuts the time to import a laboratory deliverable by a factor of five to ten over traditional methods. The process involves a DTS as described in Appendix C to define how the data is to be delivered, and a closed-loop reference file system where the laboratory compares the data it is about to deliver to a reference file provided by the database user. Users employ their database software to create the reference file. This reference file is then sent to the laboratory. The laboratory prepares the electronic data deliverable (EDD) in the usual way, following the DTS, and then uses the database software to do a test import against the reference file. If the EDD imports successfully, the laboratory sends it to the user. If it does not, the laboratory can make changes to the file, test it again, and once successful, send it to the user. Users can then import this file with a minimum of effort because consistency problems have been eliminated before they receive it. This results in significant time-savings over the life of a project. If the database tracks which laboratories are associated with which sites, then the creation of the reference file can start with selection of the laboratory. An example screen to start the process is shown in Figure 58. In this example, the software knows which sites are associated with the laboratory, and also knows the name to be used for the reference file. The user selects the laboratory, confirms the file name, and clicks on Create File. The file can then be sent to the laboratory via email or on a disk. This process is done any time there are significant changes to the database that might affect the laboratory, such as installation of new stations (for that laboratory’s sites) or changes to the lookup tables. There are many benefits to having a centralized, open database available to project personnel. In order to have this work effectively the data in the database must be accurate and consistent. Achieving this consistency can be a time-consuming process. By using a comprehensive data transfer standard, and the closed-loop system described above, this time can be minimized. In one organization the average time to import a laboratory deliverable was reduced from 30 minutes down to 5 minutes using this process. Another major benefit of this process is higher data quality. This increase in quality comes from two sources. The first is that there will be fewer errors in the data deliverable, and consequently fewer errors in the database, because a whole class of errors
158
Relational Management and Display of Site Environmental Data
related to data mismatches has been completely eliminated. A second increase in quality is a consequence of the increased efficiency of the import process. The data administrator has more time to scrutinize the data during and after import, making it easier to eliminate many other errors that would have been missed without this scrutiny.
Automated checking Effective importing of laboratory and other data should include data checking prior to import to identify errors and to assist with the resolution of those errors prior to placing the data in the system. Data checking spans a range of activities from consistency checking through verification and validation. Performing all of the checks won’t ensure that no bad data ever gets into the database, but it will cut down significantly on the number of errors. The verification and validation components are discussed in more detail in Chapter 16. The consistency checks should include evaluation of key data elements, including referential integrity (existence of parents); valid site (project) and station (well); valid parameters, units, and flags; handling of duplicate results (same station, sample date and depth, and parameter); reasonable values for each parameter; comparison with like data; and comparison with previous data. The software importing the data should perform all of the data checks and report on the results before importing the data. It’s not helpful to have it give up after finding one error, since there may well be more, and it might as well find and flag all of them so you can fix them all at once. Unfortunately, this is not always possible. For example, valid station names are associated with a specific site, so if the site in the import file is wrong, or hasn’t been entered in the sites table, then the program can’t check the station names. Once the program has a valid site, though, it should be able to perform the rest of the checks before stopping. Of course, all of this assumes that the file being imported is in a format that matches what the software is looking for. If site name is in the column where the result values should be, the import should fail, unless the software is smart enough to straighten it out for you. Figure 59 shows an example of a screen where the user is being asked what software-assisted data checking they want performed, and how to handle specific situations resulting from the checking.
Figure 59 - Screen for software-assisted data checking
Importing Data
159
Figure 60 - Screen for editing data prior to import
You might want to look at the data prior to importing it. Figure 60 shows an example of a screen to help you do this. If edits are made to the laboratory deliverable, it is important that a record be kept of these changes for future reference.
REFERENTIAL INTEGRITY CHECKING A properly designed EDMS program based on the relational model should require that a parent entry exist before related child entries can be imported. (Surprisingly, not all do.) This means that a site must exist before stations for that site can be entered, and so on through stations, samples, and analyses. Relationships with lookups should also be enforced, meaning that values related to a lookup, such as sample matrix, must be present and match entries in the lookup table. This helps ensure that “orphan” data does not exist in the tables. Unfortunately, the database system itself, such as Access, usually doesn’t give you much help when referential integrity problems occur. It fails to import the record(s), and provides an error message that may, or may not, give you some useful information about what happened. Usually it is the job of the application software running within the database system to check the data and provide more detailed information about problems.
CHECKING SITES AND STATIONS When data is obtained from the lab it must contain information about the sites and samples associated with the data. It is usually not a good idea to add this data to the main data tables automatically based on the lab data file. This is because it is too easy to get bad records in these two tables and then have the data being imported associated with those bad records. In our experience, it is more likely that the lab has misspelled the station name than that you really drilled a new well, although obviously this is not always the case. It is better to enter the sites and stations first, and then associate the samples and analyses with that data during import. Then the import should check to make sure the sites and stations are there, and tell you if they aren’t, so you can do something about it. On many projects the sample information follows two paths. The samples and field data are gathered in the field. The samples go to the laboratory for analysis, and that data arrives in the electronic data deliverable (EDD) from the laboratory. The field data may arrive directly from the field, or may be input by the laboratory.
160
Relational Management and Display of Site Environmental Data
Figure 61 - Helper screen for checking station names
If the field data arrives separately from the laboratory data, it can be entered into the EDMS prior to arrival of the EDD from the laboratory. This entry can be done in the field in a portable computer or PDA, in a field office at the site, or in the main office. Then the EDMS needs to be able to associate the field information with the laboratory information when the EDD is imported. Another approach is to enter the sample information prior to the sampling event. Then the EDD can check the field data and laboratory data as it arrives for completeness. The process needs to be flexible enough to accommodate legitimate changes resulting from field activities (well MW1 was dry), but also notify the data administrator of data that should be there but is missing. This checking can be performed on data at both the sample and analyses levels. The screen shown in Figure 61 shows the software helping with the data checking process. The user has imported a laboratory data file that has some problems with station names. The program is showing the names of the stations that don’t match entries already in the database, and providing a list of valid stations to choose from. The user can step through the problem stations, choosing the correct names. If they are able to correctly match all of the stations, the import can proceed. If not, they will need to put this import aside while they research the station names that have problems. The import routine may provide an option to convert the data to consistent units, and this is useful for some projects. For other projects (perhaps most), data is imported as it was reported by the laboratory, and conversion to consistent units is done, if at all, at retrieval time. This is discussed in Chapter 19. The decision about whether to convert to consistent units during import should be made on a project-by-project basis, based on the needs of the data users. In general, if the data will be used entirely for site analysis, it probably makes sense to convert to consistent units so retrieval errors due to mixed units are eliminated. If the data will be used for regulatory and litigation purposes, it is better to import the data as-is, and do conversion on output.
CHECKING PARAMETERS, UNITS, AND FLAGS After the import routine is happy with the sites and stations in the file, it should check the other data, as much as possible, to try to eliminate inconsistent data. Data in the import file should be compared to lookup tables in the database to weed out errors. Parameter names in particular provide a great opportunity for error, as do reporting units, flags, and other data.
Importing Data
161
Figure 62 - Screen for entering defaults for required values
The system should provide screens similar to Figure 61 to help fix bad values, and flag records that have issues that can’t be resolved so that they can be researched and fixed. Note that comparing values against existing data like sites and stations, or against lookups, only makes sure that the data makes sense, not that it is really right. A value can pass a comparison test against a lookup and still be wrong. After a successful test of the import file, it is critical that the actual data values be checked to an appropriate level before the data is used. Sometimes the data being imported may not contain all of the data necessary to satisfy referential integrity constraints. For example, historical data being imported may not have information on sample filtration or measurement basis, or even the sample matrix, if all of the data in the file has the same matrix. The records going into the tables need to have values in these fields because of their relationships to the lookup tables, and also so that the data is useful. It is helpful if the software provides a way to set reasonable defaults for these values, as shown in Figure 62, so the data can be imported without a lot of manual editing. Obviously, this feature should be used with care, based on good knowledge of the data being imported, to avoid assigning incorrect values.
OTHER CHECKS There are a number of other checks that the software can perform to improve the quality of the data being imported. Checking for repeated import – In the confusion of importing data, it is easy to accidentally import, or at least try to import, the same data more than once. The software should look for this, tell you about it, and give you the opportunity to stop the import. It is also helpful if the software gives you a way to undo an import later if a file shouldn’t have been imported for one reason or another. Parameter-specific reasonableness – Going beyond checking names, codes, etc., the software should check the data for reasonableness of values on a parameter-by-parameter basis. For example, if a pH value comes in outside the range of 0 to 14, then the software could notice and complain. Setting up and managing a process like this takes a considerable amount of effort, but results in better data quality. Comparison with like data – Sometimes there are comparisons that can be made within the data set to help identify incorrect values. One example is comparing total dissolved solids reported by the lab with the sum of all of the individual constituents, and flag the data if the difference
162
Relational Management and Display of Site Environmental Data
exceeds a certain amount. Another is to do a charge balance comparison. Again, this is not easy to set up and operate, but results in better data quality. Comparison with previous data – In situations where data is being gathered on a regular basis, new data can be compared to historical data, and data that is more than or less than previous data by a certain amount (usually some number of standard deviations from the mean) is suspect. These data points are often referred to as outliers. The data point can then be researched for error, re-sampled, or excluded, depending on the regulations for that specific project. The field of statistical quality control has various tools for performing this analysis, including Shewhart and Cumulative Sum control charts and other graphical and non-graphical techniques. See Chapters 20 and 23 for more information.
CONTENT-SPECIFIC FILTERING At times there will be specific data content that needs to be handled in a special way during import. Some data will require specific attention when it is present in the import. For example, one project that we worked on had various problems over time with phenols. At different times the laboratory reported phenols in different ways. For this project, any file that contained any variety of phenol required specific attention. In another case, the procedure for a project specified that tentatively identified compounds (TICs) should not be imported at all. The database software should be able to handle these two situations, allowing records with specific data content to be either flagged or not imported. Figure 63 shows an example of a screen to help with this. Some projects allow the data administrator to manually select which data will be imported. This sounds strange to many people, but we have worked on projects where each line in the EDD is inspected to make sure that it should be imported. If a particular constituent is not required by the project plan, and the laboratory delivered it anyway, that line is deleted prior to import. In Figure 60 the checkbox near the top of the screen is used for this purpose. The software should allow deleted records to be saved to a file for later reference if necessary.
Figure 63 - Screen to configure content-specific filtering
Importing Data
163
Figure 64 - Screens showing results of a successful and an unsuccessful import
TRACKING IMPORTS Part of the administration of the data management task should include keeping records of the import process. After trying an import, the software should notify you of the result. Records should be kept of both unsuccessful and successful imports. Figures 65 and 66 are examples of reports that can be printed and saved for this purpose.
Figure 65 - Report showing an unsuccessful import
164
Relational Management and Display of Site Environmental Data
Figure 66 - Report showing a successful import
The report from the unsuccessful import can be used to resolve problems prior to trying the import again. At this stage it is helpful for the software to be able to summarize the errors so an error that occurs many times is shown only once. Then each type of error can be fixed generically and the report re-run to make sure all of the errors have been remedied so you can proceed with the import. The report from the successful import provides a permanent record of what was imported. This report can be used for another purpose as well. In the upper left corner is a panel (shown larger in Figure 67) showing the data review steps that may apply to this data. This report can be circulated among the project team members and used to track which review steps have been performed. After all of the appropriate steps have been performed, the report can be returned to the data administrator to enter the upgraded review status for the analyses.
UNDOING AN IMPORT Despite your best efforts, sometimes data is imported that either should not have been imported or is incorrect. An Undo Import feature can do this automatically for you if the software provides this feature. The database software should track the data that you import so you can undo an import if necessary. You might need to do this if you find out that a particular file that you imported has errors and is being replaced, or if you accidentally imported a file twice. An undo import feature should be easy to use but sophisticated, leaving samples in the database that have analyses from a different import, and undoing superseded values that were incremented by the import of the file being undone. Figure 68 shows a form to help you select an import for deletion.
Importing Data
165
Figure 67 - Section of successful import report used for data review
Figure 68 - Form to select an import for deletion
TRACKING QUALITY A constant focus on quality should be maintained during the import process. Each result in the database should be marked with flags regarding lab and other problems, and should also be marked
166
Relational Management and Display of Site Environmental Data
with the level of data review that has been applied to that result. An example of a screen to assist with maintaining data review status is shown in Figure 75 in Chapter 15. If the import process is managed properly using software with a sufficiently sophisticated import tool, and if the data is checked properly after import, then the resulting data will be of a quality that it is useful to the project. The old axiom of “garbage in, garbage out” holds true with environmental data. Another old axiom says “a job worth doing is worth doing well,” or in other words, “If you don’t have time to do it right the first time, how are you ever going to find time to do it again?” These old saws reinforce the point that the time invested in implementing a robust checking system and using it properly will be rewarded by producing data that people can trust.
CHAPTER 14 EDITING DATA
Once the data is in the database it is sometimes necessary to modify it. This can be done manually or using automated tools, depending on the task to be accomplished. These two processes are described here. Due to the focus on data integrity, a log of all changes to the data should be maintained, either by the software or manually in a logbook.
MANUAL EDITING Sometimes it is necessary to go into the database and change specific pieces of data content. Actually, modification of data in an EDMS is not as common as an outsider might expect. For the most part, the data comes from elsewhere, such as the field or the laboratory, and once it is in it stays the way it is. Data editing is mostly limited to correcting errors (which, if the process is working correctly, should be minimal) and modifying data qualifiers such as review status and validation flags. The data management system will usually provide at least one way to manually edit data. Sometimes the user interface will provide more than one way to view and edit data. Two examples include form view (Figure 69) and datasheet view (Figure 70).
Figure 69 - Site data editing screen in form view
168
Relational Management and Display of Site Environmental Data
Figure 70 - Site data editing screen in datasheet view
AUTOMATED EDITING If the changes involve more than one record at a time, then it probably makes sense to use an automated approach. For specific types of changes that are a standard part of data maintenance, this should be programmed into the system. Other changes might be a one-time action, but involve multiple records with the same change, so a bulk update approach using ad hoc queries is better.
Standardized tasks Some data editing activities are a relatively common activity. For these activities, especially if they involve a lot of records to be changed or a complicated change process, the software should provide an automated or semi-automated process to assist the data administrator with making the changes. The examples given here include both a simple process and a complicated one to show how the system can provide this type of capability.
UPDATING REVIEW STATUS It’s important to track the review status of the data, that is, what review steps have been performed on the data. An automated editing step can help update the data as review steps are completed. Automated queries should allow the data administrators to update the review status flags after appropriate data checks have been made. An example of a screen to assist with maintaining data review status is shown in Figure 75 in Chapter 15.
REMOVAL OF DUPLICATED ENTRIES Repeated records can enter the database in several ways. The laboratory may deliver data that has already been delivered, either a whole EDD or part of one. Data administrators may import the same file twice without noticing. (The EDMS should notify them if they try to do this.) Data that has been imported from the lab may also be imported from a data validator with partial or complete overlap. The lab may include field data, which has already been imported, along with its data. However it gets in, this repeated data provides no value and should be removed, and records kept of the changes that were made to the database. However, duplicated data resulting from the quality control process usually is of value to the project, and should not be removed. Repeated information can be present in the database at the samples level, the analyses level, or both. The removal of duplicated records should address both levels, starting at the samples level, and then moving down to the analyses level. This order is important because removing repeated samples can result in more repeated analyses, which will then need to be removed. The samples component of the duplicated record removal process is complicated by the fact that samples have analyses underneath them, and when a duplicate sample is removed, the analyses should probably not be lost, but rather moved to the remaining sample. The software should help you do this by letting you pick the sample to which you want to move the analyses. Then the software should modify the superseded value of the affected analyses, if necessary, and assign them to the other sample.
Editing Data
169
Figure 71 - Form for moving analyses from a duplicate sample
The analyses being moved may in fact represent duplicated data themselves, and the duplicated record removal at the analyses level can be used to remove these results. The analyses component of the duplicated record removal process must deal with the situation that, in some cases, redundant data is desirable. The best example is groundwater samples, where four independent observations of pH are often taken, and should all be saved. The database should allow you to specify for each parameter and each site and matrix how many observations should be allowed before the data is considered redundant. The first step in the duplicated record removal process is to select the data for the duplicate removal process. Normally you will want to work with all of the data for a sampling event. Once you have selected the set of data to work on, the program should look for samples that might be repeated information. It should do this by determining samples that have the same site, station, matrix, sample date, top and base, and lab sample ID. Once the software has made its recommendations for samples you might want to remove, the data should be displayed for you to confirm the action. Before removing any samples, you should print a report showing the samples that are candidates for removal. You should then make notes on this report about any actions taken regarding removal of duplicated sample records, and save the printed report in the project file. If a sample to be removed has related analyses, then the analyses must be moved to another sample before the candidate sample can be deleted. This might be the case if somehow some analyses were associated with one sample in the database and other analyses with another, and in fact only one sample was taken. In that case, the analyses should be moved to the sample with a duplicate value of zero from the one with a higher duplicate value, and then the sample with a higher duplicate value should be deleted. The software should display the sample with the higher duplicate value first, as this is the one most likely to be removed, and display a sample that is a likely target to move the analyses to. A screen for a sample with analyses to be moved might look like Figure 71. The screen has a notation that the sample has analyses, and provides a combo box, in gray, for you to select a sample to move the analyses to. If the sample being displayed does not have analyses, or once they have been moved to another sample, then it can be deleted. In this case, the screen might look like Figure 72. Once you have moved analyses as necessary and deleted improper duplicates, the program should look for analyses that might contain repeated information. It can do this using the following process: 1) Determine all of the parameters in the selection set. 2) Determine the number of desired observations for each parameter. Use site-specific information if it is present. If it is not, use global information. If observation data is not available, either site-specific or global, for one or more parameters, the software should notify you, and provide the option of stopping or proceeding. 3) Determine which analyses for each parameter exceed the observations count.
170
Relational Management and Display of Site Environmental Data
Figure 72 - Form for deleting duplicate samples for a sample without analyses
Next, the software should recommend analyses for removal. The goal of this process is to remove duplicated information, while, for each sample, retaining the records with the most data. The program can use the following process: 1) Group all analyses where the sample and parameter are the same. 2) If all of the data is exactly the same in all of the fields (except for AnalysisNumber and Superseded), mark all but one for deletion. 3) If all of the data is not exactly the same, look at the Value, AnalyticMethod, AnalDate_D, Lab, DilutionFactor, QCAnalysisCode, and AnalysisLabID fields. If the records are different in any of these fields, keep them. For records that are the same in all of these fields, mark all but one for deletion. (The user should be able to modify the program’s selections prior to deletion.) If the data in all of these fields is the same, then keep the record with the greatest number of other data fields populated, and mark the others for removal. Once the software has made its recommendations for analyses to be removed, the data should be displayed in a form such as that shown in Figure 73. In this example, the software has selected several analyses for removal. Visible on the screen are two Arsenic and two Chloride analyses, and one of each has been selected for removal. In this case, this appears appropriate, since the data is exactly duplicated. The information on this screen should be reviewed carefully by someone very familiar with the site. You should look at each analysis and the recommendation to confirm that the software has selected the correct action. After selecting analyses for removal, but before removing any analyses, you should print a report showing the analyses that have been selected for removal. You should save the printed report in the project file. There are two parts to the Duplicated Record Removal process for analyses. The first part is the actual removal of the analytical records. This can be done with a simple delete query, after users are asked to confirm that they really want to delete the records. The second part is to modify the superseded values as necessary to remove any gaps caused by the removal process. This should be done automatically after the removal has been performed.
PARAMETER PRINT REORDERING This task is an example of a relatively simple process that the software can automate. It has to do with the order that results appear on reports. A query or report may display the results in alphabetical order by parameter name. The data user may not want to see it this way. A more useful order may be to see the data grouped by category, such as all of the metals followed by all of the organics. Or perhaps the user wants to enter some specific order, and have the system remember it and use it.
Editing Data
171
Figure 73 - Form for deleting duplicated analyses
A good way to implement a specific order is to have a field somewhere in the database, such as in the Parameters table, that can be used in queries to display the data in the desired order. For the case where users want the results in a specific order, they can manually edit this field until the order is the way they want it. For the case of putting the parameters in order by category, the software can also help. A tool can be provided to do the reordering automatically. The program needs to open a query of the parameters in order by category and name, and then assign print orders in increasing numbers from the first to the last. If the software is set up to skip some increment between each, then the user can slip a new one in the middle without needing to redo the reordering process. The software can also be set up to allow you to specify an order for the categories themselves that is different from alphabetical, in case you want the organics first instead of the metals.
Ad hoc queries Where the change to be made affects multiple records, but will performed only once, or a small number of times over the life of the database, it doesn’t make sense to provide an automated tool, but manual entry is too tedious. An example of this is shown in Figure 74. The problem is that when the stations were entered, their current status was set to “z” for “Unknown,” even though only active wells were entered at that time. Now that some inactive wells are to be entered, the status needs to be set to “s” for “In service.” Figure 74 shows an example of an update query to do this. The left panel shows the query in design view, and the right panel in SQL view. The data administrator has told the software to update the Stations table, setting the CurrentStatusCode field to “s” where it is currently “z.” The query will then make this change for all of the appropriate records in one step, instead of the data administrator having to make the change to each record individually. This type of ad hoc query can be a great time saver in the hands of a knowledgeable user. It should be used with great care, though, because of the potential to cause great damage to the database. Changes made in this way should be fully documented in the activity log, and backup copies of the database maintained in case it is done wrong.
172
Relational Management and Display of Site Environmental Data
Figure 74 - Ad hoc query showing a change to the CurrentStatusCode field
CHAPTER 15 MAINTAINING AND TRACKING DATA QUALITY
If the data in your database is not of sufficient quality, people won’t (and shouldn’t) use it. Managing the quality of the data is just as important as managing the data itself. This chapter and the next cover a variety of issues related to quality terminology, QA/QC samples, data quality procedures and standards, database software support for quality analysis and tracking, and protection from loss. General data quality issues are contained in this chapter, and issues specific to data verification and validation in the next.
QA VS. QC Quality assurance (QA) is an integrated system of activities involving planning, quality control, quality assessment, reporting, and quality improvement to ensure that a product or service meets defined standards of quality with a stated level of confidence. Quality control (QC) is the overall system of technical activities whose purpose is to measure and control the quality of a product or service so that it meets the needs of users. The aim is to provide quality that is satisfactory, adequate, dependable, and economical (EPA, 1997a). In an over-generalization, QA talks about it and QC does it. Since the EDMS involves primarily the technical data and activities that surround it, including quantification of the quality of the data, it comes under QC more than QA. An EMS and the related EMIS (see Chapter 1), on the other hand, cover the QA component.
THE QAPP The quality assurance project plan (QAPP) provides guidance to the project to maintain the quality of the data gathered for the project. The following are typical minimum requirements for a QAPP for EPA projects:
Project management • • • •
Title and approval sheet. Table of Contents – Document control format. Distribution List – Distribution list for the QAPP revisions and final guidance. Project/Task Organization – Identify individuals or organizations participating in the project and discuss their roles, responsibilities, and organization.
174 • • • • •
Relational Management and Display of Site Environmental Data
Problem Definition/Background – 1) State the specific problem to be solved or the decision to be made. 2) Identify the decision maker and the principal customer for the results. Project/Task Description – 1) Hypothesis test, 2) expected measurements, 3) ARARs or other appropriate standards, 4) assessment tools (technical audits), 5) work schedule and required reports. Data Quality Objectives for Measurement – Data decision(s), population parameter of interest, action level, summary statistics, and acceptable limits on decision errors. Also, scope of the project (domain or geographical locale). Special Training Requirements/Certification – Identify special training that personnel will need. Documentation and Record – Itemize the information and records that must be included in a data report package, including report format and requirements for storage, etc.
Measurement/data acquisition • • • • • • • • • •
Sampling Process Designs (Experimental Design) – Outline the experimental design, including sampling design and rationale, sampling frequencies, matrices, and measurement parameter of interest. Sampling Methods Requirements – Sample collection method and approach. Sample Handling and Custody Requirements – Describe the provisions for sample labeling, shipment, chain of custody forms, procedures for transferring and maintaining custody of samples. Analytical Methods Requirements – Identify analytical method(s) and equipment for the study, including method performance requirements. Quality Control Requirements – Describe routine (real-time) QC procedures that should be associated with each sampling and measurement technique. List required QC checks and corrective action procedures. Instrument/Equipment Testing Inspection and Maintenance Requirements – Discuss how inspection and acceptance testing, including the use of QC samples, must be performed to ensure their intended use as specified by the design. Instrument Calibration and Frequency – Identify tools, gauges and instruments, and other sampling or measurement devices that need calibration. Describe how the calibration should be done. Inspection/Acceptance Requirements for Supplies and Consumables – Define how and by whom the sampling supplies and other consumables will be accepted for use in the project. Data Acquisition Requirements (Non-direct Measurements) – Define the criteria for the use of non-measurement data such as data that comes from databases or literature. Data Management – Outline the data management scheme including the path and storage of the data and the data record-keeping system. Identify all data handling equipment and procedures that will be used to process, compile, and analyze the data.
Assessment/oversight • •
Assessments and Response Actions – Describe the assessment activities for this project. Reports to Management – Identify the frequency, content, and distribution of reports issued to keep management informed.
Data validation and usability •
Data Review, Validation, and Verification Requirements – State the criteria used to accept or reject the data based on quality.
Maintaining and Tracking Data Quality
175
What Is Quality? Take a few minutes, put this book down, get a paper and pencil, and write a concise answer to: “What is quality in data management?” It’s harder than it sounds. “Quality … you know what it is, yet you don’t know what it is. But that’s selfcontradictory. But some things are better than others, that is, they have more quality. But when you try to say what the quality is, apart from the things that have it, it all goes poof! There’s nothing to talk about. But if you can’t say what Quality is, how do you know what it is, or how do you know that it even exists? If no one knows what it is, then for all practical purposes it doesn’t exist at all. But for all practical purposes, it really does exist. … So round and round you go, spinning mental wheels and nowhere finding anyplace to get traction. What the h___ is Quality? What is it?” Robert M. Pirsig, 1974 - Zen and the Art of Motorcycle Maintenance Quality is a little like beauty. We know it when we see it, but it’s hard to say how we know. When you are talking about data quality, be sure the person you are talking to has the same meaning for quality that you do. As an aside, a whole discipline has grown up around Pirsig’s work, called the Metaphysics of Quality. For more information, visit www.moq.org. • •
Validation and Verification Methods – Describe the process to be used for validating and verifying data, including the chain of custody for data throughout the lifetime of the project. Reconciliation with Data Quality Objectives – Describe how results will be evaluated to determine if DQOs have been satisfied.
There are many sources of information on how to write a QAPP. The EPA Web site (www.epa.gov) is a good place to start. It is usually not necessary to create the QAPP from scratch. Templates for QAPPs are available from a number of sources, and one of these templates can be modified for the needs of each specific project.
QC SAMPLES AND ANALYSES Over time, project personnel, laboratories, and regulators have developed a set of procedures to help maintain data quality through the sampling, transportation, analysis, and reporting process. This section describes these procedures and their impact on environmental data management. An attempt has been made to keep the discussion general, but some of the issues discussed here apply to some types of samples more than others. Material in this section is based on information in EPA (1997a); DOE/HWP (1990a, 1990b); and Core Laboratories (1996). This section covers four parts of the process: field samples, field QC samples, lab sample analysis, and lab calibration. There are several aspects of handling QC data that impact the way it should be handled in the EDMS. The basic purpose of QC samples and analyses is to confirm that the sampling and analysis process is generating results that accurately represent conditions at the site. If a QC sample produces an improper result, it calls into question a suite of results associated with that QC sample. The scope of the questionable suite of results depends on the samples associated with that QC sample. The scope might be a shipping cooler of samples, a sampling event, a laboratory batch, and so on. The questionable results must then be investigated further to determine whether they are still usable. Another issue is the amount and type of QC data to store. The right answer is to store the data necessary to support the use of the data, and no more or less. The problem is that different projects and uses have different requirements, and different parts of the data handling process can be done either inside or outside the database system. Once the range of data handling processes has been defined for the anticipated project tasks that will be using the system, a decision must be
176
Relational Management and Display of Site Environmental Data
made regarding the role of the EDMS in the whole process. Then the scope of storage of QC data should be apparent. Another issue is the amount of QC data that the laboratory can deliver. There is quite a bit of variability in the ability of laboratories and their LIMS systems to place QC information in the EDD. This is a conversation that you should have with the laboratory prior to selecting the laboratory and finalizing the project QC plan. A number of QC items involve analyses of samples which are either not from a specific station, or do not represent conditions at that station. Because the relational data model requires that each sample be associated with a station, stations must be entered for each site for each type of QC sample to be stored in the database. Multiple stations can be added where multiple samples must be distinguished. Examples of these stations include: ! ! ! !
Trip Blank 1 Trip Blank 2 Field Blank Rinseate Sample
! ! ! !
Equipment Blank Laboratory Control Sample Matrix Spike Matrix Spike Duplicate
For each site only a small number of these would normally be used. These stations can be excluded from normal data retrieval and display using QC codes stored with the data at the station level. A technical issue regarding QC sample data storage is whether to store the QC data in the same or separate tables from the main samples and analyses. Systems have been built both ways. As with the decision of which QC data to store, the answer to separate vs. integrated table design should be based on the intended uses of the data and the data management system. The following table summarizes some common QC sample types and the scope of samples over which they have QC influence. Where a term has multiple synonyms, only one is shown in the table. Some QC sample types are not included in the table because they are felt to primarily serve laboratory calibration rather than QC purposes, but this decision is admittedly subjective. Also, some types of QC samples can be generated either in the field or the laboratory, and only one is shown here. Source Sample type Field Samples Field duplicates Split samples Referee duplicates Field sample spikes Field QC Samples Trip blank Field blank Rinseate blank Sampling equipment blank Lab Sample Analyses Matrix spike Matrix spike duplicate Surrogate spikes Internal standard Laboratory duplicates Laboratory reanalyses
QC scope Sample event or batch Sample event or batch Sample event or batch Analytical batch Single cooler Sample event or batch Sample event or batch Sample event or batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch
Maintaining and Tracking Data Quality
Source Sample type Lab Calibration Blank spike Method blank Instrument blank Instrument carryover blank Reagent blank Check sample Calibration blank Storage blank Blind sample Dynamic blank Calibration standard Reference standard Measurement standard
177
QC scope Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch
Each of these QC items has some (at least potential) impact on the database system.
FIELD SAMPLES Field samples are the starting point for gathering both the primary data of interest as well as the QC data. This section covers QC items related to these samples themselves. Samples – Material gathered in the field for analysis from a specific location at a specific time. Field duplicates, Duplicate samples, Replicate samples – Two or more samples of the same material taken in separate containers, and carried through all steps of the sampling and analytical procedures in an identical manner. Duplicate samples are used to assess variance of the total method, including sampling and analysis. The frequency of field duplicates is project specific. There is an issue here that a major reason for sending duplicates to the laboratory is to track their performance. Marking the samples as duplicates might alert the laboratory that it is being checked, and cause it to deviate from the usual process. Using synthetic station names can overcome this, at the expense of making it more difficult to associate the data with the real station name in the database. A problem with a field duplicate impacts the samples from that day or that batch of field data. Split samples – Two or more representative portions taken from a sample or subsample and analyzed by different analysts or laboratories. Split samples are used to replicate the measurement of the variable(s) of interest to measure the reproducibility of the analysis. The difference between field duplicates and split samples is that splits start out as one sample, and duplicates as two. The frequency of split samples is project specific, with one for each 20 samples being common. Samples for VOC analysis should not be split. A problem with a split sample impacts the samples from that day or that batch of field data. Referee duplicates – Duplicate samples sent to a referee QA laboratory, if one is specified for the project. A problem with a referee duplicate impacts the samples from that day or that batch of field data. Field sample spikes, Field matrix spikes – A sample prepared by adding a known mass of target analyte to a specified amount of a field sample for which an independent estimate of target analyte concentration is available. Spiked samples are used, for example, to determine the effect of matrix interference on a method’s recovery efficiency. The frequency of these spikes is project specific. A problem with a matrix spike impacts the samples from that analytical batch.
178
Relational Management and Display of Site Environmental Data
FIELD QC SAMPLES Usually some samples are brought in from the field which do not represent in-situ conditions at the site, but are used in the QC process to help avoid bad data from contamination and other sources not directly related to actual site conditions. Trip blank – A clean sample of matrix that is carried to the sampling site and transported to the laboratory for analysis without having been unsealed and exposed to sampling procedures (as opposed to a field blank, which is opened or even prepared in the field). They measure the contamination, usually by volatile organics, from laboratory water, sample containers, site handling, transit, and storage. There is usually one trip blank per shipment batch (cooler). Trip blanks are usually used for water samples, but may also be sent with soil samples, in which case they are analyzed and reported as water samples. A problem with a trip blank impacts the contents of one cooler of samples. Field blank, Blank sample, Medium blank, Field reagent blank, Site blank – A clean sample (e.g., distilled water), carried to the sampling site, exposed to sampling conditions (e.g., bottle caps removed, preservatives added) and returned to the laboratory and treated as an environmental sample. This is different from trip blanks, which are transported to but not opened in the field. Field blanks are used to check for analytical artifacts and/or background introduced by sampling and analytical procedures. The frequency of field blanks is project specific. The term is also used for samples of source water used for decontamination and steam cleaning (DOE/HWP, 1990a, p. 17). A problem with a field blank impacts the samples from that day or that batch of field data. Rinseate blank, Equipment rinseates – A clean sample (e.g., distilled water or ASTM Type II water) passed through decontaminated sampling equipment before sampling, and returned to the laboratory as a sample. Sampling equipment blanks are used to check the cleanliness of sampling devices. Usually one rinseate sample is collected for each 10 samples of each matrix for each piece of equipment. A problem with a rinseate blank impacts the samples from that day or that batch of field data. Sampling equipment blank, Decontamination blank – A clean sample that is collected in a sample container with the sample-collection device after or between samples, and returned to the laboratory as a sample. Sampling equipment blanks are used to check the cleanliness of sampling devices. A problem with a sampling equipment blank impacts the contents of one sample event or batch.
LAB SAMPLE ANALYSES This section covers QC procedures performed in the laboratory, specifically involving the samples from the field. Matrix spike, Spiked sample, Laboratory spiked sample – A sample prepared by adding a known mass of target analyte to a specified amount of a field sample for which an independent estimate of target analyte concentration is available. Spiked samples are used, for example, to determine the effect of matrix interference on a method’s recovery efficiency. Usually one matrix spike is analyzed per sample batch, or 5 to 10% of the samples. Matrix spikes and matrix spike duplicates are associated with a specific analytical batch, and are not intended to represent a specific site or station. The matrix spike for a batch may be from a different laboratory client than some of the samples in the batch. A problem with a matrix spike impacts the samples from that analytical batch. Matrix spike duplicate – A duplicate of a matrix spike, used to measure the laboratory precision between samples. Usually one matrix spike duplicate is analyzed per sample batch. Percent differences between matrix spikes and matrix spike duplicates can be calculated. A problem with a matrix spike duplicate impacts the samples from that analytical batch. Surrogate spikes – Non-target analytes of known concentration that are added to organic samples prior to sample preparation and instrument analysis. They measure the efficiency of all
Maintaining and Tracking Data Quality
179
steps of the sample preparation and analytical method in recovering target analytes from the sample matrix, based on the assumption that non-target surrogate compounds behave the same as the target analytes. They are run with all samples, standards, and associated quality control. Spike recoveries can be calculated from spike concentrations. A problem with this type of sample impacts the samples from that analytical batch. Internal standard – Non-target analytes of known concentration that are added to organic samples following sample preparation but prior to instrument analysis (as opposed to surrogate spikes, which are added before sample preparation). They are used to determine the efficiency of the instrumentation in quantifying target analytes and for performing calculations by relative response factors. They are run with all samples, standards, and associated quality control. A problem with this type of sample impacts the samples from that analytical batch. Laboratory duplicates, Duplicate analyses or measurements, Replicate analyses or measurements, Laboratory replicates – The analyses or measurements of the variable of interest performed identically on two or more subsamples of the same sample. The results from duplicate analyses are used to evaluate analytical or measurement precision, including non-homogeneous sample matrix effects but not the precision of sampling, preservation, or storage internal to the laboratory. Typically lab duplicate analysis is performed on 5 to 10% of the samples. These terms are also used for laboratory reanalyses. A problem with this type of sample impacts the samples from that analytical batch. Laboratory reanalyses, Laboratory replicates – Repeated analyses of a single field sample aliquot that has been prepared by the same sample preparation procedure to measure the repeatability of the sample analysis. A problem with this type of analysis impacts the samples from that analytical batch. Maximum holding time – The length of time a sample can be kept under specified conditions without undergoing significant degradation of the analyte(s) or property of interest. Problems with holding times impact just those specific samples. Recovery efficiency – In an analytical method, the fraction or percentage of a target analyte extracted from a sample containing a known amount of the analyte. A problem with recovery efficiency impacts the samples from that analytical batch. Dilution factor – The numerical value obtained from dividing the new volume of a diluted sample by its original volume. This is a value to be tracked for the analysis, rather than a separate QC sample, although often one or more analytes are reported at more than one dilution. Professional judgment is then required to determine which result is most representative of the true concentration in the sample. Method of standard addition, MSA – Analysis of a series of field samples which are spiked at increasing concentrations of the target analytes. This provides a mathematical approach for quantifying analyte concentrations of the target analyte. It is used when spike recoveries are outside the QC acceptance limits specified by the method. This is more a lab calibration technique than a QC sample.
LAB CALIBRATION A number of procedures are carried out in the laboratory to support the QC process, but do not directly involve the samples from the field. Blank spike, Spike – A known mass of target analyte added to a blank sample or subsample; used to determine recovery efficiency or for other quality control purposes. Blank spikes are used when sufficient field sample volumes are not provided for matrix spiking, or as per method specifications. A problem with this type of sample impacts the samples from that analytical batch. Method blank – A clean sample containing all of the method reagents, processed simultaneously with and under the same conditions as samples containing an analyte of interest through all steps of the analytical procedure. They measure the combined contamination from reagent water or solid material, method reagents, and the sample preparation and analysis
180
Relational Management and Display of Site Environmental Data
procedures. The concentration of each analyte in the method blank should be less than the detection limit for that analyte. Method blanks are analyzed once per batch of samples, or 5 to 10% of the sample population, depending on the method specifications. A problem with this type of sample impacts the samples from that analytical batch. Instrument blank – A clean sample processed through the instrumental steps of the measurement process at the beginning of an analytical run, during, and at the end of the run. They are used to determine instrument contamination and indicate if corrective action is needed prior to proceeding with sample analysis. Normally one blank is analyzed per analytical batch, or as needed. A problem with this type of sample impacts the samples from that analytical batch. Instrument carryover blank – Laboratory reagent water samples which are analyzed after a high-level sample. They measure instrument contamination after analyzing highly concentrated samples, and are analyzed as needed when high-level samples are analyzed. A problem with this type of sample impacts the samples from that analytical batch. Reagent blank, Analytical blank, Laboratory blank, Medium blank – A sample consisting of reagent(s) (without the color forming reagent), without the target analyte or sample matrix, introduced into the analytical procedure at the appropriate point and carried through all subsequent steps. They are used to determine the contribution of the reagents and of the involved analytical steps to error in the observed value, to zero instruments, and to correct for blank values. They are usually run one per batch. A problem with this type of sample impacts the samples from that analytical batch. Check sample, QC check sample, Quality control sample, Control sample, Laboratory control sample, Laboratory control standard, Synthetic sample, LCS – An uncontaminated sample matrix spiked with known amounts of analytes usually from the same source as the calibration standards. It is generally used to establish the stability of the analytical system but may also be used to assess the performance of all or a portion of the measurement system. They are usually analyzed once per analytical batch or as per method specifications, although LCS duplicates are also sometimes run. A problem with this type of sample impacts the samples from that analytical batch. Calibration blank – Laboratory reagent water samples analyzed at the beginning of an analytical run, during, and at the end of the run. They verify the calibration of the system and measure instrument contamination or carry-over. A problem with this type of sample impacts the samples from that analytical batch. Storage blank – Laboratory reagent water samples stored in the same type of sample containers and in the same storage units as field samples. They are prepared, stored for a defined period of time, and then analyzed to monitor volatile organic contamination derived from sample storage units. Typically one blank is used for each sample batch, or as per method specifications. A problem with this type of sample impacts the samples from that analytical batch. Blind sample, Double-blind sample – A subsample submitted for analysis with a composition and identity known to the submitter but unknown to the analyst and used to test the analyst’s or laboratory’s proficiency in the execution of the measurement process. A problem with this type of sample impacts the samples from that analytical batch. Dynamic blank – A sample-collection material or device (e.g., filter or reagent solution) that is not exposed to the material to be selectively captured but is transported and processed in the same manner as the sample. A problem with this type of sample impacts the samples from that analytical batch. Calibration standard, Calibration-check standard – A substance or reference material containing a known concentration of the target analytes used to calibrate an instrument. They define the working range and linearity of the analytical method and establish the relationship between instrument response and concentration. They are used according to method specifications. A problem with this type of sample impacts the samples from that analytical batch. The process of using these standards on an ongoing basis through the analytical run is called Continuous
Maintaining and Tracking Data Quality
181
If the outcome of a test is likely to change, conduct the test only once. Rich (1996) Calibration Verification or CCV, and CCV samples are run at a project-specific frequency such as one per ten samples. Reference standard – Standard of known analytes and concentration obtained from an independent source than the standards used for instrument calibration. They are used to verify the accuracy of the calibration standards, and are analyzed after each initial calibration or as per method specifications. A problem with this type of sample impacts the samples from that analytical batch. Measurement standard – A standard added to the prepared test portion of a sample (e.g., to the concentrated extract or the digestate) as a reference for calibrating and controlling measurement or instrumental precision and bias. Clean sample – A sample of a natural or synthetic matrix containing no detectable amount of the analyte of interest and no interfering material. This is more a lab material than a QC sample. Laboratory performance check solution – A solution of method and surrogate analytes and internal standards; used to evaluate the performance of the instrument system against defined performance criteria. This is more a lab material than a QC sample. Performance evaluation sample (PE sample), Audit sample – A sample, the composition of which is unknown to the analyst and is provided to test whether the analyst/laboratory can produce analytical results within specified performance limits. This is more a lab material than a QC sample. Spiked laboratory blank, Method check sample, Spiked reagent blank, Laboratory spiked blank – A specified amount of reagent blank fortified with a known mass of the target analyte; usually used to determine the recovery efficiency of the method. This is more a lab material than a QC sample.
METHOD-SPECIFIC QC Some analytical methods have specific QC requirements or procedures. For example, gas chromatograph analysis, especially when mass spectrometry is not used, can use two columns, generate two independent results, and the results from the second column confirmation can be used for comparison. In many metals analyses, replicate injections, in which the results are run several times and averaged, can provide precision information and improve precision. Sample dilutions are another QC tool to help generate reproducible results. The sample is re-analyzed at one or more dilution levels to bring the target analyte into the analytical range of the instrument.
DATA QUALITY PROCEDURES For laboratory-generated analytical data, data quality is assured through data verification and validation procedures (Rosecrance, 1993). Data validation procedures developed by the EPA and state agencies provide data validation standards for specific program and project requirements. These data validation procedures usually refer to QA/QC activities conducted by a Contract Laboratory Program (CLP) EPA laboratory. They can occur at and be documented by the laboratory performing the analyses, or they can be performed by a third party independent from the laboratory and the client. Subsequent to receiving the data from the laboratory, the largest component of quality assurance as it applies to a data management system involves data import, data entry, and data editing. In all three cases, the data being placed in the system must be reviewed to determine if the data fulfills the project requirements and project specifications. Data review is a systematic process consisting of data check-in (chain of custody forms, sampling dates, entire data set received), data
182
Relational Management and Display of Site Environmental Data
entry into the database, checking the imported data against the import file, and querying and comparing the data to the current data set and to historical data. A data review flag should be provided in the database to allow users to track the progress of the data review, and to use the data appropriately depending on the status of the review. The details of how data review is accomplished and by whom must be worked out on a project-by-project basis. The software should provide a data storage location and manipulation routines for information about the data review status of each analytical value in the system. The system should allow the storage of data with different levels of data checking. A method should be provided for upgrading (and in rare cases, perhaps downgrading) the data checking status of each data item as additional checking is done, and a history of the various checking steps performed. This process can be used for importing or entering data with a low level of checking, then updating the database after the review has been performed to the required level.
Levels of data review Data review activities determine if analytical data fulfills the project requirements and project specifications. The extent of data review required for analyses will likely vary by project, and even by constituent or sample type within the project. The project managers should document the actual level of effort required for each data review step. Data review flags indicate that the projectspecific requirements have been met for each step. Some projects may require third-party validation of the laboratory data. Flags in the data review table can be used to indicate if this procedure has been performed, while still tracking the data import and review steps required for entry into the data management system. It is also possible for various levels of data review to be performed prior to importing the data. Laboratories and consultants could be asked to provide data with a specified level of checking, and this information brought in with the data. The following table shows some typical review codes that might be associated with analytical data: Data Review Code 0 1 2 3 4 5 6 7 8
Data Review Status Imported Vintage (historical) data Data entry checked Sampler error checked Laboratory error checked Consistent with like data Consistent with previous data In-house validation Third-party validation
Tracking review status The system should provide the capability to update a data review flag for subsets of analytical data that have undergone a specific step of data review, such as consistency checking, verification, or validation. The data review flag will alert users to the status of the data relative to the review process, and help them determine the appropriate uses for the data. A tool should be provided to allow the data administrators to update the review status codes after the appropriate data checks have been made to the analyses.
Maintaining and Tracking Data Quality
183
Figure 75 - An interface for managing the review status of EDMS data
Figure 75 shows an example of an interface for performing this update. Data that had previously been checked for internal consistency has now undergone a third-party validation process, and the data administrator has selected this data set and is updating the review status code. In this figure, the review status is being modified for all of the records in the set of selected data. It is helpful to be able to select data by batch or by chain of custody, and then modify the review status of that whole set at once.
Documentation and audits In addition to data checking, a method of tracking database changes should also be instituted. The system should maintain a log of activity for each project. This log should permanently record all changes to the database for that project, including imports, data edits, and changes to the review status of data. It should include the date of the change, the name of the person making the change, and a brief description of the change. Occasional audits should be performed on the system and its use to help identify deficiencies in the process of managing the data. A remedy that adequately addresses the extent of the deficiency should follow the identification of the deficiencies. The details of the operation of the tracking and auditing procedures should be specified in the Quality Assurance Project Plan for each project.
184
Relational Management and Display of Site Environmental Data
Data quality standards Quality has always been very important in organizations working with environmental data. Standards like NQA-1 and BS 7750 have made a major contribution to the quality of the results of site investigation and remediation. Recent developments in international standards can have an impact on the design and implementation of an EDMS, and the EDMS can contribute to successful implementation of these systems. ISO 9000 and ISO 14000 are families of international standards. Both families consist of standards and guidelines relating to management systems, and related supporting standards on terminology and specific tools. ISO 9000 is primarily concerned with “quality management,” while ISO 14000 is primarily concerned with “environmental management.” ISO 9000 and ISO 14000 both have a greater chance of success in an organization with good data management practices. A good source of information on ISO 9000 and ISO 14000 is www.iso.ch/9000e/9k14ke.htm. Another international standard that applies to environmental data gathering and management is ISO 17025, which focuses on improvements in testing laboratories (Edwards and Mills, 2000). It provides for preventive and corrective programs that facilitate client communication to resolve problems.
ISO 9000 ISO 9000 addresses the management of quality in the organization. The focus of ISO 9000 is on documentation, procedures, and training. A data management system can assist with providing procedures that increase the chance of generating a quality result. A newer version of ISO 9000, called QS 9000, includes some modifications for specific industries such as the automotive industry. The definition of "quality" in ISO 9000 refers to those features of a product or service that are required by the customer. "Quality management" is what the organization does to ensure that its products conform to the customer’s requirements, for example, what the organization does to minimize harmful effects of its activities on the environment.
ISO 14000 ISO 14000 encourages organizations to perform their operations in a way that has a positive, or at least not a negative, impact on the environment. An important component of this is tracking environmental performance, and an EDMS can help with this. In ISO 14000 there are ten principles for organizations implementing environmental management systems (Sayre, 1996). Some of these principles relate to environmental data management. In the following section each principle is followed by a brief discussion of how that principle relates to environmental data management and the EDMS software. Recognize that environmental management is one of the highest priorities of any organization – This means that adequate resources should be allocated to management of environmental aspects of the business, including environmental monitoring data. Every organization that has operations with potential environmental impacts should have an efficient system for storing and accessing its environmental data. EDMS is a powerful tool for this. Establish and maintain communications with both internal and external interested parties – From the point of view of environmental data, the ability to quickly retrieve and communicate data relevant to issues that arise is critical, both on a routine and emergency basis. Communication between the EDMS and regulators is a good example of the former, the ad hoc query capability of an EDMS is a good example of the second. Determine legislative requirements and those environmental aspects associated with your activities, products, and services – Satisfaction of regulatory requirements is one of the primary purposes of an environmental data management system. An EDMS can store information
Maintaining and Tracking Data Quality
185
like the sampling intervals for monitoring locations and target concentration limits to assist with tracking and satisfying these requirements. Develop commitment, by everyone in the organization, to environmental protection and clearly assign responsibilities and accountability – This is just as true of environmental data as any other aspect of the management process. Promote environmental planning throughout the life cycle of the product and the process – Planning through the life cycle is important. Tracking is equally important, and an EDMS can help with that. Establish a management discipline for achieving targeted performance – A key to achieving targeted performance is tracking performance, and using a database is critical for efficient tracking. Provide the right resources and sufficient training to attain performance targets – Again, the tracking is an important part of this. You can’t attain targets if you don’t know where you are. Implementing an environmental data management system with the power and flexibility to get answers to important questions is critical. That system should also be easy to use, so that the resources expended on training generate the greatest return. A good EDMS will provide these features - power, flexibility, and ease of use. Evaluate performance against policy, environmental objectives, and targets, and make improvements where possible – Again, the tracking is the important part, along with improvements, as discussed in the next principle. Establish a process to review, monitor, and audit the environmental management system to identify opportunities for improvement in performance – This is a reflection of the Deming/Japanese/Incremental Improvement approach to quality, which has been popular in the management press. The key to this from a data management perspective is to implement open systems where small improvements are not rejected because of the effort to implement them. In an EDMS, a knowledgeable user should be able to create some new functionality, like a specific kind of report, without going through a complicated process involving formal interaction with an Information Technology group or the software vendor. Encourage vendors to also establish environmental management systems – Propagating the environmental quality process through the industry is encouraged, and the ability to transfer data effectively can be important in this. This is even truer if it is looked at from the overall quality perspective. For example, a reference file system used to check laboratory data prior to delivery encourages the movement of quality data from the vendor (the lab) to the user within the organization.
ISO 17025 A new ISO standard called 17025 covers laboratory procedures for management of technical and quality records (Edwards and Mills, 2000). This standard requires laboratories to establish and maintain procedures for identification, collection, indexing, access, storage, maintenance, and disposal of quality and technical records. This includes original observations and derived data, along with sufficient information to establish an audit trail, calibration records, staff records, and a copy of each report issued for a defined period. The focus of the standard is on internal improvements within the laboratory, as well as corrective and preventative action programs that require client communications to satisfactorily resolve problems.
EPA GUIDANCE The U.S.E.P.A. has a strong focus on quality in its data gathering and analysis process. The EPA provides a number of guidance documents covering various quality procedures, as shown in the following table. Many are available on its Web site at www.epa.gov.
186
Relational Management and Display of Site Environmental Data
QA/G-0 QA/G-3 QA/G-4 QA/G-4D QA/G-4H QA/G-5 QA/G-5S QA/G-6 QA/G-7 QA/G-8 QA/G-9 QA/G-9D QA/G-10 QA/R-1 QA/R-1ER QA/R-2 QA/R-5
EPA Quality System Overview Management Systems Review Process Data Quality Objectives Process Decision Error Feasibility Trials (DEFT) DQO Process for Hazardous Waste Sites Quality Assurance Project Plans Sampling Designs to Support QA Project Plans Preparation of SOPs Technical Assessments for Data Operations Environmental Data Verification & Validation DQA: Practical Methods for Data Analysis DQA Assessment Statistical Toolbox (DataQUEST) Developing a QA Training Program Requirements for Environmental Programs Extramural Research Grants Quality Management Plans Quality Assurance Project Plans
DATABASE SUPPORT FOR DATA QUALITY AND USABILITY Data review in an EDMS is a systematic process that consists of data check-in, data entry, screening, querying, and reviewing (comparing data to established criteria to ensure that data is adequate for its intended use), following specific written procedures for each project. The design of the software and system constraints can also make a great contribution to data quality. An important part of the design process and implementation plan for an EDMS involves the detailed specification of procedures that will be implemented to assure the quality of the data in the database. This part of the design affects all of the other parts of the detailed design, especially the data model, user interface, and import and export. System tools – Data management systems provide tools to assist with maintaining data quality. For example, systems that implement transaction processing can help data quality by passing the ACID test (Greenspun, 1998): Atomicity – The results of a transaction’s execution are either all committed or all rolled back. Consistency – The database is transformed from one valid state to another, and at no time is in an invalid state. Isolation – The results of one transaction are invisible to another transaction until the transaction is complete. Durability – Once committed, the results of a transaction are permanent and survive future system and media failures. These features work together to help ensure that changes to data are made consistently and reliably. Most commercial data management programs provide these features, at least to some degree. Data model – A well-designed, normalized data model will go a long way toward enforcing data integrity and improving quality. Referential integrity can prevent some types of data errors from occurring in the database, as discussed in Chapter 3. Also, a place should be provided in the data model for storage of data review information, along with standard flags and other QC information. During the detailed design process, data elements may be identified which would help with tracking quality related information, and those fields should be included in the system. User interface – A user interface must be provided with appropriate security for maintaining quality information in the database. In addition, all data entry and modification screens should
Maintaining and Tracking Data Quality
187
support a process of reviewing data in keeping with the quality procedures of the organization and projects. These procedures should be identified and software routines specified as part of the EDMS implementation. Import and export – The quality component for data import involves verifying that data is imported correctly and associated correctly with data already in the database. This means that checks should be performed as part of the import process to identify any problems or unanticipated changes in the data format or content which could result in data being imported improperly or not at all. After the data has been imported, one or more data review steps should be performed which are appropriate to the quality level required for the data in the database for that particular project and expected data uses. Once the data is in the database with an appropriate level of quality, retrieving, and exporting quality data requires that the selection process be robust enough to ensure that the results are representative of the data in the database, relative to the question being asked. This is described in a later section.
Data retrieval and quality Another significant component of quality control is data retrieval. The system should have safeguards to assure that the data being retrieved is internally consistent and provides the best representation of the data contained in the system. This involves attention to the use of flags and especially units so that the data delivered from the system is complete and internally consistent. This is easy to enforce for canned queries, but more difficult for ad hoc queries. Adequate training must be provided in the use of the software so that errors in data retrieval are minimized. Another way to say this is that the system should provide the answer that best addresses the intent of the question. This is a difficult issue, since the data delivery requirements can vary widely between projects.
PRECISION VS. ACCURACY Many people confuse precision, accuracy, and bias. Precision is the degree to which a set of observations or measurements of the same property, usually obtained under similar conditions, conform to themselves, which is also thought of as reproducibility or repeatability. Precision is usually expressed as standard deviation, variance, or range, in either absolute or relative terms, of a set of data. Accuracy is the degree of agreement between an observed value and an accepted reference value. Bias is the systematic or persistent distortion of a measurement process that deprives the result of representativeness (i.e., the expected sample measurement is different than the sample’s true value). For a good discussion of precision and accuracy, see Patnaik (1997, p. 6) Accuracy as most people think of it includes a combination of random error (precision) and systematic error (bias) components due to sampling and analytical operations. EPA recommends that the term accuracy not be used, and that precision and bias be used to convey the information usually associated with accuracy. Figure 76, based on ENCO (1998), illustrates graphically the difference between precision and accuracy. From the perspective of the laboratory QC process, the method accuracy is based on the percent recovery of a known spike concentration from a sample matrix. The precision is based on the relative percent difference between the duplicate samples or duplicate spike samples.
188
Relational Management and Display of Site Environmental Data
Inaccurate and Imprecise
Accurate and Imprecise
Inaccurate and Precise
Accurate and Precise
Figure 76 - Illustration of precision vs. accuracy
PROTECTION FROM LOSS The final quality assurance component is protection of data from loss. Once data has been entered and reviewed, it should be available forever, or at least until a conscious decision is made that it is no longer needed. This protection involves physical measures such as a process for regular and reliable backups, and verification of those backups. It also involves an ongoing process of checking data contained in the database to assure that the data content remains as intended. This can involve checking data against previous reports or other methods designed to identify any improper changes to data, whether those changes were intentional or not.
Data security In order to protect the integrity and quality of the database, accessing the database and performing actions once the database is opened should be restricted to those with a legitimate business need for that access. Security methods should be developed for the EDMS that will designate what data each user is allowed to access. A desktop data management system such as Microsoft Access typically provides two methods of securing a database: setting a password for opening the database, and user-level security. The password option will protect against unauthorized user access. However, once the database is open, all database objects are available to the user. This level of security does not provide adequate protection of sensitive data, prevent users from inadvertently breaking an application by changing code or objects on which the application depends, or inadvertently changing reviewed data. User-level security provides the ability to secure different objects in a database at different levels. Users identify themselves by logging into the database when the program is started. Permissions can be granted to groups and users to regulate database usage. Some database users, designated as data administrators, are granted permission to view, enter, or modify data. The data import and edit screens should only be accessible to these staff members. Other users should be restricted to just viewing the data. Access to tables can be restricted by group or user. In a client-server database system, such as one with Microsoft SQL Server or Oracle as a back-end, the server security model can provide an additional level of security. This is a fairly complicated subject, but implementing this type of security can provide a very high level of security.
Maintaining and Tracking Data Quality
189
Figure 77 - SQL Server screen for modifying permissions on tables
For example, SQL Server provides a variety of security options to protect the server and the data stored on that server. SQL Server security determines who can log on to the server, the administrative tasks each user is allowed to perform, and which databases, database objects (tables, indexes, views, defaults, procedures, etc.) and data are available to each user. Figure 77 shows the SQL Server screen for editing permissions on Table objects. SQL Server login security can be configured for one of three security modes: ! ! !
SQL Server’s own login validation process Windows NT/2000/XP authentication services A mixture of the two login types
The system administrator and a backup system administrator should be trained on how to add users with specific permissions, and change existing permissions, if necessary. To further track database additions or modifications, a dialogue box can pop up when a database administrator ends a session where data might have been changed. The box should contain the user name, site, and date, which are not editable, and a memo field, where the user can enter a description of the data modifications that were made. This information should be stored in a table to allow tracking of database activities. This table should be protected from edits so that the tracking information it contains is reliable. An example of this type of tracking is shown in Chapter 5. Some enterprise EDMS databases will be set up with multiple sites in one database. This is a good design for some organizations, because it can simplify the data management process, and allow for comparison across sites. However, often it may not be necessary for all users to have access to all sites. A user interface can be provided to limit the access of users to only specific sites. Figure 78 shows an example of a screen (in form and datasheet view) for assigning users to sites, and the software can then use this information to limit access as appropriate.
190
Relational Management and Display of Site Environmental Data
Figure 78 - Security screen for assigning users to sites
Backup Backing up the database is probably the most important maintenance activity. The basic rule of thumb is when you have done enough work in the database that you wouldn’t want to re-do it, back up. Backup programs contain various options regarding scheduled vs. manual backups, backup media, compression features, etc., so you want to think through your backup strategy, then implement it. Then be sure to stay with it. The process for backing up depends on the type of database that you have. If you are running a stand-alone program like Access, you can use consumer-oriented backup tools, such as the one that comes with the operating system, or a third-party backup utility, to back up your database file. Most of these programs will not back up a file that is open, so be sure everyone that might have the database file open closes it before the backup program runs. If people keep the database open all the time, even though a backup program is running regularly, the file may not be backed up. Backing up a client-server database is more complicated. For example, there are several options for backing up the SQL Server database file. For the Windows NT/2000/XP system, the NT/2000/XP backup program can be used to back up the SQL Server database if the database file is closed during the backup process. SQL Server also contains a backup program that will back up the database files while they are open. The SQL backup program is located under tools in SQL Enterprise Manager. The format for the SQL Server backup program is not compatible with NT/2000/XP format backups and separate tapes must be used if both backup options are used. It is recommended that frequent backups of the database be scheduled using this program. Third-party backup programs are also available, and may provide more functionality, but at an additional cost. On a regular basis, probably daily, a backup of the data should be made to a removable medium such as tape, Zip disk, or writeable CD or DVD. The tapes or other media should be rotated according to a formal schedule so that data can be recovered if necessary from a previous date. A reliable staff member should be assigned responsibility for the backup process, which should include occasional testing to make sure that the backup tapes can be read if necessary. This task might be performed by someone in IS if it is maintaining the server.
CHAPTER 16 DATA VERIFICATION AND VALIDATION
For many projects, data verification and validation are significant components of the data management effort. A variety of related tasks are performed on the values in the database so that they are as accurate as possible, and so that their accuracy (or lack thereof) is documented. This maximizes the chance that data is useful for its intended purpose. The move toward structured data validation has been driven by the EPA Contract Laboratory Program (CLP), but the process is certainly performed on non-EPA projects as well. Verification and validation is not “one size fits all.” Different projects have different data quality objectives, and consequently different data checking activities. The purpose of this section is not to teach you to be a data validator, but rather to make you aware of some of the issues that are addressed in data validation. The details of these issues for any project are contained in the project’s quality assurance project plan (QAPP). The validation procedures are generally different for different categories of data, such as organic water analyses, inorganic water analyses, and air analyses.
TYPES OF DATA REVIEW Data review refers to the process of assessing and reporting data quality, and includes verification, validation, and data quality assessment. Data quality terms have different meanings to different people in the industry, and often people have different but strongly held beliefs in the definition of terms like validation, verification, data review, and checking. Hopefully when you use any of these terms, the person you are talking to is hearing what you mean, and not just what you say.
MEANING OF VERIFICATION EPA (2001c) defines verification as: Confirmation by examination and provision of objective evidence that specified requirements have been fulfilled. Data verification is the process of evaluating the completeness, correctness, and conformance/compliance of a specific data set against the method, procedural, or contractual requirements. Core Laboratories (1996) provides the following somewhat simpler definition of verification: Verification is the process of determining the compliance of data with method and project requirements, including both documentation and technical criteria.
Relational Management and Display of Site Environmental Data
DATA VERIFICATION
192
Project Planning
Field Activities Field Documentation Review
Sample Collection
Sample Management
Sample Preparation
DATA VALIDATION
Sample Analysis
LIMS
Sample Receipt Data Verification Documentation and Verified Data
Laboratory Documentation Review Focused Data Validation Report
Focused Data Validation (as requested)
Data Validation of Field and Analytical Laboratory Data
Data Validation Report and Validated Data
Data Quality Assessment Figure 79 - Data verification and validation components in the project life cycle (after EPA, 2001c)
Verification is sometimes used informally to refer to the process of checking values for consistency, reasonableness, spelling, etc., especially when it is done automatically by software. This is particularly important in a relational data management system, because referential integrity depends strongly on data consistency in order for the data be imported and retrieved successfully.
Data Verification and Validation
193
MEANING OF VALIDATION EPA (2001c) defines validation as: Confirmation by examination and provision of objective evidence that the particular requirements for a specific use have been fulfilled. Data validation is an analyte- and sample-specific process that extends the evaluation of data beyond method, procedural, or contractual compliance (i.e., data verification) to determine the analytical quality of a specific data set. Core Laboratories (1996) provides the following definition of validation: Validation is the process of determining the usability of data for its intended use, including qualification of any non-compliant data. EPA (1998c) provides the following levels of data validation: Level 0 Data Validation – Conversion of instrument output voltages to their scaled scientific units using nominal calibrations. May incorporate flags inserted by the data logger. Level 1 Data Validation – Observations have received quantitative and qualitative reviews for accuracy, completeness, and internal consistency. Final audit reviews required. Level 2 Data Validation – Measurements are compared for external consistency against other independent data sets (e.g., comparing surface ozone concentrations from nearby sites, intercomparing raw windsonde and radar profiler winds, etc.). Level 3 Data Validation – Observations are determined to be physically, spatially, and temporally consistent when interpretive analyses are performed during data analysis. Validation contains a subjective component, while verification is objective.
THE VERIFICATION AND VALIDATION PROCESS While it is possible to find definitions of verification and validation, it is not easy to draw the line between the two. Verification is the evaluation of performance against predetermined requirements, while validation focuses on the data needs of the project. The data must satisfy the compliance requirement (verification) in order to satisfy the usability requirement (validation). From a software perspective, perhaps it is best to view verification as something that software can do, and validation as something that people need to do, and that is where the data validator comes in. Data validators are professionals who spend years learning their trade. Some would say that validation is as much an art as a science, and a good validator has a “feel” for the data that takes a long time to develop. It is certainly true that an experienced person is more likely than an inexperienced person to identify problems based on minimal evidence. EPA (1996) describes validation as identifying the analytical error associated with a data set. This is then combined with sampling error to determine the measurement error. The measurement error is then combined with the sampling variability (spatial variability, etc.) to determine the total error or uncertainty. This total error is then used to evaluate the usability of the data. The validator is responsible for determining the analytical error and the sampling error. The end user is responsible for combining this with the sampling variability for the final assessment of data usability. Data validation is a decision-making process in which established quality control criteria are applied to the data. An overview of the process is illustrated in Figure 79. The validator should review the data package for completeness, assess the results of QC checks and procedures, and examine the raw data in detail to verify the accuracy of the information. The validation process involves a series of checks of the data as described in the next section. Each sample is accepted, rejected, or qualified based on these checks. Individual sample results that fail any of the checks are not thrown away, but are marked with qualifier codes so the user is aware of the problems. Accepted data can be used for any purpose. Rejected data, usually given a flag of “R,” should never be used. Qualified data, such as data that is determined to be estimated and given a “J” flag, can be used as long as it is felt to satisfy the data quality objectives, but should not be used
194
Relational Management and Display of Site Environmental Data
If you’re right 90% of the time, why quibble about the remaining 4%? Rich (1996) indiscriminately. The goal is to generate data that is technically valid, legally defensible, of known quality, and ultimately usable in making site decisions. Verification and validation requirements apply to both field and laboratory data. Recently EPA (2001c) has been emphasizing a third part of the process, data quality assessment (DQA), which is a process that determines the credibility of the data. This has become increasingly important as more and more cases are being found of laboratory and other fraud in generating data, and in fact some laboratory operators have been successfully prosecuted for fraud. It can no longer be assumed that the integrity of the laboratory and others in the process can be trusted. The DQA process involves looking at the data for clues that shortcuts have been taken or other things done which would result in invalid data. Examples of improper laboratory processes include failure to analyze samples and then fabricating the results (drylabbing); failure to conduct the required analytical steps; manipulating the sample prior to analysis, such as by fortification with additional analyte (juicing); manipulating the results during analysis such as by reshaping a peak that is subtly out of specification (peak shaving or peak enhancement); and post-analysis alteration of results. EPA guidance documents provide warning signs to validators to assist with detecting these activities. Data verification consists of two steps. The first is identifying the project requirements for records, documentation, and technical specifications for data generation, and determining the location and source of these documents. The second is verifying that the data records that are produced or reported satisfy the method, procedural, or contractual requirements as per the field and analytical operational requirements, including sample collection, sample receipt, sample preparation, sample analysis, and data verification documentation review. The two outputs of data verification are the verified data and the data verification documentation. Data validation involves inspection of the verified data and verification documentation, a review of the verified data to determine the analytical quality of the data set, and production of a data validation report and qualified data. Documentation input to validation includes projectspecific planning documents such a QAPP or SAP, generic planning documents, field and laboratory SOPs, and published sampling and analytical methods. Data validation includes, as much as possible, the reasons for failure to meet the requirements, and the impact that this failure has on the usability of the data set.
VERIFICATION AND VALIDATION CHECKS This section describes a few typical validation checks, and comes from a variety of sources, including EPA (1996). Data completeness – The validator should review the data package for completeness to ensure that it contains the required documents and forms. Field QC samples – Field QC samples such as trip blanks, equipment blanks, and field duplicates are typically taken about one per 20 field samples. The validator should confirm compliance, and use the QC results to identify the sampling error. Holding times – For various matrices and types of analyses, the time from collection to extraction and analysis must not exceed a certain period. Exceedence of holding times is a common reason for qualifying data. Information on holding times for some analytes is given in Appendix D. Equipment calibration – The appropriate project-specific or analysis-specific procedure must be used, both for initial and ongoing calibration, and the validator should check for this. LCS and duplicates – The correct number of QC samples must be run at various stages of the process, and should agree with the primary samples within specific tolerances.
Data Verification and Validation
195
Figure 80 - Data entry screen for setting up validation parameters
Blanks – Blank samples should be run at appropriate intervals to identify contamination, and this should be confirmed by the validator. Surrogates – Surrogates are compounds not expected in the sample, but expected to react similarly in analysis. Surrogates are added in known concentration to assist with calibration. The recovery of the surrogate must be within certain ranges. Matrix effects – The validator should examine matrix spikes, matrix spike duplicates, surrogate spike recoveries, and internal standards responses to identify any unusual matrix effects. Performance evaluation samples – Blind PE samples may be included in each sample group to help evaluate the laboratory’s ability to identify and measure values during the sample analysis, and the validator should compare the analyzed results to the known concentrations to evaluate the laboratory performance. Detection limits – The laboratory should be able to perform the analyses to the required detection limits, and if not, this may require qualification of the data. In addition, there are a number of checks that should be performed for specific analytical techniques, such as for furnace AA and ICP.
SOFTWARE ASSISTANCE WITH VERIFICATION AND VALIDATION Data verification and validation is such an important operation for many projects that those projects must have tools and procedures to accomplish it. With the current focus on keeping the cost down on environmental projects, it is increasingly important that these tools and procedures allow the validation process to be performed efficiently. The EDMS software can help with this.
196
Relational Management and Display of Site Environmental Data
Figure 81 - Software screen for configuring and running validation statistics reports
Prior to validation Usually the data is verified before it is validated. The EDMS can be a big help with the verification process, providing checks for consistency, missing data, and referential integrity issues. Chapter 13 contains information on software assistance with checking during import. One useful approach is for the consistency component of the verification to be done as part of the standard import, and then to provide an option, after consistency checking, to import the data either directly into the database or into a validation subsystem. The validation subsystem contains tables, forms, and reports to support the verification and validation process. After validation, the data can then be moved into the database. To validate data already in the database, the data selection screen for the main database can provide a way to move the data to be validated into the validation table. Once the validation has been performed, information resulting from the validation, such as validation flags, can be added to those records in the main database. The validation system can provide Save and Restore options that allow users to move between multiple data sets. In the validation table, flagging edits and QC notations are made prior to entry into the main database. Once validation is completed, the associated analytical and field duplicate data can then be imported into the main database, and the validation table can be saved in a file or printed as documentation of the data validation process. Validation involves comparison of the data to a number of standard values, limits, and counts. These values vary by QC type, matrix, and site. Figure 80 shows a program screen for setting some of these values.
Visual validation Visual validation is the process of looking at the data and the supporting information from the laboratory and making the determination of whether the validation criteria have been met. The EDMS can help by organizing the data in various ways so that the visual validation can be done
Data Verification and Validation
197
efficiently. This is a combination of calculations and reporting. Figure 81 shows a software screen for configuring and running validation statistics reports
VALIDATION CALCULATIONS The software can provide a set of calculations to assist the data validator. Examples include calculation of relative percentage difference (RPD) between lab and field duplicates and the primary samples, and calculation of the adequacy of the number of lab QC samples, such as calibration check verification (CCV) and laboratory control sample (LCS) analyses.
VALIDATION REPORTING The key component of visual validation is inspection of the data. The software should provide reports to help with this. These reports include QC Exceedence, Summary of sample and QC control completeness, QC data summary by QC type (field dup., lab dup., LCS, CCV), and reports of quality control parameters used in RPD and recovery calculations.
STATISTICS Some of the reports used in the validation process are based on statistical calculations on the data, and in some cases on other data in the database from previous sampling events. Examples of these reports include: Basic Statistics Report – This report calculates basic statistics, such as minimum, maximum, mean (arithmetic and/or geometric), median, standard deviation, mean plus 2 standard deviations, and upper 95th percentile by parameter for a selected date range. Ion Balance Report – This report calculates the cation/anion percent difference. It can also calculate the percent differences of field parameters analyzed in the field and the lab, and might also add up the major constituents and compare the result to the amount of total dissolved solids (TDS) reported by the laboratory. Trend Report – This is a basic report that compares statistics from a range of data in the database with the current data set in the validation table, and reports percent difference by parameter. Comparison Report – This report flags data in the validation data set that is higher or lower than any analyses in the comparison dataset. Figure 82 shows an example of one type of validation statistics report.
Figure 82 - Example of a validation statistics report (the L to the right means less than the mean)
198
Relational Management and Display of Site Environmental Data
Autoflagging The autoflagging process uses the results of calculations to set preliminary flags, which are then inspected by the validator and either accepted or modified. For example, the validation option can compare QC calculations against user-supplied project control limits. The data is then flagged based on data flagging options that are user-configurable. Flagging can then be reviewed and revised using edit screens included in the validation menu system.
CHAPTER 17 MANAGING MULTIPLE PROJECTS AND DATABASES
Often people managing data are working on several facilities at once, especially over the time frame of months or years. This raises several issues related to the site data, lookup tables and other related data, and ease of moving between databases. These issues include whether the data should be stored in one database or many, sharing data elements such as codes and lookups, moving between databases, and multi-site security.
ONE FILE OR MANY? If the data management system allows for storing multiple sites in the same database, then some decisions need to be made about how many databases to have, and how many and which sites to store in each. Even the concept of what constitutes a site can be difficult to define in some cases.
What is a site? While the usage of the term “site” in this book has been pretty much synonymous with “facility” or “project,” it is often not that simple. First, some people use “site” to mean a sample location, or what we are calling a “station,” including monitoring wells, soil borings, and so on. This is a difference in terminology, not concepts. Is just personal preference, and will be ignored, with our apologies to those who use it that way. A bigger issue, when “site” is used for “facility,” is what is a “site”? The problem with defining a site (assuming the meaning of facility, and not sample location) can be illustrated with several examples. One example is a facility that has various different operations, administrative units, or environmental problems within it. Some large facilities have dozens of different environmental issues being dealt with relatively independently. For example, we have a client with a refinery site that it is investigating and remediating. A number of ponds and slag piles are being managed one way, and each has its own set of soil borings and monitoring wells. The operating facility itself has another set of problems and data associated with it. This project can be viewed as one site with multiple sub-parts, which is how the client is managing it, or as several separate sites.
200
Relational Management and Display of Site Environmental Data
Why does DC have the most lawyers per capita and New Jersey the most toxic waste dumps? New Jersey had first choice. Rich (1996) The second case is where there are several nearby, related projects. One of our clients is remediating a nuclear processing facility. The facility itself has a number of ponds, railway areas, and building locations that must be excavated and shipped away. Over the years some tailings from the facility were used throughout the neighboring residential area for flower gardens and yards (unfortunately the tailings made good topsoil). And some material from the facility and the residential area made it into the local creek, which now requires remediation. Each of these is being managed differently, with different investigative techniques, supervision, and regulatory oversight. In this case the client has chosen to treat each area as a separate site from a data management perspective because it views them as different projects. Another example is a municipality that is using an EDMS to manage several solid waste landfills, which are, for the most part, managed separately. Two of the landfills are near each other. Each landfill has its own monitoring wells, and these can be easily assigned to the proper site. Some monitoring wells are used as background wells for both landfills, so don’t apply uniquely to either. In this case the client has elected to define a third “site” for the combined wells. At data selection time it can choose one landfill site plus the site containing the combined wells to obtain the data it needs. There is no “right” or “wrong” way to define a site. The decision should be made based on which approach provides the greatest utility for working with the data.
To lump or to split? Once you have decided what to call a site, you still have the problem of which sites to put in which databases, if your EDMS supports multiple sites. Part of the answer to this problem comes from what type of organization is managing the data, and who the data belongs to. The answer may be different for an industrial user managing data for its own sites, than for a consulting company managing data for multiple clients. However, the most important issue is usually whether you need to make comparisons between sites. If you do, it will be easier if the sites are in the same database. If not, there may be no benefit to having them in the same database. In one case, a landfill company used its database system to manage groundwater data from multiple landfills. Its hydrogeologist thought that there might be a relationship between the turbidity of samples taken in the field and the concentration of contaminants reported by the lab. This company had chosen to “lump” its sites, and the hydrogeologist was able to perform some queries to compare values across dozens of landfills to see if there was in fact a correlation (there was). In this case, having all of the data in one database provided a big benefit. Consultants managing data for multiple clients have a different issue. It is unlikely that they will want to compare and report on data belonging to more than one client at a time. It is also likely that the client will, at some point, be required to provide a copy of its database to others, either voluntarily or through litigation. It certainly should not deliver data that it doesn’t own. In this case it makes sense to have a different database for each different client. Then within that client’s data a decision should be made whether to have one large database or multiple smaller ones. We recently visited a consulting company with 38 different SQL Server databases, one for each of its active clients. Another factor that often enters into the database grouping decision is geographic proximity. If several sites are nearby, there is a good chance that at some point they will be viewed as a unit, and there would be an advantage to having them in the same database. If they are in different parts of the country, it is less likely that, for example, you would want to put their values on the same map.
Managing Multiple Projects and Databases
201
Figure 83 - A screen to help the user attach to different server databases.
The size of the resulting database can also have an impact on the decision. If the amount of data for each site will be very large, then combining several sites might not be a good idea because the size of the database might become unmanageable. A good understanding of the capacity of the database tool is important before making the decision.
SHARING DATA ELEMENTS It is useful to view the data in the database as consisting of the site data and supporting data. The site data consists of the site, station, sample, and analysis information. The supporting data includes the lookup tables like station types, units and unit conversions, parameter names, and so on. If you are managing several similar sites, then the supporting data could be similar. In the case of sites with a long list of constituents of concern, just managing the parameter table can take a lot of effort. If the sites are in one database, then sharing of the supporting data between them should not be an issue. If the sites are managed in separate databases, it may be desirable to have a way to move the supporting data between databases, or to propagate changes made in one database into other similar databases. One project we worked on involved multiple pipeline pumping stations across several states. The facilities were managed by different consultants, but the client wanted the data managed consistently across projects. In this case, the decision was made to store the data for each site in different databases because of the different companies responsible for each project. However, in order to keep the data management consistent, one data manager was assigned management of the parameter list and other lookups, and kept the other data administrators updated with changes so the database stayed consistent.
MOVING BETWEEN DATABASES Depending on the computing experience of the people using the databases, it might be of value to provide an easy way for users to move between databases. This is even more important in the case of client-server systems where connecting to the server database involves complicated command strings. Figure 83 shows an example of a screen that does this. Often it is helpful to be able to move data between databases. If the EDMS has a good data transfer system, such as using a formalized data transfer standard, then moving data from one database to another should not be difficult.
202
Relational Management and Display of Site Environmental Data
Figure 84 - Screen for assigning users to sites
LIMITING SITE ACCESS One of the issues that may need to be addressed if multiple sites are stored in one database is that not all users need to have access to all sites. Some users may need to work on one site, while others may need access to several or all of the sites. In this scenario, the software must provide a way for users to be assigned to one or more sites, and then limited to working with only those sites. Figure 84 shows a screen for assigning users to sites based on their Windows login ID. The software will then filter the data that each user sees to the sites to which each has been assigned.
PART FIVE - USING THE DATA
CHAPTER 18 DATA SELECTION
An important key to successful use of an EDMS is to allow users to easily find the data they need. There are two ways for the software to assist the user with data selection: text-based and graphical. With text-based queries, the user describes the data to be retrieved using words, generally in the query language of the software. Graphical queries involve selecting data from a graphical display such as a graph or a map. Query-by-form is a hybrid technique that uses a graphical interface to make text-based selections.
TEXT-BASED QUERIES There are two types of text-based queries: canned and ad hoc. The trade-off is ease of use vs. flexibility.
Canned queries Canned queries are procedures where the query is prepared ahead of time, and the retrieval is done the same way each time. An example would be a specific report for management or regulators, which is routinely generated from a menu selection screen. The advantage of canned selections is that they can be made very easy to use since they involve a minimum of choices for the user. The goal of this process is to make it easy to quickly generate the output that will be required most of the time by most of the users. The EDMS should make it easy to add new canned queries, and to connect to external data selection tools if required. Figure 85 shows an example of a screen from Access from which users can select pre-made queries. The different icons next to the queries represent the different query types, including select, insert, update, and delete. The user can execute a query by double-clicking on it. Queries that modify data (action queries), such as insert, update, and delete, display a warning dialog box before performing the action. Other than with the icons, this screen does not separate selection queries from action queries, which results in some risk in the hands of inexperienced or careless users.
206
Relational Management and Display of Site Environmental Data
Figure 85 - Access database window showing the Queries tab
Ad hoc queries Sometimes it is necessary to generate output with a format or data content that was not anticipated in the system design. Text selections of this type are called ad hoc queries (“ad hoc” is a Latin term meaning “for this”). These are queries that are created when they are needed for a particular use. This type of selection is more difficult to provide the user, especially the casual user, in a way that they can comfortably use. It usually requires that users have a good understanding of the structure and content of the database, as well as a medium to high level of expertise in using the software, in order to perform ad hoc text-based queries. The data model should be included with the system documentation to assist them in doing this. Unfortunately, ad hoc queries also expose a high level of risk that the data retrieved may not be valid. For example, the user may not include the units for analyses, and the database may contain different units for a single parameter sampled at different times. The data retrieved will be invalid if the units are assumed to be the same, and there is no visible indication of the problem. This is particularly dangerous when the user is not seeing the result of the query directly, but using the data indirectly to generate some other result such as statistics or a contour map. In general, it is desirable to formalize and add to the menu as wide a variety of correctly formatted retrievals as possible. Then casual users are likely to get valid results, and “power users” can use the ad hoc queries only as necessary. Figure 86 shows an example of creation of an ad hoc text-based query. The user has created a new query, selected the tables for display, dragged the fields from the tables to the grid, and entered selection criteria. In this case, the user has asked for all “Sulfate” results for the site “Rad Industries” where the value is > 1000. Access has translated this into SQL, which is shown in the second panel, and the user can toggle between the two. The third panel shows the query in datasheet view, which displays the selected data. The design and SQL views contain the same information, although in Access it is possible to write a query, such as a union query, that can’t be displayed in design view and must be shown in SQL. Some advanced users prefer to type in the SQL rather than use design view, but even for them the drag and drop can save typing and minimize errors.
Data Selection
207
Figure 86 - A text-based query in design, SQL, and datasheet views
GRAPHICAL SELECTION A second selection type is graphical selection. In this case, the user generates a graphical display, such as a map, of a given site, selects the stations (monitoring wells, borings, etc.), then retrieves associated analytical data from the database.
208
Relational Management and Display of Site Environmental Data
Figure 87 - Interactive graphical data selection
Figure 88 - Editing a well selected graphically
Data Selection
209
Figure 89 - Batch-mode graphical data selection
Geographic Information System (GIS) programs such as ArcView, MapInfo, and Enviro Spase provide various types of graphical selection capability. Some map add-ins that can be integrated with database management and other programs, such as MapObjects and GeoObjects, also offer this feature. There are two ways of graphically selecting data, interactive and batch. In Figure 87 the user has opened a map window and a list window showing a site and some monitoring wells. The user then double-clicked on one of the wells on the map, and the list window scrolled to show some additional information on the well. In Figure 88 a well was selected graphically, then the user called up an editing screen to view and possibly change data for that well. The capability of working with data in its spatial context can be a valuable addition to an EDMS. In Figure 89 the user wanted to work with wells in or near two ponds. The user dragged a rectangle to select a group of wells, and then individually selected another. Then the user asked the software to create a list of information about those wells, which is shown on the bottom part of the screen. In this case the spatial component was a critical part of the selection process. Selection based on distance from a point can also be valuable. The point can be a specific object, such as a well, or any other location on the ground, such as a proposed construction location. The GIS can help you perform these selections. Other types of graphical selection include selection from graphs and selections from cross sections. Some graphics and statistics programs allow you to create a graph, and then click on a point on the graph and bring up information about that point, which may represent a station, sample, or analysis. GIS programs that support cross section displays can provide a similar feature where a user can click on a soil boring in a cross section, and then call up data from that boring, or a specific sample for that boring.
210
Relational Management and Display of Site Environmental Data
Figure 90 - Example of query-by-form
QUERY-BY-FORM A technique that works well for systems with a variety of different user skill levels is queryby-form, or QBF. In this technique, a form is presented to the user with fields for some of the data elements that are most likely to be used for selection. The user can fill out as many of the fields as needed to select the subset that the user is interested in. The software then creates a query based on the selection criteria. This query can then be used as the basis for a variety of different lists, reports, graphs, maps, or file exports. Figure 90 shows an example of this method.
Data Selection
211
Figure 91 - Query-by-form screen showing selection criteria for different data levels
In this example, the user has selected Analyses in the upper right corner. Along the left side the user selected “Rad Industries” as the site, and “MW-1” as the station name. In the center of the screen, the user has selected a sample date range of greater than 1/1/1985, and “Sulfate” as the parameter. The lower left of the screen indicates that there are 16 records that match these criteria, meaning that there are 16 sulfate measurements for this well for this time period. When the user selected List, the form at the bottom of the screen was displayed showing the results. To be effective, the form for querying should represent the data model, but in a way that feels comfortable to the user. Also, the screen should allow the user to see the selection options available. Figure 91 shows four different versions of a screen allowing users to make selections at four different levels of the data hierarchy. The more defined the data model, the easier it is to provide advanced user-friendly selection. The Access query editor is very flexible, and will work with any tables and fields that might be in the database. However, the user has to know the values to enter into the selection criteria. If the fields are well defined and won’t change, then a screen like that shown in Figures 90 and 91 can provide selection lists to select values from. Figure 92 shows an example of a screen showing the user a list of parameter names to choose from.
212
Relational Management and Display of Site Environmental Data
Figure 92 - Query-by-form screen showing data choices
One final point to be emphasized is the reliance of data quality on good selection practices. This was discussed above and in Chapter 15. Improper selection and display can result in data that is easy to misinterpret. Great care must be taken in system design, implementation, and user training so that the data retrieved accurately represents the answer to the question the user intended to ask.
CHAPTER 19 REPORTING AND DISPLAY
It takes a lot of work to build a good database. Because of this, it makes sense to get as much benefit from the data as possible. This means providing data in formats that are useful to as many aspects of the project as possible, and printed reports and other displays are one of the primary output goals of most data management projects. This chapter covers a variety of issues for reports and other displays. Graph displays are described in Chapter 20. Cross sections are discussed in Chapter 21, and maps and GIS displays in Chapter 22. Chapter 23 covers statistical analysis and display, and using the EDMS as a data source for other programs is described in Chapter 24.
TEXT OUTPUT Whether the user has performed a canned or ad hoc query, the desired result might be a tabular display. This display can be viewed on the screen, printed, saved to a file, or copied to the clipboard for use in other applications. Figure 93 is an example of this type of display. This is the most basic type of retrieval. This is considered unformatted output, meaning that the data is there, but there is no particular presentation associated with it.
Figure 93 - Tabular display of output from the selection screen
214
Relational Management and Display of Site Environmental Data
Figure 94 - Banded report for printing
FORMATTED REPORTS Once a selection has been made, another option is formatted output. The data can be sent to a formatted report for printing or electronic distribution. A formatted report is a template designed for a specific purpose and saved in the program. The report is based on a query or table that provides the data, and the report form provides the formatting.
Standard (banded) reports Figure 94 is an example of a report formatted for printing. This example shows a standard banded report, where the data at different parent-child levels is displayed in horizontal bands across the page. This is the easiest type of report to create in many database systems, and is most useful when there is a large amount of information to present for each data element, because one or more lines can be dedicated to each result.
Cross-tab reports The next figure, Figure 95, shows a different organization called a cross-tab or pivot table report. In this layout, one element of the data is used to create the headers for columns. In this example, the sample event information is used as column headers.
Reporting and Display
215
Figure 95 - Cross-tab report with samples across and parameters down
Figure 96 - Cross-tab report with parameters across and samples down
Figure 96 is a cross-tab pivoted the other way, with parameters across and sample events down. In general, cross-tab reports are more compact than banded reports because multiple results can be shown on one line.
216
Relational Management and Display of Site Environmental Data
Figure 97 - Data display options
Cross-tab reports provide a challenge regarding the display of field data when multiple field observations must be displayed with the analytical data. Typically there will be one result for each analyte (ignoring dilutions and reanalyses), but several observations of pH for each sample. In a cross-tab, the additional pH values can be displayed either as additional columns or additional rows. Adding rows usually takes less space than additional columns, so this may be preferred, but either way the software needs to address this issue.
FORMATTING THE RESULT There are a number of options that can affect how the user sees the data. Figure 97 shows a panel with some of these options for how the data might be displayed. The user can select which regulatory limit or regulatory limit group to use for comparison, how to handle non-detected values, how to display graphs and handle field data, whether to include calculated parameters, how to display the values and flags, how to format the date and time, and whether to convert to consistent units and display regulatory limits.
Regulatory limit comparison For investigation and remediation projects, an important issue is comparison of analytical results to regulatory limits or target levels. These limits might be based on national regulations such as federal drinking water standards, state or local government regulations, or site-specific goals based on an operating permit or Record of Decision (ROD). Project requirements might be to display all data with exceedences highlighted, or to create a report with only the exceedences. For most constituents, the comparison is against a maximum value. For others, such as pH, both an upper and a lower limit must be met. The first step in using regulatory limits is to define the limit types that will be used. Figure 98 shows a software screen for doing this. The user enters the regulatory limit types to be used, along with a code for each type. The next step is to enter the limits themselves. Figure 99 shows a form for doing this. Limits can be entered as either site-specific or for all sites. For each limit, the matrix, parameter, and limit type are entered, along with the upper and lower limits and units. The regulatory limit units are particularly important, and must be considered in later comparison, and should be taken into consideration in conversion to consistent units as described below. There is one complication that must be addressed for limit comparison to be useful for many project requirements. Often the requirement is for different parameters, or groups of parameters, to be compared to different limit types on the same report. For example, the major ions might be compared to federal drinking water standards, but the organics may be compared to more stringent local or site-specific criteria. This requires that the software provide a feature to allow the use of different limits for different parameters. Figure 100 shows a screen for doing this. The user enters a name for the group, and then selects limits from the various limit types to use in that group.
Reporting and Display
Figure 98 - Form for defining regulatory limit types
Figure 99 - Form for entering regulatory limits
Figure 100 - Form for defining regulatory limit groups
217
218
Relational Management and Display of Site Environmental Data
Figure 101 - Selection of regulatory limit or group for reporting
After the limits and groups have been defined, they can be used in reporting. Figure 101 shows a panel from the selection screen where the user is selecting the limit type or group for comparison. The list contains both the regulatory limit types and the regulatory limit groups, so either one can be used at report time. The software code should be set up to determine which type of limit has been selected, and then retrieve the proper data for comparison.
Value and flag Analytical results contain much more information than just the measured value. A laboratory deliverable file may contain 30 or more fields of data for each analysis. In a banded report there is room to display all of this data. When the result is displayed in a cross-tab report, there is only one field for each result, but it is still useful to display some of this additional information. The items most commonly involved in this are the value, the analytical flag, and the detection limit. Different EDMS programs handle this in different ways, but one way to do it is using fields for reporting factor and reporting basis that are based on the analytical flag. Another way to do it is to have a text field for each analysis containing exactly the formatting desired. Examples of reporting factor and reporting basis values, and how each result might look, are shown in the following table: Basis code v f b l g d d a m
Reporting basis Value only Flag only Both value and flag Less than sign (<) and detection limit or value Greater than sign (>) and detection limit or value Detection limit (times factor) and flag Detection limit (times factor) and flag Average of values Dash (-) only
Reporting factor 1 1 1 1
Value
Flag
Result
v v v u
Detection limit 0.1 0.1 0.1 0.1
3.7 3.7 3.7 3.7
1
3.7
u
0.1
> 0.1
1
3.7
u
0.1
0.1 u
.5
3.7
u
0.1
0.05 u
1 1
3.7 3.7
v v
0.1 0.1
1.9 -
3.7 v 3.7 v < 0.1
The next table shows examples of some analytical flags and how the reporting factor and reporting basis might be assigned to each.
Reporting and Display
Flag code b c d e f g h i j l m n q r s t u v w x y z
Flag Analyte detected in blank and sample Coelute Diluted Exceeds calibration range Calculated from higher dilution Concentration > value reported Result reported elsewhere Insufficient sample Est. value; conc. < quan. limit Less than detection limit Matrix interference Not measured Uncertain value Unusable data Surrogate Trace amount Not detected Detected value Between CRDL/IDL Determined by associated method Calculated value Unknown
Reporting factor 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0.5 1 1 1 0 1
219
Reporting basis v v v v v g f v b l v v v f v d l v v v v v
Finally, analyses can often have multiple flags, for example “uj,” but the result can only be displayed one way. The software needs to have an established priority for the reporting basis so that the display is based on the highest priority format. Based on the previous basis code values, an example of the priority might be: f, l, g, b, d, v, a, and m. This means that for a flag of “bj” the basis codes would be “v” (from the “b” flag) and “b” (from the “j” flag). The “b” basis would have preference, so a less than sign (<) and the value would be displayed.
Non-detects When laboratories analyze for a constituent, it may or may not be found. If it is not found, it is referred to as not detected, or a non-detect. The various different detection limits used by laboratories are discussed in Chapter 12. If the result is not detected at the appropriate limit, the lab should flag (qualify) the data with a flag such as “u” for “undetected.” It should also report the detection limit and the limit type. It may or may not place the detection limit in the value field. In reporting and otherwise working with non-detects, they can be handled in several ways. In a full, banded report, the value, flag, detection limit, and detection limit type can all be reported. In a cross-tab report, or an export such as an XYZ file for contouring, there is no room for that. There are several ways to handle non-detects. Often a combination of these is used. Ignore them – Analyses for which the constituent was not detected can be excluded. This is generally not a good idea, since the fact that the constituent wasn’t detected is useful information. Display the value – The software can display the value provided by the laboratory, but this is risky, because the laboratory may or may not place the detection limit in the value field. It has the advantage of being easy to implement, because the report can be based on only one field.
220
Relational Management and Display of Site Environmental Data
Figure 102 - Form for defining calculated parameters
Display the detection limit – It makes sense to display the detection limit for non-detected values and the value if there was a detection. This is more complicated to program than just basing the report on the value field, because the software has to look at the analysis record and determine which field to display, either using an IF statement (or more likely the slightly different immediate IIF) or using program code. Display the limit and qualify it – If the limit is displayed, it is helpful to qualify it in the report, either by displaying a less than sign (<) or the flag. To do this only for the non-detects requires special handling in the software. Apply a factor to the limit – Sometimes a numerical factor is applied to the detection limit before it is displayed. A common factor is one half, although others are sometimes used. The thinking is that the true value is somewhere between the detection limit and zero, so one half is a good guess. This can be useful for estimating volumes of a material, or for other statistical calculations. Display a zero – A variation on using a factor is to use a zero for non-detects. This is usually not correct technically, but can be useful in some applications like contour mapping. If you do use a zero value in contouring, be sure to do so with care. The value is not really a zero, but is less than a specific value (the detection limit), and setting it to zero could be misleading, especially if the detection limit is highly elevated, and the real value could be different enough from zero to affect the surface. Another option for contouring is to set the value to the indeterminate value, which is the value (such as -99999) that the contouring program ignores in calculating the surface, but then you are throwing away the useful information that the value is low. Some, but not many, contouring programs allow you to specify that the value is less than a certain amount, and then the software constrains the surface based on that information. That is the best solution if it is available. Which approach is best for displaying non-detects depends on the use of the data. It is important that data users be aware of how the result is being displayed.
Calculated fields Sometimes it is helpful to display data that is based on calculations using data that is in the database. These are referred to as calculated fields or derived values. These are results that are not contained in the database, but are generated “on the fly” at retrieval time. The software can provide a system for defining and calculating these results. Figure 102 shows an example of how this might be presented.
Reporting and Display
221
In this screen, the user has specified that the software is to calculate the mass of the total dissolved solids for a sample. The input parameters have been selected as the total dissolved solids concentration times the effluent volume. The result must then be scaled to the output units of kilograms by dividing by one million. The screen is also asking for a nesting order, which determines the order in which multiple calculations are to be performed, allowing complicated multi-step calculations with many parameters if necessary. There is also a checkbox to enable and disable the calculated field, so that a particular calculation can be turned off and on without deleting it.
Consistent units It is possible that different results for the same parameter in the database might be in different units. This can be avoided at import time, as described in Chapter 13, but that is not always desirable. When the data is displayed in a banded report with one or more lines per result, and the units displayed, then multiple units may not be a problem, since a unit is shown with each value. In a cross-tab report, or if only the numbers (and not the units) are being retrieved for use in statistics, graphing, or mapping, then it is mandatory to convert to consistent units. A good approach is to define in the software the target units for each parameter and matrix. Matrix is important because the units for different matrices usually should be different. For example, in water the concentration of a constituent like a metal is reported as mass per unit volume, such as milligrams per liter, while for a solid such as soil, it is in mass per mass, such as milligrams per kilogram or parts per million. A screen for defining target units for each parameter is shown in Figure 103. The next step is to define all of the conversion factors necessary to do the conversions. This is also shown in Figure 103. Conversion of different units of the same scale, such as from milligrams per liter to micrograms per liter, is pretty straightforward. Not all conversions are this simple, however, and great care must be taken in converting between different types of measure. For example, the laboratory may express measurements of radioactive materials like radium226 in activity, such as picocuries per gram. In order to determine how much material is there, it is useful to have the data in mass units, such as milligrams per kilogram. This conversion, however, depends on a number of factors, such as the isotopic mix, physical properties of the sample, and so on, and consequently is at best site-specific, and at worst involves complicated statistical calculations. Be sure you know what you are doing before you go too far with unit conversions. Once the desired concentration and conversion factors have been defined, the software can perform the conversion. It is obvious that the value should be converted, but usually you will also want to convert other related information, such as the detection limit, regulatory limits used for comparison, and so on.
Other issues There are a number of other issues that arise in formatting the data to satisfy project needs. These include handling of decimal places and date and time formatting. Handling of decimal places, or significant figures, is an issue that is not done well in many software programs. Try this experiment. Open a new database in Excel. In one of the cells, type in 3.00, and press Enter. The zeros go away. Access and other programs lose trailing zeros the same way. This results in lost information. If the analysis was to two decimal places, then those zeros should be displayed. There are two ways to handle this in an Access-based database. One is to store the value as a text string, rather than as a number. The other is to store the number of decimal places in a separate field, and combine the two if necessary at retrieval time using a user-defined function.
222
Relational Management and Display of Site Environmental Data
Figure 103 - Forms for defining units by parameter and matrix, and conversion between units
The issue of date and time formatting is related to the way that the data management software stores dates and times, and how you want them displayed. For example, Access combines dates and times into one field. This field is a numeric field, with the whole number (left of the decimal point) representing the date. Internally this is stored as the number of days since Dec. 31, 1899, so a value of 1 is Jan. 1, 1900, and Jan. 1, 2002 is 37257. The decimal portion of the date number (right of the decimal point) represents the time, starting at midnight. For example, a value of .5 is 12:00 PM (noon) and 8:30 AM is .3541666667. This combination of date and time storage is different from some other systems, such as dBase and FoxPro, where the date and time are stored in separate fields. For environmental projects, the date is nearly always important, but the time may or may not be. For example, for soil samples taken once, the time during the day that they were taken may not be important, but for air samples taken every hour, it certainly would be. For systems like Access that combine the date and time, it is useful to have a feature to turn the display of the time on and off as appropriate for the data being displayed. Reports can be formatted to display the date and time field in different fields if desired.
Reporting and Display
MW-1 2/26/1981
MW-1 4/20/1981
Field pH s.u. 7.8 Iron (Ferrous) mg/l 0.35 Nitrate mg/l 1.7 Potassium mg/l 6.9 Sulfate DW 400 mg/l 1255 Reg. Limits: DW - Federal drinking water standards
7.9 0.10 bj < 1.0 6.6 1400
Matrix: Water Parameters
Matrix: Water
Sample Point -> Sample Date ->
223
Reg. Limit Units
Sample Point -> Sample Date ->
Parameters Field pH Iron (Ferrous) Nitrate Potassium Sulfate
MW-1 2/26/1981
MW-1 4/20/1981
7.8 0.35 1.7 6.9 1255
7.9 0.1 1 6.6 1400
Units s.u. mg/l mg/l mg/l mg/l
Figure 104 - Reports with different levels of formatting for performance comparison
Formatting and performance Keep in mind that asking the software to perform sophisticated formatting comes at a cost. In Figure 104, the panel on the top has formatted values and comparison to regulatory limits. Notice that a regulatory limit is displayed for sulfate, and both sulfate values are bolded and underlined because they exceed this limit. Also, for 4/20/1981 the value for iron shows the value and analytical flags, and the value for nitrate shows “<” and the detection limit. This retrieval for 315 records takes 17 seconds. The panel on the bottom displays only the numbers, with no comparison to limits, and takes 1.5 seconds. In data management (as in most everything else) nothing is free.
INTERACTIVE OUTPUT In the past, nearly all of the focus of data management has been on generating printed reports. As data management software evolves, it is now becoming possible to work interactively with the data in ways that before were either not possible or not time-effective. Figure 105 shows an example of this type of interactive display. The software is showing the environmental data in a TreeView display. This display, which is similar to the Windows Explorer display, shows sites at the highest level, then stations, samples, and analyses. At each level, the most pertinent data is displayed. This type of display lets the user “drill down” to find a particular result quickly, even in a large database.
224
Relational Management and Display of Site Environmental Data
Figure 105 - TreeView display of site data
ELECTRONIC DISTRIBUTION OF DATA Often the person managing the data is not the person using it. The best approach is for everyone that needs the data to have direct access to it through the EDMS. For various reasons, such as cost and location, this is not always possible. There are several ways to overcome this. One is to make the data available more generally, such as through Web access. Another way is through electronic distribution of reports. The Adobe Portable Document Format (PDF) and the free PDF reader are a convenient way to distribute reports. Users create the report that they want in the EDMS, and then print it to the PDF format using Acrobat for distribution. Recipients of the report can use the free Acrobat reader to see it, formatted the way the database user intended.
CHAPTER 20 GRAPHS
There’s an old saying that a picture is worth a thousand words. In many situations, presenting data in a graphical display makes the information much more understandable. A well-designed graph of the data in a table can be many times more informative than the table alone. This chapter and the next two describe and show a variety of graphic displays that can be used to present environmental data. This chapter discusses traditional graphs. Other graphic displays, such as maps and cross sections, are discussed in the following two chapters.
GRAPH OVERVIEW There’s a good and a bad side to graphs. They can be used to display data in a format conducive to greater understanding. They can also be confusing, misleading, or even dishonest. An excellent book by Tufte (1983) provides a wealth of information on various aspects of graphical data display, including graphs and maps. According to Tufte, graphical displays should: ! ! ! ! ! ! ! ! !
Show the data Induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else Avoid distorting what the data has to say Present many numbers in a small space Make large data sets coherent Encourage the eye to compare different pieces of data Reveal the data at several levels of detail, from a broad overview to fine structure Serve a reasonably clear purpose: description, exploration, tabulation, or decoration Be closely integrated with the statistical and verbal description of a data set
In addition, Tufte provides the following six principles of graphical integrity: ! ! ! ! ! !
The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities expressed. Clear, detailed, and thorough labeling should be used to defeat graphical distortions and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data. Show data variation, not design variation. In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units. The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.
226
Relational Management and Display of Site Environmental Data
Following these two sets of guidelines will greatly increase your chance of creating good graphical displays. Additional general information on graphs can be found in Milne (1992), and information specific to environmental graphing in Sara (1994, pp. 11-19 to 11-28).
GENERAL CONCEPTS Because graphing software is so accessible and easy to use, there is a tendency to throw together a graph of a bunch of data and be done with it. If you try to follow Tufte’s guidelines above, then clearly there is more to it than that, from making sure the data is amenable to the graphing technique you will be using to confirming at the end that the graph communicates the correct message. If you keep in mind the key concepts of creating a graph, rather than take them for granted, your graphs will be much more effective. Generally graphs present data with one data element graphed as a function of another. Commonly the independent variable, which is often presented against the X (horizontal) axis, is time, and the dependent variable, presented against the Y (vertical) axis, is the measured value. It is also possible to plot one observed value against another. Sometimes the X-axis is called the abscissa and the Y-axis is called the ordinate.
Data issues Back in the day when graphs were created by hand, the person creating the graph was forced to look at each data point, because he or she scaled it off and drew it on the graph. With automated programs like Microsoft Excel and Golden Software’s Grapher, it is easy to create a graph without giving it much thought. This can result in a graph that looks great, but, in the worst case, is totally meaningless. For example, if you take a data set like the one graphed in Figure 106, and set the scale to logarithmic as discussed below, Grapher will complain if some of the data has a zero value and can’t be graphed, but Excel won’t. Those values may be important, and won’t be displayed in either case, but with Excel you might not even know they are gone. There are a number of other data issues that can trip you up in creating graphs. Chapter 19 discussed the importance of checking units during data retrieval. Use of non-detects and flagged data must be done carefully. Duplicate data can also be a problem. A good policy is to take a hard look at the data after it has been retrieved from the EDMS, but before it is graphed. Look at every number, or if there is too much data to do that, sort in various ways to understand the data ranges, relationships between different values, and so on. Time spent doing this will be rewarded by better graphs, ones that you are more likely to be able to trust.
Coordinate systems Graphing involves taking values and plotting them relative to some coordinate system. For most graphs this is a Cartesian XY system, but other systems, such as polar and radial plots, are possible. Think about which system will work best with your data and the message you are trying to get across, rather than just using the default provided by the software.
Graph scales The scales of the graph determine the spacing of the points relative to each axis. In the simple case of an X-Y graph of two constituents against each other, the value range for each constituent will be used as the scale for each axis. In the case of a time-sequence graph, one of the axes (usually the horizontal one) is the time or date range, and the other is the value or values.
Graphs
227
Whenever presenting a forecast, give a number and a date, but never both. Rich (1996)
Parameter Comparison
Parameter Comparison
1200
1000
1000
100 Ra 226
Ra 226
800 600
10
400
1 200
0.1
0 0
200 400 600 800 1000 1200 U Tot
0.1
1
10 U Tot
100
1000
Figure 106 - Comparison of linear vs. logarithmic scales
For the case where the data has a large dynamic range, or where the data is lognormally distributed, a logarithmic scale on one or both axes may be appropriate. A graph with a logarithmic scale on one axis and a linear scale on the other is called a semi-log plot, and one with both axes logarithmic is called a log-log plot. The graph on the right side of Figure 106 shows a log-log plot. The goal is to see the relationship between the two constituents in each sample. The left graph shows the data graphed on a linear scale. Most of the data is clustered in the lower left, and it is difficult to say what the relationship is. The right graph shows a logarithmic scale for both constituents, and it is possible to see that there is a rough correlation between the two, and a sample with a high value in one is likely to have a high value in the other. In fact, it appears that there may be several populations with different linear relationships between the constituents, perhaps representing different sources of the material. This was not at all apparent from the linear graph.
Labels and annotations There are two basic types of labels and annotations, those associated directly with graph elements, and those not. Examples of the first type are the scale labels and scale titles. Scale labels identify positions along a scale axis. Usually there will be one set of labels per axis, such as the numbers annotating the tic marks and the text label for the axis. Labels not associated with graph elements include the graph title, legends, comments, and so on.
TYPES OF GRAPHS Because graphics are so useful, people have developed many different types of graphs to best represent their data. This section describes some of the most popular types of graphs, and the following one shows some examples. Line graphs – Line graphs are often used to represent data in a series. A grid is drawn, and then one or more series of data are drawn on the grid. Lines are used to connect the points to highlight trends and patterns. Often the horizontal axis (abscissa) is time, and the vertical axis
228
Relational Management and Display of Site Environmental Data
(ordinate) is the value being compared, but this is not required. Line graphs are probably the most common type of technical graph. Bar graphs – Bar graphs, also called column graphs, are good for displaying increases and decreases in quantity over a period of time. They work best when the amount of data to be displayed is not large. As with line graphs, the horizontal axis is often time. Area graphs – Area graphs are similar to line graphs, except the areas under the curve(s) are filled. Stacked graphs – A stacked graph is a variety of bar or (more commonly) area graph where the values are stacked cumulatively rather than each starting at zero. Scatter plots – A scatter plot is used for displaying two variables for each point against each other. Scatter plots are very popular for technical data. Box plots – Box plots are special bar graphs that show the minimum, maximum, mean, and lower and upper quartiles for each data group. Picture graphs – In picture graphs, the data is displayed with symbols rather than lines or bars. These are sometimes used for business presentations, but are not commonly used for displays of technical data. Pie charts – A pie chart is a type of graph used to display the fractional parts of a whole like slices of a pie, where the size, or more accurately the angular displacement, of each slice is based on the percentage of the whole contributed by each value. Surface plots – Surface plots are used to show one variable as a function of two others. They are similar to contour displays used on maps, but the two independent variables can be something other than map coordinates. Rose diagram – A rose diagram is a circular graph of angular data. Angular measurements, such as joint or cross-bed directions, are grouped by an angle range, such as 10° or 30°, and the number of observations in each range are shown as distances from the center. Before designing a rose diagram, you should examine the variability in the data and set the increments (angle range) to be graphed appropriately. If the increment is too small for the data, then only “noise” is displayed. If too coarse, the real variability is lost. An alternative way of drawing the rose diagram is to start at the outer edge and increase the values toward the center. This often helps to define trends in multi-modal data sets better than the more conventional approach (Mike Wiley, pers. comm., 2002). Polar plot – A polar plot is also a circular graph of angular data. Values as a function of angle are shown as distances from the center, creating a line graph within a circle. Maps – It’s important to remember that maps are a type of graph. Because maps have so many special issues to discuss, they will be covered separately in Chapter 22. There are also many opportunities for combining maps with traditional graphs to create visually rich and informative displays.
GRAPH EXAMPLES The following examples show graphs created by several different programs. Figure 107 shows a number of graphs created with Microsoft Excel. Figure 108 shows some more technical graph types created with Grapher from Golden Software. The previous examples have used programs outside the EDMS. Figure 110 shows a fairly typical graph of one parameter (sulfate) from two wells plotted as a function of time within an EDMS program. Figure 111 shows a variation on the time sequence graph where data from several years is folded onto one 12-month graph. This was done to help identify seasonality in the data.
Graphs
1600
1600
1400
1400
1200
229
1200
1000 800
Sodium
1000
Sulfate
800
600
600
400
5/18/94
5/19/93
5/12/92
6/4/90
6/19/91
6/21/89
5/31/88
5/26/87
5/21/86
2/8/84
5/13/85
1/18/83
200 1/27/82
400
0 2/26/81
200
0 0
100
200
Line (time sequence) graph
300
400
500
X-Y scatter plot
1600
1600
1400
1400
1200
1200 1000
1000
Sodium
800
800
Sulfate
Scatter plot with lines showing change with time
5/18/94
5/19/93
5/12/92
6/4/90
600
6/19/91
500
6/21/89
400
5/31/88
300
5/26/87
200
5/21/86
100
2/8/84
0
0
5/13/85
200
200
1/18/83
400
2/26/81
400
1/27/82
600
600
0
600
Bar graph
1400
1400
1200
1200
1000
1000
800
800 600
600 400
Sodium
400
Sodium
200
Sulfate
200
Sulfate
3-D bar graph
8/5/92
11/2/93
6/19/91
3/23/90
12/15/88
5/21/86
9/16/87
11/13/84
7/15/83
3-D bar graph with too much data
1400 1200 1000 800 600 400 200 0
2000 1800 1600 1400 1200 Sodium
1000
Sulfate
800
Sodium
600
3-D area graph
Stacked area graph
Figure 107 - Examples of several graph types created with Microsoft Excel
11/2/93
8/5/92
6/19/91
3/23/90
12/15/8
9/16/87
5/21/86
11/13/8
7/15/83
4/27/82
0
2/26/81
11/2/93
200
8/5/92
6/19/91
400
3/23/90
12/15/88
5/21/86
9/16/87
11/13/84
Sulfate
7/15/83
4/27/82
2/26/81
4/27/82
2/26/81
11/21/85
5/13/85
9/14/84
2/8/84
7/15/83
7/20/82
1/18/83
1/27/82
6/19/81
0 2/26/81
0
230
Relational Management and Display of Site Environmental Data
0
0
315
315
45
270
90 0
2
4
6
225
8
45
270
90
10
0
135
10
20
30
225
135
180
180
Rose diagram
Polar plot 0
60
1
0.8
0.2
0.6
0.4
r pa lds Fe
40
Lit hic s
40
0.4
0.6 20
0.2
0.8
1
0
0
7 B-6 B 5 B3 B-2 B 1 -1 W M -1 0 W M -8 W M -7 W M -6 W M -5 W M -3 W M -2 W M -1 W M
0
0.2
0.4
Box plot
Quarz
0.6
0.8
1
Trilinear plot
Figure 108 - Examples of several graph types created with Grapher from Golden Software
Cation Concentration K
Na
Al
Fe
Fe
Mn
pH
Mg Total Dissolved Solids
3-D surface plot created with Surfer Figure 109 - Additional graph examples
Pie chart created with Excel
Graphs
231
Figure 110 - Formatted graph of selected parameter
Figure 111 - A graph of a constituent (blood lead) by month created by an EDMS
Sometimes it is useful to view graph data in its spatial context. Figure 112 shows an example of this type of display. A map with an airphoto backdrop is displayed, along with symbols for the well locations. Time-sequence graphs are shown for five of the monitoring wells, with leader lines to the wells from which the samples were taken. This type of display shows the time sequence data, along with the spatial context of the wells, so inferences can be made about the progression of values over time for different parts of the facility. Graphing in spatial context is of greatest value where the variation is expected to relate to geographic position. For example, in addition to the water quality parameters shown, parameters such as water level elevation and temperature often benefit from being displayed in this manner. Figure 113 shows an enlarged view of part of Figure 112. The graphs show the value of the constituents of interest, along with the vertical scales of the graph. It also shows horizontal lines for the mean value for bicarbonate, along with lines located three standard deviations above and below the mean, and a line for the regulatory limit. Points that are outside the limit lines are displayed in a different color. These points deserve additional scrutiny to determine if they are erroneous or real.
232
Relational Management and Display of Site Environmental Data
Figure 112 - Graphs displayed with leader lines to their map locations
Figure 113 - Enlarged graph showing control chart limits and outliers
CURVE FITTING Often a graph, especially a time-sequence graph, will expose a trend in the data. Many graphing programs provide a way to fit a curve to the data to help understand the trend. The curve can help understand the trend by smoothing out irregularities and variations in the data.
Graphs
Concentration Over Time
Concentration Over Time
40
40
30
Value
20
2nd Order Polynomial
10 0
Concentration
Concentration
233
30
Value
20
3rd Order Polynomial
10 0
J
F
M
A
M
J
M onth
J
F
M
A
M
J
Month
Figure 114 - Graphs showing trend lines
Curve fitting must be used with caution, however. Figure 114 shows an example of two graphs of the same data set, with trend lines suggesting two very different conclusions. The data set consists of four monthly observations: 10, 25, 30, 20, 25, and 25. The question is whether the data is trending up or down. Fitting a second-order polynomial suggests that the data is trending down. Changing to a third-order polynomial suggests an upward trend. Which is correct? A scarier question is: Which will you use to prove your point? Because graphing software makes it so easy to use high-order polynomial fitting, it is tempting to use high orders to improve the fit. However, a third-order polynomial is the lowest order that can produce both concave and convex curves on the same plot. This may be the highest order appropriate for many data sets.
GRAPH THEORY Graph theory is a topic that might be confused with the theory of creating graphs, but actually covers a different topic. It is discussed here to make the point that graph theory and the theoretical basis for graphing data or functions are different issues. Some of the theoretical issues related to creating graphs, such as data issues, scales, etc., are discussed above. The basic material of graph theory is spatial connectivity or topology. In graph theory, “graph” is used to denote a set of vertices possibly connected by edges, as opposed to graphing data or values. Graphs in graph theory consist of points connected by lines (vertices connected by edges), and then various kinds of studies are performed on these graphs. Unlike geometry, topology ignores spatial issues, and addresses only issues that don’t change when objects are deformed. An example of the type of problem studied by graph theory is the Four-Color Problem (Figure 115), which is a theory that any map can be colored using four colors in such a way that adjacent regions (those sharing a common boundary segment, not just a point) receive different colors. Graph theory may have application for environmental projects by analyzing the relationships between different areas of interest, or similar area-based studies.
Figure 115 - Example of the Four-Color Problem in graph theory
CHAPTER 21 CROSS SECTIONS, FENCE DIAGRAMS, AND 3-D DISPLAYS
Environmental data, and geologic data in general, is inherently three dimensional. A number of graphical tools have been developed to assist with visualizing the 3-D configuration and relationships contained in the data. These range from logs through cross sections and fence diagrams to block diagrams.
LITHOLOGIC AND WIRELINE LOGS Rock or soil samples and geophysical measurements from boreholes and from outcrops make up the basic data for many geologic projects. Displaying this data as a function of depth is the first step in interpretation. Before the advent of personal computers, lithologic logs of samples were prepared by manual drafting onto strip-log paper. Wireline geophysical logs were drawn with analog recorders on special chart paper. Digital wireline logs arrived long before personal computers. They were recorded on tape and plotted with pen plotters attached to mainframe or minicomputers. Now both lithologic and wireline logs, including combinations of both in one display, can be easily created using a computer program on a personal computer. For drill cuttings and outcrop samples, the plot usually consists of patterns for lithology types along with a text description of the rock, both plotted against depth on the vertical axis. Curves for other factors, either measured or interpreted, may also be included. Measured factors that can be plotted might include grain size, porosity, or oil saturation, while interpretive factors might include depositional energy or diagenetic alteration. Figure 116 shows an example of a typical lithologic log for an environmental project. Geophysical measurements from boreholes (or less commonly from outcrops) are widely used for determining rock properties, and are also very valuable for stratigraphic correlation. Displays of two or more geophysical curves, such as spontaneous potential (SP) along with resistivity, or gamma ray plotted with neutron density or sonic travel time, are widely used for stratigraphic and structural interpretation of subsurface rocks. Figure 117 shows an example of a small portable geophysical logging device.
236
Relational Management and Display of Site Environmental Data
Figure 116 - Lithologic log for an environmental project (Courtesy of RockWare)
Figure 117 - Gamma ray logger system (Courtesy of Geotech Environmental Equipment)
Cross Sections, Fence Diagrams, and 3-D Displays
237
Figure 118 - Cross section created from relational data
CROSS SECTIONS Several lithologic or geophysical logs can be displayed side by side to form a cross section. The use of cross sections on environmental projects is discussed in Sara (1994, pp. 7-17 to 7-21). The manual approach is to tape several logs onto a big sheet of graph paper (cross section paper), with the vertical position based on elevation (structural cross section) or on a stratigraphic horizon (stratigraphic cross section). This type of display is used to interpret the spatial position of rock units or the lateral variation in lithologies. This is particularly useful to assist with correlation of lithologic and stratigraphic units. Contamination values can be added to increase the information content of the cross sections. The vertical scale can be changed relative to the horizontal scale to adjust the vertical exaggeration for cross sections, block diagrams, and other displays. This is important because many geological features are tabular in shape, and vertical exaggeration is necessary to be able to see the features. Computers can be used to create cross section displays once the basic data on lithology, chemistry, or log values has been entered. The user specifies which logs are to be used, how each log is to be displayed, and other information such as how the cross section is to be hung (structural or stratigraphic datum) and how the logs are to be labeled. Most cross section programs allow correlation lines to be drawn from log to log to display stratigraphic and structural relationships, and some allow the user to interactively pick formation tops from the logs for entry into a database. Figure 118 shows a cross section display of the concentration of uranium and radium in soil. It was generated to show the part of the site that will need to be excavated. It includes a combination of laboratory data from soil samples along with downhole data from gamma logs. Uranium values are to the left of each log, and radium values to the right. The shaded rectangles represent soil samples, and the continuous lines show the downhole gamma surveys. Both the boxes and the lines are truncated at the excavation cutoff. The elevations of geologic units, as well as the ground surface and water table, have been added to aid in interpretation. A series of parallel cross sections of this type can be used to calculate the volume to be excavated.
238
Relational Management and Display of Site Environmental Data
Figure 119 - Electric log cross section
Figure 120 - Map and cross section views of model results
Figure 119 shows a cross section of electric log data. This type of display is very useful for performing subsurface correlations. Figure 120 shows two graphic displays from the same project, one a map and one a cross section. The values from the borings were used to create a 3-D geostatistical block model, which was then displayed along with the map and cross section. Cross sections can also be used to demonstrate changes in water chemistry or contaminant distribution over time. If the same wells were resampled over time, then cross sections of each sampling period will show the changes. This adds a fourth dimension to the information obtained.
PROFILES A type of display that is similar to a cross section is a profile, which is like a slice through a surface. The surface is usually a grid created by a contouring program. The profile represents the values of that surface along the line of the profile. Sometimes profiles and log cross sections are combined to show what the surface does between control points.
Cross Sections, Fence Diagrams, and 3-D Displays
239
Figure 121 - Fence diagram (Courtesy of RockWare, Inc.)
FENCE DIAGRAMS AND STICK DISPLAYS Extending from a one-dimensional lithologic or geophysical log or a two-dimensional cross section to a three-dimensional display is very difficult with hand drafting techniques (see Tearpock and Bischke, 1991, pp. 182-194), but can be easily done with a computer. The user first specifies which wells are to be used, what data elements are to be displayed, and how they are to be shown. The software then uses the X-Y coordinates of the well locations to project them onto a threedimensional perspective view. The logs can be shown with no connections between them (stick diagram) or the formations can be connected from well to well (fence diagram). This type of display can show three-dimensional relationships that are difficult to discern using other methods. Figure 121 shows an example of a fence diagram. In this example, the geology from the borings, which are shown as curved lines, has been interpolated across the site, and then profiles drawn at regularly spaced intervals. Unless the wells are regularly spaced, which they usually aren’t, either the lines of the fence diagram must be crooked, or the lines drawn straight and the data interpolated at the intersections. This requires considerable confidence in your understanding of the data and the spatial relationships at the site.
240
Relational Management and Display of Site Environmental Data
Figure 122 - Block diagram created automatically from relational data
BLOCK DIAGRAMS AND 3-D DISPLAYS Another type of three-dimensional display is the block diagram. Block diagrams can be made from two- or three-dimensional grid models of a particular volume of rock. Some block diagram software allows certain stratigraphic or lithologic units to be made transparent so the user can see into the block diagram. Generating block diagrams for large grid models is computationally intensive and requires a powerful computer to produce results in a reasonable amount of time. Fortunately, current high-end personal computers have the power to do this for all but the largest projects. Figure 122 shows an example of a block diagram created from data extracted from an EDMS. In Figure 123 the low concentration material has been removed (made transparent) to show only the higher concentration material. Also, logs for the boreholes have been added. Figures 124 and 125 are more complicated figures with 3-D surface features and a contaminant plume. Figure 125 adds the depth to bedrock. Although block diagrams such as those shown from Figures 122 to 125 are quite computationally intensive and can take several minutes or more to create, the benefits can far outweigh the inconvenience. Block diagrams with this kind of detail are very difficult to produce manually. Accurate rendering of 3-D objects that are faithful to the data, as shown in these figures, is virtually impossible without using a computer. Because many people, especially non-technical people, find it difficult or impossible to visualize objects in three dimensions, block and 3-D diagrams can be a powerful tool in illustrating and proving your case. Providing an understanding of spatial relationships for regulators, attorneys, and environmental activists can be greatly aided by these displays.
Cross Sections, Fence Diagrams, and 3-D Displays
Figure 123 - Deviated boreholes and plume display (Courtesy of RockWare, Inc.)
Figure 124 - 3-D facility display created with a mapping program (Courtesy of RockWare, Inc.)
241
242
Relational Management and Display of Site Environmental Data
Figure 125 - 3-D display of contamination under a refinery created with a GIS (Courtesy of Dan Heidenreich, HSI Geotrans)
CHAPTER 22 MAPPING AND GIS
Most environmental data is inherently spatial. That means that the observation was taken at a specific location in map coordinates (X and Y) and depth (or elevation). Often seeing the data in its spatial context imparts more information content than seeing it as a text-only presentation. Using computerized mapping to help understand this spatial context makes sense for many projects. This chapter covers issues related to computerized mapping, including software for creating maps, displaying your data, contouring and modeling, and specialized map displays.
MAPPING CONCEPTS Since earth scientists often spend a large amount of their time working with maps, it is logical to consider computerization of the map generation and manipulation process. Computerized mapping covers a wide variety of activities, and programs are available to help with most of them. There are some advantages and disadvantages to consider, however.
Advantages and disadvantages of computerized mapping Before making a commitment to computerized mapping, a thorough appraisal should be made of the timesaving and other benefits that will be provided. The problem is similar to the one encountered with computer-aided design (CAD) software. Making the first map with the computer will take as much or more time than doing it by hand. This is especially true if the learning curve for the mapping software is taken into consideration. The time savings will come later, when the map needs to be redone or changes need to be made. Then the computer eliminates re-drafting, which can improve accuracy as well as improve speed in generating the second map. Another advantage of the computerized mapping process is that it allows the earth scientist to make maps and diagrams that either could not or would not have been made by hand. Good examples are trend surface and residual maps, other derived maps, block diagrams, and maps of data that may have previously been considered unimportant. Other examples of maps more likely to be made are multiple maps of different time periods. Having the computer generate the maps makes it more likely that these maps will be made. Some of these experimental maps will not be useful and will be thrown away. Others may provide surprising insight into the data and the geology behind it, and could prove tremendously valuable. Computer-generated maps are, for the most part, unbiased, which can be of value in many situations. Finally, computerized mapping can greatly improve the ease and accuracy of volumetric calculations.
244
Relational Management and Display of Site Environmental Data
Often the decision on whether computerized mapping is appropriate for a project depends on the number of maps to be made and the amount of data to be mapped. For small projects, hand mapping is often better. For large projects with thousands (or millions) of data points, the computer may be the only way to do it.
Types of maps Since maps are so widely used, there are hundreds of different kinds of maps. A few of those types of maps will be discussed here. In some cases, the final map display is a combination of several map types. Base maps – The most fundamental type of map is the base map. Whether derived from a topographic map or commercial or proprietary data, a base map usually must be constructed before any other type of map can be made. Geographic, cultural, and sample location data must all be collected and related together in the right spatial positions. The importance of this step must not be underestimated, and this subject is discussed in more detail below. Posted data – The next step after creating a base map is often to post data, creating a posted data map. In many cases, one of the primary goals of organizing a database is to create this type of map. Retrieving data and posting it on maps is described in a later section. Bubble maps – A bubble map, also called a dot map or pin map, expands on data posting by using the symbol on the map to represent the value being displayed. The symbol’s size, color, or shape can reflect the value being featured. Figures 129 and 130 later in the chapter show examples of this type of map. Thematic maps – Thematic maps use the display of various map elements to communicate data, usually numeric values. For example, each county on a regional map could be color coded to represent the value of some variable, such as economic or environmental parameters. Sometimes this is done in perspective view with the polygon of each county extruded to a height that represents the value. Contour maps – Contour maps use contour lines, color fill, and other graphical displays to communicate numeric information, usually of a continuous or nearly continuous surface (such as one broken by faults). The many issues related to creating and displaying contour maps are discussed below. Surface geology – It is often useful to make geologic maps of surface or subsurface geology and/or other features. As the use of computers for image processing and analysis increases, more surface geology projects are being done on the computer using airphotos and satellite photos. Software exists that allows the user to move interpreted information from images onto line drawings (maps) for output to plotters and for integration with other types of maps. Airphotos and satellite images – Aerial imagery, whether taken from an airplane or satellite, can be of great value in environmental mapping. Often airphotos are available for various time periods, which can assist with documenting the history of the site. In order to be used for mapping, images must be ortho-rectified to remove any spatial distortion caused by the imaging process, and to allow them to be registered to a particular map geometry. Once this has been done, both types of images can be used for map backdrops to illustrate a variety of points about site data.
Base map creation If the EDMS will include a map component, then base map information must be loaded into either the database or the geographic information system (GIS) before a map can be displayed. Then the analytical and other information can be overlaid on the base map. Loading the base map data involves two steps. The first is to create a base map. Often this is done using a computer-aided drafting program such as AutoCAD or the digitizing capabilities of the GIS. This base map should provide sufficient locational information as a reference for the data being displayed while keeping
Mapping and GIS
245
the file size small for quick display. The second step is to import this map file into the EDMS relational data model or the GIS. A drafter will need to create the base map for each site, and then a data administrator or GIS operator will need to import the map into the system. Cultural and drainage information can be captured either within a CAD program or with a mapping program, and then displayed along with chemical, geological, and other data. A number of public agencies and private companies have developed base map files that can be purchased or downloaded for free from the Web. Base maps with well or sample locations can be generated either by a general-purpose CAD program or by a dedicated mapping program. Using a CAD program, locations can be digitized or posted from a database (with some tinkering). Most mapping programs operate from a database, and mapping coordinates can be digitized or calculated from legal descriptions or surveyed locations. Map creation should be done in parallel with data loading. When a site is selected for data import, a site map or drawing should be located or created. This will involve effort from data management staff to locate appropriate drawings and to import them into the data management system. If personnel outside of the environmental organization maintain the drawings, then their assistance will be required as well. A level of effort of two hours or less for a small site to a day or more for a large site should be anticipated to obtain a suitable base map. An important issue in creating a base map is the accuracy of the coordinates. The map accuracy refers to the spatial accuracy of locations on the map. This usually depends on the scale at which the map data elements were captured. Locations on a map digitized at 1:200 will be more accurate than those digitized at 1:2,000,000.
Coordinate systems One area of site data management that provides an ongoing challenge is to describe the locations of the stations in the database. In many older environmental data sets, station coordinates did not receive much attention because the primary goal was to generate text reports. With the current trend to get more information from the data in the database, coordinates are becoming critical, and are required if the data is to be used for mapping. The three most difficult issues regarding site coordinates are finding locations at all, site vs. world coordinates, and coordinate projections. The first applies mostly to stations (wells, borings, etc.), while the second and third apply both to stations and base map information.
FINDING COORDINATES In order to draw a map using information obtained at stations, you need to have coordinates for the stations. The primary data source for this information is survey information, which should have been gathered when the station was installed, or at a later date. If this data is not available, the next best information usually comes from engineering drawings showing the location. Sometimes it is possible to export this information from a CAD program, or to point to the location in the software and write down the coordinates. If all you have are hard copy drawings, you can measure coordinates, or, if you have a large number of stations, use a digitizer and digitizing software. You can do the same thing with airphotos, if the station can be seen. If no coordinates are available, but the stations can still be located on the ground, you can survey them, either with traditional survey techniques, or more commonly nowadays with global positioning system (GPS) equipment. If you go the GPS route, a big factor in determining the cost is the accuracy required for the final result. If you can live with uncertainty of a few meters in location, an inexpensive GPS receiver will do the job. If your project requires sub-meter accuracy, you will need more expensive equipment using differential GPS technology. One issue that often arises is what to do with locations that are questionable. You may have locations that are estimated, or were determined with a low level of precision. Assuming that these are the best locations available, you will need to determine whether it is better to use them or to
246
Relational Management and Display of Site Environmental Data
ignore them. This can be a difficult decision. It should be based on the type of decisions that will be made based on the information from those stations. For general background information, or as part of building a historical picture, less accurate locations may be acceptable, but for more critical decisions you may need to do more, perhaps by installing a new station. A field should be provided in the EDMS for the precision and/or source of the station coordinates.
SITE VS. WORLD COORDINATES Station locations can be described in any of three coordinate systems: latitude-longitude, world Cartesian, and site Cartesian. Most available commercial and government data, e.g., land grid, culture, imagery, and geophysical data, are referenced to either or both of latitude-longitude or world Cartesian coordinates. Station locations for a facility are usually described in Cartesian coordinates, either world or site, although latitude-longitude is sometimes used. The choice of whether the primary coordinates used for a project are Cartesian or latitude-longitude is usually a function of the scale of the project. A leaking underground tank at a gas station is best handled in Cartesian coordinates, whereas the contamination at a large facility such as the Idaho National Engineering and Environmental Laboratory, which covers 890 square miles would probably better be described in latitude-longitude. Any project with an extent of more than about 10 miles should be surveyed in latitude-longitude to avoid problems due to the curvature of the earth. Latitude-longitude – The earth is roughly spherical, and the latitude-longitude system is used to describe locations in spherical coordinates. Latitude is measured in degrees (1/360th of the sphere) north or south of the equator. Longitude is measured in degrees east (positive) or west (negative) of the prime meridian, which passes through Greenwich, England. Note that longitudes in North America are usually described in west longitude, and must be displayed as negative numbers. World Cartesian – The representation of locations on the earth, which is roughly spherical, on a flat map is referred to as coordinate projection. A system in which coordinates have been projected into a flat plane is called a Cartesian coordinate system (named after the French mathematician René Descartes). Some of the most popular coordinate projection systems are described below. Site Cartesian – Some site location surveys are not tied to any “real world” coordinate system. Many small environmental monitoring and remediation sites have been surveyed on an orthogonal grid, oriented to true north, but using an arbitrary grid origin. In other words, they are not referenced to the real world. For example, one site that the author worked on had the origin (0,0) where the railroad crossed the creek. Stories abound of surveys where the origin is “where the tree used to be.” This is fine as long as no other data, surveyed in a different system, is added to the database. It is preferable to determine offsets to the site coordinates and place the site on the realworld system for the area, and then the data can be related to other spatial data. Often the site survey systems were based on topographic maps of the area, so they are already in the projection of that topographic map, and only the offsets are required to convert to world units. Most GIS programs can store both Cartesian and latitude-longitude coordinates and can convert from one to the other “on the fly.” Map limits, station locations, roads, and all other data that can be expressed on a map can have their locations stored in either or both coordinate systems. Some GIS programs, like Enviro Spase from Geotech Computer Systems, Inc., can handle all three coordinate systems, site Cartesian, world Cartesian, and latitude-longitude, in one database.
COORDINATE PROJECTION SYSTEMS An integral part of creating maps is the ability to handle transformations between Cartesian coordinate systems and latitude-longitude. This capability is present in a number of mapping programs and most GIS programs. This section will provide a brief description of the problem, some popular coordinate systems, limitations imposed on the use of projections, and other projection-related issues.
Mapping and GIS
Regular Cylindrical
Transverse Cylindrical
Oblique Cylindrical
247
Regular Conic
Polar Azimuthal (plane)
Oblique Azimuthal (plane)
Figure 126 - Projection of the earth onto the three major surfaces (from Snyder, 1987)
Description of the coordinate systems problem The projection process involves inherent error because of the distortion required to stretch or shrink the surface during projection. Anyone who has tried to gift-wrap a basketball is familiar with this distortion, which is illustrated by the wrinkles in the paper. A variety of projection systems have been developed to minimize this error for maps of different scales and orientations. This concept is illustrated in Figure 126. The following are some of the more popular projections: Flat, Mercator, Transverse Mercator, Universal Transverse Mercator (UTM), Oblique Mercator, Lambert Conformal Conic, Albers Equal Area, American Polyconic, Miller Cylindrical, and Gnomonic. Although there are about two dozen other major projections and hundreds of minor ones, this group provides a good level of functionality for most types of maps.
248
Relational Management and Display of Site Environmental Data
It should also be mentioned that in the United States a group of projections customized for each state is referred to as state plane coordinate systems. Each state, or state plane zone for larger states, is actually one of the standard projection systems with the constants selected to accurately portray that state or part of the state with positive coordinates. With one exception, all state plane coordinate systems are based on either Transverse Mercator or Lambert projections. The exception is the Alaska panhandle, which uses an Oblique Mercator system called Hotine Oblique. Another important factor in projection calculations is the reference ellipsoid, which is the shape that the earth is assumed to have for calculating the projection. This is a sphere that is flattened along the polar axis into an oblate spheroid. The spheroid can be described by its eccentricity (or commonly the eccentricity squared) and the length of the semimajor (equatorial) axis. Some spheroids have been given names such as Clarke 1866. The ellipsoidal shape is only an approximation, and when the actual deviations are taken into consideration, the shape is called the geoid. A reference ellipsoid for a local area is referred to as a datum, and locations for the same point referred to different datums will not coincide. In the United States, two common datums are the North American Datum of 1927 (often called NAD 27) and a revised version, the North American Datum of 1983 (NAD 83). Locations relative to these two datums can vary by as much as 100 meters, with the greatest divergence occurring in parts of California. The discrepancy between locations in these two systems is a common problem, which can usually be resolved by consulting the topographic maps used in determining the locations. The transformation between the two systems is complex mathematically but well understood, and program code to perform the transformation is available from government agencies and private software companies. The NAD27-NAD83 problem is essentially a local problem limited to North America. Other parts of the earth have various other datums that have been used at different times. Coordinates referenced to one datum must not be mixed with those referenced to another.
Descriptions of coordinate projections A brief description of some of the popular projections is included here. This material is based mostly on Snyder (1987). Parameters required for each are also listed. Flat – Latitude and longitude are treated as if they are Cartesian. This is not a very accurate system, but can be useful in some special cases, especially near the equator, where the distortion is at a minimum. Mercator – Used for equatorial regions. Particularly useful for navigation because course direction is a straight line. Projection is mathematically based on a cylinder tangent to the equator. Meridian spacing is equal, and the parallel spacing increases away from the equator. Meridians and parallels are straight and parallel. Scale is true along the equator only. Areal enlargement increases with distance from the equator and becomes extreme at latitudes more than 40° north or south. Transverse Mercator – Used where the north-south dimension is greater than the east-west dimension. Projection is mathematically based on a cylinder tangent to a meridian. Shape is true only within a small area. Areal enlargement increases away from the tangent meridian. Reasonably accurate projection within a 15° band along the line of tangency. Cannot be edge-joined in an eastwest direction if each sheet has its own central meridian. Used as the base for USGS 1:250,000 and some 7½- and 15-minute topo maps, and for state plane coordinate systems in which the state or zone is predominantly of north-south extent. Universal Transverse Mercator (UTM) – An ellipsoidal Transverse Mercator to which specific parameters, such as central meridians, have been applied. The earth between latitudes 84° north and 80° south is divided into 60 zones, each 6° wide in longitude. Bounding meridians are evenly divisible by 6°, and zones are numbered from 1 to 60 proceeding east from the 180th meridian from Greenwich with minor exceptions. There are also letter designations from south to north. Cartesian coordinates in UTM systems are almost always given in meters. From latitudes 84° north and 80° south the Universal Polar Stereographic projection is used instead. The reference
Mapping and GIS
249
ellipsoid varies for specific parts of the earth, and the Clarke 1866 ellipsoid is used in the United States. Used for areas with mostly north-south extent, and UTM tick marks are widely used on USGS topo maps and many other maps. Oblique Mercator – Based on a cylinder tangent along any great circle other than the equator or a meridian. Shape is true only within a small area. Linear scale is true along the line of tangency or along two lines equidistant from the lines of tangency. Areal enlargement increases away from the line of tangency. Reasonably accurate projection within a 15° band along the line of tangency. Used for plotting linear configurations that are situated along a line oblique to the earth's equator. Used for the Alaska panhandle and for some displays of satellite imagery. Lambert Conformal Conic – Mathematically based on a cone that is tangent at one parallel or, more commonly, secant on two parallels. Areal distortion is minimal but increases away from the standard parallels. Great circle lines are approximately straight. Retains its properties at various scales. Sheets can be joined along their edges. Used for large countries in mid-latitudes and state plane coordinate systems having an east-west orientation. Albers Equal Area – Based on a cone that is secant on two parallels. No areal deformation. The North or South Pole is represented by an arc. Linear scale is true on standard parallels. Retains its properties at various scales. Individual sheets can be joined along their edges. Used for thematic maps. Used mostly for large countries with an east-west orientation. American Polyconic – Based on an infinite number of cones tangent to an infinite number of parallels. Distortion increases away from the central meridian. Has both areal and angular deformation. Linear scale is true along each parallel and along the central meridian. Maximum scale error is 7% on a map of the 48 states. Used for areas with a north-south orientation. Formerly used for 7½- and 15-minute USGS topo maps. Only along the central meridian does it portray true shape, area, distance, and direction. Individual sheets can be edge-joined if they are drawn with straight meridians. They cannot be mosaicked beyond a few sheets. Miller Cylindrical – Based on a simple mathematical modification of the Mercator projection. Neither equal area or conformal. Meridians and parallels are straight lines intersecting at right angles. Meridians are equidistant and parallels are spaced farther apart away from the equator. Used for world maps. Gnomonic – Geometric projection of the earth onto a plane. The point of projection is the center of the earth. Useful for polar regions. The meridians are straight lines radiating from the point of tangency for polar projections, and the parallels are concentric circles. Linear scale and angular distortion are extreme.
Limitations on the use of projections Translating between latitude-longitude and one projection system is relatively easy. Code exists in a number of programs to do this. The problem becomes more difficult when more than one projection system is involved in a single data set. Here is an example of how this might arise. The user needs to work with data in an area in Wyoming and Colorado (Figure 127). The area covers part of the Wyoming Central and Wyoming East state plane zones, as well as part of the Colorado North zone. If the data was retrieved and the Cartesian coordinates treated as a homogeneous data set, the resulting map would be meaningless because the points from the different zones would be in the wrong relative positions. It is therefore necessary that the user select a common coordinate system for viewing this area. The user might select a Transverse Mercator projection with the meridian shown. That projection could then be applied to each subset of the data to generate a homogeneous coordinate system. UTM would also work well in this situation, as long as those working on the project are comfortable working in meters rather than in feet.
250
Relational Management and Display of Site Environmental Data
Figure 127 - Example of mixed coordinate systems
Other map elements There are a number of additional issues that must be addressed in creating maps. These include map scale and reference information. Map scale – The map scale is the relationship of the coordinates in the real world compared with on the map. For a printed map, the scale is well defined, and can be described as a verbal scale, like 1" = 200' or a ratio scale, like 1:2400 (which would be the same scale). The scale on a computer screen is harder to deal with. For a map displayed on the screen using the same display parameters, the scale would be different on a 15" monitor, a 21" monitor, or using a digital projector and filling a wall. This is less of an issue than it sounds, though, because users can zoom at will to see the features that they want, and the software may provide a tool for measuring distances. A better way to display scale information on-screen that makes sense at any magnification is to use a scale bar, which scales with the map and is always correct no matter what the zoom or screen magnification. Printed maps should always include a scale bar because of the possibility of reproducing the map at a different scale. A scale bar is shown in Figure 128. Reference information – Other information can make the map more meaningful. A title block can display a map title, date, and other information about the map. A legend shows the meaning of symbols and other map elements. A north arrow helps the user orient the map, and should always be included. Regarding orientation, most people in the Northern Hemisphere expect the top of the map to be north, and you should draw it that way unless you have a really good reason not to. Finally, a border helps the map look finished. A legend, north arrow, and scale bar are shown in Figure 128.
Figure 128 - Legend, north arrow, and scale bar
Mapping and GIS
251
Figure 129 - Map of a selected parameter displayed by the EDMS (Enviro Data)
MAPPING SOFTWARE There are a number of different kinds of programs that can create maps. Some EDMS programs can display maps within the database software. Outside the EDMS, mapping and contouring programs, computer-aided drafting (CAD) programs, and geographic information system (GIS) programs can all create maps of various kinds with various levels of functionality.
Mapping within the database Some EDMS programs provide mapping capability as part of their basic functionality. Figure 129 shows a map created by an EDMS. This map shows values from the database color-coded by the amount of a constituent at each location. This type of map is sometimes called a bubble or chloropleth (amount shown by color) map. Mapping within the database is convenient, but the mapping tool may not provide as many features and options as a dedicated mapping program.
Mapping and contouring programs Before GIS programs became widely available, there were mapping programs. Some were free programs like GSMap and AllMap from the U.S. Geological Survey and some were commercial programs like Gridzo from RockWare. They provided base mapping and contouring capabilities. This type of program is still popular today, with Surfer from Golden Software the most popular, at least for environmental projects, and RockWorks from RockWare and others also filling this niche. The main differences between a mapping program and a GIS are in two areas: data integration and spatial analysis. GIS programs provide a level of integration with the data not usually found in mapping programs. A GIS can be used to build coverages, including topological structuring of the features. It also usually provides more integration with the data model, so that the GIS can be used as an interface into the database. GIS programs provide more spatial analysis
252
Relational Management and Display of Site Environmental Data
tools, such as proximity and buffer zone calculations, not usually found in mapping programs. On the other hand, mapping programs focus on surface modeling, with contour and related displays, and are generally easier to use than GIS programs.
CAD programs Probably more maps have been printed using computer-aided drafting (CAD) programs than any other type of software. The ubiquitous use of CAD programs in engineering and architecture, along with the flexibility that they provide in graphical display, have led to widespread use in mapping. The disadvantage of most CAD systems is that they usually don’t have tools specific to mapping such as coordinate projections, spatial analysis, and surface modeling. Some CAD programs, such as AutoCAD Map from Autodesk and GeoMedia from Intergraph, expand drafting programs by adding map features. CAD programs are often used as the final step in map production. Maps created in other tools are exported in a CAD format such as a Drawing Exchange File (DXF) and then imported into the CAD program for final layout, title blocks and borders, and so on.
Geographic information systems A geographic information system is a computer-based system for collecting, managing, analyzing, and displaying spatial data, as well as attributes associated with that spatial data (Guptill et al., 1988). GIS programs are often used in site investigation, planning, and management, and have applications as well in excavation and transportation. Examples of GIS usage in environmental and related projects can be found in Lang (1998) and Singhroy et al. (1996), and URISA (1998) contains a very basic discussion of GIS concepts. Developed initially for working on geographic problems, GIS programs are being used increasingly in geologic and engineering projects, and are amenable to use on any problems involving data with XY (horizontal) or XYZ (horizontal and vertical) spatial locations. GIS has been particularly well received by government agencies (Reports Working Group, 1988; Barnwell, 2000) and is being used increasingly in education and industry as well. There are two types of GIS programs, vector-based and raster-based. Vector-based systems work with points, lines, and polygons, and work best with discrete data elements. Raster-based systems work with images, and are most useful with continuous data. Some GIS packages can handle both vector-based and raster-based data. Figure 130 shows an example of data from the EDMS that has been moved into a vector GIS and then displayed. This screen shows an index map in the upper right displaying the whole site, and a detail map to the left of part of the site at a larger scale. This map includes a bubble display. At the bottom is a list of stations and analyses, “hot-linked” to the map so the user can click on a station and the list will scroll to display the results from that station. Figure 131 shows another GIS example where data from an ODBC data source in the EDMS has been used to place data on the GIS map.
GIS DATA The strength of a GIS is its ability to integrate spatial data from different sources, including maps, charts, remotely sensed images, and tables of text or numbers. The data elements may be in different scales or coordinate systems, may have different levels of accuracy, and may cover different or partly overlapping areas. The GIS software allows the user to compile the data elements into a coherent data set and work with it more efficiently than would be possible in a hard copy system. Usually the data handled by the GIS consists of two parts, the geographic location of a data element and the descriptive attributes associated with it.
Mapping and GIS
Figure 130 - Example of EDMS data displayed in a GIS (Enviro Spase)
Figure 131 - Another example of EDMS data in a GIS (ArcView)
253
254
Relational Management and Display of Site Environmental Data
Geographic data elements can be points, lines, or polygons, and in 3-D GIS programs, polyhedra. Other data sets may be based on a two- or three-dimensional array of cells with particular attributes or values. The first type of data elements (points, etc.) are handled by vectorbased GIS and the second type of data by raster-based systems. Examples of environmental data that might be handled by a GIS, and how they apply to GIS data objects, include the following: Points – sample locations; Lines – geophysical surveys, trenches, or geochemical traverses; and Polygons – property boundaries, pit outlines, and outcrop areas. These three spatial data types can be conveniently handled by a vector-based GIS. Lines are sometimes called arcs, which may or may not be curved. Continuous geologic data that might be manipulated by a raster-based GIS might include climate or soils data, mineralization patterns, contaminant plumes, or water saturation in an oil reservoir. Examples of three-dimensional GIS use in environmental geology would include calculation of the intersections of a fault with a geologic horizon, determination of the true location of deviated wellbores, interaction between aquifers and wells, and unraveling structural relationships in highly faulted areas. Two other important points about GIS data are coverages and metadata. A coverage is a type of data (also called a theme or layer) in the data and the area for which that data layer was captured. A description of the content of a data set is called the metadata for that data set.
3-D GIS Early GIS programs, either vector-based or raster-based, worked primarily in two dimensions. This was satisfactory for analysis of surface-oriented problems such as land use planning, but was inadequate for volume-oriented problems such as atmospheric and subsurface investigations. Many GIS programs now add the third dimension (Lang, 1989). Initially this was done by adding the Z (vertical) axis data as another layer in a standard GIS, but a more sophisticated approach is to modify the data structure so that XYZ coordinates are stored together for all data elements. Using this method, it is possible to calculate intersections, proximity, and common volumes between true three-dimensional objects. It is also becoming possible, using 3-D GIS, to do realistic visual simulations of surfaces, such as synthetic landscapes, subsea and subsurface topography, and complex three-dimensional relationships between geologic units and contamination. Figure 125 is an example of this type of display.
DISPLAYING DATA One of the main purposes of creating maps, whether manually or using a mapping program, a CAD program, or a GIS, is to display data, and often this data is contained in the EDMS. Once the base maps and coverages have been built, then the next step is to get the data from the EDMS and display it on the map. Setting up the connection between the database and other software is discussed in Chapter 24. This connection can be one-directional, if the purpose is to display the data on a map for printing, as shown in Figure 132, or two-directional if the map display is used as a window into the database, as shown in Figure 133. Either way, the use of the GIS for data display adds a significant amount of utility to the EDMS.
Mapping and GIS
Figure 132 - Data streamers displayed in a GIS program (Enviro Spase)
Figure 133 - Integrated display of maps and data from an EDMS in a GIS (Enviro Spase)
255
256
Relational Management and Display of Site Environmental Data
CONTOURING AND MODELING One of the first things many earth scientists want to do with their computers is generate contour maps and other models. This is in spite of the fact that computers cannot, for the most part, generate contours based on a geological idea. However, there are many applications of computergenerated contour maps, such as regional maps, work maps, and maps of derived surfaces, where a computer-generated map is the best option. A useful technique for working with computer maps is to let the computer generate a contour map based on the existing data, then edit the contours to add some geological interpretation. The edited contours can then be used to correct the underlying grid, so that volumetrics and grid arithmetic can be done on the corrected grid. Some of the better contouring programs allow this approach. Computerized contouring goes back quite a ways. IBM published one of the first papers on computerized contouring in 1965. For good discussions of contouring of geological data, see Davis (1986) and Jones et al. (1986). An excellent reference on the details of gridding and contouring of spatial data is Krajewski and Gibbs (2001), upon which the following contouring rules are based: Rule 1: Contour lines cannot cross. Rule 2: Contour lines cannot merge. Rule 3: Contour lines must close (although not necessarily on the map). Rule 4: Contour lines are repeated when the slope changes direction. Rule 5: Contour lines are separated by a constant vertical interval. Rule 6: Contour lines should be drawn smooth and not undulating. Rule 7: Contour intervals should be small enough to see features. The interval must also be appropriate to the accuracy, precision, density, and distribution of the data. Rule 8: Contours are drawn based on experience, data, and aesthetics. While these rules should not be followed blindly, and many can have exceptions, they provide good guidelines for creating better contour maps. An area of concern in using computers to generate contours is contouring artifacts. These are features of the contours that are not present in the data. An example is shown in Figure 136 of artifacts from poor point selection, but artifacts can creep in at any stage in the process. Krajewski and Gibbs (2001) show examples of many different types of contouring artifacts and describe where they come from and how to avoid them.
Triangulation contouring Computers can generate contour maps of randomly or irregularly spaced data in one of two ways. One approach is to create triangles (or other polygons) between the data points, and then use the triangles to create the contours. This is much faster than the gridding techniques discussed below, and the contours always honor the data points. Since there is no grid, it is difficult to do volumetric calculations or manipulations between surfaces with different amounts of control, and the appearance of the contours may not be what many users would desire. Also, most triangulation algorithms cannot project the surface above or below the maximum and minimum data points. Because the calculations are quick to perform, triangulation is often used for a first look at a data set to identify data errors, which show up as “bull’s eyes” of closely spaced contours around the erroneous data point. Figure 134 shows the process of creating contours using triangulation. The first box shows the location of the data points. The second shows the triangles created by the software. The third shows the contours drawn in each triangle. The contours created with triangulation can be smoothed to improve their appearance, but care must be taken that the smoothing does not interfere with honoring the data points, and smoothing can decrease performance.
Mapping and GIS
257
Figure 134 - Triangulation contouring example showing data point locations, triangles, and contours
A surface created by triangulation is often called a triangulated irregular network or TIN. This is a common way of representing topography and other surfaces in a GIS.
Gridded contouring The other, more common way of generating contour maps with a computer is to first create a grid of evenly spaced data points based on the original data, and then create contours based on the grid. This has the advantage that the grid can be used for other types of calculations, but the disadvantage is that the grid does not directly represent the original data. Figure 135 shows an example of a contour map and the values of the grid nodes used to create the contours. Several steps are required to make a gridded contour map. These steps include specifying the spacing and orientation of the grid, selecting the points to be used for calculating each grid point (usually called a grid node), choosing the algorithm to be used to estimate the values at the grid points, selecting contouring parameters, and so on. The following paragraphs discuss these steps in detail.
Figure 135 - Example of gridded data and contours based on the grid
258
Relational Management and Display of Site Environmental Data
GRID PARAMETERS Once a data set has been selected for contouring, the parameters for the grid must be selected. The maximum and minimum values for the outside of the grid network must be specified. This area can be larger or smaller than the outline of the data points. The name given to this outline is the convex hull of the data, which is the smallest convex polygon that can be circumscribed around the data points. The grid can be made larger than the area covered by the data points to cover a specific geographical area, but the edges of the map are likely to be unreliable. It is better if data points are available outside of the area of the grid (often called overage) to ensure that the edges of the map have a realistic appearance. Some contouring packages allow the user to specify the orientation of the grid, while others require that the lines defining the grid be north-south and east-west. If it is possible to rotate the grid, it is usually best to orient the grid in the same direction as the grain of the geologic features to be mapped, such as the trend of sand bars for an isopach or the strike direction for a structure map. A disadvantage of adjusting the grid orientation is that it puts a bias into the map that must be documented, and may be difficult to assess. Other techniques such as spatial filtering of a nonrotated grid may produce better results. The grid cell spacing in one or both directions must be specified either in data units or in number of cells across the grid. Some packages allow the spacing in one direction to be different from the other, while others require them to be the same. By changing the relative spacing of the grid nodes, especially if the grid orientation can also be varied, it is possible to emphasize or minimize certain trends or components of the data. The choice of the grid cell spacing is important because it determines the detail that can be resolved in the surface being mapped, but there are two sides to this decision. On the one hand, if sufficient data coverage is available, a fine grid spacing allows definition of small features, providing a detailed map. On the other hand, a course grid can be calculated much faster, and takes less space for storage. The spacing selected should match the computational resources available and the intended use for the map. Also, too fine a grid spacing relative to the data being gridded can cause unsightly artifacts such as unsupported small closures, sometimes called “donuts,” and various other features. Krajewski and Gibbs (2001) quote the Nyquist rule, which states that for most algorithms, the average grid block size should allow a data point every two to three cells. Also, the grid size and origin should be selected so that no more than one data point is present in any grid cell.
POINT SELECTION Once the grid parameters have been selected, it is necessary to specify how points are to be selected for estimating grid nodes. One factor is the search radius distance, or area of influence. Points whose distance from the grid node being estimated are greater than this distance will have no effect on that particular grid node. Choosing a distance that is the diagonal of the grid area will assure that all points are considered. Choosing a smaller distance will speed up the computation, and is a good thing to do if the data distribution is even across the area. A second factor in point selection for grid estimation is the number of data points (often called nearest neighbors) to be considered for each grid node. Again the best number for this depends on the distribution of the data, with a larger number (10 to 20) being better for irregular data or clustered data and a smaller number (5 to 10) being better for regularly spaced data points. The third point selection factor that is provided by many contouring packages is quadrant searching or octant searching. In normal searching, neighbors are searched from the nearest one outward until the maximum specified number of points has been used. In quadrant and octant searching, the gridding program looks farther away in directions where no nearby points are found. Usually a maximum number of points is allowed in each sector, and a weighting factor reduces the influence of more distant points.
Mapping and GIS
Original Data
Normal Search
Quadrant Search
Octant Search
259
Inverse Distance Squared Algorithm Figure 136 - Example of contours generated with different search algorithms
The directional searching technique is useful in overcoming data distribution problems, including clustering or data along lines, such as with geophysical or geochemical data along lines or traverses. Normal searching of data distributed along a grid of lines can produce maps with the appearance of a waffle. This can be minimized with quadrant or octant searching. Figure 136 shows an example of the results of different search algorithms. An inclined plane was sampled along a grid of lines. That data was then contoured with an inverse distance squared algorithm, using three different search techniques. The difference between the three different contoured surfaces is striking, with the octant search resulting in the best representation of the data, at least for parts of the map which are surrounded by data. Another interesting aspect of this figure is that it illustrates how poorly computerized contouring can work in certain situations. Even the “best” map, the octant search map, represents
260
Relational Management and Display of Site Environmental Data
the original surface very poorly around the edge of the map. If this is the best we can do with a surface whose configuration is known, how well can we do with a surface that we can’t see? The other side of this is that the gridding and contouring parameters were intentionally selected to introduce an exaggerated artifact. The conclusion to be taken from this example is that a even a knowledgeable user must use computerized contouring with care if the results are to be trusted.
GRIDDING ALGORITHMS While the above factors are important, the variable that usually has the greatest effect on the final map is the choice of the gridding algorithm used to estimate the values of the grid nodes. The gridding algorithm is the mathematical calculation used to calculate the value assigned to each grid node. Dozens of different techniques have been developed for use with different types of data and to meet various data distributions and computational requirements. Some of the most popular are discussed in the following paragraphs. Many contouring programs provide more than one algorithm, and it is often best to experiment with several on each data set to see which provides the most realistic interpretation. All of the better contour programs provide the capability to automatically force the surface to honor all control points, as long as grid cells are occupied by no more than one control point. Inverse distance – The inverse distance method, also called weighted moving average, is relatively easy to program and quick to calculate, and is available in many contouring programs. Each grid node is estimated from a weighted average of the selected neighbors. The weighting factor is based on the distance to the neighbor raised to some power, usually two (known as “inverse distance squared”). Powers other than two can also be used. Lower powers lessen the effects of local points, while higher powers decrease the effects of distant points. Since the grid nodes are based on an average, no grid node will be greater than the maximum or less than the minimum data value and the grid will have a smooth surface. Beyond the convex hull of the data, the surface will not be projected, but will tend toward a regional average. Also, depending on the fineness of the grid and the distribution of the data points, the resulting contours may not honor all of the data points. This is usually a poor choice for all but the simplest surfaces. Least squares – The least squares technique is probably the most used gridding algorithm. In this technique, a polynomial equation is used to fit a surface to the data points, and this surface is used to estimate the grid nodes. The higher the order of the polynomial (that is the higher power of the equation used to define the surface), the more curvature the surface can have, and usually, the better the fit to the data. However, the higher order polynomial surfaces can be time-consuming to calculate. As with inverse distance, this is an averaging technique, so the surfaces tend to be smooth and may not honor data points, although grid nodes can have values greater than the maximum or less than the minimum data value. Surfaces are projected beyond the convex hull of the data, and the edges of the map may fluctuate wildly if no overage (data beyond the map) has been used. Some programs have techniques to eliminate the edge effects. Spline – With the spline method, the program fits a flexible three-dimensional surface to the data points, trying to make the best fit based on the gradients from one data point to the next. Grid nodes are then estimated from the spline surface, with the optimum value for each node being the one which requires the least curvature of the surface (which gives rise to another name for this method: minimum curvature). The most common method of determining the spline surface is the bicubic spline, and matrix algebra is used to fit the surface and analyze the gradients. This method does a good job of honoring data points with a minimum of averaging, but projection beyond the convex hull of the data may cause the edges of the map to be unreliable. For most randomly distributed data, this is an excellent choice. In some programs it is slower than other algorithms. Edge effects can be controlled by properly constraining the algorithm, and with “helper” control points. The constraining variables and control points can be difficult to set up, but once set up for an area and data distribution they don’t need to be re-determined for other similar data sets.
Mapping and GIS
261
Kriging – The Kriging algorithm, which is based on geostatistics, uses the concept of a regionalized variable. A regionalized variable is one which is continuous (spatially correlated) over a small area, but as the spacing of samples increases, the relationship between points approaches randomness. This situation is common with many types of earth science data in one, two, or three dimensions. The degree of spatial continuity is quantified with a variogram, which shows the variance of the data as a function of distance between data points. This variogram is used along with existing data points to estimate values at grid nodes (or at any other location) away from data points. This technique also has a significant additional benefit in that it provides information about the degree of error or uncertainty of specific points, such as the grid nodes, on the calculated surface. There are two different ways of performing Kriging. The simplest, called punctual Kriging or ordinary Kriging, considers the data points and grid nodes to be dimensionless points. It is a relatively fast process to calculate the surface, but the procedure does not work well when there is a trend or drift to the data. Universal Kriging overcomes this limitation, but requires additional computations, since the drift is first removed, then the data set is Kriged, and then the drift is added back in. Once used mostly in the mining industry, Kriging is becoming more common in other areas of geoscience. For many data sets, Kriging provides a statistically optimum surface for contouring. This statistically optimum surface may or may not resemble a geologic surface. It is probably better suited to contouring values of an analytical parameter than a physical surface.
CONTOUR PARAMETERS Once the grid has been created it is a relatively simple procedure for the software to thread the contours between the grid nodes. The user must choose the contour interval, which is the vertical difference between the contours, along with the starting and ending contours, if desired. Other items to be specified include which contours are to be bolded and labeled, and how contours are to be labeled. Some programs allow hachures (short lines on one side of the contour lines) to be placed on the downdip side of contours, which is helpful in interpretation. Either all contours or only closed lows can be hachured. It is often possible to smooth the contours after they have been drawn, which is different from smoothing the grid before contouring, which is described below. The choice of the starting contour and the contour interval can bias the output map. This can be a major issue for low-relief surfaces. An advantage of computerized contour mapping is that you can change either or both and almost instantly see the result.
GRID OPERATIONS In addition to contouring, gridded data sets can be used for several other operations. These include volumetrics, smoothing, spatial filters, and operations between grids. Volumetrics – Many times, determining the volume of a particular feature, such as an aquifer, pond, or plume, is very important. The traditional way of doing this used to be to hand contour the data, then use a mechanical planimeter to estimate the area within each contour. These were then added together to get the volume. The process can be speeded up considerably using the calculated grid nodes. This can be done using grid data alone or in conjunction with polygon data such as property outlines to determine the desired volume. Digital volume calculations made from a wellbehaved gridded data set are generally more accurate than those made with a planimeter. However, most volumetric algorithms report volumes to many decimal places, thus implying precision beyond the precision of the input data. Caution must be used in interpreting these reports. Smoothing – Some gridding techniques generate grids that show some degree of jaggedness or roughness. Statistical techniques can be used to smooth the grid, often in combination with increasing the grid density, to improve the subjective appearance of the final map. This is cosmetic only, however, and does not increase the information content of the map, although it may improve the geologic faithfulness of the final result.
262
Relational Management and Display of Site Environmental Data
Filtering – A technique related to smoothing is spatial filtering. This involves passing a small matrix across the grid, performing calculations between the matrix and the grid at each location, and saving the result in another grid. This is used for emphasizing or minimizing aspects of the data, such as particular directional trends or features of a certain size. It is also used for enhancing edges between different data levels in a manner similar to the algorithms used in remote sensing. Trend surfaces and Fourier transforms – In some cases, it is appropriate to make some mathematical transformations on your data before you contour it. For example, it may be desirable to remove a regional trend from the data, or correct for or emphasize some periodic spacing of geologic features. In the former case, trend surface analysis can be used, and in the latter, Fourier transforms can be helpful. Software exists for performing both of these procedures prior to the contouring process. Similar to these are derivative maps, especially the second vertical derivative. This technique has been used for decades in gravity, magnetic, and other potential field geophysics work to emphasize highs, lows, and trends. Because this method shows the rate of change of the slope of a surface, it can be used to highlight small perturbations in a surface that otherwise would be lost in the contour interval. Several contouring programs contain this feature, or it is relatively easy to program if it is not available. Operations between grids – The above discussion concerned grids and contours of one data element at a time. It is also possible to perform mathematical or logical operations between two or more different grids. Two examples will be used to illustrate the process. The first example involves generating a structural contour map of a surface that is only sparsely penetrated by wells. A shallower horizon is well defined by penetrations, and the two surfaces are thought to be conformable (roughly parallel). A grid of the elevation of the upper zone is created, using all available data. Then a grid is made of the isopach (thickness) from the lower zone to the upper zone, using the few deep penetrations. The isopach grid is then subtracted from the structure grid of the upper surface to generate an estimate of the structure of the lower surface. If used carefully, this is a very powerful technique. Another example is estimation of the amount of contaminated material in a zone above a certain regulatory cutoff. A grid is created of the thickness of the zone containing the contamination, and another of the concentration of the contaminant in the zone. Next, a logical operation is performed on the grid of the contamination level, setting any nodes less than the regulatory limit to zero, and those above it to one. The resulting grid is then multiplied by the first grid, and the result is the volume of contaminated material above the cutoff.
SPECIALIZED DISPLAYS Sometimes, in order to get the maximum information content from the data, it is helpful to create displays specialized for the specific type of data being shown. These specialized displays can be graphs, cross sections, maps, or other formats. Graphs are described in Chapter 20 and cross sections in Chapter 21. Some examples of specialized map displays will be discussed here.
Plume models Mapping and GIS programs can be used to model the location of a plume in two, three, and even four (including time) dimensions. The data is first retrieved from the EDMS, and then the mapping software is used to create the model, often based on an interpolated grid. Some plume modeling programs provide the capability to project the model forward in time to predict future configurations (Figure 137).
Mapping and GIS
263
Figure 137 - Plume displayed at various concentration cutoffs (Courtesy of RockWare, Inc.)
Simulations Software can be used to visualize things that are not directly visible themselves. They may not be visible because they are obscured, such as things in the subsurface, or they may not exist yet. Figure 138 shows an example of a study performed to determine the visual impact of different configurations of a landfill design. The site was photographed from several vantage points, with 60" balloons floating on cables at several elevations above three locations at the site. Then the Surfer contouring program and the Corel Draw graphics program were used to create threedimensional representations of the different configurations and superimpose them over the site photographs. This allowed the landfill operator to better understand the impact of its plans on the mountain views of its neighbors, and to document that for most of the people the impact of the proposed configuration would be minimal.
Water chemistry diagrams Two kinds of diagrams, Piper and Stiff, are commonly used to illustrate and analyze water chemistry. First you input analytical results from laboratory tests, and then the program converts the data from a mass-based concentration (such as milligrams per liter) to electrochemical equivalence (millequivalence per liter) and draws the plot. The program may also calculate the ion balance and compare total dissolved solids (TDS) values as a check on the lab analysis. Piper diagrams display multiple samples on one diagram, while Stiff diagrams display one sample at a time. A useful approach is to draw a Piper diagram to help identify different water populations, and then draw a Stiff diagram for each sample at its location on a map to analyze the spatial distribution of the water groups.
264
Relational Management and Display of Site Environmental Data
Current Configuration
Permitted Configuration
Proposed Configuration Figure 138 - Landscape simulation for proposed landfill expansion (Courtesy of Geotech Computer Systems)
PIPER DIAGRAMS A Piper diagram uses a diamond and two or more triangles to display the chemistry of several samples at a time. This allows quantitative comparison and classification of the samples. Figure 139 (from Lee, 1998) shows an example of a Piper diagram. Different water populations will cluster in different parts of the diagram.
Mapping and GIS
265
Figure 139 - Piper water quality diagram from Lee (Reprinted from Computers and Geosciences v 24, p. 523-529 by Tien-Chang Lee, LEEGRAM: a program for normalized Stiff diagrams and quantification of grouping hydrochemical data, © 1998, with permission from Elsevier Science and Dr. Lee)
STIFF DIAGRAMS Stiff water quality diagrams are a very useful tool to help understand groundwater chemistry. Hem (1985) first described their use. They are generally used to show the most common cations and anions plotted as a graph of two line plots, with the cations on the left and anions on the right. Distance from the centerline is proportional to the concentration (in millequivalents) of the particular ion. Plots from samples from several different wells can be posted on a map and relationships between the samples can be viewed qualitatively. Some programs for creating Stiff diagrams draw the diagram for the data from one well, which must then be pasted onto a map. Software that posts the diagrams for all of the wells automatically on the map can eliminate the manual pasting step.
Figure 140 - Screen for entering Stiff diagram parameters
266
Relational Management and Display of Site Environmental Data
Figure 141 - Stiff diagram scale display
Figure 142 - Stiff water quality diagram from EDMS data
Although the primary use for Stiff diagrams is to display major ion chemistry, they can be used to display any constituent, such as indicator chemicals and constituents of concern. Figure 140 shows a screen for selecting how the diagrams are to be displayed. Figure 141 shows the scale used to interpret the Stiff diagrams. Figure 142 shows an example of Stiff diagrams drawn on a map. In this display, each polygon represents a water sample at a specific location. The color of the polygon fill is dependent on the total dissolved solids concentration of the water, and the shape reflects the major ion chemistry. This type of map is very useful in analyzing the sources and mixing of waters in the subsurface. It is possible to draw some interesting conclusions from Figure 142. At the top (north) end of the figure is a station with a relatively high chemical concentration. The software filled the polygon with a dark color because the concentration of total dissolved solids has exceeded a user-defined cutoff of 1500 mg/l. To the south are two stations that also have relatively high concentrations. The high concentration value to the north is associated with an industrial facility, while the southern values are near a residential area, off the map to the southeast. It might be reasonable to infer that the high values to the south were caused by pollution from the industrial facility.
Mapping and GIS
267
Figure 143 - Raw data and calculations accompanying Stiff water quality diagrams
The Stiff diagrams tell a different story. There are two lines of evidence that suggest that the water to the south is unrelated to the water to the north. The first is the shape of the Stiff plots. The water to the north is high in calcium and sulfate, while the waters to the south are lower in those ions, and higher in sodium/potassium and bicarbonate. This can be seen from the shape of the curves, and suggests a different source for the water in the south. Also, it is clear from the Stiffs that the stations in the middle of the figure are relatively low in all ions, and their shape, while subdued due to their low concentration, does not look very much like the others. This is another piece of evidence that the two areas of high concentration are unrelated. All of this information together provides a fairly complete picture of the relationship between the two water populations. Figure 143 shows the raw data and calculations for the Stiff diagrams for two stations. It is important to carefully review the data used to create the Stiffs before they are used to draw conclusions. Two things in particular to check are the charge balance and the TDS. The charge balance should be a relatively small percentage of the total of the millequivalent values. For the TDS, the value reported by the laboratory should be close to the sum of the concentrations of the major ions. A reasonable target for both is ±5 to 10%. For station F-3 the total of the millequivalents is 66.8 and the charge balance error is 1.68, so the error is about 2.5%. For F-4, the error is -4.3%, so both look reasonable. With the TDS comparison, for F-3 they are within about 4%, which is fine, but F-4 is off by 24% and might be questionable. One disadvantage of a standard Stiff diagram is that large variations in concentration can lead to diagrams with greatly varying width. This can obscure the shape relationships that are the goal of the diagrams. The software used to create Figure 140 includes an option to re-scale the diagrams if the TDS value exceeds a certain cutoff, which can help, especially in the situation where there is a combination of relatively fresh water and brines. Lee (1998) has described a modification to Stiff diagrams where the diagrams are scaled continuously based on the TDS value, so all of the diagrams are the same width. Then the TDS value is shown with a circular display in the center of the diagram. This type of display, called a Leegram, is shown in Figure 144.
268
Relational Management and Display of Site Environmental Data
Figure 144 - Leegram presentation of Stiff diagrams from Lee (Reprinted from Computers and Geosciences v 24, pp. 523-529 by Tien-Chang Lee, LEEGRAM: a program for normalized Stiff diagrams and quantification of grouping hydrochemical data, © 1998, with permission from Elsevier Science and Dr. Lee)
CHAPTER 23 STATISTICS AND ENVIRONMENTAL DATA
Often the data in the EDMS will be used for further processing and analysis, such as statistical calculations. The large data sets used on many environmental projects often require that a statistical approach be used to determine the significant relationships, because these relationships may not be apparent by direct inspection of the large volume of data. There are a number of issues that must be addressed if the statistical results are to be valid. This chapter addresses some of these issues, and some of the statistical analyses that can be performed to provide a greater understanding of site conditions. Statistics is a very old and much-studied field, and this chapter merely scratches the surface. An understanding of statistics is important in analyzing environmental data. 80% of environmental professionals are afraid of statistics (OK, I made that up, but it’s not too far off), and knowing something about statistics and how they should and should not be applied is a useful skill for those working with environmental data.
STATISTICAL CONCEPTS There are a number of concepts that you should understand before you perform statistical analyses on environmental (or any other) data. For example, all numbers are not created equal, and statistical analyses that are valid on some may not be valid on others. Likewise, you should have a good understanding of the distribution of your data, and the cause of that distribution, before you trust the statistics that you run on that data. One important point that should be made is that statistics, if used improperly, can be misleading at best and dishonest at worst. Huff (1954) wrote a book called How to Lie with Statistics, and there is a lot of truth to the idea. Benjamin Disraeli, Prime Minister of Great Britain in the 1800s, is quoted as saying, “There are three kinds of lies: lies, damned lies, and statistics.” Be sure that you use statistics to expose the truth, not hide it.
Types of numbers Although it might appear that a number is a number, in fact different types of numbers can have significantly different meanings. Several authors have listed types of numbers (also sometimes referred to as scales of measurement), including Robinson (1982) and Joseph (1998). Number types in this context are different from the numeric types as defined in a programming language, which have names like real (floating point) or long integer. What we are talking about
270
Relational Management and Display of Site Environmental Data
41% of all statistics are worthless. Rich (1996) here is what the numeric values represent, as opposed to how they are stored. The following are some examples of numeric types: Nominal – These are numbers which are merely labels, with no relative amount or order. Assigning well types of 1 for a monitoring well and 2 for a soil boring would be an example of a nominal data type. Statistics on this type of number should be limited to counting the number in each class. Ordinal – This type of number has a relative order, but the interval between items is not constant. A common example of this is the Mohs hardness scale for minerals, where the increase in amount between 9 (corundum) and 10 (diamond) is much greater than the intervals between the other values. In addition to counting values, valid statistics would include a cumulative frequency histogram and other similar distribution analyses. Interval – In this case, the relative difference between values is constant, but a true zero value is not defined. Temperature is an example of this type of number. The number chosen for zero is somewhat arbitrary, with zero in Celsius and zero in Kelvin being different, but the increment between degrees being well defined. Most types of statistics can be performed on these numbers, especially those involving addition and subtraction. Care must be taken on dividing or multiplying these numbers, or working with powers. Ratio – Ratio scale numbers are ones where both the interval and the zero point are well defined. The concentration of a constituent in a sample is an example of a ratio scale measurement. Any type of statistical analysis can be performed on this type of number, as long as the statistical test is valid for the distribution of the numbers.
Random and stochastic processes Some processes are based on direct cause and effect. A deterministic process is one for which the model is well understood, and a given input will cause a specific output. On the other hand many processes in nature are random or appear to be so. A random process is one for which no rule can be applied. Many processes that appear to be random are in fact stochastic. A stochastic process is one in which the variation can be described by probability theory, and this is a better description of many natural processes. A good example of a stochastic process is the popping of popcorn. At a given temperature, a certain number of kernels will pop each second, but it is impossible to know exactly when each kernel will pop. That can only be described statistically. Most stochastic or even random processes are actually deterministic processes for which the model is so complicated that it is not (and probably never will be) understood. The importance of stochastic processes in the environment is discussed in Ott (1995). Sampling of the results of a stochastic process will result in some type of distribution of the data, which may or may not be a normal distribution. Global climate change is another example of a process that may never be described deterministically.
Distribution - Are you normal? The distribution of numbers, that is, how many of each value are present in a population or a sample, has a large impact on what type of statistics can be performed on the data. The most common distribution is the normal distribution, also called a Gaussian distribution or bell-shaped curve. Many common statistical analysis techniques are valid only for normal distributions, so determining whether the data has this distribution is critical before doing these analyses. A graph of the normal distribution is shown in Figure 145.
Statistics and Environmental Data
271
Probability
0.4
Normal Distribution
0.3 1 SD 0.2
0.1 .0215
2 SD
.1359 .3413
3 SD
0.0 -6.0
-3.0
0.0
3.0
Value
6.0
Figure 145 - Graph of a normal distribution
1.0
Lognormal Distribution
0.5
0.25
Arithmetic Mean
Geometric Mean
Probability
0.75
0.0 0
1
2
Value
3
4
5
Figure 146 - Graph of a lognormal distribution
In this graph, the number of observations of each value of a variable is plotted against the value. The shape of the curve is determined by the mean and the standard deviation. The center of the graph is the (arithmetic) mean of the data, which is determined by taking the sum of the values and dividing by the number of values. The standard deviation is a measure of the spread of the data. In Figure 145, vertical lines have been drawn at 1, 2, and 3 standard deviations from the mean. The percentages of data points contained in the area a certain number of standard deviations above or below the mean can be easily calculated, and are also shown on the graph. For example, about 34% of the values fall between the mean and 1 standard deviation below it. Another common distribution of environmental data is a lognormal distribution, in which the logarithms of the numbers are normally distributed. A graph of this distribution is shown in Figure 146. Indicative of a lognormal distribution is a data population that has a lot of low and medium values, and a few very high values. The distribution of contamination often has this characteristic. Lognormal distributions make it easier to represent high values. Identification of a lognormal distribution is very important if you will be doing statistical analysis of the data, because a number of statistical techniques are not valid, or at least not useful, for this type of distribution. This is particularly important in the situation where a remediation project is trying to achieve certain concentration goals based on average values. The arithmetic mean of a lognormal distribution will usually be much higher than the geometric mean, which is determined by calculating the mean of the logarithms of the data, and then taking the antilog. A corrective action for contamination with a lognormal distribution is likely to be judged successful much sooner if the geometric mean is evaluated rather than the arithmetic mean. Some physical examples of distributions might be helpful. The size distribution of a sample of well-sorted beach sand tends to be normal. Fine-grained conglomeratic sand, with many small particles but few large ones, will have a lognormal size distribution. Contaminant concentrations are often lognormal, with many low values and a few very high values.
272
Relational Management and Display of Site Environmental Data
Populations and samples Another issue related to the distribution of the data is the relationship between the samples and the underlying population. The population can be either discrete, where the number of possible samples is finite, or continuous, where the number of possible samples is infinite. Either way, the samples taken usually represent a subset of the possible samples. (The facetious exception to this is for an extremely heavily sampled site, where so many samples have been taken and sent to the lab that there is nothing left to remediate.) The subset nature of samples must be taken into consideration in the application of statistical tests.
Problems with environmental data Yost (1997) has pointed out a number of issues encountered in environmental data that must be taken into consideration when performing statistical analysis on the data: • • • • • • •
Aberrant values are present that deviate from the trend of previous measurements. Censored data, meaning that certain values, such as those below the detection limit, are not properly represented. Happenstance data is collected for various uses, then used for others for which it was not intended. Large measurement errors frequently occur. Lurking variables may exist, which are significant variables that are not measured. Non-constant variances, which are measurement errors proportional to the magnitude of the measured value rather than a constant, resulting from activities such as sample dilution. Serial correlation results when values are taken sequentially in space or time, resulting in values that are not independent measurements.
While these issues don’t make the data invalid, they must be kept in mind in selecting the approach to handling the data and the statistical test applied. The purpose of these statistical tests is to determine whether a particular result is out of compliance. Either the result is in violation, or it is in compliance, and either the statistical decision says it is in violation or in compliance. This results in four possibilities, as illustrated in Figure 147. Two of the outcomes are a good decision, one is a false positive, also known as a Type I error, and the other is a false negative, or Type II error. The probability of a Type I error, where contamination is indicated but is not present, is defined as the “significance level” of the test, and is estimated to be a 5% chance, or 1 out of 20, of a false positive. A Type II error, where contamination is present but not indicated, is more difficult to quantify, and, depending on your perspective, could be more serious.
True Situation In Violation In Compliance
Statistics and Environmental Data
273
Statistical Decision In Violation In Compliance Good Decision
False Positive Decision (Type I error)
False Negative Decision (Type II error)
Good Decision
Figure 147 - Statistical error in hypothesis testing
Statistics and regulatory requirements Statistical analysis has an important place in satisfying regulatory requirements for data analysis and reporting. Sara (1994, pp. 11-26 to 11-30) discusses the statistical component of some regulatory requirements. He quotes the following EPA requirements for statistical procedures to be performed for each constituent for RCRA facilities: • • • •
A parametric analysis of variance followed by multiple comparison procedures to identify statistically significant evidence of contamination. An analysis of variance based on ranks followed by multiple comparison procedures to identify statistically significant evidence of contamination. A tolerance or prediction interval procedure in which an interval for each constituent is established from the distribution of the background data, and the level of each constituent in each compliance well is compared to the upper tolerance or prediction limit. A control chart approach that gives control limits for each constituent.
The regulations also discuss how various data issues, such as detection limits, are to be handled in statistical analysis, and require other procedures such as comparison of exceedences to background well data.
TYPES OF STATISTICAL ANALYSES Statistical analysis of earth science data can be broken down into descriptive statistics, multivariate analyses, and spatial or geostatistics. A good, detailed description of the use of statistics in geology can be found in Davis (1986), and a good discussion of statistics on environmental projects is contained in Joseph (1998). Statistical tests are also described as parametric or non-parametric. A parametric test is one based on a normal distribution, while a non-parametric test is based on a non-normal distribution. Parametric tests are based on the magnitude of values, while non-parametric tests are based on the ordering of the values. Sometimes a transformation, such as a log transformation, can be performed on the data to make it suitable for parametric analysis.
Descriptive statistics Descriptive statistics involve one variable at a time. Included in this group are the four statistical methods specified in RCRA regulations: 1) tests of central tendency, used to compare the
274
Relational Management and Display of Site Environmental Data
mean or median of two or more data sets for similarity; 2) tests of trend, to evaluate a significant increase or decrease in a parameter over time; 3) prediction, tolerance, and confidence intervals, used to set acceptable background values for comparison; and 4) control charts. Control charts represent a graphical display of confidence intervals, and will be described in a later section. Descriptive statistics tests include mean, median, mode, percentiles, standard deviation, skewness and kurtosis, chi-square, t-test, F-test, and one-way ANOVA. A good general discussion of univariate statistics can be found in Bulmer (1979). Mean – Mean is what most people think of when you say “average.” There are three types of means: arithmetic, geometric, and weighted. The arithmetic mean is the sum of the values divided by the number of values, and works best with a normal distribution. The geometric mean is the antilogarithm of the mean of the logarithms of all the values in a set, and works best with a lognormal distribution. A weighted mean includes a step of weighting some values more than others, and then calculating the mean. It is useful in special situations where some values may be more representative than others of the population being sampled. Median – The median is the value that falls in the middle of a data set sorted in order of increasing value. The median is a non-parametric test, and works with non-normal data. Mode – The mode is the value that occurs with the greatest frequency, either as an individual value or of a group of values. Percentiles – The nth percentile is the value where n percent of the data have this value or less. This is one way of identifying outliers. Standard deviation – Standard deviation is a common measure of deviation from the mean. First the variance is calculated by summing the square of each value’s deviation from the mean divided by the number of observations. Then the square root of the variance is taken to give the standard deviation. Skewness – Skewness describes a distribution that is similar to a normal distribution, but one tail of the curve is longer than the other. These distributions can be said to have a right-hand or left-hand tail. Skewness can be calculated by taking the average of the cubed deviations from the mean, a process called the third moment about the mean. Kurtosis – Kurtosis describes the situation where both tails are longer or shorter than those of a normal distribution. It is calculated by averaging the fourth powers of the deviations from the mean. Chi-square – The chi-square distribution is a distribution with n degrees of freedom. It starts with a single tail to the right, and approaches a normal distribution as the number of degrees increases, fitting pretty well at about 16 degrees of freedom. The chi-square test is used to test the hypothesis that elements of one population are members of another population. t-test – The t-test is used to evaluate whether two sample sets could have come from the same underlying population, within a specified confidence interval. The application of the t-test to small sample sets is called Student’s t-test, after the pen name of the person who first described the use of the test in brewing beer. This test can be used on lognormally distributed data by taking the logarithm of the data and the remedial standard, and doing the comparison based on the transformed data. F-test – The F-test is used to test whether two samples have different variances. ANOVA – Analysis of variance (ANOVA) is performed to determine whether the means of two populations are equal, and if not, what the variance is. A one-way analysis of variance addresses one variable, while multi-factor ANOVA can address multiple variables at once.
Multivariate analyses Multivariate analysis compares several variables to determine patterns and relationships. Procedures involve multi-factor ANOVA (discussed above), linear regression, correlation coefficients, discriminant function analysis, factor analysis and principal components analysis,
Statistics and Environmental Data
275
correspondence analysis, and cluster analysis. Kachigan (1991) provides useful basic information on multivariate statistical tests. Linear regression – Linear regression is used to find a straight line fit to a set of data. A leastsquares method is used to calculate the slope and Y intercept of the line. A correlation coefficient can then be calculated to quantify how good the data fits the regression line. Other lines of various powers can be used as well, but must be used with care, as illustrated in Figure 114. Correlation coefficients – The correlation coefficient describes how well two variables fit a linear relationship. It ranges from 1 for a perfect linear relationship, through 0 for no relationship, to -1 for a perfect negative linear relationship. Obtaining a valid result requires that any relationship be linear, that the samples are random, and that the two variables together have a normal distribution. Discriminant function analysis – Discriminant function analysis is similar to regression analysis. In regression analysis a weighted combination of values of predictor variables is used to predict the value of a continuously scaled criterion variable. The discriminant function uses a weighted combination of those variables to classify the value into a criterion variable group. Factor analysis and principal components analysis – Factor analysis is used to identify from among many different factors those that may be related. Principal components analysis is a type of factor analysis that starts by extracting one factor for each variable, and then scoring the factors. Usually the first two or three factors contribute to most of the variation. Correspondence analysis – Correspondence analysis is similar to factor analysis. It is a descriptive technique to analyze simple two-way and multi-way tables to determine correspondence between the rows and columns. Cluster analysis – Cluster analysis is used to partition a set of objects into relatively homogeneous subsets based on the clustering of variables, and does not depend on a homogeneous distribution of the objects.
Spatial statistics Spatial statistics, or geostatistics, relates one dependent variable, such as concentration of a particular element or elevation of a formation, to its spatial coordinates. The degree of variation as a function of distance from control points is used to determine weighting factors for points away from the control points. This is the same as the Kriging technique for contouring described in Chapter 22.
OUTLIERS AND COMPARISON WITH LIMITS Outliers The field of statistical quality control has provided tools for comparing individual points to the data set from which they came to identify anomalous points, also called outliers. The importance of outliers in environmental projects is discussed in Sara (1994, p. 11-13). Outliers can result from sampling errors or field contamination; analytical errors or laboratory contamination; recording or transcription errors; poor sample handling, including exceeding holding times; or true variation in the parameter being measured. Once the outliers have been identified, in some cases, such as transcription errors, they can be corrected, which resolves the issue. In other cases, the cause of the variation can be identified, but cannot be corrected, such as a holding time problem. Outlier data that does not accurately represent site conditions should be marked as such and excluded from further analysis, but should be flagged, kept in the database, and reported, along with the qualifier flag.
Relational Management and Display of Site Environmental Data
Test Results
276
Upper Control Limit x x Mean x Lower Control Limit
x x
x
x
x
x
x Analysis
Time x Out-of-Control Analysis
Figure 148 - Control chart concepts
An important issue in outlier analysis is to be sure that the comparison population is the same population from which the sample was taken. For example, on typical hydrology projects the concentrations in wells vary significantly based on where the well is located. Upgradient, onsite, sidegradient, and downgradient wells can be expected to have very different concentration distributions. Outlier analysis should be performed for each well based on the correct reference population. A graphical display method that helps in working with outliers is control charts, which display the points along with the reference data set. Two types of control charts are Shewhart and cumulative sum.
Control charts Control charts are used to demonstrate achievement of quality control objectives (see Patnaik, 1997, p. 15). The test results (values) are plotted on the vertical axis against time on the horizontal axis (see Figure 148). The mean (average) of the analyses is drawn, along with the control limits, based on some number of standard deviations (such as 2, 3, or 4.5) above and below the mean. In the figure, a curve has been drawn along the vertical axis showing the normal distribution of the data. This curve is not usually shown, but provides the basis for the determination of the control limits. Once the chart axes and control limits have been drawn, the analyses are then drawn on the chart. Analyses that fall outside of the control limits are identified with some distinctive marking, and then are further analyzed to determine the reason the analyses are out-of-control. If the reason for the out-of-control situation is related to sampling or analysis, the problem can be identified and addressed. If not, then it is appropriate to investigate whether the data point represents some environmental impact, or is the result of statistical variation. Figure 149 shows a control chart for several years of data for one well. In order for a control chart to be valid, the technique should be from a single, uncontaminated well that has been monitored over time. At least eight independent samples over at least a one-year period should be used. The samples must be independent, the values must be normally distributed, and the mean and variance must be fixed. In addition to identification of specific out-of-control points, the chart can be used to identify any trends and cycles in the data, including seasonal variations and drift with time. The use of frequent samples helps detect time-related changes more quickly, but at a higher cost. Control charts can also be used to determine the relationships between subgroups, or the relationship between two analytical values, such as a chemical constituent and groundwater elevation.
Statistics and Environmental Data
277
2 + 2 = 5 for moderately large values of 2. Rich (1996)
Figure 149 - Shewhart control chart showing limits and outliers
SHEWHART Walter A. Shewhart at Bell Laboratories developed Shewhart control charts for quality control in manufacturing in the 1920s and published the idea in 1931 (Shewhart, 1931). (In the literature the name can be found spelled both Shewart and Shewhart.) Shewhart control charts plot individual values (or averaged groups) as a function of time. Shewhart charts are preferable when the data has fewer results on an intermittent basis. A chart with Shewhart control limits is shown in Figure 149.
CUMULATIVE SUM With cumulative sum (CuSum) control charts, the values are plotted as cumulative sums as a function of groups, rather than individual values. CuSum charts are best when there are many results over regular time intervals. The EPA (1989) has described a technique that combines Shewhart and CuSum charts to get the best benefits of both.
Regulatory limits Regulatory limits are a significant part of environmental data analysis. The limits can be general, such as federal drinking water standards, or specific to a site as a result of a record of decision or consent decree. Either way, a comparison between analyzed values and the standard can be a key part of the decision process. The database should allow you to store and work with regulatory limits as part of the retrieval process, and should allow for comparison with and display of regulatory limits. Chapter 19 shows example screens for entering and reporting regulatory limits.
TOXICOLOGY AND RISK ASSESSMENT Much of the focus of environmental data management, and of environmental statistics, is on protecting public health. Two major components of this are toxicology and risk assessment. These two topics will be discussed here briefly, and the references contain much more information.
278
Relational Management and Display of Site Environmental Data
Toxicology Toxicology is the study of the adverse effects of chemicals and other influences on living organisms. Many of the substances tracked in an EDMS are toxic, and the concentrations of these substances found in the environment can have a significant impact on human, animal, and plant health. Some information on health risks associated with various parameters can be found in Appendix D. Inorganic compounds, especially heavy metals, as well as many organic compounds can be toxic, and their toxicity varies from causing acute illness and death to requiring long-term exposure to cause damage. Additional information on toxicology can be found in Kamrin (1988), Manahan (2000), Extoxnet (2001), and NLM (2001). Toxic effects can show up in several ways. Organ toxicity covers a range of effects on organ systems, such as the heart, lungs, kidneys, and liver. In organ toxicity, the severity of the effect is usually related to the amount of exposure. Carcinogenesis is where the result of toxic exposure is the growth of cancer cells. With carcinogenesis, the chance of contracting cancer is related to the severity and frequency of exposure, but the severity of the cancer usually is not. Teratogenesis is the formation of birth defects. Some defects are so severe as to cause death of the offspring before birth. Reproductive toxicity involves the effects of toxic substances on reproductive capacity, including decreases in fertility, decreases in the number of conceptions leading to live birth, and reduced birth weight or size. Mutagenesis is the formation of mutations or changes in genetic material, either in the affected individual or in future generations. Obviously all of these are severe problems, and the cost of investigation, cleanup, and monitoring is often justified by the possibility of reducing these effects. Unfortunately, assessing the toxicity of substances is very difficult. The temptation is to err on the side of safety and set very low target limits where the data is inconclusive. Unfortunately, this significantly increases the cost of cleanup. There is no easy answer, although a risk-based approach can help.
Risk assessment Risk assessment is the characterization of potential adverse health effects of human exposure to environmental hazards. The extent to which a group of people has been or may be exposed to a certain chemical is determined, and the extent of exposure is then considered in relation to the hazard posed by the exposure. The important point is that rather than setting limits based on some assumed level at which harm may occur, the pathways, assessed exposures, and dose-response information is taken into consideration on a project-specific basis to come up with realistic cleanup targets. A good introduction to risk assessment and much information on chemicals and their environmental risk can be found in EPA (2001d). Probably the most difficult aspect of public involvement in environmental issues is in attitudes about risk. There is a story that the devil approached mankind with an offer. He would give us something that would allow us to expand our personal world tremendously, increase our individual freedom, allow us to travel anywhere at any time, live and work where we want, and generally live a better life. The price for this would be the death of 50,000 people a year. Presented this way, most people would reject this offer. But this is the story of the automobile. Most benefits come with some risk, and the benefits of our industrialized society come with many. People don’t agree on what level of risk is acceptable for what level of reward. Some feel that the only acceptable level of risk is zero. Another (supposedly true) story is about a woman in Pennsylvania who was afraid that something bad would happen to her if she went outside so she decided not to leave her bed. A boulder rolled down the hill, through the wall, and killed her while “safe in bed.” For some time there has been a movement in the environmental industry toward risk-based corrective action, where the benefits of actions are weighed against their reduction in risk toward people or the environment. We have a long way to go before the general public understands and approves of acceptable levels of risk.
CHAPTER 24 INTEGRATION WITH OTHER PROGRAMS
A challenging issue is making applications needing the data (such as statistical analysis or GIS) talk to the database where the data is stored. This can be done either by moving data between applications using intermediate files, or by directly connecting the applications. Integration between the database and other applications is becoming increasingly important and, fortunately, is becoming much easier to accomplish due to new features in the software on both sides of the conversation.
EXPORT-IMPORT Figure 150 shows the two approaches to integration between the database and applications that need the data: export-import and direct connection. Each has advantages and disadvantages, which are discussed below. In situations where data will be moved between applications or locations using the exportimport approach, there are a variety of data access methods and file standards. Older formats like ASCII tab-delimited files, spreadsheets, and dBase files are giving way to new standards like XML (eXtensible Markup Language).
Database Direct Connection
ExportImport Intermediate File
Application Figure 150 - Connection methods
Application
280
Well B-1 B-1 B-1 B-1 B-1 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2
Relational Management and Display of Site Environmental Data
Elev 725 725 725 725 725 706 706 706 706 706 706 706 706 706
X 1050 1050 1050 1050 1050 342 342 342 342 342 342 342 342 342
Y 681 681 681 681 681 880 880 880 880 880 880 880 880 880
SampDate 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 11/4/95 11/4/95 11/4/95 2/3/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96
Sampler JLG JLG DWR DWR DWR JAM JAM JAM JLG JLG JLG DWR DWR DWR
Param As pH As Cl pH As Cl pH As Cl pH As Cl pH
Value .05 6.8 .05 .05 6.7 3.7 9.1 5.2 2.1 8.4 5.3 1.4 7.2 5.8
Flag not det not det not det detected detected detected detected detected detected
Figure 151 - ASCII file of environmental data
Stations Well Elev B-1 725 B-2 706 B-3 714
X 1050 342 785
Samples Well SampDate B-1 2/3/96 B-1 5/8/96 B-2 11/4/95 B-2 2/3/96 B-2 5/8/96 B-3 2/3/96 B-3 5/8/96
Y 681 880 1101 Sampler JLG DWR JAM JLG DWR JLG CRS
Analyses Well SampDate B-1 2/3/96 B-1 2/3/96 B-1 5/8/96 B-1 5/8/96 B-1 5/8/96 B-2 11/4/95 B-2 11/4/95 B-2 11/4/95 B-2 2/3/96 B-2 2/3/96 B-2 2/3/96 B-2 5/8/96 B-2 5/8/96 B-2 5/8/96 B-3 2/3/96 B-3 2/3/96 B-3 5/8/96 B-3 5/8/96 B-3 5/8/96
Param As pH As Cl pH As Cl pH As Cl pH As Cl pH As pH As Cl pH
Value .05 6.8 .05 .05 6.7 3.7 9.1 5.2 2.1 8.4 5.3 1.4 7.2 5.8 .05 8.1 .05 .05 7.9
Flag not det not det not det detected detected detected detected detected detected not det not det not det
Figure 152 - Normalized environmental data
In an ASCII (American Standard Code for Information Interchange) file, the data is represented with no formatting information (Figure 151). In the case of transferring laboratory data, the usual file structure has the disadvantage that the data is de-normalized, that is, the hierarchical (parent-child) relationships are not represented in the file. Breaking the data into separate tables to represent this structure is called “normalization” (Figure 152).
Integration with Other Programs
281
Figure 153 - XML file of environmental data
The advantage of a newer format like XML is that it can communicate both the data itself and the hierarchical structure of the data in one file (Figure 153). In this file, “tags” are used to define data elements in the file, and the positioning of the tags and the data define the hierarchy. The XML tags are added automatically by the software when it generates the XML format. Using an advanced format like XML allows the data to be rendered (displayed) in a more efficient and understandable way (Figure 154). This rendering can be done in a flexible way using style sheets, which define how each data element in the file is to be displayed. Style sheets used to render XML data use a language called XSL (eXtensible Stylesheet Language). There are many benefits to separation of the data from how it is displayed, allowing the data to be displayed in different ways in different situations. For example, the display requirements are different for a browser running on a computer vs. a Web-enabled portable phone.
Figure 154 - Rendered XML file of environmental data
282
Relational Management and Display of Site Environmental Data
The “extensible” part of XML and XSL is important. Because the language is extensible, features can be added to handle specific data requirements. For example, it is possible to add extensions to XML to handle the spatial aspects of the data. GML (Geography Markup Language) is an example of this approach. GML schema (data structures) define how geographic information is put into XML. The GML tags describe content such as the geographic coordinates and properties of a map element. A style sheet can then define the appearance of the map elements. The separation of the data from how it is displayed, in this case, might allow different scale-dependent displays depending on the resolution of the output device. XML is also starting to be used as a data management storage format in place of relational and other designs. This usage of XML is described in Chapter 2. Two other common intermediate formats are spreadsheet files and database files. A benefit to these formats is that most programs can read and write one or more of them. In both cases, it is necessary that the application creating the file and the one accessing it agree on the program format and version of the file. Spreadsheet files, nowadays usually meaning Microsoft Excel, have the advantage that they can easily be edited. Since spreadsheet files are usually loaded into memory for processing, they usually have file size limits that can be a significant problem. For example, Excel 97 has a limit of 65535 rows, which, for an investigation of a site with 50 organic constituents and 100 wells, would limit the file to 13 quarters of data. This capacity problem severely limits spreadsheet storage of environmental data. Database files don’t have the file size limitation of spreadsheets. The two most common formats are dBase (and the very similar FoxPro format) and Microsoft Access. dBase is an old format and may not be supported by applications in the future. Access is a modern, flexible, and widely supported format, but the file structure changes every two or three years, so versions can be a problem.
DIGITAL OUTPUT Data export from the database can be provided in two ways. The result of a selection using the QBF screen can be sent to a window from which data can be copied to the clipboard, saved in a variety of file formats, or printed. Alternatively, if certain applications require a specific or specialized format, exports in that format can be provided, also based on the QBF screen. Output in a specialized format often must be addressed on a case-by-case basis. A specialized type of export that may be a part of the EDMS is the creation of a database subset. Creation of a subset starts with selection of data using the standard selection screen. The data selected is then written to an Access .mdb file, which can be sent to a remote location or placed on a portable computer. The EDMS software should provide the capability to attach to one of these subset files instead of the main database, so the analytical capabilities of the software can be used on the subset. Only a specific group or person, such as the Database Administrator, should be able to create subsets, so that the location and distribution of the subsets can be controlled. This is important since the subsets can become “stale” when data in the main database changes.
EXPORT-IMPORT ADVANTAGES AND DISADVANTAGES All export-import approaches have inherent advantages and disadvantages. The primary advantage is in a situation where there is a clear “hand-off” of the data, such as when a laboratory delivers data to the client. Once the laboratory delivers the data, the responsibility for managing the data rests with the client, and any connection back to the LIMS (Laboratory Information Management System) that created the electronic deliverable would be inappropriate. Transfer of the intermediate file breaks this connection, enforcing the hand-off. Another advantage is where the data must be made available even though no direct connection is available, such as at a remote location.
Integration with Other Programs
283
Figure 155 - ODBC data source configuration
There are several disadvantages to the export-import model. One is that it is necessary to define formats that are common to the programs on either side of the process, which can be difficult to do initially and to maintain. A more severe disadvantage is the proliferation of multiple copies of the data, which wastes space and can present a management problem. The biggest disadvantage, however, is lack of concurrency. Once the export has been performed, the data can become “stale.” If the data in the original location is corrected, that correction is not automatically reflected in the intermediate file. The same is true of updates. With the export-import model, great care must be taken to minimize the chance of using old and/or incorrect data.
DIRECT CONNECTION In many cases, direct connection through ODBC (Open DataBase Connectivity) or other protocols can eliminate the step of exporting and importing. These tools can help facilitate movement of data between the database and applications, such as a GIS, that can use the data. ODBC is the most widely used method at this time for connecting applications to data, especially in the client-server setting. There are several parts to setting up ODBC communication. For a client-server system (as opposed to a stand-alone system, where ODBC can also be used), the first step is to set up the database on the server. Typically this is a powerful back-end database program such as Oracle or SQL Server. The next step is to set up the ODBC connection on the client computer using the ODBC Data Source Administrator (Figure 155), which is part of Windows. Then the application needs to be configured to use the ODBC connection. One example (Figure 156) shows a database client being connected to a server database in Microsoft SQL Server. This screen also shows the option to connect to an Access database. Another example (Figure 157) shows a GIS program being connected to a database through ODBC. In this example, the GIS program (Enviro Spase) understands the normalized environmental data model, and once the attachments are made, the SQL language in the GIS can join the tables to use the data. The Feature_Lines and Feature_Text tables, which contain base map data, will be local to the GIS, while Sites, Stations, Samples, and Analyses are having their attachment changed from the EnvDHyb1 to the MyNuData data source.
284
Relational Management and Display of Site Environmental Data
Figure 156 - Database attachment screen
Figure 157 - Enviro Spase GIS attachment screen
Figure 158 - ArcView GIS attachment screen
In another example (Figure 158) a different GIS program, ArcView, is being attached to tables via an ODBC data source. Here the attachment will result in a denormalized table, which will then be used for display.
Integration with Other Programs
285
Other connection protocols are also available, and appropriate for some situations. COM (Component Object Model), DCOM (Distributed Component Object Model), CORBA (Common Object Request Broker Architecture), EJB (Enterprise Java Beans), and more recently SOAP (Simple Online Access Protocol) are usually used with Internet data communication. COM, DCOM, and SOAP are popular for use with Microsoft products, and CORBA and EJB are popular with the anti-Microsoft camp, especially with advocates of Java (a programming language from Sun Microsystems) and Linux (an open source variant of the UNIX operating system). A new connectivity protocol, Microsoft .NET, will be the communication standard for future versions of Windows, and is now starting to be used in some applications. A direct connection method specific to geographic data is GTTP (Geographic Text Transfer Protocol). This is a transfer protocol at the same level as HTTP (HyperText Transfer Protocol), the protocol on which the bulk of the World Wide Web is based. GTTP is a protocol to request and receive geographic data across the Web. In order for it to work, there must be an application running on the Web server that can accept GTTP requests and then send out the requested geographic data. A thin client such as a Java applet running in a browser would send a request, receive the data, and create the map display. In the implementation of this protocol by Global Geomatics, the server application can combine raster and vector map data, convert it to a common projection, and send it to the client application for display.
DATA WAREHOUSING AND DATA MINING As databases grow, they contain much valuable information, but this information can be hard to assimilate because of the sheer volume of the data there. It’s kind of the opposite of “You can’t see the forest for the trees.” You can’t see the trends and data details easily. Technology has been developed to help with this. This technology goes by various names. A data warehouse is a centralized database with all of the data of a particular type for an organization, organized for analysis. Data from the data warehouse can be analyzed using data mining, also known as online analytical processing (OLAP), which involves using specialized tools to look at the data in various ways. The tools for doing this are becoming increasingly popular due to their application in electronic commerce (Ross, 2001), but could have application in the environmental industry as well. Data warehouses can be either static, where the data is extracted from other enterprise databases on a regular basis, or dynamic, were the data in the warehouse is linked directly with the primary databases. A popular way of doing data mining is using a data cube. This is a specialized database structure, often stored in the data warehouse, where the data is organized in multiple dimensions, and then data mining software is used to look at different combinations of dimensions. For example, you could use this software to organize the data by station, matrix, date, analytical method, and parameter, and then combine these dimensions into various displays. A related concept is a data mart, where data is pre-extracted for a particular use. An example would be to extract just the data needed for spatial analysis in a GIS, and store it in a file separate from the main database. That data mart would then be used for your analysis. The advantage of this is the convenience of working with a smaller data set. The disadvantages include the various problems with copies of the data as described above. Data marts and similar approaches are used regularly for environmental projects. Data warehousing and data mining are widely used in manufacturing, retailing, and similar applications, but use in the environmental industry is not widespread, and specialized tools for performing this type of analysis, if available at all, are not well established at the present time.
286
Relational Management and Display of Site Environmental Data
DATA INTEGRATION A number of issues must be considered in the integration of spatial and non-spatial data. Issues such as map projections, coordinate systems, and map scales must be considered. Environmental projects can be particularly troublesome in this area, since for some sites some of the data might be in a local coordinate system, some in a global Cartesian system in one or more projections and scales, and some in latitude-longitude. In some cases it is necessary to maintain three separate coordinate systems, but yet be able to display all of the data on one map. Also, the accuracy and precision of the spatial data used to display the analytical data in its geographic context is every bit as important as the accuracy and precision of the analytical data being displayed. In Figure 159, the user has chosen a station (well) for which the site coordinates are known, selected the location of the well from the map, and the software has calculated the offsets between real-world XY (state plane) and site coordinates. Tying non-spatial data to map locations can be a difficult process on many projects, and items like defining offsets for posting can be dependent on both the spatial and non-spatial data.
Figure 159 - Multiple coordinate system
PART SIX - PROBLEMS, BENEFITS, AND SUCCESSES
CHAPTER 25 AVOIDING PROBLEMS
Implementing an EDMS can be either a positive or a negative experience, depending on how the project goes. When it goes well, it can be very rewarding. When it goes poorly, it can be very frustrating (at best). More often than not, the result is positive, and there are specific things that can be done to increase the chance of a positive outcome. If you really think through the issues that were discussed in Part Two of this book, you will be well along to anticipating and avoiding problems. This chapter discusses some of the possible pitfalls in implementing an EDMS.
MANAGE EXPECTATIONS A very important aspect of implementing an EDMS is to make sure that the people who are designing the solution and those who will be using it are talking to each other. This is true whether the solution is being bought or built, and it goes both ways. The implementers of the system must be aware of the needs of the users so that they can satisfy as many of those needs as possible. The users must be aware of what is actually being implemented so they are not surprised when it arrives on their desk.
Understand the real needs If you ask people what tools they need to do their work, the results can sometimes be surprising. One time a senior vice president of the environmental group at a major company was asked what would make his job easier. He didn’t hesitate for a second before he replied “retirement.” Obviously nothing that could be done with the software would help with that. When you are trying to define functionality, be sure to focus on tools and processes that will make a significant contribution to improving the job performance of the people using the system. Statements that start with “It would be cool if …” should be examined critically. Sometimes they contain real needs. Sometimes they don’t. And sometimes it’s hard to tell the difference. On the other hand, a statement that starts with “My regulators require …” probably contains an important requirement in order for the EDMS to be successful.
Don’t promise the world When you’re in the early stages of implementing an EDMS (or any system for that matter) it is easy for the developer to respond to every request with “Yeah, it’ll do that.” Those statements
290
Relational Management and Display of Site Environmental Data
usually come back to haunt you. Be sure that when you say the software will do something, you really understand what it involves to provide the functionality being requested, how that functionality fits with the design plan, and whether resources can be provided to fit that need. Also, don’t forget that with software users (as with kids) “no” means “maybe” and “maybe” means “yes.”
Plan adequate time and budget The dismal success record of software implementation projects was discussed in Chapter 8, as were some thoughts on estimating a budget for an implementation project. Nearly any project can be completed given enough time and money. The trick is for the developer to implement the greatest amount of the most useful functionality within the schedule and budget available. Experience has shown that nearly everything takes longer and costs more than was originally anticipated. Experienced project managers have a habit of doubling everything. If this is factored into the planning process, rather than dealt with as it comes up, success is much more likely.
Manage risk Given that the project has a finite, and in fact relatively large, chance of failure, it is important to plan for the risks that might derail the project. You should ask questions like: What happens if a key developer becomes unavailable? What if management changes during the project? What happens if the budget is cut part way through the project? Or, in the worst case, what happens if the project fails completely? Good project managers have the ability to react to change, and can often salvage something, or even pull the whole thing off, when things go poorly. Be sure to have a backup plan for the most likely contingencies. Any but the smallest projects should have a formal Risk Management Plan. This type of thinking is uncomfortable for many people, but really does pay off when things don’t go as planned.
USE THE RIGHT TOOL There is a tendency to use the tools that you have and are familiar with. This is particularly true with software. If people know how to use a spreadsheet or word processor, they will want to use that tool to manage their data, even if it is not appropriate. And even if you choose the right tool, the size you choose must fit the task to be accomplished.
Avoid overkill At first glance it would seem that choosing a more powerful database system than you need might not be too bad. It seems that you can just use what you need, and not worry about the rest. This is not always the case. Overkill in database software hurts you in several ways. First, it can require the expenditure of resources beyond what is necessary, putting a strain on other parts of the organization or project. Second, big software can have a big learning curve, and it may be a challenge to get people up to speed on using it. Finally, and this is the worst, if the software is too big and complicated, people won’t use it, and will go back to their spreadsheets. A common case of overkill in database projects is in choosing the data repository. There is a tendency to use a big server database system even for small databases just because it is there, or because “It is the company standard.” Don’t fall into this trap. Using too big a hammer isn’t good for the nail, or for the fingers holding it. If the database will be small, be smart and use the appropriately sized software to manage it.
Avoiding Problems
291
Also avoid underkill The problems of underkill are more obvious, but no less serious, than those of overkill. It is a very frustrating feeling to work hard to get a database set up, overcome the learning curve, perhaps put a lot of effort into entering and/or formatting data, and then find out that the software won’t do something that you need. Often the time spent organizing data will not be lost if you have to make a transition to a more powerful system, but you will probably have to repeat the time (and cost) to get up to speed on the new system. Sometimes the path is not too hard, such as in moving from Access to SQL Server, but other times it can be painful. It’s better to plan ahead and take your long-term needs into consideration before you select software.
Buy vs. build This issue was discussed in Chapter 8, but is worth bringing up again here. There is a tendency to feel that either your project is unique or too complicated, or your skills are so high, that the only answer is a custom solution. Sometimes this is the case, and sometimes it is not. Usually, it is much less expensive to buy rather than to build, as long as what you get fits your needs. Put your personal preferences aside and think hard about what is best for the project, given time and budget limitations and technical requirements, and make your decision based on those factors.
PREPARE FOR PROBLEMS WITH THE DATA Getting the software up and running is the easy part. Finding, organizing, and entering data, and keeping it flowing on an ongoing basis, is the bigger challenge. There will almost always be problems with the data, ranging from minor to severe. It takes skill, patience, and perseverance, and often help from others, to be successful.
Where is the data? The first problem is finding the data to begin with. Whether you are dealing with hard copy or digital data, locating disks or documents can be a challenge. Particularly on projects where the personnel working on the project has changed over time, even finding someone who knows what data there is (or was) can be difficult. The amount of effort spent on locating and organizing data should be commensurate with the value of having the data in digital form.
Structure problems Once you have found the data, you will need to find out how it is organized. With hard copy data this usually isn’t too hard because you can look at it and (hopefully) figure it out, but if people are entering the data, changes to formats over time can be a problem. With digital data it can be even worse, since it’s not unusual for the format of digital data, even from the same source, to change often. It takes a patient person who is knowledgeable about the data to figure out and accommodate changes to the data structure prior to importing it.
Content problems Once you know the structure of the reports or files, you must be sure that you understand the data they contain. Different data creators may have different ideas about how to report data. For example, when a chemical value is less than the instrument detection limit, it can be reported as
292
Relational Management and Display of Site Environmental Data
zero, the detection limit, or half the detection limit. If the detection limit is included in the file, you can figure it out, but it is important to really look closely at your data, and be sure you understand what is there, before you try to use it.
Delivery problems When you are working with data on an ongoing basis, you want to make sure that the process you have in place for getting the data in a timely fashion is working well. Often the time between a sampling event and the due date for a report can be short enough that any problems in data delivery, import, checking, or reporting can stress the deadline. Be sure to plan ahead and know where your data is and when reports are due so you can minimize timing problems.
PLAN PROJECT ADMINISTRATION The final pitfall to be avoided is inadequate planning for the administration of the project. You should determine up-front who will be in charge of the management of the project, the personnel responsible for implementing the project, and then for running the system after it is operational. Interfaces with outside organizations, such as labs, the IT department, and any other affected personnel should be mapped out and discussed in detail at the start. Finally, the project must be supported by a realistic schedule and an adequate budget.
INCREASING THE CHANCE OF A POSITIVE OUTCOME If we were to boil this section down to one thing, it would be communication. If you involve all of the people affected by the system from the beginning, and keep them involved throughout the process, the chance of success is greatly increased. Plan and design thoroughly, so the software developer or vendor and the users are always on the same page, and keep the communication up during implementation. Also, be sure to take advantage of peripheral benefits like improved communication that can occur in other areas once people start talking about the database. If you do all of the above things well, your EDMS implementation project has an excellent chance of success, and you can reap the benefits of a job well done. Implementing a database system can have a tremendous positive impact on a project and an organization. It’s worth the effort on everyone’s part to make it be successful.
CHAPTER 26 SUCCESS STORIES
Don’t let the previous chapter scare you. Most data management systems, if implemented carefully, are judged a success by those using them, once they are up and running. This section will provide some examples of the benefits of an organized data management system, as recognized by project management personnel. In the following examples, the problems were successfully overcome (or even better, anticipated and avoided) so that everyone was satisfied with the outcome.
FINANCIAL BENEFITS Feng and Easley (1999) provide some examples of benefits that they saw on an environmental management information system they implemented. These included upgrading their system to get off their mainframe, reducing costs by streamlining workflow and reducing compliance costs, improving compliance by handling their data better, and anticipating the future by preparing for future regulations. Often the justification for implementing a data management system is primarily financial. Financial benefits that can be expected by implementing a data management system fall into several areas. These include reduced overhead costs, increased project efficiency, replacing older systems, reduced project operating costs, and increased revenue.
Reduced overhead costs Lower costs can be achieved both on the data management component of projects, as well as by using the data management system to improve other areas of the project. A good example of decreasing overhead costs occurs when the data management work can be transferred to a less expensive employee after implementation of an easy-to-use data management system. For example, one company was able to transfer much of the data management activities for a complex project from a high priced project manager to more economic tech and clerical staff members. This resulted in average savings of $25 per hour on about 40 hours per month, resulting in savings of $12,000 per year on this one project alone.
294
Relational Management and Display of Site Environmental Data
Increased efficiency One of the most obvious areas of financial benefit of an automation project is increased efficiency. One Enviro Data user reported that its time to process electronic deliverables from its laboratories decreased from 30 minutes to 5 minutes per file after it implemented and enforced a data transfer standard and a closed-loop reference file system. This implementation helped the laboratory to deliver clean data. Since the data administrator was handling about 300 files a year, this translates to 125 hours per year saved, for cost savings of almost $5,000 per year just for that one task. Additional savings were realized in increased efficiency in selecting and reporting data.
Replacing expensive older systems Financial benefits from replacing expensive in-house systems (sometimes called “legacy systems”) can result in both software and hardware savings. The data validation manager for a national consulting company was using the company’s in-house database that could no longer be supported, but still needed to perform the same functions. The DOS-based database was old and too difficult and expensive to maintain. The company needed a commercial system that would allow it to do data validation without the cost of maintaining the old system. The company purchased an off-the-shelf solution that was then customized slightly for its specific project needs. The company now has a better tool for doing its work for less money than maintaining the old system. Very significant cost savings can be realized when the data management system can be moved from expensive mainframe hardware to less expensive personal computers and servers. Maintaining a mainframe can cost several thousand dollars a month, compared to several hundred dollars a month to maintain a PC-based database server. It doesn’t take too many months of this to pay for a system upgrade, and then after that the cost savings go straight to the bottom line.
Reducing project operating costs An EDMS can help lower project operating costs in areas other than data management. One large industrial company with many facilities routinely uses its EDMS to review groundwater monitoring wells to identify ones where concentrations are consistently below regulatory limits. With a database of several thousand wells, the company is able to identify at least two wells per quarter that can safely be monitored less often. Each well that can be sampled annually instead of quarterly saves about $3,000, and the database provides the documentation to take its case to the regulators. If the company is successful on half of the requests, it can save $12,000 per year for the four wells, and these savings are cumulative from year to year.
Increased revenue An environmental engineering company needed to provide a data management system for one of its clients, a pipeline company. The client needed to manage its historical and ongoing data in a standardized way because it was being pressured by its regulators (EPA) to provide more comprehensive reporting of the trends in contamination levels. The engineering company wanted to make money by solving the client’s problem, and to generate a satisfied customer who would come back for more services. By showing its proficiency in using an efficient EDMS, the company landed a $300,000 data management task. Only 20% of that actually went into data management costs, resulting in increased revenue of $240,000 for that project.
Success Stories
295
One environmental lawyer in a small town can starve to death, but two can make a good living. Rich (1996) Taking the financial savings from three of the examples above and adding them up, the total is $27,000 per year. This means that every month that the implementation of a data management system is delayed, $2,250 is lost. Another way of looking at it is that if implementing the system takes $75,000 for software, training, data conversion, etc., then the time to pay out the investment is 33 months, for a return on investment (annualized IRR) of 23% over 5 years. This would be considered a good use of funds in most organizations. To this can be added the technical and intangible benefits described in the next two sections.
TECHNICAL BENEFITS While the dollars usually drive the purchase decision, the technical benefits are often the greatest benefit from an EDMS implementation project. Building a comprehensive, centralized, open database can generate improved technical results in a variety of ways. The biggest technical benefit is the improved quality that results from removal of database fragmentation. People are always using the best data available, not an outdated data set, or one that was thrown together to answer one question, and then used later to answer a different question, which might really have different data requirements. Related to this is improved communication on the project, because everyone is looking at the same data. This results in increased confidence in the data and in the decision-making process for the project. The impact of these technical benefits on those outside the project, such as upper management and especially regulators, can be significant. If these people develop confidence that the project team is staying on top of issues at the site, the result can be less scrutiny, and consequently less aggravation, for the project team. If they are finding and reliably dealing with issues as they come up, the project goes more smoothly for everyone. Increased efficiency was discussed above from a financial perspective, but can also be viewed from a technical perspective. If the project team spends less time on repetitive activities like cleaning up data and moving it around, the members will have more time to spend really working with the data, which will result in a better understanding of the site and better management of site issues.
Real access to data People like to point out that the difference between data and information is whether you can use it or not. A geologist that Geotech worked with had a large amount of base map and well data that he had obtained from a mainframe system when the management of the project changed, and the mainframe became unavailable. He had the well data in thick printed reports and in digital files from a mainframe. For a while he worked with the paper copies, but became frustrated with how hard it was to find things. Even though he was not very computer literate, and had never used a database before, he knew exactly what he wanted, which was to be able to select the data in specific ways, and to post the results on a map. Based on his very specific instructions, it was easy to build a data management system to satisfy his specific needs. The mainframe data was imported and cleaned up, the base map data was reformatted for his mapping program, and he was trained on how to select data and place it on a map. His frustration went away, because now he could easily use the large amount of data that he had. A mining company had a problem with a relatively small data set, but with very complicated retrieval requirements. It had to create an annual report that was taking them many weeks of
296
Relational Management and Display of Site Environmental Data
counting and re-counting the data, with various different ways of selecting and grouping the data. The data included public health data, and in addition to patient confidentiality issues, its regulator was requiring that decisions be made on relatively small data sets, so accuracy in retrieval was very important. Building a password-protected system to hold the data was relatively easy. Automating the retrieval was more complicated, and ended up requiring 267 SQL queries nested up to six deep with multiple unions to create the 17 output tables that made up the annual report. Now the company can create all of the data for the report in just a few minutes each year when it is due, and has confidence that it has been done right. There are many benefits that can result from better access to the data. More and more, people with a legitimate need for access to data can be provided this access efficiently and cost-effectively using client-server and Web-based tools. Having the data in a centralized, open database is the key to this.
Better visualization Another technical benefit is the ability to analyze the project better. Having a comprehensive database opens the door for better analysis and visualization tools, which can lead to a better understanding of the project, and a better ability to anticipate and minimize problems before they become critical. This results in a process where projects are managed by the project team, rather than the projects managing the team with a series of crises and fire drills. An example of benefits from better data management and visualization involved public health data. The client was gathering exposure data, and needed to organize it and try to identify the source of the exposure. Geotech implemented a relational data management system to help the client store and check the data. This system integrated the capability to draw the data on a graph as a time sequence and on a map as a function of distance from the suspected source. The displays that were created provided useful insights into the mechanisms leading to exposure and impact so the exposure could be minimized.
SUBJECTIVE BENEFITS Some benefits derived from improved data management are intangible, but still contribute significantly to the overall success of the project. Data management can be the most tedious component of a project. Implementing an efficient system for processing data can significantly improve morale, which results in improved quality of output, less staff dissatisfaction and turnover, and in general a happier and more productive project team.
Minimized drudgery Sometimes the database software takes on work that it is hard to get people to do. One organization had several projects where the routine drafting was verging on drudgery. Project personnel were printing out their data from spreadsheets, and then the drafters were typing it into their CAD program. In another case, they were printing out results and pasting them on drafted maps. By moving the data into a database that allowed sorting and gave good control over creating formatted output, and by integrating a GIS to automate the display of the data, they were able to automate the process just in time to prevent mutiny.
Improved communication A national landfill company client had a problem with where its data was going to live. It had several hydrologists, each of whom worked on many different projects. The client was receiving
Success Stories
297
laboratory and other data on an ongoing basis for all of these facilities. It wanted its hydrologists to be able to call up data on any of the facilities any time it was necessary, and then be able to work on the data locally, even on an airplane, with some confidence that it is up to date. The solution was a software system that was distributed between the main office and each user’s laptop. Clerical staff in the main office imported and checked the laboratory data as it arrived. Remote users downloaded subsets of the data for local use, and then discarded them when they were done with them. As long as they were working with a relatively small amount of data, they were very happy with this process. If they needed to work with larger sets of data, they either needed a fast connection, or to have a subset made and sent to them. Either way, they were able to efficiently work with the data they needed. In another example, a project required the combination of environmental data with health data. The environmental data was gathered and entered in one location, and the health data in another. Patient confidentiality made it very difficult to work between the locations, because people on the environmental side could not see all of the data on the health side, and the health people were not entering the environmental data. The problem was that they were dependent on each other, with the health people needing property information from the environmental folks, and the environmental folks needing family and other information from the health side. A system was built that allowed data to be entered at either location, and then transferred, as much as allowed by confidentiality, between the locations. It also allowed the creation of digital output combining the two types of data while still maintaining confidentiality, so that toxicological and other studies could be performed with the greatest amount of data.
Increased confidence A very successful database implementation was for a medium-sized environmental company. The system that was implemented helped the company significantly throughout the entire life cycle of the data from field data gathering to submission of results to regulators. The client had several projects where it needed to gather data in the field and from laboratories, import and edit it, create printed reports, and deliver digital data to the regulators. A system was put in place that could be used on laptops in the field as well as in the office. The software helped the client enter and check historical data from hard copy, import data on an ongoing basis from the laboratory, and perform statistics and create reports. Then the software was sent along with the data to the regulators, so they could work with the data themselves. Both the employees of the environmental company and the regulators worked with the data so intensely that they felt they had worked out all of the problems with it, and that the data in the database could be trusted. This improved confidence and trust between all parties, from which everyone benefited. In another example, the senior vice president of environmental affairs for a major oil company had numerous consultants managing the company’s data, and the oil company was losing data because of consultant turnover. The data was managed remotely and the company had no control over it. The company was paying for data but not getting it. It needed a centralized database so once it paid for analyses, it would always have them. A centralized in-house database was implemented, and the company now has complete control of its data. That database now has over a million records of analytical data in it and is growing daily.
CHAPTER 27 THE FUTURE OF ENVIRONMENTAL DATA MANAGEMENT
Data management is an important part of many environmental projects. Good data management can contribute to good project management in many ways. As changes occur in the environmental industry, it is likely that data management will play an increasingly important role in projects and in the industry at large. Environmental professionals would be wise to continually improve their computer and data management skills so that they can benefit from the movement to more efficient project and data management. The environmental business is a business of ups and downs. A recent survey of trends in the environmental business (Hensel, 2001) suggests that most environmental professionals are overworked, underpaid, and ignored by supervisors, and envision a future of cutbacks, layoffs, and salary slashing in the industry. Some of these negative thoughts seem to stem from a feeling that the current administration in Washington has less of an emphasis on conservation and environmental impacts, so the demand for environmental professionals will decrease. Another perspective (Krukowski, 2001) published the same month states that environmental professionals are better educated and better compensated than in the past. The same article gave an estimate for the average age of the readership of Pollution Engineering at about 51 years, with an average salary of $83,913 per year. It also mentioned that emphasis on the high-tech nature of the environmental industry will be important in attracting new blood. There is no doubt that the environmental industry is not as hot as it once was. Growth in the industry domestically was 1.6% from 1997 to 1998 as opposed to 15 to 20% annually in the late 1980s (Diener et al., 2000). This can be attributed to a slowdown in the growth in environmental regulations, increased experience in compliance, a slowdown in government sponsored programs, and commoditization of environmental services. There may also be a decline in citizen interest in environmental issues. A result of this has been a decrease in profit margins for companies, which has led to a significant round of industry consolidation through mergers and acquisitions. A reasonable response to this economic climate is for companies to focus on reducing their cost of doing business. Increasing their efficiency by using automated tools such as data management systems is a useful contributor to increasing efficiency, so it is a key component of surviving and thriving in the future.
300
Relational Management and Display of Site Environmental Data
If the facts are against you, argue the law. If the law is against you, argue the facts. Rich (1996) Another trend in the environmental business is that environmental sites are making a transition from investigation and installation of remediation systems to ongoing monitoring. This shifts the focus of the projects from geologic and engineering issues to management and data management issues. Also, outsourcing of environmental activities has increased by 38% in the last two years (Krukowski, 2001), especially of commoditized services such as data management. Some industrial companies are striking deals with environmental companies where the environmental company not only manages the project, but also assumes liability for environmental issues, often for a lump sum payment. This has resulted in a merging of the engineering and business components of the projects. Managing the projects efficiently is just as important as managing them well technically. Once again, good data management is an important component of this. Just as the environmental industry is evolving, so is the computing industry. Computers continue to grow faster and more powerful. Moore’s law marches on, and now that everyone has switched to 32-bit processors, the 64-bit processors are starting to take hold. Driven by the Internet, universal connectivity is becoming a reality. The impact of the Internet on the environmental industry has been somewhat different from its impact on the rest of society. In general, environmental data, at least site data, is not public data, or at least its owners don’t want it to be so. While the Internet definitely holds promise for data delivery, not many organizations are making their chemical concentration data publicly available on the Internet. One last trend that should be discussed is the impact of the overseas market for the environmental industry. Of the roughly half a trillion dollars spent annually for environmental products and services, 40% of it is spent in the U.S. The U.S. is not a growth market for environmental services, but overseas is. A visit to the Web sites of the major environmental consulting companies will show that most have several offices overseas. If the per-capita spending on environmental issues overseas approaches anything like that spent in the U.S., then clearly there will be a lot of growth in the future. The trends in the environmental industry discussed above can be expected to continue, and probably intensify in the foreseeable future. Changes in technology will continue to contribute to making data management easier and less expensive. Chapter 5 discussed the increasing importance of centralized, open database systems for improving project performance. A different set of skills is required to manage a large corporate database in SQL Server or Oracle than those necessary to manipulate a spreadsheet of one electronic deliverable. Add to this the expected requirements for more data availability through intranets and extranets, and clearly the data management skill level among environmental professionals will need to be much higher than it is now. The hope is that this book will contribute to increasing database literacy in the industry. But consulting companies and their industrial clients need to make a contribution also. There is a hump to get over in setting up a good data management system, including training staff members, acquiring software, locating and organizing data, building proficiency in using the data to improve environmental management, and knowing when to outsource data management projects. Companies should be prepared to make these investments, and then reap the rewards in the future. The numbers are there to demonstrate the return on investment from improved data management. Forward-thinking companies willing to make the investment in building an efficient, centralized data management system, including developing the personnel to use it well, will be in a good position to benefit from environmental industry trends and prosper well into the future.
PART SEVEN - APPENDICES
APPENDIX A NEEDS ASSESSMENT EXAMPLE
This section contains an example of a needs assessment questionnaire of the type often used in the early stages of implementing a data management system for environmental data. It should be modified for the specific needs of each organization. Note that it starts with general questions, and gets more specific in the later questions. Staff Member Name ___________________________
Date ______________________
Title _____________________________ Reports to _______________________________ What is your primary job responsibility? _____________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ What would make your job easier?__________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ How much of your time do you currently spend using the computer? ________% How would you describe your computer proficiency? ___________________________________ 1 Computer Novice 1 Windows Novice 1 Database Novice 1 Database Familiar 1 Power User
What computer tools do you currently use in your work? ________________________________ ______________________________________________________________________________ ______________________________________________________________________________ What computer tools would you like to use in your work? _______________________________ ______________________________________________________________________________ ______________________________________________________________________________ What capabilities would you like to see in a data management system? _____________________ ______________________________________________________________________________ ______________________________________________________________________________
304
Relational Management and Display of Site Environmental Data
What kind of data should the system contain? _________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ For your projects, where would you expect the data to come from? ________________________ ______________________________________________________________________________ ______________________________________________________________________________ What output would you like the system to generate? ____________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ Who in your organization would you expect to use the data management system? ____________ ______________________________________________________________________________ ______________________________________________________________________________ Do you have someone who could help you with computer work? __________________________ ______________________________________________________________________________ Do you have any thoughts on how the system should be implemented? _____________________ ______________________________________________________________________________ ______________________________________________________________________________ When would you like to have the system in place? 1 Last year 1 This year 1 Next year
1 Later
1 Never
Comments? ____________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ On a scale of one to five, how would you rank the importance of the following capabilities in a data management system: Capability
Very Important
Not Important Don’t Know
Data Import Manual data entry Conversion from other formats Automated import from labs
15 14 13 12 11 15 14 13 12 11 15 14 13 12 11
1 1 1
Data Storage Site chemical data Site geologic data Engineering drawings Other site engineering data Project plans Reports _______________________ _______________________
15 15 15 15 15 15 15 15
11 11 11 11 11 11 11 11
1 1 1 1 1 1 1 1
Data Manipulation Manual data editing Data quality checking Map-based selection
15 14 13 12 11 15 14 13 12 11 15 14 13 12 11
1 1 1
14 14 14 14 14 14 14 14
13 13 13 13 13 13 13 13
12 12 12 12 12 12 12 12
Needs Assessment Example
305
Data Output Lists/tables Graphs and charts Map displays 2-D modeling (contouring) 3-D modeling Flow modeling Fate and transport modeling Cross sections Boring logs Technical displays (e.g., Stiff, Piper) Statistics
15 15 15 15 15 15 15 15 15 15 15
11 11 11 11 11 11 11 11 11 11 11
1 1 1 1 1 1 1 1 1 1 1
Project Management Schedules and budgets Regulatory compliance Prioritization
15 14 13 12 11 15 14 13 12 11 15 14 13 12 11
1 1 1
Other Issues Remote access _______________________ _______________________ _______________________ _______________________ _______________________
15 15 15 15 15 15
1 1 1 1 1 1
14 14 14 14 14 14 14 14 14 14 14
14 14 14 14 14 14
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13 13 13
12 12 12 12 12 12 12 12 12 12 12
12 12 12 12 12 12
11 11 11 11 11 11
If you had one or two capabilities that you would like to see implemented in the next six months, what would they be? _____________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ Do you have any other comments? __________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ We ask that respondents not discuss their responses to this survey with others until after everyone has been surveyed.
APPENDIX B DATA MODEL EXAMPLE
This appendix describes the data model for an EDMS. It is based on Microsoft Access, a commercially available database manager, and Enviro Data, a commercially available product for managing site environmental data. However, the intention here is to focus on the structure and contents of the data model, and not on one particular implementation.
INTRODUCTION The data model for a data management system is the structure of the tables and fields that will contain the data. Many software designers work with data models at two levels. The logical data model describes, at a relatively high level, the data content for the system. The physical data model describes in detail exactly how the data will be stored, with names, data types, and sizes for all of the fields in each table, along with the relationships (key fields which join the tables) between the tables. This data model is designed to store and relate site characterization data. It was built primarily for groundwater, surface water, sediment, soil, and air data, although it is not necessarily limited to those data types. It is also possible to store media characterization data such as physical characteristics and geologic units. The design of the tables and relationships is for the most part Fifth Normal Form. This means that data of like types is stored together, and redundant data is separated out, with parent-child relationships and lookup tables used as appropriate. This design provides good performance with a minimum of wasted space.
CONVENTIONS This data model was created using Microsoft Access, so field types are listed as Access data types. SQL Server and Oracle have similar data types. For coordinate locations and dates, we have added suffixes to the field names, so that other programs reading the tables (through ODBC, for example) can identify the field types. These conventions are useful for spatial data retrieval by GIS, contouring, geostatistics, or other software. For example, coordinate values end in _CX or _CY to identify them as Cartesian X and Y coordinates, and date fields end in _D. In this section certain typefaces are used to refer to different objects in the system. These typefaces are listed in the following table:
308
Relational Management and Display of Site Environmental Data
Typeface Underlined, bold, upper case Underlined, bold Underlined, italic Bold Italic Underlined
Example MAIN MENU Samples Edit Samples SampleTop Number of Samples
Database object Application forms, Access dialog boxes Filter Criteria Boxes Specific buttons or selection fields Table names Field names Other objects on application screens
The data model is divided into four table types: primary tables, lookup tables, reference tables, and utility tables. Primary tables contain data imported from the lab or entered by users. Lookup tables contain expansions of abbreviations and other information that is not stored in the primary tables. They are keys to the data in the primary tables. Some of the lookups (e.g., Parameters and Units) are formally linked to the primary tables through relationships. Others (AnalyticFlags, AnalyticProblems, and ValidationFlags) are not tied with relationships. This is because multiple flags can be entered into the field, so these tables are for use in checking only. Reference and utility tables provide other functions in the database, but are not related to the primary or lookup tables. The data tables for this model, and descriptions of how each table is used, are in the following sections. Listed for each table is the table name with a brief description. Also shown are the fields and other information for each of the tables in the database, as well as some example data. Also listed are all of the fields for the table. The first column has the name for the field. The second is the field type. Numeric fields have the following types: Code By In Lg Sg Db Au
Type Byte integer Integer Long integer Single precision real Double precision real Automatically generated
Range 0 to 255 -32,768 to 32,767 -2,147,483,648 to 2,147,483,647 -3.4x1038 to 3.4x1038 with 7 decimal places -1.797x10308 to 1.797x10308 with 15 decimal places -2,147,483,648 to 2,147,483,647, generated by system
The “Sz” column lists the amount of storage required for this data element. Note that for numbers this is not the same as the number of digits, since numbers are stored in Access in compressed form. “Description” is a brief summary of what the field represents. “Relationship” lists other tables that depend on this data element, that is, tables that are joined on this field. It also should be noted that fields that refer to lookup tables (such as ReportingUnits in the Analyses table) contain codes, so the field in the primary table does not need to be long enough to hold the whole object. For example, “µg/l” is stored as “ul” in Analyses, and the full text of the data is in the lookup table, in this case the ReportingUnits table. In the example data shown below some of the fields may not be shown so the table will fit in the width of the page. For many of the lookup tables, example values are shown without codes.
PRIMARY TABLES The tables will be presented from highest level (parent) through lowest (great grandchild), that is, each table on the one side of the one-to-many relationship will be followed by the table on the many side. Sites – This is a table of sites, tied to the Stations table. Each site corresponds to a facility project (not for a well, as this term is sometimes used). Each SiteName and State must be unique.
Data Model Example
Field SiteNumber
Type AutoNumber
Sz 4
Description Unique site number generated by system
SiteName SiteCity State County Country CoordSystem SiteType Owner Description SiteMinX SiteMaxX SiteMinY SiteMaxY SiteMinZ SiteMaxZ Scale1 Scale2 BaseMapFile SiteUpdateDate_D
Text Text Text Text Text Text Text Text Text Number(Sg) Number(Sg) Number(Sg) Number(Sg) Number(Sg) Number(Sg) Number(Db) Number(Db) Text Date/Time
50 50 5 20 20 50 20 20 255 8 8 8 8 8 8 8 8 120 10
Name of site City that site is located in or near State County Country Coordinate system Type of site Owner of site Description Minimum X-Coordinate (easting) Maximum X-Coordinate (easting) Minimum Y-Coordinate (northing) Maximum Y-Coordinate (northing) Minimum Z-Coordinate (elevation) Maximum Z-Coordinate (elevation) Initial scale for map Scale for detail display File name and path for base map Date this site was last updated
309
Relationships Stations, Geologic Units, Lithology, SiteStationAlias, SiteLabLink, MultiObservations, ValidDetectLimit, RegLimits
Some example data for Sites is: Site # 1 2 3
Site Name Rad Industries Refining, Inc. Forest Products, Co.
City Erie Hanes Como
State CO OH CO
County Adams Cleveland Adams
Coordinate System Site (based on state plane) State plane Site (based on state plane)
Type Rad Organics Organics
Owner Description Orphan Orphan Orphan
Stations – This is a data table of locations at which samples have been taken, such as wells, borings, surface water samples, etc. Each station can have multiple samples by date, depth, or both. Records must be unique on SiteNumber and StationName. Field StationNumber
Type AutoNumber
Sz 4
StationName SiteNumber ShortName OldName StationGroupCode Location_CX Location_CY Location_LL_LX Location_LL_LY GroundElevation DatumElevation Depth ScreenTop ScreenBase StationTypeCode StationDate_D StationText LocationCode SamplingFreqCode StationUnitsCode StaUnitGeoCode
Text Number(Lg) Text Text Text Number(Db) Number(Db) Number(Db) Number(Db) Number(Sg) Number(Sg) Number(Sg) Number(Sg) Number(Sg) Text Date/Time Text Text Text Text Text
25 4 10 25 15 8 8 8 8 4 4 4 4 4 2 8 255 2 3 2 5
Description Unique station number generated by system Station identifier or name Site code Abbreviated name for map Previous station name Group to which station belongs X-Coordinate (easting) Y-Coordinate (northing) Longitude Latitude Ground Elevation Datum Elevation Depth of hole Top of screen if present1 Base of screen if present1 Code for type of station Date station was activated Additional station information Location code relative to gradient Sampling frequency code Depth units for this station Geologic code for screened interval
Relationships Samples, SiteStationAlias, SampEventStations Sites
StationGroup
StationTypes
LocationCodes SamplingFrequency ReportingUnits GeologicUnits
310
Relational Management and Display of Site Environmental Data
PropertyDescrip
Text
50
T/R/S or metes & bounds description of station RegulatoryID Text 20 Well identifier issued by regulatory agency DrillerName Text 50 Name of company installing station InstallerName Text 20 Name of person overseeing well installation CurrentStatusCode Text 1 Foreign key to CurrentStatus CurrentStatus lookup table QCStationCode Text 3 Quality control code for this station QCCodes StaUpdateDate_D Date/Time Date this station last updated LocationSource Text 50 Source of station coordinates 1 For holes with more than one screened interval, enter the top of the highest and bottom of the lowest screened intervals.
The following is some example data from the Stations table: Sta.# 807 808 809 811
Station Name Site # X Coord. Y Coord. Ground MW-01 3 602.181 1036.21 751.78 MW-02 3 597.735 889.752 751.59 MW-04 3 891.285 892.284 752.29 MW-07 3 899.841 1219.13 753.23
Datum Depth Screen Top Screen Base Type 754.26 63.32 53.3 62.3 mw 754.32 63.23 53.2 62.2 mw 754.46 61.91 52.4 61.4 mw 755.35 65.22 55.2 64.2 mw
Samples – This is a data table of samples or observation events for the stations. Samples must be unique in the combination of StationNumber, SampleDate_D, SampleMatrixCode, SampleTop, SampleBottom, Filtered, and DuplicateSample. Each sample can have many analyses. Not all samples will require that all fields be used. For example, SampleTop and SampleBottom are used for soil samples but may not be used for water samples. Water samples will more likely be unique on their sample date. In order to ensure compatibility between Access and server databases, fields that make up the unique index should not be null. Zeros should be entered if there are no values for SampleTop and SampleBottom. Field SampleNumber StationNumber SampleDate_D SampleEndDate_D SampleMatrixCode SampleTypeCode SampleTop SampleBottom GeologicUnitCode LithologyCode Description ExtDescription Sampler SamplePurposeCode LabSampleID AltSampleID DuplicateSample CoolerID FieldSampleID FilteredSampleCode DeliveryGroup QCSequenceID DepthUnitsCode COCNumber TaskNumber PrimarySample SampleResult QCSampleCode
Type AutoNumber Number(Lg) Date/Time Date/Time Text Text Number(Sg) Number(Sg) Text Text Text Text Text Text Text Text Number(Int) Text Text Text Text Text Text Text Text Number(Lg) Text Text
Sz 4 4 10 10 1 4 4 4 5 5 25 255 50 2 20 20 2 20 20 4 10 15 2 20 20 4 255 3
Description Unique sample number generated by system Foreign key linking to Stations table Date/time sample was taken or started Date/time that sample period ended Sample matrix code Sample collection code Sample top Sample Bottom Geologic or hydrologic unit code Lithology or soil type code1 Sample brief description Sample extended description Name of person taking sample Code for purpose of sampling Lab sample ID Alternate sample ID Duplicate sample designation Cooler ID Field sample ID number Foreign key to Filtered lookup table Sample delivery group QC Sequence identifier Foreign key to ReportingUnits lookup Chain of custody number Task number under which sampling is done Primary sample to which QC sample applies Result of attempted sampling QC code for this sample
Relationships Analyses Stations
SampleMatrix SampleTypes
GeologicUnits Lithology
SamplePurpose
Filtered
ReportingUnits
QCCodes
Data Model Example
311
SampleEventID Number(Lg) 4 Link to SampleEvents table SampUpdateDate_D Date/Time 10 Date this sample was last updated 1 For soils, lithology code should be based on Unified Soil Classification System (USCS) with up to two combined codes. For rock, use a code that is appropriate for the specific site.
This is some data from the Samples table (note the coded values in Matrix, SampleType, etc.): Samp.# Sta.# Samp.Date Matr. 995 487 8/28/86 w 996 487 11/11/86 w 997 487 3/4/87 w 998 487 5/13/87 w 1007 31 s 1008 31 s 1009 31 s 1010 31 s
Samp.Type Samp.Top Samp.Base Geol.Unit g 0 0 C/E g 0 0 C/E g 0 0 C/E g 0 0 C/E c 711.6 712.6 z c 713.6 714.6 z c 715.6 716.6 z c 716.6 717.6 z
Lith. Descrip. Z Z Z Z Cloudy Z Silty clay Z z z
Sampler DWR DWR JLG MAW CRS CW RW EG
Purp. z z z z z z z z
Analyses – This data table contains the analytical data (the results) for samples. This is usually the largest table in the database. Analytical values are unique in the combination of SampleNumber, ParameterNumber, LeachMethod, Basis, and Superseded. Field AnalysisNumber SampleNumber ParameterNumber Superseded AnalyticMethod Value FlagCode ReportUnitsCode Detect LimitType Error DataReviewCode DataReviewHistory ProblemCode ValidationCode AnalDate_D ExtractDate_D Lab LabComments AnalysisLabID AnalyticalBatch DilutionFactor ConvertedValue AnalyticLevelCode DetectedResult ReportableResult Detect2 LimitType2 LeachMethodCode PrepMethod ValueCode Basis LabReportDate_D AliasNumber NumberDecimals FilteredAnalCode RunCode
Type AutoNumber Number(Lg) Number(Lg) Number(Int) Text Number(Sg) Text Text Number(Sg) Text Number(Sg) Text Text Text Text Date/Time Date/Time Text Text Text Text Number(Sg) Yes/No Text Text Text Number(Sg) Text Text Text Text Text Date/Time Number(Lg) Number(Int) Text Text
Sz 4 4 4 2 25 4 4 2 4 4 4 1 10 4 4 10 10 10 50 20 40 4 1 1 1 1 4 4 1 20 6 1 10 4 2 4 2
Description System-assigned local key Foreign key linking to Samples table Foreign key linking to Parameters table Analysis superseded by re-analysis?1 Method for performing analysis Value measured during analysis Data qualifier Units of the analysis Detection limit for this analysis2 Type of detection limit Error range for this analysis Status of data review Historical listing of data review codes Problems encountered during analysis3 Data validation qualifier Date the analysis was performed Date the constituent was extracted Name of lab conducting analysis Lab comments Lab identification number for analysis Lab batch ID number Dilution performed before analysis This value was converted from other units EPA level of analysis Was analyte detected Use this analysis as the reportable result 2nd detection limit for this analysis Type of 2nd detection limit Foreign key to LeachMethod lookup table Preparation method Foreign key to ValueCode lookup table Analyzed wet or dry - report w, d, or n Date of lab report Tracks parameter alias name used by lab No. of decimal places displayed at report time Foreign key to Filtered lookup table Foreign key to RunCode lookup table
Relationships Samples Parameters
AnalyticFlags4 ReportingUnits
DataReview DataReview AnalyticProblems4 ValidationFlags4
LeachMethod ValueCode
ParameterAlias Filtered RunCode
312
Relational Management and Display of Site Environmental Data
QCAnalysisCode Text 3 QC Code for this analysis QCCodes AnalUpdateDate_D Date/Time 10 Date this analysis last updated 1 Numbered values for superseded analyses, with 0 for current analysis, increasing by one for each older value. 2 Represents the method detection limit for this analysis after adjustment for dilution, etc. 3 Coded values for any problems with this analysis. 4 These relationships are informational only and are not enforced by the system.
Here is some example data from the Analyses table (note the coded entries in ParameterNumber, ReportUnitsCode, etc.): Analysis # Samp. # Parm. # Sup. An.Meth Value Flag 11840 1676 35 0 6010B U 11841 1676 40 0 6010B 2.61 B 11842 1676 50 0 6010B 0.515 V 11843 1676 60 0 6010B 0.0324 J 11844 1676 85 0 6010B U 11845 1676 120 0 6010B 0.026 V 11846 1676 10 0 6010B 0.03 V 11851 1676 135 0 353.2 U 11852 1676 140 0 160.1 1090 @ 11853 1688 20 0 6010B 0.183 B 11854 1688 30 0 6010B U 11855 1688 32 0 6010B U
Rep.Units ml ml ml ml ml ml ml ml ml ml ml ml
Detect. 0.0045 0.015 0.0031 0.0078 0.0047 0.009 0.005 0.03 21 0.0058 0.0066 0.0058
D.L.Ty. MDL MDL MDL MDL MDL MDL MDL MDL MDL MDL MDL MDL
Data Rev. 0 0 0 0 0 0 0 0 0 0 0 0
Prob. Z Z Z Z Z Z Z Z Z Z Z Z
Val.Code z z z z z z z z z z z z
LOOKUP TABLES Some of the tables in the database contain lookup values for codes contained in other tables. This makes it easier to change the descriptions of the code, minimizes errors due to misspellings, and saves space. The lookup values can be edited while the system is in use, although lookup values for which there are related records in the primary tables cannot be deleted or the codes changed. The following selected tables from the data model are presented in order from data related to the highest level primary data element (Sites) to the most specific data item (Analyses). The following tables are related to Sites: SiteLabLink – This table is a link table allowing a many-to-many relationship between sites and labs (including contractors and others). Field SiteLabLinkID SiteNumber LabID LinkType
Type AutoNumber Number(Lg) Number(Lg) Text
Sz 4 4 4 1
Description ID number for this link Foreign key linking to Sites table Foreign key linking to Labs table Type of link
Relationships Sites Labs
Some values for SiteLabLink are: SiteLabLinkID 2 3 4
SiteNumber 1 2 3
LabID 1 1 1
LinkType Contractor Contractor Contractor
Labs – This table is tied to Sites by the SiteLabLink table. It contains information about labs, contractors, and others who are working with the data. It can be used to create a reference file for testing imports prior to delivering data. Field LabID CompanyName StreetAddress
Type AutoNumber Text Text
Sz 4 50 100
Description Unique Lab ID assigned by system Name of lab Address
Relationships SiteLabLink
Data Model Example
City State Zip Country Contact Phone Fax EMail Website FileName
Text Text Text Text Text Text Text Text Text Text
50 5 10 20 50 20 20 50 50 50
313
City State code Zip Country Name of contact person Phone Fax Email Website Name for export file
Some values for Labs are: LabID 1 2
CompanyName Geotech XYZ Labs
StreetAddress 6535 S. Dayton St. 37 Elm St.
City Englewood Anytown
State CO CO
FileName GeotechRef.mdb XYZRef.mdb
RegLimits – This table stores regulatory limit information for projects, and is tied to sites and parameters. Field Type Sz Description RegLimitsID AutoNum 4 Unique reg. limit number generated by system ParameterNumber Number(Lg) 4 Link to Parameters table SiteNumber Number(Lg) 4 Link to Sites table SampleMatrixCode Text 1 Sample matrix code RegLimit Number(Db) 8 Regulatory limit for parameter RegLowerLimit Number(Db) 8 Lower limit for parameter RegUnit Text 2 Units for regulatory limit RegTypeCode Text 2 Foreign key to RegTypeLimits table 1 This relationship is informational only, and is not enforced by the system.
Relationships Parameters Sites1 SampleMatrix
ReportingUnits
Some values for RegLimits are: RegLimitsID 1 2 3
ParameterNumber 35 40 45
SiteNumber 1 2 2
SampleMatrixCode Water Water Water
RegLimit 10 50 100
RegUnit ug/l ug/l ug/l
RegTypeCode FM z z
RegLimitTypes – This lookup table lists the regulatory limit types to which limits can be assigned. Field RegTypeCode RegType
Type Text Text
Sz 2 50
Description Code for regulatory limit type Regulatory limit type
Relationships
Some values for RegLimitTypes include Calculate percentile, Federal MCL, Guidance, None, Permit, Primary, State drinking water levels, Safe drinking water standards, Sec. high, Sec. low, Surface water, TCLP, and Unknown. RegLimitGroups – This lookup table is used to group regulatory limits in user-defined sets for reporting. Field RegLimitGroupID RegLimitGroup
Type AutoNum Text
Sz 4 50
Description System generated unique ID Reg Limit Group name
Relationships
Some example values for RegLimitGroups are: Rad Industries, Drinking water all sites, and Standard report group.
314
Relational Management and Display of Site Environmental Data
The following tables are related to Stations: CurrentStatus – This lookup table tracks the station’s current operating status. Field CurrentStatusCode CurrentStatus
Type Text Text
Sz 1 20
Description Code for current status Current status
Relationships
Here are some typical entries in this table: Abandoned, In service, and Unknown. LocationCodes – This is a table containing the codes for the location of the station. This can be used to describe location relative to hydrodynamic gradient, or changed to better fit your data needs. Field LocationCode Location
Type Text Text
Sz 2 25
Description Code for station location Station location relative to gradient
Relationships Stations
Some typical values for LocationCodes are: Downgradient, Sidegradient, Upgradient, Onsite, Offsite, and Unknown. SamplingFrequency – This table is a lookup for the frequency of sampling a station. Field SamplingFreqCode SamplingFrequency FrequencyInterval
Type Text Text Number(Lg)
Sz 3 25 4
Description Sampling frequency code Frequency of sampling for this station Days between sampling events
Relationships Stations
Some values for SamplingFrequency are: Frequency Code A A2 A3 M N Q S SP W z
Sampling Frequency Annual Biennial Triennial Monthly Abandoned station Quarterly Semi-annual Special one-time Weekly Unknown
Frequency Interval 365 730 1095 30 91 182 7
StationGroup – This lookup table tracks groups of stations. It can be used to identify well clusters, or to distinguish project levels within a selected site. Field StationGroupCode StationGroup
Type Text Text
Sz 2 20
Description Station group code Station group description
Relationships
Some values for StationGroup are: Deep, None, Shallow, and Unknown. StationTypes – This is a lookup table containing codes and descriptions for the type of station. This is tied to the Stations table. Field StationTypeCode StationType
Type Text Text
Sz 2 25
Description Code for type of station Type of station
Relationships Stations
Some values for StationTypes are Cone penetrometer, Gas monitoring probe, Geoprobe boring, House well, Monitoring well, Piezometer, Recovery well, Soil boring, Sediment point, Sampling port, Stock well, Surface water point, Water well, and Unknown.
Data Model Example
315
The QCStationCode field in the Stations table is tied to the QCCodes lookup table, which is described below under the Samples table lookups. The StaGeoUnitCode field (station geologic unit code) in the Stations table is tied to the GeologicUnits lookup table, which is described below under the Samples table lookups. The StationUnits field in the Stations table is joined to the ReportingUnits table, which is described below under the Analyses table lookups. The following tables are related to Samples: GeologicUnits – This is a lookup table containing codes and names of the geologic units for the interval of each sample. You can assign each code to a specific site, or enter 0 for all sites. Field Type Sz Description GeologicUnitCode Text 5 Geologic or hydrologic unit code GeologicUnit Text 25 Name of geologic or hydrologic unit SiteNumber Number(Lg) 4 Site number 1 This relationship is informational only, and is not enforced by the system.
Relationships Samples Sites1
Some codes for GeologicUnits are: Geol.Code A Al B Bed Dp Fil n Sh Sil z
Unit Name A-stratum Alluvial B-stratum Bedrock Deep Surface Fill Not applicable Shallow Silurian Unknown
SiteNumber 1 1 1 0 2 1 3 2 1 0
Lithology – This is a lookup table containing codes and lithologic descriptions of the geologic unit of the interval for each sample. It is based on the Unified Soil Classification System (USCS) for soil and a site-specific code for rock. You can assign each code to a specific site, or enter 0 for all sites. Field Type Sz Description LithologyCode Text 5 USCS code for lithology Lithology Text 50 Meaning of code Pattern Text 18 Fill pattern or color Fill_Pen Text 18 Color fill pen Line_Pen Text 18 Line pen color SiteNumber Number(Lg) 4 Site Number 1 This relationship is informational only, and is not enforced by the system.
Some typical codes for Lithology are: Lith.Code CH CL GC GP GW MH ML OL PT SC SM z
Lithology Inorganic clay, high plasticity Inorganic clay, low-med. plasticity Clayey gravel Poorly-graded gravel Well-graded gravel Inorganic silt Inorganic silt and vfg. sand Organic silt and organic silty clay Peat Clayey sand Silty sand Unknown
SiteNumber 2 2 2 2 2 3 3 3 3 3 1 0
Relationships Samples
Sites1
316
Relational Management and Display of Site Environmental Data
QCCodes – This table is used to flag lab and field quality control data at the stations, samples, and analyses levels. Field QCCode
Type Text
Sz 4
Description QC code for station, sample, or analysis
QCType QCScopeCode DuplicateOrder QCLocation QCDataLevel ValidationType
Text Text Number(Int) Text Text Text
40 1 2 15 10 40
QC item type Foreign key to QCScope lookup table Import order based on QCCode Identify if QC is from lab or field Identify level QCCode applies to Identify QCCode type for Validation code
Relationships Stations, Samples, Analyses, RPDControlLimits QCScope
Some examples for QCCode are: QCCode B BDS BS CB CCV CS DB DUP FB FS IB IS LCS LCSD LD MB MS MSD N O RB RD SB SP SPD SS SUR TB TIC Z
QC Type Blank Blind sample Blank spike Calibration blank Calibration Control Verification Check sample Dynamic blank Field duplicates Field blank Field sample spikes Instrument carryover blank Internal standard Laboratory Control Sample Lab Control Sample Duplicate Laboratory duplicates Method blank Matrix spike Matrix spike duplicate None
QCScopeCode n s s s s
DuplicateOrder 30 13 14 15 8
s s s s s s
16 17 2 3 18 19
QCLocation QCDataLevel ValidationType Field 3 3 3 Lab 3 Lab 3 Continuous Calibration Value (R) 3 3 Field 3 Field Duplicate Field 3 Field Blank Field 3 Lab 3
s s
20 7
Lab Lab
3 3
s
9
Lab
3
s s s s n
6 10 11 12 29
Lab Lab Lab Lab
3 3 3 3 4
Original data Rinseate blank Referee duplicates Storage blank Split samples Split-Duplicates Synthetic sample Surrogate spikes Trip blank Tentatively identified compound Unknown
n s s s s s s a t a
1 21 22 23 4 5 24 25 26 27
Field
4 3 3 3 3 3 3 4 2 4
n
28
Field Field Lab Lab Field
Laboratory Control Sample (R) Laboratory Control Sample Duplicate (R) Lab Duplicate Matrix Spike Matrix Spike Duplicate Laboratory Reference Standard (R) Original Sample
Split Split-Duplicate
4
QC codes are listed here under samples, but some of the codes are used for stations and analyses as well. QCScope – This table is used to designate the scope of a quality code in the QCCodes table. Field QCScopeCode QCScopeDescription
Type Text Text
Sz 1 20
Description Code for QCScope QCScope description
Relationships QCCode
Data Model Example
317
Some codes for QCScope are Analyses, Samples, and Not applicable. SampleMatrix – This table has codes and descriptions of the matrix (material) for this sample. Field SampleMatrixCode SampleMatrix
Type Text Text
Sz 1 15
Description Code for sample matrix Matrix for this sample
Relationships Samples
Some codes for SampleMatrix are Air, DNAPL, Gas, Leachate, Sediment, Sludge, Other, Petroleum, LNAPL, Reagent, Soil, Water, Waste, and Unknown. SamplePurpose – This lookup table holds codes for the reason for sampling. Field SamplePurposeCode Stage Purpose
Type Text Text Text
Sz 2 25 25
Description Code for purpose of sampling Stage of investigation Purpose
Relationships Samples
Some codes for SamplePurpose are: PurposeCode ac ad ai ar cm db dr n s u v z
Stage Assessment Assessment Assessment Assessment Corrective action Detection Detection NPDES Special Due diligence Verification Unknown
Purpose Confirmation Disposal Investigatory Routine Monitoring to a limit Background Regulated Monitoring to a limit Special Due diligence Detection Unknown
SampleStatus – This lookup table holds codes for status of sampling events. Field SampleStatusCode SampleStatus
Type Text Text
Sz 1 20
Description Sample status code Sample status description
Relationships SiteStationAlias
Some codes for SampleStatus are Canceled and Complete. SampleTypes – This table has codes and descriptions of how the sample was taken. Field SampleTypeCode SampleType
Type Text Text
Sz 4 25
Description Code for type of sample How the sample was collected
Relationships Samples
Some codes for SampleTypes are Composite, Disturbed, Grab, Discrete, Undisturbed, and Unknown. The following tables are related to Analyses: AnalyticFlags – This is a lookup table containing codes and descriptions for flags associated with analytical values. These flags generally come with the electronic deliverable of data from the laboratory. Factor is a multiplier for statistical analysis, and Basis is the value to be used for statistical analysis, where v is for value, and d is for detection limit. ReportingFactor is a multiplier for analytical reports, and ReportingBasis describes how the result is to be formatted for reporting. This table is a reference table for data in Analyses, but is not formally related to it.
318
Relational Management and Display of Site Environmental Data
Field Type Sz Description FlagCode Text 1 Code for flag AnalyticFlag Text 55 Flag for analytical value Factor Number(Sg) 4 Multiplier for statistical analysis Basis Text 1 Indicates value to use for statistical analysis ReportingFactor Number(Sg) 4 Multiplier for analytical reports ReportingBasis Text 1 Indicates number, flag to use in analytical reports 1 This relationship is not strictly enforced by the system, since multiple flags are allowed.
Relationships Analyses1
Some example codes for AnalyticFlags are: Flag Code * a b c d e f g h i j l m n q t u v
Flag Factor Basis ReportingFactor ReportingBasis Surrogate outside QC limits 0 v 0 v Not available 0 v 0 v Analyte detected in blank and sample 1 v 1 v Coelute 1 v 1 v Diluted 1 v 1 v Exceeds calibration range 1 v 1 v Calculated from higher dilution 1 v 1 v Concentration > value reported 1.43 v 1 g Result reported elsewhere 1 v 1 f Insufficient sample 0 v 0 v Est. value; conc. < quan. limit 1 v 1 b Less than detection limit 0.5 d 1 l Matrix interference 1 v 1 v Not measured 0 v 0 v Uncertain value 1 v 1 v Trace amount 0.5 d 1 d Not detected 0.1 d 1 d Detected value 1 v 1 v
AnalyticProblems – This lookup table contains codes and descriptions for problems encountered in transporting the samples or during the analysis. These flags do not generally come with the electronic deliverable of data from the laboratory, but must be gotten from the case narrative. This table is a reference table for data in Analyses, but is not formally related to it. Field Type Sz Description ProblemCode Text 1 Code for analytical problem AnalyticProblem Text 40 Problem with analysis 1 This relationship is not strictly enforced by the system, since multiple flags are allowed.
Relationships Analyses1
Some codes for AnalyticProblems are Exceeds holding time, Percent RPD criteria not met, Exceeds extr. holding time, Cooler above 10°C, Interference, Bottle broke, Resample value, Matrix effect, No problems, Spike not in control lim., Zero headspace not achieved, Quality control problem, Meth. of std. additions, Est. because of interference, Multiple problems, and Unknown. DataReview – This is a lookup table containing codes and descriptions for the status of data review for each analytical result. Field DataReviewCode DataReview
Type Text Text
Sz 4 45
Description Code for data review level Data review level
Relationships Analyses
Some codes for DataReview are Imported, Vintage data, Data entry checked, Sampler error checked, Laboratory error checked, Consistent with like data, Consistent with previous data, Inhouse validation, and Third-party validation.
Data Model Example
319
Filtered – This lookup table keeps track of sample filtering information. Field FilteredCode FilteredDescrip
Type Text Text
Sz 4 20
Description Code for filter Filter description
Relationships Analyses
Some values for Filtered are: FilteredCode DIS F1 F45U L1 L45U TOT TRC Z
FilteredDescription Dissolved Field - unknown Field 0.45u Lab unknown Lab 0.45u Total Total Recoverable Unknown
LeachMethod – This table identifies the leach method separate from the parameter name. Field LeachMethodCode LeachMethod
Type Text Text
Sz 1 20
Description Code for leach method Leach method
Relationships Analyses
Some values for LeachMethod are SPLP, TCLP, None, and Unknown. Parameters – This table contains the codes, names and other information on the analytes of interest for this database. One table is used for all sites. There is one entry in this table for each analyte. The use of different analytical methods for a particular parameter on different projects or in different parts of the country can by tracked by the AnalyticMethod field in the Analyses table. Field Type Sz Description Relationships ParameterNumber AutoNumber 4 Code for parameter Analyses LongName Text 60 Long name of parameter ShortName Text 10 Short name of parameter CASNumber Text 20 CAS number for parameter AltParamNumber Text 20 Alternate number for parameter SumCategoryCode Text 1 Category for summarizing data SumCategories StatTypeCode Text 1 How number is to be handled statistically StatisticalTypes LabTest Text 25 Laboratory test method PrintOrder Number(Lg) 4 Order for printing on reports Weight Number(Sg) 4 Ionic weight for this parameter ParmUpdateDate_D Date/Time Date this parameter was last updated IonicCharge Number(Int) 2 Parameter charge, + for cations, - for anions ParameterTypeCode Text 1 Foreign key to ParameterType lookup table ParameterType 1 Data received in units other than the standard units for the parameter can be converted on import, and the Converted Value flag in the analyses table set to “yes.”
The Parameters table will generally contain hundreds of elements, compounds, physical measurements, and other data items. A few example entries are shown here. See Appendix D for a more comprehensive list. Parm. # 1045 1050 1055 1060 1065 1075 1090 1095
Parameter Name Abbrev. Pyrene Pyridine Safrole Sulfotep Terphenyl-D14 (surrogate) Thionazin Zinophos 4,4'-DDT Aldrin
CAS Number 129-00-0 110-86-1 94-59-7 3689-24-5 1718-51-0 297-97-2 50-29-3 309-00-2
Sum.Cat. Semi-VOAs VOAs Semi-VOAs Pesticides Semi-VOAs Semi-VOAs Pesticides Pesticides
Stat.Type Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
Test Meth. Print Order SW 8270;8100;8275 1390 SW 8260;8015 1750 SW 8270 1400 SW 8141 555 SW 8270 1405 SW 8270;8141 1410 SW 8080 460 SW 8080 465
320
Relational Management and Display of Site Environmental Data
1100 Alpha-BHC 1105 Alpha-chlordane
319-84-6 5103-71-9
Pesticides Pesticides
Unknown Unknown
SW 8080 SW 8080
470 475
The QCAnalysisCode field in the Analyses table is tied to the QCCode lookup table, which is discussed above under the Samples table lookups. ReportingUnits – This lookup table for analytical reporting units contains codes and unit names such as “ppm” and “deg C.” The database does not store millequivalents for cations and anions, but this can be calculated from the ionic weight stored in the Parameters table. Field ReportingUnitsCode ReportingUnits
Type Text Text
Sz 2 15
Description Code for reporting units Reporting units
Relationships Analyses and Stations
Some codes for ReportingUnits are: Unit code C F G In M Mk Ms O Pc Pg Pl Ub Uk Ul Z
Report. Units Deg C Deg F fmsl in ppm mg/kg ms/cm other per cent (%) pCi/g pCi/l um/cm ug/kg ug/l unknown
RunCode – This lookup table tracks GC run data, if provided. Field RunCode RunDescription
Type Text Text
Sz 2 40
Description Code for run code RunCode description
Relationships Analyses
Some values for RunCode are First column result, Second column result, None, and Unknown. ValidationFlags – This lookup table contains codes and descriptions for problems encountered by a validator reviewing the data after delivery by the laboratory. This table is a reference table for data in Analyses, but is not formally related to it. Field Type Sz Description Relationships ValidationCode Text 1 Code for validator's flag Analyses1 ValidationFlag Text 55 Validator's flag 1 This relationship is not strictly enforced by the system, since multiple flags are allowed.
Some codes for ValidationFlags are Anomalous, Est. value, conc. < quan. limit, Tentatively identified compound, Rejected data, Not detected, and None. ValueCode – This lookup table supplies the reason for multiple analytical values. This field is used in conjunction with the superseded value, and tracks whether an analysis is a dilution, a reanalysis, a re-extraction, etc.
Data Model Example
Field ValueCode ValueDescription
Type Text Text
Sz 6 40
Description Code for value code ValueCode description
321
Relationships Analyses
Some codes for ValueCode are Dilution, Second dilution run, Not applicable, Re-analyzed, Re-extracted and re-analyzed, and Unknown. The following tables are related to Parameters: ParameterType – This table designates different parameter types. Field ParameterTypeCode ParameterTypeDesc
Type Text Text
Sz 1 20
Description Code for parameter type Description of parameter type
Relationships Parameters
Some codes for ParameterType are Non-organic and Organic. ParameterUnits – This is a lookup table related to the Parameters and SampleMatrix tables. This is where you can set preferred units for different matrices for the same parameter. Field ParameterUnitsID ParameterNumber SampleMatrixCode ReportingUnitsCode
Type AutoNumber Number(Lg) Text Text
Sz 4 4 1 2
Description System generated ID number Foreign key to Parameters table Foreign key to SampleMatrix lookup table Foreign key to ReportingUnits lookup table
Relationships Parameters SampleMatrix ReportingUnits
Some example ParameterUnits entries are: ParameterUnitsID 1 9 15
ParameterNumber 1 154 400
SampleMatrixCode w w w
ReportingUnitsCode ml in pi
SumCategories – This is a lookup table for codes governing how the parameters are to be summarized or grouped. Field SumCategoryCode SumCategory
Type Text Text
Sz 1 20
Description Code for summarization category Summarization category
Relationships Parameters
Some codes for SumCategories are Metals, Inorganics, Radiologic, Herbicides, Pesticides, PCBs, Dioxins, Semi-VOAs, VOAs, Hydrocarbon, RCRA Charac., Field Param., and Other. StatisticalTypes – This is a lookup table for codes indicating how the number should be treated statistically. Field StatTypeCode StatisticalType
Type Text Text
Sz 1 10
Description Code for statistical type Statistical type
Relationships Parameters
Some codes for StatisticalTypes are Regular, Log, Nominal, Ordinal, and Unknown.
REFERENCE TABLES Reference tables are used by the software to perform various calculations, comparisons, and conversions. HoldingTimes – This table contains holding times for parameters, entered by summary category or by parameter.
322
Relational Management and Display of Site Environmental Data
Field HoldingTimeID ParameterNumber SampleMatrixCode HoldingTime HoldingUnits
Type AutoNumber Number(Lg) Text Number(Sg) Text
Sz 4 4 1 4 2
Description System generated ID number Foreign key to Parameters table Foreign key to SampleMatrix table Holding time Holding time reporting units
Relationships Parameters SampleMatrix ReportingUnits
Some values for HoldingTimes are: HoldingTimeID 8 9
ParameterNumber 142 35
SampleMatrixCode W W
HoldingTime 14 6
HoldingUnits D M
SiteStationAlias – This table holds information about sample number assignment by station. Field SiteStationID SiteNumber StationNumber SampleMatrixCode ExpSampleDate_D SampleNumberPrefix FieldSampleID SamplesPerStation ExtraNumbers StartingNumber RandomOrder SampleStatusCode SampleEventID
Type AutoNumber Number(Lg) Number(Lg) Text Date/Time Text Text Number(Int) Number(Int) Number(Int) Text Text Number(Lg)
Sz 4 4 4 1 10 5 20 2 2 2 1 1 4
Description System generated ID number Site number from sites table Station number from stations table Foreign key to SampleMatrix lookup table Date of expected sampling, use first day of mo. Prefix for FieldSampleID Client assigned sampling number Number of samples per station Number of extra numbers for QC, blanks, etc. Starting number for FieldSampleID Generate sample numbers in random order, y/n Foreign key to SampleStatus lookup table Link to SampleEvents table
Relationships Sites Stations SampleMatrix
SampleStatus
SampEventStations – This table is used by the SampleEvents table to allow users to create lists of stations and parameters, and assign a date range and a unique name. Field EventStationID SampleEventID StationNumber
Type AutoNumber Number(Lg) Number(Lg)
Sz 4 4 4
Description Unique internal key Foreign key to SampleEvents table Foreign key to Stations table
Relationships SampleEvents Stations
Some values for SampEventStations are: EventStationID 1 2
SampleEvent ID 7 8
StationNumber 11 12
SampEventParams – This table is used by the SampleEvents table for the same purpose as the SampEventStations table above. Field EventParamID SampleEventID ParameterNumber PrintOrder
Type AutoNumber Number(Lg) Number(Lg) Number(Lg)
Sz 4 4 4 4
Description Unique internal key Foreign key to SampleEvents table Foreign key to Parameters table Order for printing on reports
Relationships SampleEvents Parameters
Some values for SampEventParams are: Event Param ID 1 2 3
Samp Event ID 7 7 8
Parm. # 10 25 35
Print Order 1 2 3
SampleEvents – This table allows you to assign wells and parameters to a sampling event that is identified by name, and by start date and end date.
Data Model Example
Field SampleEventName EventStartDate_D EventEndDate_D SampleMatrixCode SamplePurposeCode EventTask SampleEventID EventDescription
Type Text Date/Time Date/Time Text Text Text AutoNumber Text
Sz 50 10 10 1 2 50 4 255
Description Name of sample event First date and time of event Last date and time of event Foreign key to SampleMatrix lookup table Foreign key to SamplePurpose lookup table Administrative task for this event System generated ID number Description of sample event
323
Relationships
SampleMatrix SamplePurpose
An example for SampleEvents is: SampleEventName Rad Ind 86 Water
EventStartDate_D 1/1/86
EventEndDate_D 12/31/86
SampleMatrixCode Water
SamplePurposeCode Due diligence
StationParameters – This table lets you define a standard list of parameters to display for each station. Field StationParamID StationNumber ParameterNumber Task PrintOrder
Type AutoNumber Number(Lg) Number(Lg) Text Number(Lg)
Sz 4 4 4 50 4
Description System generated ID number Foreign key to the Stations table Foreign key to Parameters table Task to use this list for Order for printing reports
Relationships Stations Parameters
Some example entries for StationParameters are: StationParamID 1 2
StationNumber MW-1 MW-1
ParameterNumber Arsenic Calcium
Task 0012 0012
PrintOrder 1 2
ValidDetectLimit – This table stores project required detection limits. Field ValidDetectID SiteNumber ParameterNumber SampleMatrixCode ProjectDetectionLimit ProjectLimitTypes ProjectLimitUnits ValidDetectID
Type AutoNumber Number(Lg) Number(Lg) Text Number(Sg) Text Text AutoNumber
Sz 4 4 4 1 4 4 2 4
Description System generated ID number Foreign key to Sites table Foreign key to Parameters table Foreign key to SampleMatrix table Project detection limit used for validation Project detection limit type Project detection limit units System generated ID number
Relationships Sites Parameters SampleMatrix
ReportingUnits
Some values for ValidDetectLimit are: ValidDetectID 1 2
SiteNumber 1 2
ParameterNumber 15 35
SampleMatrixCode W w
ProjectDetectionLimit 10 50
ProjectLimitType z z
RPDControlLimits – Data validation table to store data validation limits. Field RPDControlID SiteNumber SampleMatrixCode Frequency Multiplier WeightingFactor UpperRcvRate LowerRcvRate KnownValue
Type AutoNumber Number(Lg) Text Number(Int) Number(Int) Number(Int) Number(Sg) Number(Sg) Number(Sg)
Sz 4 4 1 2 2 2 4 4 4
Description System generated ID number Foreign key to Sites table Foreign key to SampleMatrix lookup table Frequency of QC sample Multiplier Weighting factor Upper recoverable rate as % limits Lower recoverable rate as % limits Known value
Relationships Sites SampleMatrix
324
Relational Management and Display of Site Environmental Data
QCCode RPDLimit
Text Number(I)
3
Foreign key to QCCode lookup table RPD limit
QCCode
An example entry in RPDControlLimits is: RPDControlID 1
SiteNumber Refining Inc.
SampleMatrixCode Air
Frequency 4
Multiplier 2
UTILITY TABLES The following tables are not related to any of the primary tables, but are used by the software for various purposes. ActivityLog – This table tracks activities which may cause changes to the data in the database. Field ID UserName ActivityDate SelectedSite ActivityDescription
Type AutoNumber Text Date/Time Text Memo
Sz 4 20 8 50 -
Description System-assigned ID number Name of user making the change Date of activity Site selected on main menu Description of activity
Relationships
Here are some typical entries in this table: User Name drdave drdave cwoertman rwendell
Act. Date 12/8/97 12/8/97 12/9/97 12/9/97
Selected Site Forest Products, Co. Rad Industries Refining, Inc. Refining, Inc.
Description Tested Updated review level for MW-1 for 1993 Edited station types Looked at parameters
Control – This table is for storage of database-related (as opposed to station- and samplerelated) information. It is used primarily by programmers rather than users. There should be no need to manipulate this table, but data administrators may do so if necessary. Field ControlName ControlText ControlMemo ControlInteger ControlReal ControlDate_D
Type Text Text Memo Number(Lg) Number(Db) Date/Time
Sz 18 50 4 8 8
Description Name of control value to be stored Text value Memo info. Integer value Real number value Date value
Relationships
ParameterAliases – This table is used to provide alternative names for parameters so that different spellings required for regulatory reasons can be accommodated. It is not intended to make up for errors in parameter spelling. Field AliasNumber SiteNumber ParameterNumber Alias PreferredAlias
Type AutoNumber Number (Lg) Number (Lg) Text Number(Int)
Sz 4 4 4 60 2
Description Unique alias number generated by system External key to Sites table External key to Parameters table Alias Name Preferred alias identifier
Relationships Sites Parameters
Examples of data for this table include: AliasNumber
SiteNumber 1 7 8 14
ParameterNumber 1 3 3 4
695 745 1465 1575
Alias 1,2-Benzanthracene Benzyl butyl phthalate Chlorodibromomethane Trichloroethylene
PreferredAlias 0 0 0 0
Data Model Example
325
UnitConversion – This table contains factors for automatic conversion of units. Field Input Unit Output Unit Factor Add Factor Mult Index Number
Type Text Text Number Number AutoNumber
SZ 2 2 Single Single Long
Description Unit conversion input Unit conversion result Add factor for conversion Multiply factor for conversion Index number for SQL Server
Relationship
Some values for UnitConversion are: Input Units Deg C Deg F ft ir mg/kg mg/l us/kg ms/cm um/cm ug/kg ug/l us/cm
Output Units Deg F Deg C ir ft ug/kg ug/l um/cm uS/cm ms/cm mg/kg mg/l mc/cm
Additive Factor 17.70 -32 0 0 0 0 0 0 0 0 0 0
Multiplier Factor 1.0 0.556 -2 0.083 1000 1000 1000 1000 0.001 0.001 0.001 0.001
RELATIONSHIPS The relationships between the tables are shown in the entity-relationship diagram in Figure 160. In this figure, the relationships between tables are shown with lines connecting corresponding fields, with “one to many” symbols of “1” and “∞,” respectively, to indicate either “parent-child” or lookup relationships between the tables. For instance, for each (1) SampleNumber in the Samples table, there will be many (∞) records for this SampleNumber in the Analyses table, and for each (1) StationTypeCode listed in the lookup table StationType, there may be many (∞) records with the StationTypeCode listed in the primary table Stations. This diagram is simplified to fit on the page, and does not contain all of the tables in the data model.
Relational Management and Display of Site Environmental Data
Figure 160 - Entity-relationship diagram
326
Figure 160 - Entity-relationship diagram
APPENDIX C DATA TRANSFER STANDARD
One of the biggest challenges in implementing an EDMS is getting the data flowing smoothly from the creator of the data, such as a laboratory, to the data users. The Data Transfer Standard (DTS) in this appendix is intended to fully describe a format for an electronic data deliverable (EDD) that should help significantly with this data flow.
PURPOSE The efficient management of projects requires the use of a wide variety of different types of data. It is not the intention of this data transfer standard to limit the types of data used in projects. Rather, it is intended to facilitate the transfer of data by providing a well-defined format for data delivery. This format is intended to be flexible enough to accommodate the majority of the analytical and other technical evaluation and monitoring data for projects. There will almost certainly be data that will not fit into this standard. In that case, the organization supplying that data should contact the project manager to discuss how data transfer can be accommodated. The outline for this dialogue is contained in a section below entitled Non-Conforming Data. Creators of digital data use a wide variety of tools in the process of generating their data files. These tools include dedicated laboratory information systems, word processors and spreadsheets, sophisticated relational data management systems, and integrated database and mapping programs. It is not intended to dictate the tools to be used by the data creator. However, whenever possible, data should be transferred in one of the standard formats to simplify the data transfer. A primary design goal of these standard formats is that files in one of these formats can be created relatively easily using software tools available to those creating the data. If data providers anticipate additional costs for providing data in one of the formats presented here, they must provide estimates of these additional costs to their project manager prior to finalization of contract terms, so that this information can be used in the vendor selection process.
DATABASE BACKGROUND INFORMATION This DTS addresses data generated as part of the site investigation and remediation process. Data of concern for this standard includes Sites (facilities or projects), Stations (observation points), Samples (individual observation events), and Analyses (specific individual values from an event).
328
Relational Management and Display of Site Environmental Data
Spreadsheet users think of a data table as being made up of rows (records) and columns (fields). In the EDMS database, sites, stations, samples, and analyses are stored in separate Tables. Within the tables, Fields are defined that will hold the Records provided by the data provider. In the EDMS!database, a specific data item can be referred to by naming its table and field separated by a period. An example would be Samples.SampleDate, which would refer to the SampleDate field in the Samples table. A Table contains data for particular physical entities such as samples. Each Record in the table represents a particular instance of that entity, such as a particular sample. The Fields in the table represent different data items for that entity. The data being transmitted in one of the formats of this standard will be placed in two tables in the EDMS. These tables are Samples and Analyses. Some of the entries in these tables must have values that match those in other tables, called lookup tables. Information on how to match these values is included below, and typical coded entries are listed in Appendix B. Note that for the lookup data, in some cases it is the value that is reported and in others the code, based on common industry practice.
DATA CONTENT This section describes the content of the data being transmitted. The following section covers the format of that data. The content is the same for all three formats. In this document the content is organized by the target table in the database. In the text file and spreadsheet formats, all of the content is in one structure. In the database format, the content is separated into three tables. In the following descriptions, fields are described as “Optional” or “Required.” These denote program requirements. Clients should instruct the laboratories if any of the fields considered “Optional” for the EDMS are required for a given project.
General comments on data content The data provider should report all the data it is contracted to report. Other data elements currently being supplied in electronic format for existing projects, but not included in this standard, should be included in fields following these designated fields. This data will be ignored during the import process. This standard supports import of duplicate sample and reanalyzed analytical data into the database. Indicate the preferred sample and analysis by entering a 0 in the corresponding DuplicateSample and Superseded fields, respectively. If more than one duplicate sample is being reported, increment the DuplicateSample field, i.e., 0, 1, 2, etc., and enter the appropriate QCSampleCode (see Appendix B). If more than one analysis is being reported, increment the Superseded field, i.e., 0, 1, 2, etc., and enter the appropriate code in the ValueCode field to designate reanalyzed, dilution, reextracted, etc. Important: These are two different things. The DuplicateSample field is used when more than one physical sample is taken in the field from the same station on the same date. The Superseded field is used when more than one result is reported for the same parameter for the same physical sample. All dates should include four-character years.
Data elements The following sections describe the data elements for the electronic deliverable. They are organized by the table in the EDMS in which the data will be placed.
Data Transfer Standard
329
SITES AND STATIONS SiteName – The name of the site (project, facility, etc.) from which the samples were taken. This field is required, and must match a site in the Sites table in the EDMS. This will be converted to a site identification number in the Stations table during the import. Required. StationName – The name of the well, boring, etc. from which the sample was taken. The entry must match a station name in the Stations table in the EDMS for the site name provided. It is also converted to an identification number on import. Required. A Station is a location of observation. Examples of stations include soil borings, monitoring wells, surface water monitoring stations, soil and stream sediment sample locations, air monitoring stations, and weather stations. A station can be a location that is persistent, such as a monitoring well that is sampled regularly, or can be the location of a single sampling event. For stations that are sampled at different elevations (such as a soil boring), the location of the station is the surface location for the boring, and the elevation or depth component is part of the sampling event record.
SAMPLES A Sample is a unique sampling event for a station. Each station can be sampled at various depths (such as with a soil boring), at various dates (such as with a monitoring well), or, less commonly, both. SampleDate_D – The date on which the sample was taken. Required. SampleMatrix – The material of the sample. Provide the full Sample Matrix name, such as “Water,” which must match an entry in the SampleMatrix table. Required. SampleTop and SampleBottom – Soil sample depths or elevations, as instructed by the client. The fields should contain only numeric values. If these fields are not applicable (i.e., water samples) or are unknown to the laboratory, then they should be populated with zeros, for compatibility with ODBC databases. Required. DepthUnits – Units for sample top and sample bottom. This is a coded field that is linked to the ReportingUnits lookup table. If this information is unavailable to the data provider, “Unknown” should be reported. Required. DuplicateSample – This field was discussed previously. It should be a zero unless this is a duplicate sample. All analyses must have an entry for this field, with multiple QC samples entered as values incremented from one. Required. FieldSampleID – The client-assigned field ID number for each sample. Optional. LabSampleID – The sample identification number used internally by the laboratory. Optional. AltSampleID – Another sample identification number if needed. Optional. CoolerID – Number to identify the cooler in which the primary samples and QC samples were shipped. Optional. Sampler – Person taking the sample. Optional. Description – Description of the sample, such as its condition. Optional. COCNumber – Chain of custody tracking number. Optional. DeliveryGroup – Sample delivery group. This field is provided for use as a lab tracking field. It could be used to define a group of parameters. Optional. FilteredSample – Filter information at the sample level. Was the sample filtered, and if so, what size filter was used? It could also be used to identify whether the filtering occurred in the field or the lab. Entries are compared to the Filtered lookup table in the database. The lab can supply either the code or the Filter description, whichever is most consistent with its system (i.e., TOT vs. total), but must coordinate this with the client. Required. QCSequenceID – QC sequence identifier. This field is another lab tracking field, used to relate field samples to lab samples. Optional. QCSampleCode – Code to identify QC samples. It ties to the QCCodes table, which contains codes for both the sample and analysis levels. The lab should supply the code if available, e.g.,
330
Relational Management and Display of Site Environmental Data
DUP for duplicate sample, or O for original sample. If this information is not available to the lab, enter “z” for Unknown. Required. TaskNumber – The administrative task number under which sampling is done. Optional. PrimarySample – Stores the Field SampleID of the primary sample to which the QC sample is tied. This field is blank for original samples, and may be blank for field QC samples that have been submitted blind to the lab. A data administrator can enter this number into the import table. The import routine converts this to the sample number of the primary sample before storing it in the database. Optional. SampleResult – The result of the sampling process, such as “successful,” “dry,” “no access.” Its primary use is to indicate that obtaining a sample was attempted unsuccessfully. If not available from the lab, this field can be entered into the import table by a data administrator. Optional. If a sample was attempted unsuccessfully, the sample fields should be filled in; however, all fields associated with analyses, including parameter name, CASNumber, and AltParamNumber, should be left blank. The system will then only attempt to import the sample information and will not try to create an analysis record.
ANALYSES An Analysis, as used in this document, is the observed value of a parameter related to a sample. This term is intended to be interpreted broadly, and not to be limited to chemical analyses. For example, field parameters such as pH, temperature, and turbidity also are considered analyses. ParameterName, CASNumber, AltParamNumber – Various combinations of these fields are used to identify the correct parameter. ParameterName should always be provided. The system compares the ParameterName to entries in the Parameters and ParameterAlias lookup tables, and assigns a parameter if a match is found. CASNumber and AltParamNumber are not required, but should be provided if possible to help ensure the correct parameter name assignment. If the ParameterName does not match a lookup entry, the system compares either the CASNumber or the AltParamNumber (frequently used for STORET codes) to Parameter table entries. Care should be taken that consistent numbers be provided. If ParameterName is left blank or a match is not found, but a CASNumber or AltParamNumber is provided that does match, the system assigns a parameter from the Parameters table based on the match. Using only CASNumber or AltParamNumber and not a ParameterName to designate the parameter is not recommended, since the program does not request confirmation of the parameter that is assigned. Superseded – This field was discussed above. It should be a zero unless the analysis is superseded by a later value in the same file, in which case the entry should be 1 or higher, depending on the number of values. This field is used in conjunction with the ValueCode field, discussed later in this section. All analyses should have an entry. Required. AnalyticMethod – Method used to perform the analysis. Optional. Value – Measured result of the analysis. Optional, but should almost always be provided. For laboratory control spike and matrix spike samples, the results should be reported in percent recovery, with the units in %. Moisture content should be reported as a separate analytical record, with the units in %. They should be entered on a “by weight” basis, based on total weight. ReportingUnits – Units of the analysis. The entry provided should be the full abbreviation, such as “mg/l.” Entries must match an entry in the ReportingUnits lookup table in the database. Detection limits and radiologic error must be reported in the same units as the value. Required. FlagCode – One to four coded entries for the analytical flag describing the analysis. Each character in the field must match an entry in the AnalyticFlags lookup table in the database. More than one flag can be entered. For example, if “b” (detected in blank) and “j” (estimated value) are both entered in the lookup table, then “bj” can be entered as an analytical flag (estimated value, detected in blank). If the analysis is considered a usable value, and would not otherwise
Data Transfer Standard
331
have a flag, this field should contain the code for detected value (usually a “v”). If the flag is unknown, the field should contain a “z.” Required. ProblemCode – Analytical problems are usually described in the narrative, and not included in the electronic format. If this field data is not provided, the field should contain a “z” for unknown. If the data provider chooses to supply problems in the electronic file, then the codes must match entries in the AnalyticProblems table. As with the FlagCode field, the entry can contain from one to four approved codes. Required. ValidationCode – One to four flags associated with validation of analyses. The data validation organization usually provides this field, which can contain from one to four of these codes, which must match entries in the ValidationFlags table. Others should place a “z” for Unknown in this field. Required. DetectedResult – Supplied by the lab, this field should contain either “y” for yes, the analyte was detected or “n” for no, the analyte was not detected. This field overlaps slightly with FlagCode. The purpose of this field is to separate the non-detect flag from other lab qualifiers, such as “j” or “b,” for statistical, evaluation, and validation purposes. Optional. Detect – Detection limit for the analysis. Detection limits must be reported in the same units as the value. Optional. LimitType – Type of limit contained in the Detect field, such as “MDL,” “PQL,” “RL,” etc. Optional. Detect2 – A second detection limit. Standards should be set for which type of limit should be entered in each field for a given site, for example: IDL or MDL in the first column, CRDL or PQL in the second. Optional. LimitType2 – Limit type for Detect2. Optional. Error – Standard error for radioactivity measurements. Optional. DilutionFactor – Amount that the sample was diluted prior to analysis. Optional. Basis – Analyzed wet or dry. Should be “w” for wet or “d” for dry. Can also report “n” for not applicable, or “z” for unknown. Required. FilteredAnalysis – Filter or measured basis information at the analysis level. Entries are compared to the Filtered lookup table in the database. As with the FilteredSample field, the lab can supply either the code or the description for this field. Required. LeachMethod – Method used to leach the sample, if any. Entries are compared to the LeachMethod lookup table to maintain consistency. The data provider should supply the full name of the method, e.g., TCLP. If the analysis was not leached, “None” should be reported. Required. PrepMethod – Method used to prepare sample separate from leaching. Optional. ReportableResults – Flag for whether the result is to be used in reports. Report “Y” for yes, or “N” for no. Reported by the data provider or selected by the project manager for multiple analyses from a selected sample, such as analyses at multiple dilutions. Optional. AnalDate_D – Date on which the analysis was performed. Optional. ExtractDate_D – Date on which the material was extracted for analysis. Optional. LabReportDate_D – Date on which the lab reported the analysis. Optional. Lab – Name of the laboratory performing the analysis, or other data provider. Optional. LabComments – Lab comments about this analysis. Optional. AnalysisLabID – Lab identification number at the analysis level. LabSampleID tracks lab analyses at the sample level. This field is for identification numbers at the analysis level. Optional. AnalyticalBatch – Lab batch identification number. Optional. ValueCode – Parameter value classification. This field identifies the analytical trial, and supplies the reason for a superseded analysis. It is a coded entry enforced by the ValueCode lookup table. The lab should report the code, such as “RE” for re-extracted, “DL” for dilution, etc., or “O” for original analysis. Required.
332
Relational Management and Display of Site Environmental Data
RunCode – Confirmation run identification. This is a coded entry enforced by the RunCode lookup table. The lab should supply the code, such as “PR” for primary run, “n” for not applicable, or “z” for Unknown. Required. QCAnalysisCode – QC code at the analysis level. It ties to the QCCodes table, which contains codes for both the sample and analysis levels. The lab should supply the code for this field, such as “TIC” for tentatively identified compound or “O” for original analysis. Required.
ACCEPTABLE FILE FORMATS This DTS supports three file formats when receiving laboratory data for inclusion in the database. These are a flat ASCII file format, an Excel spreadsheet format, and an Access relational format. Only the ASCII format is described here. The other formats contain the same content, but with a different file format.
Flat ASCII file format The simplest format for data delivery under this standard is in a flat ASCII file with tab delimiters. The file must contain specific data elements as described above (Data Content) in the particular order described below. All modern word processors, spreadsheets, and database manager programs can save data in this format without special programming. There are three components to a text file: encoding, structure, and content. Each of these components is described in the following sections.
ENCODING ASCII (American Standard Code for Information Interchange, pronounced “as′-kee”) is a character-encoding scheme that allows letters, numbers, punctuation, and other characters to be stored in computer files. All modern computer systems can accommodate this format, either directly (personal computers, workstations, and some minicomputers) or via software to transfer from their native format (usually EBCDIC; some minicomputers and mainframes). The first seven bits (128 characters) of this eight-bit code are well defined and are platform-independent. The standard supports ASCII files using this “low bit” character set if it contains the data elements as described in the following paragraphs. In most cases, if the “Save as ASCII” or “Save as Text” option is used in saving the file, it will be saved with the proper encoding.
STRUCTURE The file should have each observation on a line in the file followed by a line delimiter (sometimes called a paragraph mark, ASCII 13 followed by ASCII 10). Within each line, the file should have each data element (which corresponds to a field in a database manager or a cell in a spreadsheet) in the order specified below. Each data element should be separated by an ASCII Tab character (09). A text data element can be shorter than the specified length but not longer.
CONTENT The ASCII text file must have the following columns present in the order shown, and the fields marked as required (bold text) must be populated. The file should have the first line in the file be the first line of data. The file should not have the field names in the first record. Field Name SiteName1 StationName SampleDate_D SampleMatrix
Data Type Text Text Date/Time Text
Record Size5 50 25 10 15
Description Site name Station identifier or name Date sample was taken Sample matrix
Table8 Sites Stations Samples Samples
Data Transfer Standard SampleTop2 SampleBottom DepthUnits DuplicateSample FieldSampleID LabSampleID AltSampleID CoolerID Sampler Description COCNumber DeliveryGroup FilteredSample QCSequenceID QCSampleCode TaskNumber PrimarySample SampleResult ParameterName CASNumber AltParamNumber Superseded AnalyticMethod Value ReportingUnits FlagCode ProblemCode ValidationCode DetectedResult Detect LimitType Detect2 LimitType2 Error DilutionFactor Basis FilteredAnalysis LeachMethod PrepMethod ReportableResult AnalDate_D ExtractDate_D LabReportDate_D Lab LabComments AnalysisLabID AnalyticalBatch ValueCode RunCode QCAnalysisCode
Number(Sg)3 Sample top Samples Number(Sg) Sample bottom Samples Text 15 Units for sample top and sample bottom Samples Number(Int)4 Duplicate sample number6 Samples Text 20 Client assigned field sample identifier Samples Text 20 Lab sample identifier Samples Text 20 Alternate sample identifier Samples Text 20 Cooler identifier number - for QA/QC Samples Text 50 Name of person taking sample Samples Text 25 Sample description Samples Text 20 Chain of custody number Samples Text 10 Sample delivery group Samples Text 20 Filter size Samples Text 15 QC sequence identifier Samples Text 4 QC code for this sample Samples Text 20 Task number under which sampling is done Samples Text 20 Primary sample to which QC sample is tied Samples Text 255 Result of attempted sampling Samples Text 60 Name of material analyzed for Analyses Text 20 CAS number of material analyzed for Analyses Text 20 Alternative number for parameter Analyses Number(Int) Analysis superseded by re-analysis? 7 Analyses Text 25 Method for performing analysis Analyses Number(Sg) Value measured during analysis Analyses Text 15 Units of the analysis Analyses Text 4 Data qualifier Analyses Text 4 Problems encountered during analysis Analyses Text 4 Code from data validation Analyses Text 1 Was analyte detected? Analyses Number(Sg) Detection limit Analyses Text 4 Detection limit type Analyses Number(Sg) 2nd detection limit Analyses Text 4 2nd detection limit type Analyses Number(Sg) Error range for this analysis Analyses Number(Sg) Dilution factor Analyses Text 1 Analyzed wet or dry? Analyses Text 20 Filter/measure basis at analytical level Analyses Text 20 Leaching method Analyses Text 20 Lab preparation method Analyses Text 1 Designates analysis as reportable result Analyses Date/Time 10 Date the analysis was performed Analyses Date/Time 10 Date the extraction was performed Analyses Date/Time 10 Lab analysis reporting date Analyses Text 10 Name of lab conducting analysis Analyses Text 50 Lab comments about this analysis Analyses Text 20 Lab identification number for analysis Analyses Text 40 Lab batch ID number Analyses Text 6 Differentiates between different results Analyses Text 5 Run code for GC analyses Analyses Text 4 QC code for this analysis Analyses 1 Field names in bold are required fields. The others may be blank. 2 SampleTop and SampleBottom are required. Numbers for depth or elevation should be entered for soil analyses. They should be zero if not applicable. 3 (Sg) A numeric data type that holds single-precision floating point numbers in IEEE format. A Single variable is stored as a 32-bit (4-byte) number that can be reported with up to 7 significant figures. 4 (Int) A number ranging from -32,768 to 32,767. 5 Character width for text fields. Does not apply directly to numbers. 6 Numbered values for duplicate samples, with 0 for preferred sample, increasing by one for each additional value. You must fill in all duplicates or none (in which case the system will assign them based on QC codes). 7 Numbered values for superseded analyses, with 0 for current analysis, increasing by one for each older value. You must fill in all duplicates or none (in which case the system will assign them based on QC codes). 8 Database table to receive data, either directly or after converting using a lookup table.
333
334
Relational Management and Display of Site Environmental Data
SUBMITTAL REQUIREMENTS File names Files submitted for import into the EDMS should be given names that describe the contents and format of the file. The name should include a site name, supplied by the project manager or the consultant, and the date the file is issued. In keeping with the DOS/Windows tradition of using a three-character file extension to describe the file type, we request that the following extensions be used for the three supported file formats: File Type Flat ASCII Files Spreadsheet Files Database Files
Extension .txt .xls .mdb
Files created with Windows 95, 98, ME, NT, 2000 or XP can use descriptive long file names such as “Rad Industries Sampling March 1997.mdb.” Files created on other systems must limit the file name to eight characters plus the extension. When the data is submitted, documentation about the data content and format of each file should accompany the submitted file, ideally on the disk label or the email accompanying the file.
Delivery media and formats The client is prepared to receive data in a variety of media and standard formats, and these formats can be expected to change and evolve over time. Submitters should communicate with their project manager prior to delivering data about the best format for the type and volume of data to be delivered. At a minimum, data will be accepted in these media and formats: •
• • •
1.44 megabyte floppy disks in DOS/Windows format. Data that will not fit on one diskette can be compressed and, if necessary, split onto more than one diskette using PKZip 2.04G or compatible software as a file with an extension of .zip containing a file with one of the above formats and extensions. WinZip can be used to create the zip file, as long as the file format is compatible with PKZip 2.04G. All versions of WinZip at the time of this writing (up to version 7.0) create the correct format. CD-ROM in ISO 9660 or compatible format. Iomega ZIP Drive. Delivery via electronic mail, compressed or uncompressed, is acceptable, subject to approval by the project manager.
Consistency of content It is very important that data submitters be consistent with the data that they submit. Data elements must be entered exactly the same way from submittal to submittal. For example, if a well was called “MW-1” in a previous submittal, then it must be called “MW-1” in all subsequent submittals, not “MW 1” or “Mw-01.” Data items such as station names are used to associate the data from the current submittal with data previously submitted. If the spelling is changed, the association will not be successful. In this example, if the laboratory or consultant come to the conclusion that the sampler may have inadvertently misnamed a well (e.g., Mw-01 or MW 1
Data Transfer Standard
335
instead of MW-1), the laboratory or contractor should contact the sampler and correct the data before submitting the data set. Another example of consistency of content is the spelling of chemical analytical compounds (parameter names). Data elements must be entered exactly the same way from submittal to submittal. If the spelling is changed without instructions from or notification to the client, the association on import will not be successful. A standardized list of parameter names should be provided to laboratories that supply data to the client, and these are the names that should be used. This system is also designed to promote consistency between the different labs and projects; however, if for project reasons the names cannot be kept consistent, the client has the ability to alias parameter names. This list can also be supplied to the laboratories.
Coded entries In order to foster consistency in the database, a number of data elements in the database tables are Coded. This means that each of these data items must contain one of a list of values. Examples of coded entries that are supplied by the laboratory include Analytic.ProblemCode, Analytic.FlagCode, and Analytic.ValidationCode. These codes describe problems encountered during the analysis, the data qualifier, and the validation data qualifier, respectively. There are a limited number of analytical problems and flags describing an analysis, so codes are used to represent each choice. Lists of the codes to be used are attached in Appendix B, but this information can be expected to change over time.
Closed-loop reference file system Data providers are encouraged to use this DTS in conjunction with a closed loop reference file system. Using this approach, the client prepares and sends a reference file to the data provider that contains the sites, stations, and lookup table information for the data provider to check the EDD for consistency prior to issuing it to the client. This approach has been found to be a great time-saver for both the data provider and the client because it minimizes errors in the EDDs and significantly decreases the need to reissue files.
NON-CONFORMING DATA The purpose of this DTS is to facilitate the accurate transfer of data by providing a standard format for data delivery. This format should be flexible enough to accommodate the majority of the analytical and other data for most projects. However, over time there may be data that will not fit into this standard. In that case, the organization providing the data should contact its project manager to begin a dialogue about how that data can be accommodated. The outline for this dialogue is contained in this section. When data is identified which does not appear to easily conform to this DTS, there is a fourstep process that should be followed to determine how to handle that data: Determine whether the data is really non-conforming. This DTS was designed to accommodate a wide variety of different types of site analytical and other data. Someone knowledgeable about the data to be transferred and someone knowledgeable about the EDMS should jointly try to fit the data into the transfer standard. The effort expended in this dialogue should be commensurate with the value of the data to the project. Any decisions made about necessary compromises, or other changes to make the data fit the standard, should be made with great concern for preserving the quality and integrity of the data. If the data is found to be non-conforming, determine how important it is to have it in the database. If the data is significant to the management of the project, and must be viewed in
336
Relational Management and Display of Site Environmental Data
relationship to other project data or to data in other projects, then it should probably be placed in the data management system. If the data is of a supporting nature, but will not be used in combination with other data, then it should be archived in the format provided and effort should not be expended in fitting it into the database system. Often the answer to these questions will not be a simple “yes” or “no.” In that case, the decision on whether to integrate the data into the database will need to take into consideration the cost of integrating the data. Determine the cost to integrate the data. Adding data to the data management system that does not fit into the structure of the existing tables can be costly. Tasks that must be performed in order for this integration to be successful include analysis of the data, modification of the data model, creation of editing screens, queries and reports, and sometimes modification of the menu system and other user interface components. These modifications can in some cases adversely affect other users. Modify the data management system as necessary. If the value of the data to be integrated (or, more precisely, the value of the use of the data in the data management system) exceeds the cost to integrate it, then resources should be allocated to performing the integration, and the integration performed.
APPENDIX D THE PARAMETERS
Every environmental project is different, and each project is likely to have its own suite of physical and chemical parameters that are important to that site. This includes the constituents of concern (pollutants) and other parameters that are important to understanding the physical, geological, and chemical conditions at the site. There are tens of thousands of naturally occurring and man-made compounds, any of which could be important at any particular site, but there is a suite of “bad guys” that tend to appear at site after site. This appendix provides some general information about the most commonly encountered parameters, and information that might be useful to those managing the data.
OVERVIEW People working with site environmental data need to have a basic knowledge of the constituent parameters that they are managing. They should have a general idea of the different kinds of constituents, how the compounds behave in the environment and in analysis, and the abbreviations that are commonly used. This last item can be tricky, since some of the abbreviations are not obvious, such as PCE and PERC, both of which are used interchangeably for tetrachloroethylene. (The nickname results from its alternate chemical name of perchloroethylene.) This would not be obvious to the casual observer. The purpose of this appendix is to provide a list of the most commonly encountered parameters, along with some information on their importance and how they are analyzed. The first section lists the constituents by parameter type. (Caution: some parameters fit into more than one category, such as “semivolatile” and “pesticide,” but each is listed in only one category.) The second part covers many of the common analysis methods for the parameters, followed by some additional information about extraction and holding times. The list of parameters is by no means exhaustive. The intention is to include the constituents that are most likely to show up in a data management project. The list of methods includes some of the methods used for the parameters listed, but like the parameters list, the methods listed are not exhaustive. The material in this appendix was gathered from many different sources, including EPA (2000a), Extoxnet (2001), Spectrum Labs (2001), SKC (2001), Cambridge Software (2001), NCDWQ (2001), Scorecard.org (Environmental Defense, 2001), Manahan (2000, 2001), Patnaik (1997), and Weiner (2000). The information that is readily available on this subject in books and on the Web is somewhat spotty and inconsistent, and despite some editing effort, this is apparent in
338
Relational Management and Display of Site Environmental Data
the following tables. In some cases the information is hard to find, and in other cases two different sources contradict each other, possibly due somewhat to different vintages of the data. WARNING: This material is intended as an overview only. It has been gathered from various official and unofficial sources over a period of time, and has not been subject to rigorous quality assurance for accuracy, consistency, or currency. Regulations change, and sources sometimes conflict. This material should not be used for any decisions that may affect public health. Instead, primary, official sources should be consulted for the latest and most correct information for any particular project. The following information is provided in the parameter tables, and some conventions and abbreviations have been used. Name – An effort has been made to list each parameter under its most common name, and in some cases nicknames and abbreviations have been included. There is great variability in the capitalization of the organic parameter names and where to insert spaces between the different parts of the name, and again different sources conflict, so just do the best you can and change them as necessary to be consistent within a project. Reg. – This column shows whether the parameter is regulated by the EPA. The codes are P for priority toxic pollutants; and N for non-priority toxic pollutants. High Vol. – An entry in this column is determined by the volume of release to the environment over time, with the cutoff being about one million pounds of toxic equivalent released. Risk – Health risk is determined by whether the substance is on the list of the highest risk substances for cancer (C) and non-cancer (N) health risks (top 100 of each) based on amount released and toxicity. The source for volume and risk was scorecard.org, May 1, 2001. Holding Time – The holding times provide a special challenge, since they vary from method to method for some constituents. Use this information as a general guide only. In some cases the holding time for the most common method has been listed, while in others a holding time of Meth. means that the holding time varies by method. Also, holding times can vary based upon whether a sample was preserved in the field or not, or the elapsed time between when the sample was obtained vs. when it was extracted in the lab, vs. when it was analyzed. Analytical Method – Different sources vary in the level of detail for the methods listed, so some parameters have several different methods, while others have only one. Again, the analytical methods listed are in no way complete.
INORGANIC PARAMETERS Metals Name
Reg.
Aluminum Antimony Arsenic Barium Beryllium Boron Cadmium Calcium Chromium VI Cobalt
N P P P, N P N P P P
High Vol. ! ! ! !
Risk
!
C,N
! !
C,N N
N N C,N N C,N
Holding Time 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 24 hours 6 mo.
Analytical Method 200.7, 200.8, 200.9 200.8, 200.9, 3114B 200.7, 200.8, 200.9 200.7, 200.8, 200.9 200.7, 200.8, 200.9 212.3 200.7, 200.8, 200.9 200.7, 3500-Ca D, D511-93A, B 200.7, 200.8, 200.9 219.1, 219.2
The Parameters
Copper Iron Lead Magnesium Manganese Mercury Molybdenum Nickel Palladium Phosphorus Platinum Potassium Selenium Silicon Silver Sodium Strontium Thallium Tin Titanium Vanadium Zinc
P N P
!
N C,N
P
! ! ! ! ! !
N
!
P
! !
N P
P N
N N C,N
N N
P
!
N
P
!
N N
Reg.
High Vol.
6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 28 days 6 mo. 6 mo. 6 mo. 28 days 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo.
200.7, 200.8, 200.9 200.7, 200.9 200.8, 200.9 3500-Mg E 200.7, 200.8, 200.9 200.8, 245.1, 245.2 246.1, 246.2 200.7, 200.8, 200.9 253.1, 253.2 365.1 -.6, 4500-P E, F 255.1, 255.2 2200.6, 258.1, 300.7, 7610 200.9, 3114B, 200.8 366, 370.1 200.7, 200.8, 200.9 200.7 7780 200.8 282.1, 282.2 283.1, 283.2 286.1, 286.2 200.7, 200.8
Holding Time 14 days 14 days 28 days 48 hours 6 mo. 48 hours 28 days 28 days 28 days 6 mo. 14 days 14 days 28 days 28 days
Analytical Method
339
Other inorganics Name Acidity Alkalinity Ammonia Asbestos Bicarbonate Biochemical oxygen demand (BOD) Bromate Bromide Carbon, organic Carbonate Carbonyl chloride (Phosgene ) Carbonyl sulfide Chemical oxygen demand (COD) Chloride
P N N P
Chlorine (free) Chlorine (total)
N
! !
! N
!
Risk
N C
Immed. Immed.
Chlorine dioxide Cyanide
P
Fluoride
P
28 days
Hardness (as CaCO3) Iodide Nitrate Nitrite
N
6 mo. 24 hours 48 hours 48 hours
Nitrogen, Kjeldahl Ortho-phosphate Oxygen, dissolved Ozone
P P
P N
!
! !
Immed. 14 days
28 days 48 hours Immed. Immed.
305.1, 305.2, 310.1, 310.2, 2320B 350.1, 350.2, 350.3 100.1, 100.2 310.1, 310.2, 2320B 405.1 321.8 320.1 415.1, 415.2 310.1, 310.2, 2320B TO-06, 554 TO-15A, 554 403.1, 410.1, 410.2, 410.3, 410.4 300.0, 325.1, 325.2, 325.3, 4500-Cl, D512-89B, 9250, 9251, 9252A, 9253 4500-Cl-B, D, E, F, G, H, I 330,1, .2, .2, .4, .5, 4500-Cl-B, D, E, F, G, H, I 4500-Cl-D, E, F, G 335.1, 2, .3, .4, 350.1,.4500-CN-C, E, F, G 300.0, 340.1, .2, .3, .6, 4500-F-B, C, D, E, 129-71W, 380-75WE 130.1, 130.2 345.1 352.1, 601, B-1011, 4500-NO3-D, E, F 353.1, .2, .3, .4, .6, 354.1, B-1011, 4500-NO2-B, 351.1, 351.2, 351.3, 351.4 365.1, 4500-P E, F 360.1, 360.2 4500-O3 B
340
Relational Management and Display of Site Environmental Data
Perchlorate Silica Sulfate
P
28 days 28 days
Sulfide
N
7 days
314, 9058 200.7, 4500-Si D, E, F 300.0, 375.1, 375.2, 375.3, 375.4, 4500-SO42-C, D, E, F 376.1, 376.3, 377.1
Radiologic Name
Reg.
High Vol.
Risk
Actinide elements Cesium (radioactive) Gamma emitting radionuclides Gross alpha Gross beta Iodine (radioactive) Lead 210 Plutonium Radium 224 Radium 226
Holding Time 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo.
Radium 228
6 mo.
Radon Strontium 89 Strontium 90 Thorium (total) Thorium 228, 230, 232 Tritium
6 mo. 6 mo. 6 mo. 6 mo. 6 mo. 6 mo.
Uranium (natural) Uranium (total) Uranium 238
6 mo. 6 mo. 6 mo.
Analytical Method 907 901, Ra-CI-Ces 901.1 00-01, 01-02, 900, 9310, Ra-CI-A&B 00-01,900, 9310, Ra-CI-A&B 902, Ra-CI-Io 909 911, Ra-LV-Pl 903, 9315 903, 903.1, Ra-03, Ra-04, Ra-CI-R2, 9315 903, 904, Ra-05, Ra-CI-R2, 9315, 9320 913, NWQL LC 1369 905, Ra-CI-Str, RA-LV-Str 905, RA-LV-Str 910, Ra-LV-Pl Ra-LV-Pl 906, Ra-CI-Trit, Ra-LV-Tri NWQL LC 1565 Ra-LV-Pl 908, 908.1, 908.2, Ra-LV-Pl Ra-LV-Pl
ORGANIC PARAMETERS Volatile organics (VOAs) Name 1,1,1,2-Tetrachloroethane
C
Holding Time 14 days
1,1,1-Trichloroethane (1,1,1-TCA)
N
14 days
C
14 days
C,N
14 days
C,N
14 days
1,1,2,2-Tetrachloroethane (PCA)
Reg.
High Vol.
P
1,1,2-Trichloroethane (1,1,2-TCA) 1,1-Dichloro-1-fluoroethane 1,1-Dichloroethane (1,1-DCA)
Risk
! P
1,1-Dichloroethene 1,1-Dichloroethylene (DCE) 1,1-Dichloropropane 1,1-Dichloropropene
14 days P
C
14 days 14 days 14 days
Analytical Method 502.2, 524.1, 524.2, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A IR 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.2 502.2, 524.1, 524.2, 8012A
The Parameters
1,1-Dimethyl hydrazine 1,2,3-Trichlorobenzene 1,2,3-Trichloropropane
C C
14 days 14 days 14 days
1,2,4-Trimethylbenzene 1,2-Dichloroethane (1,2-DCA)
P
N C
14 days 14 days
1,2-Dichloroethene 1,2-Dichloropropane
P
C,N
14 days 14 days
C,N C
14 days 14 days 14 days
1,3,5-Trimethylbenzene 1,3-Butadiene 1,3-Dichloropropene 1,4-Dioxane 1-Chloro-1,1-difluoroethane 2,2-Dichloropropane 2-Butanone, methyl ethyl ketone 2-Chloroethylvinyl ether 2-Hexanone 2-Nitropropane 4-Methyl-2-pentanone Acetaldehyde Acetone Acetonitrile Acrolein Acrylamide Acrylic acid Acrylonitrile Allyl chloride, 3-chloropropene Benzene
!
! P
!
C N
!
N
14 days
!
C,N
!
P
! ! !
C,N N C N N
14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days
P
!
C C,N
14 days 14 days
C C
14 days 14 days 14 days
P C
P
Benzoic trichloride Benzyl chloride Bromobenzene Bromochloromethane
14 days
Bromodichloromethane
14 days
Bromoform
P
14 days
Bromomethane
14 days
BTEX (see Benzene, Toluene, Ethylene, and Xylene) Carbon disulfide Carbon tetrachloride
P
Chlorobenzene
P
Chlorodifluoromethane
! !
N C,N
14 days 14 days 14 days
!
14 days 14 days
Chloroethane
P
!
Chloroform
P
!
C,N
14 days
!
C,N
14 days
Chloromethane
341
FID 502.2, 503.1, 524.2, 8021A 502.2, 504.1, 524.1, 524.2, 8010B, 8021A, 8240B, 8260A 502.2, 503.1, 524.2, 8021A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 8021A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 503.1 524.2, 8021A 8010B, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 8240B, 8260A IR 502.2, 524.2 524.2, 8015A, 8240B, 8260A 601, 624, 8010B, 8240B, 8260A 524.2, 8240B, 8260A 524.2, 8260A 524.2, 8015A, 8240B, 8260A 554, 8315, TO-5, 0030, 8010B 524.2, 8240B, 8260A, 8315 8240B, 8260A 603, 8030A, 8240B, 8260A, 8513 8032A, 8316 TO-15A 524.2, 603, 8030A, 8031, 8240B, 8260A 524.2, 8010B, 8240B, 8260A 502.2, 503.1, 524.1, 524.2, 602, 624, 4032, 8020A, 8021A, 8240B, 8260A 8121 8010B, 8240B, 8260A 502.2, 503.1, 524.1, 524.2, 8010B, 8021A 502.2, 503.1, 524.1, 524.2, 602, 624, 8020A, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 551, 601, 602, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 551, 601, 602, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A
524.2, 8240B, 8260A 502.2, 524.1, 524.2, 551, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 503.1, 524.1, 524.2, 601, 602, 624, 8010B, 8020A, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 551, 601, 602, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A
342
Relational Management and Display of Site Environmental Data
Chloromethyl methyl ether Chloroprene Chlorotoluene Cumene Cyclohexanol di(2-Ethylhexyl)adipate di(2-Ethylhexyl)phthalate Dibromochloromethane, chlorodibromomethane Dibromomethane, methylene bromide Dichlorobenzene Dichlorobromomethane, methylene bromide Dichlorodifluoromethane
C
14 days 14 days 14 days 14 days 14 days
! !
P
14 days
P
Dichloromethane Dichloropropane Dicyclopentadiene Diisocyanates (2,4-TDI and MDI) Dimethyl sulfate Epichlorohydrin Ethyl acrylate Ethylbenzene
14 days 14 days 14 days
!
C C
14 days 14 days
N
14 days
C,N
14 days
! ! C C,N C P
!
14 days 14 days 14 days 14 days 14 days 14 days 14 days
Ethylene Ethylene dibromide, 1,2dibromoethane (EDB)
!
Ethylene glycol Ethylene oxide Ethyleneimine Ethylmethacrylate Fluorotrichloromethane
!
N C,N C
14 days 14 days 14 days 14 days 14 days
Formaldehyde Glycol ethers Hexachlorobutadiene
! !
C,N C
14 days 14 days 14 days
N N
14 days 14 days 14 days 14 days 14 days
Isobutyl alcohol Isopropylbenzene Methacrylonitrile Methanol Methyl bromide Methyl chloride Methyl iodide Methyl isobutyl ketone Methyl methacrylate Methyl tert-butyl ether Methylene chloride n-Butyl alcohol n-Butylbenzene Nitrosamines n-Propylbenzene Pentachloroethane p-Isopropyltoluene
C,N
P
P
! !
P ! ! !
C N
P ! N
N
14 days 14 days
14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days 14 days
8010B 8010B, 8240B, 8260A 502.2, 503.1, 524.1, 524.2, 8010B, 8021A 502.2, 503.1, 524.2, 8021A, 8260A GC, NIOSH 1402, 1500 506, 525.2 506, 525.2 502.2, 524.1, 524.2, 551, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 8010B, 8021A, 8240B, 8260A 502.2, 524.2 502.2, 504.1, 524.1, 551, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 8010B, 8021A, 8240B, 8260A 502.2, 524.2 GC OSHA 18 TO-15A 8010B, 8240B, 8260A TO-15A 502.2, 503.1, 524.1, 524.2, 601, 602, 624, 8020A, 8021A, 8240B, 8260A D1946 502.2, 503.1, 504, 504.1, 524.1, 524.2, 551, 618, 8010B, 8011, 8021A, 8081, 8240B, 8260A 8430 8240B, 8260A TO-15A 524.2 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 554, TO-5, TO-11, 8315 GC/FID 502.2, 503.1, 524.2, 612, 625, 8021A, 8120A, 8250A, 8260A 8240B, 8260A 502.2, 503.1, 524.2, 8021A, 8260A 524.2, 8240B, 8260A 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A TO-17 524.2, 8010B, 8240B, 8260A 524.2, 8015A, 8240B, 8260A 524.2, 8240B, 8260A 524.2 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 8260A 502.2, 503.1, 524.2, 8021A 607, 8070A 502.2, 503.1, 524.2, 8021A 524.2, 8240B, 8260A 502.2, 503.1, 524.2, 8021A
The Parameters
Propionitrile Propylene oxide, dichloride Pyridine sec-Butyl alcohol sec-Butylbenzene Styrene tert-Butyl alcohol tert-Butylbenzene Tetrachloroethylene (PCE, PERC), tetrachloroethene, perchloroethylene Toluene
! ! !
Triethylamine Trihalomethanes, total Vinyl acetate Vinyl chloride
14 days 14 days 14 days 14 days 14 days 14 days
P
!
C,N
14 days 14 days 14 days
P
!
N
14 days
!
trans-1,3-Dichloropropene trans-1,4-Dichloro-2-butene Trichloroethylene, trichloroethene (TCE) Trichlorofluoromethane
C,N N
C
P
C,N
14 days 14 days
N
14 days
!
N
!
N C,N
14 days 14 days 14 days 14 days
!
N
14 days
High Vol.
Risk
Holding Time
!
P
Xylenes (m, o, p)
343
524.2, 8240B, 8260A 0030, TO-15A 8240B, 8260A OSHA 7 502.2, 503.1, 524.2, 8021A 502.2, 503.1, 524.1, 524.2, 8021A, 8240B, 8260A OSHA 7 502.2, 503.1, 524.2, 8021A 502.2, 503.1, 524.1, 524.2, 551, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 503.1, 524.1, 524.2, 602, 624, 8020A, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 524.2, 8260A 502.2, 503.1, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A TO-15A 501.3, 502.2, 524.2, 551 8240B, 8260A 502.2, 524.1, 524.2, 601, 624, 8010B, 8021A, 8240B, 8260A 502.2, 503.1, 524.1, 524.2, 602, 624, 8020A, 8021A, 8260A
Semi-VOAs Name
Reg.
1,2,3,4-Tetrachlorobenzene 1,2,4,5-Tetrachlorobenzene 1,2,4-Trichlorobenzene
N P
N
28 days 14 days
1,2-Dichlorobenzene
P
N
14 days
1,2-Diphenyl hydrazine 1,3-Dichlorobenzene
P P
C C
28 days 14 days
1,4-Dichlorobenzene
P
C
14 days
1-Naphthylamine 2,3,4,6-Tetrachlorophenol 2,3,7,8-Tetrachlorodibenzodioxin (TCDD) 2,4,5-Trichlorophenol 2,4,6-Trichlorophenol 2,4-Diaminotoluene 2,4-Dichlorophenol 2,4-Dimethylphenol 2,4-Dinitrophenol 2,4-Dinitrotoluene 2,6-Dichlorophenol 2,6-Dinitrotoluene 2-Chloronaphthalene
P N P
C C
P P P C C P
Analytical Method
28 days 28 days 7 days
8120A 8120A, 8270 502.2, 503.1, 524.2, 551, 612, 625, 8021A, 8120A, 8250A, 8260A 502.2, 503.1, 524.1, 524.2, 601, 602, 612, 624, 625, 8010B, 8020A, 8021A, 8120A, 8250A, 8260A 8250A, 8270 502.2, 503.1, 524.1, 524.2, 601, 602, 612, 624, 625, 8010B, 8020A, 8021A, 8120A, 8250A, 8260A 502.2, 503.1, 524.1, 524.2, 601, 602, 612, 624, 625, 8010B, 8020A, 8021A, 8120A, 8250A, 8260A 625, 8250A 8040A, 8250A 1613
28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days
8040A, 8250A 604, 625, 8040A, 8250A 8270 604, 625, 8040A, 8250A 604, 625, 8040A, 8250A 604, 625, 8040A, 8250A 609, 625, 8250A 8040A, 8250A 609, 625, 8250A 612, 8120A, 8250A, 8270
344
Relational Management and Display of Site Environmental Data
2-Chlorophenol 2-Methylphenol 2-Naphthylamine 2-Nitroaniline 2-Nitrophenol 2-Picoline 3,3'-Dichlorobenzidine 3,3'-Dimethylbenzidine 3-Methylcholanthrene 3-Nitroaniline 4,4'-Methylenedianiline 4-Aminobiphenyl 4-Bromophenylphenyl ether 4-Chloro-3-methylphenol (p-Chlorom-cresol) 4-Chloroaniline 4-Chlorophenyl phenylether 4-Nitrophenol 7,12-Dimethylbenz(a)anthracene a,a-Dimethylphenethylamine Acenaphthene Acenaphthylene Acetophenone Aniline Anthracene Benzidine Benzo(a)anthracene, 1,2-benzanthracene Benzo(a)pyrene Benzo(b)fluoranthene Benzo(g,h,i)perylene Benzo(k)fluoranthene Benzoic acid Benzyl alcohol bis(2-Chloro-1-methylethyl) bis(2-Chloroethoxy)methane bis(2-Chloroethyl) ether bis(2-Chloroisopropyl) ether bis(2-Ethylhexyl) phthalate bis(Chloromethyl) ether Butyl benzyl phthalate Chlorobenzilate Chrysene Cresols (o, m, p) Dibenz(a,h)acridine Dibenz(a,h)anthracene Dibenz(a,j)acridine Dibenzofuran Diethyl phthalate Diethyl sulfate Dimethyl phthalate di-N-Butyl phthalate di-N-Octyl phthalate Diphenylamine Ethylmethane sulfonate Fluoranthene Fluorene Formic acid Hexachlorobenzene
P
P
28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days
8040A 8250A 8250A 8250A 604, 625, 8040A, 8250A 8240B, 8250A, 8260A, 8270 553, 605, 625, 8270 553, 8270 8250A 8250A NIOSH 5029 8250A 611, 625, 8110, 8250A 604, 625, 8040A, 8250A
P P
28 days 28 days 28 days
8250A 611, 625, 8110, 8250A 515.1, 555, 604, 625, 8040A, 8151, 8250A 8250A 8250A 610, 625, 8100, 8250A, 8310 525, 610, 625, 8100, 8250A, 8310 8250A 8250A 525, 610, 625, 8100, 8250A, 8310 553, 605, 625, 8250A 525, 610, 625, 8100, 8250A, 8310
P P
C
P
! P P P
C,N C
28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days
P
7-14 d.
P P P
28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 7 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days Meth. Meth.
C P P P P N
C C,N
P ! P
P C P P
P P ! P
C,N
525.2, 550, 610, 625, 550.1, 8100, 8250A, 8310 610, 625, 550.1, 8100, 8250A, 8310 525.2, 610, 625, 8100, 8250A, 8310 525, 610, 625, 8100, 8250A, 8310 8250A 8250A 611, 625, 8010B, 8110, 8250A 611, 625, 8010B, 8110, 8250A 611, 625, 8110, 8250A 611, 625, 8010B, 8110, 8250A 506, 525, 606, 625, 8060, 8061, 8250A 8270 506, 525, 606, 625, 8060, 8061, 8250A 404, 508, 8081 525, 610, 625, 8100, 8250A, 8310 8250A 8250A 525, 610, 625, 8100, 8250A, 8310 8250A 8250A 506, 525, 606, 625, 8060, 8061, 8250A 8270 506, 525, 606, 625, 8060, 8061, 8250A 506, 606, 8060, 8061, 8250A 506, 606, 625, 8060, 8061, 8250A 620, 8250A 8250A 610, 625, 8100, 8250A, 8310 525, 610, 625, 8100, 8250A, 8310 NIOSH 2011 505, 508, 508.l, 525.2
The Parameters
Hexachlorocyclohexane Hexachlorocyclopentadiene Hexachloroethane Hexachlorophene Hydroquinone Indeno(1,2,3-cd)pyrene Isophorone Methyl methane sulfonate n,n-Dimethylformamide Naphthalene Nitrobenzene n-Methyl-2-pyrrolidone n-Nitrosodimethylamine n-Nitroso-di-n-butylamine n-Nitroso-di-n-propylamine n-Nitrosodiphenylamine n-Nitrosopiperidine p-Chloroaniline Pentachlorobenzene Pentachlorophenol (PCP) Phenacetin Phenanthrene Phenol Polychlorinated biphenyls (PCBs, Aroclor) Polycyclic aromatic compounds (PAHs) Pyrene Quinoline
N N C,N
C P P
P
! !
N
P ! N P P C N P
P P P
C
28 days 28 days 28 days 28 days 28 days 28 days 14 days 14 days 14 days 28 days 28 days 28 days 28 days 28 days 28 days 28 days 14 days 28 days 28 days 28 days Meth.
! ! !
28 days Meth. 28 days
C
28 days
C
28 days 28 days
P
345
8081, 8121, 8270 505, 508, 508.1, 525.2 524.2, 612, 625, 8120A, 8250A, 8260A 604.1, 8270 8270 525, 610, 625, 8100, 8250A, 8310 609, 625, 8250A 8250A 8270 502.2, 503.1, 524.2, 610, 625, 8021A, 8100, 8250A, 8260A, 8310 524.2, 609, 625, 8250A, 8260A GC/FID 607, 625, 8070, 8250A 8250A 607, 625, 8070, 8250A 607, 625, 8070, 8250A 8250A 8250A 8250A, 8270 515.1, 515.2, 525, , 555 604, 625, 8040A, 8270, 8540 8250A 525, 610, 625, 8100, 8250A, 8310 604, 625, 8040A, 8250A 505, 508A, 508, 525, 608, 625, 8080, 8270 550, 550.1, 1654, 8275A, 8310, 8310A, TO-13, TO-13A 525, 610, 8100, 8250A, 8310 8270
Herbicides and fungicides Note: Some substances act as both herbicides and pesticides, but are listed in only one category. Name 2,4- Dichlorophenyl acetic acid (2,4D) 2,4,5-TP (Silvex) 2,4,5-Trichlorophenoxyacetic acid (2,4,5-T) 2,4-DB 4-Chloro-2-methylphenoxy acetic acid Acetamide Alachlor Butachlor Carbendazim Chlorothalonil Cyanazine Dalapon Diallate Dinoseb Diquat Glyphosate Metolachlor
Reg.
High Vol.
N
Risk C
N
! !
N N
C C
Holding Time 14 days
Analytical Method 515.2, 515.1, 555
14 days 14 days
515.1, 515.2, 555 515.1, 515.2, 555, 615, 8150B, 8151
14 days 14 days
515.1, 515.2, 555, 615, 8150B, 8151 555, 615, 8150B, 8151
14 days 14 days 14 days 14 days 28 days 14 days 14 days 14 days 14 days 7 days 14 days Meth.
505, 525, 645, 8081 505, 507, 508.1, 525.2, 645, 8081 507, 525.2, 645 402 508, 608.2, 8081 409, 629 551.1, 552.1 8081 406, 407, 515.2,515.1, 555 549.1, 549.2 140, 547, 6651 507, 525.2, 508.1
346
Relational Management and Display of Site Environmental Data
Metribuzin Paraquat Pentachloronitrobenzene Picloram Pronamide Propachlor Quintozene Simazine Trifluralin
C C,N C C
Meth. 7 days 14 days 14 days 28 days Meth. 14 days 14 days 14 days
507, 525.2, 508.1 549.1, 549.2 617, 8081, 8250A 515.1, 515.2, 555, 644, 8151 8250A 143, 508, 525.2, 508.1 617, 8081, 8250A 505, 507, 508.1, 525.2, 619, 508, 617, 627, 8081
Holding Time 14 days
Analytical Method
Pesticides (insecticides) Name 1,2-Dibromo-3-chloropropane (DBCP) 3-Hydroxycarbofuran 4,4'- Dichloro diphenyl dichloro ethane (4-DDD) 4,4'- Dichloro diphenyl dichloro ethylene (4-DDE) Acephate Aldicarb Aldicarb sulfone Aldicarb sulfoxide Aldrin Atrazine Benomyl Benzene hexachloride (BHC)
Reg.
Endothall Endrin Endrin aldehyde Endrin ketone Ethoprop Famphur
Risk
P
28 days 28 days
P
28 days C
P
C,N C
P
Bifenthrin Bromoxynil Camphechlor Carbaryl (Sevin) Carbofuran Catechol Chlordane Chloropyrifos Demeton Diazinon Dicamba Dichlorodiphenyltrichloroethane (DDT) Dicofol Dieldrin Dimethoate Disulfoton Endosulfan (alpha and beta) Endosulfan sulfate
High Vol.
C C C
28 days 28 days 28 days 28 days Meth. 14 days 28 days 28 days
GC/FPD 531.1, 6610 531.1, 6610 531.1, 6610 505, 508, 525.2, 508.l 505, 507, 508.1, 525.2 402, 631 508, 608, 617, 625, 8080A, 8081, 8120A, 8250A
28 days 28 days
1661, 8270 505, 508, 525, 608, 617, 625, 8080A, 8081, 8250A 531.1, 6610 403, 531.1, 632, 6610
28 days 28 days P N N
C C,N
C? C,N
Meth. 7 days 28 days 7 days 14 days 28 days
P P
28 days Meth. 7 days 7 days 28 days 28 days
P P
7 days Meth. 28 days
P
C
502.2, 504, 524.1, 524.2, 8010B, 8011, 8021A, 8081, 8240B, 8260A 531.1, 6610 508, 608, 617, 625, 8080A, 8081, 8250A 508, 608, 617, 625, 8080A, 8081
28 days 7 days 7 days
505, 508, 508.1, 525.2, 8080, 8270 8141A 8141, 8270 8141A 515.1, 515.2, 555 508, 608, 617, 625, 8080A, 8081, 8250A 617, 8081 505, 508, 525.2, 508.1 8141A 8141A 508, 608, 625, 8080, 8270 508, 608, 617, 625, 8080A, 8081, 8250A 548.1 505, 508, 508.1, 525.2 508, 608, 617, 625, 8080A, 8081, 8250A 8081, 8250A 8141A 622.1, 8141A
The Parameters
gamma-BHC (Lindane)
P
C
7 days
Heptachlor Heptachlor epoxide Isodrin Kepone Lindane Linuron Malathion Methomyl Methoxychlor Methyl parathion Mirex Oxamyl Parathion Permethrin Phorate Sulfotep Thionazin Thiourea Toxaphene Triallate
P P
C
Meth. Meth. 7 days 7 days Meth. 28 days 28 days 28 days Meth. 7 days 28 days 28 days 7 days 7 days 7 days 7 days 7 days
C N N N N C
C P C
Meth. 14 days
508, 608, 617, 625, 8080A, 8081, 8120A, 8250A 505, 508, 508.1, 525.2 505, 508, 508.1, 525.2 617, 8081 8081 505, 508, 508.1, 525.2 553, 632 1618, 8270 408, 531.1, 632, 6610 505, 508, 508.1, 525.2 8141A 8081, 8270 531.1, 632, 6610 8141A 508, 608.2 8141A 8141A 622.1, 8141A 509, 553 505, 508, 525.2 507
Hydrocarbons Name Chlorinated hydrocarbons Creosotes Kerosene C9-C18 n-Hexane No. 6 fuel oil C12-C24 Oil and grease Petroleum hydrocarbons
Reg.
High Vol.
Risk
! !
Holding Time
Analytical Method
14 days 7 days 28 days
612, 8120A, 8121 8270 418.1, 8015A 9071B 418.1, 8015A 413.1, 413.2, 1664, 1664A, 9070, 9071A 418.1, 1664, 4030, 8440, 9074
N 28 days 28 days 28 days
N
OTHER PARAMETERS Field parameters Name Field conductivity Field pH Field turbidity Floaters (LNAPLs) Groundwater elevation Sinkers (DNAPLs)
Reg. P P N
Borehole geophysics Name Gamma survey Neutron survey Spontaneous potential Resistivity
High Vol.
Risk
Holding Time On site Immed. 48 hours Immed Immed. Immed.
Analytical Method 2510 B 150.1, 150.2 180.1 Visual Calculated Visual
Operating parameters Name Flow rate Fluid level Production volume
347
348
Relational Management and Display of Site Environmental Data
Biologic Name
Reg.
High Vol.
Risk
Cryptosporidium E. coli Fecal coliform Giardia lambia Heterotrophic bacteria Legionella Total coliform
Holding Time 6 hours 6 hours 6 hours 6 hours 6 hours 6 hours 6 hours
Analytical Method 1622, 1623 1103.1, 1104, 1105 9221 E, 9222 D 1623 9215 B 9221 A, B, C, D, 9222 B, C, 9131, 9132
Other Name Color Corrosivity pH1 Corrosivity to steel1 Extractable organic halides (EOX) Foaming agents Hexahydro-1,3,5-trinitro-1,3,5triazine (RDX) Hydraulic conductivity Ignitability1 Laboratory conductivity Laboratory pH Laboratory temperature Nitrilotriacetic acid (NTA) Nitroglycerin Odor Percent moisture Phenolics Purgeable organic halides (POX) Reactivity1 Temperature Total dissolved solids (TDS) Total organic carbon (TOC) Total organic halides (TOX) Total suspended solids (TSS) Toxicity1 Tributyltin (TBT) Trinitrotoluene (TNT) 1
Reg.
High Vol.
Risk
N
Holding Time 48 hours Meth. Meth. 28 days 48 hours
Analytical Method 110.1, 110.2, 110.3 4500-H+B, 9040, 9045 1110 9023 4051
Meth. 28 days On arrival On arrival C 24 hours Immed. 7 days
P N
N
28 days Meth. Immed. 7 days 48 hours 28 days 7 days Meth.
1010, 1020A, 1030, 1040 2510 B 150.1, 150.2 170.1, 2550 430.1, 430.2 8332 1250 B 420.1,.2,.3,.4,1653,1653A, 9065, 9066, 9067 9021 1050 2550 2540 C 9060 9020B, 9022 209D, 160.2 1311 282.3 4050, 4655, 4656
RCRA characteristics of hazardous substances
METHOD REFERENCE The following list contains many of the methods used to analyze for the above constituents. Most are EPA standards, although other organizations such as ASTM also provide standards. Standards that are not from EPA sources are marked in the table, and in some cases result in the same number for different methods. Some abbreviations that are used include AES (atomic emission spectrometry), AA (atomic absorption spectrometry), CCT (capillary column technique), DW (drinking water), ECD (electron capture detector), ELCD (electrolytic conductivity detectors), GC (gas chromatography), HPLC (high performance liquid chromatography), IC (ion chromatography), LC (liquid chromatography), LLE (liquid-liquid extraction), LSE (liquid-solid
The Parameters
349
extraction), MS (mass spectrometry), PID (photoionization detector), PUF (polyurethane), TEM (transmission electron microscopy), and WW (wastewater). Method 00-01 00-02 0030 100.1 100.2 110.1 110.2 110.3 12971W 130.1 130.2 140 143 150.1 150.2 170.1 180.1 200.6 200.7 200.8 200.9 212.3 219.1 219.2 245.1 245.2 246.1 246.2 253.1 253.2 255.1 255.2 258.1 282.1 282.2 282.3 283.1 283.2 286.1 286.2 300.0 300.7 305.1 305.2 310.1 310.2 314 320.1 321.8 325.1 325.2 325.3
Procedure Gross alpha & gross beta part. radiochem. Gross alpha in DW by co-precipitation SVOCs using a sampling train (VOST) Asbestos by TEM Asbestos by TEM Color - colorimetric, ADMI Color - colorimetric-platinum-cobalt Color - spectrophotometric Fluoride by automated alizarin (non-EPA)
330.1 330.2 330.3 330.4
Hardness, total - colorim., auto. EDTA Hardness, total - titrimetric, EDTA Glyphosate Propachlor pH - electrometric pH - continuous monitoring (electrometric) Temperature - thermometric Turbidity, nephelometric Metals - Ca, Mg, K, and Na Metals and trace elements by ICP/AES Trace elements by ICP/MS Trace elem. by stab. temp. grap. furn. AA Boron - colorimetric, curcumin Cobalt - AA, direct aspiration Cobalt - AA, furnace Mercury by cold vapor AA - manual Mercury by cold vapor AA - automated Molybdenum - AA, direct aspiration Molybdenum - AA, furnace Palladium - AA, direct aspiration Palladium - AA, furnace Platinum - AA, direct aspiration Platinum - AA, furnace Potassium - AA, direct aspiration Tin - AA, direct aspiration Tin - AA, furnace Tributyltin Cl in marine & fresh waters Titanium - AA, direct aspiration Titanium - AA, furnace Vanadium - AA, direct aspiration Vanadium - AA, furnace Inorganic anions by ion chromatography Metals: Na/ammonium/K/Mg/Ca Acidity - titrimetric Acidity - titrimetric (acid rain) Alkalinity - titrimetric, pH 4.5 Alkalinity - colorimetric, automated Perchlorate in DW using IC Bromide - titrimetric Bromate in DW by IC/ICP/MS Chloride-colorimetric, auto. ferricyanide AI Chloride-colorimetric, auto ferricyanide AII Chloride - titrimetric, mercuric nitrate
335.4 340.1 340.2 340.3 340.6
330.5 335.1 335.2 335.3
345.1 350.1 350.1 350.2 350.3 351.1 351.2 351.3 351.4 352.1 353.1 353.2 353.3 353.4 353.6 354.1 360.1 360.2 365.1 365.2 365.3 365.4 365.5 365.6 366 370.1 375.1 375.2 375.3 375.4 376.1 376.2 377.1
Chlorine, total residual - titrimetric Chlorine, total residual - titrimetric, back Chlorine, total resid. - titrimetric, iodomet. Chlorine, total resid. - titrimetric, DPDFAS Chlorine, total residual spectrophotometric Cyanides, amen. to chlorination titrimetric Cyanide, total - titrimetric, spectrophot. Cyanide, total - colorimetric, automated UV Total cyanide by semi-auto. colorimetry Fluoride, total - colorimetric Fluoride - potentiometric Fluoride - colorimetric Fluoride in wet deposition by potentiometric Iodide - titrimetric Total cyanide by semi-auto. colorimetry Nitrogen, ammonia - colorimetric Nitrogen, ammonia - colorim., titrimetric Nitrogen, ammonia - potentiometric Nitrogen, Kjeldahl, total colorimetric/auto Total Kjeldahl nitrogen - semi-auto. colorimetric Nitrogen, Kjeldahl, total - colorim./titrim. Nitrogen, Kjeldahl, total - potentiometric Nitrogen, nitrate - colorimetric, brucine Nitrogen, nitrate-nitrite - colorim./hydra. Nitrate-nitrite by auto. colorimetry/ cadmium Nitrogen, nitrate-nitrite - manual cadmium Nitrate & nitrite by gas segmented CF/CA Nitrate-nitrite by automated colorimetric Nitrogen, nitrite - spectrophotometric Oxygen, dissolved - membrane electrode (probe) Oxygen, dissolved - modified Winkler Phosphorus by automated colorimetry (method for ortho-phosphate) Phosphorus, all forms - colorim./one reag. Phosphorus, all forms - colorim./two reag. Phosphorus, total - colorimetric/automated Orthophosphate by automated colorimet. Orthophosphate in wet deposition Dissolved silicate by gas segmented CF/CA Silica, dissolved - colorimetric Sulfate - colorimetric, auto., chloranilate Sulfate by automated colorimetry Sulfate - gravimetric Sulfate - turbidimetric Sulfide - titrimetric, iodine Sulfide - colorimetric, methylene blue Sulfite - titrimetric
350
38075WE 401.3 402 0403 404 405.1 406 407 408 409 410.1 410.2 410.3 410.4 413.1 413.2 415.1 415.2 418.1 420.1 420.2 420.3 420.4 430.1 430.2 501.3 502.2 503.1 504 504.1 505 506 507 508 508A 508.1 509 515.1 515.2 515.3
Relational Management and Display of Site Environmental Data
Fluoride by automated electrode (nonEPA) Oxygen, chemical demand - high level Benomyl & carbendazim Carbofuran Chlorobenzilate, profluralin, terbutyn Biochemical oxygen demand (BOD) Dinoseb Dinoseb Methomyl Cyanazine Chem. oxygen demand - titrim., mid-lev. Chem. oxygen demand - titrim., low lev. Chem. oxygen demand - titrim., high lev. Chem. oxygen dem. - semi-auto. colorim. Oil & grease, tot., recoverable - gravimet. Oil & grease, toy. recov. - spectrophotom. Organic C, tot. - combustion or oxidation Organic carbon, total - UV promoted Petroleum hydrocarbons, total recoverable Phenolics, tot. recoverable - spectrophot. Phenolics, total recoverable - colorimetric Phenolics, total recoverable spectrophotometric/MBTH Total rec. phenolics - semi-auto. column NTA - colorimetric, manual, zinc-zincon NTA - colorimetric, automated, zinczincon Trihalomethanes in DW - GC/MS Volatile organic compounds in water by purge and trap capillary col. GC & ECD Volatile aromatic and unsaturated organic compounds in water 1,2-Dibromoethane (EDB) and 1,2dibromo-3-chloropropane (DBCP) in water by microextraction and GC EDB, DBCP, and 1,2,3-trichloropropane by microextraction and GC Organohalide pesticides and commercial polychlorinated biphenyl (PCB) in water by microextraction and GC Phthalate and adipate esters by liquidliquid or LSE by GC with a photoionization detector Nitrogen- and phosphorus-containing pesticides by GC with a nitrogen phosphorus detector Chlorinated pest. in water by GC with ECD Screening for PCBs by perchlorination / GC Chlorinated pesticides, herbicides and organohalides by LSE and GC with ECD Ethylene thiourea (ETU) in water using GC with a nitrogen-phosphorus detector Chlorinated acids in water by GC with ECD Chlorinated acids in water using LSE and GC with ECD Chlorinated acids using LLE, derivatization and GC with ECD
524.1 524.2 525 525.2 531.1
547 548.1 549.1 549.2 550 550.1 551.1 552.1 552.2 553 554 555 601 601 602 603 604 604.1 605 606 607 608 608.2 609 610 611 612 613 615 616
Purgeable organic compounds in water by packed column GC/MS Purgeable organic compounds by capillary column GC/MS Organic compounds in DW by LSE and capillary column GC/MS Organic compounds by LSE and capillary column GC/MS n-Methylcarbamoyloximes and nmethylcarbamates in water by direct aqueous injection HPLC with post column derivatization Glyphosphate by HPLC, post column derivatization, and fluorescence detector Endothall in DW by ion exchange extraction, acidic methanol methylation and GC/MS Diquat and paraquat in DW by LSE and HPLC with ultraviolet detection Diquat and paraquat by LSE and HPLC with a photodiode array ultraviolet detector Polycyclic aromatic hydrocarbons (PAHs) by LLE and HPLC with coupled ultraviolet and fluorescence detection Polycyclic aromatic hydrocarbons (PAHs) by LSE and HPLC Chlorinated disinfection by-products and chlor. solvents by LLE and GC with ECD Haloacetic acids and dalapon in DW by ion-exchange LSE and GC with ECD Haloacetic acids and dalapon by LLE, derivatization and GC with ECD Benzidines and nitrogen-containing pesticides in water by LLE or LSE and reverse phase HPLC/particle beam/MS Carbonyl compounds in DW by dinitrophenylhydrazine deriv. and HPLC Chlorinated acids in water by HPLC with a photodiode array ultraviolet detector Purgeable halocarbons Nitrate by ion selective electrode (nonEPA) Purgeable aromatics Acrolein and acrylonitrile Phenols Hexachlorophene and dichlorophen in industrial and municipal WW Benzidines Phthalate esters Nitrosamines Organochlorine pesticides and PCBs Organochlorine pesticides in WW by GC Nitroaromatics and isophorone Polynuclear aromatic hydrocarbons (PAHs) Haloethers Chlorinated hydrocarbons Dioxin Chlorinated herbicides in industrial and municipal WW C, H, and O pesticide compounds
The Parameters
617 618 619 620 622.1 624 625 627 629 631 632 632.1 633 634 635 636 637 638 639 640 641 643 644 645 646 900 901 901.1 902 903 903.1 904 905 906 907 908 908.1 908.2 909 910 911 913 1010 1020A 1030
Organohalide pesticides and PCBs in industrial and municipal WW Volatile pesticides in municipal and industrial WW by GC Triazine pesticides in industrial and municipal WW Diphenylamine in municipal and indust. WW Thiophosphate pesticides Purgeable organics in waters Semivolatile organics in waters Dinitroaniline pesticides in industrial and municipal WW Cyanazine in industrial and municipal WW Benomyl and carbendazim in industrial and municipal WW Carbamate and urea pesticides in industrial and municipal WW Carbamate and amide pest. in WW by LC Organonitrogen pesticides in industrial and municipal WW Thiocarbamate pesticides in industrial and municipal WW by GC Rotenone in industrial and municipal WW by LC Bensulide in industrial and municipal WW by LC MBTS and TCMTB in municipal and industrial WW by LC Oryzalin in industrial and municipal WW Bendiocarb in municipal and industrial WW by LC Mercaptobenzothiazole in WW by LC Thiabendazole in WW by LC Bentazon in WW by LC Picloram in WW by LC Amine pesticides and lethane in WW by gas Dinitro aromatic pesticides in WW by GC Radioactivity, gross alpha and gross beta Radioactive cesium Radionuclides, gamma emitting Radioactive iodine Radium, alpha-emitting isotopes Radium-226 radon emanation technique Radium-228 Radioactive strontium Tritium Actinide elements Uranium - radiochemical method Uranium - fluorometric method Uranium - laser indirect fluorometry in DW Lead-210 in DW Thorium - DW Plutonium - DW Radon in DW by liquid scint. counting Ignitability - Pensky-Martins closed-cup method Ignitability - Setaflash closed-cup method Ignitability of solids
1103.1 1104 1105 1110 1250 B 1311 1613 1618 1622 1623 1653 1653A 1654 1661 1664 1664A 2120 B 2130 B 2150 B 2320 B 2510 B 2540 C 2550 3111 B 3111 D 3112 B 3113 B 3114 B 3120 B 3500-Ca D 3500Mg E 4030 4032 4050 4051 4110 B 4500-ClB 4500-ClD 4500-Cl E
351
E. coli & enterococci in water - membrane filter E. coli in DW/EC medium with mug tub E. coli in DW /nutrient agar/mug tub Corrosivity to steel Odor Toxicity characteristic leaching procedure (TCLP) Chlorinated dioxins by isotope dilution high resolution GC/high res. MS (nonEPA) Pesticides, organo-halide/phosphorus malathion Cryptosporidium in water by filtration/IMS/FA Cryptosporidium & Giardia by filtration/IMS/FA Chlorinated phenolics - in Situ acetylation/GCMS Chlorinated phenolics - in Situ acetylation/GCMS PAH content of oil by HPLC/UV Bromoxynil Oil & grease and total petroleum hydrocarbons Oil & grease (HEM/SGT-HEM) by extr. Color by visual comp. meth. (non-EPA) Turbidity, nephelometric (non-EPA) Odor (non-EPA) Alkalinity by titration (non-EPA) Conductivity (non-EPA) Total dissolved solids (TDS) (non-EPA) Temperature (non-EPA) Metals by flame AA, direct air-acetylene flame (non-EPA) Metals by flame AA, direct nitrous oxideacetylene flame (non-EPA) Metals by cold-vapor AA (non-EPA) Metals by electrothermal AA (non-EPA) Metals by hydride generation / AA (nonEPA) Metals by inductively coupled plasma/ atomic emission spectroscopy (non-EPA) Calcium by EDTA titration (non-EPA) Magnesium by complexation titration and calculated difference (non-EPA) Petroleum hydrocarbons soil screen by immunoassay Benzene in water & soil by immunoassay TNT explosives in water and soils by immunoassay RDX in soil and water by immunoassay Inorganic anions by ion chromatography (non-EPA) Chlorine residual by iodometric method (non-EPA) Chlorine residual by amperometric titration (non-EPA) Chlorine residual by low level amperometric titration (non-EPA)
352
Relational Management and Display of Site Environmental Data
4500-Cl F 4500-Cl G 4500-Cl H 4500-Cl I 4500-ClD 4500ClO2 C 4500ClO2 D 4500ClO2 E 4500CN- C 4500CN- E 4500CN- F 4500CN-G 4500-FB 4500-FC 4500-FD 4500-FE 4500-H+ B 4500NO2-B 4500NO3-D 4500NO3-E 4500NO3-F 4500-O3 B 4500-P E 4500-P F 4500-Si D 4500-Si E 4500-Si F 4500SO42-C 4500SO42-D 4500SO42-E 4500SO42-F
Chlorine residual by DPD ferrous titration (non-EPA) Chlorine residual by DPD colorimetric method (non-EPA) Chlorine residual by syringaldazine (FACTS) method (non-EPA) Chlorine residual by iodometric electrode technique (non-EPA) Chloride by potentiometric method (nonEPA) Chlorine dioxide by the amperometric method I (non-EPA) Chlorine dioxide by the DPD method (non-EPA) Chlorine dioxide by the amperometric method II (non-EPA) Total cyanide after distillation (non-EPA) Cyanide by colorimetric method (nonEPA) Cyanide by the cyanide-selective electrode method (non-EPA) Cyanides amenable to chlorination after distillation (non-EPA) Fluoride - preliminary distillation step (non-EPA) Fluoride by ion selective electrode (nonEPA) Fluoride by SPADNS method (non-EPA) Fluoride by complexion method (nonEPA) pH, electrometric (non-EPA) Nitrite by colorimetric method (non-EPA) Nitrate by ion selective electrode method (non-EPA) Nitrate by cadmium reduction method (non-EPA) Nitrate by automated cadmium reduction method (non-EPA) Ozone residual by indigo colorimetric method (non-EPA) Phosphorus by ascorb. acid meth. (nonEPA) Phosphorus by automated ascorbic acid reduction method (non-EPA) Silica by molybdosilicate method (nonEPA) Silica by heteropoly blue method (nonEPA) Silica by automated method for molybdate-reactive silica (non-EPA) Sulfate by gravimetric method with ignition of residue (non-EPA) Sulfate by gravimetric method with drying of residue (non-EPA) Sulfate by turbidimetric method (nonEPA) Sulfate by turbidimetric method (nonEPA)
4655 4656 5540C 6610 6651 7610 7780 7903 8010B 8011 8015A 8020A 8021A 8030A 8031 8032A 8040A 8060 8061 8070 8070A 8080 8080A 8081 8100 8110 8120A 8121 8141A 8150B 8151 8240B 8250A 8260A 8270 8270B 8275A 8310 8310A 8315 8316 8318 8330 8332 8440
TNT and RDX in water by immunoassay TNT and RDX in water by fluorescent immunoassay Anionic surfactants as methyl blue active substances (MBAS) (non-EPA) Carbamates by HPLC with postcolumn fluorescence detection (non-EPA) Glyphosphate herbicide by LC postcolumn fluorescence (non-EPA) Potassium - AA, direct aspiration Strontium - AA, direct aspiration NIOSH method for acids in air Halogenated volatile organics by GC DBCP & EDB by microextraction & GC Nonhalogenated volatile organics by GC Aromatic volatile organics by GC Halogenated volatiles by GC using PID and electrolytic conductivity detectors in series: CCT Acrolein and acrylonitrile by GC Acrylonitrile by GC Acrylamide by GC Phenols by GC Phthalate esters Phthalate Esters by cap. GC/ECD Nitrosamines by GC Nitrosamines by GC Chlorinated pesticides & PCBs by GC/ECD or GC/ELCD Organochlorine pesticides & PCBs by GC Organochlorine pesticides, halowaxes, and PCBs as aroclors by GC: CCT Polynuclear aromatic hydrocarbons (PAHs) Haloethers by GC Chlorinated hydrocarbons by GC Chlorinated hydrocarbons by GC Organophosphorus comp. by GC: CCT Chlorinated herbicides by GC Chlorinated herbicides by GC using methylation or pentafluorobenzylation derivatization: CCT Determination of volatile organics by GC/MS SVOCs - GC/MS packed column VOCs by GC/MS: CCT SVOCs by high resolution GC/MS: CCT SVOCs by GC/MS: CCT PAHs and PCBs in soils/sludges by TE/GC/MS PAHs by HPLC PAHs by HPLC Carbonyl compounds (formaldehyde, aldehydes & ketones) by HPLC Acrylamide, acrylonitrile, and acrolein by HPLC n-Methylcarbamates by HPLC Nitroaromatic and nitramine explosives by HPLC Nitroglycerin by HPLC Total rec. petroleum hydrocarbons - IS
The Parameters
8540 9020B 9021 9022 9023 9040 9045 9058 9060 9065 9066 9067 9070 9071A 9071B 9074 9131 9132 9215B 9221A 9221B 9221C 9221D 9221E 9222A 9222B 9222C 9222D 9223 9250 9251 9252A* 9253 9310 9315 9320 B-1011 D1946 D51193A D51193B D51289B NWQL LC 1369 NWQL LC 1565 Ra-03 Ra-04 Ra-05 RA-CIA&B
Pentachlorophenol (PCP) - colorimetric field test Total organic halides (TOX) Purgeable organic halides (POX) Total organic halides (TOX) Extractable org. halides (EOX) in solids pH by meter pH by meter Perchlorate by ion chromatography Total organic carbon Phenolics - spectrophot., manual 4-AAP Phenolics - colorimetric automated 4-AAP Phenolics - spectrophotometric, MBTH Oil & grease, total recoverable - gravim. Oil & Grease - extraction for sludge & sed. n-Hexane extractable material (HEM) /oil & grease Petroleum hydrocarbons in soil by turbidimetric Total coliform - multiple tube fermentation Total coliform - membrane filter Heterotrophic plate count - pour plate meth. Multiple-tube fermentation technique for members of the coliform group Standard total coliform fermentation tech. Estimation of bacterial density Presence-absence (P-A) coliform test Fecal coliform procedure Membrane filtration technique for members of the coliform group Standard total coliform memb. filter proc. Delayed incubation tot. coliform proc. Fecal coliform membrane filter procedure Chromogenic substrate coliform test Chloride - colorim., auto ferricyanide AAI Chloride - colorim., auto ferricyanide AAII Chloride - titrimetric, mercuric nitrate Chloride - titrimetric, silver nitrate Alpha & beta particles, gross Alpha-emitting radium isotopes Radium-228 Nitrate and nitrite by IC (non-EPA) Atmospheric gases, ethane, ethylene Calcium by EDTA titration (non-EPA) Calcium by direct aspiration atomic absorption (non-EPA) Chloride by silver nitrate titration (nonEPA) Radon by liquid scintillation Tritium by liquid scintillation Radium-226 radiochemical in water Radium-226 radiochemical de-emanation Radium-228 radiochemical in water Gross alpha & beta in DW
RA-CICes RA-CIIo-D RA-CIIo-P RA-CIR2 RA-CIR2 RA-CIStr RA-CITrit RA-LVPl RA-LVRa RA-LVStr RA-LVTri TO-01 TO-02 TO-03 TO-04 TO-04A TO-05 TO-06 TO-07 TO-08 TO-09 TO-10 TO-10A TO-11 TO-11A TO-12 TO-12 TO-13 TO-13A TO-14 TO-14A TO-15 TO-16 TO-16 TO-17
353
Radioactive cesium in DW Radioactive iodine in DW - distillation Radioactive iodine in DW - precipitation Radium-226 in DW - radon emanation Radium-228 in DW - sequential method Radioactive strontium in DW Tritium in DW Plutonium, uranium, and thorium / soil, air, tissue Radium-226 & radium-228 / soil, air, tissue Strontium-89 & -90 / vegetation, soil, tissue Tritium / water & biological tissue VOCs in amb. air - Tenax® and GC/MS Carbon molecular sieve adsorption Cryogenic trapping Organochlorine pesticides and PCBs high vol. PUF Pesticides and PCBs by high volume PUF & GC/MD Aldehydes and ketones - liq. impinger samp. Phosgene by HPLC N-Nitrosodimethylamine by Thermosorb/N GC/MS Cresol/phenol by sodium hydroxide LI/HPLC Dioxin by high vol. PUF/HRGC/HRMS Pesticides by low volume PUF Pesticides & PCBs by low volume PUF & GC/MD Formaldehyde by adsorbent cartridge/HPLC Formaldehyde by adsorbent cartridge & HPLC Formaldehyde by adsorbent cartridge/HPLC Organic compounds, non-methane (NMOC)/PDFID Polynuclear aromatic hydrocarbons (PAHs) PAHs by GC/MS Organic compounds, semivolatile and volatile VOCs by canisters and GC VOCs collected in canisters - GC/MS Atmospheric gases by Fourier transform infrared VOCs by long-path/open-path FT/IR monitoring VOCs by active sampling onto sorbent tubes
354
Relational Management and Display of Site Environmental Data
Preservation and holding times The following table contains preservation and holding times for EPA regulated parameters and several common methods. Parameter/ Method
Preservative
Sample Holding Time
Metals (except Hg & Cr VI) Mercury Alkalinity Asbestos Chloride Resid. disinfectant Color Conductivity Cyanide
HNO3 pH<2
6 months
HNO3 pH<2 Cool, 4C Cool, 4C None None Cool, 4C Cool, 4C Cool, 4C, ascorbic acid (if chlorinated), NaOH pH>12 None Cool, 4C Cool, 4C Cool, 4C, H2SO4, pH<2 Cool, 4C Cool, 4C None Filter immed., cool, 4C Cool, 4C Cool, 4C Cool, 4C None Cool, 4C Sodium thiosulfate or ascorbic acid, 4C, HCl pH<2 Sodium thiosulfate, cool, 4C, Sodium thiosulfate, cool, 4C, Sodium thiosulfate cool, 4C, dark Sodium thiosulfate cool, 4C, dark Cool, 4C
28 days 14 days 48 hours 28 days Immediately 48 hours 28 days 14 days
Fluoride Foaming agents Nitrate (chlor.) Nitrate (non chlor.) Nitrite Odor pH Ortho-phosphate Silica Solids (TDS) Sulfate Temperature Turbidity 502.2
504.1 505 506, 507 508 508A 508.1 515.1 515.2 524.2 525.2
Sodium sulfite, HCl pH<2 cool, 4C Sodium thiosulfate cool, 4C, dark Sodium thiosulfate, HCl pH <2, cool, 4C, dark Ascorbic acid HCl pH<2, cool 4C Sodium sulfite, dark, cool, 4C, HCl pH<2
Extract Holding Time
Suggested Sample Size 1l 100 ml 100 ml
28 days 48 hours 28 days 14 days 48 hours 24 hours Immediately 48 hours 28 days 7 days 28 days Immediately 48 hours 14 days
Type of Container Plastic or glass
50 ml 200 ml 50 ml 100 ml 1l
Plastic or glass Plastic or glass Plastic or glass Plastic or glass Plastic or glass Plastic or glass Plastic or glass Plastic or glass
300 ml
Plastic or glass
100 ml 100 ml 50 ml 200 ml 25 ml 50 ml 100 ml 100 ml 50 ml 1l 100 ml 40-120 ml
Plastic or glass Plastic or glass Plastic or glass Glass Plastic or glass Plastic or glass Plastic Plastic or glass Plastic or glass Plastic or glass Plastic or glass Glass with PFTE lined septum Glass with PFTE lined septum Glass with PFTE lined septum Amber glass w/ PFTE lined cap Glass with PFTE lined cap Glass with PFTE lined cap Glass with PFTE lined cap Amber glass w/ PFTE lined cap Amber glass w/ PFTE lined cap Glass with PFTE lined septum Amber glass w/ PFTE lined cap
14 days
24 hours
40 ml
14 days (7 days for heptachlor) 14 days (see meth. for exceptions) 7 days (see meth. for exceptions) 14 days
24 hours
40 ml
Dark, 14 days Dark, 14 days 30 days
1l
14 days (see meth. 30 days for exceptions) 14 days Dark, 28 days 14 days Dark, 14 days 14 days
1l
14 days (see meth. 30 days for exceptions) from collection
1l
1l 1l
1l 1l 40-120 ml
The Parameters
355
Sodium thiosulfate, monochloroacetic acid, pH<3, cool, 4C Sodium thiosulfate cool, 4C Sod. thio. (HCl pH 1.5-2 if high biol. act.) cool, 4C, dark Sodium thio., (H2SO4 pH <2 if biol. act.), cool, 4C, dark
28 days
60 ml
Glass with PFTE lined septum
14 days (18 mo. frozen) 7 days
60 ml 14 days
>= 250 ml
7 days
21 days
>= 250 ml
550, 550.1
Sodium thiosulfate cool, 4C, HCl pH<2, dark
7 days
550 30d., 1 l 550.1 40d
551
Sodium thiosulfate, sodium sulfite, amm. Cl, or asc. acid, HCl pH 4.5-5.0, cool, 4C Sodium sulfite, HCl, pH <= to 2, dark, cool 4C Sodium thiosulfate, cool, 04C, dark
14 days
>= 40 ml
Glass with PFTE lined septum Amber glass with PFTE lined sept. High dens. amber plast. or Silanized amber glass Amber glass with PFTE lined cap Glass with PFTE lined septum
14 days
>= 100 ml
531.1, 6610
547 548.1
549.1
555 1613B
Rec. 40 days
1l
Glass with PFTE lined cap Amber glass with PFTE lined cap
APPENDIX E EXERCISES
These exercises are intended to reinforce the material presented in the previous sections.
DATABASE REDESIGN EXERCISE You receive a database with this design:
358
Relational Management and Display of Site Environmental Data
Describe the problem with this database: ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ Redraw the design to eliminate the problem:
Exercises
359
DATA NORMALIZATION EXERCISE You need to manage data for satellite accumulation areas for hazardous materials. You receive a database with this design:
Draw an entity-relationship diagram for a normalized version of this database:
Design an SQL statement to display the data in the layout in which it was originally presented: ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________ ______________________________________________________________________________
360
Relational Management and Display of Site Environmental Data
GROUP DISCUSSION - DATA MANAGEMENT AND YOUR ORGANIZATION As a group, discuss how data is currently managed in your organization. The plusses and minuses of the current processes should be discussed, and opportunities for improvement identified.
DATABASE REDESIGN EXERCISE SOLUTION Description of the problem: There is no reason for the redundant fields, either in the tables or the relationships. The fields Sample Date, Sample Type, Matrix, Beginning Depth, and Ending Depth need only be in the Groundwater/Soil Sample table and not in Soil Sample Results, Groundwater General Chemistry Results, QA/QC Results, and Soil Gas Data Results. (It might also be possible to combine these four tables into one, but we can’t tell without seeing whether the other fields are the same or different.) Revised design:
Exercises
361
DATA NORMALIZATION EXERCISE SOLUTION Entity-relationship diagram:
SQL Statement: SELECT Areas.AreaName, Areas.Location, Drums.DrumNumber, Drums.DrumSize, Contents.ItemDescription, Contents.ItemAmount FROM (Areas INNER JOIN Drums ON Areas.AreaID = Drums.AreaID) INNER JOIN Contents ON Drums.DrumID = Contents.DrumID; Result of SQL Statement:
DATABASE SOFTWARE EXERCISES Readers of this book are encouraged to visit the Geotech Computer Systems, Inc. Web site to download a learning version of Enviro Data. This version comes with a set of exercises to help you practice many of the concepts in this book. Visit www.geotech.com/relman for more information. We hope these exercises will give you a feel for what can be done with site data management software.
APPENDIX F GLOSSARY
The following are some common terms used in the environmental and computing industry. The information comes from many sources, including EPA (1997a, 2000a).
Numbers 1,1,1-TCA – 1,1,1-Trichloroethane. Volatile organic contaminant. 1,1,2-TCA – 1,1,2-Trichloroethane. Volatile organic contaminant. 1,1-DCA – 1,1-Dichloroethane. Volatile organic contaminant. 1,2-DCA – 1,2-Dichloroethane. Volatile organic contaminant. 2,4,5-T – 2,4,5-Trichlorophenoxyacetic acid. Semivolatile organic contaminant. 2,4-D – 2,4- Dichlorophenyl acetic acid. Semivolatile organic contaminant. 4-DDD – 4,4'- Dichloro diphenyl dichloro ethane. Pesticide. 4-DDE – 4,4'- Dichloro diphenyl dichloro ethylene. Pesticide. 68000/68020/68030 – Processors made by Motorola used in the Apple Macintosh, Commodore Amiga, some UNIX workstations, and other computers. 8088/80286/80386/80486/Pentium – Processors made by Intel and used in IBM-PC compatible computers. 9-track tape – Tape type used most commonly for mainframes, although drives are available for PCs. The most common data densities are 1600 or 6250 bits per inch (BPI). Users must make certain they have software to translate data from the tape format into something usable on the target machine. An example of using 9-track tapes is the downloading of seismic field tapes or well log data onto a PC floppy or hard disk for processing.
A AA – Atomic absorption. A procedure for inorganic analysis based on the absorption of radiation by mercury vapor (cold vapor), flame, or graphite furnace. Abscissa – Horizontal or X-axis of a graph. Absolute coordinates – Coordinates tied to an established reference system such as position on the globe. Absolute method – A body of procedures and techniques for which measurement is based entirely on physically defined fundamental quantities.
364
Relational Management and Display of Site Environmental Data
Acceptable quality level – A limit above which quality is considered satisfactory and below which it is not. In sampling inspection, the maximum percentage of defects or failures that can be considered satisfactory as an average. Acceptable quality range – The interval between specified upper and lower limits of a sequence of values within which the values are considered to be satisfactory. Acceptable value – An observed or corrected value that falls within the acceptable range. See Corrected value and Observed value. Acceptance sampling – The procedure of drawing samples from a lot or population to determine whether to accept or reject a sampled lot or population. Accepted reference value – A numerical quantity that serves as an agreed-upon basis for comparison, and which is derived as l) a theoretical or established quantity based on scientific principles, 2) an assigned value, based on experimental work of some recognized organization, or 3) a consensus quantity based on collaborative experimental work under the auspices of a scientific or engineering group. Access time – Hard disk speed is rated by its access time, given in milliseconds (ms). The shorter the access time, the faster data can be manipulated and stored. Accreditation – A formal recognition that an organization (e.g., laboratory) is competent to carry out specific tasks or specific types of tests. See also Certification. Accreditation criterion – A requirement that a laboratory must meet to receive authorization and approval to perform a specified task. Accredited laboratory – A laboratory that has been evaluated and given approval to perform a specified measurement or task, usually for a specific property or analyte and for a specified period of time. Accuracy – The degree of agreement between an observed value and an accepted reference value. Inaccuracy includes a combination of random error (precision) and systematic error (bias) components which are due to sampling and analytical operations. EPA recommends that this term not be used and that precision and bias be used to convey the information usually associated with accuracy. See Precision and Bias. ACIL – American Council of Independent Laboratories. Trade group of independent testing laboratories that fosters communication between laboratories. ACS – American Chemical Society. Action limit – See Control limit. Acute toxicity – The effect of high-level, short-term (as opposed to long term or chronic) exposure to a toxic substance. Adjusted value – The observed value after adjustment for values of a blank or bias of the measurement system. ADSL – Asymmetrical digital subscriber line broadband communication connection. Adsorption – Adhesion of contaminants to liquids or solids. Aerobic – Having a high oxygen content, or an organism that lives or is active in the presence of oxygen. AES – Atomic Emission Spectrometry. AFCEE – Air Force Center for Environmental Excellence. Organization responsible for the ERPIMS environmental data system. AFE – Authority for Expenditure. A document authorizing expenditure of funds, similar to a purchase order. Algorithm – An algorithm is a numerical method. For example, in mapping, a common algorithm used for creating grids is weighted moving average.
Glossary
365
Aliquant – A subsample derived by a divisor that divides a sample into a number of equal parts but leaves a remainder; a subsample resulting from such a divisor. See Subsample. Aliquot – A subsample derived by a divisor that divides a sample into a number of equal parts and leaves no remainder; a subsample resulting from such a division. In analytical chemistry the term aliquot is generally used to define any representative portion of the sample. Alpha error – See Type I error. Alternate method – Any body of procedures and techniques of sample collection and/or analysis for a characteristic of interest which is not a reference or approved equivalent method but which has been demonstrated in specific cases to produce results comparable to those obtained from a reference method. Anaerobic – Having low oxygen content, or an organism that lives or is active in the absence of oxygen. Analysis (chemical) – The determination of the qualitative and/or quantitative composition of a substance. Analyte – The substance, a property of which is to be measured by chemical analysis. Analytical batch – A group of samples, including quality control samples, which are processed together using the same method, the same lots of reagents, and at the same time or in continuous, sequential time periods. Samples in each batch should be of similar composition and share common internal quality control standards. Analytical blank – See Reagent blank. Analytical limit of discrimination – See Method detection limit. Analytical protocol – See Statement of Work (SOW). Analytical reagent (AR) – The American Chemical Society’s designation for the highest purity of certain chemical reagents and solvents. See Reagent grade. ANOVA – Analysis of variance. Statistical test to determine whether two populations have the same mean. ANSI – American National Standards Institute. This organization develops and publishes standards in a variety of technical areas. ANVO – Accept No Verbal Orders. All project changes should be in writing. AOC – Analytical Operations Data Quality Center. The U.S. EPA Center which directs the national Contract Laboratory Program. Append – To append something is to add a block or file to an existing file without removing the contents of the original file. The new material is “tacked on” the end of the existing file. Application – Software to perform a particular task such as data management or mapping. APPS – Act to Prevent Pollution from Ships. Aquifer – An underground geologic unit that can store water and supply it to wells and springs. Aquitard – An underground geologic unit with a low permeability that inhibits the vertical flow of water. See also Confining layer. ARAR – Applicable or Relevant and Appropriate Requirement. Cleanup or other standards that address problems or situations present at a CERCLA site. They are used to set cleanup goals, select a remedy, and determine how to implement that remedy. Architecture – The internal electronic configuration of the data pathway on the motherboard is known as the architecture. Arithmetic mean – The sum of all the values of a set of measurements divided by the number of values in the set; a measure of central tendency. See Measure of central tendency. Aroclor – Polychlorinated biphenyl (also called PCB).
366
Relational Management and Display of Site Environmental Data
Aromatics – Organic compounds that contain structures of six carbon rings, such as creosote, toluene, and phenol. Array– Set of values arranged in rows and columns. Also used for a line or grid of sensors, such as geophysical devices. Array processor – Computing device designed to perform calculations on arrays of data. Used in graphic display, geophysical, and other number-intensive applications. ASCII – American Standard Code for Information Interchange (pronounced as′-kee); the most common way of representing data for microcomputers and many larger machines as well. Each character is represented by a number, with the numbers ranging from 0 to 127. IBM has extended the ASCII code from 128 to 255 to allow many additional graphics, foreign language, and special characters. Some examples of ASCII codes (in decimal) and their representative meanings are: ASCII Code 7 12 27 38 42 48 49 57 65 90 97 122
Character (rings bell) (advances printer page) (escape) & * 0 1 9 A Z a z
ASP – Application Service Provider. A company that hosts software that users can operate over the Internet. ASP – Active Server Pages. A Microsoft technology that creates Web pages on the fly based on user requests. Usually the data displayed is coming from a database. The language used is a dialect of Visual Basic. Aspect ratio – Relative scale of the horizontal and vertical axes. Assignable cause – A factor or an experimental variable shown to significantly change the quality of an effect or a result. ASTM – American Society for Testing and Materials. Asynchronous – Asynchronous means not at the same time and refers to communications from a computer to some other computer or peripheral device. For most instances, the term serial can be substituted. Attribute – Textual or numeric information associated with a graphic object such as a well or a block in a drawing. Audit – A systematic evaluation to determine the conformance to quantitative specifications of some operational function or activity. See Audit of data quality, Performance evaluation audit, and Technical systems audit, and also Review and Management systems review. Audit of data quality (ADQ) – A qualitative and quantitative evaluation of the documentation and procedures associated with environmental measurements to verify that the resulting data are of acceptable quality. Audit sample – See Performance evaluation sample. Average – A measure of the most “typical” value in a set of data. Types of averages include arithmetic mean, geometric mean, median, and mode. Axis – A line used for reference, such as the scale lines on a graph.
Glossary
367
B B2B – Business to business. Used mostly to describe Internet commerce transactions. B2C – Business to consumer. Also used mostly to describe Internet commerce transactions. Background level (environmental) – The concentration of substance in a defined control area during a fixed period of time before, during, or after a data gathering operation. Backlit – A display for a portable computer that has a light source behind the screen for increased brightness. Also, some digitizers are backlit allowing them to work with transparent materials. Backup – When you make copies of a file, diskette, or hard disk, it is called a backup. Bandwidth – Amount of data that can be transferred at one time. The higher the bandwidth, the faster data moves. BASIC – Beginner’s All-Purpose Symbolic Instruction Code. This is a programming language widely used on personal computers. BASICA was an advanced version for IBM brand computers, and GW-BASIC was often found on PC compatibles. The latest version from Microsoft is Visual Basic. Batch – A quantity of material produced or processed in one operation, considered to be a uniform discrete unit. Batch file – A file containing a series of commands that are executed as a group by the operating system. They must have the extension of .bat to be recognized. Batch lot – The samples collected under sufficiently uniform conditions to be processed as a group. See Batch, Batch size. Batch sample – One of the samples drawn from a batch. Batch size – The number of samples in a batch lot. Baud – Transmission rate of serial devices such as modems is given by baud rate. It roughly translates to bits per second. Typical PC modem transmission rates range from 1200 to 56,000 baud. BBS – Electronic Bulletin Board System. See Bulletin board system. Bedrock – Solid rock that underlies the soil. Beer’s law – The amount of monochromatic light absorbed by an aqueous solution is proportional to the concentration. This effect is used in colorimetric analysis methods. Benchmark – A benchmark is a test or series of tests that uses standardized data and/or algorithms to rate hardware and software performance. Most benchmarks are rated on time, although some are compared to a known index. Beta error – See Type II error. Beta test – When a commercial hardware or software product is tested by selected individuals or companies with “real world” data outside the office of the manufacturer, the process is known as a beta test, and the users are called beta testers. BHC – Benzene hexachloride. Bias – The systematic or persistent distortion of a measurement process which deprives the result of representativeness (i.e., the expected sample measurement is different from the sample’s true value.) A data quality indicator. Binary – Some software code and digital communication is in a format known as binary because the code is a series of 1s and 0s. Binary code is a base 2 system where every number is represented by 2n or 2n +1. When binary data appears on the screen it has the appearance of “garbage.” Bioremediation – A treatment process that uses microorganisms such as bacteria, yeast, and fungi to break down hazardous organic substances.
368
Relational Management and Display of Site Environmental Data
BIOS – Basic Input-Output System; acts as an interface between the hardware and the software, and provides services such as reading and writing to and from memory and disk drives, and so on. It is “burned” into ROM chips using special equipment, and once installed in the computer, it is not usually changed. Bit – A bit (BInary digiT) is the smallest unit of computer information representing the presence or absence of an electrical charge. It is equivalent to yes/no or on/off. Bit map – An image represented as an array of pixels. Blank sample – A clean sample or a sample of matrix processed so as to measure artifacts in the measurement (sampling and analysis) process. Blank spike – See Spike. Blind sample – A subsample submitted for analysis with a composition and identity known to the submitter but unknown to the analyst and used to test the analyst’s or laboratory’s proficiency in the execution of the measurement process. See Double-blind sample. BMP – Best Management Practice. BOD – Biochemical Oxygen Demand. Boot – Start up a computer. Based on an analogy of pulling oneself up by ones bootstraps. A cold boot is starting the computer by turning it on. A warm boot is restarting the computer with a software or keyboard command. Borehole – A hole dug in the ground by a drilling rig. BPI – Bits per Inch. Usually used in describing tape drive formats, where 1600 and 6250 BPI are common. Broadband – High speed connection such as DSL or cable modem. Brownfield – Abandoned, idle, or under-used industrial and commercial facilities where expansion or redevelopment is complicated by real or perceived environmental contamination. Remediation levels for brownfields are often based on the expected use rather than arbitrary standards. BTEX – Benzene, Toluene, Ethylene, and Xylene. Bubble map – Map where values are shown as colors. Usually the colors represent ranges of data values. Also called a chloropleth, dot map, or classed post map. Buffer – In computers, buffers are storage areas that hold all or parts of files for printing or plotting. They also serve as holding areas for fast file access in memory or to disk drives. Buffers can either be hardware or software. In chemistry, a buffer is a solution that tends to maintain a constant value of some parameter such as pH. Bulk sample – A sample taken from a larger quantity (lot) for analysis or recording purposes. Bulletin board system (BBS) – Computer configured to answer the telephone and allow users to download and upload software and leave messages. Largely made obsolete by the Internet. Bus – The bus is the path data takes between the motherboard and adapter cards such as the video display or drive controller cards. Byte – A byte is equal to 8 bits. The byte is the unit measure of file size and disk and RAM capacity on personal computers.
C C – Programming language widely used for software development. C++ – Programming language based on C, but with object-oriented extensions. CAA – Clear Air Act. Cable – Assembly of wires and plugs used to connect two devices. Cable modem – High-speed communication connection.
Glossary
369
Cache – An area of memory used to store data from a slow device (such as a hard disk or main memory) to speed up the performance of a faster device (such as the main processor). CAD/CAE/CAM/CIM – Computer-Aided Drafting (or Design), Computer-Aided Engineering, Computer-Aided Manufacturing, Computer-Integrated Manufacturing. CADRE – Computer Aided Data Review and Evaluation. The CADRE system evaluates QC results against data review criteria appropriate for a specified corresponding analytical method or procedure. Calibrant – See Calibration standard. Calibrate – To determine, by measurement or comparison with a standard, the correct value of each scale reading on a meter or other device, or the correct value for each setting of a control knob. The levels of the calibration standards should bracket the range of planned measurements. See Calibration curve. Calibration blank – Laboratory reagent water samples analyzed at the beginning of an analytical run, during, and at the end of the run. They verify the calibration of the system and measure instrument contamination or carry-over. Calibration curve – The graphical relationship between the known values for a series of calibration standards and instrument responses. Calibration drift – The difference between the instrument response and a reference value after a period of operation without recalibration. Calibration standard – A substance or reference material used to calibrate an instrument. Calibration-check – See Calibrate. Calibration-check standard – See Calibration standard. Candidate method – A body of procedures and techniques of sample collection and/or analysis that is submitted for approval as a reference method, an equivalent method, or an alternative method. Carbonate – In geology, rock or soil made of calcite (CaCO3) or dolomite (CaMg(CO3)2). In chemistry, the CO3 ion. Carcinogen – A substance that causes cancer. Carrying-agent – Any dilutent or matrix used to entrain, dilute, or to act as a vehicle for a compound of interest. Cartesian coordinate system – Coordinates in which the X and Y axes are perpendicular and at the same scale. CAS number – Chemical Abstracts Service registry number of elements, chemical compounds, and certain mixtures. Cassette tape – Digital tape cassettes hold 20 to 40 MB of data or more and are similar to audio tape cassettes but with higher-quality tape. Cat5 – Category 5 cable. Similar to standard telephone wiring, but of higher quality, used for networking computers. CATEX – Categorical Exclusion. Outcome of the NEPA process where an action is found to have no significant effect on the environment. Cause-effect diagram – A graphical representation of an effect and possible causes. A popular one is the Ishikawa “fish bone diagram.” CCL – Construction Completions List. A list developed under CERCLA that helps identify successful completion of cleanup activities. CCS – Contract Compliance Screening. The screening of electronic and hard copy data deliverables for completeness and compliance with the contract. This screening is done under EPA direction by the SMO Contractor.
370
Relational Management and Display of Site Environmental Data
CCV – See Continuous Calibration Verification. CD-ROM – Compact Disk Read Only Memory. Optical disks which are used to distribute large amounts of data. Can be read but not written to by a personal computer. Ceiling plot – A two-dimensional contour plot that is projected above a three-dimensional mesh or floating contour plot is called a ceiling plot. These plots are usually created in mapping programs. Central line – The line on a control chart that represents the expected value of the control chart statistic; often the mean. See Control chart. CERCLA – Comprehensive Environmental Response, Compensation, and Liability Act. Initiated in December 1980, CERCLA provided broad federal authority to respond directly to the release or possible release of hazardous substances that may endanger human health or the environment. CERCLA also established a trust fund to provide for cleanup when no responsible party could be identified; hence CERCLA is commonly referred to as “Superfund.” This legislation covers environmental issues at non-operating facilities. CERCLIS – Comprehensive Environmental Response, Compensation and Liability Information System (CERCLIS); CERCLIS is the official inventory database of Superfund hazardous waste sites. It contains information about planned and actual site activities and financial information entered by EPA regional offices. Certification – The process of testing and evaluation against specifications designed to document, verify, and recognize the competence of a person, organization, or other entity to perform a function or service usually for a specified time. See also Accreditation. Certification of Data Quality – The real-time attestation that the activities of an environmental data collection operation’s individual elements (e.g., sampling design, sampling, sample handling, chemical analysis, data reduction, etc.) have been carried out in accordance with the operation’s requirements and that the results meet the defined quality criteria. Certified Reference Material (CRM) – A reference material that has one or more of its property values established by a technically valid procedure and is accompanied by or traceable to a certificate or other documentation issued by a certifying body. See Certification and Reference material. Certified value – The reported numerical quantity that appears on a certificate for a property of a reference material. CFC – Chlorofluorocarbon. CFR – Code of Federal Regulations. The final rules of federal agencies published every year. Environmental regulations are codified in 40 CFR, and often other sections apply to environmental projects as well. CGA – Color Graphics Adapter. Video adapter introduced with the original IBM-PC with a maximum resolution of 640x200 pixels. CGI – Common Gateway Interface. A technology that allows access to databases and programming capabilities using Web pages. It is popular on UNIX systems. Chain of custody – An unbroken trail of accountability that ensures the physical security of samples, data, and records. Chance cause – An unpredictable, random determinant of variation of a response in a sampling or measurement operation. Characteristic – See Property. Check sample – An uncontaminated sample matrix spiked with known amounts of analytes usually from the same source as the calibration standards. It is generally used to establish the stability of the analytical system but may also be used to assess the performance of all or a portion of the measurement system. See also Quality control sample.
Glossary
371
Check standard – A substance or reference material obtained from a source independent from the source of the calibration standard used to prepare check samples. Chi-square test – A statistical test of the agreement between the observed frequency of events and the frequency expected according to some hypothesis. Chlorinated – A compound containing chlorine. Chloropleth map – Map where values are shown as colors. See Bubble map. Chronic toxicity – The effect of low-level, long-term (as opposed to short-term or acute) exposure to a toxic substance. CLASS – Contract Laboratory Analytical Services Support. Contract that operates the Sample Management Office (SMO) and is awarded and administered by EPA. The contractor-operated SMO provides management, operations, and administrative support to the CLP. The SMO contractor schedules and tracks sample shipments for CLP analytical services requests. Classed post map – Map where values are shown as colors. See Bubble map. Clastic – Rock made of grains such as sandstone, siltstone, or shale. Clean sample – A sample of a natural or synthetic matrix containing no detectable amount of the analyte of interest and no interfering material. Clock calendar – Device which stores the correct time and date in a computer. This setting is not lost when the computer is turned off. Clock calendars are standard on most computers. Clock speed – The speed of the microprocessor and related components is known as the clock speed, and is usually expressed in megahertz, or millions of cycles per second. Clone – Computers that are compatible with the original IBM architecture are known as clones. Some of the more famous clones are Compaq, Gateway, and Dell; lesser known brands or generic brands are often called no-name clones. CLP – Contract Laboratory Program. Supports the EPA’s Superfund effort by providing a range of chemical analytical services to produce environmental data of known quality. This program is directed by the Analytical Operations/Data Quality Center of EPA. CMM – Capability Maturity Model. A framework for organizing continuous process improvement into five maturity levels geared toward achieving a mature software process. COC – Chain of Custody. The legal document that accompanies samples to the laboratory. COD – Chemical Oxygen Demand. Coefficient of variation (CV) – a measure of relative dispersion (precision.) It is equal to the ratio of the standard deviation divided by the arithmetic mean. See also Relative standard deviation. Cold boot – see Boot. Collaborative testing – The evaluation of an analytical method by typical or representative laboratories using subsamples prepared from a homogeneous standard sample. Collocated sample – One of two or more independent samples collected so that each is equally representative for a given variable at a common space and time. Collocated samplers – Two or more identical sample collection devices, located together in space and operated simultaneously, to supply a series of duplicate or replicate samples for estimating precision of the total measurement system/process. Combo card – Hardware adapter with several different functions. Communications program – Software that allows a computer to talk to another computer. Often it can make the computer look like a specific type of terminal to simplify communication with a larger computer. Comparability – The degree to which different methods, data sets, and/or decisions agree or can be represented as similar; a data quality indicator.
372
Relational Management and Display of Site Environmental Data
Compatibility – How well a hardware device or program mimics another, usually a better-known one. Completeness – The amount of valid data obtained compared to the planned amount, and usually expressed as a percentage; a data quality indicator. Component of variance – A part of the total variance associated with a specified source of variation. Composite sample – A sample prepared by physically combining two or more samples having some specific relationship and processed to ensure homogeneity. See Flow-proportioned sample and Time-proportioned sample. Confidence coefficient – The probability statement that accompanies a confidence interval and is equal to unity minus the associated Type I error rate (false positive rate). A confidence coefficient of 0.90 implies that 90% of the intervals resulting from repeated sampling of a population will include the unknown (true) population parameter. See Confidence interval. Confidence interval – The numerical interval constructed around a point estimate of a population parameter, combined with a probability statement (the confidence coefficient) linking it to the population’s true parameter value. If the same confidence interval construction technique and assumptions are used to calculate future intervals, they will include the unknown population parameter with the same specified probability. See Confidence coefficient. Confining layer – An underground geologic unit with a low permeability that inhibits the vertical flow of water. See also Aquitard. Connectivity – Connectivity is the communication between computers with different operating systems such as an IBM PC-compatible accessing data from a mainframe. Continuous calibration verification – A process using laboratory standards to ensure that the analysis process is in calibration. Contour – The shape of a surface. Also a line of equal value on a graph or map. Control chart – A graph of some measurement plotted over time or sequence of sampling, together with control limit(s) and, usually, a central line and warning limit(s). See Central line, Control limit, and Warning limit. Control limit – A specified boundary on a control chart that, if exceeded, indicates a process is out of statistical control, and the process must be stopped, and corrective action taken before proceeding (e.g., for a Shewhart chart the control limits are the mean plus and minus three standard deviations, i.e., the 99.72% confidence level on either side of the central line.) Control sample – See Quality control sample and Check sample. Control standard – See Check standard. Controlled variable – A variable that is set at a pre-selected level when a controlled experiment is conducted. Controller – A controller is a piece of hardware that controls a device such as a disk drive or monitor; it is usually on an interface card or the motherboard. Coordinate conversion – Changing coordinates from one system to another such as from latitudelongitude to Universal Transverse Mercator. Coordinates – A pair of numbers that defines the location of a point, such as a station on a map. Coordinates are related to a coordinate system, which defines the scale and units of the coordinates, such as Cartesian (linear XY) or spherical (latitude-longitude). Coprocessor – Chip or board that works with the microprocessor to perform some particular function such as mathematical or graphical calculations. Copy protection – System to prevent unauthorized use of software. May involve special diskettes or hardware keys.
Glossary
373
Corrected value – The magnitude of a specific measurement; a variable; a unit of space, time, or quantity; or a datum after correction for a blank value. See Observed value. Correlation – A measure of association between two variables. See also Correlation coefficient. Correlation coefficient – A number between -1 and 1 that indicates the degree of linearity between two variables or sets of numbers. The closer to -1 or +1, the stronger the linear relationship between the two (i.e., the better the correlation). Values close to zero suggest no correlation between the two variables. The most common correlation coefficient is the productmoment, a measure of the degree of linear relationship between two variables. Corrosivity – A substance’s ability to corrode metals. Cost recovery – A legal process by which potentially responsible parties who contributed to contamination at a Superfund site can be required to reimburse the Trust Fund for money spent during any cleanup actions by the federal government. Cost recovery request – A request issued by an Authorized Cost Recovery Requester for detailed cost and sample documentation associated with a Superfund site. COTS – Commercial Off-the-Shelf software. Coverage – In GIS, data of a specific type such as transportation or drainage, and the area covered by that data. See also Theme and Layer. CPU – Central Processing Unit, also known as the computer. CRDL – Contract Required Detection Limit. Minimum level of detection acceptable under the contract Statement of Work. Critical-toxicity range – The interval between the highest concentration at which all test organisms survive and the lowest concentration at which all test organisms die within the test period. CRP – Community Relations Plan. CRQL – Contract Required Quantitation Limit. Minimum level of quantitation acceptable under the contract Statement of Work. CRT – Cathode Ray Tube. The TV-like video monitor on a computer. Cursor – A cursor is a pointing device used for digitizers. A cursor is also a place marker on a monitor. Cursor keys – Cursor keys are the arrow keys on the keyboard that move the cursor. CVAA – Cold Vapor Atomic Absorption. CWA – Clean Water Act. Cylinder – Area of a hard disk that can contain data. The more cylinders, the larger the capacity of each platter or disk in the disk drive. CZMA – Coastal Zone Management Act.
D Daily standard – See Calibration standard. Daisy wheel printer – Impact printers that use a printwheel to generate output. While they cannot print graphics, daisy wheels produce text output of excellent quality but are slow and noisy. DART – Data Assessment Rapid Transmittal. DART is an active notification system providing upto-the-minute transmittal of the CCS and CADRE evaluation data to DLP customers. Data – Facts or figures from which conclusions can be inferred. Also collection of letters, numbers, and other information elements stored in a computer. Data assessment tool – A software driven process that incorporates CCS, CADRE, and DART designed to produce enhanced CLP deliverables and more usable reports in a standard format.
374
Relational Management and Display of Site Environmental Data
Data cube – Data structured in multiple dimensions for data mining. Data mart – Data extracted from the main database with a focus on one particular task. Data mining – Analysis of a data warehouse using specialized tools to look at the data in various ways for trends and details. Data quality – The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data. Data quality indicators – Quantitative statistics and qualitative descriptors that are used to interpret the degree of acceptability or utility of data to the user. The principal data quality indicators are bias, precision, accuracy, comparability, completeness, and representativeness. Data Quality Objective (DQO) – Qualitative and quantitative statements of the overall level of uncertainty that a decision maker is willing to accept in results or decisions derived from environmental data. DQOs provide the statistical framework for planning and managing environmental data operations consistent with the data user’s needs. Data reduction – The process of transforming raw data by arithmetic or statistical calculations, standard curves, concentration factors, etc., and collation into a more useful form. Data set – All the observed values for the samples in a test or study; a group of data collected under similar conditions and which, therefore, can be analyzed as a whole. Data turnaround time – The maximum length of time allowed for laboratories to submit analytical data to EPA in order to avoid financial penalties (i.e., liquidated damages). Data turnaround time begins at the validated time of sample receipt (VTSR) at the laboratory. Data validation – Data validation is based on EPA region-defined criteria and limits, professional judgment of the data validator, and (if available) the Quality Assurance Project Plan (QAPP) and Sampling and Analysis Plan (SAP). Data warehouse – A centralized database with all of the data of a particular type for an organization. Database – A collection of related information. Database manager – The software that organizes a database into a usable format. Daughterboard – A small printed circuit board attached to a larger one, usually to add some additional capability. DBCP – 1,2-Dibromo-3-chloropropane. Pesticide. DCE – 1,1-Dichloroethylene. Volatile organic contaminant. DDT – Dichlorodiphenyltrichloroethane. Pesticide. Decontamination blank – See Sampling equipment blank. Default – Value used when no other value is specified. Defensible – The ability to withstand any reasonable challenge related to the veracity or integrity of laboratory documents and derived data. Degrees of freedom – The total number of items in a sample minus the number of independent relationships existing among them; the divisor used to calculate a variance term; in the simplest cases, it is one less than the number of observations. Dependent variable – See Response variable. Desktop publishing – Using a computer to lay out pages and typeset text. Specialized software is used for this along with a high-resolution output device such as a laser printer. Detection limit (DL) – The lowest concentration or amount of the target analyte that can be determined to be different from zero by a single measurement at a stated level of probability. See Method detection limit. Determination – The application of the complete analytical process of measuring the property of interest in a sample, from selecting or measuring a test portion to the reporting of results.
Glossary
375
Device driver – Device drivers are small programs that set up the communication parameters between the computer or software programs and devices such as disk drives, printers, plotters, or digitizers. Digitizer – Input devices that put graphic information into a computer. They consist of a handheld position sensor (stylus or cursor), a tablet, and sophisticated electronics. Dilutent – A substance added to another to reduce the concentration and resulting in a homogeneous end product without chemically altering the compound of interest. Dilution factor – The numerical value obtained from dividing the new volume of a diluted substance by its original volume. DIP switch – Dual Inline Parallel switch. Commonly used to configure hardware. Directory – Group of files kept together by the operating system. Sometimes also called folders. See Subdirectory. Disk cache – See Cache. Diskette – Floppy disk. Diskette controller – Interface between a computer and a diskette drive. Diskette drive – Device for reading and writing diskettes. Display adapter – Circuitry that translates the electronic signal from the computer into the signal that feeds the monitor. These adapters come in different resolutions such as EGA or VGA. DLG – Digital Line Graph is a specialized data format used for geographic, cultural, and other data. DNAPL – Dense Non-Aqueous Phase Liquid. A fluid that doesn’t mix with water, and sinks because it is dense. Examples include chlorinated hydrocarbons such as TCE, TCA, and PCE. Document control – A systematic procedure for indexing documents by number, date, and revision number for archiving, storage, and retrieval. DOE – Department of Energy. Formerly the Nuclear Regulatory Commission. DOS – Disk Operating System. More specifically refers to PC-DOS and MS-DOS used by IBMcompatible computers. DOT – Department of Transportation. Dot matrix printer – Dot matrix printers are impact printers. They generate output when pins in the printhead strike a ribbon that transfers the information to paper. Dot pitch – The smallest circle that can be drawn around all of the color dots making up one pixel. The smaller the number, the sharper the appearance of the monitor. Double-blind sample – A sample submitted to evaluate performance with concentration and identity unknown to the analyst. See Blind sample. Download – Move data from another computer to yours. Opposite of upload. DPI – Dots per Inch; used for describing printer and scanner resolution. DQA – Data Quality Assessment. The third part of the EPA data verification/validation process that determines the credibility of the data. DQO – Data Quality Objective. Quality targets for a project. DRAM – Dynamic Random Access Memory. The most common type of memory chip used in personal computers. DRO – Diesel Range Organics. Drylabbing – Fraudulent laboratory practice of failing to analyze data and then fabricating the results. DSL – Digital subscriber line broadband communication connection.
376
Relational Management and Display of Site Environmental Data
Duplicate – An adjective describing the taking of a second sample or performance of a second measurement or determination. Often incorrectly used as a noun and substituted for “duplicate sample.” Replicate is to be used if there are more than two items. See Replicate. Duplicate analyses or measurements – The analyses or measurements of the variable of interest performed identically on two subsamples of the same sample. The results from duplicate analyses are used to evaluate analytical or measurement precision but not the precision of sampling, preservation, or storage internal to the laboratory. Duplicate samples – Two samples taken from and representative of the same population and carried through all steps of the sampling and analytical procedures in an identical manner. Duplicate samples are used to assess variance of the total method including sampling and analysis. See Collocated sample. Dvorak – Keyboard layout that is not widely used but which puts the most commonly used keys under the fingers. DW – Drinking Water. DXF – Drawing Exchange Files have a .dxf extension and are used to communicate between graphics programs. Although .dxf files were first used by AutoCAD, many graphics packages will handle .dxf directly or through a converter. These files are ASCII text files. Dynamic blank – A sample-collection material or device (e.g., filter or reagent solution) that is not exposed to the material to be selectively captured but is transported and processed in the same manner as the sample. See Instrument blank and Sampling equipment blank. Dynamic calibration – Standardization of both the measurement and collection systems using a reference material similar to the unknown. For example, a series of air-mixture standards containing sulfur dioxide of known concentrations could be used to calibrate a sulfur dioxide bubbler system.
E EA – See Environmental Assessment. EBCDIC – Extended Binary Coded Decimal Interchange Code. While most computers speak ASCII, a few, especially IBM mainframes, use EBCDIC. It is a different way of encoding data, and both methods work, but in situations where machines using the two different encoding schemes must be made to talk to each other, the translation required can cause a problem. ECD – Electron Capture Detector. In pesticide/Aroclor analysis, the compounds are detected by an electron capture detector. EDB – Ethylene dibromide. Volatile organic contaminant. EDD – Electronic Data Deliverable. This is the file delivered by the laboratory containing the results of its analyses. Edit – Change the contents of a file such as a document or drawing. EDMS – Environmental Data Management System. A software program for managing environmental data. See EMS and EMIS. EGA – Enhanced Graphics Adapter. Video adapter for IBM-compatible computers with a maximum resolution of 640x350 pixels. EIS – See Environmental Impact Statement. ELCD – Electrolytic Conductivity Detector. EMIS – Environmental Management Information System. A software program to assist with an environmental management system. See EDMS and EMS. EMS – Environmental Management System. An administrative system for managing environmental issues at a facility. See EDMS and EMIS.
Glossary
377
Emulator – Hardware or software which is designed to work like or take the place of some other type of device. Entity-relationship diagram – Diagram showing the tables, fields, and relationships in a relational database. Environmental Assessment (EA) – Screening document used to determine whether a full Environmental Impact Statement (EIS) is required. An EA can also result in a Finding of No Significant Impact (FONSI). Environmental Impact Statement (EIS) – A document prepared to assist with decision making based on the environmental consequences of a specific action. Environmental sample – A sample of any material that is collected from an environmental source. Environmentally related measurement – Any assessment of environmental concern generated through or for field, laboratory, or modeling processes; the value obtained from such an assessment. EO – Executive Order. EOX – Extractable Organic Halides. EPA – United States Environmental Protection Agency. EPCRA – Emergency Planning and Community Right-to-Know Act of 1986. This law requires industrial facilities to disclose information about chemicals stored onsite. Equipment rinseates – See Rinseate blank. Equivalent method – Any method of sampling and/or analysis demonstrated to result in data having a consistent and quantitatively known relationship to the results obtained with a reference method under specified conditions, and formally recognized by the EPA. ERPIMS – Environmental Resources Program Information Management System (formerly IRPIMS). The Air Force system for validation and management of data from environmental projects at all Air Force bases. ERPTools – Software used with ERPIMS. Error (measurement) – The difference between an observed or corrected value of a variable and a specified, theoretically correct, or true value. Error function – The mathematical relationship of the results obtained from the measurement of one or more properties and the error of the applied measurement process. See Normal distribution. ESA – Endangered Species Act. Ex situ – Out of place, such as occurring out of the ground. Ex-situ treatment processes involve removal of the contaminated material from the ground prior to treatment. Expansion slots – Places in the computer where adapter cards can be inserted to increase the computer's capabilities. Experimental variable – See Controlled variable. Export – To transfer a file out from one program to another is to export a file. Extension – The extension is the three-letter suffix after the period in a file name. Examples of extensions include .bat for batch files or .dxf for drawing exchange files. External quality control – The activities that are routinely initiated and performed by persons outside of normal operations to assess the capability and performance of a measurement process.
F False negative decision – See Type II error.
378
Relational Management and Display of Site Environmental Data
False negative result – Estimating (incorrectly) that an analyte is not present when it actually is present. False positive decision – See Type I error. False positive result – Estimating (incorrectly) that an analyte is present when it is actually not present. FAT – The File Allocation Table is the data area on a floppy or hard disk that tells the operating system where to find each file on the disk. Because of its importance, there are two FATs on every disk. Feasibility study – A description and analysis of potential cleanup alternatives for a site such as one on the National Priorities List. The feasibility study usually recommends selection of a costeffective alternative. It usually starts as soon as the remedial investigation is under way. Together, they are commonly referred to as the “RI/FS.” FID – Flame Ionization Detector. Field blank – A clean sample (e.g., distilled water), carried to the sampling site, exposed to sampling conditions (e.g., bottle caps removed, preservatives added) and returned to the laboratory and treated as an environmental sample. Field blanks are used to check for analytical artifacts and/or background introduced by sampling and analytical procedures. See Dynamic blank and Sampling equipment blank. Field duplicates – See Duplicate samples. Field reagent blank – See Field blank. Field sample – See Sample. Field sample spikes – Samples that have been spiked with known amounts of target analytes in the field prior to shipment to the laboratory. These are submitted as double-blind quality control samples to measure the recovery of target analytes for both field and laboratory procedures. The frequency of these spikes is project specific. FIFRA – Federal Insecticide, Fungicide, and Rodenticide Act of 1972. Provides federal control of pesticide distribution, sale, and use. EPA is authorized to study the consequences of pesticide usage and to require users such as farmers and utility companies to register when purchasing pesticides, and to take exams for certification as applicators of pesticides, and pesticides used in the U.S. must be registered (licensed) by EPA. File – Cohesive unit of data stored by a computer. Can be either a program or a data file. FireWire – IEEE 1394 high-speed serial interface for connecting peripherals to computers. Fixed disk drive – Hard disk. Floaters – See LNAPL. Floating contours – Floating contours are also known as constant Z plots. A topographic map transformed into three-dimensions is an example of a floating contour map. Floating point processor – Math coprocessor. Floppy disk – Flexible disk for storing data. Also called a diskette. Floppy drive – A device inside or attached to a computer for reading and writing floppy disks (diskettes). FLOPS – FLoating point Operations Per Second; a measure of processor speed. Flow rate – The quantity-per-unit time of a substance passing a point, plane, or space; for example, the volume or mass of gas or liquid emerging from an orifice, pump, or turbine or moving through a point in a conduit or channel. Flow-proportioned sample – A sample or subsample collected from a fluid system at a rate that produces a constant ratio of sample accumulation to matrix flow rate. Folder – Group of files kept together by the operating system. Sometimes also called directories.
Glossary
379
FONSI – Finding of No Significant Impact. One of the possible outcomes of an Environmental Assessment. Font – Style of character as printed or viewed on the screen. Courier, Times Roman, Helvetica, and Arial are examples of different fonts. Individual fonts come in a variety of sizes. Form R – Form submitted to the EPA for the Toxics Release Inventory to report toxic materials entering the waste stream, recycled or treated at a facility, source reduction practices, and other items. Format – With disks, preparing the disk to hold data. For data files, the arrangement of data within the file. Fortify – Synonym for spike. FORTRAN – FORmula TRANslator. Programming language popular for number crunching. FPD – Flame Photometric Detector. Fragmentation – When data is broken into parts. Often used to describe files on a hard disk which are not contiguous. Frame relay – Medium-speed communication connection. FS – Feasibility study. FSP – Field Sampling Plan. Full-scale response – The maximum output of a measurement instrument in a given range as displayed on a meter or scale. Function key – The F1, F2, etc. keys on a computer keyboard are the function keys. The functions assigned to these keys vary with software. Functional analysis – A mathematical evaluation of each component of the measurement system (sampling and analysis) in order to quantitate the error for each component. A functional analysis is usually performed prior to a ruggedness test in order to determine which variables should be studied experimentally.
G Game port – Adapter card which allows a joystick to be added to a computer. Gaussian distribution – See Normal distribution. GC – Gas Chromatography. The instrument used to separate analytes on a stationary phase within a chromatographic column. Gas chromatography is frequently used with other instruments for analyzing organic compounds. GC/MS – Gas Chromatograph/Mass Spectrometry. Geodetic coordinate system – Coordinates where the position is given as location on the globe. Geometric mean – The antilogarithm of the mean of the logarithms of all the values in a set. A type of average. Gigabyte – About one billion bytes, abbreviated as GB. GIS – Geographic Information Systems are used for identifying and manipulating spatial or other attributes on a map. GL – Graphics Language. GML – Geography Markup Language. A protocol for transferring geographic data over the Web. Good laboratory practices (GLP) – Either general guidelines or formal regulations for performing basic laboratory operations or activities that are known or believed to influence the quality and integrity of the results. Goodness-of-fit – The measure of agreement between the data in a data set and the expected or hypothesized values.
380
Relational Management and Display of Site Environmental Data
Grab sample – A single sample that is collected at one point in time and place. Graphics adapter – See Video adapter. Graphics program – Software for creating graphics. Types include paint programs, graphing and charting programs, and CAD programs. Gray-scale – Refers to the number of levels of gray which a device can display or print. Grid – A grid is a mathematical representation of a surface based on data point values. Different algorithms create different grids from the same data points. GRO – Gasoline Range Organics. Gross sample – See Bulk sample. GTTP –Geographic Text Transfer Protocol. A transfer protocol at the same level as HTTP (HyperText Transfer Protocol) for transferring geographic data.
H Hachure – Tic marks on contour lines or a pattern filling an area. Halogen – A group of elements that includes fluorine, chlorine, bromine, or iodine. These elements commonly found in pairs, such as Cl2. Hard copy – Paper printout of text or graphic data. Hard disk – A hard disk is a storage device that consists of one or more metal platters coated with magnetic oxide. Sizes vary from 10 to 100 GB or more. Hard disk controller – Interface between a computer and a hard disk. Hardware – The physical components of a computer system. HASP – Health and Safety Plan. Hazardous waste site – A site contaminated with substances that can pose a substantial or potential hazard to human health or the environment. HAZMAT – HAZardous MATerials. HAZWRAP – HAZardous Waste Remedial Action Plan. Head – Parts of a disk drive that read and write the data. The more heads, the larger the capacity of the disk drive. Also pressure exerted by groundwater. Head crash – Failure of a disk drive where the head contacts the disk. Usually results in loss of data and damage to the drive. Head parking – The capability of some disk drives to move the heads to an area of the disk that does not contain data when the computer is turned off. Heavy metal – A group of toxic metallic elements that includes arsenic, cadmium, chromium, copper, lead, mercury, silver, and zinc. HECD – Hall Electrolytic Conductivity Detector. Hexadecimal – Hexadecimal number system is a base-16 system used in many computers. Numbers are represented by the digits 0 to 9 and the letters A to F. HMIS – Hazardous Materials Identification System. A system of colored bars that provide health warning information. HMIS labels cover, starting at the top, health, flammability, reactivity, and personal protection. Other areas may be included as well. See also NFPA, which is meant primarily for fire fighters and other emergency personnel. Homogeneity – The degree of uniformity of structure or composition. HPGL – Hewlett-Packard Graphics Language is a language to direct pen and paper handling for plotters. It is used by Hewlett-Packard and other manufacturers. HPLC – High Performance Liquid Chromatography.
Glossary
381
HRS – Hazard Ranking System. A numerically based screening system that uses information from initial, limited investigations to assess the relative potential of sites to pose a threat to human health or the environment. HRS is the principal mechanism EPA uses to place uncontrolled waste sites on the National Priorities List (NPL). HSWA – Hazardous and Solid Waste Amendments. 1984 amendments to RCRA that required phasing out land disposal of hazardous waste. HTML – HyperText Markup Language. Language used by most Web pages on the Internet. HW – Hazardous Waste. Hydrocarbons – Organic compounds containing carbon and hydrogen found in petroleum, natural gas, and coal. Hydrogeology – The science that studies groundwater, including its origin, occurrence, movement, and quality. Hydrology – The science that studies the properties, movement, and effects of water found above and below the earth’s surface. Hz – Hertz or cycles per second. For monitors, used for the vertical scan rate, or the rate at which the screen is redrawn from top to bottom. A larger number means a more stable looking image (less flicker).
I Icon – Graphical symbol representing a command or choice in a menu system. ICP/AES – Inductively Coupled Plasma/Atomic Emission Spectroscopy. A technique for the simultaneous or sequential multi-element determination of elements in solution. ICP/MS – Inductively Coupled Plasma/Mass Spectrometry. IDL – Instrument Detection Limit. IEEE 1394 – See FireWire. Ignitability – Tendency to cause fires. ILM04.1 – The current inorganic analytical protocol. Import – Transferring a file into a program from another format. In control – A condition indicating that performance of the quality control system is within the specified control limits, i.e., that a stable system of chance is operating and resulting in statistical control. See Control chart. In situ – In place. Cleanup contaminants where they are found, without excavation or pumping. Independent variable – See Controlled variable. Inkjet printer – A device which prints by shooting droplets of ink at the paper. Currently the most popular type of printer. Inorganic – Substances that contain elements other than carbon and hydrogen, such as metals and nutrients. Inspection criterion – The specification(s) and rationale for rejecting and accepting samples in a particular sampling plan. Institutional controls – Legal or institutional measures that limit activities at a property to ensure protection of human health and the environment. Instrument blank – A clean sample processed through the instrumental steps of the measurement process; used to determine instrument contamination. See Dynamic blank. Instrument carryover blank – Laboratory reagent water samples which are analyzed after a highlevel sample. They measure instrument contamination after analyzing highly concentrated samples, and are analyzed as needed when high-level samples are analyzed.
382
Relational Management and Display of Site Environmental Data
Interference – A positive or negative effect on a measurement caused by a variable other than the one being investigated. Interference equivalent – The mass or concentration of a foreign substance which gives the same measurement response as one unit of mass or concentration of the substance being measured. Interlaboratory calibration – The process, procedures, and activities for standardizing a given measurement system to ensure that laboratories participating in the same program can produce comparable data. Interlaboratory method validation study (IMVS) – The formal study of a sampling and/or analytical method, conducted with replicate, representative matrix samples, following a specific study protocol and utilizing a specific written method, by a minimum of seven laboratories, for the purpose of estimating interlaboratory precision, bias, and analytical interferences. Interlaboratory precision – A measure of the variation, usually given as the standard deviation, among the test results from independent laboratories participating in the same test. Interlaboratory test – A test performed by two or more laboratories on the same material for the purpose of assessing the capabilities of an analytical method or for comparing different methods. Interlaced – For video displays, interlacing means that alternate scan lines are drawn on each redraw. This allows higher resolution at a lower bandwidth. For interlaced system memory, alternate banks of RAM are accessed successively to minimize wait states for the microprocessor. Internal quality control – See Intralaboratory quality control. Internal standard – A standard added to a test portion of a sample in a known amount and carried through the entire determination procedure as a reference for calibration and controlling the precision and bias of the applied analytical method. Interpolate – Establish an intermediate value between two points of known value. Intralaboratory precision – A measure of the method/sample specific analytical variation within a laboratory; usually given as the standard deviation estimated from the results of duplicate/replicate analyses. See also Standard deviation and Sample variance. Intralaboratory quality control – The routine activities and checks, such as periodic calibrations, duplicate analyses, and spiked samples, that are included in normal internal procedures to control the accuracy and precision of measurements. IO – Input-Output. An IO card is an adapter card that adds one or more input-output ports to the computer. IO bandwidth – Rate at which a device such as a computer can input and output data. Ion – An atom that has lost or gained one or more electrons, resulting in a positive or negative charge. IPS – Inches per Second is a measure of speed. Usually used for plotters and tape drives. IR – Infrared. Also used for infrared spectrophotometry. IRDMIS – Installation Restoration Data Management Information System. A system that supports the technical and managerial requirements of the Army’s Installation Restoration Program (IRP) and other environmental efforts of the U.S. Army Environmental Center. IRIS – Integrated Risk Information System. An electronic database of EPA regulatory information about chemical constituents. Isopach or Isopleth – A line of equal value (contour) on a graph or map. IT – Information Technology. Often companies have a group with this name (or IS for Information Solutions or Services) to provide technology solutions for the rest of the company.
Glossary
383
J Joint and several liability – A legal concept that allows responsibility to be placed regardless of the amount of damage caused by each PRP. One PRP therefore can be held liable for the entire cost of cleanup. Joystick – A joystick is an input device usually used for games. It consists of a freely movable stick inside a small electronic device. A game port is required to connect a joystick to a computer. Juicing – Fraudulent laboratory practice of manipulating the sample prior to analysis by fortification with additional analyte.
K Keyboard – A keyboard is the most common input device for all types of computers. The most popular one has a layout similar to that of a typewriter. kHz – Kilohertz or thousands of cycles per second. For monitors, used for the horizontal scan rate, or the rate at which the horizontal lines of pixels are drawn on the screen. A larger number allows higher resolution. Kilobyte – About 1000 bytes (actually, 1024), abbreviated KB.
L Laboratory accreditation – See Accredited laboratory and Accreditation. Laboratory blank – See Reagent blank. Laboratory control sample – See Quality control sample. Laboratory control standard – See Quality control sample. Laboratory duplicates – Synonym for duplicate analyses. Laboratory performance check solution – A solution of method and surrogate analytes and internal standards used to evaluate the performance of the instrument system against defined performance criteria. Laboratory reanalyses – See Replicate analysis or measurements. Laboratory replicates – See Replicate analysis or measurements. Laboratory sample – A subsample of a field, bulk, or batch sample selected for laboratory analysis. Laboratory spiked blank – See Spiked reagent blank. Laboratory spiked sample – See Spiked sample. Lambert projection – Method of representing spherical coordinates (latitude-longitude) on a flat map using a cylinder parallel to the polar axis. LAN – Local Area Network. A way of connecting several computers together. Landscape orientation – A device or printout where the horizontal axis is longer than the vertical axis. The opposite of portrait. Landfill – A land disposal site for solid wastes. Laptop – Small, battery-operated computer. Laser printer – Laser printers provide very fast, high-quality output using a laser and a toner cartridge. Layer – In GIS, data of a specific type such as transportation or drainage. See also Coverage and Theme. LC – Liquid Chromatography.
384
Relational Management and Display of Site Environmental Data
LCS – Laboratory Control Standard or Laboratory Control Sample. LDAR – Leak Detection and Repair. An EPA program under the Clean Air Act that requires refineries to develop and implement a program to monitor for and repair leaks. LDC – Legacy Data Center. Older EPA system for managing water quality information. LDR – Land Disposal Restrictions. The EPA’s LDR program works to minimize potential environmental threats resulting from land disposal of hazardous wastes by establishing hazardous waste protocol and treatment requirements that make the waste safe for land disposal. Leachate – A liquid, possibly containing contaminants, resulting from water moving through material such as hazardous waste or a landfill. Leaching – The chemical and physical process by which chemicals are dissolved and moved through the soil by water or other fluids. Least squares method – A technique for estimating model coefficients which minimizes the sum of the squares of the differences between each observed value and its corresponding predicted value derived from the assumed model. Light pen – Pen-shaped device for pointing to the screen. Used for drawing and menu selection. Limit of detection (LOD) – See Method detection limit. Limit of quantification (LOQ) – The concentration of analyte in a specific matrix for which the probability of producing analytical values above the method detection limit is 99%. LIMS – Laboratory Information Management System. This is software that takes data from laboratory instruments, performs calculations, and creates electronic data deliverables. Linearity – The degree of agreement between the calibration curve of a method and a straight-line assumption. Linux – An open-source (pretty much free) version of the UNIX operating system for PCs. Named after Linus Torvalds, who wrote the original version. Linux is increasing in popularity, especially for Internet applications such as Web servers. LLE – Liquid-Liquid Extraction. LNAPL – Light Non-Aqueous Phase Liquid. A fluid that doesn’t mix with water, and floats because it is less dense. Examples include gasoline and oil. Logarithm – The exponent to which a number (base) must be raised to produce a given number. Typical values for the base are 10 and e, the base of natural logarithms (2.71828…). Lognormal distribution – Distribution of data values where the logarithm of the values forms a normal distribution. Lot – A number of units of an article or a parcel of articles offered as one item; commonly, one of the units, such as a sample of a substance under study. See Batch. Lot size – The number of units in a particular lot. See Batch lot and Batch size. Lower control limit – See Control limit. Lower warning limit – See Warning limit. LQAP – Laboratory Quality Assurance Plan. LSE – Liquid-Solid Extraction. LUFT, LUST – Leaking Underground Fuel Tank, Leaking Underground Storage Tank
M Macintosh – Family of computers from Apple. All feature a graphical user interface using a mouse. Widely used in graphic arts and prepress, but rarely in technical computing. Macro – A few keystrokes that represent many keystrokes and/or operations. Many programs provide a macro language to automate operations.
Glossary
385
Mainframe – Large computer designed to handle many users and tasks at once. Maintenance agreement – An agreement with a hardware or software manufacturer where any service is paid for in advance. Management systems review (MSR) – The qualitative assessment of a data collection operation and/or organization(s) to establish whether the prevailing quality management structure, practices, and procedures are adequate for ensuring that the type and quality of data needed and expected are obtained. See Review and Audit Math coprocessor – A chip that takes over mathematical operations, freeing the microprocessor to run the computer. Matrix – A specific type of medium (e.g., surface water, drinking water) in which the analyte of interest may be contained. See Medium. Matrix spike – See Spiked sample. Matrix spike duplicate – A duplicate of a matrix spike, used to measure the laboratory precision between samples. Usually one matrix spike duplicate is analyzed per sample batch. Percent differences between matrix spikes and matrix spike duplicates can be calculated. Matrix spike duplicate sample analysis – See Matrix, Duplicate analyses, and Spiked sample. Maximum contaminant level – The highest permissible concentration of a pollutant that may be delivered to any receptor. Maximum holding time – The length of time a sample can be kept under specified conditions without undergoing significant degradation of the analyte(s) or property of interest. MCL – Maximum Contaminant Level. MDL – Method Detection Limit. Mean – See Arithmetic mean. Measure of central tendency – A statistic that describes the grouping of values in a data set around some common value (e.g., the median, arithmetic mean, or geometric mean). Measure of dispersion – A statistic that describes the variation of values in a data set around some common value. See Coefficient of variation, Range, Sample variance, and Standard deviation. Measurement range – The range over which the precision and/or recovery of a measurement method are regarded as acceptable. See Acceptable quality range. Measurement standard – A standard added to the prepared test portion of a sample (e.g., to the concentrated extract or the digestate) as a reference for calibrating and controlling measurement or instrumental precision and bias. Median – The middle value for an ordered set of n values represented by the central value when n is odd or by the mean of the two most central values when n is even. Medium – A substance (e.g., air, water, soil) which serves as a carrier of the analytes of interest. See Matrix. Medium blank – See Field blank and Reagent blank. Megabyte – About one million bytes, abbreviated MB. Megahertz – Millions of cycles per second. For monitors, computer systems, or microprocessors, the rate at which instructions are executed or data is transferred. A larger number means faster processing, but the processing rate also depends on the type of processor and the number of bits being transferred at one time. Memory – Chips used in a computer for short-term handling of data. See RAM. Memory cache – See Cache. Menu – Screen that presents a list of choices to the user. Mesh plot – Graphical display consisting of three-dimensional plots with constant X value and constant Y value lines at specified intervals.
386
Relational Management and Display of Site Environmental Data
Metadata – Data about data. A description of the content of a data set is the metadata for that data set. Method – A body of procedures and techniques for performing a task (e.g., sampling, characterization, and quantification) systematically presented in the order in which they are to be executed. Method blank – A clean sample processed simultaneously with and under the same conditions as samples containing an analyte of interest through all steps of the analytical procedure. Method check sample – See Spiked reagent blank. Method detection limit (MDL) – The minimum concentration of an analyte that, in a given matrix and with a specific method, has a 99% probability of being identified, qualitatively or quantitatively measured, and reported to be greater than zero. See Detection limit. Method of least squares – See Least squares method. Method of standard addition – Analysis of a series of field samples which are spiked at increasing concentrations of the target analytes. This provides a mathematical approach for quantifying analyte concentrations of the target analyte. It is used when spike recoveries are outside the QC acceptance limits specified by the method. Method performance study – See Interlaboratory method validation study. Method quantification limit (MQL) – See Limit of quantification and also Method detection limit. MFM – Modified Frequency Modulation. One type of hard disk controller. MHz – See Megahertz. Microcomputer – Computer systems like IBM-compatibles, Macintoshes, and other small computers are collectively known as microcomputers. Microprocessor – The “guts” of a PC is the microprocessor. Familiar processor designations include the Intel Pentium and the Motorola 68030. Microsoft Windows – A graphical user interface for Intel processor and compatible computers. Allows several programs to be active at one time and supports a mouse-based point and shoot interaction with the computer. Millisecond – See ms. Minicomputer – Computer systems that are intermediate between mainframes and microcomputers are known as minicomputers. Some models of VAX, Prime, and others are minicomputers. Minimum detectable level – See Method detection limit. MIPS – Million Instructions per Second. A rating of speed for computers. Mode – The most frequent value or values in a data set. Modem – MOdulators-DEModulators are used for communication between different computers over telephone lines. They are rated by their transmission rate in baud (roughly bits per second). The Hayes (modem manufacturer) command set has become the de facto standard for interface between communications software and modems on personal computers. Monitoring – Observation to determine the level of compliance with regulations or to assess pollutant levels. Monitoring well – A well drilled specifically for monitoring purposes. Motherboard – The board inside a personal computer that contains the microprocessor, math coprocessor socket, card slots, and other necessary chips to make the computer work. Mouse – A mouse is a pointing device used for menu selection and moving the cursor around on the screen.
Glossary
387
MS – Mass Spectrometry. In volatile and semivolatile analysis, the compounds are detected by a mass spectrometer. MS – Matrix Spike. ms – Millisecond or 1/1000 second. Hard disk access times are rated in milliseconds. MSA – See Method of standard addition. MSD – Matrix Spike Duplicate MS-DOS – Microsoft’s version of the DOS operating system for PC-compatible computers from manufacturers other than IBM. MSDS – Material Safety Data Sheet. Documents prepared for each substance by its manufacturer describing the properties and safety issues for that substance. Multifunction card – Adapter card with several features such as serial port, parallel port, game port, clock-calendar, etc. Multipoint calibration – The determination of correct scale values by measuring or comparing instrument responses at a series of standardized analyte concentrations; used to define the range for generating quantitative data of acceptable quality. Multitasking – The capability of some computers and operating systems to perform more than one activity at once. Multi-user – The capability of some computers and operating systems to be used by more than one person at once. Mutagenesis –The formation of mutations or changes in genetic material, either in the affected individual or in future generations.
N N/A – Not applicable. NAA – Non-Attainment Area. NAAQS – National Ambient Air Quality Standards. NAD 27 – The North American Datum of 1927 is a network of triangulation stations and surveys criss-crossing the United States. The purpose of establishing this datum was so surveyors could have access to a system of known reference points for mapping. NAD 83 – As satellite information and distance measurement equipment became available, surveyors became increasingly frustrated with their surveys failing to agree with the reference points established by NAD 1927. To solve this problem, the U.S. Coast and Geodetic Survey completed the North American Datum of 1983. Changing from NAD 27 to NAD 83 will change the latitude, longitude, and elevation value for many points in North America, some by as much as 100 m. Nanosecond – See ns. NAPL – Non-Aqueous Phase Liquid. A fluid that doesn’t mix with water. NAPLs can be light (DNAPL) and float on water, or dense (DNAPL) and sink. NAS – Network Attached Storage. A server appliance that can be attached to a network to provide storage independent of a server. NCP – National Contingency Plan. Criteria and procedures for cleanup of Superfund sites. NEPA – National Environmental Policy Act of 1969. NESHAP – National Emission Standard for Hazardous Air Pollutants. Netware – Popular standard from Novell for connecting several computers together on a network. Network – Method of connecting several computers together to share data and applications.
388
Relational Management and Display of Site Environmental Data
NFG – National Functional Guidelines. Documents designed to offer guidance on inorganic, organic, and organic low concentration CLP analytical data evaluation and review. NFPA – National Fire Protection Association. A diamond system to denote firefighting hazards for emergency personnel. Corners of the diamond represent (clockwise from left) health hazard, flammability, reactivity, and special hazards. See HMIS. NGVD – National Geodetic Vertical Datum. Reference system for surveyed elevations. NIOSH – National Institute for Occupational Safety and Health NIST – National Institute of Standards and Testing. Node – In networking, a computer or terminal on the network. In mapping, the intersections of the X and Y lines of a grid. In GIS and CAD, the intersections of segments of a polyline or polygon. Noise – The sum of random errors in the response of a measuring instrument. Non-interlaced – See Interlaced. Non-parametric test – A statistical test suitable for use on a non-normal distribution. See Parametric test. Non-point source – Sources of pollution that do not have a specific point of origin, such as contamination from an agricultural area. Normal distribution – An idealized probability density function that approximates the distribution of many random variables associated with measurements of natural phenomena and takes the form of a symmetric “bell-shaped curve.” Novell – See Netware. NPD – Nitrogen-Phosphorus Detector. NPDES – National Pollutant Discharge Elimination System. A program authorized by the Clean Air Act that controls water pollution by regulating point sources that discharge pollutants into waters of the United States. This is done via an NPDES permit, which is usually administered by authorized states. NPL – National Priorities List. A list of sites for hazardous waste cleanup under the Superfund program. NPT – Normal Temperature and Pressure. NRC – National Response Center. A communications center maintained by the Coast Guard that tracks discharges or releases of hazardous substances into the environment. NRC – Nuclear Regulatory Commission. Now the Department of Energy. ns – Nanosecond or 1/1,000,000 second. Backup power supplies are ns-rated for the time it takes for the system to detect and correct a power problem. NTA – NitriloTriacetic Acid. Carcinogenic phosphate substitute banned in the U.S. NTSC – National Television Standards Committee. Usually refers to a video standard compatible with U.S. broadcast television. Nyquist Rule – Rule in computer gridding and contouring stating that for most algorithms, the average grid block size should allow a data point every 2 to 3 cells.
O O&F – Operational and Functional. CERCLA status where the remedy for a site is functioning properly and performing as designed, or has been in place for one year. O&M – Operation and Maintenance. CERCLA activities that protect the integrity of the selected remedy for a site. Observation – A fact or occurrence that is recognized and recorded.
Glossary
389
Observed value – The magnitude of a specific measurement; a variable; a unit of space, time, or quantity; a datum. The observed value is that reported before correction for a blank value. See Corrected value. ODBC – Open DataBase Connectivity. A Microsoft protocol for communicating between a database and other applications. OEM – Original Equipment Manufacturer. Refers to a company that sells hardware or software from another manufacturer under their own label, often with some customization. OLAP – OnLine Analytical Processing. Analysis of a data warehouse for trends and other intelligence. OLC02 – The current organic low concentration water analytical protocol. OLE – Object Linking and Embedding. A Microsoft protocol that allows programs to share information. OLM04.2 – The current organic analytical protocol. OLTP – OnLine Transaction Processing. Interaction with a database in high volume such as processing credit card transactions. OPA – Oil Pollution Act. Operating system – Software that is required by a computer to perform its tasks, but which is concerned with running the computer rather than some particular application. Examples are Windows, DOS, UNIX, and Linux. Optical disk – Disk drive system and platter for storing data using an optical or optical-magnetic method. Usually have very high storage capacity. Three main types are CD-ROM, WORM, and erasable optical. Ordinate – Vertical or Y-axis of a graph. Organic – Compounds containing carbon, usually with hydrogen and sometimes with oxygen. Origin – The intersection of the axes in a coordinate system, usually with the coordinates of 0,0. OS/2 – Operating System 2; OS/2 is an alternative to DOS and Windows that has not been widely accepted. OSCs – On-Scene Coordinators for the Superfund’s Removal Program. OSHA – Occupational Safety and Health Administration. Administers the Occupational Safety and Health Act of 1970. OSW – EPA’s Office of Solid Waste is responsible for ensuring that currently generated solid waste is managed properly, and that currently operating facilities address any contaminant releases from their operations. OSWER – Office of Solid Waste and Emergency Response. The EPA office that provides policy, guidance, and direction for the EPA’s solid waste and emergency response programs, including Superfund. Outlier – An observed value that appears to be discordant from the other observations in a sample. One of a set of observations that appears to be discordant from the others. The declaration of an outlier is dependent on the significance level of the applied identification test. See also Significance level. Oxidation – Combine with oxygen, or increase the charge of an ion, such as a change from Na+ to Na++. Ozone – A highly corrosive form of oxygen (O3) found naturally and also manufactured for use as a disinfectant.
390
Relational Management and Display of Site Environmental Data
P PA – Preliminary Assessment. A limited scope investigation performed under CERCLA at each site. Its purpose is to gather readily available information about the site and its surrounding area to determine the threat posed by the site. PA/SI – Preliminary Assessment and Site Investigation. A process of collecting and reviewing available information about a known or suspected hazardous waste site or release. PAHs – Polycyclic Aromatic Hydrocarbons. Palette – The range of colors available for display or printing. Pan – Change the view of a drawing or map by moving laterally at the same scale. Parallel port – A connection with peripheral equipment where eight bits are sent at one time. Printers are usually connected to parallel ports. Parameter – Any quantity such as a mean or a standard deviation characterizing a population. Also a constituent to be measured. Parametric test – A statistical test based on a normal distribution. See Non-parametric test. PARCC – Precision, Accuracy, Representativeness, Comparability, and Completeness. Partition – Break up into parts, or the parts themselves. Often used to refer to a hard disk that is configured as more than one logical disk drive. PC – Personal Computer. While it technically covers all desktop computers with different operating systems, PC is usually used to refer to IBM Personal Computers and compatibles. PCA – 1,1,2,2-Tetrachloroethane. Volatile organic contaminant. PCBs – Polychlorinated biphenyls (also called Aroclor). A group of toxic, persistent chemicals used in electrical transformers and capacitors for insulating purposes, and in gas pipelines systems as a lubricant. The sale and new use of PCBs was banned by law in 1979. PC-DOS – The version of MS-DOS for IBM brand personal computers. PCE – Tetrachloroethylene. Also called PERC. Volatile organic contaminant. PE – Performance evaluation sample. A sample of known composition provided by EPA for contractor analysis. Used by EPA to evaluate contractor performance. PE sample – See Performance evaluation sample. Peak shaving or Peak enhancement – Fraudulent laboratory practice of manipulating the results during analysis such as by reshaping a peak that is subtly out of specification. PERC – Tetrachloroethylene. Also called PCE. Volatile organic contaminant. Percentage standard deviation – Synonym for Relative standard deviation. Performance evaluation audit – A type of audit in which the quantitative data generated in a measurement system is obtained independently and compared with routinely obtained data to evaluate the proficiency of an analyst or laboratory. Performance evaluation sample – A sample, the composition of which is unknown to the analyst, which is provided to test whether the analyst/laboratory can produce analytical results within specified performance limits. See Blind sample and Performance evaluation audit. Peripheral – Any device that is not necessary for running a computer, or which is outside the system unit, is called a peripheral. Monitors, printers, mice, modems, plotters, and digitizers are all considered peripherals. Permeability – The ability of a rock or other material to transmit fluid. Pesticides – Substances intended to repel, kill, or control any species designated a “pest,” including weeds, insects, rodents, fungi, bacteria, and other organisms. Phenols – A group of oxygen-containing organic compounds that are by-products of petroleum refining and other industrial processes.
Glossary
391
PID – PhotoIonization Detector. Pitch – Pitch is the spacing of text letters on a page, as put out by a typewriter or character printer. Common pitches include 10 (Pica), 12 (Elite), and 17 (condensed). Pivot Table – A type of query where the results are turned sideways, and rows become columns. An example would be a report where constituent names are used as column heads and the values displayed in columns, in a database where the results are actually records (rows). Also called a cross-tab. Pixel – A pixel, a PICture ELement, is the dots in an image, and is also the unit of resolution on monitors. Planimeter – Device or software for calculating areas. Plotter – Plotters are devices that use pens, ink droplets, or electrostatic charges to make varying sizes of hard copy output, usually in multiple colors. Output media sizes range from 8.5" x 11" up to 36" x 42" or larger. Plume – An area of contaminant concentration resulting from one or more releases of material. PO – Purchase Order. A document authorizing expenditure of funds, similar to an AFE. Point source – A specific single location from which pollutants are discharged. Polyconic projection – Method of representing spherical coordinates (latitude-longitude) on a flat map using an approximation of cones. Polygon – Closed group of line segments treated as a unit. Polyline – Group of line segments treated as a unit. Population – All possible items or units that possess a variable of interest and from which samples may be drawn. Port – Interface for connecting devices together. Types of ports include serial (RS-232), parallel, and SCSI. Portrait orientation – A device or printout where the horizontal axis is shorter than the vertical axis. The opposite of landscape. POX – Purgeable Organic Halides. PPA– Pollution Prevention Act. PPB – Parts per Billion. PPI – Points per Inch. Used to describe the resolution of digitizers and printers. PPM – Parts per Million. Precision – The degree to which a set of observations or measurements of the same property, usually obtained under similar conditions, conform to themselves; a data quality indicator. Precision is usually expressed as standard deviation, variance, or range, in either absolute or relative terms. See also Standard deviation and Sample variance. Preventative maintenance – An orderly program of activities designed to ensure against equipment failure. Primary reference standard – See Primary standard. Primary standard – A substance or device, with a property or value that is unquestionably accepted (within specified limits) in establishing the value of the same or related property of another substance or device. Print buffer/print spooler – Hardware device or software program that takes output destined for the printer and collects it for output at the rate of the printer. Frees up the computer for other uses while the printing is being done. Printer – Device used for generating hard copy output from a computer. Types include dot-matrix, daisy-wheel, inkjet, thermal, and laser.
392
Relational Management and Display of Site Environmental Data
Probability – A number between zero and one inclusive, reflecting the limiting proportion of the occurrence of an event in an increasingly large number of identical trials, each of which results in either the occurrence or nonoccurrence of the event. Probability sampling – Sampling in which (a) every member of the population has a known probability of being included in the sample; (b) the sample is drawn by some method of random selection consistent with these probabilities; and (c) the known probabilities of inclusion are used in forming estimates from the sample. The probability of selection need not be equal for members of the population. Procedure – A set of systematic instructions for performing an operation. Proficiency testing – A systematic program in which one or more standardized samples is analyzed by one or more laboratories to determine the capability of each participant. Program – Software which permits the computer to perform some desired action. Also the act of writing software. Programming language – A language used to write software. Examples are BASIC, FORTRAN and C. Projection – As used in mapping, projection is the representation of the three-dimensional earth on a two-dimensional map. Depending on how a surface is wrapped around the earth, different projections can be obtained. Common projections include Mercator, Albers Equal Area, and Lambert Conformal Conic. PROM – Programmable Read Only Memory. Prompt – Message provided by the computer indicating that it is ready for input. A common prompt is the active drive and/or the current subdirectory. Property – A quality or trait belonging to and peculiar to a thing; a response variable is a measure of a property. Synonym for characteristic. Protocol – A protocol is a method of doing something, and in the computer world is the format for transferring files. Also a detailed written procedure for a field and/or laboratory operation (e.g., sampling, analysis) which must be strictly adhered to. PRP – Potentially Responsible Party. PRPs are individuals, companies, or other parties that are potentially liable for payment of Superfund cleanup costs. Public domain software – Public domain software is free software available from a variety of sources including the government and BBSs. Although the software is free, there may be a handling charge or reproduction fee to obtain the software. PUF – Polyurethane. Pump and treat – A remediation method that involves pumping groundwater to the surface for treatment. PVC – Polyvinyl chloride.
Q QA – Quality Assurance. An integrated system of management activities involving planning, implementation, assessment, reporting, and quality improvement to ensure that a process, item, or service is of the type and quality needed and expected by the customer. QAMS – Quality Assurance Management Staff. QAPP – Quality Assurance Project Plan. QATS – Quality Assurance Technical Support laboratory. A contractor-operated facility operated under the QATS contract, awarded and administered by EPA. QC – Quality Control. The overall system of technical activities that measures the attributes and performance of a process, item, or service against defined standards to verify that they meet the
Glossary
393
stated requirements established by the customer; operational techniques and activities that are used to fulfill requirements for quality. QC check sample – See Quality control sample. QFD – Quality Function Development. A quality control method for product development. See SQFD. Quality – The sum of features and properties or characteristics of a product or service that bear on its ability to satisfy stated needs. Quality assessment – The evaluation of environmental data to determine if it meets the quality criteria required for a specific application. Quality assurance (QA) – An integrated system of activities involving planning, quality control, quality assessment, reporting, and quality improvement to ensure that a product or service meets defined standards of quality with a stated level of confidence. Quality Assurance Narrative Statement – A description of the quality assurance and quality control activities to be followed for a research project. Quality Assurance Objectives – The limits on bias, precision, comparability, completeness, and representativeness defining the minimal acceptable levels of performance as determined by the data user’s acceptable error bounds. Quality Assurance Program Plan (QAPP) – A formal document describing the management policies, objectives, principles, organizational authority, responsibilities, accountability, and implementation plan of an agency, organization, or laboratory for ensuring quality in its products and utility to its users. Quality Assurance Project Plan (QAPjP) – A formal document describing the detailed quality control procedures by which the quality requirements defined for the data and decisions pertaining to a specific project are to be achieved. Quality circle – A small group of individuals from an organization or unit who have related interests and meet regularly to consider problems or other matters related to the quality of the product or process. Quality control (QC) – The overall system of technical activities whose purpose is to measure and control the quality of a product or service so that it meets the needs of users. The aim is to provide quality that is satisfactory, adequate, dependable, and economical. Quality control chart – See Control chart. Quality control check sample – See Calibration standard. Quality control sample – An uncontaminated sample matrix spiked with known amounts of analytes from a source independent from the calibration standards. It is generally used to establish intralaboratory or analyst-specific precision and bias or to assess the performance of all or a portion of the measurement system. See also Check sample. Quantitation limits – The maximum or minimum levels or quantities of a target variable that can be quantified with the certainty required by the data user. Query – Request for data from a database manager. QWERTY – Layout of the keys on a standard keyboard.
R RA – Remedial Action. The remedial action follows the remedial design and involves the construction or implementation phase of Superfund site cleanup. RAID – Redundant Array of Inexpensive Drives. This is a system for combining several hard disk drives together to increase capacity and reliability. The systems come in different levels based on the number of drives and how they are configured.
394
Relational Management and Display of Site Environmental Data
RAM – Random Access Memory; refers to the main memory used by a computer for storage and processing. The main feature of RAM is that it can be both written and read at any time by the system, and any part of the RAM can be used at any time. The amount of RAM is usually specified in kilobytes (KB) for smaller systems, and in megabytes (MB) for larger systems. The more RAM that is present, the larger the programs that can be loaded, and sometimes the larger the data sets that can be processed. See also ROM. RAM disk – A portion of RAM set aside that operates as a disk drive is a RAM disk. Because files written to a RAM disk do not have to access a floppy or hard disk, they run very fast. Any work done on a RAM disk is lost when the power is shut off, unless it has been saved first to a mechanical disk. RAM-resident – Once a RAM-resident program is loaded into memory, it stays there so the program can be invoked anytime the computer is on. Random – Lacking a definite plan, purpose, or pattern; due to chance. Random error – The deviation of an observed value from a true value, which behaves like a variable in that any particular value occurs as though chosen at random from a probability distribution of such errors. The distribution of random error is generally assumed to be normal. Random sample or subsample – A subset of a population or a subset of a sample, selected according to the laws of chance with a randomization procedure. Random variable – A quantity that may take any of the values of a specified set with a specified relative frequency or probability. It is defined by a set of possible values, and by an associated probability function giving the relative frequency of occurrence of each possible value. Randomization – The arrangement of a set of objects in a random order, a set of treatments applied to a set of experimental units is said to be randomized when the treatment applied to any given unit is chosen at random from those available and not already allocated. Randomness – A basic statistical concept and property implying an absence of a plan, purpose, or pattern, or of any tendency to favor one outcome rather than another. Range – The difference between the minimum and the maximum of a set of values. RAS – Routine Analytical Service. The standard inorganic, organic, and organic low concentration, high volume, multi-component analyses available through the CLP. Raster – Data represented as an array of points. Raster devices include monitors and printers. Differs from vector data, which uses line segments instead of dots to represent the image. Raw data – Any original factual information from a measurement activity or study recorded in laboratory worksheets, records, memoranda, notes, or exact copies thereof and that are necessary for the reconstruction and evaluation of the report of the activity or study. Raw data may include photographs, microfilm or microfiche copies, computer printouts, magnetic media, including dictated observations, and recorded data from automated instruments. If exact copies of raw data have been prepared (e.g., tapes which have been transcribed verbatim, dated, and verified accurate by signature), the exact copy or exact transcript may be substituted. RCRA – Resource Conservation and Recovery Act. Legislation covering environmental issues at operating facilities. RD – Remedial Design. The phase of Superfund cleanup where the technical specifications for cleanup remedies and technologies are designed. RD/RA– Remedial Design/Remedial Action. RDX – Hexahydro-1,3,5-trinitro-1,3,5-triazine. Reactivity – The tendency of a compound to explode or produce toxic fumes. Reagent blank – A sample consisting of reagent(s), without the target analyte or sample matrix, introduced into the analytical procedure at the appropriate point and carried through all subsequent
Glossary
395
steps to identify sources of error in the observed value caused by the reagents and the involved analytical steps. Reagent grade – The second highest purity designation for reagents that conform to the current specifications of the American Chemical Society Committee on Analytical Reagents. Real-time clock – See Clock calendar. Records system (or plan) – A written, documented group of procedures describing required records, steps for producing them, storage conditions, retention period, and circumstances for their destruction or other disposition. Recovery efficiency – In an analytical method, the fraction or percentage of a target analyte extracted from a sample containing a known amount of the analyte. Referee duplicates – Duplicate samples sent to the referee QA laboratory, if one is specified for the project. Reference material – A material or substance, one or more properties of which are sufficiently well established to be used for the calibration of an apparatus, the assessment of a measurement method, or assigning values to materials. Reference method – A sampling and/or measurement method which has been officially specified by an organization as meeting its data quality requirements. Reference standard – Standard of known analytes and concentration obtained from an independent source other than the standards used for instrument calibration. They are used to verify the accuracy of the calibration standards, and are analyzed after each initial calibration or as per method specifications. Relational database manager – Software that stores data in several tables and allows the data to be retrieved and related based on the contents of the data file. Relative coordinates – Coordinates tied to a local reference system or based only on internal position rather than a global reference system. Relative standard deviation – The standard deviation expressed as a percentage of the mean recovery, i.e., the coefficient of variation multiplied by 100. Reliability – The likelihood that an instrument or device will function under defined conditions for a specified period of time. Remedial action – The construction or cleanup phase of a Superfund site cleanup. Remedial design – A phase of remedial action that follows the remedial investigation/feasibility study and includes development of engineering drawings and specifications for a site cleanup. Remedial investigation – An in-depth study designed to gather data needed to determine the nature and extent of contamination at a Superfund site, establish site cleanup criteria, identify preliminary alternatives for remedial action, and support technical and cost analyses of alternatives. The remedial investigation is usually done with the feasibility study. Together they are usually referred to as the “RI/FS.” Remedial response – Long-term action that stops or substantially reduces a release or threat of a release of hazardous substances that is serious but not an immediate threat to public health. Remediate – Correct contamination with activities such as destroying the contaminants, capping a waste site, excavating and transporting the contaminants to an approved hazardous landfill, or any other method that reduces risk at a site. Repeatability – The degree of agreement between mutually independent test results produced by the same analyst using the same test method and equipment on random aliquots of the same sample within a short period of time. Replicability – See Repeatability.
396
Relational Management and Display of Site Environmental Data
Replicate – An adjective or verb referring to the taking of more than one sample, or to the performance of more than one analysis. Incorrectly used as a noun in place of replicate analysis. Replicate is to be used when referring to more than two items. See Duplicate. Replicate analyses or measurements – The analyses or measurements of the variable of interest performed identically on two or more subsamples of the same sample within a short time interval. See Duplicate analyses or measurements. Replicate samples – Two or more samples representing the same population characteristic, time, and place, which are independently carried through all steps of the sampling and measurement process in an identical manner. Replicate samples are used to assess total (sampling and analysis) method variance. Often incorrectly used in place of the term “replicate analysis.” See Duplicate samples and Replicate analyses. Representative sample – A sample taken so as to reflect the variable(s) of interest in the population as accurately and precisely as specified. To ensure representativeness, the sample may be either completely random or stratified depending upon the conceptualized population and the sampling objective (i.e., upon the decision to be made.) Representativeness – The degree to which data accurately and precisely represent the frequency distribution of a specific variable in the population; a data quality indicator. Reproducibility – The extent to which a method, test, or experiment yields the same or similar results when performed on subsamples of the same sample by different analysts or laboratories. Resolution – For monitors and printers, the number of dots which can be displayed on the screen or printed. For digitizers, the smallest increment of movement that can be detected. Response variable – A variable that is measured when a controlled experiment is conducted. Result – The product of a calculation, test method, test, or experiment. The result may be a value, data set, statistic, tested hypothesis, or an estimated effect. Review – The assessment of management/operational functions or activities to establish their conformance to qualitative specifications or requirements. See Management systems review and Audit. RGB – Red, Green, and Blue. Used to describe a video system where the signal is sent to the monitor using three or more lines, one each for red, green, and blue. With digital RGB systems the intensity of each line, determined by the voltage, is at discrete levels. With analog RGB systems the voltage range is continuous for each color. RI – Remedial Investigation. RI/FS – Remedial Investigation/Feasibility Study. This is the step in the cleanup process that is conducted to gather sufficient information to support the selection of a site remedy Rinseate blank – A clean sample (e.g., distilled water or ASTM Type II water) passed through decontaminated sampling equipment before sampling, and returned to the laboratory as a sample. Sampling equipment blanks are used to check the cleanliness of sampling devices. Usually one rinseate sample is collected for each 10 samples of each matrix for each piece of equipment. Risk – The probability or likelihood of an adverse effect. Risk (statistical) – The expected loss due to the use of a given decision procedure. Risk assessment – Characterization of the potential adverse health effects of human exposures to environmental hazards. RLL – Run Length Limited. A type of hard disk controller that provides 50% more storage that MFM controllers. Robustness – (In)sensitivity of a statistical test method to departures from underlying assumptions. See Ruggedness. ROD – Record of Decision. Administrative order that explains the cleanup method that will be used at a site, and often sets target limits for project cleanup.
Glossary
397
ROM – Read-Only Memory; differs from RAM in that it takes special equipment to write ROM chips, and once in a computer, they can be read and not written to. They are usually used for lowlevel program code that does not change, such as the BIOS on IBM-compatible computers. See also RAM. Rounded number – A number, reduced to a specified number of significant digits or decimal places, using defined criteria, that are fewer than those measured. Routine method – A defined plan of procedures and techniques used regularly to perform a specific task. RPM – Remedial Project Manager. The EPA or state official responsible for overseeing onsite remedial action. RSCC – Regional Sample Control Center. The RSCC coordinates regional sampling efforts. RSO – Radiation Safety Officer. Rubber-banding – Indicating an area on the screen by moving a box outlining the area. Ruggedness – The (in)sensitivity of an analytical test method to departures from specified analytical or environmental conditions. See Robustness. Ruggedness testing – The carefully ordered testing of an analytical method while making slight variations in test conditions (as might be expected in routine use) to determine how such variations affect test results. If a variation affects the results significantly, the method restrictions are tightened to minimize this variability.
S S8 – Molecular sulfur. Sample – Material gathered in the field for analysis from a specific location at a specific time. Also a part of a larger whole or a single item of a group; a finite part or subset of a statistical population. A sample serves to provide data or information concerning the properties of the whole group or population. A single, discrete portion of material to be analyzed, which is contained in single or multiple containers and identified by a unique sample number. Sample data custody – See Chain of custody. Sample variance (statistical) – A measure of the dispersion of a set of values. The sum of the squares of the difference between the individual values of a set and the arithmetic mean of the set, divided by one less than the number of values in the set. (The square of the sample standard deviation.) See also Measure of dispersion. Sampling – The process of obtaining a representative portion of the material of concern. Sampling equipment blank – A clean sample that is collected in a sample container with the sample-collection device and returned to the laboratory as a sample. Sampling equipment blanks are used to check the cleanliness of sampling devices. See Dynamic blank. Sampling error – The difference between an estimate of a population value and its true value. Sampling error is due to observing only a limited number of the total possible values and is distinguished from errors due to imperfect selection, bias in response, errors of observation, measurement, or recording, etc. See also Probability sampling. SAN – Storage Area Network. A hardware and software system that allows storage, usually using hard drives, to be attached to the network independent of a server. SAP – Sampling and Analysis Plan. SARA – Superfund Amendments and Reauthorization Act. The 1986 amendment to CERCLA. Saturated zone – A volume of rock or soil in which the pores are filled with fluid, usually water. SCADA – Supervisory Control and Data Acquisition. Software assistance with data gathering.
398
Relational Management and Display of Site Environmental Data
Scanner – Optical devices that allow input of continuous tone graphic images, line drawings, and text. Scanning frequencies – For monitors, the rate at which the horizontal rows of pixels are drawn on the screen (horizontal scan rate) or the whole screen is redrawn (vertical refresh rate). Scheduled maintenance – See Preventative maintenance. Schema – Description of the structure of a database. Screening test – A quick test for coarsely assessing a variable of interest. SCSI – Small Computer Systems Interface (pronounced “scuzzy”) is a high-speed parallel interface for attaching devices such as disk drives to microcomputers. The standard allows mass storage devices to be designed to work with any computer and allows several peripherals to be attached to one port. SDSL – Symmetrical Digital Subscriber Line. Broadband communication connection. SDWA– Safe Drinking Water Act. Secondary standard – A standard whose value is based on comparison with a primary standard. Selectivity (analytical chemistry) – The capability of a method or instrument to respond to a target substance or constituent in the presence of non-target substances. SEM – Scanning Electron Microscopy. Sensitivity – Capability of method or instrument to discriminate between measurement responses representing different levels of a variable of interest. Serial port – A serial port has a digital signal that is sent one byte at a time, one after another. Peripherals such as mice, modems, digitizers, and plotters are often attached to serial ports. SFE – Supercritical Fluid Extraction. Shareware – Shareware software is distributed free or for a small handling fee. If the user likes the software, the user is supposed to send money in the amount the author asks for. The fee often includes registration, software upgrades, and a manual. Shell – Software that surrounds other software (such as the operating system), usually to make it easier to use. SI – Site Inspection. A step under CERCLA that provides the data needed for HRS, and identifies sites that enter the NPL site listing process. Significance level – The magnitude of the acceptable probability of rejecting a true null hypothesis or of accepting a false null hypothesis; the difference between the hypothetical value and the sample result. Significant digit – Any of the digits 0 through 9, excepting leading zeros and some trailing zeros, which is used with its place value to denote a numerical quantity to a desired rounded number. See Rounded number. Significant figure – See Significant digit. SIMM – Single Inline Memory Module. A group of RAM chips attached to a carrier and installed as a unit. Single operator precision – The degree of variation among the individual measurements of a series of determinations by the same analyst or operator, all other conditions being equal. Sinkers – See DNAPL. Site – The area within boundaries established for a defined activity. Site blank – See Field blank. SM – Standard Methods. SM&TE – Standard Measuring and Test Equipment. SMO – Sample Management Office. A contractor-operated facility operated by the Contract Laboratory Analytical Services Support (CLASS) contract, awarded and administered by the EPA.
Glossary
399
SOAP – Simple Object Access Protocol. This is an XML-based protocol for information interchange in distributed environments such as the Internet. It describes message contents, data types, processing requirements, and remote procedure calls and responses. SOC – Synthetic Organic Chemicals. Software – Program instructions which cause the computer to do something useful. Soil boring – A hole dug in the ground from which a soil sample is extracted for chemical, biological, or analytical testing to determine the level of contamination present. Soil gas – Gaseous elements and compounds in the small spaces between particles of the soil. Solubility – Ability of a substance to dissolve in a particular liquid. Solvent – A liquid capable of dissolving other substances. Water is a solvent, as are many organic compounds. SOP – Standard Operating Procedure. SOW – Statement of Work. A document that specifies how laboratories analyze samples under a particular CLP analytical program. Span-drift – The change in the output of a continuous monitoring instrument over a stated time period during which the instrument is not recalibrated. Span-gas – A gas of known concentration which is used routinely to calibrate the output level of an analyzer. See Calibration standard. SPE – Solid Phase Extraction. Specimen – See Sample. Spike – A known mass of target analyte added to a blank sample or subsample; used to determine recovery efficiency or for other quality control purposes. Spiked laboratory blank – See Spiked reagent blank. Spiked reagent blank – A specified amount of reagent blank fortified with a known mass of the target analyte; usually used to determine the recovery efficiency of the method. Spiked sample – A sample prepared by adding a known mass of target analyte to a specified amount of matrix sample for which an independent estimate of target analyte concentration is available. Spiked samples are used, for example, to determine the effect of the matrix on a method’s recovery efficiency. Spiked sample duplicate analysis – See Duplicate analyses and Spiked sample. Split samples – Two or more representative portions taken from a sample or subsample and analyzed by different analysts or laboratories. Split samples are used to replicate the measurement of the variable(s) of interest. SPLP – Synthetic Precipitation Leaching Procedure. Spooler – Hardware or software used as a buffer for printing and plotting. Spreadsheet – Program for manipulating rows and columns of data. Cells in the spreadsheet can contain text, numbers, or formulas. SQFD – Software Quality Function Development. QFD as applied to software development. SQL – Structured Query Language. Widely used for retrieving data from relational database managers. Stakeholder – A party with an interest in a remediation project, such as government officials, community members, PRPs, banks, etc. Standard (measurement) – A substance or material with a property quantified with sufficient accuracy to permit its use to evaluate the same property in a similar substance or material. Standards are generally prepared by placing a reference material in a matrix. See Reference material.
400
Relational Management and Display of Site Environmental Data
Standard addition – The procedure of adding known increments of the analyte of interest to a sample, to cause increases in detection response. The level of the analyte of interest present in the original sample is subsequently established by extrapolation of the plotted responses. Standard curve – See Calibration curve. Standard deviation – The most common measure of the dispersion or imprecision of observed values expressed as the positive square root of the variance. See Sample variance. Standard material – See Standard (measurement) and Reference material. Standard method – An assemblage of techniques and procedures based on consensus or other criteria, often evaluated for its reliability by collaborative testing and receiving organizational approval. Standard operating procedure (SOP) – A written document that details the method of an operation, analysis, or action whose techniques and procedures are thoroughly prescribed and which is accepted as the method for performing certain routine or repetitive tasks. Standard reference material (SRM) – A certified reference material produced by the U.S. National Institute of Standards and Technology and characterized for absolute content independent of analytical method. Standard reference sample – See Secondary standard. Standard solution – A solution containing a known concentration of analytes, prepared and verified by a prescribed method or procedure, and used routinely in an analytical method. Standardization – The process of establishing the quantitative relationship between a known mass of target material (e.g., concentration) and the response variable (e.g., the measurement system or instrument response). See Calibration curve and Multipoint calibration. State plane – Coordinate system within each state for creating a flat map from spherical coordinates (latitude-longitude). The projection and origin used varies from state to state. Statistic – An estimate of a population characteristic calculated from a data set (observed or corrected values), e.g., the mean or standard deviation. Stepper motor – Device for positioning the heads in inexpensive disk drives. Stock solution – A concentrated solution of analyte(s) or reagent(s) prepared and verified by prescribed procedure(s), and used for preparing working standards or standard solutions. Storage – Refers to the ability of computers to hold data. Usually means the amount of disk storage. Storage blank – Laboratory reagent water samples stored in the same type of sample containers and in the same storage units as field samples. They are prepared, stored for a defined period of time, and then analyzed to monitor volatile organic contamination derived from sample storage units. Typically one blank is used for each sample batch, or as per method specifications. STORET – STOrage and RETrieval. EPA’s repository for water quality, biological, and physical data. STP – Standard Temperature and Pressure. Stratification – The division of a target population into subsets or strata which are internally more homogeneous with respect to the characteristic to be studied than the population as a whole. Stratified sampling – The sampling of a population that has been stratified, part of the sample coming from each stratum. See Stratification. Strict liability – A legal concept under CERCLA that allows the federal government to hold PRPs liable without proving that the PRPs were at fault. Stylus – A pointing device for a digitizer that looks like a pen. Subdirectory – Part of a disk used for storing related data. Can be created and manipulated by the user. Called folders on a Macintosh and in Windows.
Glossary
401
Subsample – A representative portion of a sample. A subsample may be taken from any laboratory or a field sample. See Aliquant, Aliquot, Split sample, and Test portion. Supercomputer – Very powerful computer designed for fast calculations. Superfund – The program operated under the legislative authority of CERCLA and SARA that funds and carries out EPA removal and remedial activities at hazardous waste sites. These activities include establishing the National Priorities List, investigating sites for inclusion on the list, determining their priority, and conducting and/or supervising cleanup and other remedial actions. Surface water – Water naturally open to the atmosphere, such as rivers, lakes, and ponds. Surrogate – See Surrogate analyte. Surrogate analyte – A pure substance with properties that mimic the analyte of interest. It is unlikely to be found in environmental samples and is added to them for quality control purposes. Surrogate spikes – Non-target analytes of known concentration that are added to organic samples prior to sample preparation and instrument analysis. They measure the efficiency of all steps of the sample preparation and analytical method in recovering target analytes from the sample matrix, based on the assumption that non-target surrogate compounds behave the same as the target analytes. They are run with all samples, standards, and associated quality control. Spike recoveries can be calculated from spike concentrations. Surveillance – The act of maintaining supervision of or vigilance over a well-specified portion of the environment so that detailed information is provided concerning the state of that portion. SVGA – Super Video Graphics Array. Popular video adapter on PC-compatible computers. Highest standard resolution is 800x600 pixels. SVOC – Semivolatile Organic Compound. This is a compound containing carbon that does not evaporate as readily as a VOC and has a boiling point greater than 200°C. Switch box – Multiple peripherals (usually 2 or 4) can be hooked into a single parallel or serial port on one computer using a switch box. These boxes can also be used to hook multiple computers into one or more peripherals. SWMU – Solid Waste Management Unit. Synthetic sample – A manufactured sample. See Quality control sample. System unit – Box containing the main components of a computer system. Systematic error – A consistent deviation in the results of sampling and/or analytical processes from the expected or known value. Such error is caused by human and methodological bias. Systems audit – See Technical systems audit. Systems error – See Total systems error.
T T1 – Medium-speed communication connection. T3 – High-speed communication connection. Tape backup – Device for backing up a hard disk. Tape drive – Device for reading and writing data on magnetic tape. Target – The chosen object of investigation for which qualitative and/or quantitative data or information is desired, e.g., the analyte of interest. TB – Trip Blank. TBT – Tributyltin. TCD – Thermal conductivity detector. TCDD – 2,3,7,8-Tetrachlorodibenzodioxin.
402
Relational Management and Display of Site Environmental Data
TCE – Trichloroethylene. TCLP – Toxicity Characterization Leaching Procedure. A method of determining the mobility of toxins in soils. TDS – Total Dissolved Solids. Technical systems audit – A thorough, systematic, onsite, qualitative review of facilities, equipment, personnel, training, procedures, record keeping, data validation, data management, and reporting aspects of a total measurement system. Technique – A principle and/or the procedure of its application for performing an operation. TEM – Transmission Electron Microscopy. Teratogenesis –The formation of birth defects. Test – A procedure used to identify or characterize a substance or constituent. See Method. Test determination – See Determination. Test method – See Method. Test portion – A subsample of the proper amount for analysis and measurement of the property of interest. A test portion may be taken from the bulk sample directly, but often preliminary operations, such as mixing or further reduction in particle size, are necessary. See Subsample. Test result – A product obtained from performing a test determination. See Determination. Test sample – See Test portion. Test specimen – See Test portion. Test unit – See Test portion. Text editor – Software programs that create and modify ASCII text files. Windows Notepad is an example. Theme – In GIS, data of a specific type such as transportation or drainage. See also Coverage and Layer. Thermal printers – Printers that use a heating element and special paper to place an image on the page. Thermal transfer printers – Printers that heat a wax-based ink on the ribbon which then flows onto the paper. TIC – Tentatively Identified Compound. Time-proportioned sample – A composite sample produced by combining samples of a specific size, collected at preselected, uniform time intervals. TM – Transverse Mercator. A map projection in which a cylinder is wrapped around the earth tangent to the sphere (earth) at a chosen meridian. TNT – Trinitrotoluene. TOC – Total Organic Carbon. Total measurement error – The sum of all the errors that occur from the taking of the sample through the reporting of results. The difference between the reported result and the true value of the population that was to have been sampled. Total Quality Management (TQM) – The process whereby an entire organization, led by senior management, commits to focusing on quality as a first priority in every activity. TQM implementation creates a culture in which everyone in the organization shares the responsibility for continuously improving the quality of products and services (i.e., for “doing the right thing, the right way, the first time, on time”) in order to satisfy the customer. Total systems error – The combined error due to all components of the system. TOX – Total organic halides. Toxic – A poisonous or hazardous substance.
Glossary
403
Toxicity – A measure of the poisonous or harmful nature of a substance. Toxicology – The study of adverse effects of chemicals on living organisms. TPH – Total Petroleum Hydrocarbons. TQM – See Total Quality Management. Traceability – An unbroken trail of accountability for verifying or validating the chain of custody of samples, data, the documentation of a procedure, or the values of a standard. Transducer – For digitizers, the device which indicates the position on the tablet. Can be either a stylus or a cursor. Treatment (experimental) – An experimental procedure whose effect is to be measured and compared with the effect of other treatments. TRI – Toxics Release Inventory. TRI is published by the U.S. EPA, and is a valuable source of information regarding toxic chemicals that are being used, manufactured, treated, transported, or released into the environment. Triangulation – In mapping, a method of creating contours using an array of triangles between the data points. Trip blank – A clean sample of matrix that is carried to the sampling site and transported to the laboratory for analysis without having been exposed to sampling procedures. Trojan horse – Program which masquerades as something useful but which can damage your data. TSCA – Toxic Substances Control Act. TSS – Total Suspended Solids. Tuning – The process of adjusting a measurement device or instrument, prior to its use, to ensure that it works properly and meets established performance criteria. Turnkey – Ready to run, or performed for a fixed price. Refers to a complete solution to an operational or computing need. Type I error (alpha error) – An (incorrect) decision resulting from the rejection of a true hypothesis. (A false positive decision.) Type II error (beta error) – An (incorrect) decision resulting from acceptance of a false hypothesis. (A false negative decision.)
U UDDI – Universal Description, Discovery and Integration. This is a shared directory for users and provides a method to locate each other’s Web services using a Web-based registry to expose services. Uncertainty – A measure of the total variability associated with sampling and measuring that includes the two major error components: systematic error (bias) and random error. Universe – See Population. UNIX – Operating system from Bell Laboratories that runs on a wide variety of computers including supercomputers, engineering workstations such as Sun Sparcstation computers, and microcomputers. Several different varieties exist. See Linux. Unsaturated zone – The area of unsaturated soil between the ground surface and the water table (also called the vadose zone). Upgrade – Increase the capability of hardware or software by changing to a newer or better version. Upload – Move data from you computer to another. Opposite of download. Upper control limit – See Control limit. Upper warning limit – See Warning limit.
404
Relational Management and Display of Site Environmental Data
USB – Universal Serial Bus. High speed serial interface for connecting peripherals to computers. User check – An evaluation of a written procedure (e.g., chemical analysis method) for clarity and accuracy in which an independent laboratory analyzes a small number of spiked samples, following the procedure exactly. User interface – The part of software that interacts with the computer operator. USGS – United States Geological Survey. The U.S. Geological Survey investigates the occurrence, quantity, quality, distribution, and movement of surface and groundwater, along with much other data, especially geologic and other maps, and provides the data to the public. USPLS – United States Public Land Survey (also known as Jeffersonian Survey). The USPLS is the township-range location system used in most of the western United States and in Canada. An ideal township is a square consisting of 36 sections of 1 square mile each. UST – Underground Storage Tank. Utility program – Software to do something useful with a computer other than an application such as word processing or mapping. UTM – Universal Transverse Mercator is a coordinate system based in meters, originally used for constructing military maps between 80° South and 84° North, based on the Transverse Mercator projection. (North and south of this zone, the Universal Polar Stereographic coordinate systems are used.) The earth is divided into 60 UTM zones of 6° each. Zones are numbered west to east; a central meridian in each zone serves as the reference point. UV – Ultraviolet. Also used for ultraviolet spectrophotometry.
V Vadose zone – The area of unsaturated soil between the ground surface and the water table (also called the unsaturated zone). Valid study – A study conducted in accordance with accepted scientific methodology, the results of which satisfy predefined criteria. Validated method – A method that has been determined to meet certain performance criteria for sampling and/or measurement operations. Validation – Confirmation by examination and provision of objective evidence that the particular requirements for a specific use have been fulfilled. Data validation is an analyte- and samplespecific process that extends the evaluation of data beyond method, procedural, or contractual compliance (i.e., data verification) to determine the analytical quality of a specific data set. Value – The magnitude of a quantity. A single piece of factual information obtained by observation or measurement and used as a basis of calculation. Vapor – Gaseous phase of any substance that is liquid or solid at atmospheric pressure and temperature. Vaporize – To go from the liquid to gaseous state. Variable – An entity subject to variation or change. Variance – See Sample variance. Vector – A geometric unit that has direction and magnitude. Also refers to graphic images that consist of points and lines rather than raster dots. Verification – Confirmation by examination and provision of objective evidence that specified requirements have been fulfilled. Data verification is the process of evaluating the completeness, correctness, and conformance/compliance of a specific data set against the method, procedural, or contractual requirements. Verifiable – The ability to be proven or substantiated.
Glossary
405
VGA – Video Graphics Array. Older video adapter on PC-compatible computers. Highest standard resolution is 640×480 pixels. Video adapter – A card that translates the digital signal from the computer’s motherboard into the monitor is the video adapter. Virtual disk – See RAM disk. Virus – Software that can damage your system and which can spread from computer to computer. VOA – Volatile Organic Analysis. VOC – Volatile Organic Compounds. Voice coil – Device for positioning the heads on high-quality disk drives. Volatile – A substance that has a high vapor pressure, that is, it evaporates easily. Volumetrics – Calculating volumes for surfaces generated by the computer.
W Warm boot – See Boot. Warning limit – A specified boundary on a control chart that indicates a process may be going out of statistical control and that certain precautions are required. Wastewater – Water from a home, community, farm, or industrial facility that contains dissolved or suspended substances. Water table – The boundary between the saturated and unsaturated zones beneath the ground surface. Window – An area of the screen used for output of data. See also Microsoft Windows. Word processor – Software to enter, edit, format, and print text, and sometimes graphics, on a computer. Working standard – See Secondary standard. Workstation – Refers generally to any computer or terminal where work is done. More specifically, a high-powered personal computer with a high-resolution display, usually running UNIX, for engineering and scientific applications. WORM – Write Once Read Many drives are used for mass storage of data. WP – Work Plan. WQC – Water Quality Criteria. WQS – Water Quality Standards. WW – Wastewater. WYSIWYG – What You See Is What You Get. Refers to word processing and desktop publishing software that displays formatting as well as text.
X XENIX – A version of the UNIX operating system for PCs. XGA – eXtended Graphics Adapter. Popular video adapter on PC-compatible computers. Highest standard resolution is 1024x768 pixels, although some go much higher. Xmodem – Public domain communication protocol used to transfer data between computers. XRF – X-Ray Fluorescence.
406
Relational Management and Display of Site Environmental Data
Z Zero drift – The change in instrument output over a stated time period of nonrecalibrated, continuous operation, when the initial input concentration is zero; usually expressed as a percentage of the full scale response. ZHE – Zero Headspace Extraction. Zoom – Change the viewing scale of a map or drawing.
APPENDIX G BIBLIOGRAPHY
Many books and articles have been written about the various aspects of protecting the environment and computerized data management. The resources listed in this appendix are ones that the author found useful in gathering the information for this book. Most end with a comment about the resource, and these comments represent the author’s opinion and nothing else.
A Abbott, B., 2001 – Requirements set the mark, InfoWorld, March 5, 2001. Article stressing the importance of developing and maintaining detailed requirements for software projects, along with statistics about software project failure rates. Adolfo, N. C. and Rosecrance, A., 1993 – QA and environmental sampling, Environmental Testing & Analysis, May/June, 1993. Discusses sampling plans and other sampling issues from a QA perspective. ASTM, 1997 – ASTM Standards on Environmental Sampling, American Society for Testing and Materials, West Conshohocken, PA. Detailed specifications for sampling of various environmental media. ASTM, 2001a – Practice E-1527-00 Standard Practice for Environmental Site Assessments: Phase 1 Environmental Site Assessment Process, American Society for Testing and Materials, West Conshohocken, PA. Revised version of the standard approach for Phase 1 environmental assessments. ASTM, 2001b – Practice E-1528-00 Standard Practice for Environmental Site Assessments: Transaction Screen Process, American Society for Testing and Materials, West Conshohocken, PA. Companion standard to ASTM 2001a. Augarten, S., 1984 – Bit by Bit - An Illustrated History of Computers, Ticknor and Fields, New York. Chronological discussion of the personalities and machines that led to modern computers. Well-written and informative.
B Barnwell, C. E., 2000 – The USGS GeoExplorer Project: using new GIS technology for scientific data publishing, in Geographic Information Systems in Petroleum Exploration and Development,
408
Relational Management and Display of Site Environmental Data
Coburn, T., C., and Yarus, J. M., Eds. AAPG Computer Applications in Geology, No. 4, pp. 249260. Discussion of a GIS interface to project data. Boyce, R. F. and Chamberlain, D. D., 1973 – Using a structured English query language as a data definition facility, IBM RJ 1318, December, 1973. First written description of SQL. Bulmer, M. G., 1979 – Principles of Statistics, Dover Publications, New York. Introductory textbook on probability and statistics.
C Cambridge Software, 2001 – Chemfinder.com, http://chemfinder.cambridgesoft.com. Search engine for chemical information. Carson, R., 1962 – Silent Spring, Houghton Mifflin Company, New York. This landmark book by Rachel Carson initiated public awareness of the impact of our activities on the environment. Carter, G. C. and Diamondstone, B. I., 1990 – Directions for Internationally Compatible Environmental Data, Hemisphere Publishing Corporation, New York. Papers from an international workshop on environmental data gathering and transfer. Cheremisinoff, N. P., 2001 – Spotlight on chlorinated hydrocarbons, Pollution Engineering, November, 2001. Good overview of chlorinated hydrocarbons (TCE, TCA, PCE) and how they act in the environment. Codd, E., 1970 – A Relational Model of data for large shared data banks, Communications in the ACM, June, 1970. Reprinted in Stonebraker and Hellerstein, 1998. Original paper on relational data management. Codd, E., 1972 – Further normalization of the data base Relational Model; in Data Base Systems, Courant Computer Science Symposia Series, v 6, Prentice Hall, Englewood Cliffs, NJ. This article was one of the first on normalization of relational data models. The contents of this article are described in Date (1981). Conservation Technology Resource Center, 2001 – Groundwater and Surface Water: Understanding the Interaction, www.ctic.purdue.edu/IYW/Brochures/GroundSurface.html. This Web page contains a good overview of groundwater issues. Cooper, A., 1995 – About Face - The Essentials of User Interface Design, IDG Books, Foster City, CA. This book was written by one of the authors of Microsoft Visual Basic, and is intended to help software authors create better software. Core Laboratories, 1996 – Environmental Analysis Survival Guide, Core Laboratories, 5295 Hollister Road, Houston, TX 77040 (with handout of slides). This guidebook provides many details on environmental data analysis, validation, and so on. Csuros, M., 1997 – Environmental Sampling and Analysis Lab Manual, Lewis Publishers, Boca Raton, FL. Overview of lab processes for environmental samples.
D Date, C. J., 1981 – An Introduction to Database Systems, Addison Wesley, Reading, MA. This book is a very technical discussion of data management principles by an IBM Fellow who collaborated with E. F. Codd on relational database design. Davis, J. C., 1986 – Statistics and Data Analysis in Geology, 2nd ed., John Wiley & Sons, New York. Classic textbook on statistical techniques and applications in geology. Good theoretical coverage of descriptive and spatial statistics. Diamondstone, B. I., 1990 – Executive summary, in: Environmental Data - Directions for Internationally Compatible Environmental Data, Carter, G. C. and Diamondstone, B. I., Eds., Hemisphere Publishing Corporation, New York. This book contains the proceedings of a
Bibliography
409
conference on international compatibility of environmental data. The executive summary contains a good overview of environmental data management issues in general. Diener, B. J., Terkla, D., and Cooke, E., 2000 – The Massachusetts Environmental Industry, University of Massachusetts Donohue Institute, Boston, MA. Interesting summary of the environmental industry in Massachusetts and elsewhere. DOE/HWP, 1990a – Quality Control Requirements for Field Methods, Rev. 1, U.S. Department of Commerce, National Technical Information Service document DE9108447. Provides quality control requirements for sampling and other field analyses as part of Hazardous Waste Removal Action Programs (HAZWRAP). DOE/HWP, 1990b – Requirements for Quality Control of Analytical Data, Rev. 1, U.S. Department of Commerce, National Technical Information Service document DE92014721. Provides quality control requirements for samples and data from collection through analysis for Hazardous Waste Removal Action Programs (HAZWRAP). Dragan, R. V., 2001 – XML your data, PC Magazine, June 26, 2001. This article describes the status of adoption of XML, including its use as a data management format. Dunn, M., 1994 – Creating Access Applications - The Advanced Guide to Developing Professional Database Applications, Que Corporation, Indianapolis, IN. This book is geared to someone who wants to learn to build applications and use the programming features in Access.
E Edwards, P. and Mills, P., 2000 – The state of environmental data management, Environmental Testing & Analysis, September/October, 2000, pp. 22-30. Overview of issues related to environmental data management. ENCO, 1998 – Environmental Conservation Labs Web site, www.encolabs.com. Good information on lab analysis procedures and QA/QC issues. England, K. and Stanley, N., 1996 – The SQL Server Handbook, Digital Press, Boston, MA. This book covers the basics of SQL Server, using both the command line and graphical interfaces. England, K., 1997 – The SQL Server 6.5 Performance Optimization and Tuning Handbook, Digital Press, Boston, MA. This book, which is a companion to England and Stanley (1996) covers the various parameters in SQL Server which can affect performance. Environmental Defense, 2001 – Scorecard.org, Environmental Defense, www.scorecard.org. This Web site contains a large amount of synthesized information on environmental issues such as toxic releases. EPA, 1980 – Test Methods for Evaluating Solid Waste, Physical/Chemical Methods (SW-846), U.S.E.P.A. This is OSW's official compendium of analytical and sampling methods that have been evaluated and approved for use in complying with the RCRA regulations. SW-846 functions primarily as a guidance document setting forth acceptable, although not required, methods for the regulated and regulatory communities to use in responding to RCRA-related sampling and analysis requirements. SW-846 is a multi-volume document that changes over time as new information and data are developed. It has been issued by EPA since 1980 and is currently in its third edition. Advances in analytical instrumentation and techniques are continually reviewed by OSW and incorporated into periodic updates to SW-846 to support changes in the regulatory program and to improve method performance and cost effectiveness. It is available online at www.epa.gov/epaoswer/hazwaste/test/main.htm, on CD-ROM, or in hard copy from the U.S. Government Printing Office or NTIS. EPA, 1989 – Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities. Interim Final Guidance (PB89-151047), Office of Solid Waste Management, Waste Management Division, U.S.E.P.A. 401 M. Street SW, Washington, DC, 20460, February, 1989.
410
Relational Management and Display of Site Environmental Data
EPA, 1996 – EPA-New England Data Validation Functional Guidelines for Evaluating Environmental Analyses, U.S.E.P.A.-New England Region 1, Quality Assurance Unit Staff, Office of Environmental Measurement and Evaluation, July, 1996, revised December, 1996. This is a good overview of data validation guidelines and procedures. EPA, 1997a – Glossary of QA terms of the Quality Assurance Management Staff, http://www.epa.gov/emap/html/qa_terms.html. This is a Web page with many definitions of quality assurance terms. EPA, 1997b – Manual for the Certification of Laboratories Analyzing Drinking Water - Criteria and Procedures Quality Assurance, U.S.E.P.A. 815-B-97-001, March 1997, www.epa.gov/ OGWDW/certlab/labfront.html. EPA, 1998a – Guidance for Data Quality Assessment - Practical Methods for Data Analysis, U.S.E.P.A. QA/G-9, QA 97 version. EPA/600/R-96/084, January, 1998. Discussion of the statistical methods for analyzing and displaying environmental data. EPA, 1998b – EPA Guidance for Quality Assurance Project Plans, U.S.E.P.A. QA/G-5, EPA/600/R-98/018, February, 1998. Discussion of issues related to writing a QAPP. EPA, 1998c – Enhanced ozone monitoring (PAMS) workbook acronyms and definitions Web page: http://www.epa.gov/airprogm/oar/oaqps/pams/analysis/acronym/actxtsac.html. EPA, 2000a – Introduction to the Contract Laboratory Program, U.S.E.P.A. 540-R-99-004, February, 2000. www.epa.gov/superfund/programs/clp/index.htm. EPA, 2000b – EPA Test Methods - Chemical or Name Index to EPA Test Methods, www.epa.gov/epahome/index/nameindx.htm, (May 2000). EPA, 2001a – Superfund (CERCLA) Enforcement, U.S.E.P.A., March 28, 2001. wysiwyg://13/ http://es.epa.gov/oeca/osre/superfund.html and linked pages. EPA, 2001b – RCRA Corrective Action Hazardous Waste Cleanup Program, A.S.A.P., July 25, 2001. http://www.epa.gov/epaoswer/hazwaste/ca/backgnd.htm and linked pages. EPA, 2001c – Guidance on Environmental Data Verification and Validation, U.S.E.P.A. QA/G-8, June, 2001 (Peer review draft). Overview of data verification and validation. Available on the Web at www.epa.gov/r10earth/offices/oea/epaqag8.pdf. EPA, 2001d – IRIS - Integrated Risk Information System, A.S.A.P., August 13, 2001. www.epa.gov/iris/intro.htm and linked pages. Good introduction to risk assessment and much information on chemicals and their environmental risk. Evans, C., 1981 – The Making of the Micro, Van Nostrand Reinhold, New York. Illustrated history of computers and the early days of microcomputers. EXTOXNET, 2001 – The EXtension TOXicology NETwork http://ace.ace.orst.edu/info/extoxnet/ ghindex.html. This Web site provides a variety of types of information on toxic substances such as pesticides and herbicides.
F Feng, P. and Easley, M., 1999 – Managing environmental data, Environmental Protection, April, 1999. A discussion of a database implementation project at an industrial facility, with a focus on project benefits. Finkelstein, C., 1989 – An Introduction to Information Engineering - From Strategic Planning to Information Systems, Addison-Wesley, Sydney, Australia. This book covers the Information Engineering approach to data management. It has a good discussion of the business aspects of data normalization, as well as automated software development.
Bibliography
411
FIPS, 1993 – Integration Definition for Information Modeling (IDEF1X), Federal Information Processing Standards Publication 184, December 21, 1993. N.I.S.T., Gaithersburg, MD. This document describes the semantics and syntax of this modeling language. It also has good background material on data modeling. Freiberger, P. and Swaine, M., 1984 – Fire in the Valley - The Making of the Personal Computer, Osborne/McGraw Hill, Berkeley, CA. Inside story of the people and companies who developed the first personal computers. Informative and fun to read.
G Gagnon, M., 1998 – SQL: the universal database language, PC Magazine, November 3, 1998. This article describes some of the basics of SQL, along with a brief history of SQL and relational database management. Gilbert, J. B., 1999 – Selecting and implementing an Environmental Management Information System. EM, July, 1999, Air and Waste Management Association. This article describes EMIS and covers various aspects of implementing these systems. Giles, J. R. A., 1995 – Geological Data Management, The Geological Society, London. This book contains a group of articles on three topics: database design, data management, and case studies, primarily from the U.K. Goldberg, A. and Robson, D., 1983 – Smalltalk-80: The Language and its Implementation, Addison-Wesley, Reading, MA. This is a good basic guide to this object-oriented language. Goodchild, M. F., Parks, B. O., and Steyaert, L. T., 1993 – Environmental Modeling with GIS, Oxford University Press, New York. A collection of papers on various aspects of environmental modeling with GIS. Individual papers have much information and many good references. Government Institutes, Inc., 1991 – Environmental Statutes, Government Institutes, Inc., Rockville, MD. A reference containing the text of U.S. federal statutes. Greenspun, P., 1998 – Database management systems, www.arsdigita.com/books/panda/ databases-choosing. Discussion of data management systems relative to Web interfaces. Guptill, S. C. et al., 1988 – A Process for Evaluating Geographic Information Systems, USGS Open File Report 88-105. Discussion of features and requirements for GIS systems.
H Harkins, S. S., 2001a – A calculated approach, PC Magazine, September 4, 2001. A warning about storing calculated data in a database. Harkins, S. S., 2001b – Keep your data clean, PC Magazine, December 26, 2001. An introduction to referential integrity in Access. Harmancioglu, N. B., Necdet Alpasian, M., Ozkul, S. D., and Singh, V. P., 1997 – Integrated Approach to Environmental Data Management Systems, Kluwer Academic Publishers, Dordrecht. Proceedings of the NATO Advanced Research Workshop on Integrated Approach to Environmental Data Management Systems, Bornova, Izmir, Turkey, September, 1996. This book contains a collection of papers on environmental data collection and management, primarily outside the United States. Harr, J., 1995 – A Civil Action, Vintage Books division of Random House, New York. This is a popular recounting of a neighborhood’s struggle with groundwater pollution. It was later made into a movie.
412
Relational Management and Display of Site Environmental Data
Hem, J. D., 1985 – Study and Interpretation of the Chemical Characteristics of Natural Water, U.S. Geological Survey Water-Supply Paper 2254, 3rd ed., 263 pp. Introduction of Stiff water quality diagrams. Hensel, J., 2001 – 2001 Salary Survey, Environmental Protection, August, 2001. Discussion of salary and other trends in the environmental industry. Huff, D., 1954 – How to Lie with Statistics, W. W. Norton and Company, Inc., New York. Nontechnical book about how statistics can be used to slant the truth. The focus is on the use of statistics in advertising and other types of persuasion.
I IBM, 1965 – Numerical surface techniques and contour map plotting, Data Processing Application, IBM Data Processing Division, White Plains, NY. One of the first papers published on using computers to generate contour maps.
J Jennings, R., 1995 – Using Access 95, Que Corporation, Indianapolis, IN. This book provides a large amount of information on Access 95 and database concepts for users of all levels. Jepson, B., 2001 – PostgreSQL vs. MySQL: Building better databases. Webtechniques, September, 2001. Comparison of these two open-source database programs. Jones, T. A., Hamilton, D. E., and Johnson, C. R., 1986 – Contouring Geological Surfaces with the Computer, Van Nostrand Reinhold, New York. General discussion of using contouring software for making maps. Provides geologic insight into the process of generating realistic maps. Joseph, A. J., 1998 – Health, Safety and Environmental Data Analysis, Lewis Publishers, Boca Raton, FL. This book provides a basic discussion of data presentation, statistics, and analysis.
K Kachigan, S. K., 1991 – Multivariate Statistical Analysis, Radius Press, New York. Introductory textbook on multivariate statistics. Kamrin, M. A., 1988 – Toxicology, Lewis Publishers, Chelsea, MI. Principals of toxicology for indoor and outdoor air, drinking water, food, and the workplace environment. Karin, S. and Parker Smith, N., 1987 – The Supercomputer Era, Harcourt Brace Jovanovich, Orlando, FL. Non-technical description of supercomputers and their applications. Kidder, T., 1981 – The Soul of a New Machine, Atlantic-Little Brown, Boston. Very readable look into the development of the Data General MV/8000 computer. Entertaining and educational. Krajewski, S. A. and Gibbs, B. L., 2001 – Understanding Contouring - A Practical Guide to Spatial Estimation Using a Computer and Variogram Interpretation, Gibbs Associates, PO Box 706, Boulder, CO 80306-0706, 303-444-6032. Thoughtful and thorough discussion of the various aspects of gridding and contouring spatial data. Krukowski, J., 2001 – From countering Erin Brockovich to attracting new blood, Pollution Engineering, August, 2001. A discussion of the magazine’s annual career survey.
Bibliography
413
L Lang, L., 1989 – GIS goes 3D, Computer Graphics World, March, 1989. General description with illustrations of early progress in 3-D GIS. Lang, L., 1998 – Managing Natural Resources with GIS, Environmental Systems Research Institute, Redlands, CA. Published by ESRI, the developer of the ArcInfo GIS software, this book contains examples of using GIS for natural resource exploration and production, agriculture, environmental investigation, reclamation, and other similar projects. Lapham, W. W., Wilde, F. D., and Koterba, M. T., 1985 – Ground-Water Data Collection Protocols and Procedures for the National Water-Quality Assessment Program: Selection, Installation, and Documentation of Wells, and Collection of Related Data, U.S. Geological Survey, Open File Report 95-398. Detailed information on well selection, installation of monitoring wells, documentation, and collection of water level and additional hydrogeologic and geologic data. Lee, T., 1998 – LEEGRM: A Program for normalized Stiff diagrams and quantification of grouping hydrochemical data, Computers and Geosciences, v 24, pp. 523-529. Expansion of Stiff diagram displays by normalizing the size based on total dissolved solids. Linderholm, O., 2001 – Making the case for IT, InfoWorld, May 14, 2001. This article presents the results of a study on the importance of information technology in companies. Lyall, C., 1995 – QBF, generic blackboxes, and gold-plated trophies, Access/Visual Basic Advisor, February/March, 1995. This article describes a technique for creating an easily modified form that automatically generates SQL statements based on user input.
M Mackenthun, K. M., 1998 – Basic Concepts in Environmental Management, Lewis Publishers, Boca Raton, FL. Despite its name, more than half of this book is about environmental regulations, especially U.S. federal regulations. Manahan, S. E., 2000 – Environmental Chemistry, 7th ed., Lewis Publishers, Boca Raton, FL. This textbook is for environmental professionals with a good understanding of chemistry. It is a good source of information on compounds, reactions, and environmental processes. Manahan, S. E., 2001 – Fundamentals of Environmental Chemistry, 2nd ed., Lewis Publishers, Boca Raton, FL. This textbook is for people with little background in chemistry, so it covers more chemistry theory than his other book. McConnell, S., 1998 – Software Project Survival Guide, Microsoft Press, Redmond, WA. Processes and practices for software development projects. Contains many useful checklists for progress tracking, along with commonsense ideas on what works and what doesn’t. McMullen, R., 1996 – Rob’s Quotation Collection, www.ae.utexas.edu/~rwmcm/quotes.html. Merriam, D., 1983 – The geological contributions of Charles Babbage, The Compass, v 62 n 1, pp. 31-38. Report on research by Dr. Merriam on the relationship between Babbage, an early computer inventor, and the London geological community. Merriam, D., 1985 – Use of Microcomputers in Geology (workshop notes), Department of Geology, Wichita State University. Articles on many aspects of geological computer use. Contains a time line for significant events in geological computer use. Microsoft, 2001 – Microsoft Consulting Services Naming Conventions for Visual Basic, Microsoft Knowledge Base article Q110264, http://support.microsoft.com, last modified June 12, 2000. List of conventions for naming programming and database objects.
414
Relational Management and Display of Site Environmental Data
Milne, P. H., 1992 – Presentation Graphics for Engineering, Science and Business, E. & FN Spohn, London. Description of graph displays and BASIC programs for creating various types of graphs. Momonier, M., 1996 – How to Lie with Maps, The University of Chicago Press, Chicago, IL. This book covers map elements and how they can be used to influence the interpretation of the map. Moore, G. E., 1965 – Cramming more components onto integrated circuits, Electronics, April 19, 1965, pp. 114-117. Moore’s original observation that the complexity of electronic components doubles every year, and would do so for the next 10 years. In an oral paper in 1975 he revised his estimate to doubling every 18 months related specifically to the component density of memory chips. Moore, G. A., 1991 – Crossing the Chasm: Marketing and Selling High-Tech Products to Mainstream Customers, HarperBusiness, New York. Interesting book that discusses adoption of technology, and the process of acceptance of a technology product.
N Nath, A. 1995 – The Guide to SQL Server, 2nd ed., Addison-Wesley, Reading, MA. This book covers SQL Server, mostly through the command-line interface rather than the newer graphical interface. NLM, 2001 – Toxicology and Environmental Health Web site, U.S. National Library of Medicine, http://sis.nlm.nih.gov/Tox/ToxMain.html. Databases, tutorials, and other information on environmental toxicology. NCDWQ, 2001 – Collection and Preservation of Groundwater Samples for the Central Laboratory, North Carolina Division of Water Quality, www.esb.enr.state.nc.us/lab/qa/ collpresgw.htm. Web site with information on sample preservation and holding times. NWWA, 1986 – RCRA Ground Water Monitoring Technical Enforcement Guidance Document (TEGD), National Water Well Association, Dublin, OH. Guidance for preparing and executing sampling and analysis plans for groundwater.
O Ott, W. R., 1995 – Environmental Statistics and Data Analysis, Lewis Publishers, Boca Raton, FL. Textbook with worked problems covering random processes, probability, Bernoulli processes, Poisson processes, diffusion and dispersion, normal processes, dilution of pollutants, and lognormal processes.
P Patnaik, P., 1997 – Handbook of Environmental Analysis, Lewis Publishers, Boca Raton, FL. This book covers selected environmental parameters and their analysis. It has a large section on the parameters, especially organic compounds, with a focus on how they are analyzed. It also covers the analytical procedures themselves. Pirsig, R. M., 1974 – Zen and the Art of Motorcycle Maintenance, Bantam Books, Toronto. Philosophical discussion of the place of appreciation of technology, and especially quality, in society, and the issues that arise when you think deeply about it. Powell, R. M. and Puls, R. W., 1997 – Hitting the bull’s eye in groundwater sampling, Pollution Engineering, June 1, 1997. This article discusses passive techniques for groundwater sampling to obtain more representative samples.
Bibliography
415
R Reports Working Group, 1988 – A Summary of GIS Activities in the Federal Government, Federal Interagency Coordinating Committee on Digital Cartography, August, 1988. Tables showing GIS use within various federal agencies. Rich, G., 1996 – Environmental Humor, ES&S Publishing Company, Bowling Green, OH. Entertaining book of jokes, one liners, and wisdom on environmental and related topics. Robinson, J. E., 1982 – Computer Applications in Petroleum Geology, Hutchinson Ross, New York. This book covers a variety of topics related to working with earth science data, including statistics and contouring. Rosecrance, A., 1993 – EPA Method Data Validation, in Environmental Testing & Analysis, September/October 1993. This article describes the various analytical methods developed by the U.S.E.P.A. and the requirements associated with each method. Ross, S. S., 2001 – Get smart, PC Magazine, August, 2001. This article covers business intelligence software, including some general information about the process, and reviews of software to perform the analyses. Ross, S. S., Dyck, T., and Sanders, J., 2001 – Getting to First Database, PC Magazine, October 16, 2001. This article provides a current overview of data management software, including desktop relational systems and Web-based flat file system. Rumble, J. R. and Hampel, V. (eds.), 1984 – Database Management in Science and Technology, Elsevier, Amsterdam, Holland. Very detailed discussion of technical database management. Individual papers have many interesting references.
S Sara, M., 1994 – Standard Handbook for Solid and Hazardous Waste Facility Assessments, Lewis Publishers, Boca Raton, FL. This large book contains much useful information on many different aspects of facility assessment. Sayre, D., 1996 – Inside ISO 14000, St. Lucie Press, Delray Beach, FL. Schaller, B., 1996 – The origin, nature and implications of Moore’s law - the benchmark of progress in semiconductor electronics - www.research.microsoft.com/~Gray/Moore_Law.html. Detailed discussion of Moore’s law and its significance to the computer industry and society at large. Shewhart, W. A., 1931 – Economic Control of Quality of Manufactured Product, D. Van Nostrand Company, New York. Introduction of Shewhart control charts. Singhroy, V. H., Nebert, D. D., and Johnson, A. I., 1996 – Remote Sensing and GIS for Site Characterization, ASTM STP 1279. American Society for Testing and Materials, West Conshohocken, PA. Symposium proceedings with various miscellaneous papers on remote sensing, GIS, site characterization, and mapping standards. SKC, 2001 – SKC Air Sampling Guides (by Chemical), www.skcinc.com/guides.html. Web site with information on EPA and other air sampling methods. Snyder, J. P., 1987 – Map Projections - A Working Manual, U.S. Geological Survey Professional Paper 1395, U.S. Government Printing Office, Washington, DC. Definitive book on coordinate projections and how they are used for USGS maps. Soustrup, B., 1986 – The C++ Programming Language, Addison-Wesley, Reading, MA. Introductory text on the object-oriented version of the C language. Spectrum Labs, 2001 – Chemical compounds Web site www.speclab.com/compound/ chemabc.htm. This site has a variety of information on hundreds of chemical compounds.
416
Relational Management and Display of Site Environmental Data
Stonebraker, M. and Hellerstein, J. L., 1998 – Readings in Database Systems, Morgan Kaufmann Publishers, San Francisco, CA. A collection of important data management papers, known in academic circles as “The Red Book.” Strunk Jr., W., and White, E. B., 1935 – The Elements of Style - Republished 1979, Macmillan Publishing Company, New York. Sullivan, T., 2001 – Exploring the data explosion, InfoWorld, April 16, 2001. Discussion of approaches to data storage. Swan, A. H. R. and Sandilands, M., 1995 – Introduction to Geological Data Analysis, Blackwell Science, Oxford, England. Textbook covering statistical and geostatistical analysis of geological data, with many worked examples.
T Tearpock, D. J. and Bischke, R. E., 1991 – Applied Subsurface Geological Mapping, Prentice Hall, Englewood Cliffs, NJ. Comprehensive coverage of subsurface mapping techniques. Most examples are from petroleum exploration and production. Thomas, D., 1989 – What’s an object? Byte, March, 1989. Overview of object oriented programming. Thompson, T., 1989 – The Next step, Byte, March, 1989. Description of object oriented programming on the Next computer. Time Life, 1986 – Computer Languages, Time Life Books, Chicago, IL. Nicely illustrated but slightly disjointed discussion of programming concepts and languages. Trimble, Jr., J. H. and Chappell, D., 1989 – A Visual Introduction to SQL, John Wiley & Sons, New York. This book uses a visual diagram approach to teach SQL. Triplett, L. D., 1997 – Water Quality Program Sampling Procedures for Ground Water Monitoring Wells, Minnesota Pollution Control Agency, July, 1997. www.pca.state.mn.us. Tsu-der Chou, G., 1987 – dBase IV Tips, Tricks and Traps, Que Corp., Indianapolis, IN. Shortcuts and hints on database organization and programming, although focused on the obsolete dBase language. Tufte, E. R., 1983 – The Visual Display of Quantitative Information, Graphics Press, Cheshire, CT. This entertaining book covers a variety of aspects of graphical data display, including graphs and maps.
U URISA, 1998 – GIS Database Concepts, The Urban and Regional Information Systems Association, Park Ridge, IL. This booklet provides a very basic overview of GIS concepts. URISA, 1999 – Aerial Imagery Guidelines, The Urban and Regional Information Systems Association, Park Ridge, IL. This booklet provides a very basic overview of aerial imaging.
V van der Lans, R. F., 1988 – Introduction to SQL, Addison-Wesley, Wokingham, England. This book, by a scientist with Shell Oil Company in the Netherlands, teaches the fundamentals of SQL. Vizard, M., 2001 – Realizing age-old visions of software. InfoWorld, July 13, 2001. p. 8. Brief discussion of the cost and value of implementing software systems.
Bibliography
417
W Wallace, D. R., Peng, W. W., and Ippolito, L. M., 2000 – Software Quality Assurance: Documentation and Reviews, NISTIR 4909, U.S. Department of Commerce. http://hissa.ncsl.nist.gov/publications/instir4909. Provides a software quality assurance standard for nuclear applications, which also applies to software quality assurance in general. Walls, M. D., 1999 – Data Modeling, The Urban and Regional Information Systems Association, 1460 Renaissance Drive, Suite 305, Park Ridge, IL 60068. This booklet describes data modeling, largely from a theoretical perspective. Watterson, K., 1989 – Objective thinking - exactly what is object-oriented programming? Data Based Advisor, March, 1989. Good overview of object-oriented programming from a programmer’s point of view. Webster, 1984 – Webster’s II New Riverside University Dictionary, Houghton Mifflin Company, Boston, MA. Weiner, E. R., 2000 – Applications of Environmental Chemistry, Lewis Publishers, Boca Raton, FL. Overview and reference materials covering environmental chemistry for non-chemists.
Y Yost, E., 1997 – Environmental Data Management and Analysis, http://phoenix.liunet.edu/~yost /grad_env/env603.html. Course notes and assembled materials about environmental data management available (at least in part) on the Web. Yourdon, E., 1996 – When good enough is best, Byte, September, 1996, pp. 85-90. Also located at www.yourdon.com/articles/Byte9609.html. This magazine article discusses the trade-off in software design between features, quality, time, and budget.
INDEX
Numbers 3-D displays ............................................240 3-D modeling.....................................39, 305
A A Civil Action ......................................4, 411 Abscissa...........................................226, 227 Access ...7, 23, 26, 27, 28, 29, 31, 38, 49,50, 51, 52, 53, 62, 64, 65, 66, 69, 71, 73, 74, 76, 77, 78, 79, 80, 82, 83, 84, 87, 90, 91, 100, 101, 102, 111, 146, 159, 188, 190, 205, 206, 211, 221, 222, 282, 283, 285, 291, 307, 308, 310, 332, 364, 375, 394, 399, 409, 412, 413 Accuracy .........................187, 188, 364, 390 Acrobat..............................................43, 224 Action queries .........................................205 Active Server Pages ..........................54, 366 ActivityLog .......................................65, 324 Ad hoc queries.........................171, 187, 206 Administration....9, 51, 67, 74, 75, 102, 163, 292, 299 Air analysis..............................................144 Air emissions inventory.............................37 Air sampling............................126, 131, 415 Airphotos.........................................244, 245 Albers Equal Area ...................247, 249, 392 AllMap ....................................................251 Alpha testing ...........................................101 American Polyconic ........................247, 249 Analyses ...21, 25, 26, 38, 61, 155, 176, 178, 211, 219, 276, 283, 308, 310, 311, 312, 315, 316, 317, 318, 319, 320, 321, 325, 327, 328, 330, 333, 410
Analysis ....61, 123, 139, 142, 145, 179, 186, 274, 311, 312, 330, 333, 365, 374, 386, 389, 397, 405, 407, 408, 409, 410, 412, 414, 415, 416, 417 Analytical flags........................................146 ANOVA ..........................................274, 365 Application server ...................................103 Application service provider ...................104 Arcs .........................................................254 ArcView ............................38, 209, 253, 284 Area of influence .....................................258 Arithmetic mean ......................................271 Arsenic ............3, 23, 39, 132, 134, 142, 380 Asbestos ..........................134, 339, 349, 354 ASCII .........13, 75, 279, 280, 332, 334, 366, 376, 402 Atomic absorption spectrometry..............141 Atomic emission spectroscopy ........142, 351 Atomicity.................................................186 Auditors...................................................105 Audits ..........................39, 40, 103, 174, 183 AutoCAD...........................44, 244, 252, 376 Autoflagging............................................198 Automated checking ................................158 Automated editing ...................................168 Automated import......................90, 154, 304
B Backup .........13, 50, 66, 67, 75, 91, 94, 102, 107, 113, 171, 189, 190, 290, 367, 388, 401 Banded report ..........214, 215, 218, 219, 221 Base maps....38, 55, 244, 245, 283, 295, 309 Beta testing..............................................101 Bias..........................................188, 364, 367
420
Relational Management and Display of Site Environmental Data
Biologic parameters.........................136, 348 Bit.....24, 31, 75, 76, 78, 126, 300, 332, 333, 368 Blank spike......................177, 179, 316, 368 Block diagrams................235, 237, 240, 243 Borehole geophysics ...............................347 Boring logs ................................................37 Browser .....50, 54, 55, 56, 96, 103, 281, 285 BS 7750...................................................184 Bubble maps....................................244, 251 Budgets......................................................40 Buy or build.......................................97, 291 Byte .......................75, 76, 78, 333, 368, 398
C C++ .....................................17, 73, 368, 415 CAA ................................................117, 368 Cable modems ...........................................53 CAD ...44, 55, 243, 245, 251, 252, 254, 296, 369, 380, 388 Calculated fields......................................220 Calculated parameters ....133, 134, 137, 216, 220 Calibration standard .......177, 180, 369, 373, 393 Cancer .........3, 134, 135, 136, 278, 338, 369 Canned queries ........................................205 Carcinogenesis ........................................278 Carson, Rachel ................................3, 4, 408 Cartesian coordinates ......................246, 249 Cartography.......................................38, 415 Caspio Bridge............................................50 CCL.................................................120, 369 CD-ROM.............55, 70, 334, 370, 389, 409 Censored data ..........................................272 Centralized ...........18, 38, 42, 53, 56, 57, 58, 110, 148, 157, 285, 295, 296, 297, 300, 374 CERCLA ........118, 119, 121, 122, 365, 369, 370, 388, 390, 397, 398, 400, 401, 410 CERCLIS ..........................................63, 370 Chain of custody................38, 125, 183, 371 Character fields .........................................78 Chi-square .......................................274, 371 Chlorinated hydrocarbons ..........3, 129, 135, 136, 347, 375, 350, 352, 408 Chloropleth map..............................251, 368 Clean Air Act ....................41, 117, 384, 388 Clean Water Act ..............................118, 373 Client computer maintenance ..................111
Client-server ...........7, 49, 50, 51, 52, 54, 55, 56, 65, 69, 70, 71, 73, 74, 87, 96, 188, 190, 201, 283, 296 Closed loop reference file system....157, 335 Cluster analysis........................................275 Codd, Edwin........................................21, 22 Coded entries...................312, 328, 330, 335 Coded values .........................19, 27, 60, 311 Colorimetric methods ..............................141 COM........................................................285 Common metals.......................................134 Communication ......53, 54, 73, 74, 104, 105, 184, 283, 285, 292, 295, 296, 364, 367, 368, 371, 372, 375, 379, 386, 398, 401, 405 Compacting .......................................91, 111 Compliance monitoring ...........................120 Compound keys ...................................25, 79 Comprehensive Environmental Response, Compensation, and Liability Act .........118, 370 Computer-aided drafting..................251, 252 Consistency ..........60, 86, 85, 154, 155, 156, 157, 158, 182, 186, 192, 193, 196, 331, 334, 335, 338 Consistency checks..................................158 Consistent units .......................................221 Construction Completion List..................120 Consultants ......................................105, 200 Content-specific filtering .........................162 Contour interval.......................................261 Contour maps ..................................244, 256 Contour parameters .................................261 Contouring......7, 38, 39, 219, 238, 243, 251, 256, 257, 258, 259, 260, 261, 262, 263, 275, 305, 307, 388, 412, 415 Contract Lab Program ...............................13 Contractors ............................42, 58, 63, 312 Control charts ..................................274, 276 Convex hull .....................................258, 260 Coordinate projection..............................246 Coordinate systems..........................226, 245 CORBA ...................................................285 Corel Draw ..............................................263 Correlation coefficients ...........................275 Correspondence analysis .........................275 Cost ........7, 8, 11, 12, 35, 53, 54, 64, 72, 75, 93, 97, 99, 100, 103, 105, 106, 107, 109, 113, 119, 121, 124, 151, 190, 195, 223, 224, 245, 276, 278, 291, 294, 296, 299, 336, 373, 378, 383, 395, 409, 416 Coverage......38, 40, 101, 125, 254, 258, 408
Index Cross sections.........5, 39, 95, 209, 213, 235, 237, 238, 239, 262, 305 Cross-contamination........125, 126, 128, 129 Crossing the Chasm ..........................95, 414 Cross-tab ...79, 214, 215, 216, 218, 219, 391 Cumulative sum control charts ................277 Curve fitting ............................................232 Customization..............................7, 112, 389 CWA ...............................................118, 373
D Data acquisition.......................................174 Data administrator ......19, 52, 61, 62, 65, 74, 91, 102, 105, 109, 111, 130, 156, 158, 160, 162, 164, 168, 171, 182, 183, 188, 201, 245, 282, 294, 330, 324 Data checking..........................................158 Data domain ..............................................78 Data editing .................................81, 91, 167 Data flow diagrams .................................107 Data hierarchy .........................................211 Data management plan ..................95, 96, 97 Data mart.................................................285 Data mining .....................................285, 373 Data model .........19, 59, 60, 95, 96, 97, 186, 307 Data normalization ..21, 22, 30, 31, 359, 411 Data qualifiers .................................146, 167 Data Quality Objective............................145 Data quality procedures...........................181 Data review ..........62, 90, 94, 109, 111, 154, 164, 165, 166, 168, 181, 182, 186, 187, 191, 311, 318, 369 Data security............................................188 Data Transfer Standard.......3, 153, 155, 156, 157, 294, 327 Data validation limits ..............................323 Data warehouse .................84, 285, 374, 389 Database maintenance ...............................66 Database managers ...........14, 15, 16, 17, 18, 27, 54, 399 Date and time formatting.................221, 222 Date fields .................................................78 Datum......................128, 237, 248, 387, 389 DB2 ...............................................22, 50, 70 dBase.......49, 74, 76, 78, 222, 279, 282, 416 DCOM.....................................................285 DDT ........................3, 4, 136, 319, 346, 374 Decimal places ........146, 221, 308, 311, 397 Declare variables.....................................100 Decommissioning ....................................122
421
Decontamination..............................131, 178 Derived values.........................................220 Descriptive statistics........................273, 274 Detection limit........ 101, 140, 141, 146, 147, 155, 180, 195, 218, 219, 220, 221, 223, 272, 273, 291, 311, 312, 317, 318, 323, 330, 331, 333, 365, 374, 384, 386 Deterministic process ..............................270 Digestion .................................................140 Digital subscriber line........................53, 398 Dilution............140, 179, 311, 321, 333, 375 Discoverable..........................87, 88, 90, 102 Discriminant function analysis ................275 Distributed databases...........................53, 54 DNAPLs ..........................................129, 347 Document management .............................43 Documentation .........39, 42, 91, 92, 96, 101, 102, 112, 117, 155, 174, 183, 184, 191, 194, 196, 206, 294, 334, 366, 370, 373, 403, 413, 416 DOS.............. 49, 76, 87, 294, 334, 375, 387, 389, 390 Double entry........................................... 152 Drill down .............................................. 223 Driver ................................................74, 375 DSL ...........................................53, 368, 375 DTS .................153, 156, 157, 327, 332, 335 Duplicate samples....155, 177, 376, 395, 396 Dynamic HTML ....................................... 54
E EA ...................................117, 121, 376, 377 Earth Day.....................................................4 Ease of use.................7, 35, 49, 86, 185, 205 EBCDIC ....................................76, 332, 376 EDD ..... 40, 47, 153, 156, 157, 159, 162, 168, 176, 327, 335, 376 Editing .................................62, 90, 167, 208 EDMS.....5, 8, 13, 14, 21, 35, 36, 37, 38, 51, 52, 54, 59, 60, 61, 63, 65, 66, 69, 71, 72, 73, 74, 78, 85, 87, 90, 91, 93, 96, 100, 101, 104, 105, 106, 109, 110, 111, 112, 117, 119, 120, 121, 122, 127, 129, 132, 133, 136, 137, 140, 147, 148, 151, 152, 153, 154, 155, 159, 167, 168, 173, 175, 176, 183, 184, 185, 186, 187, 188, 189, 195, 196, 200, 205, 209, 213, 218, 224, 226, 228, 231, 240, 244, 246, 251, 252, 253, 254, 255, 262, 266, 269, 278, 282, 289, 292, 294, 295, 307, 327, 328, 329, 334, 335, 376
422
Relational Management and Display of Site Environmental Data
Efficiency .........8, 11, 30, 58, 140, 158, 177, 178, 179, 181, 293, 294, 295, 299, 395, 399, 401 EIS...................................117, 121, 376, 377 EJB..........................................................285 Electronic data deliverable ......................159 Electronic distribution .............................224 Electronic import...............................62, 153 Elevations........7, 38, 61, 136, 263, 329, 388 Emergency Planning and Community Rightto-Know Act .................................118, 377 EMIS ...................................8, 173, 376, 411 Emissions management .............................37 EMS ............................................8, 173, 376 Encapsulation ......................................16, 17 Endangered Species Act..........................118 Enforcement actions ................................120 Entity integrity...........................................31 Entity-relationship diagram .......22, 325, 359 Enviro Data ...................27, 30, 59, 307, 361 Enviro Spase.....38, 209, 246, 253, 255, 283, 284 Environmental assessment......117, 121, 376, 377, 379 Environmental impact statement .............121 Environmental regulations.......................117 Enzyme immunoassay .............................144 EP toxicity test ........................................140 EPA ....55, 63, 118, 119, 120, 121, 134, 135, 141, 143, 144, 145, 173, 175, 181, 185, 186, 188, 191, 192, 193, 194, 273, 277, 278, 294, 311, 337, 338, 348, 349, 350, 351, 352, 353, 354, 363, 364, 365, 369, 370, 371, 374, 375, 377, 378, 379, 381, 382, 384, 389, 390, 392, 397, 398, 400, 401, 403, 409, 410, 415 EPCRA............................................118, 377 EPTOX....................................................140 E-R diagram ..............................................22 Erin Brockovich ..........................4, 134, 412 ERPIMS ....................................63, 364, 377 Error handling .........................................101 ESA .................................................118, 377 Excel..... 14, 58, 78, 146, 156, 221, 226, 228, 229, 230, 282, 332 Exercises .................................................357 Expense reports .........................................42 Expert users...............................................90 Export... 38, 62, 71, 157, 186, 187, 219, 245, 279, 282, 283, 313, 377 Export-import....................................62, 283
F Factor analysis.........................................275 Feasibility study.......................119, 378, 395 Fence diagrams........................................239 Field blank...............176, 178, 316, 378, 385 Field data.... 37, 56, 129, 130, 155, 159, 168, 177, 178, 216, 297, 331 Field duplicates ...............176, 177, 316, 378 Field measurements .................................129 Field parameters ..............................136, 347 Field samples...........................................177 Fields .............................21, 76, 78, 328, 407 File import .................................................90 Filtered ............141, 310, 311, 319, 329, 331 Filtering ...................................130, 140, 262 Financial benefits.....................................293 Finding of No Significant Impact ...........121, 377, 379 First-time users ..........................................87 Flat file ..............14, 15, 23, 26, 31, 332, 415 Floaters............................................129, 136 FONSI .....................................121, 377, 379 Foreign key..........................................26, 79 Formatted reports ....................................214 Forms.........................................................80 Four Color Problem.................................233 Fourier transforms ...................................262 FoxPro .....................49, 71, 74, 78, 222, 282 Frame relay................................................53 FS ............119, 120, 316, 378, 379, 395, 396 F-test........................................................274 Functions .... 61, 62, 82, 83, 87, 90, 100, 233, 294, 308, 371, 379, 396, 409 Fungicides ...............................................345
G Gas chromatography........................143, 379 Geographic information systems ......3, 5, 89, 209, 213, 242, 243, 244, 246, 251, 252, 253, 254, 255, 257, 262, 279, 283, 284, 285, 296, 307, 373, 379, 383, 388, 402, 407, 411, 413, 415, 416 Geography Markup Language .........282, 379 Geologic maps...................................37, 244 Geologic units..........................254, 307, 315 Geology .........................5, 37, 408, 413, 415 GeoMedia................................................252 Geometric mean...............271, 274, 366, 385 GeoObjects........................................38, 209 Geophysical logs .....................................235
Index Geophysical measurements .............137, 235 Geophysical survey .................................126 Gigabyte ..........................................6, 70, 76 GIS ...3, 5, 89, 209, 213, 242, 243, 244, 246, 251, 252, 253, 254, 255, 257, 262, 279, 283, 284, 285, 296, 307, 373, 379, 383, 388, 402, 407, 411, 413, 415, 416 Global Geomatics....................................285 Global positioning system .......................245 GML................................................282, 379 Gnomonic........................................247, 249 GPS ...............................................5, 56, 245 Graph displays.........................................213 Graph scales ............................................226 Graph theory............................................233 Grapher....................................226, 228, 230 Graphical selection..........................207, 209 Graphs ......3, 62, 74, 91, 110, 209, 210, 216, 225, 226, 227, 228, 231, 232, 233, 262, 305, 414, 416 Gravimetry ..............................................143 Grid operations................................261, 262 Grid parameters.......................................258 Gridded contouring .................................257 Gridding algorithm..................................260 Gridzo .....................................................251 Groundwater....127, 128, 129, 130, 136, 347, 360, 408, 414 Groundwater sampling ....127, 128, 130, 414 GSMap ....................................................251 GTTP ..............................................285, 380
H Hachures..................................................261 Halogenated compounds .........................135 Hard copy...13, 43, 44, 45, 91, 92, 107, 111, 121, 147, 151, 152, 155, 245, 252, 291, 297, 369, 380, 391, 409 Hawking, Stephen .....................................49 Hazard ranking system ............................119 Hazardous wastes ......................................36 Health and safety...........................36, 39, 40 Heavy metals ...........................................134 Help file.......................................91, 92, 102 Herbicides ...............................136, 321, 345 Hierarchical ..........14, 15, 17, 21, 32, 60, 61, 280, 281 Historical data .. 19, 109, 151, 152, 154, 155, 156, 161, 162, 182, 297 Historical entry................................151, 154 History.......................................................21
423
HMIS.........................................37, 380, 388 Holding times ..........145, 146, 194, 321, 354 Horizontal scale.......................................237 Hotine oblique .........................................248 HRS .........................................119, 381, 398 HTML ...................................44, 45, 54, 381 Hubs ....................................................71, 72 Human tissue ...........................................132 Hydraulic head ....................................37, 38 Hydrocarbons ..135, 136, 347, 381, 390, 403 Hydrodynamic gradient ...........................314 Hydrogeology............................................37 Hypercard ............................................17, 45 Hypertext.............................................16, 45
I Implementation model...............................86 Import......................109, 187, 304, 316, 381 Increased revenue ....................................294 Inductively coupled plasma .....................142 Industrial facilities .................3, 11, 118, 377 Information Technology .....12, 74, 104, 185, 382, 413 Infrared spectroscopy ..............................144 INGRES ....................................................21 Inheritance.................................................16 Inner joins..................................................77 Inorganic compounds ......................134, 278 Inorganic nonmetals ................................134 Inorganic parameters ...............................338 Insecticides ......................................136, 346 Instrument blank......................177, 180, 381 Integrated services digital network ............53 Interactive output.....................................223 Intergraph ................................................252 Internet ...18, 42, 44, 50, 54, 55, 56, 96, 103, 285, 300, 366, 367, 368, 381, 384, 399 Intranet ......................42, 44, 45, 54, 96, 107 Ion balance ..............................................197 Ion chromatography.................................142 Ion-selective electrodes ...........................142 IRDMIS.............................................63, 382 IRIS ...........................................63, 382, 410 ISDN .........................................................53 ISO 14000 ...............................100, 184, 415 ISO 17025 .......................................184, 185 ISO 9000 ...................................41, 100, 184
J Java..............................................54, 73, 285 Join table .............................................16, 60
424
Relational Management and Display of Site Environmental Data
Join types...................................................77
K Key fields ..................................................79 Kilobyte.....................................................76 Kriging ............................................261, 275 Kurtosis ...................................................274
L Lab calibration ........................................179 Lab reports ................................................13 Laboratories .......40, 64, 102, 104, 105, 109, 139, 141, 147, 148, 155, 156, 157, 175, 177, 182, 184, 185, 191, 193, 219, 277, 292, 294, 297, 304, 312, 328, 335, 364, 371, 374, 382, 392, 396, 399, 403, 408, 410 Laboratory duplicates......176, 179, 316, 383 Laboratory quality control.......................140 Laboratory reanalyses..............176, 179, 383 Lambert Conformal Conic.......247, 249, 392 Landfill ......5, 140, 200, 263, 264, 296, 384, 395 LANs .....................................53, 55, 72, 383 Latitude-longitude ....37, 246, 286, 372, 383, 391, 400 LDAR................................................37, 384 Leaching..................140, 333, 384, 399, 402 Leegram...........................................267, 268 Left join.....................................................77 Legacy systems........................................294 Legionnaire’s disease ..............................133 Levels of analysis ....................................145 LIMS .........40, 139, 140, 147, 176, 282, 384 Linear regression .....................................275 Lines........................................227, 254, 283 Linux .........................................................50 Liquid chromatography ...........143, 144, 348 Lithologic logs.........................................235 Lithology ...37, 235, 237, 309, 310, 311, 315 LNAPLs ..........................................129, 347 Logarithmic scale ....................................227 Logbooks.........................................125, 128 Logical data model ....................................76 Logical fields.............................................78 Login ID ..................................................202 Log-log plot.............................................227 Lognormal distribution....................271, 274 Lookup tables .......19, 25, 31, 60, 61, 62, 90, 157, 159, 160, 161, 199, 201, 307 308, 310, 311, 312 313, 314, 315, 316, 317,
318, 319, 320, 321, 322, 323, 324, 325, 328, 329, 330, 331, 332, 333, 335
M Macintosh ........17, 45, 70, 87, 363, 384, 400 Macros.......................................................82 Maintenance ........90, 91, 174, 385, 388, 414 Make table .................................................80 Manifest model..........................................86 Manual editing.........................................167 Manual entry................................62, 90, 151 Manual input......................................90, 154 Map accuracy ..........................................245 Map scale ................................................250 MapInfo.............................................38, 209 MapObjects .......................................38, 209 Maps....... 30, 37, 38, 44, 62, 74, 91, 97, 110, 210, 213, 225, 228, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 254, 255, 256, 257, 259, 262, 296, 404, 412, 414, 416 Mass spectrometry...........142, 143, 181, 349 Matrix spike............. 176, 177, 178, 316, 385 Matrix spike duplicate .............................178 Mean...........19, 70, 162, 191, 197, 199, 228, 231, 271, 274, 276, 365, 366, 370, 371, 372, 379, 385, 390, 395, 397, 400 Measured depths........................................38 Median.............................................274, 385 Megabyte.................................6, 44, 76, 334 Message passing ........................................17 Metals......................321, 338, 349, 351, 354 Method blank...................177, 179, 316, 386 Method of standard addition....179, 386, 387 Method reference.....................................348 Microsoft Office ..................................50, 69 Miller cylindrical.............................247, 249 Mode ...............................................274, 386 Model .......12, 16, 17, 19, 21, 22, 23, 26, 30, 31, 35, 36, 38, 49, 55, 56, 59, 60, 61, 71, 78, 84, 86, 94, 95, 97, 99, 103, 104, 159, 176, 186, 188, 206, 211, 238, 245, 251, 262, 270, 283, 307, 308, 312, 325, 336, 384 Modeling .... 8, 16, 24, 38, 64, 243, 252, 256, 262, 305, 377, 411, 417 Modem ........................54, 56, 367, 368, 386 Modules.........................................53, 82, 83 Monitoring wells ...... 8, 15, 37, 61, 199, 200, 207, 209, 231, 294, 329, 413 Moore’s law.................................6, 300, 415
Index Multiple flags ..................219, 308, 318, 320 Multiple sites.....................56, 189, 200, 202 Multi-site security....................................199 Multi-tiered .........................................54, 56 Multivariate analysis ...............................274 Music...........................................................4 Mutagenesis.....................................278, 387 MySQL..............................................50, 412
N NAD 27 ...........................................248, 387 NAD 83 ...........................................248, 387 Naming conventions................................101 NAPLs.............................................129, 387 National Contingency Plan......118, 119, 387 National Environmental Policy Act....4, 117, 121, 387 National Priorities List ...119, 145, 378, 381, 388, 401 Needs assessment ....................3, 95, 96, 303 NEPA ..........................4, 117, 121, 369, 387 Nephelometric analysis ...........................143 Network....14, 16, 45, 51, 53, 62, 64, 69, 71, 72, 73, 74, 76, 104, 111, 258, 387, 388, 397 Network adapters.......................................71 Network hardware .....................................71 Network software ................................70, 72 NextStep....................................................17 NFPA ........................................37, 380, 388 Non-aqueous phase liquids......................129 Non-conforming data ..............................335 Non-detects .....................................219, 331 Non-parametric test .................................273 Normal distribution........... 270, 271, 273, 274, 275, 276, 388, 390 Normal Form.......22, 23, 24, 25, 26, 31, 307 North American Datum ...................248, 387 Norton Ghost...........................................101 NPDES ......................................37, 317, 388 NPL .................118, 119, 120, 381, 388, 398 NQA-1.....................................................184 Numeric fields ...................................78, 308 Nutrients..................................................134
O O&M ...............................................120, 388 Object-oriented.....14, 16, 17, 368, 411, 415, 417 Oblique Mercator ....................247, 248, 249 Occupational Safety and Health Act.......117, 389
425
OCR.............................................43, 44, 152 Octant searching ......................................258 ODBC... 50, 52, 74, 110, 252, 283, 284, 307, 329, 389 Office of Solid Waste ..............120, 389, 410 Offshore entry..........................................152 OLAP ..............................................285, 389 One-to-many.........15, 17, 22, 30, 60, 61, 79, 308 Ongoing monitoring ............................3, 300 Online analytical processing....................285 Open database ......56, 57, 58, 110, 157, 295, 296, 300 Open DataBase Connectivity .......50, 74, 283, 389 Open-source software ................................50 Operating costs ................................293, 294 Operating parameters...............................137 Operating permit......................120, 133, 216 Operating System ......................70, 375, 389 Operation and maintenance .....................120 Optical character recognition ...........43, 152 Oracle ....... 22, 27, 50, 64, 65, 70, 74, 76, 79, 84, 102, 104, 188, 283, 300, 307 Ordinate...........................................226, 228 Organic compounds........135, 350, 353, 366, 381 Organic parameters..........................135, 340 Organic phosphates .....................................3 Organisms................................................133 Orphan data .............................................159 OSHA........................40, 117, 342, 343, 389 OSW........................................120, 389, 409 Outer joins .................................................77 Outliers ............162, 232, 274, 275, 276, 277 Overhead costs ........................................293
P PA....................................119, 390, 407, 415 Paradox................................................50, 71 Parameters ......25, 27, 28, 83, 171, 248, 308, 311, 313, 319, 320, 321, 322, 323, 324, 330, 337 Parametric test .................273, 274, 388, 390 Parent-child ...............60, 214, 280, 307, 325 Passwords ..................................................91 PCE .........129, 136, 337, 343, 375, 390, 408 PDAs .........................................................56 Pentium......................................70, 363, 386 PERC.......................................337, 343, 390 Percentiles ...............................................274
426
Relational Management and Display of Site Environmental Data
Performance evaluation sample ........181, 366, 390 Permissions .................................64, 65, 189 Personal digital assistants..........................56 Pesticides........ 136, 319, 320, 321, 337, 346, 350, 351, 353, 376, 378, 390 Phase 1 ..............................41, 121, 122, 407 Physical data model...........................76, 307 Piper diagram ..................................263, 264 Pivot table ...............................................214 Plant operating data.................................133 Plume models ..........................................262 Point selection .........................................258 Points...............................231, 254, 258, 391 Pollution Prevention Act .................118, 391 Polygons..................................................254 Polymorphism ...........................................17 Populations..............................................272 PostgreSQL .......................................50, 412 Potentially responsible party ...119, 383, 392 Power users .................74, 91, 111, 112, 206 PPA .................................................118, 391 Precision..........................187, 364, 390, 391 Preliminary assessment....................119, 121 Preservation.............................130, 354, 414 Preservative .............................................130 Primary key ...................................24, 26, 79 Primary tables........31, 60, 61, 308, 312, 324 Principal components analysis.........274, 275 Print reordering .......................................170 Prioritization............................................305 Procedural .................16, 191, 193, 194, 404 Profiles ....................................................238 Programmer documentation ....................101 Project administrative data ........................39 Project management ... 5, 39, 40, 41, 94, 106, 173, 293, 299 Prototypes................................................100 PRP .........................................119, 383, 392
Q QA..........37, 39, 42, 64, 173, 177, 181, 186, 333, 360, 392, 393, 395, 407, 409, 410 QAPP ........41, 105, 126, 173, 191, 194, 374, 392, 393, 410 QBF...................................91, 210, 282, 413 QC ......38, 64, 127, 130, 132, 147, 155, 173, 174, 175, 176, 177, 178, 179, 180, 181, 186, 188, 193, 194, 196, 197, 198, 310, 312, 316, 318, 322, 323, 329, 330, 332, 333, 360, 369, 386, 392, 393, 409
QC Scope ........................................176, 177 Quadrant searching..................................258 Quality........4, 7, 8, 12, 13, 37, 38, 42, 44, 63, 64, 65, 72, 90, 94, 95, 100, 102, 105, 109, 112, 117, 118, 123, 126, 131, 133, 134, 139, 140, 145, 147, 148, 151, 152, 155, 157, 161, 162, 165, 166, 168, 173, 174, 175, 179, 180, 181, 183, 184, 185, 186, 187, 188, 191, 193, 194, 197, 211, 265, 266, 267, 275, 276, 277, 295, 296, 304, 310, 316, 318, 335, 338, 364, 365, 366, 367, 369, 370, 371, 372, 373, 374, 375, 377, 378, 379, 381, 382, 383, 384, 385, 387, 391, 392, 393, 395, 396, 399, 400, 401, 402, 403, 404, 405, 409, 410, 412, 413, 414, 415, 416, 417 Quality assurance.....................................173 Quality assurance project plan........126, 173, 191 Quality control... 42, 139, 173, 365, 382, 393 Queries .....18, 26, 27, 28, 29, 30, 31, 39, 52, 53, 73, 74, 76, 77, 79, 80, 84, 87, 90, 91, 154, 168, 170, 171, 172, 184, 187, 200, 205, 206, 207, 210, 211, 213, 296, 336, 391, 408 Query-by-form.........................................210 QuickBase ...........................................15, 50 QuickBooks...............................................50 Quicken .....................................................50
R RA ...........................120, 340, 353, 393, 394 Radiochemical methods...........................143 Radiologic ...............................135, 321, 340 Radiologic parameters .............................135 RAM...........69, 70, 368, 382, 385, 394, 397, 398, 405 Random process ......................................270 Raster-based ............................................252 RCRA .....118, 119, 120, 121, 137, 141, 144, 273, 321, 348, 381, 394, 409, 410, 414 RCRA characterization....................137, 144 RD ...........................................120, 316, 394 Read-write .................................................62 Real estate transactions............................121 Record of decision...................119, 120, 277 Records..........21, 76, 79, 163, 309, 328, 395 Referee duplicates ...........176, 177, 316, 395 Reference ellipsoid ..........................248, 249 Reference files.........................................157 Reference standard ..................177, 181, 395
Index Referential integrity........ 30, 52, 77, 91, 156, 158, 159, 161, 192, 196 Regulators ........ 41, 105, 109, 147, 151, 175, 184, 205, 289, 294, 295, 297 Regulatory limit groups...........................216 Regulatory limit types .............................313 Regulatory limits .......... 39, 42, 120, 216, 217, 218, 221, 223, 231, 262, 277, 294, 313 Relational ............................17, 21, 395, 408 Relationships ....8, 13, 15, 16, 17, 18, 21, 22, 24, 30, 32, 35, 45, 59, 60, 61, 76, 77, 97, 159, 161, 226, 227, 235, 237, 239, 254, 265, 267, 269, 274, 276, 280, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 360, 374, 377 Remedial action.......119, 120, 393, 395, 397 Remedial design ......................119, 120, 393 Remedial investigation ....119, 122, 378, 395 Remediation .......3, 8, 11, 12, 17, 35, 37, 39, 40, 63, 117, 119, 122, 133, 184, 200, 216, 246, 271, 300, 327, 392, 399 Remote access .........................................305 Remote communication.............................54 Removal of duplicated entries .................168 Report.........4, 11, 12, 13, 14, 30, 36, 37, 39, 40, 41, 42, 43, 45, 62, 63, 64, 65, 71, 73, 74, 76, 80, 82, 87, 90, 91, 100, 102, 103, 105, 109, 110, 120, 146, 151, 163, 170, 174, 188, 196, 197, 210, 213, 214, 215, 216, 222, 223, 224, 245, 252, 291, 292, 295, 297, 303, 304, 318, 319, 322, 323, 331, 336, 373, 415 Reporting.........139, 147, 213, 218, 219, 320 Reporting units ................147, 160, 320, 322 Requirements plan...................................100 Residential media ....................................133 Residual maps .........................................243 Resource Conservation and Recovery Act ... 118, 119, 120, 121, 137, 141, 144, 273, 321, 348, 381, 394, 409, 410, 414 Review status .........110, 164, 166, 167, 168, 182, 183 RI.............................119, 120, 378, 395, 396 Right join...................................................77 Risk ......39, 40, 84, 118, 119, 120, 128, 129, 136, 205, 206, 277, 278, 290, 338, 395, 410 Risk assessment........ 39, 120, 145, 277, 278, 396, 410 River and Harbor Act of 1899.....................3 RockWorks..............................................251
427
ROD ..................................41, 120, 216, 396 Rounding .................................................146 Router ........................................................53
S Safe Drinking Water Act .................118, 398 Sample extraction ....................................140 Sample identification...............................125 Sample preparation..........................139, 140 Sample spikes ..................176, 177, 316, 378 Sample tracking .......................................139 Samples ..........21, 25, 26, 27, 28, 38, 61, 79, 123, 127, 130, 131, 140, 145, 155, 176, 177, 178, 280, 283, 308, 309, 310, 311, 315, 316, 317, 320, 325, 327, 328, 329, 332, 333, 365, 378, 414 Sampling strategies..................................125 Satellite photos ..................................38, 244 SCADA .............................................39, 397 Scalability..................................................64 Scanning ................43, 44, 55, 142, 152, 398 Schedules...........................................40, 305 Scorecard.org ......................4, 141, 337, 409 SDWA .............................................118, 398 Search radius ...........................................258 Security......................................64, 188, 190 Sediment samples ....................................127 Select ...............................18, 26, 27, 28, 361 Selection criteria..........30, 87, 206, 210, 211 Semivolatile.....................135, 337, 343, 353 SEQUEL....................................................22 Server hardware.........................................74 Server maintenance .................................111 Server software..........................................75 Shared file .................................................51 Sharing data elements..............................201 Shewhart control charts ...................277, 415 Shipping method......................................130 Shrink-wrap agreement............................103 SI .............................................119, 390, 398 Significant figures....................146, 221, 333 Silent Spring ........................................3, 408 Simple keys ...............................................79 Simulations..............................................263 Sinkers.............................................129, 136 Site deletion.....................................119, 120 Site inspection .................................119, 121 Site remediation.......................................122 Sites ........4, 61, 77, 118, 120, 145, 186, 283, 308, 309, 312, 313, 322, 323, 324, 327, 329, 332
428
Relational Management and Display of Site Environmental Data
Skewness .................................................274 Smalltalk ...........................................17, 411 Smoothing ...............................................261 SOAP ..............................................285, 399 Software licensing ...................................103 Soil data...................................................127 Soil samples ....126, 127, 128, 131, 178, 222, 235, 237, 310 Source code .......................................50, 100 Spatial statistics .......................................275 Spike.......176, 179, 316, 318, 368, 387, 399, 401 Split samples ...................176, 177, 316, 399 SPLP .......................................140, 319, 399 Spreadsheets...... 11, 14, 84, 154, 279, 282, 2 90, 296, 327, 332 SQL ....17, 18, 21, 22, 26, 27, 28, 29, 30, 52, 53, 64, 65, 69, 73, 74, 75, 76, 79, 84, 87, 90, 102, 104, 171, 188, 189, 190, 200, 206, 207, 283, 291, 296, 300, 307, 325, 359, 361, 399, 408, 409, 411, 413, 414, 416 SQL Server..... 50, 51, 52, 53, 69, 74, 75, 76, 84, 189, 190, 409, 414 Stand-alone.....49, 50, 51, 55, 64, 65, 69, 71, 73, 74, 76, 190, 283 Standard deviation..........125, 188, 197, 271, 274, 371, 382, 390, 391, 395, 397, 400 State plane coordinate systems ........248, 249 Stations...25, 27, 28, 31, 37, 61, 77, 79, 171, 280, 283, 308, 309, 310, 313, 314, 315, 316, 320, 322, 323, 325, 327, 329, 332 Statistical analysis ........ 3, 14, 110, 213, 270, 271, 272, 273, 279, 317, 318 Statistics .............5, 8, 14, 39, 100, 146, 147, 174, 196, 197, 206, 209, 221, 269, 270, 273, 274, 275, 277, 297, 374, 407, 408, 412, 415 Stick diagram...........................................239 Stiff diagrams ..........................................265 Stochastic process ...................................270 Stored procedure .......................................84 STORET ...................................63, 330, 400 Stormwater ................................................37 Stratigraphy ...............................................37 Subroutines..................................82, 83, 100 Subsets ............... 53, 54, 62, 74, 105, 182, 275, 297, 400 Successful import ............................164, 165 Superfund .......5, 41, 63, 118, 119, 370, 371, 373, 387, 388, 389, 392, 393, 394, 395, 397, 401, 410
Superseded ..... 155, 156, 164, 168, 170, 311, 312, 320, 330, 331, 333 Support .......3, 5, 7, 9, 11, 56, 66, 82, 90, 96, 99, 101, 102, 103, 104, 111, 112, 120, 144, 173, 175, 179, 186, 187, 196, 209, 371, 392, 395, 396, 398, 409, 413 Surface geology.......................................244 Surface water samples .............................130 Surfer.......................................230, 251, 263 Surrogate spikes ..............176, 178, 316, 401 SVOCs.....................................135, 349, 352 Syntax checking.......................................100 Synthetic key .............................................79 Synthetic precipitate leaching procedure....... 140, 319, 399 System administrator ........ 52, 66, 67, 75, 102, 111, 189 System maintenance ..........................67, 111
T T1 ......................................................53, 401 T3 ......................................................53, 401 Table Analyzer Wizard .................23, 31, 32 Tables ....................21, 60, 76, 308, 328, 415 Target levels ....................................120, 216 TCA.................129, 136, 340, 363, 375, 408 TCE .............4, 129, 136, 343, 375, 402, 408 TCLP .......127, 140, 313, 319, 331, 351, 402 Tedlar ......................................................132 Tenax...............................................132, 353 Terabyte.....................................................76 Teratogenesis...................................278, 402 Test............................................................95 Test plan ..........................................101, 102 Text files............................13, 154, 376, 402 Text output ..............................................213 Text-based queries...................................206 Thematic maps ........................................244 TIN ..........................................................257 Titration...................................................141 Tool tips ....................................................85 Topography ...............................38, 131, 254 Toxic Substances Control Act .........118, 403 Toxicity characterization leaching procedure 127, 140, 313, 319, 331, 351, 402 Toxicology ..39, 40, 277, 278, 403, 412, 414 Tracking imports .....................................163 Tracking quality ......................................165 Trailing zeros.....................78, 146, 221, 398 Training ......7, 21, 40, 91, 96, 102, 112, 174, 184, 185, 187, 211, 295, 300, 402
Index Transmission electron microscopy ..........142 Transverse Mercator................................248 TreeView.........................................223, 224 Trend surfaces .................................243, 262 TRI ....................................................37, 403 Triangulated irregular network................257 Triangulation contouring.........................256 Trigger.......................................................84 Trip blank........................176, 178, 316, 403 TSCA ................................36, 118, 136, 403 t-test.........................................................274 Tutorial..........................................87, 88, 91 Type I error .....................................272, 403 Type II error ....................................272, 403
U Undo import ............................................164 Unified Soil Classification System..311, 315 Units ...31, 78, 105, 133, 143, 147, 155, 156, 158, 160, 180, 187, 193, 199, 201, 206, 216, 221, 222, 225, 226, 237, 240, 246, 258, 309, 311, 319, 320, 321, 323, 325, 330, 331, 372, 384, 391, 394, 400 Universal Transverse Mercator ........247, 248, 372, 404 UNIX....22, 70, 74, 285, 363, 370, 384, 389, 403, 405 Unsuccessful import ........................163, 164 UPDATE...................................................26 Upgrades ...................................................66 USCS...............................................311, 315 User interface .......17, 50, 51, 54, 55, 59, 61, 64, 71, 73, 74, 75, 76, 84, 85, 86, 87, 88, 90, 91, 94, 96, 101, 104, 112, 167, 186, 187, 189, 336, 384, 386, 408 Users.....7, 19, 28, 56, 62, 90, 102, 112, 148, 188, 363 USGS ..................55, 64, 248, 249, 404, 407 Utility tables ................................60, 61, 308 UTM........................................247, 248, 404
V Validation...7, 18, 62, 63, 64, 158, 167, 173, 174, 175, 181, 182, 186, 189, 191, 192, 193, 194, 195, 196, 197, 198, 294, 311, 316, 318, 320, 323, 331, 333, 335, 374, 375, 377, 382, 386, 402, 404, 408, 410, 415 Validation flags .................................62, 196
429
Value and flag .........................................218 Variable scope .........................................100 VBA ......................................50, 53, 83, 100 Vector-based............................................252 Verification...........7, 62, 158, 173, 174, 175, 181, 182, 188, 191, 192, 193, 194, 195, 196, 197, 316, 317, 370, 372, 375, 404, 410 Vertical exaggeration...............................237 Vertical market application .......................71 Vertical scale ...........................................237 Visual Basic......49, 73, 76, 83, 90, 100, 366, 367, 408, 413 Visualization......................................39, 296 VOAs...............127, 130, 319, 321, 340, 343 VOCs.......................................135, 352, 353 Voice entry ..............................................152 Volumetrics .....................................261, 405
W WAN .......................................53, 54, 55, 72 Water chemistry diagrams .......................263 Water levels...............................64, 128, 313 Well data ...................................17, 273, 295 Wellbore construction ...............................37 Wide-area network ....................................53 Windows.......5, 7, 45, 51, 63, 67, 69, 70, 71, 74, 76, 78, 87, 92, 101, 189, 190, 202, 223, 283, 303, 334, 386, 389, 400, 402, 405 Wireless modem ........................................56 Wireline logs ...........................................235 Wiring .........................................71, 72, 369 Wizard .................................................31, 87 Word processor files..................................13 Workflow ................102, 107, 109, 139, 293 Workflow automation..............................108 Workplace media.....................................133 World Wide Web ....16, 45, 55, 56, 103, 285
X X axis ......................................................226 XML..................18, 279, 281, 282, 399, 409 XSL .................................................281, 282 XY coordinates..........................................37
Y Y axis ......................................................226