This page intentionally left blank
Advanced Mapping of Environmental Data
This page intentionally left blank
Adva...
133 downloads
2113 Views
7MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
This page intentionally left blank
Advanced Mapping of Environmental Data
This page intentionally left blank
Advanced Mapping of Environmental Data Geostatistics, Machine Learning and Bayesian Maximum Entropy
Edited by Mikhail Kanevski Series Editor Pierre Dumolard
First published in Great Britain and the United States in 2008 by ISTE Ltd and John Wiley & Sons, Inc. Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 6 Fitzroy Square London W1T 5DX UK
John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA
www.iste.co.uk
www.wiley.com
© ISTE Ltd, 2008 The rights of Mikhail Kanevski to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Cataloging-in-Publication Data Advanced mapping of environmental data : geostatistics, machine learning, and Bayesian maximum entropy / edited by Mikhail Kanevski. p. cm. Includes bibliographical references and index. ISBN 978-1-84821-060-8 1. Geology--Statistical methods. 2. Machine learning. 3. Bayesian statistical decision theory. I. Kanevski, Mikhail. QE33.2.S82A35 2008 550.1'519542--dc22 2008016237 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN: 978-1-84821-060-8 Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire.
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1. Advanced Mapping of Environmental Data: Introduction . . . . . 1 M. KANEVSKI 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Environmental data analysis: problems and methodology . 1.2.1. Spatial data analysis: typical problems. . . . . . . . . . 1.2.2. Spatial data analysis: methodology . . . . . . . . . . . . 1.2.3. Model assessment and model selection . . . . . . . . . 1.3. Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1. Books, tutorials . . . . . . . . . . . . . . . . . . . . . . . 1.3.2. Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . .
1 3 3 5 8 12 12 12 14 15
Chapter 2. Environmental Monitoring Network Characterization and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D. TUIA and M. KANEVSKI 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Spatial clustering and its consequences . . . . . . . . . . . . . 2.2.1. Global parameters . . . . . . . . . . . . . . . . . . . . . . . 2.2.2. Spatial predictions . . . . . . . . . . . . . . . . . . . . . . . 2.3. Monitoring network quantification. . . . . . . . . . . . . . . . 2.3.1. Topological quantification . . . . . . . . . . . . . . . . . . 2.3.2. Global measures of clustering . . . . . . . . . . . . . . . . 2.3.2.1. Topological indices . . . . . . . . . . . . . . . . . . . 2.3.2.2. Statistical indices . . . . . . . . . . . . . . . . . . . . . 2.3.3. Dimensional resolution: fractal measures of clustering . 2.3.3.1. Sandbox method . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
19 20 21 22 23 23 23 23 24 26 27
vi
Advanced Mapping of Environmental Data
2.3.3.2. Box-counting method . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3.3. Lacunarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4. Validity domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5. Indoor radon in Switzerland: an example of a real monitoring network . 2.5.1. Validity domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2. Topological index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3. Statistical indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3.1. Morisita index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3.2. K-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4. Fractal dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4.1. Sandbox and box-counting fractal dimension . . . . . . . . . . . 2.5.4.2. Lacunarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
30 33 34 36 37 37 38 38 39 40 40 42 43 44
Chapter 3. Geostatistics: Spatial Predictions and Simulations . . . . . . . . . 47 E. SAVELIEVA, V. DEMYANOV and M. MAIGNAN 3.1. Assumptions of geostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Family of kriging models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1. Simple kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2. Ordinary kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3. Basic features of kriging estimation . . . . . . . . . . . . . . . . . . . . 3.2.4. Universal kriging (kriging with trend) . . . . . . . . . . . . . . . . . . . 3.2.5. Lognormal kriging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Family of co-kriging models . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1. Kriging with linear regression . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2. Kriging with external drift . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3. Co-kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4. Collocated co-kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5. Co-kriging application example . . . . . . . . . . . . . . . . . . . . . . . 3.4. Probability mapping with indicator kriging. . . . . . . . . . . . . . . . . . . 3.4.1. Indicator coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2. Indicator kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3. Indicator kriging applications . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3.1. Indicator kriging for 241Am analysis . . . . . . . . . . . . . . . . . 3.4.3.2. Indicator kriging for aquifer layer zonation . . . . . . . . . . . . . 3.4.3.3. Indicator kriging for localization of crab crowds . . . . . . . . . . 3.5. Description of spatial uncertainty with conditional stochastic simulations 3.5.1. Simulation vs. estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2. Stochastic simulation algorithms . . . . . . . . . . . . . . . . . . . . . . 3.5.3. Sequential Gaussian simulation . . . . . . . . . . . . . . . . . . . . . . . 3.5.4. Sequential indicator simulations . . . . . . . . . . . . . . . . . . . . . .
47 49 50 50 51 56 56 58 58 58 59 60 61 64 64 66 69 69 71 74 76 76 77 81 84
Table of Contents
vii
3.5.5. Co-simulations of correlated variables . . . . . . . . . . . . . . . . . . . 88 3.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Chapter 4. Spatial Data Analysis and Mapping Using Machine Learning Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 F. RATLE, A. POZDNOUKHOV, V. DEMYANOV, V. TIMONIN and E. SAVELIEVA 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Machine learning: an overview . . . . . . . . . . . . . . . . . . . . . . 4.2.1. The three learning problems . . . . . . . . . . . . . . . . . . . . . 4.2.2. Approaches to learning from data. . . . . . . . . . . . . . . . . . 4.2.3. Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4. Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5. Dealing with uncertainties . . . . . . . . . . . . . . . . . . . . . . 4.3. Nearest neighbor methods . . . . . . . . . . . . . . . . . . . . . . . . . 4.4. Artificial neural network algorithms . . . . . . . . . . . . . . . . . . . 4.4.1. Multi-layer perceptron neural network. . . . . . . . . . . . . . . 4.4.2. General Regression Neural Networks . . . . . . . . . . . . . . . 4.4.3. Probabilistic Neural Networks. . . . . . . . . . . . . . . . . . . . 4.4.4. Self-organizing (Kohonen) maps . . . . . . . . . . . . . . . . . . 4.5. Statistical learning theory for spatial data: concepts and examples . 4.5.1. VC dimension and structural risk minimization . . . . . . . . . 4.5.2. Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3. Support vector machines . . . . . . . . . . . . . . . . . . . . . . . 4.5.4. Support vector regression . . . . . . . . . . . . . . . . . . . . . . 4.5.5. Unsupervised techniques . . . . . . . . . . . . . . . . . . . . . . . 4.5.5.1. Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.5.2. Nonlinear dimensionality reduction . . . . . . . . . . . . . . 4.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
95 96 96 100 101 103 107 108 109 109 119 122 124 131 131 132 133 137 141 142 144 146 146
Chapter 5. Advanced Mapping of Environmental Spatial Data: Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 L. FORESTI, A. POZDNOUKHOV, M. KANEVSKI, V. TIMONIN, E. SAVELIEVA, C. KAISER, R. TAPIA and R. PURVES 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2. Air temperature modeling with machine learning algorithms and geostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1. Mean monthly temperature . . . . . . . . . . . . . . . . . . 5.2.1.1. Data description . . . . . . . . . . . . . . . . . . . . . . 5.2.1.2. Variography . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1.3. Step-by-step modeling using a neural network . . . .
. . . . . . . 149 . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
150 151 151 152 153
viii
Advanced Mapping of Environmental Data
5.2.1.4. Overfitting and undertraining . . . . . . . . . . . . . . . . . . . 5.2.1.5. Mean monthly air temperature prediction mapping . . . . . . 5.2.2. Instant temperatures with regionalized linear dependencies . . . . 5.2.2.1. The Föhn phenomenon . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.2. Modeling of instant air temperature influenced by Föhn . . . 5.2.3. Instant temperatures with nonlinear dependencies . . . . . . . . . . 5.2.3.1. Temperature inversion phenomenon . . . . . . . . . . . . . . . 5.2.3.2. Terrain feature extraction using Support Vector Machines . . 5.2.3.3. Temperature inversion modeling with MLP . . . . . . . . . . . 5.3. Modeling of precipitation with machine learning and geostatistics. . . 5.3.1. Mean monthly precipitation . . . . . . . . . . . . . . . . . . . . . . . 5.3.1.1. Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1.2. Precipitation modeling with MLP . . . . . . . . . . . . . . . . . 5.3.2. Modeling daily precipitation with MLP . . . . . . . . . . . . . . . . 5.3.2.1. Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2.2. Practical issues of MLP modeling . . . . . . . . . . . . . . . . . 5.3.2.3. The use of elevation and analysis of the results . . . . . . . . . 5.3.3. Hybrid models: NNRK and NNRS . . . . . . . . . . . . . . . . . . . 5.3.3.1. Neural network residual kriging . . . . . . . . . . . . . . . . . . 5.3.3.2. Neural network residual simulations . . . . . . . . . . . . . . . 5.3.4. Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4. Automatic mapping and classification of spatial data using machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1. k-nearest neighbor algorithm . . . . . . . . . . . . . . . . . . . . . . 5.4.1.1. Number of neighbors with cross-validation . . . . . . . . . . . 5.4.2. Automatic mapping of spatial data . . . . . . . . . . . . . . . . . . . 5.4.2.1. KNN modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2.2. GRNN modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3. Automatic classification of spatial data . . . . . . . . . . . . . . . . 5.4.3.1. KNN classification . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3.2. PNN classification . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3.3. Indicator kriging classification . . . . . . . . . . . . . . . . . . . 5.4.4. Automatic mapping – conclusions . . . . . . . . . . . . . . . . . . . 5.5. Self-organizing maps for spatial data – case studies . . . . . . . . . . . 5.5.1. SOM analysis of sediment contamination . . . . . . . . . . . . . . . 5.5.2. Mapping of socio-economic data with SOM . . . . . . . . . . . . . 5.6. Indicator kriging and sequential Gaussian simulations for probability mapping. Indoor radon case study . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1. Indoor radon measurements . . . . . . . . . . . . . . . . . . . . . . . 5.6.2. Probability mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3. Exploratory data analysis. . . . . . . . . . . . . . . . . . . . . . . . . 5.6.4. Radon data variography . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.4.1. Variogram for indicators . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
154 156 159 159 160 163 163 164 165 168 169 169 171 173 173 174 177 179 179 182 184
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
185 185 187 187 188 190 192 193 194 197 199 200 200 204
. . . . . .
. . . . . .
209 209 211 212 216 216
Table of Contents
5.6.4.2. Variogram for Nscores . . . . . . . . . . . . . . . . . . . . . . . . 5.6.5. Neighborhood parameters . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.6. Prediction and probability maps. . . . . . . . . . . . . . . . . . . . . . 5.6.6.1. Probability maps with IK . . . . . . . . . . . . . . . . . . . . . . . 5.6.6.2. Probability maps with SGS . . . . . . . . . . . . . . . . . . . . . . 5.6.7. Analysis and validation of results. . . . . . . . . . . . . . . . . . . . . 5.6.7.1. Influence of the simulation net and the number of neighbors . . 5.6.7.2. Decision maps and validation of results . . . . . . . . . . . . . . 5.6.8. Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7. Natural hazards forecasting with support vector machines – case study: snow avalanches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1. Decision support systems for natural hazards. . . . . . . . . . . . . . 5.7.2. Reminder on support vector machines . . . . . . . . . . . . . . . . . . 5.7.2.1. Probabilistic interpretation of SVM . . . . . . . . . . . . . . . . . 5.7.3. Implementing an SVM for avalanche forecasting . . . . . . . . . . . 5.7.4. Temporal forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.4.1. Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.4.2. Training the SVM classifier . . . . . . . . . . . . . . . . . . . . . 5.7.4.3. Adapting SVM forecasts for decision support. . . . . . . . . . . 5.7.5. Extending the SVM to spatial avalanche predictions . . . . . . . . . 5.7.5.1. Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.5.2. Spatial avalanche forecasting. . . . . . . . . . . . . . . . . . . . . 5.7.6. Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
. . . . . . . . .
217 218 219 219 220 221 221 222 225
. . . . . . . . . . . . . . .
225 227 228 229 230 230 231 232 233 237 237 239 241 241 242
Chapter 6. Bayesian Maximum Entropy – BME . . . . . . . . . . . . . . . . . . 247 G. CHRISTAKOS 6.1. Conceptual framework . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Technical review of BME . . . . . . . . . . . . . . . . . . . . . . . 6.2.1. The spatiotemporal continuum . . . . . . . . . . . . . . . . . 6.2.2. Separable metric structures . . . . . . . . . . . . . . . . . . . 6.2.3. Composite metric structures . . . . . . . . . . . . . . . . . . . 6.2.4. Fractal metric structures . . . . . . . . . . . . . . . . . . . . . 6.3. Spatiotemporal random field theory . . . . . . . . . . . . . . . . . 6.3.1. Pragmatic S/TRF tools . . . . . . . . . . . . . . . . . . . . . . 6.3.2. Space-time lag dependence: ordinary S/TRF . . . . . . . . . 6.3.3. Fractal S/TRF . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4. Space-time heterogenous dependence: generalized S/TRF . 6.4. About BME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1. The fundamental equations . . . . . . . . . . . . . . . . . . . 6.4.2. A methodological outline . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
247 251 251 253 255 256 257 258 260 262 264 267 267 273
x
Advanced Mapping of Environmental Data
6.4.3. Implementation of BME: the SEKS-GUI . . 6.5. A brief review of applications . . . . . . . . . . . 6.5.1. Earth and atmospheric sciences . . . . . . . . 6.5.2. Health, human exposure and epidemiology . 6.6. References . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
275 281 282 291 299
List of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Preface
This volume is a collection of lectures and seminars given at two workshops organized by the Institute of Geomatics and Analysis of Risk (IGAR) at the Faculty of Geosciences and Environment of the University of Lausanne (www.unil.ch/igar): – Workshop I, October 2005: “Data analysis and modeling in environmental sciences towards risk assessment and impact on society”; – Workshop II, October 2006 (S4 network modeling tour): “Machine Learning Algorithms for Spatial Data”. During the first workshop many topics related to natural hazards were considered. One of the lectures was given by Professor G. Christakos on the theory and applications of Bayesian Maximum Entropy (BME). The second workshop was organized within the framework of the S4 (Spatial Simulation for Social Sciences, http://s4.parisgeo.cnrs.fr/index.htm) network modeling tour of young researchers. The main topics considered were related to machine learning algorithms (neural networks of different architectures and statistical learning theory) and their applications in geosciences. Therefore, the book is actually a composition of three topics concerning the analysis, modeling and presentation of spatiotemporal data: geostatistical methods and models, machine learning algorithms and the Bayesian maximum entropy approach. All these three topics have quite different theoretical hypotheses and background assumptions. Usually, they are published in different volumes. Of course, it was not possible to cover both introductory and advanced topics taking into account the limits of the book. Authors were free to select their topics and to present some theoretical concepts along with simulated/illustrative and real case studies. There are some traditional examples of environmental data mapping using different techniques but also
xii
Advanced Mapping of Environmental Data
advanced topics, which cover recent research activities. Obviously, this volume is not a textbook on geostatistics, machine learning and BME. Moreover, it does not cover all currently available techniques for environmental data analysis. Nevertheless, it tries to explain the main theoretical concepts and to give an overview of applications for the selected methods and models. We hope that the book will be useful both for professionals and experts interested in environmental data analysis and mapping. The book can expand the knowledge of tools currently available for the analysis of spatiotemporal data. Let us remember that in general the selection of an appropriate method should depend on the quality and quantity of data and on the objectives of the study. The book consists of six chapters. Chapter 1 is an introduction to the topics of environmental data mapping. Chapter 2 deals with a characterization of monitoring networks and studies monitoring network clustering and its effect on spatial predictions. The main focus is given to global cluster detection methods such as fractal dimension. Integration of the characteristics of the prediction space is also discussed via the concept of validity domain. Chapter 3 is devoted to traditional and recently developed models in geostatistics. Geostatistics is still a dynamically developing discipline. It has contributed to different topics of data analysis during the last 50 years. Chapter 4 gives an introduction to machine learning algorithms and explains some particular models widely used for environmental data: multilayer perceptron, general regression neural networks, probabilistic neural networks, self-organizing maps, support vector machines and support vector regression. Chapter 5 describes real case studies with the application of geostatistical models and machine learning algorithms. The presented case studies cover different topics: topo-climatic modeling, pollution mapping, analysis of socio-economic spatial data, indoor radon risk and natural hazard risk assessment. An interesting section deals with a so-called “automatic mapping” (spatial prediction and spatial classification) using general regression and probabilistic neural networks. Such applications can be important in on-line data analysis and environmental decision support systems. Chapter 6 is completely devoted to the Bayesian maximum entropy approach to spatiotemporal data analysis. It is a separate part of the book presenting BME from conceptual introduction to recent case studies dealing with environmental and epidemiological applications.
Preface
xiii
We would like to acknowledge the Faculty of Geosciences and Environment of the University of Lausanne for the financial support of both workshops. The S4 network (Professor Denise Pumain) played an important role in organizing the second workshop. The scientific work resulting in a collection of papers presented in this volume is the result of several projects, financed by the Swiss National Science Foundation: 105211-107862; 100012-113506, 200021-113944, Scope project IB7310-110915, and the Russian Foundation for Fundamental Research: 07-0800257. The support for the preparation of Chapter 6 was provided by a grant from the California Air Resources Board, USA (Grant No. 55245A). We acknowledge the following institutions and offices that have kindly provided us with data: Swiss Federal Office for Public Health, MeteoSwiss, Swisstopo, Swiss office of statistics, CIPEL (Lausanne), and sportScotland Avalanche Information Service (SAIS) for the avalanche recordings and meteorological data in the Lochaber region of Scotland, UK. I would like to acknowledge the authors who have contributed directly to this volume for their interesting works and fruitful collaboration. Finally, all the authors acknowledge Professor P. Dumolard (who initiated this project) and ISTE Ltd. for the collaboration and opportunity to publish this book. M. Kanevski Lausanne, April 2008
This page intentionally left blank
Chapter 1
Advanced Mapping of Environmental Data: Introduction
1.1. Introduction In this introductory chapter we describe general problems of spatial environmental data analysis, modeling, validation and visualization. Many of these problems are considered in detail in the following chapters using geostatistical models, machine learning algorithms (MLA) of neural networks and Support Vector Machines, and the Bayesian Maximum Entropy (BME) approach. The term “mapping” in the book is considered not only as an interpolation in two- or threedimensional geographical space, but in a more general sense of estimating the desired dependencies from empirical data. The references presented at the end of this chapter cover the range of books and papers important both for beginners and advanced researchers. The list contains both classical textbooks and studies on contemporary cutting-edge research topics in data analysis. In general, mapping can be considered as: a) a spatiotemporal classification problem such as digital soil mapping and geological unit classification, b) a regression problem such as mapping of pollution and topo-climatic modeling, and c) a problem of probability density modeling, which is not a mapping of values but “mapping” of probability density functions, i.e., the local or joint spatial distributions conditioned on data and available expert knowledge.
Chapter written by M. KANEVSKI.
2
Advanced Mapping of Environmental Data
As well as some necessary theoretical introductions to the methods, an important part of the book deals with the presentation of case studies. These are both simulated problems used to illustrate the essential concepts and real life applications. These case studies are important complementary parts of the current volume. They cover a wide range of applications: environmental data analysis, pollution mapping, epidemiologic spatiotemporal data analysis, socio-economic data classification and clustering. Several case studies consider multivariate data sets, where variables can be dependent (linearly or nonlinearly correlated) or independent. Common to all case studies is that data are geo-referenced, i.e. they are located at least in a geographical space. In a more general sense the geographical space can be enriched with additional information, giving rise to a high dimensional geo-feature space. Geospatial data can be categorical (classes), continuous (fields) or distributions (probability density functions). Let us remember that one of the simplest problems – the task of spatial interpolation from discrete measurements to continuous fields – has no single solution. Even with a very simple interpolation method just by changing one or two tuning parameters many different “maps” can be produced. Here we are faced with an extremely important question of model assessment and model selection. The selection of the method for data analysis, modeling and predictions depends on the quantity and quality of data, the expert knowledge available and the objectives of the study. In general, two fundamental approaches when working with data are possible: deterministic models, including the analysis of data using physical models and deterministic interpolations, or statistical models which interpret the data as a realization of a random/stochastic process. In both cases models and methods depend on some hypotheses and have some parameters that should be tuned in order to apply the model correctly. In many cases these two groups merge, and deterministic models might have their “statistical” side and vice versa. Statistical interpretation of spatial environmental data is not trivial because usually only one realization (measurements) of the phenomena under study exists. These cases are, for example, geological data, pollution after an accident, etc. Therefore, some fundamental hypotheses are very important in order to make statistical inferences when only one realization is available: ergodicity, second-order stationarity, intrinsic hypotheses (see Chapter 3 for more detail). While some empirical rules exist, these hypotheses are very difficult to verify rigorously in most cases. An important aspect of spatial and spatiotemporal data is the anisotropy. This is the dependence of the spatial variability on the direction. This phenomenon can be
Introduction
3
detected and characterized with structural analysis such as the variography presented below. Almost all of the models and algorithms considered in this book (geostatistics, MLA, BME) are based on the statistical interpretation of data. Another general view on environmental data modeling approaches is to consider two major classes: model-dependent approaches (geostatistical models – Chapter 3 and BME – Chapter 6) and data-driven adaptive models (machine learning algorithms – Chapter 4). Being applied without the proper understanding and lacking interpretability, the data-driven models were often considered as black or gray box models. Obviously, each data modeling approach has its own advantages and drawbacks. In fact, both approaches can be used as complementary tools, resulting in hybrid models that can overcome some of the problems. From a machine learning point of view the problem of spatiotemporal data analysis can be considered as a problem of pattern recognition, pattern modeling and pattern prediction or pattern completion. There are several major classes of learning approaches: – supervised learning. For example, these are the problems of classification and regression in the space of geographical coordinates (inputs) based on the set of available measurements (outputs); – unsupervised learning. These are the problems with no outputs available, where the task is to find structures and dependencies in the input space: probability density modeling, spatiotemporal clustering, dimensionality reduction, ranking, outlier/novelty detection, etc. When the use of these structures can improve the prediction for a small amount of available measurements, this setting is called semisupervised learning. Other directions such as reinforcement learning exist but are rarely used in environmental spatial data analysis and modeling. 1.2. Environmental data analysis: problems and methodology 1.2.1. Spatial data analysis: typical problems First let us consider some typical problems arising when working with spatial data.
4
Advanced Mapping of Environmental Data
Figure 1.1. Illustration of the problem of environmental data mapping
Given measurements of several variables (see Figure 1.1 for the illustration) and a region of the study, typical problems related to environmental data mapping (and beyond, such as risk mapping, decision-oriented mapping, simulations, etc.) can be listed as follows: – predicting a value at a given point (marked by “?” in Figure 1.1, for example). If it is the only point of interest, perhaps the best way is simply to take a measurement there. If not, a model should be developed. Both deterministic and statistical models can be used; – building a map using given measurements. In this case a dense grid is usually developed over the region of study taking into account the validity domain (see Chapter 2) and at each grid node predictions are performed finally giving rise to the raster model of spatial predictions. After post-processing of this raster model different presentations are possible – isolines, 3D surfaces, etc. Both deterministic and statistical models can be used; – taking into account measurement errors. Errors can be either independent or spatially correlated. Statistical treatment of data is necessary; – estimating the prediction error, i.e. predicting both unknown value and its uncertainty. This is a much more difficult question. Statistical treatment of data is necessary;
Introduction
5
– risk mapping, which is concerned with uncertainty quantification for the unknown value. The best approach is to estimate a local probability density function, i.e. mapping densities using data measurements and expert knowledge; – joint predictions of several variables or prediction of a primary variable using auxiliary data and information. Very often in addition to the main variable there are other data (secondary variables, remote sensing images, digital elevation models, etc.) which can contribute to the analysis of the primary variable. Additional information can be “cheaper” and more comprehensive. There are several geostatistical models of co-predictions (co-kriging, kriging with external drift) and co-simulations (e.g. sequential Gaussian co-simulations). As well as being more complete, secondary information usually has better spatial and dimensional resolutions which can improve the quality of final analysis and recover missing information in the principal monitoring network. This is an interesting topic of future research; – optimization of the monitoring network (design/redesign). A fundamental question is always where to go and what to measure? How can we optimize the monitoring network in order to improve predictions and reduce uncertainties? At present there are several possible approaches: uncertainty/variance-based, Bayesian approach, space filling, optimization based on support vectors (see references); – spatial stochastic conditional simulations or modeling of spatial uncertainty and variability. The main idea here is to develop a spatial Monte Carlo model which can produce (generate) many realizations of the phenomena under study (random fields) using available measurements, expert knowledge and well defined criteria. In geostatistics there are several parametric and non-parametric models widely used in real applications (Chapter 3 and references therein). Post-processing of these realizations gives rise to different decision-oriented maps. This is the most comprehensive and the most useful information for an intelligent decision making process; – integration of data/measurements with physical models. In some cases, in addition to data science-based models – meteorological models, geophysical models, hydrological models, geological models, models of pollution dispersion, etc. are available. How can we integrate/assimilate models and data if we do not want to use data only for the calibration purposes? How can we compare patterns generated from data and models? Are they compatible? How can we improve predictions and models? These fundamental topics can be studied using BME. 1.2.2. Spatial data analysis: methodology The generic methodology of spatial data analysis and modeling consists of several phases. Let us recall some of the most important.
6
Advanced Mapping of Environmental Data
– Exploratory spatial data analysis (ESDA). Visualization of spatial data using different methods of presentation, even with simple deterministic models helps to detect data errors and to understand if there are patterns, their anisotropic structures, etc. An example of sample data visualization using Voronoï polygons and Delaunay triangulation is given in Figure 1.2. The presence of spatial structure and the WestEast major axis of anisotropy are evident. Geographical Information Systems (GIS) can also be used as tools both for ESDA and for the presentation of the results. ESDA can also be performed within moving/sliding windows. This regionalized ESDA is a helpful tool for the analysis of complex non-stationary data.
Figure 1.2. Visualization of raw data (left) using Voronoï polygons and Delaunay triangulation (right)
– Monitoring network analysis and descriptions. The measuring stations of an environmental monitoring network are usually spatially distributed in an inhomogenous manner. The problem of network homogenity (clustering and preferential sampling) is closely connected to global estimations, to the theoretical possibility of detecting phenomena with a monitoring network of the given design. Different topological, statistical and fractal measures are used to quantify spatial and dimensional resolutions of the networks (see details in Chapter 2). – Structural analysis (variography). Variography is an extremely important part of the study. Variograms and other functions describing spatial continuity (rodogram, madogram, generalized relative variograms, etc.) can be used in order to characterize the existence of spatial patterns (from a two-point statistical point of view) and to quantify the quality of machine learning modeling using variography of the residuals. The theoretical formula for the variogram calculation of the random variable Z(x) under the intrinsic hypotheses is given by:
J ( x, h )
1 2
Var ^Z (x) Z (x h)`
^
1 2 E Z ( x) Z ( x h ) 2
`
J (h)
where h is a vector separating two points in space. The corresponding empirical estimate of the variogram is given by the following formula
Introduction
J ij (h)
7
1 N (h ) ¦ Z i ( x ) Z i ( x h ) 2 2 N (h) i 1
where N(h) is a number of pairs separated by vector h. The variogram has the same importance for spatial data analysis and modeling as the auto-covariance function for time series. Variography should be an integral part of any spatial data analysis independent of the modeling approach applied (geostatistics or machine learning). In Figure 1.3 the experimental variogram rose for the data shown in Figure 1.2 is presented. A variogram rose is a variogram calculated in several directions and at many lag distances. A variogram rose is a very useful tool for detecting spatial patterns and their correlation structures. The anisotropy can be clearly seen in Figure 1.3. – Spatiotemporal predictions/simulations, modeling of spatial variability and uncertainty, risk mapping. The following methods are considered in this book: - Geostatistics (Chapter 3). Geostatistics is a well known approach developed for spatial and spatiotemporal data. It was established in the middle of the 20th century and has a long successful history of theoretical developments and applications in different fields. Geostatistics treats data as realizations of random functions. The geostatistical family of kriging models provides linear and nonlinear modeling tools for spatial data mapping. Special models (e.g. indicator kriging) were developed to “map” local probability density functions, i.e. modeling of uncertainties around unknown values. Geostatistical conditional stochastic simulations are a type of spatial Monte Carlo generator which can produce many equally probable realizations of the phenomena under study based on well defined criteria. - Machine Learning Algorithms (Chapter 4). Machine Learning Algorithms (MLA) offer several useful information processing capabilities such as nonlinearity, universal input-output mapping and adaptivity to data. MLA are nonlinear universal tools for obtaining and modeling data. They are excellent exploratory tools. Correct application of MLA demands profound expert knowledge and experience. In this book several architectures widely used for different applications are presented: neural networks: multilayer perceptron (MLP), probabilistic neural network (PNN), general regression neural network (GRNN), self-organizing (Kohonen) maps (SOM), and from statistical learning theory: Support Vector Machines (SVM), Support Vector Regression (SVR), and other kernel-based methods. At present, the conditional stochastic simulations using machine learning is an open question.
8
Advanced Mapping of Environmental Data
Figure 1.3. Experimental variogram rose for the data from Figure 1.2
- Bayesian Maximum Entropy (Chapter 6). Bayesian Maximum Entropy (BME) is based on recent developments in spatiotemporal data modeling. BME is extremely efficient in the integration of general expert knowledge and specific information (e.g. measurements) for the spatiotemporal data analysis, modeling and mapping. Under some conditions BME models are reduced to geostatistical models. – Model assessments/model validation. This is a final phase of the study. The “best” models are selected and justified. Their generalization capabilities are estimated using a validation data set – a completely independent data set never used to develop and to select a model. – Decision-oriented mapping. Geomatics tools such as Geographical Information Systems (GIS) can be used to efficiently visualize the prediction results. The resulting maps may include not only the results of data modeling but other thematic layers important for the decision making process. – Conclusions, recommendations, reports, communication of the results. 1.2.3. Model assessment and model selection Now let us return to the question of data modeling. As has already been mentioned, in general, there is no single solution to this problem. Therefore, an extremely important question deals with model selection and model assessment procedures. First we have to choose the “best” model and then estimate its generalization abilities, i.e. its predictions on a validation data set which has never been used for model development.
Introduction
9
Model selection and model assessment have two distinct goals [HAS 01]: – Model selection: estimating the performance of different models in order to choose the best one: the most appropriate, the most adapted to data, best matching some prior knowledge, etc. – Model assessment: having chosen a model, model assessment deals with estimating its prediction error on new independent data (generalization error). In practice these problems are solved either using different statistical techniques or empirically by splitting the data into three subsets (Figure 1.4): training data, testing data and validation data. Let us note that in this book the traditional definition used in environmental modeling is used. The machine learning community splits data in the following order: training/validation/testing. The training data subset is used to train the selected model (not necessarily the optimal or best model); the testing data subset is used to tune hyper-parameters and/or for the model selection, and the validation data subset is used to assess the ability of the selected model to predict new data. The validation data subset is not used during the training and model selection procedure. It can be considered as a completely independent data set or as additional measurements. The distribution of percentages between data subsets is quite free. What is important is that all subsets characterize the phenomena under study in a similar way. For environmental spatial data it can be the clustering structure, the global distributions and variograms which should be similar for all subsets. Model selection and model assessment procedures are extremely important especially for data-driven machine learning algorithms, which mainly depend on data quality and quantity and less on expert knowledge and modeling assumptions.
Figure 1.4. Splitting of raw data
10
Advanced Mapping of Environmental Data
A scheme of the generic methodology of using machine learning algorithms for spatial environmental data modeling is given in Figure 1.5. The methodology is similar to any other statistical analysis of data. The first step is to extract useful information (which should be quantified, e.g. as information described by spatial correlations) from noisy data. Then, the quality of modeling has to be controlled by analyzing the residuals. The residuals of training, testing and validation data should be uncorrelated white noise. Unfortunately in many applied publications this important step of the residual analysis is neglected. Another important aspect of environmental decisions both during environmental modeling or environmental data analysis and forecasting deals with uncertainties of the corresponding modeling results. Uncertainties have great importance in intelligent decisions; sometimes they can be even more important than the particular prediction values. In statistical models (geostatistics, BME) this procedure is inherent and under some hypotheses confidence intervals can be derived. With MLA this is a slightly more difficult problem, but many theoretical and operational solutions have already been proposed.
Figure 1.5. Methodology of MLA application for spatial data analysis
Introduction
11
Concerning mapping and visualization of the results one possibility to summarize both predictions and uncertainties is to use “thick isolines” which characterize the uncertainty of spatial predictions (see Figure 1.6). For example, under some hypotheses which depend on the applied model, the interpretation is that with a probability of 95% an isoline of the predefined decision level can be found in the thick zone. Correct visualization is important in communicating the results to decision makers. It can be used also for monitoring network optimization procedures by demonstrating regions with high or unacceptable uncertainties. Let us note that such a visualization of predictions and uncertainties is quite common in time series analysis.
Figure 1.6. Combining predictions with uncertainties: “thick isolines”
In this section some basic problems of spatial data analysis, modeling and visualization were presented. Model-based (geostatistics, BME) methods and datadriven algorithms (MLA) were mentioned as possible modeling approaches to these tasks. Correct application of both of them demands profound expert knowledge of data, models, algorithms and their applicability. Taking into account the complexity of spatiotemporal data analysis, the availability of good literature (books, tutorials, papers) and software modules/programs with user-friendly interfaces are important for learning and applications.
12
Advanced Mapping of Environmental Data
In the following section some of the available resources such as books and software tools are given. The list is very short and far from being complete for this very dynamic research discipline, sometimes called environmental data mining. 1.3. Resources Some general information, including references to the conferences, tutorials and software for the methods considered in this book can be found on the Internet, in particular on the following sites: – web resources on geostatistics and spatial statistics can be found at http://www.ai-geostats.org; – on machine learning: http://www kernel-machines.org/, http://www.supportvector.net/; http://mloss.org/about/ – machine learning open source software; http://www.cs.iastate.edu/~honavar/Courses/cs673/machine-learning-courses.html – index of ML courses, http://www.patternrecognition.co.za/tutorials html – machine learning tutorials; very good tutorials on statistical data mining can be found on-line at http://www.autonlab.org/tutorials/list html; – Bayesian maximum entropy: some resources related to Bayesian maximum entropy (BME) methods. For a more complete list or references see Chapter 6; see also the BMELab site at http://www.unc.edu/depts/case/BMElab. 1.3.1. Books, tutorials The list of books, given in the reference section below, is not complete but gives good references on introductory and advanced topics presented in the book. Some of these are more theoretical, while some concentrate more on the applications and case studies. In any case, most of them can be used as text books for the educational purposes as well as references for research. 1.3.2. Software All contemporary data analysis and modeling approaches are not feasible without powerful computers and good software tools. This book does not include a CD with software modules (unfortunately). Therefore, below we would like to recommend some cheap and “easy to find” software with short descriptions. – GSLIB: a geostatistical library with Fortran routines [DEU 1997]. The GSLIB library, which first appeared in 1992, was an important step in geostatistics applications and stimulated new developments. It gave many researchers and
Introduction
13
students the possibility of starting with geostatistical models and learning corresponding algorithms having access to the codes. Description: the GSLIB modeling library covers both geostatistical predictions (family of kriging models) and conditional geostatistical simulations. There is a version of GSLIB with user interfaces which can be found at http://www.statios.com/WinGslib. – S-GeMS is a piece of software for 3D geostatistical modeling. Description: it implements many of the classical geostatistics algorithms, as well as new developments made at the SCRF lab, Stanford University. It includes a selection of traditional and the most recent geostatistical models: kriging, co-kriging, sequential Gaussian simulation, sequential indicator simulation, multi-variate sequential Gaussian and indicator simulation, multiple-point statistics simulation, as well as standard data analysis tools (histogram, QQ-plots, variograms) and interactive 3D visualization. Open source code is available at http://sgems.sourceforge net. – Geostat Office (GSO). An educational version of GSO comes with a book [KAN 04]. The GSO package includes geostatistical tools and models (variography, spatial predictions and simulations) and neural networks (multilayer perceptron, general regression neural networks and probabilistic neural networks). – Machine Learning Office (MLO) is a collection of machine learning software modules: multilayer perceptron, radial basis functions, general regression and probabilistic neural networks, support vector machines, self-organizing maps. MLO is a set of software tools accompanying the book [KAN 08]. – R (http://www.r-project.org). R is a free software environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment. There are several contributed modules dedicated to geostatistical models and to machine learning algorithms. – Netlab [NAB 01]. This consists of a toolbox of Matlab® functions and scripts based on the approach and techniques described in “Neural Networks for Pattern Recognition” by Christopher M. Bishop, (Oxford University Press, 1995), but also including more recent developments in the field. http://www.ncrg.aston.ac.uk/netlab. – LibSVM: http://www.csie ntu.edu.tw/~cjlin/libsvm is quite a popular library for Support Vector Machines. – TORCH machine learning library (http://www.torch.ch). The tutorial on the library, http://www.torch.ch/matos/tutorial.pdf, presents TORCH as a machine learning library, written in C++, and distributed under a BSD license. The ultimate objective of the library is to include all of the state-of-the-art machine learning algorithms, for both static and dynamic problems. Currently, it contains all sorts of artificial neural networks (including convolutional networks and time-delay neural networks), support vector machines for regression and classification, Gaussian mixture models, hidden Markov models, k-means, k-nearest neighbors and Parzen
14
Advanced Mapping of Environmental Data
windows. It can also be used to train a connected word speech recognizer. And last but not least, bagging and adaboost are ready to use. – Weka: http://www.cs.waikato.ac.nz/~ml/weka. Weka is a collection of machine learning algorithms for data-mining tasks. The algorithms can either be applied directly to a dataset or taken from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules and visualization. It is also well-suited for developing new machine learning schemes. – Machine Learning Open Source Software (MLOSS): http://mloss.org/about. The objective of this new interesting project is to support a community creating a comprehensive open source machine learning environment. – SEKS-GUI (Spatiotemporal Epistematics Knowledge Synthesis software library and Graphic User Interface). Description: advanced techniques for modeling and mapping spatiotemporal systems and their attributes based on theoretical modes, concepts and methods of evolutionary epistemology and modern cognition technology. The interactive software library of SEKS-GUI explores heterogenous space-time patterns of natural systems (physical, biological, health, social, financial, etc.); accounts for multi-sourced system uncertainties; expresses the system structure using space-time dependence models (ordinary and generalized); synthesizes core knowledge bases, site-specific information, empirical evidence and uncertain data; and generates meaningful problem solutions that allow an informative representation of the real-world system using space-time varying probability functions and the associated maps (predicted attribute distributions, heterogenity patterns, accuracy indexes, system risk assessment, etc.). http://geography.sdsu.edu/Research/Projects/SEKS-GUI/SEKS-GUI.html. Manual: Kolovos, A., H-L Yu, and Christakos, G., 2006. SEKS-GUI v.0.6 User Manual. Dept. of Geography, San Diego State University, San Diego, CA. – BMELib Matlab library (Matlab®) and its applications can be found on http://www.unc.edu/depts/case/BMElab/. 1.4. Conclusion The problem of spatial and spatiotemporal data analysis is becoming more and more important: many monitoring stations around the world are collecting high frequency data on-line, satellites produce a huge amount of information about Earth on a daily basis, an immense amount of data is available within GIS. Environmental data are multivariate and noisy; highly variable at many geographical scales – from local variability in hot spots to regional trends; many of them are unique (only one realization of the phenomena under study); usually environmental data are spatially non-stationary.
Introduction
15
The problem of the reconstruction of random fields using discrete data measurements has no single solution. Several important, and difficult to verify, hypotheses have to be accepted and tuning of the model-dependent parameters has to be carried out before arriving at a “unique and in some sense the best” solution. In general, different data analysis approaches – both model-based and datadriven can be considered as complementary. For example, MLA can be efficiently used already at the phase of exploratory data analysis or for de-trending in a hybrid scheme. Moreover, there are links between these two groups of methods, such that, under some conditions, kriging (as a Gaussian process) can be considered as a particular neural network and vice versa. Therefore, in this book, three currently different approaches are presented as possible solutions to the same problem of analysis and mapping of spatial data. Each of them has its own advantages and drawbacks in comparison with the others. In some cases they have quite unique properties to solve more specific tasks: geostatistical simulations and BME to model joint probability density functions (random fields), MLA when working with high dimensional and multivariate data. Hybrid models based on both approaches can overcome some difficulties and produce better results. We propose to apply different methods and tools in order to produce alternative and complementary results which can improve decision-making processes. 1.5. References [ABE 05] ABE S., Support Vector Machines for Pattern Classification, Springer, 2005. [BIS 07] BISHOP C.M., Pattern Recognition and Machine Learning, Springer, 2007. [CHE 98] CHERKASSKY V. and MULIER F., Learning from Data, John Wiley & Sons, 1998. [CHI 99] CHILES J.-P., DELFINER P., Geostatistics: Modelling Spatial Uncertainty, Wiley series in probability and statistics, John Wiley and Sons, 1999. [CHR 92] CHRISTAKOS G., Random Field Models in Earth Sciences, Academic Press, San Diego, CA, 1992. [CHR 98] CHRISTAKOS G. and HRISTOPULOS D.T., Spatiotemporal Environmental Health Modelling, Kluwer Academic Publ., Boston, MA, 1998. [CHR 00a] CHRISTAKOS G., Modern Spatiotemporal Geostatistics, Oxford University Press, New York, 2000. [CHR 02c] CHRISTAKOS G., BOGAERT P. and SERRE M.L., Temporal GIS, SpringerVerlag, New York, NY, with CD-ROM, 2002.
16
Advanced Mapping of Environmental Data
[CHR 05] CHRISTAKOS G., OLEA R.A., SERRE M.L., YU H.L. and WANG L-L., Interdisciplinary Public Health Reasoning and Epidemic Modelling: The Case of Black Death, Springer-Verlag, New York, NY, 2005. [CRE 93] CRESSIE, N., Statistics for Spatial Data, John Wiley and Sons, NY, 1993. [CRI 00] CRISTIANINI N. and SHAWE-TAYLOR J., Support Vector Machines, Cambridge University Press, 2000. [DAV 88] DAVID M., Handbook of Applied Advanced Geostatistical Ore Reserve Estimation, Elsevier Science Publishers, Amsterdam B.V., 216 p., 1988. [DEU 97] DEUTSCH C.V. and JOURNEL A.G., GSLIB: Geostatistical Software Library and User’s Guide, Oxford University Press, 1997. [DOB 07] DOBESCH H., DUMOLARD P., and DYRAS I (eds.), Spatial Interpolation for Climate Data: The Use of GIS in Climatology and Meteorology, Geographical Information Systems series, ISTE, 2007. [DUB 03] DUBOIS G., MALCZEWSKI J., and DE CORT M. (eds.), Mapping Radioactivity in the Environment, Spatial Interpolation Comparison 97, European Commission, JRC Ispra, EUR 20667, 2003. [DUB 05] DUBOIS G. (ed.), Automatic Mapping Algorithms for Routine and Emergency Data, European Commission, JRC Ispra, EUR 21595, 2005. [DUD 01] DUDA R., HART P. and STORK D., Pattern Classification, 2nd edition, John Wiley & Sons, 2001. [GAN 63] GANDIN L.S., Objective Analysis of Meteorological Fields, Israel program for scientific translations, 1963, Jerusalem. [GOO 97] GOOVAERST P., Geostatistics for Natural Resources Evaluation, Oxford University Press, 1997. [GRU 06] DE GRUIJTER J., BRUS D., BIERKENS M.F.P. and KNOTTERS M., Sampling for Natural Resource Monitoring, Springer, 2006. [GUY 06] GUYON I., GUNN S., NIKRAVESH M., and ZADEH L. (eds.), Feature Extraction: Foundations and Applications, Springer, 2006. [HAS 01] HASTIE T., TIBSHIRANI R., and FRIEDMAN J., The Elements of Statistical Learning, Springer, 2001. [HAY 98] HAYKIN S., Neural Networks: a Comprehensive Foundation, Pearson Higher Education, 2nd edition, 842 p., 1999. [HIG 03] HIGGINS N. A. and JONES J. A., Methods for Interpreting Monitoring Data Following an Accident in Wet Conditions, National Radiological Protection Board, Chilton, Didcot, 2003. [HYV 01] HYVARINEN A., KARHUNEN J., OJA E., Independent Component Analysis, Wiley Interscience, 2001. [ISA 89] ISAAKS E., SHRIVASTAVA M., Applied Geostatistics, Oxford University Press, 1989.
Introduction
17
[ISA 90] ISAAKS E. H. and SRIVASTAVA R. M., An Introduction to Applied Geostatistics, Oxford University Press, 1990. [JEB 04] JEBARA T., Machine Learning: Discriminative and Generative, Kluwer Academic Publ., 2004. [JOU 78] JOURNEL A.G. and HUIJBREGTS C.J., Mining Geostatistics, Academic Press, 600 p., London, 1978. [KAN 04] KANEVSKI and MAIGNAN, M., Analysis and Modelling of Spatial Environmental Data, EPFL Press, 2004. [KAN 08] KANEVSKI M., POZDNOUKHOV A. and TIMONIN V., Machine Learning Algorithms for Environmental Spatial Data. Theory, Applications and Software, EPFL Press, Lausanne, 2008. [KOH 00] KOHONEN T., Self-Organising Maps, Springer, NY, 2000. [LEE 07] LEE J and VERLEYSEN M., Nonlinear Dimensionality Reduction, Springer, NY, 2007. [LEN 06] LE N.D. and ZIDEK J.V., Statistical Analysis of Environmental Space-Time Processes, Springer, NY, 2006. [LLO 06] LLOYD C.D., Local Models for Spatial Analysis, CRC Press, 2006. [MAT 63] MATHERON G., Principles of Geostatistics Economic Geology, vol. 58, December 1963, p. 1246-1266. [MUL 07] MULLER W.G., Collecting Spatial Data. Optimum Design of Experiments for Random Fields, 3rd edition, Springer, NY, 2007. [NAB 01] NABNEY I., Netlab: Algorithms for Pattern Recognition, Springer, 2001. [RAS 06] RASMUSSEN C.E. and WILLIAMS C.K.I., Gaussian Processes for Machine Learning, MIT Press, 2006. [SCH 05] SCHABENBERGER O. and GOTWAY C., Statistical Methods for Spatial Data Analysis, Chapman and Hall/CRC, 2005. [SCH 06] SCHÖLKOPF B. et al. (eds.), Semi-Supervised Learning, Springer, 2006. [SCH 98] SCHÖLKOPF B., SMOLA A., and MÜLLER K., “Nonlinear Component Analysis as a Kernel Eigenvalue Problem”, Neural Computation, vol. 10, 1998, p. 1299-1319. [SHA 04] SHAWE-TAYLOR J. and CRISTIANINI N., Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [VAP 06] VAPNIK V., Estimation of Dependences Based on Empirical Data (2nd Edition), Springer, 2006. [VAP 95] VAPNIK V., The Nature of Statistical Learning Theory, Springer, 1995. [VAP 98] VAPNIK V., Statistical Learning Theory, Wiley, 1998. [WAC 95] WACKERNAGEL H., Multivariate Geostatistics, 3rd edition, Springer-Verlag, 387 p., Berlin, 2003.
This page intentionally left blank
Chapter 2
Environmental Monitoring Network Characterization and Clustering
2.1. Introduction The quality of environmental data analysis and propagation of errors are heavily affected by the representativity of the initial sampling design [CRE 93, DEU 97, KAN 04a, LEN 06, MUL 07]. Geostatistical methods such as kriging are related to field samples, whose spatial distribution is crucial for the correct detection of the phenomena. Literature about the design of environmental monitoring networks (MN) is widespread and several interesting books have recently been published [GRU 06, LEN 06, MUL 07] in order to clarify the basic principles of spatial sampling design. In [POZ 06] a new approach for spatial sampling design (monitoring networks optimization) based on Support Vector Machines was proposed. Nonetheless, modelers often monitoring networks that suffer Clustering can be related to the reaching certain regions. Figure networks.
receive real data coming from environmental from problems of non-homogenity (clustering). preferential sampling or to the impossibility of 2.1 shows three examples of real monitoring
Chapter written by D. TUIA and M. KANEVSKI.
20
Advanced Mapping of Environmental Data
Figure 2.1. Examples of clustered MN: (top-left) Cs137 survey in Briansk region (Russia); (top-right) heavy metals survey in Japan; (bottom-right) indoor radon survey in Switzerland
In order to deal with this problem, declustering methods have been developed, to estimate the non-biased global parameters by weighting the distribution function according to the degree of spatial clustering [DEU 97]. Several specific declustering techniques have been proposed, going from simple random and cell methods to Maximum Likelihood-based [ALL 00], two-point declustering [RIC 02] and more complex approaches based on Bayesian Maximum Entropy formalism [KOV 04]. Declustering of clustered preferential sampling for histogram and semivariogram inference was proposed in [OLE 07]. Declustering methods are delicate and are linked to an unavoidable loss of the initial information. In that sense, a rigorous characterization of the MN is necessary in order to understand whether or not these operations are necessary. This chapter deals with exploratory spatial data analysis, paying particular attention to the quantitative characterization of MN, in order to give to the analyst the tools necessary to evaluate the adequacy of a network to detect an environmental phenomenon. 2.2. Spatial clustering and its consequences Spatial clustering of an MN can influence global estimations and spatial predictions and leads to erroneous conclusions about environmental phenomena such as pollution. In this chapter, the term clustering is used in a purely spatial context: only the spatial repartition of samples is considered and clusters in
Environmental Monitoring Network Characterization and Clustering
21
functional/variable space (such as pollutant concentrations) are not considered (the functional approach, like functional box-counting, can generalize most of the measures considered below [LOV 87]). In this sense, clustering can be defined as the spatial non-homogenity of measurement points. Figure 2.2 shows two monitoring networks: the first is characterized by a random repartition of samples, while the second is clustered.
Figure 2.2. Example of MN: (left) random distribution of samples; (right) clustered distribution of samples
The measures described in this chapter imply the spatial stationarity of the phenomenon under study. Non-stationary measures will not be discussed here. 2.2.1. Global parameters Clustered monitoring networks often do not represent the true spatial pattern of phenomena and modeling processes based on raw data produce biased results. This non-representativity leads to a risk of over- and under-estimation of global parameters (e.g., mean, variance) and therefore to an erroneous reconstruction of the probability distribution governing the phenomenon. Figure 2.3 shows an example of a random variable (this is a simulated example).
Figure 2.3. Simulation of an environmental phenomenon and sampling schemes used (left: random; right: clustered)
22
Advanced Mapping of Environmental Data
If this phenomenon is sampled with the networks shown in Figure 2.2, the differences in observed mean and variance of the phenomenon are evident (Table 2.1); the clustering of samples in areas characterized by small concentrations decreases the value of the observed mean, implying an incorrect observation of the phenomenon. The histogram that will be used for modeling is therefore biased and does not represent the true phenomenon. Such errors can lead to under- or overestimation of environmental risk and must be avoided. Mean
Variance
Real
0.26
0.77
Random MN
0.26
0.79
Clustered MN
0.09
0.77
Table 2.1. Observed mean and variance of the artificial phenomenon sampled with both MN shown in Figure 2.2. The first line shows parameters estimated using all data
2.2.2. Spatial predictions The use of a clustered MN for spatial prediction can lead to incorrect spatial conclusions about the extent of a polluted area. Following the example used in the previous section, random (left in Figure 2.2) and clustered (right in Figure 2.2) networks were used to produce a pollution map using a kriging model (see Chapter 3). Figure 2.4 shows that the oversampling in small concentration areas leads to a regional under-estimation of risk and that small contaminated areas (hot spots) are not detected.
Figure 2.4. Spatial interpolation of both networks (left: random; right: clustered) using kriging
Environmental Monitoring Network Characterization and Clustering
23
2.3. Monitoring network quantification In this section, several clustering measures will be discussed. Particular attention will be paid to the fractal clustering measures. In principle, quantitative clustering measures can be aggregated into topological, statistical and fractal measures [KAN 04a]. 2.3.1. Topological quantification The topological structure of space can be quantified by Euclidean geometry expressed by topological dimension: an object that can be disconnected by another of dimension n has a dimension n+1 (Figure 2.5). The usual representation of space is therefore bounded to integer dimensions. For example, a surface environmental process should be analyzed with a MN covering the entire two-dimensional space (topological dimension of 2).
Figure 2.5. Examples of topological dimensions
2.3.2. Global measures of clustering Several methods exist in order to highlight clustering [CRE 93, KAN 04a]. Below is a non-exhaustive list of well-known methods useful to quantify departures from the homogenous repartition of samples. Both the simulated and real data considered in this chapter deal with two-dimensional geographical space (longitudelongitude coordinates or corresponding projections). 2.3.2.1. Topological indices Topological indices evaluate the level of MN clustering by estimating the homogenity of the two-dimensional space covering provided by the MN. In that
24
Advanced Mapping of Environmental Data
sense, a quasi-quantitative index is the area of Voronoï polygons [THI 11, PRE 85, STO 95, OKA 00]. If the samples are homogenously distributed, the areas of the Voronoï polygons are constant for every polygon associated with every sample (except for the samples located close to the boundaries of the region). If there is some clustering, the surface distribution varies from small areas (clustered areas) to large (regions where only a few samples are available). Therefore, the area/frequency distribution of the polygons can be interpreted as an index of spatial clustering [NIC 00, KAN 04a, PRO 07]. An example of the analysis based on Voronoï polygons is given in Figure 2.6.
Clustered Random
30 00
25 00
Number of polygons
20 00
15 00
10 00
5.00
0.00 5. 0
5.50
6.00
6.50
7.00
7 50
8.00
-5.00
Area (sqkm)
Figure 2.6. Voronoï polygon area for the clustered (left, above) and homogenous (left, below) areas. Frequency/area for the networks (right)
2.3.2.2. Statistical indices Several statistical indices have been developed to highlight the presence of spatial clustering, the most common probably being Moran’s index [MOR 50], a weighted correlation coefficient used to analyze departures from spatial randomness. Other indices can be used to discover the presence of clusters: – the Morisita index [MOR 59]: the region is divided into Q identical cells and the number of samples ni within every cell i is counted. Then, the size of the cells is increased and the process is iterated, returning the size-dependent Morisita index I'. I'
¦ Q
Q i 1
n i ( n i 1)
N ( N 1)
[2.1]
Environmental Monitoring Network Characterization and Clustering
25
where N is the total number of samples. A homogenous process will show a Morisita index fluctuating around the value of 1 for all scales considered, because of the homogenous distribution of the samples within the boxes at every scale. For the clustered MN, the number of empty cells for small scales increases the value of the index. The index has been used in a wide range of environmental applications, from ecological studies [SHA 04, BON 07] to risk analysis [OUC 86, TUI 07a]. Examples of Morisita diagrams for two simulated networks (homogenous and clustered) are given in Figure 2.7. Clustered samples
Morisita index
Random samples
Cell's size
Figure 2.7. Morisita index for random (dashed) and clustered (solid) MN
The K-function (or Ripley’s K) [RIP 77, MOE 04] can be used to calculate the degree of spatial randomness in the spatial distribution of samples. The K-function is
K'
O1 E >I(dij R)@
[2.2]
where O is the density (number per unit area) of samples and I is an indicator function giving 1 if the considered samples are within a circle of radius R and 0 otherwise.
Clustered samples Random Poisson
0.0 e+00
0.0 e+00
4.0 e+07
8.0 e+07
K
8.0 e+07
Homogenous samples Random Poisson
4.0 e+07
K
1.2 e+08
Advanced Mapping of Environmental Data
1.2 e+08
26
0
1000
2000
3000
4000
5000
6000
0
1000
2000
delta
3000
4000
5000
6000
delta
Figure 2.8. (Solid) K-functions for the homogenous (left) and clustered (right) datasets. Comparison with spatial randomness configuration (dashed)
The iteration for different delta values allows us to describe the properties at different scales (as with the Morisita index). The K-function can then be compared to the K-function associated with a function representing spatial randomness, usually a Poisson process. Here, Krnd ~ R2. Plotting both functions shows a departure from spatial randomness (dashed line, Figure 2.8), i.e., clustering. It should be noted that the K-function method is comparable to the sandbox counting measure for fractal sets considered below. 2.3.3. Dimensional resolution: fractal measures of clustering Fractal resolution [MAN 82, MAN 94] can be used as a measure of clustering of a monitoring network [LOV 86, KAN 04a, TUI 07a]. Ideal fractals are self-similar objects that reproduce their structure throughout the scales. The word “fractus” comes from the Latin “irregular”. Fractal objects are often characterized by non-integer dimensions: they have fragmented shapes and reproduce their structures using many scales. The purpose of a monitoring network is to detect, understand and model spatiotemporal phenomena (natural or artificial) via the observations at a finite number of locations in space. As well as spatial resolution, monitoring networks are characterized by dimensional resolution, i.e. their ability to detect D-dimensional phenomena in a D-dimensional Euclidean space. Dimensional resolutions are characterized by fractal dimensions and for clustered monitoring networks they are
Environmental Monitoring Network Characterization and Clustering
27
less than D (in our case D=2). By their fractal nature, clustered monitoring networks have a dimensional resolution lower than 2, and thus they can detect only phenomena of (2-df) dimension [LOV 1986, TUI 2007b]. The determination of the dimensional resolution allows us to analyze the appearance of self-similar structures in the monitoring network, i.e. the repetition of the configuration of points throughout the scale and the non-homogenity of the network. Let us note that clustered MN are not mathematical fractal objects, therefore, they could be self-similar only over a limited number of scales. Several methods exist to evaluate the fractal dimension of networks [MAN 94, SMI 96]. In the following sections, two of these methods (the most widely used in applications), the sandbox and the box-counting algorithm, are presented. The remainder of the section focuses on another interesting measure of fractal objects, the lacunarity. 2.3.3.1. Sandbox method With the sandbox method (also called the radial method) [FED 88], the number of neighbors within a circle of radius R centered on the current point is averaged on the whole dataset (Figure 2.9, left). This average number of neighbors follows a power law:
P(R) v R df SAND
[2.3]
where dfSAND is the fractal dimension of the network measured with the sandbox method. Using a log-transform of [2.3] it is possible to plot log[P(R)] as a function of log[R] and to derive dfSAND as the slope of the linear regression fitting the data of the plot. The sandbox method is based on local neighborhood measures between samples and can be interpreted as a measure of the density of samples at different scales. Therefore, using the sandbox method allows us to detect the appearance of clustering as a departure from a homogenous situation, for which the fractal dimension is equal to 2 (the number of points for a homogenous set increases with R2).
28
Advanced Mapping of Environmental Data
Log(mean Number of points) 1.00 0.60
Slope = dfSAND
Log(S(L)) 1.00 0.60 Slope = 0.20 dfBOX
0.20 0.20 0.60 1.00
0.20 0.60 1.00
Log(Rad s)
Log(mean Number of points)
Log(L) L = 1, N = 256, S(L) = 13
Log(S(L))
1.00
1.00
0.60
0.60 0.20
0.20 0.20 0.60 1.00
0.20 0.60 1.00
Log(Radius)
Log(mean Number of points)
Log(L) L = 4, N = 16, S(L) = 10
Log(S(L))
1.00
1.00
0.60
0.60 0.20
0.20 0.20 0.60 1.00
0.20 0.60 1.00
Log(Radius)
Log(mean Number of points)
Log(L) L = 8, N = 4, S(L) = 4
Log(S(L))
1.00
1.00
0.60
0.60 0.20
0.20 0.20 0.60 1.00
Log(Radius)
0.20 0.60 1.00
Log(L) L = 16, N = 1, S(L) = 1
Figure 2.9. Calculation of the fractal dimension with the sandbox (left) and box-counting (right) methods
Figure 2.10 shows the results of the sandbox method to both networks considered in this section. The slope of the regression function fitting the curves gives the fractal dimension. The homogenous samples are associated with a fractal dimension of 1.832, while the clustered samples have a dimension of 1.197. The value of 2 is not reached by the homogenous network because the distribution of samples is not regular: the value of two can be reached by a regular grid of samples, while a real network is characterized by slight over-densities at small scales resulting in a small level of clustering and a decrease of the fractal dimension. The second effect deals with the finite number of points and ergodic fluctuations around a homogenous measure. Moreover, the points on the boundary of the region produce a boundary effect that results in reducing the mean number of neighbors and thus the true dimension (Figure 2.11). One way to avoid the boundary effect is to introduce correlation factors, as is usually done for K-function calculations (see e.g. [DIG 03]).
Environmental Monitoring Network Characterization and Clustering
29
1.6
Clustered samples Homogenous samples
1.4
log (mean number of samples)
y = 1.1975x - 3.2784 1.2
1
0.8 y = 1.832x - 5 8202 0.6
0.4
0.2
0 3.2
33
3.4
35
3.6
3.7
3.8
3.9
log (radius)
Figure 2.10. Fractal dimension measured with the sandbox method for the homogenous (dashed) and clustered (solid) samples
Figure 2.11. Boundary effects for peripheral samples (right): the number of neighbors for these samples is less than expected (left)
4
30
Advanced Mapping of Environmental Data
2.3.3.2. Box-counting method The box-counting method (also called the grid-method) [SMI 89] covers the region under study with a regular grid of N boxes (as in the case of Morisita index calculation) and counts the number of boxes necessary to cover the whole network S(L). The size of the boxes, L, is then gradually decreased, and the number of boxes necessary to cover the samples is counted (Figure 2.9 – second method, right). The scales and the number of boxes follow a power law
S(L) v Ldf BOX
[2.4]
where dfBOX is the fractal dimension of the network measured with the box-counting method. Using a log-transform of [2.4] it is possible to plot log[S(L)] as a function of log[L] and to derive dfBOX as the slope of a linear regression fitting the data. The least squares technique is used to fit the regression Contrary to the previously discussed sandbox method, the box-counting method is useful for the calculation of the degree of spatial coverage of space by the network. 2
Clustered samples 1.8
Homogenous samples 1.6 y = -1.8266 x + 8.4447
log(number of boxes)
1.4
1.2 y = -1.5917 x + 7.396 1
0.8
0.6
0.4
0.2
0 3.4
3.6
3.8
4
4.2
4.4
4.6
4.8
log(size of boxes)
Figure 2.12. Fractal dimension measured with the box-counting method for the homogenous (dashed) and clustered (solid) samples
Environmental Monitoring Network Characterization and Clustering
31
Figure 2.12 shows the results for the clustered and homogenous networks. Clustering of both networks is particularly distinguishable for small box sizes, where the effect of local clustering can be detected. As both networks are covering the entire space, the box-counting method cannot detect the difference between them at large scales (the curves are very similar for log (box sizes) greater than 4). To sum up, the fractal measures considered in this section show two different approaches to quantify the dimensionality of a monitoring network, i.e. if the network is appropriate to detect a D-dimensional phenomenon in a D dimensional Euclidean space: the first, the sandbox method, calculates a measure of the local densities of samples at different scales, while the second, the box-counting method, is based on an estimate of the spatial covering of the region under study by the network. Both indices are complementary, because they describe different clustering properties. Complementarities of the indices are shown in the two following “toy” examples. Figure 2.13 shows two similar monitoring networks associated with the same numbers of samples: the first presents a cluster of samples in the middle, while the second is homogenously distributed.
Figure 2.13. Artificial monitoring networks: left with a cluster; right – homogenous distribution of measurement points
Table 2.2 shows the values of fractal dimensions calculated with both methods on both MN considered; only the sandbox method is able to detect a spatial clustering, by the strong change in local densities, while the box-counting measure remains almost unchanged, because the 2-dimensional space is covered by both networks in the same way.
32
Advanced Mapping of Environmental Data Clustered
Homogenous
Sandbox fractal dimension
1.67
1.98
Box-counting fractal dimension
1.88
1.91
Table 2.2. Fractal dimensions calculated on the networks of Figure 2.13
Figure 2.14 shows two other monitoring networks associated with the same numbers of samples: the first present a regular distribution of samples in two regions of the 2-dimensional space, while in the second the samples are distributed regularly in the whole region.
Figure 2.14. Artificial monitoring networks: (left) regular samples distributed in two regions; (right) regular samples distributed in the whole region
2 regions
Regular
Sandbox fractal dimension
1.86
1.91
Box-counting fractal dimension
1.79
2
Table 2.13. Fractal dimensions calculated on the networks of Figure 2.14
In this case, the result is unlike the previous example: the box-counting method detects the change in spatial distribution of samples due to the appearance of empty regions throughout the scales. The sandbox method, which analyzes the local densities, does not detect this change, because the distribution remains regular for both networks. Changes in the sandbox results are related to the bigger extent of the boundary of the first network: in this case, the boundary effects discussed above are stronger and tend to decrease the value of the fractal dimension.
Environmental Monitoring Network Characterization and Clustering
33
2.3.3.3. Lacunarity Although the fractal dimension can highlight the presence of clustering, it is still difficult to interpret: fractal dimension shows a departure from spatial homogenity, but it is not a structural index. Two MN can share the same dimension, and still be very different [TUI 07b]. Both networks in Figure 2.15 share the same fractal dimension (1.093), even if a single cluster characterizes the first, while the second shows two of them.
Figure 2.15. Monitoring networks characterized by the same fractal dimension; (left) one single cluster in the center; (right) two clusters
In order to include additional information in the analysis of the dimensionality of a network, the second moment of the distribution, the variance, can be included (the fractal measures discussed above are based on the first moment only). An example of an index showing this property is the lacunarity [MAN 82, MAN 94, ALL 91]. At a descriptive level, lacunarity can be interpreted as a lack of rotational or translational invariance in an object. This property of fractals measures the degree of non-uniformity (heterogenity) of structure of an object: i.e., it measures the structural variance within the object [SMI 96]. An object is said to be lacunar if its structure is highly variable. Several methods have been proposed to calculate the lacunarity of a set of points. No general agreement exists about the best one to use. In this section, we will present the gliding box method proposed by Allain and Cloitre [ALL 91, PLO 96]. A box of size l is placed at the origin of the set and the mass s (number of events) within the box is counted. Then, the box is moved one space along the set and the process is iterated. Since the gliding box has counted the masses for every possible position, a mass distribution n(s, l) is defined. The frequency distribution is then
34
Advanced Mapping of Environmental Data
converted to a probability distribution by dividing n(s, l) by the total number of boxes of size l. Having defined this probability distribution, it is possible to derive the first and second moments of the distribution:
E(1)
¦ sQ(s,l)
[2.5]
s 1
E(2)
¦ s Q(s,l) 2
[2.6]
s 1
The lacunarity /(l) can then be defined, as shown in equation [2.7]. The calculation is repeated for different sizes l, returning a size-dependent index:
/(l)
E(2) E 2 (1)
[2.7]
2.4. Validity domains In the previous section some methods to evaluate the presence of clustering in a MN have been discussed. Even if the methods are simple and computationally light, the interpretation of the results in terms of predictive power is often difficult; for example, a network related to a fractal dimension of 1.4 is clustered, but which type of phenomena can be detected and predicted by a network related to such a level of clustering? Moreover, the correct estimate of real clustering measures deals with a finite number of measurement points and complex geographical regions where phenomena are monitored and studied; surveys are limited by constraining factors that can be geographical (areas that cannot be reached with instruments), political (administrative limits between countries), geomorphologic or socio-economic (for example, phenomena for which sampling out of inhabited areas is senseless). These factors define a Validity Domain (VD, or space of interest) that constrains the prediction space, decreasing the dimensionality of the phenomenon to be detected. In fact, the fractal dimension of the phenomenon studied in the mapping space will no longer be 2. Figure 2.16 shows two sampling design schemes for a hypothetical survey of forests: the one on the left side is homogenously distributed in two-dimensional space, but is not realistic (samples occur outside the forest areas). The one on the
Environmental Monitoring Network Characterization and Clustering
35
right side is clustered, but in a forest-related space, i.e. by taking into account the VD of interest, it can be considered much more homogenous.
Figure 2.16. Sampling in forest areas. Homogenous 2D sampling (left column); homogenous sampling within the VD (right column)
Although the existence of VD seems obvious, their integration into cluster analysis is not trivial. One way to include the VD in cluster analysis is to proceed with a comparative analysis: in order to define the indices related to a homogenous sampling within the VD, some artificial MN with known properties in terms of clustering are generated and are considered as reference measures. For each of them, homogenously distributed (uniformly distributed in a Cartesian (x,y) coordinate system) random points have been generated within the VD. Departures from the homogenous distribution (clustering) are then calculated as differences (in terms of indices) in comparison with the simulated networks. The reference measures calculated on a simulated network are subjected to a certain degree of fluctuation related to the finite number of points and the randomness of the simulation procedure. Two points are crucial: – in order to deal with the finite number of points, each generated network has the same number of points (N) as the real monitoring network: the number of points
36
Advanced Mapping of Environmental Data
being the same, differences in the results are only related to the differences in the level of clustering and ergodic fluctuations; – in order to take into account ergodic and other types of fluctuations, the networks have been generated using a type of bootstrapping technique: a large number of points M (with M >> N) have been produced within the validity domain and then several random samplings (bootstrapping) with N number of points have been extracted. Such a procedure can quantify uncertainty and sensitivity of the results due to the finite number of points and VD definition. The number of points M is usually defined by the following relation:
M|
Lx Ly
G2
[2.8]
where Lx and Ly are the limits of the region and G is the required spatial resolution of the phenomena. The analysis of clustering within the VD can lead to the conclusion that there is no clustering within the validity domain. By limiting the following analysis to the VD of interest, declustering procedures can be avoided and the network can be considered as spatially optimized. The analyst should not forget that, if this option is chosen, the mapping of environmental variables outside the VD will be biased. In summary, the calculation of clustering indices within the relevant VD can be used to interpret indices such as fractal dimensions or Morisita diagrams. The VD are the regions of interest related to the phenomenon under study and important for the predictions; out of these regions, the location of points is not relevant or even senseless (e.g. forest fires in waterplans or water quality measures in desert areas). Such a procedure prevents us from erroneous conclusions that would come from the comparison with standard values from homogenous distribution (2 for fractals or 1 for Morisita index). The real clustering is thus the difference between the values obtained on the true MN and the artificial generated on the VD, representing the situation of spatial homogenity within the VD of interest. In summary, “only relative measures matter”. 2.5. Indoor radon in Switzerland: an example of a real monitoring network Let us consider a real case study based on the Swiss indoor radon monitoring network (SIRMN). This dataset represents about 29,000 measurements in dwellings,
Environmental Monitoring Network Characterization and Clustering
37
and is used for indoor radon modeling and spatial predictions [KAN 04b]. Data are highly clustered, variable and anisotropic at different scales. It is very difficult to find spatial structures on raw data using traditional variography. The application of regularized and non-regular variography on transformed data can reveal spatial structures but still with a high nugget effect [KAN 04b]. 2.5.1. Validity domains In this study, 10 MN extracted by bootstrap for each of the VD presented below have been used for comparison with the real monitoring network (Figure 2.17): – random samples within the rectangular region covering the data under study. This is a theoretical homogenous network and it does not take into account any boundaries; – random samples within the political boundaries of the region under study. This network configuration allows us to take into account complex boundary effects; – random samples within the populated regions of the area under study: this kind of VD can be justified by the phenomena where priorities in prediction are given to the populated regions.
Figure 2.17. Artificial MN used for the study of the Swiss indoor radon survey; (left) random points in 2D bounding box; (center) random points within political boundaries; (right) random points within populated regions
2.5.2. Topological index The Voronoï polygon area analysis (see Figure 2.18) shows a different distribution of areas-of-influence depending on the level of clustering. On one hand, the behavior of the curves is quite different for the real data and for the random points on the squared grid. On the other hand, the more constraining the VD is, the more the curves begin to be similar. Analysis of the Kullbach-Leibler divergence for these curves (Table 2.3) confirms these observations.
38
Advanced Mapping of Environmental Data Min K
Mean K
Max K
Homogenous square VD
3.401
3.625
3.722
Political boundaries VD
2.511
2.634
2.809
Populated areas VD
0.340
0.366
0.384
Table 2.3. Kullbach-Leibler divergence K between the real SIRMN and the artificial ones reflecting known clustering properties
0
Figure 2.18. Voronoï polygon area analysis for the SIRMN
2.5.3. Statistical indices 2.5.3.1. Morisita index The Morisita index (Figure 2.19) confirms the previous results: for real data, the index is higher than 1, which is the value of the index for a homogenous network, at all the scales. This shows a clustered distribution of points. The value of 1 is only reached asymptotically at larger scales, where one box covers the whole space, giving a value of I' of 1. Comparison with the simulated networks shows that the real data are distributed similarly to the random points over the populated regions VD, but also that the level
Environmental Monitoring Network Characterization and Clustering
39
of clustering is significantly higher for every scale. The simulated networks can be interpreted as homogenous for large scales (for scales larger than 200 km I' ) and the random distribution on populated regions can be distinguished from the distribution within the political boundaries only at scales smaller than 100 km.
Figure 2.19. Morisita index for the SIRMN
2.5.3.2. K-function The application of the K-functions confirms the previous observations. It is interesting to note that the network generated in the squared grid corresponds to a random Poisson point process.
40
Advanced Mapping of Environmental Data
Figure 2.20. K-functions for the SIRMN
2.5.4. Fractal dimension 2.5.4.1. Sandbox and box-counting fractal dimension The analysis of the fractal dimension on the SIRMN showed a general loss of dimensionality related to the degree of clustering (Table 2.4 and Figure 2.21). dfSANDa
dfBOX Min Homogenous square VD
2
Mean 2
Max
Min
2
1.954
Mean 1.956
Max 1.958
Political boundaries VD
1.828
1.83
1.832
1.925
1.928
1.930
Populated areas VD
1.751
1.753
1.755
1.7
1.703
1.706
Na
1.71
Na
Na
1.52
Na
SIRMN
Table 2.4. Fractal dimension of the considered networks (ten for every type). a Radius considered from 500 m to 50 km; Na: the real dataset consists of one unique realization, min and max statistics are not informative
Environmental Monitoring Network Characterization and Clustering
Figure 2.21. Fractal dimension index for the SIRMN; above: sandbox method; below: box-counting method
41
42
Advanced Mapping of Environmental Data
The squared random configuration of points is associated with a fractal dimension close to 2 for the sandbox method and of 2 for the box-counting method, i.e. a regularly distributed monitoring network for both methods. The slight difference of the sandbox method (dfSAND = 1.96) is related to the random generation of points, which allows the presence of small clusters in the local distribution. The use of increasingly constraining validity domains is reflected in the dimensional index by a consistent decrease of the fractal dimension: for the boxcounting method, the generation of random samples on the Swiss surface is related to a dfBOX of 1.83, while the use of the population validity domain decreases the fractal dimension to 1.75. The fractal dimension of real MN is equal to 1.71, and is very close to a randomly distributed network on populated areas. Differences appear only at very small cell sizes, when the effects of very local clustering appear. For larger scales (log(L) greater than 5 in the figure), the four networks are equivalent because they homogenously cover this region approximately in the same way. Regarding the sandbox method, even if the tendency of the decrease of fractality is the same, the difference between the real samples and the network generated on the populated regions VD is stronger and about 0.2. This difference can be explained by the computational method of dfSAND: the sandbox method is based on local neighborhood measures between samples and is more sensitive to high local clustering than the box-counting method, which is more a method for calculating the degree of coverage of space by the network. Raw SIRMN and populated area networks covering the space in a similar way, the difference of clustering between them is less visible with the box-counting method. Real SIRMN case study has shown dimensions between 1.52 and 1.71 depending on the method used. The dimensionality of a regular two-dimensional network fluctuates around 2, while the values of the populated regions VD are closer: between 1.70 and 1.75. Since the prediction of a phenomenon as indoor radon is strongly linked to inhabited areas, we can consider the following: if, on the one hand, the real network appears to be heavily clustered for predictions in twodimensional space, on the other hand it appears to be homogenous for predictions in populated regions. Therefore, the modeling should be constrained to populated regions, thus avoiding the need for declustering methods. 2.5.4.2. Lacunarity The lacunarity index calculated for the four networks is shown in Figure 2.22. The global tendency to clustering is confirmed by the curves and differences in the point patterns can be guessed. Specifically, it can be observed between MN of real data and MN of data on populated region networks and the two others.
Environmental Monitoring Network Characterization and Clustering
43
This was expected, since the real samples have been taken in populated regions (indoor radon is measured in dwellings). The non-sampled area distribution is therefore very similar for both networks and for the curves calculated.
Figure 2.22. Lacunarity index calculated for the four networks
2.6. Conclusion The first questions regarding environmental spatial data deal with the spatial and dimensional resolution of a given monitoring network, i.e. which phenomena can be detected by the monitoring network and at what resolutions. Clustering (nonhomogenity) and preferential sampling already give rise to biased estimates of global statistics such as mean and variance values. Therefore, correct quantification of monitoring network quality and following selection of an appropriate declustering technique are extremely important both for exploratory data analysis and spatial predictions.
44
Advanced Mapping of Environmental Data
In this chapter topological, statistical and fractal measures were introduced to quantify clustering of simulated and real monitoring networks. The important concept of a validity domain – important regions for the analysis and predictions – was introduced. It was demonstrated that in the case of complex regions under study relative values between indices are more important than absolute values. In fact, such studies have close relationships with traditional – representativity of raw data, splitting of data into training/validation/testing data subsets, and recent trends in machine learning – transductive and semi-supervised learning. 2.7. References [ALL 91] ALLAIN C. and CLOITRE M., “Characterizing the lacunarity of random and deterministic fractal sets”, Physical Review A, 44, 1991, p. 3552-3558. [ALL 00] ALLARD D. and GUILLOT G., “Clustering geostatistical data”, Proceedings of the 6th International Geostatistics Congress, Cape Town, South Africa, 2000, 15 p. [BON 07] BONJORNE de ALMEIDA L., GALETTI M., “Seed dispersal and spatial distribution of Attalea geranensis (Arecaceae) in two remnants of Cerrado”, Acta Oecologica, in press. [CRE 93] CRESSIE N., Statistics for Spatial Data, John Wiley and Sons, NY, 1993. [DEU 97] DEUTSCH C.V. and JOURNEL A.G., GSLIB, Geostatistical Software Library and User Guide, Oxford University Press, NY, 1997. [DIH 03] DIGGLE P.J., Statistical Analysis of Spatial Point Processes, second edition, Oxford University Press, London, 2003. [FED 88] FEDER J., Fractals, Plenum Press, NY, 1988. [GRU 06] DE GRUIJTER J., BRUS D., BIERKENS M.F.P. and KNOTTERS M., Sampling for Natural Resource Monitoring, Springer, NY, 2006. [KAN 04a] KANEVSKI M., MAIGNAN M., Analysis and Modelling of Spatial Environmental Data, EPFL Press, Lausanne, 2004. [KAN 04b] KANEVSKI M., MAIGNAN M., PILLER G., “Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland”, Proceedings of the XVIII International Conference Enviroinfo 2004, Geneva. [KOV 04] KOVITZ J.L., CHRISTAKOS G., “Spatial statistics for clustered data”, Stochastic Environment Research and Risk Assessment, 18(3), 2004, p. 147-166. [LEN 06] LE N.D., ZIDEK J.V., Statistical Analysis of Environmental Space-Time Processes, Springer, NY, 2006. [LOV 86] LOVEJOY S., SCHERTZER D. and LADOY P., “Fractal characterization of inhomogeneous geophysical measuring networks”, Nature, 319, 1986, p. 43-44 [LOV 87] LOVEJOY S. SCHERTZER D. and TSONIS A., “Functional Box-Counting and Multiple Elliptical Dimensions in Rain” Science, 1987, p. 1036-1038.
Environmental Monitoring Network Characterization and Clustering
45
[MAN 94] MANDELBROT B.B., “Fractals, lacunarity, and how it can be tuned and measured”, NONNENMACHER, T.F., LOSA, G.A., WEIBEL, E.R. (eds.), Fractals in Biology and Medecine, Birkhäuser Verlag, Boston, 1994, p. 21-28. [MOE 04] MOLLER J., WAAGEPETERSEN R.P., Statistical Inference and Simulation for Spatial Point Processes, Chapman & Hall, Boca Raton, 2004. [MOR 50] MORAN P.A.P, “Notes on continuous stochastic phenomena“, Biometrika, 37, 1950, p. 17-23. [MOR 59] MORISITA M., “Measuring of the dispersion of individuals and analysis of the distribution patterns”, Mem. Fac. Sci. Kyushu Univ., Ser E., 2, 1959, p. 214-235. [MUL 07] MULLER W.G., Collecting Spatial Data. Optimum Design of Experiments for Random Fields, Third edition, Springer, NY, 2007. [NIC 00] NICHOLSON T., SAMBRIDGE M., GUDMUNDSSON O., “On entropy and clustering in earthquake hypocenter distributions”, International Journal of Geophysics, 142, 2000, p. 37-51. [OKA 00] OKABE A., BOOTS B., SUGIHARA K., Spatial Tessellations: Concepts and Applications of Voronoï Diagrams, John Wiley & Sons, 1992. [OLE 07] OLEA R., “Declustering of Clustered Preferential Sampling for Histogram and Semivariogram Inference”, Mathematical Geology, vol. 39, 2007, p. 453-467. [OUC 86] OUCHI, T., UEKAWA, T., “Statistical analysis of the spatial distribution of earthquakes – variation of the spatial distribution of earthquakes before and after large earthquakes”, Physics of the Earth and Planetary Interiors, 44(3), 1986, p. 211-225. [PLO 96] PLOTNICK R.E., GARDNER R.H., HARGROVE W.W., PRESTEGAARD K. and PERLMUTTER M., “Lacunarity analysis: a general technique for the analysis of spatial patterns”, Physical Review E, 53, 1996, p. 5461-5468. [POZ 06] POZDNOUKHOV A. and KANEVSKI M., “Monitoring network optimization for spatial data classification using support vector machines”, Int. Journal of Environment and Pollution, vol. 28, 2006, p. 465-484. [PRE 85] PREPARATA F.P. and SHAMOS F.I., Computational Geometry, Springer, NY, 1985. [PRO 07] PRODANOV D., NAGELKERKE N., MARANI E., “Spatial clustering analysis in neuroanatomy: applications of different approaches to motor nerve fiber distribution”, Journal of Neuroscience Methods, 160, 2007, p. 93-108. [RIC 02] RICHMOND A., “Two-points declustering for weighting data pairs in experimental variogram calculations”, Computers and Geosciences, 2(2), 2002, p. 231-241. [RIP 77] RIPLEY B.D., “Modelling spatial patterns”, Journal of the Royal Statistical Society, B39, 1977, 172-212. [SHA 04] SHAHID SHAUKAT S., ALI SIDDIQUI I., “Spatial pattern analysis of seed bank and its relationship with above-ground vegetation in an arid region”, Journal of Arid Environments, 57, 2004, p. 311-327.
46
Advanced Mapping of Environmental Data
[SMI 89] SMITH T.G., MARKS W.B., LANGE G.D., SHERIFF W.H. and NEALE E.A., “A fractal analysis of cell images”, Journal of Neuroscience Methods, 27, 1989, p. 173180. [SMI 96] SMITH T.G., LANGE G. and MARKS W.B., “Fractal methods and results in cellular morphology – dimensions, lacunarity and multifractals”, Journal of Neuroscience Methods, 69, 1996, p. 123-136. [STO 95] SOTYAN D., KENDALL W.S., MECKE J., Stochastic Geometry and its Applications, 2nd Ed., J. Wiley and Sons, Chichester, 1995. [TES 94] TESSIER Y., LOVEJOY S., and SCHERTZER D., “Analysis and Simulation of the Global Meteorological Network”, Journal of Applied Meteorology, vol. 33, 1994, p. 1572-1586. [THI 11] THIESSEN A.H., “Precipitation average for large areas”, Monthly Weather Review, 39, 1911, p. 1082-1084. [TUI 07a] TUIA D., LASAPONARA R., TELESCA L., and KANEVSKI M., “Identifying spatial clustering phenomena in forest-fire sequences”, Physica A, 376, 2007, p. 596-600. [TUI 07b] TUIA D., KAISER C. and KANEVSKI M., “Clustering in environmental monitoring networks: dimensional resolutions and pattern detection”, in GEOENV VI: Proceedings of the Sixth European Conference on Geostatistics and Environmental Applications, Sprinter, 2007.
Chapter 3
Geostatistics: Spatial Predictions and Simulations
3.1. Assumptions of geostatistics Geostatistics dates back to the first introduction of kriging in 1954 [MAT 1954]. The principles of geostatistics were developed by Matheron [MAT 1963] and extended in later works [JOU 1978; CRE 1993; CHI 1999]. An independent contribution to spatial data modeling and interpolations (objective analysis of meteorological fields) was made by L. Gandin [GAN 63]. Geostatistics considers a spatial phenomenon as a random process Z(x), where the argument x denotes location in space (x=(x,y) in a two-dimensional space). Available measurement data (Z(x1), …,Z(xN)) from N locations (x1,…,xN) are treated as realizations of the random process Z(x). Measurements in spatial statistics are unique for every sampling location and represent a single existing realization unlike in classical statistics, where measurements are considered as multiple samples of a random variable. For example, we can measure porosity of a core sample from a well core plug, but cannot make a repeated measurement from the exactly identical location. The repeated measurement in this case would come from a very close but not identical location. This limitation is overcome by an assumption of Z(x) spatial continuity – a similar behavior in the vicinity of a measurement. Mathematically, the measure of continuity is described by the spatial correlation structure, which reflects how similar the values are in respect to their mutual location in space. In classical geostatistics, spatial correlation is described by a covariance function or a variogram. These characteristics are related to stationarity assumptions. Spatial Chapter written by E. SAVELIEVA, V. DEMYANOV and M. MAIGNAN.
48
Advanced Mapping of Environmental Data
stationarity in a strict sense states that a distribution of Z(x) is invariant to its location; i.e. the distribution function is the same in any two parts of the considered region. Thus, if we take 10 samples from one region they would feature exactly the same distribution as 10 samples from another region. This is a very heavy assumption, which is impractical in the majority of real problems. Therefore, several weaker assumptions about stationarity are used in practice: – second-order stationarity – the mean m of Z(x) is constant over the whole study area and the covariance function C(x+h, x) for any two locations x+h and x separated by a vector h depends only on h: x E[Z(x)]=m=const & x E[(Z(x+h)-m)(Z(x)-m)]=C(x+h, x)=C(h)
[3.1]
– intrinsic hypothesis – the mathematical expectation of the difference between function Z(x) values at any two locations x+h and x separated by a vector h is zero and the variance for the differences between these values (called a variogram) depends only on vector h: x E[Z(x+h)-Z(x)]=0 & x Var[Z(x+h)-Z(x)]=2J(h)
[3.2]
Intrinsic hypothesis is a weaker assumption than the second-order stationarity as it does not imply knowledge of the mean m. Analysis and modeling of the spatial correlation structure is the key part of any geostatistical analysis as it is directly inserted into kriging estimation procedure [JOU 78; DAV 88]. It means that geostatistical analysis usually starts with calculation of raw spatial correlation functions based on the measured data. Assuming intrinsic hypothesis, the standard formula for statistically unbiased estimation of the variance is given by [JOU 78, CLA 84; CHI 99]: G J(h)
¦ ¦ Z ( x i
j
) Z ( x j ) , G 2N( h ) 2
i
[3.3]
where h is the separation vector and N(h) is the number of pairs of samples separated by h. The summation is performed for all pairs (Z(xi),Z(xj)), such as ||xi, xj||[h-'h, h+'h], where 'h is the tolerance inserted in order to find a statistically sufficient number of pairs separated exactly by the separation vector h in arbitrary distributed raw data. Tolerance 'h is also a vector composed of lag tolerance h and direction tolerance. To prevent the same pair being used twice the sum is divided by 2 (semi-variogram). Under assumption of the second-order stationarity the statistical estimate of the covariance is used instead of the variogram.
Geostatistics: Spatial Predictions and Simulations
49
Under assumption of the second-order stationarity both the covariance and the variogram exist and are related: J(h)=C(0)-C(h), where C(0) is a prior variance of the random process. There are some problems with estimation of a raw variogram associated with poor selection of tolerance, clustered samples, deterministic trends, outliers, etc. [DAV 88, PAR 91]. To improve the situation and to simplify the spatial correlation structure analysis, preliminary statistical analysis [KAN 04], a declustering procedure [JOU 83, DEU 89, CRO 83, SCH 93, BOU 97] and treating outliers [PAR 91] were proposed. The behavior of a variogram can depend on the orientation of the vector h separating the pairs. Such a situation is known as anisotropy [JOU 78]. Details of different types of variogram anisotropy are described in [ZIM 93]. Now, let us briefly consider key geostatistic models used for spatial predictions and spatial simulations. 3.2. Family of kriging models There are several kriging models (kriging family) with the same basic principles differing by some assumptions (or knowledge) on the data (process). Any kriging belongs to the BLUE (best linear unbiased estimator) class. Consequently, the basic principles of a kriging model are: – it is a linear estimator – a kriging estimate Z*(x0) at location x0 is obtained as a linear combination of known values Z(xi):
Z * ( x0 )
N ( x0 )
¦w ( x i
0
)Z ( x i )
[3.4]
i 1
where N(x0) is the number of samples from the neighborhood of x0 taken into account for the estimation; – it is an unbiased estimator – the mean value is reproduced by the kriging estimate (E{Z*(x0)}=E{Z(x0)}); – it is the best estimator among all estimators of the linear class, which minimizes the estimation error variance (Var{Z*(x0)-Z(x0)}omin). The value of the variance can be estimated together with the corresponding kriging estimate. This is called the kriging variance and is referred to as the kriging error.
50
Advanced Mapping of Environmental Data
Conditions of unbiasedness and estimation variance minimization are used to find the weights wi for the linear estimator. They lead to a system of linear equations, the form of which depends on the type of kriging, some of which are described below. 3.2.1. Simple kriging Simple kriging works under the assumption of second-order stationarity and with known mean values (E[Z(x)]=m) [JOU 78; CRE 93; CHI 99]. Knowledge of the mean automatically provides the unbiasedness of the estimator:
Z * ( x)
m ¦i
n( x) 1
wi ( x)( Z ( xi ) m)
[3.4a]
The estimation error minimization leads to the system of simple kriging equations: N ( x0 )
¦w ( x j
0
)C ( x i x j ) C ( x i x 0 ) i 1,! , N ( x 0 ) ,
[3.5]
j 1
where wj(x0) are the weight coefficients for the linear combination and C(xi-xj) is the covariance for the vector separating the locations xi and xj. The covariance matrix is easy to estimate in the case with the known mean. This system of equations has a single solution if the covariance matrix is non-singular, the covariance model is positively defined and there are no collocated samples in the data. The variance of a simple kriging estimate is given by the so-called simple kriging variance VSK:
V SK ( x 0 )
C (0)
N ( x0 )
¦ w C (x i
i
x0 ) .
[3.6]
i 1
3.2.2. Ordinary kriging Ordinary kriging differs from simple kriging by the unknown mean. The mean is assumed to be constant over the field, but it is unknown. Such a situation is more realistic, as the mean value is usually unknown for the real field and is not necessarily adequately represented by the sample mean.
Geostatistics: Spatial Predictions and Simulations
51
An ordinary kriging estimator [JOU 78; CRE 93; CHI 99] works under the assumption of intrinsic hypothesis. The lack of knowledge leads to additional assumptions. To fulfil the unbiasedness, an additional constraint is imposed over the weights: N ( x0 )
¦w ( x i
) 1.
0
i 1
[3.7]
Minimization of the estimation error provides the system of equations called an ordinary kriging system (N(x0)+1 linear equations with N(x0)+1 unknowns). In a more general way it can be written in terms of the variogram: N ( x0 )
¦ w (x j
0
)J ij P
J i 0 i 1,!, N ( x0 )
j 1
[3.8]
N ( x0 )
¦ w (x j
0
) 1
j 1
where P is a Lagrangian multiplier, introduced because of the variance minimization with a constraint and J is a semi-variogram. Note that Jij can be calculated for each data pair Z(xi) and Z(xj) (i,j=1,…,Nx0 – the number of data in the neighborhood of (x0)), whereas Ji0 is approximated by the fitted theoretical variogram model J(h), with the separating vector h=xi-x0 as an argument. An ordinary kriging variance appears as follows:
V ok ( x0 ) P
N ( x0 )
¦ w ( x )J i
0
i0
.
[3.9]
i 1
3.2.3. Basic features of kriging estimation All models of the kriging family have a set of common features according to the basic principles of kriging. First, kriging weights do not depend on the variable values, as they are defined by a spatial correlation structure described by the covariance or the variogram model. Second, kriging is an exact estimator – the estimate honors the conditioning data exactly. Third, kriging features a smoothing effect on the estimates – the kriging estimate cannot exceed the data maximum or go below the data minimum. Smoothing is characterized by the variability of the kriging estimates. Fourth, the kriging variance is not higher than the variance of the initial data.
52
Advanced Mapping of Environmental Data
Figures 3.1 and 3.2 illustrate kriging estimates calculated in the 1D case. The estimates exactly honor the conditioning data and smoothly interpolate in between. Simple kriging (SK) and ordinary kriging (OK) estimates are quite similar except in the extrapolation regions at the edges. Kriging with the variogram model (correlation range 50) with a nugget effect (40% of the a priori variance) provides quite different estimates: they also exactly honor the observation data but feature much higher variance around the data locations, which results in a spiky pattern as if the data are less representative of the general function pattern. The kriging variance plotted in Figure 3.2 shows just slightly higher values for OK. The kriging model with high nugget clearly demonstrates high estimation variance between the data. The variance at the data locations is zero as the data values are reproduced exactly.
Figure 3.1. Kriging estimates in 1D case (spherical variogram, correlation range – 50, nugget – 2, sill – 3)
Figure 3.2. Kriging estimation error in 1D case
Geostatistics: Spatial Predictions and Simulations
53
Smoothness of the kriging estimate depends on the variogram model used and in particular its correlation range. Ordinary kriging estimates with long (10) and short (1) variogram model ranges are presented in Figure 3.3. Both estimates honor the data points exactly and demonstrate different smoothness in between the data.
Figure 3.3. Kriging estimates with long and short variogram model ranges in 1D case
A 2D example of ordinary kriging estimates is presented in Figure 3.4 calculated with different values of variogram parameters. Higher nugget results in a smoother estimated pattern because the estimates are allowed to deviate more from the conditioning data towards the mean (compare Figures 3.4a and b). Different orientation of the anisotropy and the ratio between the ranges in the perpendicular directions may result in huge differences between the estimate patterns (see Figure 3.4c–f). As the estimations are based on only 6 points, the uncertainty about the variogram parameters is very large (as it is impossible to build a consistent variogram model based on just 6 items of data). The basic variogram model was chosen to be of the spherical type whereas the nugget, the correlation range and its orientation may vary.
54
Advanced Mapping of Environmental Data
Figure 3.4. Ordinary kriging estimates in 2D with different variogram model parameter values: a) nugget=0, angle=0, range along=5, range across=5; b) nugget=0.5, angle=0, range along=5, range across=5; c) nugget=0, angle=90, range along=40, range across=10; d) nugget=0, angle=90, range along=40, range across=20; e) nugget=0, angle=90, range along=40, range across=30; f) nugget=0, angle=30, range along=40, range across=20
Geostatistics: Spatial Predictions and Simulations
55
One of the benefits of all kriging models is an estimate of the corresponding kriging variance, which describes the distribution of the estimation error. Kriging variance under some assumptions can be considered as a measure of the estimate’s uncertainty. It is important to remember that kriging variance is not conditional to the data. It is defined by the initial sample locations (sample density) and the spatial correlation structure (this feature has already been illustrated in Figures 3.4). Kriging variance is higher in zones with lower density of the initial samples. This property is illustrated in Figure 3.5, where kriging estimates are plotted along with the kriging estimation variance. The data used for kriging come from Spatial Interpolation Comparison contest SIC 2004 (see [SAV 05]). Comparison of the two plots shows that the kriging variance depends on the sample density (plotted with + marks) and does not depend on the actual value of the samples.
a)
b) Figure 3.5. Kriging estimate (a) and kriging variance (b); crosses indicate initial data locations [SAV 05]
56
Advanced Mapping of Environmental Data
3.2.4. Universal kriging (kriging with trend) Assumption of a constant mean considered for simple or ordinary kriging is sometimes difficult to accept. However, in some cases it is possible to account for local variations by assuming a smooth trend function m(x) as a mean estimate. Universal kriging models local mean (trend of the function) as a linear combination of basic functions fk(x) with k=1,…,K:
m( x c )
K
¦ a ( xc ) f k
k
( xc ) ak ( xc ) ak
x c W ( x ) .
[3.10]
k 0
The requirement of unbiasedness in this case leads to a set of constraints: N ( x0 )
¦w ( x i
0
) f k ( xi )
f k ( x0 ) .
[3.11]
i 1
The system of universal kriging equations appears as follows: K N ( x0 ) ° ¦ w j ( x0 )C R ( x i x j ) ¦ P k ( x0 ) f k ( x i ) C R ( x i x0 ) i 1,! , N ( x0 ) k 0 ° j1 N ( x0 ) ° w j ( x0 ) 1 ® ¦ j 1 ° N ( x0 ) ° w j ( x0 ) f k ( x j ) f k ( x0 ) k 1,! , K ¦ ° j 1 ¯
[3.12] where CR(h) is covariance function of residuals (R(x)=Z(x)-m(x)). A universal kriging variance can also be introduced:
V 2 UK ( x 0 )
N ( x0 )
¦ i 1
K
w i ( x )C R ( x i x 0 ) ¦ P k ( x 0 ) f k ( x 0 ) . [3.13] k 0
3.2.5. Lognormal kriging Lognormal kriging is the early attempt to perform nonlinear kriging [REN 79, DOW 82, HAA 90]. Usually, it is applied to data with lognormal statistical distribution.
Geostatistics: Spatial Predictions and Simulations
57
Lognormal kriging is an ordinary kriging performed on a transformed variable (Y(x)=ln(Z(x))). The semi-variogram is calculated for the transformed models and modeled. Then, ordinary kriging system [3.8] is solved for the transformed values. The estimates of the transformed variable are obtained. The main problem with lognormal kriging concerns back transform to the original values so as to preserve the unbiasedness for the final estimate. In the case of lognormal distribution, the back transform using kriging variance (Vok2(x0)) and Lagrangian multiplier (P) is obtained in logarithmic scale:
1 2 ½ ( x0 ) P ¾ . Z ( x 0 ) exp®Y ( x 0 ) V OK 2 ¿ ¯
[3.14]
Figure 3.6. Experimental variogram for original data (a) and lognormal data (b)
Variogram plots in Figure 3.6 show how sensitive the variogram can be to the lognormal transformation of the data. It is difficult to detect any correlation in the original data due to the influence of very large values (see Figure 3.3a). However, after lognormal transformation, which mitigates the influence of the high value tail, a stationary correlation structure becomes clear (see Figure 3.6b). Thus, lognormal transformation allows application of kriging to the lognormal values, which is called a lognormal kriging.
58
Advanced Mapping of Environmental Data
3.3. Family of co-kriging models Sometimes in spatial prediction problems there is more than one variable to consider. Observations of the spatial variables can often be correlated due to their origin or natural phenomena (e.g. rainfall and cloud cover, temperature and elevations, porosity and permeability, contamination from radioactive nuclides). Joint consideration of correlated data improves accuracy of the spatial predictions and allows us to use a large number of cheaper measurements to estimate a variable characterized by only a few items of data, which may be difficult to obtain. Geostatistics offers a selection of methods for spatial interpolation, which accounts for secondary correlated data. In simple kriging models, secondary data can be used in the form of an external drift, which forms a trend surface to be used in estimates. A family of co-kriging models allows us to account for the secondary variable including its own spatial correlation structure more accurately. 3.3.1. Kriging with linear regression A simple linear regression relation between the two variables can be used to derive the value of the correlated variable from the one already estimated, e.g. using ordinary kriging. This method is very simple, computationally inexpensive and does not require any additional variogram modeling. It only requires the estimated value of the first variable in every location where the second variable is estimated (presumably a grid) and the linear regression coefficient. The latter has to be defined based on the prior knowledge of the relationship between the two variables and the correlation between their data distributions. However, linear regression suffers from a few drawbacks due to its simplicity. Using linear regression assumes that both variables have the same spatial correlation structure, which may not be the case. Furthermore, estimated values of the second variable mimic the spatial distribution of the first variable and may not exactly reproduce the conditioning data. This means that both variables will have exactly the same spatial variability and distribution but on the different scales according to the variable ranges. This can lead, for example, to an under-estimation of the peak value and an over-estimation of the low values of the second variable, if its distribution is more variable than the distribution of the first variable. 3.3.2. Kriging with external drift Kriging with external drift models the trend with the help of another function y(x) defined in the same field (secondary variable):
m( x )
a 0 ( x ) a1 ( x ) y( x ) .
[3.15]
Geostatistics: Spatial Predictions and Simulations
59
Kriging with external drift can be considered as a modification of kriging with trend (section 3.2.3); the difference is in trend modeling. Kriging with trend [3.10] becomes kriging with external drift if K=1 and f1(x)=y(x). N ( x0 ) ° ¦ w i ( x 0 )C R ( x j x i ) P 0 ( x 0 ) P1 ( x 0 ) y( x j ) C R ( x j x ) ° i1 N ( x0 ) ° wi ( x0 ) 1 ® ¦ i 1 ° N ( x0 ) ° w i ( x 0 ) y( x i ) y( x 0 ) ¦ ° i 1 ¯
j 1,! , N ( x0 )
[3.16] where CR(h) is the covariance function of residuals (R(x)=Z(x)-m(x)). It is important to remember that to apply kriging with external drift, the secondary variable needs to be known at every location to be estimated. 3.3.3. Co-kriging Co-kriging is a generalization of kriging to the multivariate case. An estimate is carried out as a linear combination of the variable under estimation (primary variable ZD1) and other variables (K secondary variables ZD): *
Z D1 ( x 0 )
K
nD
wD Z D ( x ) . ¦¦ D i
i
[3.17]
1 i 1
To perform co-kriging we need to calculate cross-variograms (cross-covariances) for pairs of variables to describe the spatial cross-correlation structure of every pair of variables, in addition to auto-variograms (covariances) for every variable:
J DE ( h ) E [( ZD ( x h ) ZD ( x ))( Z E ( x h ) Z E ( x ))] ,
C DE ( h )
1 N(h) ¦ Z D ( x )Z E ( x h ) m Z D m Z E N( h ) i 1
60
Advanced Mapping of Environmental Data
Co-kriging can be simple or ordinary. Simple co-kriging covers the requirement on unbiasedness automatically. In the case of ordinary co-kriging, unbiasedness is achieved by fulfilment of the constraint nD
¦ wD i
i 1
GDD
0
1, D D 0 . ® ¯0, otherwise
[3.18]
An ordinary co-kriging equation system has the form:
K ni j ° ¦ ¦ w E J ij ( hD E ) P i °j 1E 1 ® ni ° w Ei G ii 0 ¦ °¯ E 1
J ii ( hD 0 ) 0
i
1,..., K ; D i
1,..., n i
1,..., K [3.19]
Co-kriging also allows us to estimate the variance of an error (co-kriging variance): 2 V CK
¦¦ wD J i
ii0
(hD 0 ) P i0 J i0i0 (0) .
[3.20]
3.3.4. Collocated co-kriging Collocated co-kriging is a modification of the full co-kriging for the case with a linear correlation between the variables. It simplifies the full co-kriging equations by deriving the cross correlation term C12 and the auto-variogram model of the secondary variable C22(xĮ-x) leaving just the a priori variance of the secondary variable C22(0):
U
C12 (0) , C12 (h) C11 (0)
UC11 (h)
[3.21]
where ȡ is a linear regression correlation coefficient and h=xĮi-x is the distance vector from the simulation location x to the observation data xai. Collocated co-kriging requires estimated values of the first variable used as the secondary data at each estimated location. For collocated co-kriging we can consider
Geostatistics: Spatial Predictions and Simulations
61
both simple and ordinary cases, analogous to kriging and full co-kriging models. The collocated simple co-kriging estimate in the case of two variables Z1 and Z2 at a location x is given by:
Z*SCK( x )
n1( x )
¦w i1 1
SCK i1
( x ) Z1( xi1 ) m1 wsSCK( x )Z2 ( x ) m2 m1 [3.22]
where m1 and m2 are the known means of corresponding variables. Weights wi1 and w2 are determined from the following system of equations for k1 observations of the primary variable Y1: n1 ( x) SCK SCK ° ¦ wi1 ( x) C11 ( xD1 xi1 ) w2 ( x) U C11 ( xD1 x) C11 ( xD1 x) , D 1 1,..., k1 ° i1 1 ® n1 ( u ) ° wiSCK ( x) U C11 ( xD 2 xi1 ) w2SCK ( x) C 22 (0) U C11 (0) ¦ 1 °¯ i1 1
[3.23] where C11 and C22 are auto-correlations of the correlated variables Y1 and Y2. Collocated co-kriging is faster than full co-kriging, avoids the instability caused by highly redundant secondary data and does not require modeling of the crosscovariance function C12(h) and secondary covariance C22(h) at lags |h|>0. The tradeoff for this is in providing secondary variable value at every estimated location and ignoring the secondary data at locations other than those being estimated. It is also better adapted for avoiding screen effect because of the limited use of secondary data. 3.3.5. Co-kriging application example This case study deals with the prediction of air temperature in Kazakh Priaralie. The selected region covers 1,400,000 km2 with 400 monitoring stations. The primary variable is average long-term air temperatures (qC) in June. Additional information is the elevation of the locations above sea level. This information is available on a dense grid from a Digital Elevation Model.
62
Advanced Mapping of Environmental Data
The correlation between the air temperature and the altitude is linear and equals 0.9 (see Figure 3.7). The correlation linearity allowed us to use any of the geostatistical models (co-kriging, collocated co-kriging and kriging with external drift) for modeling. Comparison of the methods is performed on a specially selected validation data set – not used during the estimation. The similarity between the training and the validation data sets was controlled by comparing summary statistics, histograms and spatial correlation structures (variograms). Similarity of spatial structures of the obtained datasets with the initial data was considered as even more important than statistical factors.
Figure 3.7. The scatter plot between altitude and air temperature in June
Results of geostatistical modeling are presented in Table 3.1 (errors on the validation dataset) and in Figure 3.8 (the estimation on the dense grid with known altitude values). It can be seen that the best results among geostatistical methods on a dataset test are obtained using kriging with external drift. Co-kriging results are worse than kriging results because of the screening effect [WAC 95]. Kriging and collocated kriging demonstrate similar patterns while kriging with external drift represents not only the large-scale structure but also small-scale variability effects ignored by kriging and co-kriging models.
Geostatistics: Spatial Predictions and Simulations
Figure 3.8. Geostatistical estimates of air temperature in June on a grid: (a) kriging, (b) co-kriging, (c) collocated co-kriging and (d) kriging with external drift
Model
Correlation
RMSE
MAE
MRE
Kriging
0.874
3.13
2.04
-0.06
Co-kriging
0.796
3.97
2.45
-0.11
Collocated co-kriging
0.881
3.05
1.95
-0.07
Kriging with external drift
0.984
1.19
0.91
-0.03
Table 3.1. The air temperature test results for geostatistical models
63
64
Advanced Mapping of Environmental Data
3.4. Probability mapping with indicator kriging The analysis of environmental spatial data is always associated with uncertainty. Major uncertainties arise from the limited amount of initial information (there is an infinite number of possible distributions honoring any initial data), sampling uncertainty, unpredictable stochastic source terms and measurement errors. Thus, the presence of uncertainties is impossible to ignore, and special attention should be paid to its characterization. Classical geostatistics (kriging family of regression methods) accompanies an estimate by a variance, which can be (under some assumptions) treated as a description of the uncertainty. Another way of dealing with uncertainties is a probabilistic approach. It replaces estimates of values by the estimates of the local probability density functions (pdf). Post-processing of local pdf gives rise to reaching a mapping of: the probability of exceeding a value (for example, a critical level of contamination), the probability of belonging to an interval of values (for example, of finding an important amount of biomass), etc. Probabilistic description and corresponding mappings link with a risk analysis through the estimation of a probability of a dangerous event. Thus, geostatistical probability mapping is often treated as risk mapping. Indicator kriging is a non-parametric approach to estimate the local probability distribution function and to perform all types of probability and risk mapping. The non-parametric approach makes no assumptions about a model of the probability distribution [JOU 83]. The only assumption of indicator kriging concerns spatial continuity of the process; this means that information on what is going on at a spatial location can be obtained from the values at the neighboring locations. This approach is useful for the reconstruction of a local probability distribution function and for the estimation of the probability of a specific event. Indicator kriging is the kriging of a binary indicator variable. Indicator kriging can be applied to both categorical and continuous variables, which are nonlinearly transformed into binary values. 3.4.1. Indicator coding Indicator coding is a non-parametric estimation based on probability distribution discretization as a step function of a series of K threshold (cut-off) values zk for continuous variable Z(x):
F ( x; z k | (n))
Pr ob^Z ( x) d z k | (n)` k
1,", K .
[3.24]
Geostatistics: Spatial Predictions and Simulations
65
Each sample is coded as a vector of K values I(x,zk), where I(x,zk) is the indicator transform carried out as follows:
I ( x, zk )
1, Z ( x ) d zk ®0, otherwise . ¯
[3.25]
Figure 3.9 illustrates indicator transformation of the well-log measurements. A threshold value zk is used to divide continuous log data into two categories, assuming two types of geological facies, which correspond to the well-log ranges above and below the threshold value. Two binary indicator variables I1 and I2 are constructed for the chosen two categories. In case of two indicator variables a one step pdf is estimated by indicator kriging at each local point. More detailed pdf is obtained by increasing the number of thresholds and the corresponding indicator variables.
Figure 3.9. Indicator transformation of continuous and categorical data into binary variables
The basic point of indicator approach is to select a set of thresholds. The number of thresholds has to be sufficient to represent the probability distribution, and the computational effort has to be reasonable. It appears that the reasonable number of thresholds depends on the number of samples and is the interval [5, 15]. The values for K thresholds are selected so as to split the data into K+1 classes of a nearly equal number of samples (use K deciles). Use of full indicator kriging requires a set of variogram models corresponding to each indicator variable.
66
Advanced Mapping of Environmental Data
Conditional probability can be interpreted in terms of indicators, i.e., as an expectation of an indicator conditioned to neighboring information (n):
F ( x; z k | (n))
E^I ( x; z k ) | (n)`.
[3.26]
An indicator transform is also applicable to the categorical type of data (classification problem). The categorical information is defined as a set of possible states sk, k=1,…,K. Each location of an area under study belongs to one of the states (S(x)). The uncertainty is modeled analogously to the continuous case by a conditional probability distribution function:
p ( x; s k | (n))
Pr ob^S ( x)
s k | (n)` k
1, ! , K .
[3.27]
Here again (n) indicates the neighboring information used for the estimation of the probability distribution. It can be explained by the same reasons as for a continuous case. The indicator transform for categorical data is also similar to the continuous case. It can be seen in Figure 3.9, where the categorical type (geologic face) is transformed into an indicator variable. In the multi-class case, the transform is performed for all classes. Possible states (classes) replace threshold values. The indicator transform appears as follows: I ( x; s k )
^
1, S ( x ) sk k 0, otherwise
1, ! , K .
[3.28]
3.4.2. Indicator kriging
Most often, members of the kriging family are used for an estimation of unknown indicators. They can be considered as a simple and ordinary kriging performed on indicators. A spatial correlation structure of indicators is used during the estimation. This is why careful selection of the thresholds is very important. Too many (or too few) zero indicator values significantly complicate variogram modeling. As for kriging there exist different types of indicator kriging: simple indicator kriging (where the indicator mean is known [SOL 86]); ordinary indicator kriging (where the indicator mean is constant through all the area but unknown). A linear form provides the estimate of the indicator variable: n( x)
i * ( x; z k )
¦ w ( x; z i
i 1
k
) I ( xi ; z k )
[3.29]
Geostatistics: Spatial Predictions and Simulations
67
with coefficients wi(x;zk) obtained from solving the (simple or ordinary) kriging equation system. To construct ccdf using indicator kriging we need to calculate and model K semi-variograms and to solve K kriging systems for each location under estimation. Averaged over the whole zone, the indicator gives a global probability distribution function:
E I ( x; z k )
F (zk ) ,
[3.30]
This allows us to treat proportions of the samples with the values below zk as the means for simple indicator kriging. In the case of a categorical variable, means are replaced by a priori class probabilities, expected proportions of a class. This works in the case of stationarity. If there are local concentrations of high values (hot spots), the case of local stationarity a priori probability distribution functions used as means for simple kriging do not correctly reflect local situations. Ordinary kriging provides more reliable local estimates. Indicator kriging uses information on only one threshold at a time. Co-kriging formalism allows us to use the information on all K thresholds. In theory, co-kriging is a better estimator since it uses all information on all thresholds. However, in practice the computational complications of co-kriging overcome the improvements of obtained estimates. For indicator co-kriging with K thresholds, calculations and modeling of K(K-1) auto- and cross-semi-variograms of indicators are required. In addition co-kriging matrices are larger then kriging matrices. The probabilistic description of indicator estimates imposes some constraints on them. For each location these constraints are the following [GOO 97]:
>F ( x; z k | (n))@* >F ( x; z k | (n))@*
i * ( x, z k ) [0,1]
[3.31]
i * ( x, z k ) d >F ( x; z k ' | (n))@* i * ( x, z k ' ) z k ' ! z k [3.32]
Sometimes these constraints can be distorted at some locations. Such situations need to be found out and corrected. Usually, this is carried out by simple averaging of upward and downward corrections. Upward correction checks the non-decreasing feature of estimates while zk is increasing, while downward correction checks the non-increasing feature in the opposite direction.
68
Advanced Mapping of Environmental Data
In order to use probability distribution values for arbitrary thresholds, the obtained discrete values have to be interpolated through the intervals and extrapolated at the ends. Linear interpolation is a simple and rather good method for internal intervals:
>F ( z )@Lin
ª z z k 1 º F * ( z k 1 ) « » [ F * ( z k ) F * ( z k 1 )] z ( z k 1 , z k ] ¬ z k z k 1 ¼ [3.33]
The tails of the estimated distribution function can be interpolated in the same way, only the expected local minimum or maximum is required. For the upper tail value, a hyperbolic model can be used. This allows us to extrapolate the positively skewed upper tail of a distribution toward an infinite upper bound. The hyperbolic model is the following:
>F ( z )@Hyp
1
O zZ
z ! z K ,
[3.34]
where parameter Z t1 controls the speed of reaching the cdf’s limiting value. The smaller Z is, the longer the tail of the distribution is. Parameter O identifies the sample cumulative frequency F*(zK):
O
z KZ >1 F * ( z K )@ .
[3.35]
In the categorical case the probability notation also imposes constraints on the estimated indicators:
> p( x; s k | (n))@* K
¦ > p( x; s k 1
k
i * ( x; s k ) [0,1] k
| (n))@*
1,!, K ,
[3.36]
K
¦ i * ( x; s
k
) 1
[3.37]
k 1
The following two corrections help if constraints [3.36] and [3.37] are violated. All estimates with the conditional probability outside the interval [0, 1], are set to the closest border. All K estimates are standardized using the sum to agree with constraint [3.37].
Geostatistics: Spatial Predictions and Simulations
69
3.4.3. Indicator kriging applications
3.4.3.1. Indicator kriging for 241Am analysis The indicator approach was applied to 241Am soil contamination data [KAN 03]. The aim was to reconstruct local distribution functions at the validation locations and, based on them, work out the probability estimates of exceeding the given thresholds: 17, 27 and 38 pCi/g. The analysis was performed on 163 samples. A more detailed description of data can be found in [KAN 06].
Northing (km )
< 1.83 > 1.83
1800
1900
2000
2100
2200
2550 2600 2650 2700 2750 2800 2850 2900
Northing (km )
2550 2600 2650 2700 2750 2800 2850 2900
The most appropriate levels to reconstruct a distribution function are quintiles of the initial data. In this case, 7 quintiles (with corresponding values 1.83, 4.79, 7.295, 8.67, 11.111, 14.874, 21.71 pCi/g) were selected. Two additional cut-offs at the critical levels of 27 and 38 pCi/g were considered for better accountancy of the high-valued tail of the distribution. Figure 3.10 presents results of indicator transform for four cuts (1.83, 8.67, 21.71 and 38 pCi/g). Black filled circles present locations with an indicator equal to one; white circles indicate zero indicators. Indicator variograms for the selected cut-offs were calculated and fitted with the theoretical models. Ordinary indicator kriging was performed. Local cumulative probability distribution functions were estimated. Figure 3.11 presents 15 examples of local cumulative pdfs.
2300
< 8.67 > 8.67
1800
1900
Northing (km )
< 21.71 > 21.71
1800
1900
2000
2100
2200
Easting (km )
2000
2100
2200
2300
Easting (km )
2300
2550 2600 2650 2700 2750 2800 2850 2900
Northing (km )
2550 2600 2650 2700 2750 2800 2850 2900
Easting (km )
< 38 > 38
1800
1900
2000
2100
2200
2300
Easting (km )
Figure 3.10. Indicator transform for four cut-off levels (1.83, 8.67, 21.71 and 38 pCi/g). Black filled circles represent locations with indicator equal to 1; white circles indicate zero indicators
Advanced Mapping of Environmental Data
40
80
c c df
0
40
0.0 0.4 0.8
c c df
0
0.0 0.4 0.8
c c df
80
41
80
0
40
62
72
89
97
80
0
40
80
0
40
80
c c df
c c df
c c df
c c df
40
0
40
80
0
40
A m241
A m241
A m241
103
104
107
109
112
40
80
40 A m241
80
0
40 A m241
80
c c df
c c df
c c df
0
0
40 A m241
80
80
0.0 0.4 0.8
A m241
0.0 0.4 0.8
A m241
A m241
80
0.0 0.4 0.8
50
0.0 0.4 0.8
A m241
0.0 0.4 0.8
A m241
0.0 0.4 0.8
A m241
c c df
0
40
0.0 0.4 0.8
c c df
0
36
A m241
0.0 0.4 0.8
0
0.0 0.4 0.8
80
0.0 0.4 0.8
c c df
40
32
A m241
0.0 0.4 0.8
0
c c df
31
0.0 0.4 0.8
c c df
11
0.0 0.4 0.8
70
0
40
80
A m241
Figure 3.11. Indicator kriging: local cumulative probability density functions at 15 testing locations (Ɣ) – sample value plotted at 0.5 p-value. The plots are labels according to sample numbers
Based on the indicator kriging with 9 thresholds, probability maps of exceeding given critical levels (17, 27 and 38 pCi/g) were constructed and presented in Figure 3.12. In these figures white marks indicate measurements actually above the indicated level. Most of the plots show that the measured value matches the modeled P50 value closer.
Geostatistics: Spatial Predictions and Simulations
71
Figure 3.12. Probability maps of exceeding given critical levels (17, 27 and 38 pCi/g). White marks indicate measurements really being above the indicated level
3.4.3.2. Indicator kriging for aquifer layer zonation This example illustrates the application of indicator kriging to a classification task: zonation of a geological aquifer layer. A detailed description of the data and the tasks can be found in [SAV 02 and SAV 03]. This parameter zonation approach is an alternative to the analysis of the spatial variation of the continuous hydraulic parameters. The parameter zonation approach is primarily motivated by the lack of measurements that would be needed for direct spatial modeling of the hydraulic properties. In the current case a hydrogeological layer is presented by 5 zones with different hydraulic properties: class 1 – gravel type 1; class 2 – gravel type 2; class 3 – gravel type 3; class 4 – sand; class 5 – silt. The initial data is a set of 225 samples. Figure 3.15 presents the initial data by using Voronoï polygons. Here Voronoï tessellation is used only for clearer visualization, as a simple post plot presents overlapping of close samples. The data are transformed into indicators for 5 possible values of a categorical function. The corresponding variogram is estimated and modeled for each indicator.
72
Advanced Mapping of Environmental Data
Figure 3.13. Initial data (classes)
Indicator kriging provides estimates of probabilities of belonging to any class. Figure 3.14 presents examples of such probability maps for class 1 (gravel 1) and class 4 (sand). The final classification solution depends on the class with the largest estimated probability. Figure 3.15 presents the classification result and the probability of the winning class. The solution rule can be changed if we only want highly probable classes to compete. This means that if no class overpasses a level of probability (for example, 0.7) at a location, this location is treated as unclassified. Figure 3.16 presents the classification result using such a classification rule with the level of probability set at 0.7. White zones (unclassified zones – zones with low probability of all classes) are zones of uncertainty for the current classification problem.
Geostatistics: Spatial Predictions and Simulations
Figure 3.14. Probabilities of belonging to class 1 (gravel 1) and class 4 (sand)
Figure 3.15. Zonation by indicator kriging: classification (left) and probability of the class winner (right)
73
74
Advanced Mapping of Environmental Data
Figure 3.16. Classification by indicator kriging with strong classifying rule – the winning class needs to have a probability higher than 0.7
3.4.3.3. Indicator kriging for localization of crab crowds This example deals with the spatial distribution of opilio crabs in the Bering Sea. Measurements were taken by trawl survey and are presented in catch numbers. The range of values varies from zero (nothing was found) up to 821,422 crabs. Such strong variability significantly complicates any interpolation. However, in reality, the interpolated value is of no special interest; the actual goal is to determine where to find a crowd of crabs. The probability of finding a crowd seems to be a sufficient type of estimate. A number over 5,000 crabs is considered to be a crowd. Figure 3.15 presents the spatial distribution of indicator transformed measured values with the level zk=5,000. Indicator kriging allows us to estimate the probability of finding a crowd equal to or higher than 5,000 individuals. The indicator kriging estimate gives the probability of being below or above the given value. Figure 3.16 presents the probability of finding a crowd of opilio crabs (more than 5,000 crabs in a catch). Light marks indicate the distribution of the initial data from this class of values. The correspondence seems to be quite good.
Geostatistics: Spatial Predictions and Simulations
75
Figure 3.15. Spatial distribution of indicator transformed number of opilio crabs (level 5,000 numbers)
Figure 3.16. Probability of finding a crowd of opilio crabs (more than 5,000 individuals). White marks indicate locations of measurements of such crowds
76
Advanced Mapping of Environmental Data
3.5. Description of spatial uncertainty with conditional stochastic simulations 3.5.1. Simulation vs. estimation
Spatial estimation models (e.g. kriging), which were described above, provide a single regression point estimate of a variable value for a chosen set of parameter values. Such point estimates are important in prediction mapping problems. However, a single estimate does not describe the range of uncertainty of the estimated variable. The reality is naturally more complex than all the sophisticated prediction models we can possibly think of. Even the very best prediction model is not able to reflect all the peculiarities of the real spatial pattern. Thus, our models can represent an average property value with a resolution relevant to a chosen grid. Prediction grid resolution is usually lower than the resolution of the measurement data (measurement support). For example, in large scale radioactive pollution mapping the common grid resolution is 102-103 m, while the observations are collected from the scale of 10-2 m (soil sample), 1 m (repeated samples) to 102 m (aerial gamma survey). Detailed samples collected with a high spatial resolution can describe local variability, which remains an essential feature of spatial patterns in environmental problems. Spatial regression models are not able to reproduce this variability as far as they impose a smooth surface honoring the available data. Stochastic simulation approach, as opposed to regression, provides multiple estimates of variable values at every considered location, which are calculated as stochastic realizations of the unknown random function. Conditional stochastic simulations are able to honor the observation data by reproducing the data exactly, like in kriging. Multiple realizations of a spatial process bring several important benefits for solving a prediction problem: – stochastic realizations do not smooth out the estimated pattern between the data locations, which is essential in preserving the realistic variability of the spatial pattern; – realizations represent a range of uncertainty with a local distribution of possible values. Capturing the local variability allows us to describe small scale variations, which may be reflected by observations, but are not reproduced by smoothed regression estimates away from data locations; – multiple realizations have equal probability by the construction algorithm. This allows us to assess the probability of the true unknown value being over a certain level and derive confidence bounds, which would encompass the real value with a certainty probability. Principal differences between stochastic simulation and estimation are illustrated in Figure 3.17, where five stochastic realizations of a 1D pattern are plotted against two regression estimates (simple and ordinary kriging) conditioned by six items of
Geostatistics: Spatial Predictions and Simulations
77
data. Kriging estimates are smooth between the data, which are honored exactly. Conditional stochastic simulations provide multiple variable realizations between the data, which are also honored exactly. It is clearly shown that stochastic realizations can be larger than the maximum observation value and smaller than the minimum value, which bound the kriging estimates. We should also note that the variability of the realizations is smaller in the areas with more data (e.g. x belongs to [5,20] in Figure 3.17), which naturally restrict the simulations from being too far from the surrounding data.
Figure 3.17. Multiple stochastic simulation realizations against kriging estimates (simple kriging and ordinary kriging using a spherical variogram, correlation range – 50, nugget – 2, sill – 3)
Stochastic simulations can still be carried out in the absence of data measurements based only on the prior knowledge of a global distribution and its structure. Such unconditional simulations still preserve global properties of the pattern: statistics (mean, variance, etc.), shape of the distribution density function (histogram) and spatial correlation (variogram). 3.5.2. Stochastic simulation algorithms
There exists a great variety of stochastic simulations algorithms used in spatial modeling based on a Monte Carlo technique in one way or another. Most of the methods fall into one of the two categories: cell-based (or pixel-based) models and object-based models [CHI 99; DEU 02].
78
Advanced Mapping of Environmental Data
Object-based algorithms model the variable value in local vicinities according to the predefined geometrical shapes (objects), which altogether form a pattern. These shapes are placed over the modeling region following one or another optimization technique, which can minimize the objective function based on data, the distribution statistics or spatial structure. In this case, spatial correlation is determined solidly by the geometric shapes as an alternative to a variogram. Object-based algorithms benefit from clear interpretability, based on the choice of realistic object shapes, which reflect the nature of the modeled phenomenon. This also makes up one of the major weaknesses of the object approach: the choice of the shapes assumes good knowledge of the pattern structure and is subject to vast uncertainties. Another drawback of the object-based approach is its possible computational cost, once an optimization technique requires multiple refitting of the spatial patterns to data. Data conditioning can be poor and numerically complex to achieve by iterative optimization. Also, conditioning the object-based realization to data from different scales (e.g. soft probabilities) is not straightforward as it would require a complex objective function and lead to a further increase of computations. However, objectbased models are widely used in such fields where prior information about the pattern structure is available (geology, hydrology, etc.). Some examples of possible object-based simulation realizations are shown in Figure 3.18.
a)
b)
Figure 3.18. Examples of object-based model stochastic realizations: a) fluvial channels deposited in a river system; b) aeolian dunes occurring in a desert landscape
In the cell-based approach, unlike the object-based method, we model the pattern value in every grid cell sequentially – cell by cell. This approach does not entail costly optimization algorithm once data conditioning becomes straightforward in every particular cell where the data is located. There are cell-based algorithms to model both continuous and categorical variables. Here we will describe just two of the most widely used in spatial modeling: sequential indicator simulations and sequential Gaussian simulations.
Geostatistics: Spatial Predictions and Simulations
79
The most recently developed cell-based models are associated with multiplepoint statistics [STR 02]. Multi-point statistics simulation is based on a spatial correlation model represented by a multiple point statistical moment rather than on a second-order two-point statistical moment (variogram) as in the most conventional geostatistical simulations. The multi-point statistic is described by a training image, which represents the global correlation structure more accurately than the conventional variogram. The use of the global structural information from a training image brings more realism and interpretability as in the case of object-based models. Thus, global structural dependencies obtained from a training image are conditioned to the local data using the sequential simulation principle (described below). Therefore, unlike in the object-based approach, data conditioning on all scales is straightforward in multi-point simulation and no iterative optimization algorithm is involved [CAE 05]. Simulated annealing is used as another cell-based algorithm, which is related to the object-based approach in some sense. It can be treated as a cell-based algorithm in the sense that it models realizations in each consecutive grid sell. However, as in the object-based approach, simulated annealing employs an optimization technique to minimize the objective function (variogram/histogram-based) by means of moving the simulated grid values around [DEU 98]. In a way it can be seen as a variant of the object-based approach, where all objects have a unique shape – a basic grid cell. Although simulated annealing shares the similar optimization techniques with object-based modeling, it does not imply any realistic prior shapes and therefore uses a variogram to represent spatial correlation. Actually, simulated annealing is a much more general algorithm which is an analogy to metal cooling in thermodynamics and is based on the Boltzmann relation between temperature and energy [MET 53]. Most geostatistical cell-based algorithms are based on the sequential simulation principle, which represents the joint probability distribution function of the entire pattern as a product of N local probability distribution functions conditioned by n observed data: F(x1,…,xN; z1,…,zN |(n))= F(xN; zN |(n+N-1)) F(xN-1; zN-1 |(n+N-2)) … F(x1; z1 |(n)) In practice, the sequential simulation principle is produced in a way where each of the sequentially simulated cells are then used to simulate further cells along with the conditioning data. The sequence in which the cells are simulated is determined by a random path regenerated for each stochastic realization. Sequential simulation algorithm steps are illustrated in Figure 3.19 showing how a simulated value for a single realization is randomly drawn from the modeled local cdf and then added to the data set simulation at further locations.
80
Advanced Mapping of Environmental Data
Random selection of the simulation cell
24 14.9
?
50.2 Modeling of the local pdf (parametric or non-parametric)
24 30.1±9
14.9 50.2
Random draw of the simulation value from the local pdf
21.1
Simulated value is added to the data pool to be used in further simulations
24
35.2
30.1
39.1
? 14.9 50.2
Figure 3.19. Sequential simulation algorithm
Geostatistics: Spatial Predictions and Simulations
81
The stochastic nature of simulation is embedded in the algorithm by random sampling from a local probability density function, which is constructed at each simulated cell location. Each random draw from the distribution corresponds to a single stochastic realization. The question is how to obtain this probability distribution function. There are several ways of doing this: parametric and nonparametric. The parametric approach entails assumption about the form and shape of the distribution, defined analytically (e.g. Gaussian). The non-parametric approach implies the definition of the local distribution function directly using a set of p-quantiles and the relevant interpolation between them. Parametric and nonparametric approaches provide different types of sequential simulations: Gaussian and truncated Gaussian algorithms, indicator simulations, direct simulations. Multipoint statistics simulations are non-parametric algorithms as they do not assume any analytical form of the local pdf but obtain it from the probability of the multi-point statistics pattern (data event) in the training image. It is worth noting that the sequential simulation algorithm is subject to the screen effect [ISA 89]. This occurs when the kriging weights of the data points that fell between one of the sampling points and the simulated point are decreased. Thus, some of the previously simulated values accounted for in the simulation act as a screen for the original sample values. This may lead to the appearance of negative weights. In practice it is not necessary to use all the available sample data for building the cdf at the simulated location. To overcome the screen effect, the number of conditional samples from the neighborhood of the simulated point can be restricted to the closest points with respect to the octant search. 3.5.3. Sequential Gaussian simulation
Sequential Gaussian simulation is widely used to model spatially distributed continuous variables (e.g. porosity, concentration, intensity, amount of rainfall) [DEU 02, GOO 97]. The key assumption behind the algorithm is a joint normal distribution of the spatial variable. This means that all the variable components at all evaluated locations are jointly normally distributed. This property – called multinormality – makes all local distributions Gaussian. The Gaussian distribution is determined by just two parameters – the mean and the variance, which makes it parametrically attractive to calculate. However, multi-normality is quite a heavy assumption, which cannot be checked in practice. There exist several tests for binormality, a weaker assumption, which states that every pair of values is jointly normally distributed [EME 04]. A sequential Gaussian simulation algorithm consists of the following steps [DEU 98]:
82
Advanced Mapping of Environmental Data
1. Normal transformation of the original data to standard Gaussian distribution N(0,1) is performed first if the data are not normally distributed initially. Usually, it is used, unless the data are lognormally distributed. The transformation is carried out using an approximate analytical function ij. A tabular inverse function ij-1 is constructed simultaneously to be used in the back transformation. 2. Choice of the simulated location along a random path, which visits all the points to be simulated. 3. Calculation of a simple kriging (SK) estimate and variance at the chosen location using the normal score variogram model (see section 3.2) and the conditioning data in the local neighborhood. Simple kriging estimation is calculated with a constant known mean equal to zero for the normalized data. 4. Construction of a local normal distribution with the mean equal to the SK estimate and variance equal to the SK variance at the simulated point. 5. Random draw of a stochastic realization value from the constructed local normal distribution at random according to the Gaussian probability density function. 6. Addition of the simulated value to the pool of conditioning data and choice of the next location (step 2) to be simulated. The previously simulated data are used in the simulation at further locations along with the observations. 7. Back transformation of the simulated realization to the original distribution values. Interpolation/extrapolation function has to be used between the values of the tabular function ij-1 and for the tails of the distribution. Multiple stochastic realizations are obtained by repeating steps, starting with step 2 through to step 7. The random path is regenerated for each stochastic realization in step 2. Gaussian models are theoretically consistent models and have several benefits. They are well known, and easy to calculate and integrate. Sampling from the local Gaussian distribution ensures that the set of spatial realizations will keep the form and shape of the local variability. The use of any other distribution would result in a variety of shapes of the local distributions. One of the drawbacks of the Gaussian approach is the maximum entropy – it imposes maximum “disorder” in the data. Maximum entropy results in poor connectivity of the extreme values, which is not always the case in nature. This is perhaps not the best choice when spatial correlation between high extreme values is of special interest. One possibility is to take a non-parametric model such as indicator-based simulations. A normal score variogram is also more stable and robust than a raw variogram, which eases the variogram modeling. A variogram of the normal score data must have a total sill equal to 1, according to the standard Gaussian variance.
Geostatistics: Spatial Predictions and Simulations
Figure 3.20. Stochastic realizations of sequential Gaussian simulations with different variogram parameters: a) nugget c0=0, angle Į=60, main range along direction Į R=40, minor range r=8; b) c0=0.4, Į=60, R=40, r=8; c) c0=0, Į=60, R=8, r=40; d) c0=0, Į=60, R=40, r=40; e) c0=0, Į=60, R=80, r=4; f) c0=0, Į=60, R=8, r=8; g) c0=0, Į=0, R=40, r=8; h) c0=0, Į=90, R=8, r=40
83
84
Advanced Mapping of Environmental Data
To simulate a categorical non-continuous variable, another type of sequential Gaussian simulation can be used, truncated Gaussian simulation, which is based on the same Gaussian assumption [DEU 02]. A set of equally probable realizations of the spatial variable distribution is a result of sequential Gaussian simulation. The simulated realizations share the same global distribution characteristics (mean, variance, histogram, etc.) and spatial correlation, reproduce exactly the same conditioning data, but differ in local peculiarities. The difference between the realizations characterizes the uncertainty of the model. In order to obtain statistical inference for the spatial distribution mean, variance, p-quantiles, etc., further postprocessing of the the realizations can be carried out. Post-processing of stochastic realizations offers a wide range of probabilistic results. Thus, an averaging over the realizations provides a smooth E-type estimate that can be compared to a regression type (kriging) estimate as a single solution. The difference between the realizations allows us to evaluate the range of variability of the spatial estimates. Local probability distribution functions (pdf) can be calculated based on the multiple realizations at each locality assuming that all the realizations have equal probability. A probability of the true value of exceeding the chosen threshold can be calculated from the local pdfs, which are based on the kriging mean and variance. Similarly, an estimate corresponding to a certain pquantile can be obtained based on the same probability density functions. Stochastic realizations heavily depend on the normal score variogram model parameters used. Figure 3.20 illustrates a realization calculated with different variogram nugget, ranges and anisotropy direction angle. Overlayed arrows show the pattern structures corresponding to the variogram ranges along and across the anisotropy direction. 3.5.4. Sequential indicator simulations
Sequential indicator simulation is a non-parametric method, which does not assume any analytical form of the local pdf, unlike in Gaussian simulation. Indicator simulation is based on the indicator approach which allows us to estimate local pdf using indicator kriging (see section 3.3). Indicator simulation is a cell-based algorithm which models values in each cell sequentially along a chosen random path. It follows the sequential principle, when previously simulated data are used in the next evaluations. A sequential indicator simulation algorithm consists of the following steps [DEU 98]: 1. Indicator transformation of data according to the set of thresholds (cut-offs). A global proportion (conditional density function value) and a variogram model are
Geostatistics: Spatial Predictions and Simulations
85
built for each indicator variable. Note that global proportions should be estimated taking into account clustering of the observation data, so an appropriate declustering algorithm can be applied, e.g. [DEU 89]. 2. Definition of a random path through all the simulation points. 3. Estimation of probabilities for each indicator variable using indicator kriging and normalization of their sum to 1. 4. Construction of a local probability density function based on the estimated probabilities at the point. 5. Random draw of a simulated realization value from the probability density function and calculation of the corresponding values for indicator variables. 6. Addition of the simulated indicator value to the set of conditioning data to be used in simulation at further locations. 7. Move to the next simulation point along the chosen random path and repeat steps 3-6. 8. Multiple stochastic realizations are obtained by repeating steps 2-7. Indicator simulation ensures the approximate representation of the average proportion for each category (indicator variable) given the global distribution and the indicator variogram for each category. Conditional indicator simulation realizations honor the observations and reproduce the spatial correlation structure (variogram). In the case of modeling a continuous variable simulated indicator realization provides values of the actual variable sampled from the local pdfs, evaluated by indicator kriging. If the number of realizations is large, the realizations reproduce the estimated pdf fairly well. Thus, averaged realizations (E-type) and maps of probability quantiles are approximately the same as those obtained from indicator kriging. The benefit of simulation is in the realizations themselves, which can be used as an input for further risk modeling resulting in the evaluation of uncertainty for decision-making. In the case of modeling a categorical variable, spatial realizations of indicator variables for each category can be combined into a joint realization of a categorical staochastic pattern. A set of multiple realizations would characterize uncertainty in the occurance of one or another category in every location. The probability of any category occuring in a particular location is determined by the corresponding pdf value, which is based on the family of realizations. Figures 3.21 and 3.22 illustrate the influence of the anisotropic variogram range on the simulation results.
86
Advanced Mapping of Environmental Data
a)
b)
c)
d) Figure 3.21. Sequential indicator simulation realizations with different horizontal correlation ranges: a) r=160, b) r=80, c) r=40, d) r=20 and vertical range R=8
Geostatistics: Spatial Predictions and Simulations
87
a)
b)
c)
d) Figure 3.22. Sequential indicator simulation realizations with different vertical correlation ranges: a) r=20, b) r=10, c) r=5, d) r=2 and horizontal range R=80
88
Advanced Mapping of Environmental Data
3.5.5. Co-simulations of correlated variables
The geostatistical stochastic simulation of correlated variables is called cosimulation and is based on a corresponding co-kriging model. Any co-kriging model discussed in section 3.3 can be used for co-simulations. In the case of a simple linear regression relationship between the two variables the value of the correlated variable can be obtained from one already simulated, e.g. using Gaussian simulations. Analogous to estimation each realization of correlated variable simply mimics the realization of the first one. Collocated co-kriging (see section 3.3.4) provides the mean and variance to calculate local normal probability density functions (pdf) of the primary correlated variable in case of Gaussian simulations. The local pdf for a secondary variable was similarly obtained based on a simple co-kriging estimate (as the mean) and variance. Therefore, the correlated variables are sampled independently of two separate local pdfs. Thus, the simulated distribution patterns are not the same as in the case of simulation with linear regression. Collocated co-simulation implies its own spatial correlation (variogram) model for the simulated correlated variable, which accounts for peculiarities of its own spatial structure. Thus, an additional modeling effort is needed to fit the variogram model for the second correlated variable. In collocated co-simulations the conditioning data for both variables are honored as in sequential Gaussian simulation. In comparison to the simple simulation with linear regression the increased computational costs for an additional stochastic simulation round is traded for more accurate modeling of spatial correlation of the second simulated variable. However, collocated co-kriging also assumes linear correlation between the variable and, thus, only uses a single piece of secondary data at the estimated point, while the full co-kriging used all the secondary data from the local neighborhood. Such simplification may not affect the kriging estimate if the neighboring secondary data do not differ much but the variance can be over-estimated, which would lead to larger variability of stochastic realizations [DEU 02]. An example of collocated cosimulation is presented below in this section. The third way of simulating correlated variables is based on the full co-kriging method (see section 3.3.3). Sequential Gaussian co-simulation with full co-kriging simultaneously calculates individual local pdfs for all correlated variables the simulated location based on the co-kriging estimates and estimation errors as Gaussian means and variances. In full co-simulation the sampling of each simulated variable is carried out simultaneously independently of the relevant distributions. This provides spatial variability of individual distributions based on the different spatial correlation structure and taking into account their joint correlation not assuming just a linear regression.
Geostatistics: Spatial Predictions and Simulations
89
An example of modeling porosity and permeability in a sub-service reservoir illustrates stochastic Gaussian collocated co-simulation. Porosity and permeability is usually highly correlated in permeable porous media (e.g. sands). A linear correlation between porosity and log-permeability can often be assumed. This assumption is used in a synthetic PUNQ-S3 case study, which is a benchmark case in the oil industry [FLO 01]. In the case study horizontal permeability was assumed correlated with porosity with a correlation over 80%. Auto-variograms for porosity and permeability are shown in Figure 3.23 as anisotropic rose contour diagrams. Actually, they feature quite similar strong anisotropic structures, which is also reflected by the cross-variogram (see Figure 3.23). The sequential Gaussian simulation realizations of porosity are presented in Figure 3.24. They look quite different due to the lack of conditioning data (from only 6 well locations). The corresponding permeability realizations calculated by sequential Gaussian cosimulations are presented in Figure 3.25.
Figure 3.23. Auto- and cross-variogram contours for porosity and permeability
Figure 3.24. Stochastic porosity realizations from sequential Gaussian simulations
90
Advanced Mapping of Environmental Data
Figure 3.25. Stochastic permeability realizations from sequential Gaussian co-simulations
Note that it is possible to calculate multiple stochastic realizations of permeability based on each porosity realization. This will result in additional variability in permeability results imposed by porosity realizations. Alternatively, using a single porosity realization as a secondary correlated variable input in permeability co-simulation will feature a reduced level of uncertainty associated with permeability only. Multiple stochastic realizations of porosity and permeability reproduce spatial correlation characterized by variograms (see Figures 3.26 and 3.27). The generated sets of porosity/permeability realizations are used as an input into the flow simulation model to simulate uncertainty of the oil production forecast (see [DEM 04]). Cumulative oil production profiles for 50 stochastic realizations are presented in Figure 3.28a in comparison with the known “TRUTH” case solution in the synthetic case. The histogram in Figure 3.28b shows the uncertainty distribution of the total cumulative oil production by the end of the forecasting period of 16.5 years against that for the “TRUTH” case.
Geostatistics: Spatial Predictions and Simulations
91
Figure 3.26. Reproduction of omnidirectional variogram by 50 stochastic realizations of porosity (left) and permeability (right)
Figure 3.27. Reproduction of directional variograms by 50 stochastic realizations of porosity (left) and permeability (right)
a)
b)
Figure 3.28. Flow simulation: a) oil production forecasts with 50 realizations of poro/perm fields against the “TRUTH” case (solid line); b) histogram of the distribution of FOPT production predictions after 16.5 years against the “TRUTH” case (solid line at 3,870,000)
92
Advanced Mapping of Environmental Data
3.6. References [BOU 97] BOURGAULT G., “Statistical declustering weights”, Mathematical Geology, vol. 29, p. 277-290, 1997. [CAE 05] CAERS J., Petroleum Geostatistics, Society of Petroleum Engineers, 2005. [CHI 99] CHILES J-P. and DELFINER P., Geostatistics: Modeling Spatial Uncertainty, John Wiley & Sons 1999. [CLA 84] CLARK I., Practical Geostatistics, Elsevier Applied Science Publishers, London and NY, 1984. [CRE 93] CRESSIE N., Statistics for Spatial Data, John Wiley & Sons, 1993. [CRO 83] CROZEL D. and DAVID M., “The combination of sampling and kriging in regional estimation of coal resources”, Mathematical Geology, vol. 15, p. 571-574, 1983. [DAV 88] DAVID M., Handbook of Applied Advanced Geostatistical Ore Reserve Estimation, Elsevier Science Publishers, Amsterdam B.V., 216 p., 1988. [DEM 04] DEMYANOV V., CHRISTIE M. and SUBBEY S., “Neighbourhood Algorithm with Geostatistical Simulations for Uncertainty Quantification Reservoir Modeling: PUNQ-S3 Case study”, 9th European Conference on Mathematics in Oil Recovery ECMOR IX 2004, Cannes, France, September 2004. [DEU 82] DEUTSCH C. and JOURNEL A., GSLIB: Geostatistical Software Library, Oxford University Press, 1998. [DEU 89] DEUTSCH C. DECLUS A., “Fortran 77 program for determining optimal spatial declustering weights”, Computer & Geosciences, 15(3), p. 325-332, 1989. [DEU 02] DEUTSCH C., Geostatistical Reservoir Modeling, Oxford University Press, 2002. [DOW 82] DOWD P.A., “Lognormal kriging – the general case”, Mathematical Geology, vol. 14, p. 475-499, 1982. [EME 05] EMERLY X., “Variogram of order Ȧ: A tool to validate a bivariate distribution model”, Mathematical Geology, 37(2), p. 163-181, 2005. [FLO 01] FLORIS F.J.T., BUSH M.D., CUYPERS M., ROGGERO F. and SYVERSEEN A-R., “Methods for quantifying the uncertainty of the production forecasts: a comparative study”, Petroleum Geosciences, vol. 7, p. S87-S96, 2001. [GAN 63] GANDIN L.S., Objective Analysis of Meteorological Fields, Israel program for scientific translations, 1963, Jerusalem. [GOO 97] GOOVAERTS P., Geostatistics for Natural Resources Evaluation, Oxford University Press, 1997. [HAA 90] HAAS T.C., “Lognormal and moving window methods of estimating acid depositions”, Journal of American Statistical Association, vol. 14, p. 950-963, 1990. [JOU 78] JOURNEL A.G. and HUIJBREGTS C.J., Mining Geostatistics, Academic Press, 600 p., London, 1978.
Geostatistics: Spatial Predictions and Simulations
93
[JOU 83] JOURNEL A.G., “Non-parametric estimation of spatial distributions”, Mathematical Geology, vol. 15, p. 445-468, 1983. [ISA 89] ISAAKS E. and SHRIVASTAVA M., Applied Geostatistics, Oxford University Press, 1989. [KAN 03] KANEVSKI M., BOLSHOV L., SAVELIEVA E., DEMYANOV V., PARKIN R., TIMONIN V., CHERNOV S., MCKENNA S., “Spatio-temporal analysis of ground water contamination”, Proceedings of IAMG 2003, Portsmouth, UK. [KAN 04] KANEVSKI M. and MAIGNAN M., Analysis and Modelling of Spatial Environmental Data, EFPL Press, 2004. [KAN 06] KANEVSKI M., DEMYANOV V., SAVELIEVA E., PARKIN R., POZDNOUKHOV A., TIMONIM V., BOLSHOV L. and MCKENNA S., “Validation of geostatistical and machine learning models for spatial decision-oriented mapping”, Proceedings of StatGIS 99, J. Heyn, Klagenfurt, 2006. [MET 85] METROPOLIS N., ROSENBLUTH A., TELLER A. and TELLER E., “Equations of state calculations by fast computing machines”, J. of Chem. Physics, 21(6), p. 10871092, 1985. [MAT 54] MATHERON G., “Démonstration approchée de la convergence vers la loi lognormale du schéma doublement binomial”, Note Statistique, no. 5, CG, Ecole des Mines de Paris. [MAT 63] MATHERON G., “Principles of geostatistics”, Economic Geology, vol. 58, p. 1246-1266, December 1963. [MYE 91] MYERS D., “Pseudo-cross variograms, positive-definiteness, and cokriging”, Mathematical Geology, 23(6), p. 805-816, 1991. [PAR 91] PARKER H.M., “Statistical treatment of Outlier data in epithermal gold deposit reservoir estimation”, Mathematical Geology, vol. 23, no. 2, 1991. [PAR 03] PARKIN R., DEMYANOV V., KANEVSKI M., TIMONIN V. and MCKENNA S., “Improved uncertainty assessment for environmental decision making with hybrid models: ANN and stochastic simulations”, Proceedings for StatGIS 2003, Springer. [PAR 04] PARKIN R., KANEVSKI M., SAVELIEVA E., DEMYANOV V. and MCKENNA S., “Geostatistical Analysis if radioecological spatio-temporal data”, Izvestia of Russian Academy of Sciences, Applied Energy, 3, p. 59-73, 2004 (in Russian). [REN 79] RENDU J.M., “Normal and lognormal estimation”, Mathematical Geology, vol. 11, p. 407-422, 1979. [SAV 02] SAVELIEVA E., KANEVSKI M., TIMONIN V., POZDNUKHOV A., MURRAY C., SCHEIBE T., XIE Y., THORNE P. and COLE C., “Uncertainty in the hydrogeologic structure modelling”, in Proceedings of IAMG2002 Conference, September 2002, Berlin, Germany, p. 481-486. [SAV 03] SAVELIEVA E., KANEVSKI M., TIMONIN V., POZDNUKHOV A., MURRAY C., SCHEIBE T., XIE Y., THORNE P. and COLE C., “Aquifer hydrogeologic layer zonation at the Hanford Site”, Proceedings of IAMG2003 Conference, September 2003, Portsmouth, UK.
94
Advanced Mapping of Environmental Data
[SAV 05] SAVELIEVA E., “Using Ordinary Kriging to Model Radioactive Contamination Data”, Applied GIS, vol. 1, no. 2, p. 10.1-10.10, 2005. [SCH 93] SCHOFIELS N., “Using entropy statistic to infer population parameter from spatially clustered sampling”, in Geostatistics Troia’92, Volume 1, Quantative Geology and Geostatistics (ed. Soares A.), Kluwer Academic Publishers, Dordrecht, p. 109-119, 1993. [SOL 86] SOLOW A.R., “Mapping by simple indicator kriging”, Mathematical Geology, vol. 18, no. 3, 1986. [STR 02] STREBELLE S., “Conditional simulation of complex geological structure using multiple-point statistics”, Math. Geology, vol. 34, p. 1-22, 2002. [WAC 95] WACKERNAGEL H., Multivariate Geostatistics, Springer-Verlag, Berlin, 256 p., 1995. [ZIM 93] ZIMMERMAN D.L., “Another look at anisotropy in geostatistics”, Mathematical Geology, vol. 25, no. 4, 1993.
Chapter 4
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
4.1. Introduction This chapter presents a brief introduction to the broad field of machine learning (ML), an exciting research subject in which statistics, computer science and artificial intelligence overlap. As data becomes easier to collect, techniques that can handle massive datasets involving a large number of variables in an efficient manner become essential. ML real-world applications have flourished over the last two decades, especially in the fields of data mining, bioinformatics, speech and character recognition, web and text mining and, more recently, environmental science and remote sensing data analysis. Machine learning can be broadly defined as a set of mathematical, computational and statistical methods that aim to automatically learn rules and dependencies from examples. These rules can be a curve providing the best fit for a set of points from a predictive point of view, a discriminative function that guides a classification task, such as in soil type recognition from remote sensing data, or even hypotheses about the spatial and temporal distribution of a certain phenomenon. The examples usually take the form of a database that registers numerical, categorical or qualitative attributes of samples representing the phenomenon under study.
Chapter written by F. RATLE, A. POZDNOUKHOV, V. DEMYANOV, V. TIMONIN and E. SAVELIEVA.
96
Advanced Mapping of Environmental Data
At present there is excellent literature on machine learning algorithms and their applications. Here we can mention references in which the reader can find detailed explanations of theories, applications and algorithms [ABE 05, BIS 06, CRI 00, GUY 06, HAS 01, HAY 98, JEB 04, KOH 00, RAS 06, SCH 06, SHA 04, VAP 06, VAP 95, VAP 98]. 4.2. Machine learning: an overview 4.2.1. The three learning problems When confronted with a data modeling problem in environmental science, we want to perform tasks such as building a map, predicting to which category a certain soil belongs, evaluate a risk related to a specific pollutant, etc. These tasks, however numerous, usually fall into one or a combination of these categories: – regression – classification – density estimation Building a map is a classical task of regression or function approximation. Given a finite set of points, we want to build a function predicting values at any location in space. The goal is to learn a functional relationship from a training set ^x i , y i ` , where xi is a p-dimensional vector, often called an input, and yi is a vector of continuous values. The latter is usually called the target or output vector. Input vectors xi are assumed to be drawn independently from the same (unknown) probability distribution. Figure 4.1 illustrates a typical function approximation problem in one dimension. Given a finite set of points – sampled here from the function sin x x with added Gaussian noise – the goal is to build a function that will represent the data well while being useful for prediction, without knowledge about the original functional form sin x x .
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
97
Figure 4.1. An example of regression problem
In Figure 4.1 the solid line represents a 2nd order polynomial fitted by least squares approximation, while the dashed line is a spline interpolation. Both models represent a valid hypothesis regarding the function that generated the data points, although none of them is likely to be the best, i.e., the one that will provide the lowest generalization error on new data. In this specific case, one of the two models is too simple (the solid line), while the other is too complex. The issue of selecting a good compromise between these cases will be the focus of section 4.2.4. The methods most often encountered in practice are multivariate linear regression, kriging, splines, generalized linear models [HAS 01], generalized additive models [HAS 90], neural networks such as multi-layer perceptron (MLP) and general regression neural network (GRNN), support vector regression (SVR) [SCH 98] and Gaussian processes [RAS 06]. MLP, GRNN and SVR will be explained in further detail in the following sections. A classification problem can be formulated the same way, but yi is a onedimensional vector of discrete values, e.g., yi = {-1,1,}. For example, from a satellite image, we could want to output, based on the reflectance of several spectral channels, if the region on the image is a forest, a desert, water or an urban area. This is a typical multiclass classification task. From a practical point of view, the objects
98
Advanced Mapping of Environmental Data
to be classified are represented using variables that are able to separate the objects correctly. For instance, even though they share the same shape and color, only one variable is necessary to discriminate between pumpkins and oranges: their diameter. The variables are often called features in this context. Figure 4.2 illustrates a twoclass and two-variable classification problem. Classification methods include linear discriminant analysis, probabilistic neural networks, decision trees and support vector machines (SVM).
Figure 4.2. Classification problem with two classes
In Figure 4.2 the solid line is a linear Bayes classifier, while the dashed line is a SVM with a Gaussian kernel. Even though the SVM achieves a better class separation, the linear classifier is sufficient for this problem and is a better choice for prediction in that case. Density estimation is a task that arises when we want to model the probability distribution underlying a certain phenomenon for visualization or clustering purposes, for instance. In this case, only the values ^x i ` are available. The most common way of dealing with such a problem is with so-called generative approaches [BIS 06]. These methods assume a functional form for the probability distribution, e.g., Gaussian, and then try to estimate the corresponding parameters. It is important to note that this functional form can be and usually is a combination of distributions. The most commonly used generative methods are Gaussian mixture
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
99
models, Bayesian networks and Markov random fields. The interested reader will find a good introduction to these methods in [BIS 04]. We have just mentioned that density estimation has often been achieved using generative methods. In fact, all machine learning methods are either generative or discriminative. The reader may find a thorough description of generative and discriminative models in [JEB 04]. Generative methods try to build a model of the whole joint probability distribution of the input and output sequence. Using Bayes’ rule, we know that
p x, y
p x y p y
[4.1]
As the above formula illustrates, this implies estimating both the conditional distribution of the inputs and the distribution of the outputs. Discriminative models, on the other hand, only try to model the conditional probability distribution of the outputs. In a two-class classification problem, we usually want to find whether or not this inequality is respected for every new point:
p y x ! 0.5
[4.2]
Unlike generative methods, it is impossible to generate new data from this type of model, since we do not have knowledge of the full joint distribution. However, predicting the output is usually the ultimate goal in regression or classification, and estimating the full distribution, which is a very difficult problem, is not useful to perform as an intermediate task. Note that p y x , the posterior (or predictive) probability of the output represents the probability of a data point belonging to class 1 (or -1, depending on the formulation of the problem). Most methods do not provide such a distribution, but instead only the most likely class label (1 or -1). Estimating the posterior distribution explicitly is slightly more difficult. We will come back to this point in section 4.2.6.
Most of the tasks we have mentioned rely on an inductive principle. From a finite set of locations, we build a general model able to predict the value of the function at any location. As underlined by Vapnik [VAP 95], we should not solve a problem that is more general than what is required. In many cases, we only need to know the values of that general model at particular locations. Traditionally, we solve a function estimation problem by first performing induction (constructing a general model from a finite dataset) and, subsequently, deduction (evaluation of the function at particular locations of interest). Transductive inference allows us to perform these tasks in only one step, from specific points to specific points. Figure 4.3 illustrates the different types of inference schemes. Transductive methods include nearest neighbor methods and GRNN. These methods have in common that they only use a
100
Advanced Mapping of Environmental Data
combination of data points in their vicinity in order to build a prediction. The output of a test point is usually a distance-based average of neighboring outputs. These methods are often referred to as lazy learners, as no training phase is necessary.
Figure 4.3. Illustration of inductive, deductive and transductive inference
4.2.2. Approaches to learning from data Apart from the task itself, the specificities of the dataset may influence the type of learning approach that will be used. We should first check what the nature of the data is, and more particularly: – whether the set of training points is labeled or not (with continuous or discrete values); – whether there are many missing labels or data. The answer to these questions will make the learning problem fall into one of these main categories: supervised learning, unsupervised learning and semisupervised learning. Supervised learning takes place when we are given a dataset ^x i , y i ` , where xi is a p-dimensional vector and yi is a one-dimensional vector of discrete (classification) or continuous (function approximation) values. We want to find a mapping from the input xi to the output yi. Thus, it is the type of learning most often encountered when dealing with classification or regression problems. The availability of the target vector yi suggests the use of a criterion measuring the discrepancy between yi and the output values predicted by the model that is being built in order to select the appropriate mapping.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
101
Unsupervised learning deals with the case where the labels yi are not available. Only the data points xi are given, and the goal is thus to extract information regarding the distribution of the data and potential clusters of points. Unsupervised learning encompasses the general problems of clustering and density estimation, which are intimately related. Unsupervised learning usually requires a greater a priori knowledge about the problem. In fact, no output data is available to measure the adequacy of a given model. We must therefore rely on assumptions about the structure of the data. Semi-supervised learning is an intermediate situation between the two we have encountered so far. In this case, a small number of training samples is labeled, and most of the data is unlabeled. This is the case that occurs most often in real-life situations, as collecting data is relatively easy, but labeling it requires the manual annotation of the data, which can be prohibitively long, when feasible at all. In semisupervised learning, unlabeled data, which represents the unconditional distribution of the inputs, px , is used to bring information to the models obtained using the labeled data, which is p y x . Popular semi-supervised learning algorithms include expectation-maximization (EM) with mixtures of Gaussians and transductive support vector machines. Even though semi-supervised learning is traditionally associated with classification problems, recent techniques have emerged dealing with semi-supervised clustering and regression. 4.2.3. Feature selection It is often said that a well-posed problem is already half-solved. This is especially true in machine learning, where we can face a problem involving a large number of variables, very often redundant. More variables can mean more information, but also more noise in the data. A first example can be given by the coordinate system: the most obvious variables to consider in a spatial environmental context are the coordinates x, y or x, y, z of a data point or, equivalently, its latitude and longitude. Even if only spatial variables are considered, the origin of the coordinate system is arbitrary; a rotation of the system may decrease the correlation between x and y. A second example can be a situation where several environmental or meteorological variables are available and would bring very useful information: temperature, time, wind speed, etc. Selecting the most informative variables is essential in this case. These two cases illustrate the two major problems of feature extraction and feature selection. A good introduction to this important subject can be found in [GUY 03], and an exhaustive review of the field is presented in [GUY 06].
102
Advanced Mapping of Environmental Data
Feature selection can be defined as the task of selecting, among the available variables, the ones that are the most correlated with the output and uncorrelated between them. This is a classical problem in statistics and many methods have been popularized in order to deal with it. Forward and backward stepwise selection is a standard method implemented in most statistical packages, but it is suitable for rather small datasets as the procedure is computationally greedy. Mallows’ Cp [MAL 73], The Akaike Information Criterion (AIC) [AKA 74] and the Minimum Description Length (MDL) [RIS 78] are other popular statistical methods. Feature extraction aims at building features that are, by construction, uncorrelated or independent by combining existing features. The most well-known feature extraction method is Principal Component Analysis (PCA). PCA finds statistically uncorrelated features by building a linear combination of the initial features. We will come back to this technique in section 4.6. Figure 4.4 illustrates a case where performing feature extraction could be useful in a classification context.
Figure 4.4. Two-class classification problem. Two features are used, although only one is necessary to perfectly separate the data if we rotate the coordinate system
Using these features, both of them are necessary to separate the data. However, if we rotate the coordinate system by 45°, only one feature is necessary. This is easily achieved with PCA.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
103
4.2.4. Model selection Selecting the right model is a difficult task which involves the determination of an appropriate functional form of a model and its parameters. Firstly, it is useful to distinguish between several classes of models which differ by their underlying assumptions about the data: parametric, non-parametric and semi-parametric models. Parametric models (sometimes called “global” approaches or “model-driven” approaches) make strong assumptions about the underlying phenomenon. In fact, a functional form is first assumed (linear, polynomial, exponential, etc.), and the learning task consists of estimating the parameters of the chosen model. The errors are usually assumed to be Gaussian-distributed. Classical multivariate regression and linear discriminant analysis are good examples of this type of approach. The very good interpretability of these models compensates for their strong (and often false) assumptions. Parametric models often require a very large number of data points, especially if the number of parameters is large. Non-parametric models (“local” approaches, “data-driven” approaches) make very few assumptions about the data. However, this advantage has the counterpart that the obtained model is not interpretable and must be used as a “black box”. This fact is the main reason why domain experts are usually very reluctant to use such approaches. Nonetheless, very often the context does not allow the use of parametric or physical models and this type of approach is the best choice. Classical examples of local approaches include nearest neighbor methods, probabilistic neural networks (PNN) and general regression neural networks (GRNN). Semi-parametric models have both parametric and non-parametric components. Kriging with external drift can be seen as a semiparametric model, as one component of the kriging estimator is model-driven (drift modeling) and another is data-driven (variography). Choosing between these types of models is an important task that will influence the reconstruction of the process. When a lot of prior knowledge is available, parametric models are often more interesting due to their interpretability. With very little prior knowledge, we should favor semi-parametric or non-parametric models, which rely on fewer assumptions. If possible, any available prior knowledge about the problem must be taken into account in order to make the choice of a general model (linear, nonlinear, polynomial, etc.) Once a type of model has been selected, a number of parameters must be adjusted. To this end, numerous methods are available. One of the most popular methods for model selection is the comparison of test errors. Data are randomly
104
Advanced Mapping of Environmental Data
divided into three parts: the training set, the test set and the validation set. The training set is used to build the model, by minimizing a given criterion. If we want to compare models, we then calculate the error obtained on the test set. The validation set is used in order to obtain an estimation of the generalization error, i.e., the error that can be expected on new data. It is important to never use data that has been used to train the models in order to estimate the generalization error, as the models would be biased favorably with respect to this data. Note that in the machine learning literature, the terms “test” and “validation” are used conversely. We have adopted the other convention to follow geostatistical terminology. Very often, the empirical error is minimized on the training set. However, if we reach a global minimum on the empirical error, it is likely that the model will be too adjusted to the training data, which means that predictions on unknown data are likely to be erroneous. This situation is called overfitting. If, on the contrary, the model is not optimized enough, or the class of functions chosen is too simple, it is likely that the data will not be well represented by the model: this situation is called underfitting. In Figure 4.1, these two situations are well represented. The function plotted with the solid line is not complex enough to capture the variation in the dataset. The opposite situation occurs with the model represented by the dashed line: the function passes through every data point. In this case, the empirical error on the training set is zero, but the generalization error is likely to be high, as the noise in the training set has also been fitted. Figure 4.5 illustrates this principle.
Figure 4.5. Evolution of the training and generalization error as the model complexity increases
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
105
This problem is also known as the bias-variance dilemma. The total variance of a predictor can always be decomposed between these two terms. A model that is too simple exhibits low variance: training the model with another random sample coming from the same distribution would not modify the model greatly. However, it has a high bias, i.e., the model is likely to be far from the “true” hypothesis. Conversely, a model that is too complex has a low bias, i.e., it represents the data well. However, it has a very high variance in the sense that a different random sample would provide a radically different model.
Figure 4.6. Evolution of the bias, variance and generalization error
Very often, we are confronted by a problem where the dataset is not large enough to be able to split it into training, test and validation sets. In fact, if the training dataset is too small, we may not capture all the information contained in the data. To get round this problem, techniques based on statistical resampling – more specifically jackknife methods – are commonly used. Jackknife estimates the standard error on a statistic by systematically recalculating the statistical estimate leaving out one observation at a time from the sample. The most popular jackknife-based method is cross-validation (CV). CV is extremely useful when we do not want to “waste” the information brought by a large part of the dataset in order to estimate the test error by re-using the training data to do so. Cross-validation is usually performed using either of the following two schemes: – leave-one-out cross-validation (LOOCV); – k-fold cross-validation (KFCV).
106
Advanced Mapping of Environmental Data
LOOCV can be summarized with the following procedure: 1) remove one point x i , y i from the dataset; 2) train the model yˆ with the remaining points; 3) calculate the error on the removed point, e.g., ei
y yˆ x i
i
2
;
4) repeat 1 to 3 for every point in the dataset; 5) the average error
1 n ¦ ei is an estimation of the test error. n i1
LOOCV is a very simple scheme, but has the drawback of being very computationally greedy – as many models as there are data points have to be trained. KFCV is a more efficient method to estimate the test error. Rather than withdrawing one point at a time from the dataset, it works by randomly splitting the dataset into equal-sized partitions, which are sequentially removed. It can be summarized as follows: 1) split the dataset randomly into k partitions of size n; 2) remove the first partition from the dataset; 3) train the model yˆ i on the remaining data; 4) calculate the error on the removed points, i.e.,
ei
1 n ¦ yi yˆi ; ni1
5) repeat 2 to 4 for all the k partitions; 6) the test error is 1 k
k
¦
e j.
j 1
One parameter has to be tuned: k, the number of partitions. If the number of partitions is equal to the size of the dataset, we come back to LOOCV. When dealing with time series, a similar method to cross-validation is very often used: sequential validation. The model is trained using the data available at time t, and the data coming in at time t+1 is used to estimate the error on the current model. At each step, the whole dataset is used for training and the error is estimated on “fresh” data. This is a method of continuously improving the models.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
107
4.2.5. Dealing with uncertainties As mentioned in section 4.2.1, most machine learning algorithms only output a class label or a prediction. However, in many applications, it is also very important to estimate the posterior probability distribution p y x , which gives a measure of how confident we are in a prediction.
In classical pattern recognition problems such as handwritten digit classification, predictive uncertainty can be seen as side-information. However, in environmental, medical or safety applications, the importance of predictive uncertainty is dramatically increased. A meteorologist who predicts that “there is a 99% probability of having a tsunami tomorrow” makes a completely different prediction from another one who states that “there is a 60% probability of having a tsunami tomorrow”. However, a binary classifier would in both cases state that “there will be a tsunami tomorrow”. Many methods exist to deal with this problem, and it is still a hot topic of research (see for example [QUI 05]). For classification, the most straightforward (and somewhat heuristic) method is to map the results of a binary classifier on a sigmoid function. The closer the point is to the decision boundary, the more uncertain it is. Figure 4.7 illustrates that principle.
Figure 4.7. Mapping of a binary classifier output (left) on a sigmoid (right) in order to obtain probabilistic outputs. The flatness of the sigmoid has to be tuned
In the case of support vector machines (section 4.5.3), this methodology has been applied in [PLA 99]. Other methods exist based on data resampling. One of the most popular of these is bagging (bootstrap aggregation) [BRE 94]. For a given prediction that we want to output, we train N models, each of which is trained using a bootstrap sample (draw
108
Advanced Mapping of Environmental Data
with replacement) of the data. The average prediction of the N bagged predictors is the most confident prediction, and its variance reflects the uncertainty at that location. 4.3. Nearest neighbor methods The simplest classification and regression algorithms are perhaps the k-nearest neighbors (KNN) methods. KNN methods predict the value of a new point by using simple combinations of training points that lie in its vicinity. Let x be a test point and y* its output value we want to predict. We define N k x as the neighborhood
of size k of x. This neighborhood is defined as the k points xi closest to x, which have outputs yi. Finding this neighborhood implies calculating the distance from x to all points in the dataset, and selecting the k closest points. This distance is usually Euclidean, but other distance measures can be used, or the Euclidean distance can be weighted by a decreasing function. The output can simply be expressed as
y*
1 ¦ yi k xi N k x
[4.3]
The application to regression is direct. In the case of classification, the yi are discrete. The average is thus equivalent to a majority vote. Figure 4.8 plots the decision boundaries generated with a 1-nearest neighbor algorithm and a 20-nearest neighbor.
Figure 4.8. Application of k-nearest neighbors to a two-class classification problem. The decision boundaries for k=1 and k=20 are shown
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
109
It is clear that increasing the value of k provides a smoother decision boundary. The limit case of k=1 generally overfits the data. The other limit case, k=N, where N is the size of the dataset, outputs the same decision for any test point. As mentioned earlier, the distance can be weighted by a decreasing function of distance. This is called kernel smoothing, which includes general regression neural networks. We will come back to this topic in section 4.4. It is worth mentioning that local methods such as KNN very often perform surprisingly well in low dimensions. However, when dealing with problems involving a large number of variables, these methods may not be reliable. In fact, the more variables are included in a model, the more the data become sparse in the variable space, and the neighbors of a point become far apart. This problem is known as the curse of dimensionality in the machine learning community. K-nearest neighbor methods are considered in detail along with real case studies in Chapter 5. 4.4. Artificial neural network algorithms In this section we will describe several artificial neural network (ANN) models and illustrate them with simple examples. Among the most frequently used ANNs are multi-layer perceptron (MLP) and kernel based neural networks [HAY 98]. General Regression Neural Networks (GRNN) and Probabilistic Neural Networks (PNN) belong to the family of kernel neural networks. Common features of these ANN are that they are feed forward neural networks and learn with supervision (with known expected outputs). Feed forward neural networks propagate information from the input to the output without recurrence and the error flow is propagated backwards from output to the input modifying the ANN parameters according to the captured dependencies from data. More comprehensive real case studies using these models will be presented in Chapter 5. An ANN of a totally different type that is also described in this section – Self-Organizing Map (SOM) – is used in classification problems and is based on unsupervised learning. 4.4.1. Multi-layer perceptron neural network MLP was developed in the 1960s as an algorithm that mimics the signal propagation process in human neurons. MLP consists of basic generic elements – neurons – which are mathematical analogs of brain neurons. An artificial neuron is capable of propagating the data and modifying itself accordingly whilst training.
110
Advanced Mapping of Environmental Data
MLP is a later development of a simpler single layer perceptron, invented in 1957 [ROS 57]. A single layer perceptron solves a simple regression problem evaluating the output y from the input vector X=(x1, …,xi,...,xn):
y*
¦w x i
i
b
[4.4]
i
or
y*
1
¦ w 1 exp( x ) b i
i
[4.5]
i
where wi are the weights of the connections coming to the neuron and b is the bias. These hyper parameters are determined through the training procedure by minimizing the error between the target variable data y and the perceptron prediction output y* using conventional optimization algorithms (e.g. gradient descent, etc.).
Figure 4.9. Sigmoid function used as a nonlinear element in a perceptron
A single layer perceptron (see Figure 4.10) is capable of solving linearly separable binary classification (with equation [4.4]) or regression (with equation [4.5]) using the sigmoid function, see Figure 4.9) problems [MIN 69]. The latter case is identical to logistic regression. The perceptron weights wi are updated through the learning procedure when the input data x are presented to input and the corresponding output y* is compared with the expected output y: wc = w+Įw(y-y*)wx, where Į corresponds to the learning rate
[4.6]
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
111
Figure 4.10. Single layer perceptron
MLP is an extension of the single layer perceptron based on the addition of a hidden layer of neurons between the input and the output neurons (see Figure 4.11). MLP with a nonlinear element (sigmoid or of another type) in each hidden neuron is a universal approximator capable of estimating every continuous function with just one hidden layer. MLP estimation with a single hidden layer with m neurons is calculated as a weighted sum:
f m ( x, w, v)
m
¦w
i
s ( X vi ) wo
[4.7]
i 1
where wi are the weights corresponding to each neuron connection and w0 is an additive bias corresponding to the entire hidden layer; s(·) is a sigmoid activation function, which represents a nonlinear element in MLP; Ȟi is a gain – activation function steepness parameter.
Figure 4.11. MLP structure [3-6-1]
112
Advanced Mapping of Environmental Data
The MLP weights wi are obtained through the training procedure [HAY 98]. MLP uses a back propagation learning algorithm where the weights are consequently modified according to the backward error propagation from the output to the input. Thus, the error is propagated in the opposite direction to the data flow from the input to the output. At the beginning, the weights are initialized at random. Then, the data are consequently presented to the inputs. The discrepancy between the MLP output and the expected output available from the data produces the error, which has to be minimized using conventional iterative optimization techniques. Error minimization between the expected output and the MLP output can be performed using a wide selection of known optimization algorithms. Optimization algorithms can be divided into two groups: gradient and stochastic. Gradient optimization algorithms are based on calculating the gradient of the minimized function and are very good in finding local minima. Gradient optimization algorithms vary in performance efficiency and speed. Among first order slower methods there are: conjugate gradients, steepest descent, Newton’s methods. Faster algorithms such as The Levenberg-Marquardt algorithm and quasi-Newton methods use 2nd order derivatives or their approximations [NOC, 1999]. However, gradient methods do not achieve the global minimum. Very often gradient optimization gets stuck in local minimum. Stochastic optimization methods have the ability to jump out of local minima due to their stochastic nature. Thus, they have a better chance of seeking the global minimum, though they tend to converge slower than the gradient methods. Among stochastic optimization methods there are: simulated annealing, genetic algorithms, bootstrap, Hamiltonian Monte-Carlo, particle swarm. In practice, a combination of stochastic and fast gradient methods provide a good result in MLP training. First, stochastic methods seek globally an appropriate starting point for the following gradient search. This allows us to avoid entrapment in multiple local minima closest to the initialization point. Then, gradient optimization is used to improve the minima found by the stochastic search. The efficiency of MLP training largely depends on the careful selection of data used to tune the model parameters (the MLP weights). Usually, all the data are split into three parts of different size: training set, test set and validation set. The training data is used directly: the data are sequentially presented to the MLP input, propagated through the network and the output obtained is finally compared with the expected output from the training data set. The training error calculated as the mean squared difference between the MLP output and the expected output is then minimized as described above through repeated propagation of the training data to the network. The test data are also propagated to the network in the same way as the training data and the test error is calculated accordingly. The principle difference of the test set is that the test error is not minimized and thus not propagated backwards from the MLP outputs to the input. The test error is calculated and compared for every training iteration for quality control purposes. Thus, both the training and the
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
113
test data are used to choose and train the estimation model. The validation data are usually the ones initially hidden – not used at all or retained by the customer. The validation data are used as an independent data set to validate predictions of the chosen MLP configuration. Once the choice of the validation set is beyond the modeler’s control, careful selection of the data for training and testing becomes crucial. Both training and test data sets should represent the original distribution. Thus, a simple random sample may not be adequate in the case where the data are clustered in high dimensions. More advanced sampling techniques from high dimensional space with declustering are used. It is important that both training and test data include the outliers and extreme values. In practice, the test set is smaller than the training set especially when the total amount of data is limited. The size of the test set can range between 10-25% of the total amount of data available. In case of large amounts of available data this ratio may be higher. During training it is crucial to identify the optimal number of iterations of the optimization algorithms – the number of weight updates. The decision on when the training should be terminated is made based on the profiles of the training and testing errors calculated during iterations (see Figure 4.12). The training error can be minimized until it steadily approaches 0 with the increasing number of iterations. Zero training error means exact reproduction of training data by MLP predictions, which is called over-fitting. An over-fitted MLP is able to reproduce only the data selected for training and lacks the generalization ability to predict data other than the training data. An increase of the test error is evidence of MLP over-fitting. Thus, the optimal MLP is chosen according to the minimum test error.
Figure 4.12. Error minimization whist training: training and testing errors
114
Advanced Mapping of Environmental Data
Let us illustrate how MLP works with a “toy” regression problem of predicting a 1D function. An analytical function was chosen for the synthetic example:
y
sin(0.4 x) 0.01x sin(2 x) x
[4.8]
The chosen function features a semi-periodic structure with a nonlinear trend and multiple local minima in argument range [0; 20] (see Figure 4.13). Random Gaussian noise (with a zero mean and variance ı2=0.07) was added to the function to generate the data for modeling. An MLP interpolation model is used to reconstruct the target function pattern based on a limited number of data. 20 randomly sampled items of data were generated to be used for the MLP training.
Figure 4.13. Theoretical function and 20 randomly sampled items of data corrupted with noise
Different MLP structures can be used for interpolation. They vary by the number of hidden layers and neurons and also by the type of optimization algorithm used for training. A single hidden layer is sufficient to model quite complex patterns in a simple 1D problem. In general, MLP with a single hidden layer models nonlinearity using the activation function in its neurons. Thus, MLP is able to model nonlinear patterns even in high dimensions.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
115
The number of neurons in the hidden layer characterizes the MLP’s capability of modeling multiple local features. Generally speaking, the more neurons there are in the hidden layer, the more degrees of freedom MLP possesses for modeling complexity. The number of hidden neurons would also depend on the amount of data available for training. On the one hand, MLP with many neurons would not be able to train with too few data. On the other hand, MLP with too few hidden neurons would not be able to capture complex dependencies represented by a large amount of data. For instance, a linear pattern is determined by just two degrees of freedom, thus it can be modeled with a single neuron perceptron described above with no hidden layers, which has just two parameters – the weight and the bias. A quadratic polynomial function with 3 or more degrees of freedom requires a nonlinear element – at least a single hidden neuron – to model it. In practice, we do not know the shape of the considered pattern beforehand, although we may assume its level of complexity: multiple scale, presence of local peaks and dips, etc. In the case of such a complex data pattern, the MLP has to include a fairly large number of neurons, and, thus, a large a number of data points for training is required to represent the pattern correctly. In practice, the ratio between the amount of data and the number of MLP connections (which represent the degrees of freedom) should be much larger than 1. In case the ratio is 1 the amount of data equals the number of MLP weights (excluding the bias for each layer). In such a situation MLP would not be able to learn the pattern and each weight would just represent one item of the data. If the ratio between the amount of data and the number of MLP weights is close to 2, then the MLP has a good chance of learning a pattern from data. A single hidden layer MLP with fewer degrees of freedom may be incapable of learning too complex patterns. The MLP structure with two hidden layers is usually used for interpolation of high dimensional problems with multiple inputs and outputs. It can be interpreted that the neurons in one hidden layer reflect the valleys of the manifold where the solution lies, while the neurons in the other hidden layer characterize smaller scale local variations within the valleys. No more than two hidden layers is necessary to analyze even very sophisticated data. In addition, too many neurons in a single layer leads to significant computation times for training that can be reduced by rearranging the neurons into two hidden layers. Figure 4.14 shows the performance of MLP with a different number of hidden neurons in a single layer – 3, 5, 10 and 20. MLP with a small number of neurons are able to capture some of the pattern trends given just 20 items of data for training. The MLP with just 3 hidden neurons provides the smallest amount of detail, while adding 2 hidden neurons allows us to model additional small scale peaks. MLP with a larger number of hidden layers (10 and 20) are able to reproduce most of the training data exactly, which leads to over-fitting. As stated above, over-fitting means that the model predicts very well on training data and looses its ability to generalize on testing and validation datasets. The evidence of over-fitting is also shown in Table 4.1, where the training and validation errors are presented. Validation was
116
Advanced Mapping of Environmental Data
performed using 800 values of the true function within the interpolation region. It is clearly seen that the validation error increases with the number of hidden neurons if no test data is used to control the training (see Figure 4.15 (left)). MLP prediction in the extrapolation region on the edge of the data region can vary significantly leading to high validation errors.
Figure 4.14. MLP trained without a test data set, predictions obtained with a different number of neurons in the hidden layer: 3, 5, 10 and 20
The problem of over-fitting can be overcome by using a separate set of test data to control the learning process and stop the training before the MLP loses its ability to generalize. The test data are presented to the MLP to calculate the corresponding MLP output, but the mismatch between the output and the target data is not propagated back though the MLP and, thus, does not influence the weight optimization. The mismatch between the test data and the corresponding MLP output is called the test error. The minimum of the test error corresponds to the optimally trained MLP, which is able to accurately predict the data that is different from the training. We have selected 5 out of the 20 items of initial training data for testing purposes, leaving the remaining 15 items of data for training with the test set
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
117
control. The MLP prediction result for a different number of hidden neurons is presented in Figure 4.16. Figure 4.15 (right) shows dependence of the training and the validation error from the number of hidden neurons for the MLP trained with the test data. Use of the test data does not impact much on MLP with 3 and 5 hidden neurons; they do not suffer from over-fitting. However, the performance of the MLP with 10 and 20 hidden layers is significantly improved with the use of the test data. Errors summarized in Table 4.1 show that the validation error is similar for all MLP configurations when the test data are used. The training error and the minimum of the validation error suggests that the MLP with 5 hidden neurons provides the best prediction for this particular problem. Hidden neurons
Use of test data
Training error
Test error
Validation error
3
No
0.063
-
0.091
3
Yes
0.072
0.031
0.093
5
No
0.030
-
0.166
5
Yes
0.006
0.022
0.085
10
No
0.016
-
0.168
10
Yes
0.075
0.033
0.094
20
No
0.015
-
0.204
20
Yes
0.073
0.037
0.088
Table 4.1. Comparison of MLP interpolation with different number of hidden neurons without using the test data in training and with the test data to control the training and avoid over-fitting
The selection of training algorithm has a significant impact on MLP prediction. Careful choice of the training algorithms allows us to improve the MLP prediction quality. Usually, a combination of stochastic and 2nd order gradient (e.g. LevenbergMarquardt) algorithms provides the best results. However, the training results may be sensitive to the initialization of the weights and the starting point of the gradient optimization. Therefore, multiple results of the gradient optimization can be considered in order to reach the global minimum (or the lower local minimum) rather than one of multiple local minima.
118
Advanced Mapping of Environmental Data
Figure 4.15. Training and validation error for MLP with varied number of hidden neurons: trained using only 20 data for training (left), trained using 15 training data and 5 data for testing (right)
Figure 4.16. MLP trained with a test data set, predictions obtained with a different number of neurons in the hidden layer: 3, 5, 10 and 20
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
119
4.4.2. General Regression Neural Networks Another representative of the kernel-based method for the regression task is a General Regression Neural Network (GRNN). A GRNN is another name for a wellknown statistical non-parametric method called the Nadaraya-Watson Kernel Regression Estimator. It was proposed independently by Nadaraya and Watson in 1964 [NAD 64; WAT 64]. In 1991 it was interpreted by Specht in terms of neural networks [SPE 91]. This method is based on kernel density estimation using the Parzen method [FAN 97; HAR 89]. Omitting the details of mathematical background, let us present the final formula for the regression estimation of using available measurements Zi:
§ x xi · V ¸¹ i 1 N § x xi · K¨ ¦ ¸ © V ¹ i 1
Z(x)
N
¦ Z K ¨© i
Z (x)
i 1, 2! , N
[4.9]
where N is a number of training points and Zi is a function value of the i-th training point with coordinate xi. To simplify the understanding of the GRNN estimations, the normalized weighting function as a function of x can be defined as
Wi ( x )
§ x xi · K¨ ¸ © V ¹ N § x xj · K¨ ¦ ¸ j 1 © V ¹
i 1, 2! , N
[4.10]
The denominator of [4.10] gives us the normalization property N
¦W ( x) i
1 x
[4.11]
i 1
Now we can rewrite equation [4.9] in a simplified form as N
Z ( x)
¦W ( x)Z i
i 1
i
[4.12]
120
Advanced Mapping of Environmental Data
In this form, equation [4.12] describes the prediction at point average of the Zi observations for all N training points.
x as a weighted
The core of this method is a kernel K(·). It depends on two parameters: the distance to the predicted point and parameter ı. ı is a positive number called the bandwidth or simply the width of the kernel. Note that xi, in fact, is the center of the i-th kernel. Generally different types of kernels can be used, but the Gaussian kernel is usually chosen.
§ x xi · K¨ ¸ © V ¹
1 (2SV )
2 p/2
§ x xi exp ¨ ¨ 2V 2 ©
2
· ¸ ¸ ¹
i 1, 2! , N
[4.13]
where p is the number of dimensions of input vector x. Finally, the GRNN’s estimation formula with a Gaussian kernel and without a normalization term is § x xi 2 · Z exp ¨ ¸ ¦ i ¨ 2V 2 ¸¹ i 1 © N § x xi 2 · exp ¨ ¸ ¦ ¨ 2V 2 ¸¹ i 1 © N
Z ( x)
[4.14]
Note that in fact GRNN, according to [4.14], is a linear estimator (prediction depends linearly on weights), but also that the weights are estimated nonlinearly according to nonlinear kernel [4.13]. The model described above is the simplest GRNN algorithm. One of the useful improvements is to use multidimensional kernels instead of one-dimensional kernels as in [4.13]. When the ı² parameter is a scalar, we are dealing with an isotropic model. In a more general case, parameter ı² may be presented as a covariance matrix Ȉ. A covariance matrix is a squared symmetric matrix with dimension p by p and with the number of parameters equal to p(p+1)/2. In the general anisotropic case, [4.13] can be rewritten as
§ x xi · K¨ ¸ © V ¹
(2S )
p/2
1 § 1 · exp ¨ ( x xi )T 6 j 1 ( x xi ) ¸ 1/ 2 (det 6 j ) © 2 ¹
where det means determinant and Ȉ-1 is the inverse of the Ȉ matrix.
[4.15]
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
121
Model [4.15] is anisotropic and is much more flexible for modeling data. Such models can be very useful in the case of complex multidimensional data. For example, for 2D spatial mapping we can use 2D parameter V=(Vx,Vy,Vxy). Usually, only the diagonal of the Ȉ matrix is used. In this case the number of adaptive (free) V values equals the number of dimensions p: V = (V1,…,Vp). The only adaptive (free) parameter in the GRNN model with a Gaussian kernel is
V (iso- or anisotropic), the width of the kernel.
In order to demonstrate how GRNN works, let us consider a simple “toy” one-dimensional problem (Figure 4.17). A simple sine function represents a true function of an underlying structure of a collection of sample data. In order to generate a training set, this sine function is sampled at a wide variety of points, and random noise is added. This true function and training samples are shown in Figure 4.17a. Now let us examine the effect produced by the variety of V values presented, the smoothing parameter. Figure 4.17b shows what happens when a very small value of V is used. The GRNN follows the training data closely, almost moving from point to point. If the data is known to be clear (without noise), the GRNN makes an excellent interpolation algorithm, analogous to the nearest neighbor method. However, this result will be acceptable only if the density of the training data is high enough. In other cases, an “overfitting” effect, which is well-known in neural networks, may appear and such solutions will not be optimal. Thus, since in most cases the data are distorted by noise, straightforward interpolation is not an acceptable option.
a)
b)
c)
d)
Figure 4.17. A simple “toy” problem to illustrate the influence of the parameter V on the GRNN result: a) true function and noised training samples; b) too small V; c) a perfect V; d) too large V
122
Advanced Mapping of Environmental Data
A larger smoothing parameter V gives the result shown in Figure 4.17c, which is almost ideal. Figure 4.17d illustrates the effect of a smoothing parameter that is too large. The global structure of the training set has been completely missed by the algorithm, leading to an oversmoothing situation. Thus, we can come to the conclusion that choosing the value of smoothing parameter V is a vital problem for the GRNN model and that such a choice is data dependent. For the estimation of V, the cross-validation procedures discussed in section 4.2.5 may be implemented. Usually, in order to find the optimal value of bandwidth a grid search is used. It is therefore necessary to define an interval of V values [Vlow, Vhigh] and is the number of steps M. Then, the validation is repeated for all the M V-values
V i V low (i 1)
V high V low M
i 1,..., M
[4.16]
The final result (optimal V value) corresponds to the model with the smallest cross-validation error. The interval and the number of steps have to be consistent in order to catch the expected optimal (with minimum of the error) value. Reliable limits are the minimum distance between points and size of area under study. In fact, the real interval is much smaller and can be defined in accordance with the monitoring dataset features and/or prior expert knowledge about studied phenomenon. 4.4.3. Probabilistic Neural Networks
The GRNN model described above is a typical kernel-based model used for regression tasks. A similar model for classification problems is the Probabilistic Neural Network (PNN). Like GRNN, it was developed by Specht in terms of neural networks in 1990 [SPE 90]. It uses the same mathematical background for the density estimation. However, in the case of classification, the estimated conditional densities for all classes are used to label a point to one of the classes. The whole data set is divided into subsets according to the class membership. Thus, the probability density functions for each class are estimated independently.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
123
Finally, a Bayesian optimal or maximum a posteriori (MAP) decision rule is used to make a decision
C ( x) {c1 , c2 ,..., cK } argmax P (ci ) p ( x | ci )
i 1, 2,..., K
[4.17]
ci
where K is the number of classes (or generators of random variables), ci (i=1,2,..,K), P(ci) is a prior probability and p(x|ci) is a distribution (for all input space x). Prior probability can be interpreted as the initial (guess) class conditional distribution p(x|ci) for all x. Generally, the prior class distribution is highly dependent on the specific task and should be determined by an additional (physical, expert etc.) knowledge of the problem. In fact, a PNN model can make a prediction using these prior probabilities even without measurement! However, really in most cases none of this additional information is available. In these cases, all P(ci) are assumed to be equal (P(c1)=P(c2)=…=P(ck)). The conditional distribution is defined by the following formula:
p( x | ci )
2S V 2
§ x x(n) i exp ¨ ¦ 2 ¨ 2V Ni n 1 © Ni
1 p
2
2
· ¸ ¸ ¹
[4.18]
(n)
where Ni is the number of samples (or class size) that belong to class ci. xi represents the n-th sample of class ci.
The difference between [4.13] and [4.18], as mentioned above, is only in the data used for the estimation. In the case of regression, all N points are used, and here only points belonging to specified class ci are used. In order to make prediction [4.17], PNN just compares the values obtained with [4.18] for different classes (taking into account prior probability P(ci)) and attributes the class membership related to the maximum value. An important and very useful characteristic of any model using a Bayesian framework is the possibility of obtaining confidence measure of the prediction. This means that PNN can not only label a point with some of the classes, but also produces a probability of it belonging to all of them. These probabilities are called posterior – final, after measurements and calculations (in comparison to prior – initial, before measurements and calculations).
124
Advanced Mapping of Environmental Data
The Bayesian confidence (a posteriori probability of x belonging to class ci) is defined by
P ( x | ci )
P(ci ) p( x | ci ) K
¦ P(ci ) p( x | ck )
[4.19]
k 1
The above discussion about cross validation showed that training of GRNN is very simple, because only parameter V is to be optimized. The same cross-validation procedure can be applied for PNN training/tuning as well. It is only necessary to modify the target/error function. In general, continuous error function can be used for the minimization despite the fact that the classification error is generally a discrete value (number of misclassified points). This can be performed due to Bayes posterior probabilities [4.19] which are continuous functions within the limits [0; 1]. Therefore, a continuous error function for the V optimization procedure can be defined as follows
e( x | ci )
>1 P ( x | ci )@
2
[4.20]
GRNN and PNN are two models which are quite efficient for automatic mapping for regression and classification tasks (see Chapter 5 for case studies). Now, let us consider self-organizing (Kohonen) maps, that are neural networks based on unsupervised learning algorithms. 4.4.4. Self-organizing (Kohonen) maps
A self-organizing map (SOM) is a type of artificial neural network. SOM is a powerful tool when dealing with highly multivariable problems. SOM was developed by T. Kohonen. Since then SOM has been successfully applied in many areas of interest, such as finance, medicine, robotics, (speech, image, signal) pattern recognition, classifications for physical and chemical systems and many others. A huge list of publications (containing more than 3,000 works) on SOM theory and applications can be found online at http://neuron-ai.tuke.sk/NCS/VOL1/P4_html/node35 html.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
125
The characteristic features of SOM are: one hidden layer of neurons organized in a lattice structure and unsupervised competitive learning. As mentioned above, in unsupervised learning there are no examples with known answers to be learnt by the network, not even for a data subset. The goal of such a learning procedure is to organize training data into clusters according to their similarity and correlation criteria. That is why unsupervised learning can be referred to as self-organizing learning. Successful performance of unsupervised learning requires redundancy in training data. Once the network has been tuned to the statistical regularities of the training data, it develops the ability to specify encoding features of the data and thereby create new classes automatically. The learning procedure (as for any other neural approaches) is carried out by the modification of weights (wij) assigned to links between neurons (i and j):
w ij ( n 1 )
w ij ( n ) 'w ij ( n ) ,
[4.21]
where n is the number of iterations of the learning procedure. Modification of the weights follows some rules, such as, for example, Hebbian or competitive rules. The Hebbian learning is a two-phase rule [KOH 00, HAY 98]: – if two neurons on both sides of a synapse (connection) are activated simultaneously (synchronously), then the strength of that synapse is selectively increased; – if two neurons on both sides of a synapse are activated asynchronously, then that synapse is selectively weakened or eliminated. A direct Hebbian rule for modification at iteration step n of a link connecting neuron producing output xi and neuron obtaining input xi and simultaneously producing output yj is
'w ij ( n ) Kx i ( n ) y j ( n ) ,
[4.22]
where K is the learning rate. The problem with a direct Hebbian learning rule is the risk of uncontrollable increase of the value of the weights. There are different mathematical tricks to overcome this problem; one of them is the so-called Oja learning rule [OJA 82]
126
Advanced Mapping of Environmental Data p · § 'w ij ( n ) Ky j ( n )¨¨ x i ( n ) ¦ yk ( n )w ik ( n )¸¸ , k 1 ¹ ©
[4.23]
where p is the number of neurons in the layer. A one-layer ANN with linear neurons learned by the Oja learning rule performs principal component analysis. Oja’s network without linear constraint on neurons produces nonlinear component analysis and is useful for data compression problems. It is also called a “bottleneck network”. The self-organization rule comes from ideas based on Shannon’s information theory [KOH 00]. Its basis can be a Linsker principle of maximum mutual information (or Informax), which states that the synaptic links of the multilayered neural network should be organized so as to maximize the amount of the information that is preserved when signals are transformed passing layers of the network. In other words, mutual signal input and signal output information at each network’s layer should be maximized. Development in this direction has led to independent component analysis [HYV 01]. Such networks are useful for the separation of a signal from noise for all types of signal recognition systems. The simplest and the easiest rule is competitive learning. The output neurons of the network compete to be activated. The competitive comparison between neurons is based on some measure of similarity to a given input. Only one neuron is the winner for each input (winner-take-all competition). The learning process consists of the modification of the winner’s weight (wi) in order to make it more similar to the input pattern:
'w i ( n ) K x( n ) w i ( n )
[4.24]
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
127
Figure 4.18. One step of competitive learning procedure
Illustration of one step of the competitive learning is presented in Figure 4.18. An input sample is presented to a neural network. The node most similar to the input is selected according to a certain metric. Once the winner has been selected, it is modified to become more similar to the input. The new set of nodes contains the modified winner and all the other nodes. The learning procedure is applied to every input. This algorithm is useful for clustering the data according to internal features. Nonetheless it suffers from non-converging oscillations of the algorithm and the optimal number of neurons is unknown. Too few neurons decrease the quality of separation within clusters, while too many neurons can produce dead (not learned) neurons with unpredictable consequences. Competition learning of an SOM differs from the standard algorithm by a modification step – not only is the winner’s neuron modified, but also its neighbors. The procedure looks like the stretching of an elastic net over the data. It prevents dead neurons, and winning neighbors are likely to be winners for data that are not in the training dataset. The vectors assigned to the neighbors in SOM are neighbors in the data space as well. Therefore, we obtain not only the input data set quantization but regulating the input data set into the map (structure of neurons).
128
Advanced Mapping of Environmental Data
Now, let us give a theoretical presentation of SOM. The hidden layer of SOM is organized as an array (M) in the lattice structure. The type of the array can be different, but rectangular or hexagonal structures are the most common. Usually, the array is 1 or 2-dimensional. Each neuron (riM) possesses a vector of weights (mi = [Pi1, Pi2,..., Pin]T, mi Rn) also called a reference vector. Its dimension (n) is equal to the dimension of the input data space. There are no links issuing signals between neurons, but they have some knowledge on the vicinity through the neighborhood. The reply of the net to the presentation of an input vector x (x Rn) is given by the winner node cM, the closest one to the input according to the accepted metric (|x-mc|) - c=argmini{|x-mi|}. More often, the Euclidean distance is used. The training starts with the initialization of reference vectors. Usually, initial weights are defined as random values in the range of corresponding coordinates of the input dataset. During the training, the vector mi referencing to the node ri changes its values according to the input vectors. The modification on input x(t) which is presented to the net in iteration t of the learning process is defined by the following formula:
m i ( t 1 ) m i ( t ) K( t ) hci ( t )>x( t ) m i ( t )@
[4.25]
where K(t) is the learning rate (0
hci ( t ) h( rc ri , t ) ,
[4.26]
where rcM and riM are nodes. The distance between the nodes is estimated according to the level of neighborhood in the array. For the closest neighbors of a node rcM the distance is equal to 1; the closest neighbors of nodes which are at the distance i from a node rcM are separated from rcM with the distance i+1. The number of closest neighbors for a node depends on the structure of the lattice. If the lattice is organized as a rectangular grid, there are 4 neurons at distance 1 from a node, 8 at distance 2, 12 at distance 3, etc (Figure 4.19a). For a hexagonal grid, there are 6 neurons at distance 1 from a node, 12 at distance 2, etc. (Figure 4.19b).
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
129
a)
b) Figure 4.19. Estimation of the distance between neurons a) in the rectangular grid, b) in the hexagonal grid
One of the simplest ways to describe h(r, t) is the so-called “bubble neighborhood”:
130
Advanced Mapping of Environmental Data
1, r d R( t ) h( r , t ) ® , ¯0, r ! R( t )
[4.27]
where R(t) is a monotonically decreasing radius for tof. The rate of R(t) decreasing in time is predefined. Such a type of a neighboring function is very easy to understand and implement. With a “bubble neighborhood” all neurons at radius R(t) from the winner are modified in the same way (including the winner itself). The absence of any differentiation according to the distance from the winner in the modification is the main disadvantage of such a neighborhood function. To overcome this drawback h(r, t) can be defined in the form of a Gaussian function:
§ r2 · ¸¸ , h( r , t ) exp¨¨ 2 © 2V ( t ) ¹
[4.28]
where V(t) is a parameter representing the radius of the neighborhood. It is decreasing monotonically while tof. Usually, the training process is divided into two steps. During the first step the reference vectors are ordered (ordering phase). During the second step the reference vectors are tuned (convergence phase). The main purpose of the ordering phase is to construct the general ordering (smooth elastic net). It starts with quite high values of R – of around half of the map’s grid size – and values of K usually around 0.1. For an ordering phase approximately 1,000 iterations are enough. The second phase deals with local peculiarities. Thus, this phase starts with slower learning rates (0.01) and a smaller radius (2). Practical implementation of decreasing functions can be performed, for example, by multiplying the starting value by (1-t/T), where T is the number of iterations at the phase. The usual reference vector of a node in a trained SOM contains the averaged values of data expected to have it as a winner. This property allows us to use an SOM to estimate missing values in vectors from the same pattern. Each missing value is indicated as an unknown value. If an input vector has unknown values its size is decreased to the number of known values. The distance is then calculated in the truncated space. Reference vector values of the winner are SOM estimates of the missing values.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
131
4.5. Statistical learning theory for spatial data: concepts and examples 4.5.1. VC dimension and structural risk minimization
In section 4.2.5, we discussed the role of the empirical error in model selection, and the importance of not minimizing it to a global minimum. In this section, basic concepts of statistical learning theory, which formalize these intuitions, will be presented. Statistical learning theory, also called the Vapnik-Chervonenkis theory, has been developed since the 1960s [VAP 98, VAP 95, VAP 06] and has shed new light on the problem of inference and model selection in statistics and machine learning. Its core concept is the notion of a Vapnik-Chervonenkis (VC) dimension, which is a way of quantifying the complexity of a model. Technically, a particular model belongs to a class of functions ^ f x, w ` , where x is the input data and w the vector of parameters. A class of functions can be, for example, polynomial functions of the third degree. The VC dimension can be defined as the number of data points that can be shattered by members of ^ f x, w ` [HAS 01]. This principle is illustrated in Figure 4.20 with a first-order polynomial. It can be shown that for this class of functions, the VC dimension is always equal to d 1 , where d is the dimension (number of variables) of the input space.
Figure 4.20. Illustration of the VC dimension principle for a binary classification problem. A first-order polynomial can shatter any three-point configuration but appears to be useless for a four-point configuration
The more data points a classifier can shatter the larger capacity it is said to have. Statistical learning theory aims at building models that have a powerful generalization ability by minimizing the empirical error while controlling the capacity – it is thus said to minimize the structural error. In fact, a capacity that is too high will lead to overfitting.
132
Advanced Mapping of Environmental Data
It is important to note that the capacity is often related to the number of parameters – especially for polynomial functions – but this is far from always the case. For example, the function f x sin Zx only has one parameter ( Z ). However, it has a very large capacity, since a sinusoidal function can fit any set of points given a large enough frequency (except a set of points aligned along a straight line). First, let X be a random variable with distribution px , f a class of functions and L a loss function. The aim of any modeling method is to minimize the expected risk associated with f:
R f
E X ª¬ L x, f º¼
³X L x, f p x dx
[4.29]
We will thus select model parameters that minimize R f . The main problem is that px is unknown; we only have at hand a finite training set Dn, with a distribution that may be different from that of X. Minimizing the loss function on Dn minimizes the empirical risk:
Rˆ f , Dn
1 n ¦ L xi , f ni1
[4.30]
Intuitively, we can guess that as the size of Dn reaches infinity, the empirical risk converges to the expected risk. In fact, it can be proven that the expectation of the empirical risk is equal to the expected risk. Obviously, this situation never happens in problems of interest, since the dataset available is always far from representative of the whole distribution. Consequently, the model found by minimizing solely the empirical error on the training set Dn will be better on Dn than on any other random sample of distribution X. VC theory states that we can bound the difference between the training error and the generalization error as a function of the capacity of the model and of the size of the dataset. On a more practical side, VC theory has shown that better generalization ability could be achieved, particularly when dealing with many variables, by finding the set of functions of minimum capacity that is consistent with the training data. This has led to the formulation of the support vector machine algorithm, which we will present in section 4.5.3. 4.5.2. Kernels
Methods based on kernels, also called reproducing kernel Hilbert spaces in statistical literature, are a broad and rapidly-evolving class of machine learning
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
133
methods. Kernel methods are considered to be the major breakthrough of the 1990s in machine learning. As neural networks rendered the analysis of nonlinear datasets possible via heuristic methods, kernel methods allow us to apply mathematically sound linear methods to nonlinear datasets using an implicit mapping, which we will describe in the following sections. 4.5.3. Support vector machines
Support vector machines are a nonlinear extension developed in [BOS 92] of the optimal margin classifier, based on VC theory. We will first have a look at the linear SVM. Two cases can be encountered: the case of separable or non-separable data. By separable data, we mean data that can be separated by a linear hyperplane. The main idea is to build a classifier that is able to separate the data by finding the hyperplane that is the farthest away from the closest training points. The minimal distance between the hyperplane and the training points is called the margin, which is maximized by the SVM algorithm.
Figure 4.21. Illustration of the maximum margin separation principle. The classifier obtained this way is likely to have the smallest generalization error
The points that lie on the dashed lines, i.e., the closest points to the separating hyperplane, are called the support vectors. The goal of the SVM algorithm is to determine those support vectors, which are the “important” points in the dataset.
134
Advanced Mapping of Environmental Data
The SVM algorithm assumes training patterns of the form ^x i , y i ` , where yi
1 if ® ¯1 if
xi A xi B
[4.31]
We are searching for a decision function Dx which takes the form Dx w x b
[4.32]
The goal is to find parameters w such that x i w b t 1
for
yi
1
x i w b d 1
for
yi
1
[4.33]
This yields
y i x i w b 1 t 0
[4.34]
The constrained optimization problem can thus be stated as follows: 1 w 2
min
2
s.t. yi xi w b 1 t 0
[4.35]
This problem can be solved by simply using Lagrange multipliers. The Lagrangian can be written as L w , b, D
l l 1 2 w ¦ D i yi xi w b ¦ D i 2 i 1 i 1
[4.36]
The Įi are the Lagrange multipliers. We set the derivatives of the Lagrangian with respect to w and b to zero and, using simple algebraic manipulation, we obtain w
l
¦D i yi xi
[4.37]
i 1
and L
¦ D i yi i 1
0
[4.38]
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
135
We substitute these results in the previous Lagrangian and obtain its dual formulation: 1
¦ D i 2 ¦ D iD j yi y j xi x j
LD
[4.39]
i, j
i
By minimizing this dual Lagrangian, we find the D i , i.e., the support vectors. The classifier output can simply be expressed as
f x
§ · sign ¨ ¦ D i yi xi x ¸ © i ¹
[4.40]
It is interesting to note that the complexity of the solution depends on the number of support vectors and not on the number of data points. If it is impossible to fit a linear hyperplane that separates the data perfectly, we may want to allow some errors, i.e., points that fall on the wrong side of the hyperplane. In that case, the problem may be re-written as xi w b t 1 [i
for
yi
1
xi w b d 1 [i
for
yi
1
[i ! 0
The
[i
[4.41]
i
are called the slack variables, which measure the deviation of the points
wrongly classified from the hyperplane. The objective function has to be changed from
1 w 2
2
to
1 2 w C ¦ [i 2 i
. C is a user-specified regularization constant that controls
the trade-off between the complexity of the classifier and the number of misclassified training points, measured by the [i . Even though an optimal value for C can be analytically determined using VC theory, it is in practice determined experimentally using a validation dataset or cross-validation. A third case must be considered, when the data is intrinsically nonlinear. In fact, in most real-life situations, the linear separability hypothesis is too restrictive. A first approach that could be considered is projecting the data in a higher-dimensional space, where a linear separation could be more easily achieved. However, this approach could be quite computationally expensive, and it is not even necessary to do so to achieve our goal. In fact, in the solution of the classifier, the inputs only appear in the form of dot products, which represents a measure of similarity in the input space. It can be proven that any matrix K that is symmetric (i.e., Kij = Kji) and positive semi-definite (i.e., of eigenvalues greater than or equal to zero) represents a similarity measure in another mathematical space, usually of greater dimension.
136
Advanced Mapping of Environmental Data
The most traditionally used kernels are the following: polynomial: K x, y Gaussian: K x, y sigmoid: K x, y
x y 1 e
xy
2
p
2V 2
tanh N x y G
Apart from these “standard” kernels, problem-specific kernels have been devised by scientists in many different fields. For example, in the field of remote sensing image analysis, textural kernels based on the homogenity of pixel intensity have been formulated in [LAF 05] in order to detect forest fires and extract urban areas from satellite images. This approach might be more efficient than comparing images in a pixelwise manner. The only requirements regarding a kernel are that it should contain information about the similarity between data points, while being a symmetric and positive semi-definite matrix. Finally, most of real-world problems involve the classification of objects that belong to more than one class. The most common way of dealing with multiclass problems is the one-against-all approach (see e.g. [WES 98]). For each of the M classes, a classifier that separates the class from all the others is trained. It is also possible to formulate the SVM algorithm for M classes, but the resulting optimization problem is hard to solve. Two types of parameters have to be tuned: the SVM parameters and the kernel parameters. In the case of SVM classification, parameter C, previously described as the trade-off between complexity and classification error, has to be selected carefully. In most applications, an arbitrary value for C will provide poor classification performance. Kernel parameters are of course different for each type of kernel function. For the Gaussian kernel, bandwidth ı has to be selected. This parameter controls the smoothness of the resulting classifier. For the polynomial kernel, we must select the degree of the polynomial. In Figure 4.22, we show the influence of parameter ı on a practical problem. The studied dataset is the concentration Caesium 137 (137Cs) taken from a soil survey in Russia shortly after the Chernobyl incident. Even though the concentration is a continuous value, we have transformed it into a binary ^0,1` value for the purpose of this exercise. If the concentration exceeds a fixed threshold (800 kBq/m3), the location is considered as at-risk (1). Otherwise, the value is 0. On the left, ı is equal to 7, while its value is 2 on the right.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
137
Figure 4.22. SVM classification with a Gaussian kernel of at-risk locations regarding the Cs dataset. Black dots represent unsafe locations, while the crosses are below the safety threshold. On the left, a value of 7 is used for ı, while a value of 2 is used on the right. We see that the classifier capacity increases with ı. Even though a better classification performance is achieved on the right, this classifier might lead to overfitting. The simpler classifier (left) should be favored
137
As can be seen from Figure 4.22, the capacity of the SVM classifier increases with the value of ı. Even though the SVM on the right achieves a better class separation, the one on the left is more likely to perform well on new data. In order to tune ı, we should aim at finding a value that yields a minimum error on test data (or on cross-validation), and not on training data. 4.5.4. Support vector regression
Support vector regression (SVR) is an adaptation of the SVM methodology to the problem of function approximation. An extensive introduction to SVR can be found in [SMO 98]. The methodology is in many ways similar to the SVM algorithm, but here, we are trying to find the best regression function, which takes the form f x
wxb
[4.42]
We are searching for a function that is as flat as possible – in order to minimize the structural risk – while minimizing a certain loss function. Instead of using standard loss functions (mean square or absolute error), SVR uses a more robust criterion: the İ-insensitive loss function. This criterion can be stated as follows:
138
Advanced Mapping of Environmental Data
L y, f x
° y f x H ® 0 °¯
if
y f x ! H
otherwise
[4.43]
This renders the loss function insensitive to errors that are smaller than a given threshold İ. This criterion is illustrated in Figure 4.23.
Figure 4.23. SVR İ-insensitive loss function
As we want the flattest possible function, the norm of the parameters should be minimized, constrained by the error obtained with this loss function. This yields the following optimization problem: 1 2 w 2 y w xi b d H s.t. ® i ¯w xi b yi d H
min
[4.44]
This problem has a solution if and only if all the data points are fitted into the İ-insensitive tube, which is seldom the case. However, as we may want to allow some errors to occur, we can introduce the aforementioned (see section 4.5.3) slack variables. This leads to a similar reformulation of the problem:
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
min
n 1 2 w C ¦ [i [i* 2 i 1
s.t.
139
yi w xi b d H [i ° * ®w x i b yi d H [i ° * ¯[i , [i t 0
[4.45]
By solving the dual Lagrangian, we find that the solution can be expressed as
f x
¦ D i D i* xi , x l
b
[4.46]
i 1
This is called the support vector expansion. Again, if the data is nonlinear, the kernel trick can be applied, as the inputs only appear through dot products. For the nonlinear case, the support vector expansion is thus
f x
¦ D i D i* K xi , x b l
[4.47]
i 1
where K is an appropriate kernel function. For illustrative purposes, we can go back to the 137Cs example, and map the concentration values over the whole area of study using SVR. To show the influence of the SVR parameters C and İ and the kernel parameter ı, we show six graphs. In every pair, two parameters are kept at a fixed value, while the third is varied.
140
Advanced Mapping of Environmental Data
Figure 4.24. Mapping of 137Cs with SVR, with a Gaussian kernel. The two top graphs have different ı values, the two middle ones have a different C, while the bottom two have a varying İ
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
141
It is obvious that the parameters have a great influence on the behavior of the predictor. In order to tune these parameters, we can use one of the criteria mentioned in section 4.2.4. Here, we have mapped error surfaces using different values for ı, C and İ. For various settings of (ı, C) and (ı, İ), a SVR model has been trained, and its generalization performance has been measured on a test set of 200 samples not used for training. This technique is usually called grid search. It is important to note that a complete grid search for 3 parameters would involve testing different settings in the whole space spanned by (ı, C, İ). This is far more computationally intensive. However, given the smoothness of the error surfaces, testing only the (ı, C) and (ı, İ) settings is sufficient. If more parameters have to be tuned, other techniques based on optimization heuristics would have to be considered, such as hill climbing, evolutionary algorithms or simulated annealing.
Figure 4.25. Error surface of (ı, İ) for the mapping of 137Cs with SVR. Here, ı should be chosen between 0.2 and 0.5 and İ, between 1 and 150
4.5.5. Unsupervised techniques
So far, we have focused on methods aiming at predicting an output from input values. Unsupervised techniques mainly encompass the problems of clustering and density estimation, which we have previously mentioned. In any of these two
142
Advanced Mapping of Environmental Data
problems, we want to infer properties of the probability distribution underlying the observed data, without knowledge of corresponding output values. This includes finding a simplified representation of a dataset for visualization or feature extraction purposes. 4.5.5.1. Clustering Clustering can be defined as partitioning the dataset into subsets such that data points in each subset share common characteristics. Similarities are judged with respect to a problem-specific distance measure. Perhaps the most popular clustering algorithm is k-means. It aims at finding k clusters with centers ȝk such that the within-cluster distance between the data points is minimized. The distance used is usually Euclidean distance:
d xi , x j
xi x j
2
[4.48]
We aim at finding assignment vector C such that the total distance between the points and their respective centers is minimized. The k-means algorithms can be stated as follows: 1) initialize randomly k cluster centers ȝk; 2) calculate the assignment vector C by assigning each point to its closest center, i.e.,
Ci
arg min xi Pk
2
1d k d K
3) calculate the new cluster centers by calculating the mean of each cluster:
Pknew
1 Nk
¦ xi
xi ck
where Nk is the number of points in cluster ck; 4) repeat 2 and 3 until assignment C does not change. Figure 4.26 illustrates the application of k-means on a “toy” dataset made of two Gaussian distributions.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
143
Figure 4.26. Cluster membership for a two-cluster dataset using classical k-means
An important point is that with the “standard” k-means algorithm, no cluster overlap is possible if all clusters have the same dispersion, since the data points are assigned to the nearest cluster center. Often, fuzzy k-means is used. Rather than providing a “crisp” partitioning, fuzzy k-means provides a membership probability. Each point has a membership value for each cluster, which is inversely proportional to the distance that separates it from each cluster. At each step, the cluster centers are calculated as in k-means, but the mean is weighted with respect to the membership of each point. In other words, fuzzy k-means allow us to deal with uncertain inputs, i.e., inputs for which the class membership is probabilistic. Note that within the k-means algorithm, we have assumed that the data was in the usual Euclidean space, i.e., we calculated the distance between data points by calculating the Euclidean distance between them. It is important to note that this algorithm can be used with other distance measures. In addition, the results of kmeans depend on the initialization of the cluster centers. It is usually necessary to run the algorithm several times in order to verify the validity of the resulting clusters. In fact, the algorithm can become trapped in a cluster assignation vector that represents a local minimum.
144
Advanced Mapping of Environmental Data
Other clustering methods, such as spectral clustering (see e.g. [NGA 01]), can cluster data by using an arbitrary affinity measure between data points, yielding a weighted graph as a representation of the data. These methods are based on the eigenspectrum of a matrix called the graph Laplacian, which derives from the affinity matrix of the data. These methods are, however, out of the scope of this chapter. An introduction to this very promising subject can be found in [LUX 06]. 4.5.5.2. Nonlinear dimensionality reduction Nonlinear dimensionality reduction is a recent and popular subfield of the larger domain of feature extraction. Feature selection, discussed previously, aims at selecting the best variables for a specific model. Feature extraction, rather than selecting already existing variables, tries to select a linear or nonlinear combination of the input variables which best suits the problem at hand. In the general case of nonlinear feature extraction, the problem is often called manifold learning, or nonlinear dimensionality reduction (NDR). The most popular linear feature extraction method is principal component analysis (PCA) [JOL 02], also called the Karhunen-Loève transform. This technique is well-know in the geostatistical literature, where it is used as “factorial kriging”. PCA is a linear method in the sense that it searches for a linear combination of variables that may account for the variability observed in the data. Technically speaking, this is carried out by finding the eigenvectors of the covariance matrix of the data. The data is then projected on this vector basis, which represent uncorrelated features. The procedure can be summarized as follows. Let X be the input data matrix: – standardize the data: x
xx
Vx
– calculate the covariance matrix of the data: S
1 N
N
¦ x j xT j 1
j
1 XXT N
– calculate eigenvalues Ȝi and eigenvectors ui of S; – calculate the new data matrix F with F = XU, where U is the eigenvector matrix. However, nonlinear relationships between the variables may be missed by PCA. A natural extension of principal component analysis to nonlinear datasets is kernel PCA [SHA 04]. In the PCA formulation, we have seen that the covariance matrix of the standardized data is expressed as a dot product between input data. As we did for SVMs, we wish to map the data xi in a space of higher dimension in order to seek
nonlinear combinations of the initial variables with linear methods. Let ) x i be
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
145
the projection of xi in feature space. The covariance matrix in feature space can be re-written as
S
1 N
¦ ) x j ) x j N
T
[4.49]
j 1
The following eigenvalue problem must be solved
O U SU
[4.50]
Since we have defined the following property for kernels:
K xi , x j ) xi ) x j
[4.51]
It is shown in [SCH 98] that the eigenvalue problem can be re-written as
N OĮ
KĮ
[4.52]
The Į are the eigenvectors of the kernel matrix. The algorithm is thus similar to PCA, but the kernel matrix and its eigenspectrum are calculated instead of that of the covariance matrix of the input data. The image of point x on the eigenvectors basis can be obtained with N
U x ¦ Į i K xi , x
[4.53]
i 1
Many other methods for NDR have emerged in recent years: isomap, locally linear embedding [ROW 00], Laplacian eigenmaps [BEL 03] and maximum variance unfolding [WEI 04]. These methods are all related to PCA or multidimensional scaling (MDS). Firstly, they build a distance measure that renders the nonlinear problem linear through a change of coordinate system. Secondly, they construct a new basis of coordinates based on the spectral decomposition of the distance matrix previously obtained. NDR has many interesting application avenues in spatial data analysis. For example, automatically learning nonlinear objects such as roads or lakes could help the automatic digitalization of maps. Furthermore, when we deal with datasets with a large number of meteorological and topographic variables, NDR can help uncovering nonlinear relationships between variables. We can then find a reduced representation of the dataset (comprising a smaller number of variables) using these dependencies.
146
Advanced Mapping of Environmental Data
4.6. Conclusion
In this chapter the basic principles of machine learning and several algorithms were presented: multilayer perceptron, general regression neural network, probabilistic neural network, self-organizing maps, support vector machines. All the presented ML methods rely on powerful data-driven algorithms. Their correct application needs profound expert knowledge of their properties and behavior in different conditions. There are some standard tools – splitting of data, VCdimension, bootstrapping, jackknife, etc. – which can be used to control their complexity and the quality of data analysis and modeling. In the following chapter, case studies using real data will be presented in detail and the results will be compared with geostatistical models. 4.7. References [ABE 05] ABE S., Support Vector Machines for Pattern Classification, Springer, 2005. [AKA 74] AKAIKE H., “A new look at the statistical model identification”, IEEE Transactions on Automatic Control, 19 (6), 1974, p. 716-723. [BEL 03] BELKIN M. and NIYOGI P., “Laplacian eigenmaps for dimensionality reduction and data representation”, Neural Computation, vol. 15, 2003, p. 1373-1396. [BIS 06] BISHOP C., Pattern Recognition and Machine Learning, Springer, 2006. [BOS 92] BOSER B., GUYON I. and VAPNIK V., “A training algorithm for optimal margin classifiers”, 5th ACM Workshop on Computational Learning Theory, 1992, p. 144-152. [BRE 94] BREIMAN L., Bagging predictors, Technical report No. 421, University of California, Berkeley, 1994. [BUR 98] BURGES C., “A tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery, vol. 2, 1998, p. 121-167. [CRI 00] CRISTIANINI N. and SHAWE-TAYLOR J., Support Vector Machines, Cambridge University Press, 2000. [GUY 03] GUYON I. and ELISSEEFF A., “An introduction to variable and feature selection”, Journal of Machine Learning Research, vol. 3, 2003. [GUY 06] GUYON I., GUNN S., NIKRAVESH M. and ZADEH L. (eds.), Feature Extraction: Foundations and Applications, Springer, 2006. [HAS 90] HASTIE T. and TIBSHIRANI R., Generalized Additive Models, Chapman & Hall, 1990. [HAS 01] HASTIE T., TIBSHIRANI R., and FRIEDMAN J., The Elements of Statistical Learning, Springer, 2001.
Spatial Data Analysis and Mapping Using Machine Learning Algorithms
147
[HAY 98] HAYKIN S., Neural Networks: a Comprehensive Foundation, Pearson Higher Education, 2nd edition, 1998, 842 p. [HYV 01] HYVARINEN A., KARHUNEN J. and OJA E., Independent Component Analysis, Wiley Interscience, 2001. [JEB 04] JEBARA T., Machine Learning: Discriminative and Generative, Kluwer, 2004. [JOL 02] JOLLIFFE I.T., Principal Component Analysis (2nd edition), Springer, 2002. [KOH 00] KOHONEN T., Self-Organising Maps, Springer, 2000. [LAF 05] LAFARGE F., DESCOMBES X. and ZERUBIA J., “Textural kernel for SVM classification in remote sensing: application to forest fire detection and urban area extraction”, Proc. of the IEEE Int. Conf. on Image Processing (ICIP), 2005. [MAL 73] MALLOWS C.L., “Some Comments on Cp”, Technometrics, vol. 15, 1973, p. 661675. [MAY] MAYORAZ E. and ALPAYDIN E., Support vector machine for multiclass classification, Technical report IDIAP-RR 98-06, 1998. [MCC 89] McCULLAGH P. and NELDER J., Generalized Linear Models, Chapman & Hall, 1989. [MIN 69] MINSKY M. and PAPERT S., Perceptrons, MIT Press, 1969. [NGA 01] NG A., JORDAN M.I. and WEISS Y., “On Spectral Clustering: Analysis and an algorithm”, Advances in Neural Information Processing Systems 14, 2001. [NOC 99] NOCEDAL J. and WRIGHT S.J., Numerical Optimization, Springer-Verlag, 1999. [OJA 82] OJA E., “A simplified neuron model as a principal component analyzer”, Journal of Mathematical Biology, vol. 15, 1982, p. 267-2735. [PLA 99] PLATT J.C., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods”, in A. SMOLA et al. (eds.), Advances in Large Margin Classifiers, MIT Press, 1999. [QUI 05] QUIÑONERO-CANDELA J., RASMUSSEN C.E., SINZ F., BOUSQUET O. and SCHÖLKOPF B., “Evaluating predictive uncertainty challenge”, Machine Learning Challenges Workshop (MLCW), 2005. [RAS 06] RASMUSSEN C.E. and WILLIAMS C.K.I., Gaussian Processes for Machine Learning, MIT Press, 2006. [RIS 78] RISSANEN J., “Modeling by the shortest data description”, Automatica, vol. 14, p. 465471, 1978. [ROS 58] ROSENBLATT F., “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory”, Psychological Review, vol. 65, no. 6, p. 386-408, 1958. [ROW 00] ROWEIS S. and SAUL L., “Nonlinear dimensionality reduction by locally linear embedding”, Science, vol. 290, p. 2323-2326, 2000.
148
Advanced Mapping of Environmental Data
[SCH 98] SCHÖLKOPF B., SMOLA A., and MÜLLER K., “Nonlinear component analysis as a kernel eigenvalue problem”, Neural Computation, vol. 10, p. 1299-1319, 1998. [SCH 06] SCHÖLKOPF B. et al. (eds.), Semi-Supervised Learning, Springer, 2006. [SHA 04] SHAWE-TAYLOR J. and CRISTIANINI N., Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [SMO 98] SMOLA A.J. and SCHÖLKOPF B., A tutorial on support vector regression, NeuroCOLT Technical report TR-98-030, 1998. [TEN 00] TENENBAUM J., DE SILVA V. and LANGFORD J., “A global geometric framework for nonlinear dimensionality reduction”, Science, vol. 290, p. 2319-2323, 2000. [VAP 95] VAPNIK V., The Nature of Statistical Learning Theory, Springer, 1995. [VAP 98] VAPNIK V., Statistical Learning Theory, Wiley, 1998. [VAP 06] VAPNIK V., Estimation of Dependences Based on Empirical Data (2nd edition), Springer, 2006. [VON 06] VON LUXBURG U., A tutorial on spectral clustering, Technical report No. TR-149, Max Planck Institute for Biological Cybernetics, 2006. [WEI 04] WEINBERGER K., SHA F. and SAUL L., “Learning a kernel matrix for nonlinear dimensionality reduction”, Proc. of the 21st Int. Conf. on Machine Learning (ICML), 2004. [WES 98] WESTON J. and WATKINS C., Multi-class support vector machines, Technical report CSD-TR-98-04, 1998.
Chapter 5
Advanced Mapping of Environmental Spatial Data: Case Studies
5.1. Introduction In this chapter several case studies using real spatial data are presented. Geostatistical models and machine learning algorithms (MLA) are applied for the analysis and mapping of spatial data. In some cases studies the results obtained by both approaches are compared and discussed. In general, geostatistics and MLA can be considered as complementary approaches for spatial data treatment and modeling. Machine learning has significant advantages when the dependencies in data are hidden in high-dimensional spaces, which can be composed of geographical space of coordinates and geo-feature space of, for example, relief features. The case studies analyzed in detail cover a variety of data: climatic and meteorological data, chemical contamination and indoor radon data, spatiotemporal data on snow avalanches, socio-economic data. An important section of the chapter deals with an application of general regression and probabilistic neural networks (GRNN and PNN) for automatic mapping and classification of spatial data. GRNN has obtained the highest scores during Spatial Interpolation Comparison (SIC2004) exercise [DUB 05]. GRNN and PNN can be important parts of environmental decision support systems and very useful for operational pollution/contamination mapping. Other applications that concern decision support are devoted to the prediction of natural hazards such as indoor radon and snow avalanches. Chapter written by L. FORESTI, A. POZDNOUKHOV, M. KANEVSKI, V. TIMONIN, E. SAVELIEVA, C. KAISER, R. TAPIA and R. PURVES.
150
Advanced Mapping of Environmental Data
Examples of traditional and advanced data mapping studies using geostatistics and MLA considered below illustrate the theory and algorithms presented in the previous chapters. The case studies cover a wide range of applications and allow the researcher or a practitioner to get familiar with the practical issues of data-driven environmental modeling. Case studies on the Bayesian Maximum Entropy approach for spatiotemporal data analysis and modeling are considered in Chapter 6. 5.2. Air temperature modeling with machine learning algorithms and geostatistics Analysis and modeling of meteorological and climate data are of great importance for many reasons. Issues in climate change, natural hazard and risks, assessment of renewable resources, biodiversity, agriculture and food production and many other problems are related to the analysis, modeling and spatiotemporal predictions of climatic data from different information sources. In fact, it is an interdisciplinary and challenging research topic concerning high variability at many spatial and temporal scales, facing nonlinearity and extreme events, relating to different geospatial features, e.g. relief and elevation, urbanization, agricultural land cover. The problem of data, expert knowledge and science-based model integration is especially important for the development and calibration of models and spatiotemporal predictions. Nowadays topics on climate and meteorological data attract great attention both from fundamental and applied points of views (see, for example, [BRY 01; BRY 02; DOB 07; LLO 06; PAR 03; RIG 01] and references therein). The list of references is far from complete. In this chapter Swiss climatic data are used for real case studies. Taking into account complex geomorphology and topography of the country, Swiss data are very interesting for the research. In this section several meteorological situations are modeled with machine learning algorithms and geostatistics using data from automatic monitoring networks. In the first part there is a step-by-step explanation of the methodology applied for spatial predictions of monthly mean air temperatures taking into account elevation data. The second part of the section focuses on the interpolation of the temperature at short temporal scales (daily and instant temperatures). In these cases the elevation-temperature relationships are complex, vary locally and can be nonlinear. Therefore, Multi-Layer Perceptron (MLP) is an appropriate method for modeling of these particular situations. The typical neural structure used to interpolate temperature data is a 3 input architecture (X, Y, Z). It was discovered that these three variables are sometimes not enough to map particular situations such as
Case Studies
151
temperature inversion. Extra geo-features, produced from a digital elevation model (DEM) have to be included in the model.
Figure 5.1. Localization of the meteorological stations over the 250 m resolution Swiss DEM
5.2.1. Mean monthly temperature The goal of this section is to present the procedure used to model climatic variables using machine learning algorithms. First, an introductory case study is considered devoted to a very simple situation of modeling the mean temperature in Switzerland in August 2005. These data are characterized by a strong linear correlation between elevation and temperature. 5.2.1.1. Data description The dataset used in this study was provided by the MeteoSwiss agency (www.meteosuisse.ch). Batch statistics of data are presented in Table 5.1. The original data set was randomly split into training and validation data subsets. The monitoring network presented in Figure 5.1 is quite homogenous and the declustering procedure was not required [DEU 97; KAN 04].
152
Advanced Mapping of Environmental Data Summary statistics
Mean air temperature, August 2005 (°C)
Number of stations
111 (88 training; 23 validation)
Mean
13.23
Variance
22.54
Min value
-1.0
Max value
20.9
Table 5.1. Summary statistics of mean air temperature in August 2005
Figure 5.2. Temperature against elevation in August 2005
Figure 5.2 shows the strong linear correlation between temperature and elevation, which is quite typical during summer months. We are going to integrate this information into the neural network. Let us note that this is also a particularly good example for the geostatistical model of kriging with external drift [DEU 97]. 5.2.1.2. Variography Variography is used as an important exploratory tool in this study. The variogram rose and omnidirectional variogram is presented in Figure 5.3. The hypothesis of spatial stationarity has to be handled carefully. In our case, the nonstationarity of data is caused by the Alps that create trends along certain directions. Moreover, the Alps cause a very high nugget effect since the altitude varies quickly at short scales compared to the resolution of the monitoring network. Thus, the use of simple 2D geostatistical predictors is complicated. This situation can be easily handled with the help of kriging with external drift though.
Case Studies
153
Figure 5.3. Variogram rose and omnidirectional variogram for mean air temperature in August 2005. Very high nugget effect is caused by temperature-elevation relation
5.2.1.3. Step-by-step modeling using a neural network After having calculated some traditional statistics and variography (exploratory variography and variogram modeling) MLP was applied to produce temperature maps. The theory of an MLP algorithm was explained in Chapter 4. Geographical (X, Y) and elevation (Z) coordinates were chosen as inputs for the spatial predictions. The dominating influence of elevation (Z) is evident in mountainous regions such as the Swiss Alps. The following step-by-step methodology was applied for temperature mapping: 1. Splitting of data into training, testing and validation datasets. The training dataset is used to find the optimum weights of MLP and the testing one is used to select the optimum model – structure and number of hidden neurons. Normally, both datasets have to be large enough to reconstruct spatial patterns of data. In a more general setting a third data set – validation data – has to be extracted for estimating the generalization performance. Taking into account general purposes of this demonstrative study and the limited amount of data only two data subsets were used: training and validation. In this chapter, the latter is used for testing purposes. 2. Choosing the neural network structure: 3 input neurons (X, Y, Z), n hidden neurons and 1 output neuron with the target variable (T). Starting with a simple model (2 or 3 hidden neurons) and by gradually increasing MLP’s complexity (by adding more hidden neurons) is a good way to avoid the waste of computational time. A testing dataset is used after the training process to find the optimum number of hidden neurons; however, due to the lack of data here we will do it empirically and only use the validation set to illustrate the performance of the MLP model;
154
Advanced Mapping of Environmental Data
3. Training the neural network to find optimum weights. First, initialize the weights randomly between some specified bounds (preferentially in the linear part of the activation function); then, find the global minima of the cost function using conjugate gradient descent alternated with annealing algorithm to avoid local minima; 4. When the training error has stopped decreasing significantly, end the training process and save the configuration of the weights; 5. Spatial prediction of the temperature over the prediction grid (in our case the digital elevation model). Evaluate the training and validation RMSE. The choice of splitting raw data into only two datasets (training and validation) is a consequence of the lack of data. Let us recall that in general, three data subsets are used to learn from data using data-driven modeling: – a training dataset is used to fit the model; – a testing dataset is used to tune hyperparameters and for the best model selection; – a validation (independent) dataset is used to estimate the generalization capabilities of the model. The optimum choice of splitting a small dataset is to bootstrap different testing datasets of the same size from training data and then tune the parameters with these testing datasets. After having selected the model, the independent validation error can be calculated. In this section, geostatistical models were directly compared with the best MLP model selected via the simplified procedure. The estimation of the generalization error is not an objective of this research (the main objective is to compare the best solutions from different approaches) and therefore it was omitted. 5.2.1.4. Overfitting and undertraining The problem of overfitting is very important in our case because of the small number of stations (88). Overfitting occurs when a training dataset is small compared to the model capacity (sometimes, but not necessarily related to the number of weights and parameters). A too complex model tends to fit every training point with a function which depends on many parameters (in our case, complexity can be measured as the number of hidden neurons): such a model loses a lot of generalization capabilities on the prediction of new independent points. For a detailed description of overfitting see Chapter 4.
Case Studies
155
To check if the model overfits the training data, three procedures have been used: 1. Several trials with different numbers of hidden neurons following the visualization of the results as a plot of the root mean square error (RMSE) of the training and validation datasets as a function of the number of hidden neurons. The training error usually decreases with an increasing number of neurons. On the other hand, the testing error is a concave function whose minimum is reached at the optimum complexity of MLP (number of hidden neurons). Given properly split data, the validation error behaves similarly to the testing error (Figure 5.4); Training and validation curves 0 85
0.75
RMSE
0 65 Validation
0 55
Training
0.45
0 35
0 25 1
2
3
4
5
6
Complexity
Figure 5.4. Training and validation error curves
2. Exploratory variography of the training residuals [KAN 97a]: if MLP has modeled all structured information in the data (and not more) the variogram shows a pure nugget effect (variance in the data which cannot be explained); however, if the number of relevant predictors is increased (for example by adding the elevation for temperature prediction), the nugget of the residuals can be reduced without overfitting data;
156
Advanced Mapping of Environmental Data
Figure 5.5. Variogram for training data and respective residuals of the model predictions of the training data
3. Injection of the outliers in the training data. If MLP neglects these outliers, it is a sign that the predictor does not tend to overfit. The most popular and practical way to avoid overfitting is the correct use of testing data subset [HAY 98; BIS 07]. 5.2.1.5. Mean monthly air temperature prediction mapping The results of prediction mapping of temperature are presented below in detail. The high linear correlation between air temperature and elevation facilitates the model training in this case study. The optimum complexity of MLP is obtained with 3 neurons in the hidden layer (3-3-1). The linear MLP output was used to force the linear decrease of temperature with altitude in extrapolation areas (>3,500 m).
Case Studies
157
a)
b) Figure 5.6. a) MLP prediction mapping of temperature; b) measured against estimated values of the validation dataset
The results are quite promising as follows from the validation chart (Figure 5.6b) and Table 5.2. Different spatial prediction models, including inverse distance weighting (IDW), ordinary kriging (OK), regression kriging (RegrOK) and kriging with external drift (KED) were applied and compared to Machine Learning methods, i.e., MLP and Support Vector Regression (SVR). An example of the similar approaches for monthly temperature interpolations in mountainous regions in the Aral Sea region and China can be found in [BRY 01; BRY 02; PAR 03].
158
Advanced Mapping of Environmental Data Temperature, August 2005 Method
KED
MLP
SVR
RegrOK
OK
IDW
Train RMSE
0.51
0.41
0.45
0.57
4.27
4.12
Valid RMSE
0.39
0.46
0.43
0.48
4.55
4.57
Table 5.2. Model comparison in terms of training and validation RMSE. The training error of kriging with external drift (KED), regression kriging (RegrOK), ordinary kriging (OK), Support Vector Regression (SVR) and inverse distance weighting (IDW) is obtained as cross-validation over training data
a)
b) Figure 5.7. a) Interpolation with inverse distance weighting; b) interpolation with ordinary kriging
Case Studies
159
Figures 5.7a and 5.7b show the results of IDW and OK modeled without taking into account the information on the elevation. The integration of altitude in our models considerably improves the results as seen by the validation RMSE. For this particular situation of temperature modeling in summer months both linear and nonlinear three-dimensional models work well. However, for other more complicated situations MLP can be recommended as the nonlinear and adaptive interpolator, as demonstrated in the following sections. As opposed to summer temperatures, local effects in winter months such as Föhn or temperature inversion complicate the modeling and provide an appealing challenge for the data-driven models’ abilities. This is the subject of sections 5.2.2 and 5.2.3. 5.2.2. Instant temperatures with regionalized linear dependencies A more complex case is encountered when the relationship between temperature and altitude varies locally and/or if there are linear or nonlinear links among several variables other than the elevation. In this section we present the case study devoted to the modeling of air temperatures during the meteorological situation known as Föhn. In these situations the modeling of instant temperatures or the maximum or minimum temperatures over a short time period is of particular interest. 5.2.2.1. The Föhn phenomenon A Föhn situation happens when there is a humid air mass converging to a relief. The topography forces the air mass to ascend and condense, creating clouds. The condensation frees the latent heat which has been gained during the vaporization process over the sea. The release of the latter heats the atmosphere causing a linear temperature gradient close to 0.5°C/100 m in the up-wind slope. In the down-wind versant the dry air descends and heats with an adiabatic linear temperature gradient of about 0.98°C/100 m. The repercussion of these physical properties is a different temperature between the two versants at the same elevation. The difference is accentuated if there is orographic precipitation in the windward side of the mountains.
Figure 5.8. A simple diagram of the Föhn process
160
Advanced Mapping of Environmental Data
In order to show an example of a Föhn situation, we have chosen to model the maximum temperatures in the afternoon of January 19th 2007 between 2 and 3 pm. The meteorological characteristics of this day are very particular, because there was a heavy storm over Europe (Kyrill) with north-westward winds over Switzerland reaching 120 km/hr. The low humidity of the air caused a passive Föhn situation without precipitation in the up-wind versant. However, the condensation of the water vapor occurred in the crest of the Alps and in this area low gradients of temperature can be observed. On the contrary, the southern Alpine part of Switzerland demonstrated very high gradients and this caused an increase of the temperature up to 24°C, which is a record for the month of January. 5.2.2.2. Modeling of instant air temperature influenced by Föhn Figure 5.9 presents the relationship between elevation and temperature in the considered time period. It is easy to note that at high altitudes the gradient is low because of the condensation process. At low elevations the gradient is high but two patterns can be distinguished: a first pattern in the north of the Alps and a second pattern in the south, where the gradient is the highest.
Figure 5.9. Elevation against temperature in a Föhn situation; black boxes are southern alpine stations (with Föhn) and crosses are northern alpine stations (without Föhn)
The temperatures were modeled with MLP using the methodology described in the previous section. The structure of the network was chosen to be 3-3-1. Here we present the analysis of the obtained results to demonstrate the ability of MLP to extract the regionalized linear and nonlinear dependencies from data.
Case Studies
161
The map in Figure 5.10a provides the prediction mapping results. It confirms the presence of Föhn in the southern alpine area (high temperatures in dark gray tones), but there are also some local effects of Föhn in the northern alpine valleys because of the heavy winds. If we want to know where the condensation takes place or where there is an adiabatic heat of the air mass we need the map of the gradients (Figure 5.10b). This map has been calculated using 278 squared moving windows over MLP outputs. The high gradients in the south part of Switzerland (Ticino) are caused by the Föhn. In addition there are local effects of Föhn in Valais (Rhône valley), Grisons (Rhine valley) and behind the Jura chain (white arrows in Figure 5.10b). The high gradients over the plateau in the north-east part are not very reliable because there is no information about the temperature at high elevations. In contrast, the Alps in general present low gradients caused by the condensation.
a)
b) Figure 5.10. a) MLP prediction map; b) moving windows mean of the temperature-elevation gradients over Switzerland. White arrows indicate the areas where the local effects of Föhn caused the increase of temperature gradients
162
Advanced Mapping of Environmental Data
Looking at the validation scatterplot in Figure 5.11a a particular validation data sample seems to be out of the general pattern. This data sample is probably incorrect: a comparison with the neighbors has demonstrated that there is no reason to observe this temperature at this place with such a turbulent atmosphere. Moreover, in this valley there is a local Föhn effect and the temperature should be higher. This sample was treated as an outlier by MLP.
a) 2
1.8
RMSE (°C)
1.6
1.4
Training Validation ValidationWithoutOutlier
1.2
1
0.8
0.6 0
2
4
6
8
Complexity (number of hidden neurons)
b) Figure 5.11. a) Measured against estimated values of the validation dataset; b) RMSE as a function of the hidden neurons
Case Studies
163
The training and validation curves in Figure 5.11b justify that the optimum complexity of the model is achieved at the number of hidden neurons equal to 3. 5.2.3. Instant temperatures with nonlinear dependencies Three-dimensional models having only (X, Y, Z) inputs may not be sufficient for modeling in other complex situations. Some phenomena can only be explained through the influence of local terrain features. The data-driven machine learning methods can deal with such situations by introducing additional inputs calculated from DEM such as terrain curvature, slope, aspect, etc. One of these cases is a temperature inversion situation. 5.2.3.1. Temperature inversion phenomenon A temperature inversion occurs in winter months in the morning when there are high pressure conditions: the air mass is stable and tends to move downward. Roughly speaking, the surface releases thermal infrared radiation during the night and cools down, causing the cooling of the air by contact. Thus, the density of the air increases and causes an air movement downward from the mountains to the adjacent plains forming a type of “cold air lake” at the bottom of the valley. This phenomenon occurs at different heights because of local temperature inversions, i.e. the elevation of the inversion layer is not constant over the whole area, but it depends on the elevation of the bottom of the valley and some other topographical characteristics such as slope and curvatures.
Figure 5.12. Distribution of the atmosphere layers during a temperature inversion
Figure 5.12 proposes a simple interpretation of the inversion process: in the free atmosphere, the temperature is low because of the elevation; the temperature is also low near the soil in the thalweg of the valley; between these two layers there is the inversion layer (the temperature is higher than the other two), where the cold air mass slips down.
164
Advanced Mapping of Environmental Data
This is a complex case for modeling, since the information about the elevation is not sufficient to explain the temperature variation. The need to introduce other topographical characteristics to predict temperature is clearly demonstrated by Figure 5.13, showing the nonlinear relationship between altitude and temperature. The instant temperatures at 6 am on February 5th 2007 have been used for the analysis.
Figure 5.13. Relationship between elevation and temperature; global linear correlation is 0.20; cross marks are stations below the inversion layer; boxes are stations in the inversion layer and in the free atmosphere
5.2.3.2. Terrain feature extraction using Support Vector Machines From Digital Elevation Models (DEM) several terrain characteristics (slope, aspect, curvatures, etc.) can be calculated but only a part or a particular combination of them is really important for predicting the temperature. We will refer to these derivatives of DEM as geo-features. Concerning the temperature inversion, the relevant geo-features have to characterize the distance to the thalweg and the convexity of the bottom of a valley. We, however, will present here a more general data-driven scheme for selecting the relevant subset. The basic idea applied in this study is to classify data in the geofeature space according to the inversion against no-inversion criterion and eventually to define an inversion indicator that summarizes this information. The following procedure explains how to select relevant information from a large set of features:
Case Studies
165
1. define and label the meteorological stations which are under the inversion layer and those which are above. This can be done by an expert (meteorologist) who knows the area under study. The result of such analysis is shown in Figure 5.13; 2. perform SVM classification using all the calculated features; 3. (optionally) carry out the recursive feature elimination with a method that is specific for SVM [GUY 01]; eliminate the less weighted feature and perform SVM classification again. If the result is still good try to eliminate other non-relevant features, if the eliminated feature is not put again into the whole dataset, and try to eliminate the second non-relevant feature, etc. An important hyperparameter of this method is the number of selected features we want to keep; 4. summarize the information of these features in one variable which will be integrated as the fourth input (summary geo-feature) into MLP. The probabilistic interpretation of SVM [PLA 99] is a good way to do it; the probability of being below the inversion layer is proportional to the distance of the prediction point to the separating hyperplane in the feature space (see Chapter 4 and section 5.7.2.1 below for SVM details). This distance is used as an indicator (temperature inversion indicator, TII) which describes if the local geomorphological conditions are favorable for the formation of the cold layer under the inversion (Figure 5.14).
Figure 5.14. Temperature inversion indicator map: probability of being below the inversion layer
5.2.3.3. Temperature inversion modeling with MLP The neural structure used to model the temperature inversion is 4-4-1 (four inputs – four hidden – and one output neuron). The four inputs or predictors of MLP
166
Advanced Mapping of Environmental Data
are: X, Y, Z coordinates and TII. With this type of neural net the temperature is supposed to vary on the XY plane, to be lower where the temperature inversion indicator is high and to vary at different elevations even in the case that the value of the temperature inversion indicator is the same. Due to the complexity of the situation, a very small number of representative validation stations were chosen. Aiming to have a random selection of validation stations representative of different regions of Switzerland, a cell declustering procedure has been applied to split data [DEU 97; KAN 04]. The quality and quantity of data is very important in this type of study. Emphasis has been given to the sufficient size of the training dataset in order for MLP to correctly model the patterns. The MLP training procedure is also complicated due to the complexity of the high-dimensional cost function which is characterized by multiple local minima. Thus, even several attempts with the same number of neurons may lead to the results which can be very dissimilar. Thus, the visual inspection with the help of some expert knowledge can be recommended as a complementary validation procedure. It can be noted in prediction mapping if the patterns of the temperature inversion are reproduced, as in Figure 5.15. Consequently, there will be few validation errors. On the contrary if the patterns are erroneous (from a meteorological point of view) it is very probable that the validation RMSE will be high.
Figure 5.15. MLP results with a 4-4-1 neural structure
As an analysis of the results, a neural network was also trained with 5 and more neurons. It led to the artefacts and a high validation RMSE. With less than 3 hidden
Case Studies
167
neurons the undertraining was observed: it was not possible to reproduce the low temperatures below the inversion layer. Concerning the validation plot (Figure 5.16) we note that the extreme values are less well reproduced than the other ones. This problem is explained by the fact that the thickness of the cold layer under the temperature inversion is not taken into account. A prediction point situated on a small hill centered in the bottom of a valley should have a low temperature even if its local topographical characteristics belong to a situation with high temperature. The nonlinear behavior of this phenomenon leaves enough room for further research.
Figure 5.16. Measured against estimated values for validation dataset
Figure 5.17. Temperature inversion over Valais; the inversion layer is easily seen at mid elevations (white); low temperatures at high elevation and the bottom of valleys are in black
168
Advanced Mapping of Environmental Data
In conclusion, the modeling of temperatures at hourly, daily and monthly temporal scales needs the correct choice of relevant predictors that can be derived from DEM. The methodology proposed is of a generic nature and can be applied for modeling of climatic and meteorological information in high-dimensional geofeature spaces. Feature selection procedures can be applied to extract the most relevant contributors. MLP can be replaced by SVR or other nonlinear modeling algorithms. 5.3. Modeling of precipitation with machine learning and geostatistics Now let us consider the data-driven modeling of precipitation data. Many different studies were published on monthly and yearly precipitation data using different techniques – from simple deterministic models like inverse distance weighting to artificial neural networks of different architectures [BRY 01; BUY 06; DEM 03; DUB 03; GOO 00; HIG 03; LLO 06; PAR 03; RIG 01]. Taking into account the complexity and high spatial variability of precipitation data, different recommendations for spatial interpolation and mapping were proposed. In most studies carried out in mountainous regions it was observed that incorporation of a digital elevation model and its derivatives can improve spatial predictions even if global correlation between precipitation and elevation is low. Spatial interpolation comparison in 1997 was based on daily precipitation data in Switzerland [DUB 03]. In the final report we can find comparisons between 14 different approaches of spatial interpolations with clear validation and testing purposes [DUB 03]. In this section several case studies on precipitation mapping using multilayer perceptron are considered. Hybrid models based on MLP (modeling of nonlinear global trends) and geostatistics (kriging or simulations of the residuals) are considered. In the first part of the section we introduce the methodology used to interpolate mean monthly precipitation data. The second part of the section presents more complex situations with an extreme precipitation event that occurred on a shorter temporal scale. This situation needs a simulation approach and some advanced techniques to reproduce this variability and extreme values. Precipitation monitoring networks in Switzerland are shown in Figure 5.18.
Case Studies
169
Figure 5.18. Distribution of rain gauges over DEM
5.3.1. Mean monthly precipitation The interpolation of precipitation is a challenging case study because of the complex nonlinear and locally variable relationships between precipitation and elevation usually observed. However, the complex relationships are partly compensated by a larger number of rain gauges as we can see in Table 5.3. From another side, low global correlation can be compensated by local correlation patterns between elevation and precipitation. Models which can take into account such local variability can improve the quality of mapping. 5.3.1.1. Data description The dataset used in this case study was provided by the MeteoSwiss agency (www.meteoswiss.ch). Monthly precipitation in August 2005 is considered. This month is characterized by heavy precipitation events in the northern part of the Alpine chain. The interpolation of rain fields over large temporal scales (monthly data) is easy as the spatial continuity is quite high and even benchmark geostatistical models work well. Batch statistics of data are presented in Table 5.3. The training and validation subsets are randomly chosen from raw data. The monitoring network is quite homogenous and a declustering procedure is not required.
170
Advanced Mapping of Environmental Data Summary statistics
Monthly amount of precipitation, August 2005, mm
Number of stations
440 (294 training; 146 validation)
Mean
189.07
Variance
9188.06
Min value
55
Max value
499
Table 5.3. Summary statistics of precipitation in August 2005
Figure 5.19 shows the relationship between precipitation and elevation.
Figure 5.19. Elevation against precipitation in August 2005; the coefficient of linear correlation is 0.2
The global correlation coefficient between elevation and precipitation is 0.2. In the following section this information is used to see if the predictions can be improved using elevation data as complementary information. This may be the case if the correlations exist locally in space.
Case Studies
171
5.3.1.2. Precipitation modeling with MLP Figure 5.20 shows the result of a 2 input MLP after having tuned hyperparameters. The best result found was reached with 20 hidden neurons and with a validation RMSE of 36.95 mm.
a)
b) Figure 5.20. a) MLP prediction mapping of precipitation using 2 inputs (spatial coordinates); b) measured against estimated values for the validation dataset
Then, the information concerning the elevation is introduced as an input into the neural network. Even though there is no strong global correlation between altitude and precipitation, from the analysis of moving window statistics it is possible to find local correlations that MLP can capture.
172
Advanced Mapping of Environmental Data
The results using a 3 input neural network are given in Figure 5.21.The best results are shown after having tuned hyperparameters. MLP 3-5-1 using input data scaling between 0.3 and 0.7 have led to the best RMSE of 26.87 mm. We can note the significant difference of the prediction models (Figures 5.20a and 5.21a). We can clearly see the influence of elevation in the second prediction. The choice of the model can be assisted by an expert, though in this case study, the information on elevation has led to an improved prediction in terms of validation RMSE.
a)
b) Figure 5.21. a) MLP prediction mapping of precipitation using 3 inputs (spatial coordinates and elevation); b) the plot of measured against estimated on validation data
Case Studies
173
5.3.2. Modeling daily precipitation with MLP
As a more complex case study, a short-term extreme precipitation situation has been chosen. The aim of this section is to observe the quality of predictions when the intense precipitation is distributed over a small region only. Such a situation occurred between the 2nd and 3rd October 2006. The particularity of this situation is caused by an intense cold front that affected Switzerland over 2 days. The distribution of precipitation is quite discontinuous in space and depends on topographical characteristics. The south Alpine area is the most concerned because the precipitation of the cold front has been reinforced by an orographical effect. The above cited methodologies were applied to study this situation. However, the need to reproduce extreme values compels us to use hybrid techniques such as neural network residual kriging (NNRK) and neural network residual simulations (NNRS), which will be considered below in section 5.3.3. We start here with the description of data and the simple MLP modeling, considering in more details some practical aspects of training the MLP and validation of the results. 5.3.2.1. Data description The following table shows the summary statistics of the accumulated precipitation during the two days considered. Summary statistics
Precipitation of 2nd and 3rd October 2006 (mm)
Number of stations
413 (360 training; 53 validation)
Mean
22.30
Variance
577.76
Min value
0
Max value
218.4
Table 5.4. Summary statistics of precipitations on 2nd and 3rd October 2006
The global correlation between elevation and precipitation is very low but from moving windows statistics patterns of local correlations were found (dark tones in Figure 5.22b).
174
Advanced Mapping of Environmental Data
a)
b) Figure 5.22. a) Relationship between elevation and precipitation of 2nd and 3rd October 2006; b) local correlation coefficient calculated in the moving windows (70 windows)
5.3.2.2. Practical issues of MLP modeling This section is focused on the effects of different optimization algorithms and on the ability of MLP to reproduce the extreme values. Although the validation data have been chosen randomly, more validation points have been added manually by their location in interesting areas with extreme precipitation. First, the annealing algorithm has been used for training to avoid local minima. Then, different algorithms have been used to find dissimilar solutions even with the same network structure (see Chapter 4). The next map (Figure 5.23a) shows the best
Case Studies
175
result (lowest validation RMSE) of a 2 input neural net using the conjugate gradient algorithm. The root mean square error is 9.43 mm for training and 9.69 mm for validation data. The map in Figure 5.23b shows the best prediction result obtained with MLP trained with Levenberg-Marquardt algorithm: in this case the errors are 6.57 mm for training and 7.69 mm for validation correspondingly.
a)
b) Figure 5.23. a) MLP mapping using conjugate gradient (15 hidden neurons); b) MLP mapping using the Levenberg-Marquardt (11 hidden neurons) algorithm
176
a)
Advanced Mapping of Environmental Data
b) Figure 5.24. a) Validation scatterplot of Figure 5.23a; b) validation scatterplot of Figure 5.23b
Using its lower training error, the Levenberg algorithm is able to better reproduce the extreme values. The only problem is related to the high nonlinearity of the Levenberg solution: it may be useful to train MLP several times and keep the solution with weights centered near the origin. To conclude with the 2 input neural network, we can assert that the LevenbergMarquardt solution led to smaller validation errors, but was more difficult to control. In terms of pattern detection it is recommended to use the gradient descent solution because of its robustness (the same minima of the error function is found even with several attempts). The stationary residuals obtained from training data using this pattern, can be interpolated using kriging (NNRK) or simulated by sequential Gaussian simulations (NNRS). This is the subject of the following section. The next figures show the differences of the weight distribution in the neural network depending on the number of hidden neurons. A large number of hidden neurons cause the inactivation of a part of them. In this case there will be a lot of synapses with weights close to 0.
a)
b) Figure 5.25. a) Distribution of weights for 3 hidden neurons; b) distribution of weights for 30 hidden neurons
Case Studies
177
5.3.2.3. The use of elevation and analysis of the results Now, let us consider modeling taking into account elevation data as well. Why do we have to use elevation if the global linear correlation coefficient with precipitation is only 0.15? From the moving window statistics, (Figure 5.22b) it has been discovered that there are some local correlations patterns, in particular in the northern Alpine area. On the contrary, in the southern Alpine area there are no important correlations where the extreme event of precipitation happened.
a)
b) Figure 5.26. a) 3 input MLP mapping of precipitation; b) local coefficient of linear correlation of MLP outputs and elevation calculated in moving windows
178
Advanced Mapping of Environmental Data
Figure 5.26a demonstrates that MLP can integrate these spatial variations of the correlation and produce an acceptable map. Actually, the map shows that the relationship between precipitation and elevation is weaker in the southern Alpine area than in the northern Alpine area. This map presents a realistic interpretation of the distribution of the precipitation but the problem is that the extreme values are smoothed out as we can see in Figure 5.27. Figure 5.26b shows the R2 calculated using moving windows over a subset of the MLP result. In other words this value is the percentage of variance of the precipitation which is explained by the elevation. We quickly note that the localization of high precipitation is linearly highly correlated to the lowest values of R2.
Figure 5.27. Validation scatterplot of 3 input MLP
From the validation graph (Figure 5.27) it is clear that the high values have been smoothed out. For this case study the validation RMSE is higher than a 2 input MLP when using the Levenberg-Marquardt training algorithm: in fact it is 9.58 mm. This smoothing effect occurs when we use the gradient descent algorithm. In this case, why don’t we use the Levenberg-Marquardt algorithm? In reality, it was noted that this algorithm is not well adapted to a neural network with more than 2 inputs. In fact, when we are working with 3 or 4 input neural networks, it is better to have a uniform distribution of the weights close to the center of the cost function. Unfortunately, the Levenberg algorithm is not very stable and produces solutions with very large and even biased weights and in many cases the general pattern is
Case Studies
179
lost. For example, when most of the variability of the precipitation is forced to be explained by the elevation, the output map is not consistent from a meteorological point of view and validation error is very high. These observations over the optimization algorithms are confirmed in temperature modeling as well. For example, the temperature inversion situation needs the use of gradient descent to avoid overfitting and artefact solutions. On the contrary, the modeling of mean air temperatures over summer months is well mapped using the Levenberg-Marquardt algorithm. 5.3.3. Hybrid models: NNRK and NNRS Theoretical principles of geostatistical models including simulations can be found in Chapter 3 and a complete description of hybrid model neural network residual kriging (NNRK) and neural network residual simulations (NNRS) can be found in [KAN 96; KAN 97a; KAN 97b]. The hybrid models have demonstrated good efficiency in spatial data modeling and in this section they are used to reproduce extreme values. In fact, if geostatistical and machine learning methods are combined, the smoothing of extreme values is reduced and the variability of the phenomenon can be reproduced. 5.3.3.1. Neural network residual kriging The objective of NNRK is to interpolate training residuals with the help of geostatistics after having modeled the global patterns (nonlinear trends) using MultiLayer Perceptron. To do this, the variogram of the residuals must not show a pure nugget effect; this occurs when MLP does not overfit and does not model small scale spatial structures. Variography is given in Figure 5.28a and shows a global trend over a range of directions that MLP should be able to model. Variogram of MLP training residuals given in Figure 5.28b contains short range structures that can be interpolated using kriging. This is a way to take into account the spatial structures which have not been modeled by MLP.
180
Advanced Mapping of Environmental Data
a)
b) Figure 5.28. a) Variogram in the direction of maximum continuity (dashed line) and minimum continuity (continuous line) and a priori variance of the training data (dotted line); b) variogram of training residuals
Figure 5.29 shows a comparison between the variogram of training data and variograms of the training residuals for different MLP complexities, characterized by the number of hidden neurons. The variogram of residuals represented by the dashed line indicates a spatial structure. This was obtained with a simple MLP model leading to undertraining. The variogram represented by the dotted line is a pure nugget variation. It was obtained using an MLP model of the optimum complexity which extracts all the spatially structured information from training data.
Case Studies
181
A variogram characteristic of an overfitting situation should be lower than the one that fluctuates around the nugget, i.e. close to 0.
Figure 5.29. Omnidirectional variogram of training data (continuous line), training residuals for an undertraining situation (dashed line) and training residuals for the optimum complexity (dotted line)
The southern part of Switzerland (Ticino) was chosen for the analysis. This is the main region affected by the precipitation during the considered period. After obtaining interpolated residuals via ordinary kriging they are added to the trend modeled by MLP in order to obtain the final NNRK map. This methodology produces an exact interpolator since the function passes through every training point, as in the kriging model. This can be considered to be important in a number of applications, including meteorology. The results are presented in the Figure 5.30. The validation error is 7.69 mm but we cannot compare this value to the previous ones because we have interpolated the map only in an extracted area and the validation points belong only to this area. The variogram of the residuals of the whole training dataset was characterized by a long range continuity caused by the data located on areas with low values of precipitation.
182
Advanced Mapping of Environmental Data
a)
b) Figure 5.30. a) NNRK mapping of precipitation; b) validation scatterplot of NNRK
The same procedure can be carried out using a 3 input neural network, but in this case we have to pay attention to overfitting. The trend model extracted by MLP has to be smooth enough and the variogram of the training residuals should not be a simple nugget effect. 5.3.3.2. Neural network residual simulations Neural network residual simulation (NNRS) has the same modeling phases as NNRK with the replacement of the kriging model of the residuals by corresponding conditional stochastic simulations. Stochastic simulations reproduce spatial variability and uncertainty of the residuals.
Case Studies
183
The MLP residuals were simulated using sequential Gaussian simulation algorithms [DEU 97]. Realizations of the simulations were added to the MLP trend model to obtain the final results. The following 4 images (Figure 5.31) show some examples of the realizations of sequential Gaussian simulations.
Figure 5.31. 4 examples from 100 realizations of sequential Gaussian simulations
The variograms of the realizations fluctuate around the raw variogram of training data (Figure 5.32). It means that the NNRS mapping is less smooth than the NNRK mapping, i.e. not only the global statistics of the training data are reproduced but also the spatial correlations described by the variogram.
Figure 5.32. Variogram for training residuals (continuous line) and 10 realizations of simulated data (dashed line)
184
Advanced Mapping of Environmental Data
By the fact that they rely on the simulation of a set of equally probable maps, simulations allow us to carry out numerous post-processing analyses. These are the estimations of probabilities of going over a given threshold or values associated with some quantiles of the probability density function. Figures 5.33 and 5.34 present some results of post-processing of 100 simulations: E-type estimates (mean values of many simulations), 95% quantile map, probabilities to be above level 100 mm and level 150 mm. The E-type estimate map is very similar to the NNRK solution as presented above.
a)
b)
Figure 5.33. a) NNRS mapping of the mean values of 100 SGS; b) precipitation value associated to the 95% quantile
a)
b)
Figure 5.34. a) Probability of being above 100 mm; b) probability of being above 150 mm
5.3.4. Conclusions This section illustrates how to apply machine learning algorithms to topoclimatic data-modeling. Due to the data-driven nature of the modeling, it was clearly seen that the quality of modeling depends on the strength and the type (linear vs. nonlinear) of relationships between terrain characteristics and the target variable. For some specific case studies (see section 5.2.1), linear geostatistical models can interpolate data as well as machine learning methods. For other case studies (see
Case Studies
185
section 5.2.3), the use of ML methods is strongly recommended by the nonlinear dependence between several variables. Relying on the goal of modeling (reproducing extreme values, modeling of spatial patterns), particular attention has to be paid to the choice of the optimization algorithms. Hybrid geostatistical and machine learning methods have been applied successfully to improve the prediction mapping of topo-climatic data. 5.4. Automatic mapping and classification of spatial data using machine learning Mapping of environmental data in an automatic mode is very important and attractive for the use in environmental decision support systems. Information such as contamination, levels of radiation, meteorological parameters and weather status for hazard prediction is usually continuously monitored with fixed or mobile stations. A huge amount of information needs to be explored in order to approach a decisionoriented modeling and prediction in real time. An important issue of automatic modeling deals with using stable automatic algorithms, i.e. those which provide answers without human intervention and with unique solutions independently of any initial parameters or tuning procedure of the algorithms. In principle, we would like to have nonlinear, non-parametric, if possible, robust and computationally efficient algorithms in order to learn from environmental data and to make spatial predictions of good quality with low generalization errors. In 2004 a Spatial Interpolation Comparison (SIC2004) study was organized by JRC Ispra [DUB 05]. The main objective of the SIC2004 was to apply automatic algorithms used by different groups around the world in order to make spatial predictions in routine and emergency situations using automatic mapping algorithms. The general regression neural network presented in this section was between the best predictors. In this section, the promising candidates for automatic mapping are presented and illustrated using real-world case studies. 5.4.1. k-nearest neighbor algorithm The first algorithm is a k-nearest neighbor model. It is the simplest and intuitively understandable algorithm. It can be used either for classification or for regression tasks. Often this method is used for quick visualization (a preview) of the data or as a benchmark tool for comparison with other, more complicated (but not always more accurate) methods.
186
Advanced Mapping of Environmental Data
KNN is an example of the so-called “lazy learning” algorithms. With such an approach, the function is interpolated locally and all computations are made directly during the prediction step [AHA 97, ATA 95]. There is no actual training phase for these algorithms – all training examples are only stored in the memory for further predictions. To make a prediction in some point of the feature space, whose class (discrete value for classification) or value (continuous for regression) is unknown, we find the first k nearest training points, according to a predefined distance measure. In case of classification, the point is classified by the majority voting of these k neighbors. If the number of votes for 2 (or more) classes is equal, the random selection for them is made. In case of regression, the prediction is only the mean over the values of its k neighbors. To run the k-nearest neighbors algorithm, we need to define the distance measure and the number of neighbors to use. Generally, any kind of Minkowski p-norm distance can be used.
§ d p ( x, y ) ¨ ¦ xi yi ¨ i ©
p
1/ p
· ¸¸ ¹
[5.1]
where p is a parameter equal to or greater than one. Thus, with p = 1 it is a Manhattan (or cityblock) distance, with p = 2 – Euclidean, and p = corresponds to the infinity-norm distance ( max xi yi ). To illustrate how the distances influence the choice of the nearest neighbors, let us select 170 points closest to the center of the regular rectangular grid using different types of distances. In other words, let us draw the circles of the same radius on the plane using different distances. Results are shown in Figure 5.35: the left figure corresponds to p=1, the figure in the middle presents Euclidean distance, and the right figure is according to p = .
Figure 5.35. Special cases of the Minkowski metrics (shapes of circles with the same radius): Manhattan (left), Euclidean (middle) and infinity-norm (right)
The most used, preferably to its straightforward interpretation, is the Euclidean distance.
Case Studies
187
The special case of the KNN is encountered when the class or value is predicted as the class or value of the closest training sample (i.e. when k = 1). It is called the nearest-neighbor algorithm. However, this algorithm is not optimum for all tasks and data sets. For different datasets the optimum number of neighbors – parameter k – is different. It is necessary to tune this parameter for every particular task, for instance, using the cross-validation procedure. 5.4.1.1. Number of neighbors with cross-validation Cross-validation is a very common approach to tuning the parameters of the models. In n-fold cross-validation, the original training data set is partitioned into n subsets. One of the n subsets is used as validation data to calculate the error, and the remaining ní1 subsets are used as training data. The cross-validation process is repeated n times (the number of folds), with each of the n subsets used exactly once as the validation data. The n validation errors from all the folds can then be averaged (or combined otherwise) to produce a single cross-validation error estimation for the specified set of parameters (the single parameter k in case of the KNN model). This procedure is repeated for different values of the parameters (k values). The model with the lowest cross-validation error is chosen as the optimum one. Note that the term k-fold is usually used in the literature. However, in the context of the KNN model there is a confusion with the number of neighbors k. That is why the term n-fold was used here. As a special case, the number of folds n can be set equal to the number of observations in the original training data set. This special case of the n-fold crossvalidation is called leave-one-out cross-validation (or sometimes simply crossvalidation). It involves using a single observation from the original data as the validation point, and the remaining observations as the training data. This is repeated n times such that each sample is used as the validation data once. The leave-one-out cross-validation is the special case of the n-fold cross-validation. A very important note is that leave-one-out always produces a unique result, in comparison with common n-fold case. This occurs because the leave-one-out procedure does not require any randomness in partitioning of the folds. This property is extremely important for the requirements of automatic mapping. Such an advantage is counterbalanced by the higher computational cost. 5.4.2. Automatic mapping of spatial data To demonstrate the use of the presented models in automatic mode let us use the same problem that was described in the section devoted to temperature modeling with MLP (see data description, summary statistics and variography there). The
188
Advanced Mapping of Environmental Data
MLP model required an extensive human interaction in order to find an optimum structure, train the network and avoid over-fitting. The same task can be performed using GRNN in the automatic mode, making it possible to compare different methods in terms of prediction error. As in the MLP study, the model with only 2 inputs (spatial coordinates) and an extended ANNEX model with 3 inputs (spatial coordinates and altitude, [PAR 03]), will be explored. As a measure of the quality of the modeling, the Root Mean Squared Error (RMSE) and Pearson’s linear coefficient of correlation (Ro) will be used. In addition, remember, that all modeling will be performed in the automatic mode! 5.4.2.1. KNN modeling Firstly, let us use a KNN model for prediction. The Euclidean metrics will be used. The initial interval for parameter k is from 1 to 30 neighbors for a model with 2 inputs and from 1 to 6 for the ANNEX model. In Figure 5.36 the leave-one-out cross-validation error curves for both cases are presented. For both cases, a well-defined unique minimum exists. Thus, it is easy to define the optimum number of neighbors (k value): for the model with 2 inputs k=7; for the ANNEX model k=2. We can see that for the second model the optimum value is significantly smaller. Generally, this means less smooth and more detailed mapping. Often, small values of kernel width lead to overfitting. However, these values were the result of a cross-validation procedure which makes it possible to avoid one.
2
7
a)
b)
Figure 5.36. Leave-one-out cross-validation error curve for KNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
In this case the inclusion of the additional input (altitude) which was highly correlated with the target function (temperature) leads to a significantly better prediction result. It can be seen from both error statistics on the validation data set from Table 5.5 and from scatter plot measured values against estimated values in
Case Studies
189
Figure 5.36. Thus, we can see that using additional high correlated data as an input improves the quality of prediction to an extreme degree. The result of mapping using two and three input models are presented in Figure 5.38. An ANNEX type model with three inputs reveals the details of mapping.
b)
a)
Figure 5.37. KNN prediction on validation dataset: measured values against estimated ones for model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
a)
b)
Figure 5.38. Prediction mapping using KNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
ANNEX k-NN gives surprisingly good results. Taking the average over the nearest neighbors selected by the altitude provides quite a precise prediction for temperature due to the linear correlation of the latter. It is important to note that these results were obtained when all the inputs were linearly scaled into [-1, 1], thus in Figure 5.38b we can clearly observe the influence of the elevation. This influence would not be noticeable if the distances were calculated in the original spatial coordinates, where average spatial distances (in X-Y projection) are much larger (kilometers) than differences in elevation (meters).
190
Advanced Mapping of Environmental Data
While KNN requires tuning of the single parameter that can be performed fully automatically with cross-validation, it still strongly relies on the used distance measure. 5.4.2.2. GRNN modeling Now let us use the GRNN model for automatic mapping. In the user-guided mode, the general regression neural networks were efficiently applied for mapping of soil pollution in [KAN 99]. The quality of modeling was controlled by variography of the residuals. In addition to spatial predictions the estimates of the prediction variance were also estimated. Here the automatic mode will be considered. To avoid the problems of different scales of distances (kilometers) and elevation (meters) in the ANNEX model, the input coordinates were mapped into the [1; -1] interval. We recall that this scaling is compulsory for MLP modeling even if it is not mentioned explicitly. In GRNN model, the difference of characteristic scales of different inputs can be approached by using the anisotropic (diagonal) matrix 6. Otherwise, the inputs can be pre-scaled according to physical sense or other prior considerations. The initial interval for the search of the optimum width of the kernel (V value) is fixed as [0.01; 0.5]. 0.078
0.102
a)
b) Figure 5.39. Leave-one-out cross-validation error curve for a GRNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
The results of kernel width searching using the cross-validation technique are given in Figure 5.39. Like the KNN, the curves have well-defined minima. Similarly, the optimum parameter for the case of ANNEX is much smaller than for a 2 input model.
Case Studies
b)
a)
Figure 5.40. GRNN prediction on validation dataset: measured values against estimated values for a model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
a)
b) Figure 5.41. Prediction mapping using a GRNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
RMSE Model
KNN
GRNN
Ro
Training
Validation
Training
Validation
7-NN, 2 inputs
3.54
4.48
0.66
0.40
2-NN, ANNEX
0.77
1.61
0.99
0.96
2 inputs
3.57
4.19
0.67
0.47
ANNEX
0.58
1.33
0.99
0.97
Table 5.5. Error statistics for regression task (mean temperature in August 2005, Switzerland)
191
192
Advanced Mapping of Environmental Data
Figure 5.40 is an analog of Figure 5.38 for the KNN model and looks very similar. Error statistics on validation data set is presented in Table 5.5. In Figure 5.41 the resulting prediction mappings for both models are presented. The ANNEX prediction has much more detail and looks nearly perfect. GRNN with 2 inputs looks good as well, especially in case of global trend prediction or quick visualization tasks. Compared to the KNN model with 2 inputs (Figure 5.38a with Figure 5.41a), it is a much better model, while such KNN mapping may be unacceptable in some cases. However, ANNEX mapping for KNN and GRNN (Figure 5.38b with Figure 5.41b) are very similar and can both be acceptable. 5.4.3. Automatic classification of spatial data In this section, the PNN model will be applied in automatic mode for modeling the precipitation data described in the previous sections and modeled with MLP. The detailed data description, graphs and summary statistics can be found there. Originally the precipitation data is a continuous function (the total precipitation in August 2005 in millimeters). To formulate a classification task, let us define a threshold of 300 mm. It converts the problem into a traditional two-class task, i.e. to predict the areas where the sum of precipitations was below and above the predefined threshold. In Figure 5.42 the distributions of the training (294) and validation (146) rain gauges which provide the measurements are presented.
a)
b)
Figure 5.42. Distribution of the sum of precipitations (August 2005, lower than 300 mm empty circles, above this level – filled circles) data sets: a) training; b) validation
As in the previous regression study, two models were considered: with only 2 inputs (geographical coordinates) and an extended ANNEX model with 3 inputs (geographical coordinates and altitude). Note that in this case the correlation (global at least) is weak and is not as well defined as in the case of temperature function. Therefore, it is interesting to investigate the feasibility of the ANNEX model in the
Case Studies
193
case of complex data. As in the previous regression study, all modeling procedures will be carried out in an automatic mode. 5.4.3.1. KNN classification Let us repeat all the steps of KNN modeling like in the previous regression study. In Figure 5.43 the leave-one-out cross-validation error curves for both cases are presented. For the model with 2 inputs the optimum number of neighbors is k=4; for the ANNEX model k=24. This result is somehow opposite to the previous regression study. There, the optimum parameter for the ANNEX model was scientifically larger than for the 2 input model (7 versus 2). In addition to this, the curves do not have such a well-defined structure (minima) as in case of the ANNEX model with high-correlated additional information. It seems that as an optimum parameter for ANNEX model the value of 8 may be more preferable. However, since this section is devoted to the exploration of automatic mapping models, in this modeling we strictly follow the automatic procedure without any expert intervention or other knowledge. 24
4 a)
b)
Figure 5.43. Leave-one-out cross-validation error curve for a KNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
In this particular case the inclusion of the additional input which is not very correlated (globally) with target function (precipitations), does not lead to a significantly better prediction result. This can be seen from the misclassification error statistics on the validation dataset from Table 5.6.
194
Advanced Mapping of Environmental Data
a)
b)
Figure 5.44. Prediction mapping (gray area points – level above 300 mm; circles – training point for this class) using KNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
The prediction mappings provided by two models presented in Figure 5.44 are quite different, especially in the area circled in the figure. Empty circles are the training point of classes above 300 mm which are added for better visualization of the results. We can see that though the validation error is lower for the ANNEX model, the prediction mapping may be more acceptable in case of the 2 input model, at least in the area selected in Figure 5.44. 5.4.3.2. PNN classification Let us now discover the performance of the PNN model, an analog of the GRNN algorithm for classification tasks. The cross-validation is also used here as well to find the optimum width of the kernel (V value) from the initial searching interval [0.01; 0.5]. Results are given in Figure 5.45. Like KNN, the ANNEX model has a larger local parameter than the 2 input model. Prediction mapping is presented in Figure 5.46. The training points (empty circlers) and selected area are presented as well as in Figure 5.44 with KNN predictions for the comparison. We can observe the significant difference for the ANNEX model: the prediction in the selected area is much more accurate. Let us now explore the principal advantage of the PNN model: the postprocessing of the probabilistic outputs. As mentioned above, PNN produce not only the classification result, but the probabilities of belonging to each of the classes. This property is very important and is a particular advantage of generative and Bayesian machine learning methods.
Case Studies
0.02
a)
195
0.03
b)
Figure 5.45. Leave-one-out cross-validation error curve for a PNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
a)
b) Figure 5.46. Prediction mapping (gray area points – level above 300 mm; circles – training points for this class) using a PNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
a)
b) Figure 5.47. Maximum probability (probability of the winning class, the same color scale) of the PNN model with a) two inputs (spatial coordinates); b) ANNEX (spatial coordinates and altitude)
196
Advanced Mapping of Environmental Data
b)
a)
Figure 5.48. Maximum probability less than 0.7 (circles – training points for level above 300 mm) of the PNN with a) two inputs; b) ANNEX
In Figure 5.47, the probability of the winning class (maximum probability among two classes) is presented. Note that for two-class task this value can be less than 0.5. We can see clearly-defined areas of uncertainty where the winning probability is not too large (dark areas) and, correspondingly, the results are less reliable than in areas with larger values. For the decision-oriented post-processing, some critical decision threshold may be defined. For example, in Figure 5.48 the area with probability less than predefined threshold 0.7 is presented. Predictions in this area are less reliable. Therefore, if it is possible, in order to improve the quality of the prediction, additional measurements might be performed in this area.
Training error Model
KNN
PNN
Validation error
Points
%%
Points
%%
4-NN, 2 inputs
7
2.4
11
7.5
24-NN, ANNEX
21
7.1
8
5.5
2 inputs
25
8.5
16
11.0
ANNEX
21
7.1
13
8.9
Table 5.6. Misclassification error for the classification task (amount of precipitation in August 2005, Switzerland, threshold 300 mm)
Case Studies
197
5.4.3.3. Indicator kriging classification Now, let us briefly consider the same problem of spatial data classification using indicator kriging following the general methodology presented in Chapter 3. This is a model of geostatistics which requires user interaction for the modeling of variograms. We thus present these results in order to make some comparisons with those obtained in this section using the automatic algorithms of machine learning.
Figure 5.49. Experimental (top) and theoretical (down) variogram roses (variograms calculated for different directions and at different distances) for the level of precipitation 300 mm
An important step of any geostatistical modeling deals with the variography: analysis of experimental variograms, variogram modeling (fitting of experimental variograms using theoretical models). The selected experimental variogram and corresponding theoretical model are presented in Figure 5.49. An anisotropic model was selected which reflects the phenomena under study – precipitation in mountainous regions in Switzerland. Experimental variography was carried out with different angle and distance tolerances in order to find the best representation of the spatial correlation structure. Potentially, there is a way to fit the theoretical variogram model to fit the experimental model according to some criteria such as Cressie’s criteria [CRE 93].
198
Advanced Mapping of Environmental Data
The next step is the spatial classification using indicator kriging and developed variogram model. As above, 2 models were considered: 1) two-dimensional model with two inputs (geographical coordinates only) and 2) indicator kriging with external drift, which is a three-dimensional model which also takes into account altitude data. The output of indicator kriging is a probabilistic model, in the case of two class classification problems if the probability belongs to the given class. In our case the results are the probabilities of being above a level of precipitation of 300 mm.
Figure 5.50. Spatial classification using indicator kriging, two-dimensional model. Gray levels (from light to dark) correspond to 25%, 50% and 75% quintiles of the probability of being above the level 300 mm. Data postplot: crosses correspond to validation data having values less than 300 mm; large circles correspond to the values larger than 300 mm
As usual, three-dimensional models provide more local short-scale details. Finally, we have to estimate the validation (generalization) error in order to compare these two models and in order to compare indicator kriging with other models (KNN, PNN). Taking into account the probabilistic nature of IK the level of probability 0.5 was selected as a decision level, i.e. the classes were separated by the probability of 0.5. The results obtained by both models were very similar and were about 6.7% of the misclassification error. If we compare the results with automatic MLA (Table 5.6) they are quite good. However, they need a lot of time and experience to develop and to model a variogram.
Case Studies
199
Figure 5.51. Spatial classification using indicator kriging, three-dimensional model. Gray levels (from light to dark) correspond to 25%, 50% and 75% quintiles of the probability of being above the level 300 mm. Data postplot: crosses correspond to validation data having values less than 300 mm; large circles correspond to the values larger than 300 mm
5.4.4. Automatic mapping – conclusions The machine learning methods – KNN and GRNN/PNN – were tested on realworld applications: mapping of the mean temperature in August 2005 in Switzerland (regression) and mapping of the values exceeding some level of precipitation (classification task). Two models were used for each method: a standard model with 2 inputs (geographical coordinates) and an extended ANNEX model with 3 inputs (geographical coordinates and altitude). The two major goals of the presented study can be mentioned. First of all, the feasibility of the described models as tools for automatic mapping was examined. Secondly, the influence of adding additional information as an additional input of the model was investigated. Such information has to be available everywhere in the space where the prediction has to be performed. In our case this information is altitude. It can be obtained from high resolution digital elevation models. Concerning the regression task, it can be concluded that both methods show good results and can be used in practice. A more simple method – KNN – provides worse results than GRNN. This is especially true for the model with 2 inputs. In some cases the result of KNN mapping with 2 inputs may be worse. However, in the case of the ANNEX model and good correlation between the inputs and output, the performance of both models is good and differs insignificantly. GRNN models showed excellent performance in both cases.
200
Advanced Mapping of Environmental Data
The ANNEX model was significantly predominant for both methods. We can see that using (globally or locally) highly correlated data as additional input largely improves the quality of prediction. As for the classification task, both methods show acceptable results. KNN provides better performance on the validation data set for both models (2 input and ANNEX). PNN models produce acceptable results, and, importantly, a probabilistic output. This is a very advantageous property for the decision-oriented systems. The geostatistical model of indicator kriging provided similar results in this case study despite the precise user-driven modeling of variograms. As a general recommendation, let us mention the following. GRNN/PNN models are promising models to use in automating mapping systems. In some cases, a more simple KNN method may equally be used, especially for a quick visualization of the data. In the case of very complicated data with a large amount of noise and possible outliers, the KNN method may be preferable due to its simplicity. 5.5. Self-organizing maps for spatial data – case studies Self-organizing maps have a wide range of applications in different fields including environmental applications (see, for example [KOH 00, KAL 07, SOM 08] and references therein). In this section self-organizing maps are used to analyze multivariate spatial data. The first case study deals with contamination of Leman lake data. The second is a case study devoted to the analysis of high-dimensional socio-economic data. 5.5.1. SOM analysis of sediment contamination The main purpose of this example is to illustrate the use of SOM in finding dependencies in high-dimensional data in an unsupervised way. Another result presented here concerns its ability to estimate the missing values according to the revealed structures. The data describe spatially distributed measurements of metal and non-organic contents in Leman lake sediment. Measurement campaigns of different chemical elements and compositions as well as environmental parameters were carried out by the international CIPEL agency (Lausanne office). This case study considers measurements of the years 1978 and 1988. The data for analysis contains 30 variables (the metals and their composites) measured at 294 points in 1978 and 20 variables at 200 points taken in 1988. The majority of the measuring locations (197) match in both databases. In 1988, few measurements were held in the internal part of the lake, 20 variables measured at
Case Studies
201
1988 are present in both databases. Figure 5.52 shows some examples of correlations between variables: nonlinear – top left, linear – top right and absence of correlation (bottom). To prevent a domination of some variables over others in distance calculation, all variables were transformed so as to be within the same range. Transformation was performed with the help of normal score transform. All results in the figures are presented in the range of real value, i.e. after backward normal score transform. The Euclidean distances between vectors were used.
Figure 5.52. Types of correlation in Leman contamination data: nonlinear, linear and no correlation
Input vectors for training are composed of 30 transformed variables (no coordinates). For this case SOM with 200 (20:10) nodes was learnt. The training procedure is performed in 2 stages: 1,000 iterations, K=0.1, “bubble neighborhood”; 10,000 iterations, K=0.01, “bubble neighborhood”, R=2. Figure 5.53 presents a U-matrix for vectors of the trained SOM. U-matrix illustrates the average distance from a vector to its neighbors. The larger this distance is the lighter the color is. Using K-means SOM vectors were clustered into 5 classes. The borders between classes are presented in Figure 5.53. The clustering of the U-matrix is an important step in SOM application. It can be carried out using
202
Advanced Mapping of Environmental Data
a variety of methods. As such, the hierarchical clustering will be used in the next section instead of k-means.
Figure 5.53. U-matrix for SOM vectors trained on Leman contamination data
Figure 5.54 illustrates spatial distribution of 5 SOM classes, shown with different grayscale levels. Though the coordinates were not used for training, the classes are spatially grouped. This means that internal properties of the contamination data hold some internal spatially structured information which may indicate, for example, the locations of contamination sources. The next approach which we illustrate here is the use of SOM for the estimation of variables not been measured in 1988. In order to check prediction abilities of SOM, a part of the samples (testing set) was extracted from the 1978 data. In these testing data the values of Fe2O3 concentrations were marked as missing data. This variable was selected as it is partly present in 1988. These missing values were estimated by trained SOM quite precisely, as illustrated in Figure 5.55.
Figure 5.54. Spatial distribution of SOM classes for Leman contamination data
Case Studies
203
Figure 5.55. Post plot of SOM estimates versus real values for Fe2O3 in 1978 (left) and variograms for SOM estimates and real values of Fe2O3 in 1978 (right)
Figure 5.55 presents the post plot of SOM estimated values of Fe2O3 of Leman sediments contamination against real mesurements and spatial correlation structures (variograms) for real and estimated values. The largest errors of estimates correspond to high contamination values. The averaging in estimation (smoothing effect) is evident – low values are over-estimated, high values are under-estimated and the middle part is estimated well. This is natural because SOM estimates (SOM neurons reference vectors) are expected to be averages over the class members. SOM estimates seem to reproduce the spatial correlation structure rather well, which is important in data analysis.
Figure 5.56. Post plot of SOM estimates versus real values for Fe2O3 in 1988 (left) and variograms for SOM estimates and real values of Fe2O3 in 1988 (right)
The same SOM was used to estimate Fe2O3 in 1988. To check the quality of the prediction even known values of this variable were considered as missing. Figure 5.56 presents the correlation between real and estimated values in 1988 and their spatial correlation structures (variograms). The results of variables that were not measured in 1988 can be presented as maps using geographical coordinates. These estimates can be considered as local averages and give some overview of unmeasured variable values.
204
Advanced Mapping of Environmental Data
5.5.2. Mapping of socio-economic data with SOM The main objective of this example is to integrate an SOM into a classical task of visualization of the high-dimensional data which reveals the underlying dependencies. In this case, the SOM performs a nonlinear transformation of a highdimensional data set. The data contain 75 variables about the economic activities, the age groups and the percentage of foreigners for 427 municipalities in western Switzerland (cantons of Vaud and Geneva). The objective is to obtain an unsupervised classification model with 4 to 8 groups in order to map the socio-economic structures of the municipalities. The 75 variables of the data set are divided into 54 economic variables about the number of jobs per economic domain in 2000, 20 demographic variables about the age structure in 2000, and 1 variable about the percentage of foreigners. For all variables, the percentage for each municipality has been calculated and the values have been standardized. This harmonization of the data set prevents the domination of some variables over others for the distance calculations. The mapping process is divided into three phases (Figure 5.57): – pre-processing of the data with the SOM. This enables a nonlinear transformation of the data and a generalization in order to exclude extreme values. The degree of the generalization is determined by the SOM size. One of the intuitions we can imagine, is to consider the SOM as a kind of stiff net which is laid over a virtual, ordered data relief. With a small SOM, there are fewer possibilities for adaptation to local asperities; – the analysis of the U-matrix of the code vectors. The SOM output vectors (the code vectors) were divided into groups with a classical hierarchical classification. The number of groups to create can be determined with classical methods, for example, dendrogram analysis; – the original data are assigned to one of the groups by the SOM. A thematic map presents the spatial distribution of the groups. An analysis of the mean profiles for each group enables the attribution of a signification for each class.
Case Studies
205
Figure 5.57. The classification and mapping process
The determination of the SOM size is of great importance, it has a direct impact on the amount of generalization of the data. A too small SOM gives a too large generalization error and some phenomena cannot be assessed. A too large SOM can lead to over-fitting the data and the SOM pre-processing step looses its significance. We have chosen for our case study a SOM with a size of 16 x 16 cells. This means 256 SOM cells for 427 municipalities; the number of geographic units is 1.65 times higher than the number of cells. A ratio of about 1 to 2 should generally lead to a satisfying solution. The SOM itself has been created in two phases. The ordering phase has been executed with 1,000 iterations and an initial learning rate of 0.1. The convergence phase has been carried out with 10,000 iterations and an initial learning rate of 0.01. The Gaussian function has been chosen for the neighborhood function, with a radius of 8 for the first phase, and 2 for the second.
206
Advanced Mapping of Environmental Data
Afterwards, the code vectors of the resulting SOM have been divided into groups using a classical hierarchical classification method. Figure 5.58 shows the corresponding dendrogram and the classified SOM. The dendrogram indicates the creation of 5 classes.
Figure 5.58. Resulting dendrogram from the hierarchical clustering and the classified SOM
The next step consists of the attribution of the class resulting from the hierarchical classification at each of the 427 municipalities. For each municipality, we determine the best fitting code vector for which the class is known. This attribution enables the mapping of the classes and the creation of the mean profiles for each class (Figure 5.59). These mean profiles enable the analyst in finding an interpretation and a signification for each class. We can see in these profiles that the classification is able to highlight different characteristics of the municipalities. As we are working with standardized variables, the unit of the y-axis is the standard deviation; 0 corresponds to the global mean value and positive values show the presence of a specific characteristic.
Case Studies
Figure 5.59. The class profiles
207
208
Advanced Mapping of Environmental Data
Figure 5.60. Mapping of the classification result
Case Studies
209
The mapping of the phenomenon can occur in different ways; there are two possible examples in Figure 5.60. However, it is not apparent that an SOM has been used during the analysis; an additional comment is therefore necessary. As a conclusion we can say that an SOM can be used in a conventional analysis process as an additional step. In this case, it functions as a nonlinear data transformation. The generalization effect of the SOM leads generally to a quite robust classification. The size of the SOM can be estimated quite easily in order to obtain an acceptable result. However, the assessment of the classification quality is not trivial; most methods rely on linear indicators. 5.6. Indicator kriging and sequential Gaussian simulations for probability mapping. Indoor radon case study In this section a risk-related mapping of indoor radon concentration using advanced geostatistical methods – indicator kriging (IK) and conditional geostatistical simulations, i.e. sequential Gaussian simulations (SGS) – is considered. Indicator kriging is a geostatistical method of modeling local probability density functions which is important for risk mapping. Geostatistical conditional simulations generate many equally probable realizations of the phenomena under study and in principle, produce the most comprehensive information for risk and decision-oriented mapping. An example of indoor radon data is a challenging case study: data are highly clustered and variable at different scales, the experimental variography and variogram modeling are not easy tasks either. 5.6.1. Indoor radon measurements Pollution by indoor radon gas inside dwellings is an interesting example for the use of interpolation methods and particularly for geostatistics. Measurements of radon are carried out inside dwellings and the accumulation of this radioactive gas is governed by distinct physical processes. From the many sources of radon gas the soil body provides the main part. Once the gas is exhaled from soil material, it could be diluted in water; diffused directly to buildings through fractures in the floor or diffused to the outside air. Once inside the dwelling it could be transported upward by advection caused by hot/cold air interchange or more intensively by ventilation. Figure 5.61 sketches the process using a house model.
210
Advanced Mapping of Environmental Data
Figure 5.61. Physical processes of indoor radon accumulation
It is evident that indoor radon measurements may vary during the day, the month or even the year for the same house. In fact it depends on a variety of factors starting from the nature of the underlying lithological material; isolation of the building; rates of ventilation and even the atmospheric conditions with its annual variations. The mean life of the 222 isotope lasts for 3.5 days, over which period it could be harmful for humans if it is inhaled at high concentrations on a daily basis. To evaluate the chronic exposure, mean annual concentrations are calculated by placing detectors in dwellings during long periods and normalized to annual rates. As this is a complex phenomenon, the influence of all the triggering factors gives a high local variability of measurements; with records differing even between neighboring buildings. We could even think indoor radon accumulation to be particular to every single building. Statistical distributions of data sets are also not very homogenous: the mean level of concentration for a country such as Switzerland is about 230 Bq/m3, but it is possible to find values up to 20,000 Bq/m3. The presence of these extreme values also poses difficulties for spatial modeling. Another type of obstacle has to do with spatial distribution of data; since measurements are carried out for buildings, they often follow clustered patterns, leaving large areas of the national territory without sample covering.
Case Studies
211
5.6.2. Probability mapping As we see, under these circumstances, modeling of indoor radon is a challenging task. The first approach that could be proposed for this type of data is to use some robust methods which are not affected by local variability and extreme values. On the other hand, it will be advisable to use methods which could take into account the innate uncertainty of indoor radon data and present that in the form of probability maps. Indicator kriging (IK) is a type of robust method due to its first step which is the indicator transform. The data are transformed into two categories: 0 and 1 indicators, depending if the values are below or above a certain cutoff value. This procedure already attenuates the influence of extremes and allows building (indicator) variograms even if the variogram of original continuous data can not be found. The same is true for the sequential Gaussian simulation (SGS) method. The SGS method also requires a data transformation on a first instance; sample values are transformed (nonlinearly) to Nscores, which correspond to a normal standard distribution with 0 mean and the variance equal to 1. The Nscore transform provides a known distribution function from which the mean, variance and covariance functions could be derived and reproduced. From another side, these two methods have an approach to uncertainty modeling that makes them interesting for indoor radon mapping. Kriging of the values processed with an indicator transform provides a probability of occurrence from 0 to 1, which can be interpreted as the local probability of being either below or above the selected indicator, following a bimodal distribution of probabilities. Once the IK interpolation is performed for unsampled points, the result will be given on the 0 to 1 scale and p decision levels could be chosen to decide if areas are considered to be below or above the threshold. With the SGS method, the resulting information is even more complete, since a local conditional cumulative density function is built for every point on a simulation net. The sequential mechanism produces a realization of the joint Gaussian distribution of local Gaussians to form simulated images. As it is assumed to be a Gaussian distribution, only the mean and the variance are required. The mean corresponds to the simple kriging mean and the variance is the simple kriging variance. In this way the uncertainty of the kriging prediction error is reincorporated by simulation. The result is a set of simulated images that gives many values for every point, from where a local probability density function (pdf) is built. Having sets of pdfs for a simulation grid we can obtain a map of probability, as for the IK method. In this map each point will be assigned with a probability value of being above or below a certain threshold.
212
Advanced Mapping of Environmental Data
5.6.3. Exploratory data analysis The data set used in this case study consists of 1,710 indoor radon measurements considering only inhabited dwellings and measurements at the ground floor. The spatial distribution of measurements is presented in Figure 5.62 superimposed over a topographic map of the area. We can observe that distribution is irregular and clustered; and in general it follows the distribution of urban areas. The information presented is also restricted to cantonal limits; therefore, some urban areas beyond these limits are not covered with samples, such as the one to the north-east part. Agricultural land, forested areas, mountains and lakes are also not expected to have measurements.
Figure 5.62. Radon measurement locations
The dataset was subdivided into a set of training data with 1,310 samples and a set for validation with 400 values. The spatial distributions of values for these two sets are presented in Figure 5.63.
Case Studies
213
Figure 5.63. Postplot graphs for training (left) and validation (right) indoor radon datasets
A first analysis of the data sets is the calculation of the statistical parameters of data distribution. For indoor radon records, which contain values far from the median, a useful representation is given with boxplots (as shown in Figure 5.64). In this graph, points that exceed one and a half of the interquantile range (the difference between third and first quartile limit), counting from the third quartile may be considered as outliers if compared to a normal distribution. This group of high values contributes to produce a high sample variance (45,000 Bq/m3) in the dataset, while the sample mean is 141 Bq/m3, and the median is 92 Bq/m3.
Figure 5.64. Representation of value distribution by boxplot graph
It is also necessary to build a pdf for the training data set, in order to verify the reproduction of global histogram at a later state. As the distribution is heavily skewed, it will pose difficulties to carrying out a graphical comparison with the histograms of results; therefore, a selection of 95% of the data was used to make a graph for the core of the data as shown in Figure 5.65.
214
Advanced Mapping of Environmental Data
Figure 5.65. Probability distribution of training values, data is represented up to the 95th centile
In addition to global statistics, it is important to represent the spatial distribution of data, in order to analyze if there is association of high or low value locations with clustering of data. In other words, it is important to understand how values are distributed in space; to detect if preferential sampling is present and eventually to decide if any declustering technique can be applied. Preferential sampling or clustering of low or high values could be represented by decile maps. The first and ninth decile are reselected from the dataset and plotted in a graph (Figure 5.66). It is observed that there is spatial clustering for both, low and high values.
Figure 5.66. Spatial distribution of first decile data (crosses) and ninth decile (empty circles)
Case Studies
215
As mentioned, the robustness of IK and SGS methods is mainly contributed by data transformation. By assuming a known distribution, there is an evident loss of information but at the same time, variance is less influenced and results are less biased. In Figure 5.67 to the left, we see a postplot of the training data transformed to indicators 0 and 1, considering the threshold value of 92 Bq/m3 (the median value), as they will be used during the indicator kriging procedure. A map of Voronoï polygons for these indicator values is presented on the right; the effects of preferential sampling and extreme values are masked after transformation. Also the trend of higher values to the north-west direction (in Figure 5.63) is no longer evident. In Figure 5.68, Voronoï polygons are also built using transformed data, but this time to Nscores, as they will be used for the SGS procedure. Voronoï polygons are useful tools to analyze the spatial distribution of data, and to obtain an indication that the conditions of stationarity are approximated after transformation. They also show that conditions of clustering and empty spaces in the dataset are important, and should be considered when using the interpolation models.
Figure 5.67. Left: map of indicator values, dark squares indicates values above 92 Bq/m3 (median value). Right: map of Voronoï polygons for indicator 92 Bq/m3
Figure 5.68. Map of Voronoï polygons of the Nscores transform of data
216
Advanced Mapping of Environmental Data
5.6.4. Radon data variography Variogram modeling is a crucial part of the kriging methods. A theoretical model must be fitted to the experimental variogram calculated using the real sample values. Models consist of a nugget effect, a sill and the variogram range. The nugget effect may be obtained by calculating the variogram at very short distances, the sill is the portion of variance where regional structures are defined and the range defines the limits for the validity of these structures. 5.6.4.1. Variogram for indicators For the IK method there is a main limitation during variography modeling; the indicator variogram becomes unstructured as the indicator value is far from the median value. This is not surprising, local variations will diminish if one of the categories is over-represented, and variance at the shorter lag distances will remain closer to the nugget effect. The a priori variance for indicator transforms will vary depending on the cutoff value. For indicators far from the median, one of the categories (1 or 0) will have a lower probability of occurrence; then, the a priori variance will be reduced. This could easily be calculated considering the variance formula:
V2
¦(i u)
2
pi
where i is the indicator and pi its probability; thus, for a median indicator the a priori variance will be 0.25; while for the sixth deciles it already decreases to 0.24 and at the ninth deciles is just 0.09. To illustrate this, the experimental variograms for the median value (92 Bq/m3) and the indicator 150 Bq/m3 (a priori variance = 0.166, for the 79th centile) are presented in Figure 5.69.
Figure 5.69. Experimental variograms for the median indicator value and for the 79th centile value (150 Bq/m3)
Case Studies
217
The median indicator kriging approach proposes assuming that the spatial distribution has similar configuration for all threshold values, and the median indicator variogram may be used for all indicators; as if the categories 0 and 1 were spatially intercalated in a mosaic disposition. For this case study, the median indicator variogram was used. The experimental variogram is first calculated for pairs of points separated at a certain lag distance, for the median indicator it was calculated as anisotropic (no distinction for directions) since a trend is no evident for indicator values. In Figure 5.70 the fitted variogram is shown; a spherical model with two structures, the first accounting for 0.05 of variance within a range of 2,450 m and the second one, with 0.015 of variance, goes up to 18 km.
Figure 5.70. Experimental variogram (continuous line) and variogram model (dashed line) for indoor radon median indicator
5.6.4.2. Variogram for Nscores When a multi-Gaussian model is assumed (the case with the SGS method) only positive definite models must be used, such as spherical, exponential and Gaussian models. To hold the condition of stationarity, the sum of the nugget and the sill must not exceed the a priori variance of the population. For the SGS method the theoretical a priori variance corresponds to a standard Gaussian distribution. Then, the variance of data after the transformation to Nscores should also approximate to 1. The anisotropic variogram was calculated and the lag tolerance was set as twice the lag in order to have a smoother variogram. A spherical variogram model with two nested structures was fitted on top. The nugget effect accounts for 0.55 of the variance; then there is a first structure with 0.26 of variance that goes up to 850 m. of range and finally there is a second structure with 0.19 of variance going up to 18,000 m. Variogram modeling becomes more difficult if there is limited knowledge about the physical process under study.
218
Advanced Mapping of Environmental Data
The behavior of indoor radon, as mentioned, depends on many factors. For this case study a spherical model with high local influence is proposed, which is reflected on the nugget effect and the first structure; the first range of variance occurs at less than 1 km of distance. With the second structure it is intended to model the influence of global environmental factors, as could be the lithology or the soil properties. The total range should not exceed the maximum distance between samples. It is also reasonable to think that most of the spatial variance can be detected through half of this maximum distance. Figure 5.71 shows the experimental variogram of Nscores and the chosen model.
Figure 5.71. Experimental variogram (continuous line) and variogram model (dashed line) for indoor radon N scores
5.6.5. Neighborhood parameters In addition to variogram models, neighborhood parameters have an important influence in the kriging results. The variogram is reproduced in different ways depending on the search radius, the number of neighbors and the size/shape of the simulation grid considered. The search radius has the effect of creating local stationarity, which is not assumed on SGS. It is also not advisable to use a fixed radius when data is irregularly distributed and the number of neighbors can change drastically between locations. It is more coherent to assume global stationarity considering a very large search radius and to tune predictions based on the neighbor parameter. Results can also vary depending on which simulation net is used. If the simulation net is made with different sizes and shapes, according on the purpose of the study, the conditions of relative clustering will not remain the same. In Figure 5.72 two alternative simulation nets are presented; a first net is of the “classical type”, which considers a rectangular grid over data; the second net is constrained to political boundaries, corresponding to the area of data collection and excluding lake surfaces. In order to analyze the effect of neighborhoods, the alternative nets and two different numbers of neighbors (20 and 40) were used for IK and simulations.
Case Studies
219
Figure 5.72. Constrained and rectangular simulation nets for neighborhood testing. The net is shown in light gray dots and samples in dark
5.6.6. Prediction and probability maps 5.6.6.1. Probability maps with IK After launching an ordinary kriging interpolation for indicators, the resulting predictions are given in a scale from 0 to 1; which can be interpreted as the local probability of being below or above the indicator value. In fact, the prediction map can be used without further processing as a probability map. In Figure 5.73 there are two probability maps, the first uses 20 neighbors and the second considers 40 neighbors. Afterward, the 20 neighbor parameter was used in order to compare the results with the SGS method.
Figure 5.73. Probability maps for the 200 Bq/m3 threshold using ordinary IK with 20 (left) and 40 (right) neighbors
220
Advanced Mapping of Environmental Data
5.6.6.2. Probability maps with SGS 100 simulations were run for the training data. Each of these simulations produced a prediction map as is represented in Figure 5.74. A set of the first 4 simulation maps after backtransform to the variable units is presented. These images are possible realizations of the joint Gaussian distribution obtained with the sequential mechanism. A large number of realizations are then required to build a complete pfd for every location in the simulation net. From this pdf it is possible to calculate the probability of exceeding a certain threshold value. In Figure 5.75 the probability map of exceeding 200 Bq/m3 is shown using the rectangular and the constrained nets. The minimum number of neighbors was set to 40.
Figure 5.74. Four simulation maps of the joint distribution using SGS with a rectangular net
Figure 5.75. Probability maps for the 200 Bq/m3 threshold using the SGS method with a rectangular net (left) and a constrained net (right)
Case Studies
221
5.6.7. Analysis and validation of results There are different ways to test the precision and accuracy of results. First, the relative precision of the procedures are checked by reproduction of model parameters; for the case of simulations it was verified if the variogram and the histogram of simulations were closer to the sample histogram and the variogram model respectively. In order to obtain a measure of accuracy it is necessary to confront the results with real independent values through validation. 5.6.7.1. Influence of the simulation net and the number of neighbors The objective of a first test was to verify if, as required for SGS, the variogram model proposed was reproduced after simulation when using different nets and a different number of neighbors. For a first run, the number of minimum neighbors was fixed to 20 and the maximum at 30. Figure 5.76 shows the simulated variogram using the constrained simulation net (dashed line) and the variogram when using the rectangular net (continuous thin line) on top of the experimental variogram (bold line). For the comparison, the same single simulation was used. We see that better reproduction of long structures was obtained using the rectangular grid, but on the contrary, the short range was better reproduced using the constrained grid (image to the right). For the second run, the number of neighbors was increased to a minimum of 40 and a maximum of 60. The results in Figure 5.77 show a better reproduction of variogram for the large range with the two nets and almost the same reproduction at short range obtained with fewer neighbors. Nevertheless, it is always possible to reproduce models with longer range using a high number of neighbors, but with a probable threat of introducing more noise rather than information.
Figure 5.76. Variogram reproduction using a constrained net (dashed line) and a rectangular grid (continuous thin line) for a minimum of 20 neighbors. The long range is shown to the left and the short range is shown to the right
222
Advanced Mapping of Environmental Data
Figure 5.77. Variogram reproduction using a constrained net (dashed line) and a rectangular grid (continuous bold line) for a minimum of 40 neighbors
According to this result, a fair reproduction of the variogram during simulations can be obtained with a minimum of 40 neighbors. It is also important to verify that these simulated variograms have fluctuations which do not exceed the model. In Figure 5.78 the ergodic fluctuations of 5 simulations are shown using a minimum of 40 neighbors and the rectangular net. To the right we show the reproduction of a data histogram after back transformation from Nscores.
Figure 5.78. Comparison of variogram model and histogram (bold line) with variograms and histograms of simulations (thin lines)
5.6.7.2. Decision maps and validation of results As mentioned earlier, a set of 400 samples was reserved to validate results. Since we are working with threshold values and only two classes were defined for the analysis: locations that are either below or above a critical value, it is necessary to
Case Studies
223
perform a hardening on the results. Since the results from IK and SGS methods are presented as probability maps, it is required to decide upon a probability or decision level to classify locations. Of course, there is no sense in selecting the 100% probability as the decision level; an evaluation task was defined to make possible the comparison of methods and parameters. The task was to obtain the lower validation error when a fixed amount of area is declared to be above the threshold. In order to fix the percentage of area the same quantile from map statistics is used. For example, if we want to obtain equal areas for both categories, the median of probability is used as the decision level. In Figure 5.79, the decision map for 200 Bq/m3 using the median value of probability is presented. The figures also illustrate the locations of samples that are above the cutoff value.
Figure 5.79. Decision maps for 200 Bq/m3 thresholds using the constrained and rectangular grids. Validation data is on top (dark crosses)
As we see in Figure 5.79, it is possible to evaluate methods by classification error. The light zone corresponds to areas where there is a high probability of being below the cutoff, while the gray areas indicate a low probability. Independent data becomes misclassified or not according to the model considered and will give a measure of validation error. Usually optimization of parameters for classification is performed by crossvalidation, searching for the minimum of the total error. Total error is the sum of omission and commission class errors. The omission error indicates how many values that effectively belong to the class were omitted during the classification. On the other side, the commission error for the same class identifies how many values that do not correspond to this class were wrongly included (false alarms).
224
Advanced Mapping of Environmental Data
These two types of errors could be optimized separately depending on the goal of the classification. For example, in the case of environmental pollution such as indoor radon, it could be more important to have less uncertainty about the location of higher values. The class of high values is in fact critical and a lower omission error for this class could be desirable. With lower omission error, a maximum of locations declared to be above the threshold will be included within the corresponding class in despite of more false alarms. For indoor radon modeling we are willing to include as many measurements as possible that effectively exceed the thresholds (100, 200, 400 or 1000 Bq/m3) depending on the cutoff analyzed, i.e. we are looking firstly for the lowest omission error in the exceeding class. Of second importance for this strategy is the error within the non-exceeding class or the commission error. This approach gives priority to detection of polluted areas even if it is over-estimated and assuming a possible extra-cost of sampling due to false alarms. It is also important to note that, if one of the categories is short of samples compared to the other, the corresponding class error will be masked by the total error. For instance, the sample size for category over 400 Bq/m3 being short, few mismatches will represent a high class error that will not be reflected in the total error, but only in the omission error. Table 5.7 shows the omission error for validation data using a different number of neighbors and simulation nets. For IK a single realization exists, because no new simulated values are added, therefore there is no influence of the shape of the prediction net. Z cut
100
200
400
1,000
SGS rectangular grid
35
30
0
0
SGS constrained grid
36
65
0
13
SGS rectangular grid
33
35
6
0
SGS constrained grid
35
19
6
0
28
19
6
-
40 Neighbors
20 Neighbors
20 Neighbors IK
Table 5.7. Percentage of omission error for SGS and IK method using different simulation nets and neighborhood parameters
Case Studies
225
The comparison table shows that omission errors diverge depending on cutoff values and simulation nets. For lower cutoffs (100 and 200 Bq/m3) we can see that the SGS method (using the constrained net and 20 neighbors) and the IK methods perform slightly better. For higher cutoffs (400 and 1,000 Bq/m3) the SGS method using a rectangular grid and 40 neighbors gives less omission error. For the IK method the cutoff value of 1,000 Bq/m3 produced highly unbalanced categories and thus the median variogram did not produce valid results. These results are coherent with the spatial variability of data. Local variability is more evident for lower cutoffs than for higher ones, which have the tendency to form distant spots. Thus, it is expected that the use of enlarged nets and more neighbors can better reproduce long range variability and vice versa. 5.6.8. Conclusions In this section an application of geostatistical methods over a data set with high local variability (indoor radon in Switzerland) was shown. The use of two methods was proposed that made use of transformed data and for which variogram models can be fitted (the IK and SGS method). Variograms with a short and long range structure were proposed. The reproduction of this variogram was used as a criterion to indicate how good the model was for the SGS method. With both methods it was possible to obtain probability maps, on which to decide whether a critical value could be exceeded (decision maps). As heavy spatial clustering is present in the data, the neighborhood parameters appeared to be relevant for tuning. The use of a large number of neighbors and an enlarged simulation net gave a better reproduction of variances at long ranges; while the constrained nets and a low number of neighbors helped to model the short range variogram. Validation of results considering the omission error for independent data confirmed this behavior. 5.7. Natural hazards forecasting with support vector machines – case study: snow avalanches Spatial mapping applied to the exploration and modeling of natural hazard phenomena is usually considered as spatial and spatiotemporal forecasting. The forecast produced is then generally used for vulnerability assessment, prevention and mitigation planning and corresponding decision making. The production of such forecasts brings some special requirements to the spatial mapping of natural hazards. Since predictions of this kind are decision-oriented, rigorous validation procedures which deal with specific task-dependent performance measures are often
226
Advanced Mapping of Environmental Data
used [WIL 95]. These include measures such as the probability of event detection, forecast success rate and various forecast skill scores. In this setting, uncertainty analysis is of particular importance. Strictly speaking, a forecast without uncertainty analysis is almost useless for decision-making purposes. As well as an uncertainty analysis, a data-driven model should meet a number of special requirements such as the ability to produce categorical and probabilistic forecasts and provide information which can be clearly interpreted by a decision-maker. In this section we explore the use of support vector machines (section 4.3), a machine learning approach derived from statistical learning theory. As a reminder, let us note that SVMs are aimed to deal with data of high-dimensionality by approaching nonlinear problems in a robust and non-parametric way. A useful approach to the post-processing of SVM outputs is presented below in section 5.7.2. Amongst different natural hazards, events such as snow avalanches and landslides can be characterized by relatively low frequencies, complex nonlinear relationships with meteorological conditions, geomorphology and a large variety of other factors. This chapter is mainly focused on this type of hazard. A real case study on the application of SVMs, which illustrates the use of machine learning methods in this domain, is devoted to avalanche forecasting in the Lochaber region, the location of Ben Nevis, in Scotland. Firstly, temporal SVM forecasts – i.e. the problem of classification of the current avalanching conditions based on the meteorological, snowpack data and past avalanche events – are presented in section 5.7.4. Then, the extension of SVMs to the production of spatially variable forecasts within this local forecasting region is illustrated in section 5.7.5. In the spatiotemporal domain, avalanche events are even more rare and hard to classify. SVMs are well suited to solving problems with very high-dimensionality in contrast with, for example, nearest neighbor methods which are commonly used in avalanche forecasting. It seems promising to incorporate input data containing a wide range of relevant features from a variety of sources. Such features might include data extracted from physical models, data-driven regionalized climatic maps (see sections 5.2–5.3), snowpack data and expert opinion. Some discussions on the potential use of data-driven machine learning methods such as SVMs in real-life decision support systems (such as their potential for decision-support in operational avalanche forecasting) are presented in the concluding section 5.7.6.
Case Studies
227
5.7.1. Decision support systems for natural hazards A wide range of numerical models and tools have been developed over recent decades to support the decision making process in environmental applications ranging from physical models, using expert systems, to a variety of statisticallybased methods. In operational forecasting a mixture of all three approaches are often used, with process chains involving physical models and statistical or expert systems being relatively common. As model complexity has increased, so too has our ability to collect real time, spatially distributed data describing a wide range of parameters through technological advances in sensor networks and automated environmental monitoring, and we can thus expect data-driven models to become increasingly important. Different possible forms and interpretations of the forecast of natural hazards can be considered. Firstly, in categorical forecasts, a decision boundary is constructed and used to classify the region\time as being either dangerous or not. Secondly, in probabilistic forecasts, the output of the system has to be interpreted as the probability of an event in the temporal or spatiotemporal domain of the forecast. Such forecasts can be used, for example, for risk assessment. Thirdly, a so-called descriptive forecast is often desirable, since experts wish to interpret and incorporate, for instance, a detailed list of similar events into their decision-making process. Concerning the last category, the nearest neighbor methods and their variations commonly called “analog methods” are extensively used in a number of applications, with their probable roots in early atmospheric predictions [LOR 69], [HEI 04]. Concerning snow avalanche forecasting, different approaches have been proposed and used in operational practice. To name but a few, these are the interpretations of the physical models of the development of the snowpack [BAR 02], expert systems which attempt to integrate expert knowledge [SCH 96] and nearest neighbor methods [BUS 83], [HEI 04], [PUR 03]. Nearest neighbor methods accord well with conventional inductive avalanche forecasting processes [LAC 80] and are thus relatively popular with forecasters. In machine learning, as was described in Chapter 4, this is a relatively simple pattern classification technique. Moreover, both through theoretical considerations and in forecasting practice [MCC 03] it has been noted that such methods may be prone to over-fitting when dealing with highly dimensional data.
228
Advanced Mapping of Environmental Data
5.7.2. Reminder on support vector machines We will now review some aspects of the SVMs which are important for constructing the SVM-based decision support systems. A more detailed introduction into the theory of SVMs may be found in Chapter 4. Here we will introduce a probabilistic interpretation of the SVM decision boundary which will be useful for a reader wishing to include a SVM as a prediction engine in a decision-support system. Given a dataset {(x1, y1), ( x2, y2),…(xn, yn)} where xi is an m-dimensional vector describing the conditions at a given time-space, and yi is a binary output (the occurrence of the event) associated with this input vector, SVMs try to construct a hyperplane in the input space which cleanly separates the binary events. It has been proven in statistical learning theory that the hyperplane which provides the maximum margin between classes will provide the best generalization and lowest validation error. Only a small subset of the vectors xi which lie at or near the decision boundary are required to identify this hyperplane, and these are known as the support vectors. Regarding the fact that in most real world datasets the data are noisy and some vectors can be mislabeled, SVMs seek to find a hyperplane which provides a trade-off between a large margin and an exact fit to training data. The next extension of SVM consists of making the decision boundary nonlinear with the help of the kernel functions. It corresponds to indirectly mapping the input space into a higher-dimensional space [SCH 02] and finding an optimum separating hyperplane using quadratic programming. It leads to the nonlinear decision function in the initial feature space which takes the form of a kernel expansion,
f ( x, D )
N
¦ y D K ( x, x ) i
i
i
i 1
where xi is a vector describing conditions at a given space/time moment; yi is the binary event described by xi; Įi is a weight constrained such that 0dDidC; and K(x,xi) is a kernel function. With the usual choice of a Gaussian radial basis function of a radius V, the SVM algorithm has two parameters: C, describing the possible range of weights, and the radius V of the kernel function. In real life problems, where the data are noisy or do not completely describe the events, the value of C bounds the range of possible weights and puts a restriction on the contribution of the vectors to the decision function – and therefore prevents the decision function from the danger of overfitting. Parameter C can be considered to be a measure of data quality with respect to the events. The value of V describes the characteristic distance of the continuity in the input space, with higher values
Case Studies
229
resulting in a more smooth and generalized form of the decision function. These two values, V and C, are the hyper-parameters of the SVM which must be tuned to minimize misclassification by using cross-validation on either a training data or a testing data subset. The decision function f(x,D) can be interpreted in terms of a categorical decision for a vector x according to a default threshold value of f(x,D), which is usually the zero value. We will now introduce the interpretation of SVM decisions, which is helpful in providing the probabilistic forecasts. 5.7.2.1. Probabilistic interpretation of SVM Though the SVM is specifically constructed to solve the classification task, i.e., to discriminate the binary events, the outputs of SVMs can be probabilistically interpreted by post-processing. Let us start with a reminder of the notion of the classification margin. The decision function inside the margin is such that -11 for the “regular” samples, which are correctly classified. The samples inside the margin are the most uncertain of all the dataset. To introduce an uncertainty measure, the values of the decision function can be transformed into probabilities. This is performed, for example, by taking a sigmoid transformation of f(x,D) [PLA 99]. The resulting transformation gives
p ( y 1 x)
1 (1 exp(a f ( x) b))
where a and b are constants. These constants are tuned using a maximum likelihood (usually, the negative log-likelihood to simplify the optimization) on the testing dataset. The single use of training samples for the latter may lead to an over-fitted or biased estimate, though the bootstrapping on the training data is suitable. The value of a is negative, and, if b is found to be close to zero, then the default SVM decision threshold f(x)=0 coincides with a probability level of 0.5. Empirical evidence suggests that this interpretation is most appropriate for linear SVMs and SVMs with Gaussian RBF kernels and low values of parameter C. Generally, for nonlinear SVMs this probabilistic interpretation must be used with some caution. The major advantage of the latter interpretation is a possibility to introduce a decision threshold for the probabilistic outputs p(y=1|x). This threshold may later be tuned to satisfy the desired forecast quality measures.
230
Advanced Mapping of Environmental Data
5.7.3. Implementing an SVM for avalanche forecasting The case study which illustrates the use of SVMs as a tool for a decision-support system concerns avalanche forecasting in Scotland. The maritime climate of Scotland is characterized by high wind speeds and rapid temperature changes. The mountains there are of relatively low elevations (with the highest summit of Ben Nevis, 1,344 m), though the northerly latitude and closeness to the sea provides conditions for regular severe snowfalls. High winds induce intense snow drifting while the zero degree isotherm moves above and below summits many times in the average winter. The data which we use here were collected in the Lochaber region, one of five areas where avalanche forecasts are produced in Scotland. The region includes Scotland’s highest mountain Ben Nevis and some of Scotland’s most popular winter climbing venues. Avalanche forecasters are in the field on a daily basis to measure meteorological and snowpack conditions and observe avalanche events, and the data used in the SVM decision support system are a mixture of those collected by the forecasters and downloaded from automatic weather stations. 5.7.4. Temporal forecasts Temporal forecasts of avalanche activity predict whether the current and past conditions suggest avalanche activity on that day. SVMs will be used here to classify the days with avalanche activity, and will be further extended to produce spatial forecasts in the next section. The available data on avalanche activity in the region consists of two parts. The first is a series of daily measurements of meteorological conditions, including the measurements of air and snow temperatures; subjective soft data on cloudiness and insolation; soft data on the foot penetration which provides some information on the state of the snowpack; measurements of wind speed and direction; binary variables such as snow drift and “raining at the elevation of 900 m”; and categorical data on the intensity of the current snowfalls. There are a total of 10 variables. The second part of the dataset consists of the recordings on the particular avalanche cases. For the temporal forecasts, the binary output which will be used is only an indicator of avalanche activity on the considered day. Concerning the input data, the meteorological and snowpack variables for the current and two previous days were combined to produce an input feature vector with 30 dimensions. This feature vector was further extended by asking avalanche forecasters of the Lochaber region to list important indicators of avalanche activity.
Case Studies
231
These “expert features” included a cumulative snow index and snow drift for the previous 2 and 3 days, the gradients of air and snow temperatures, and several indicator variables including air temperature crossing 0oC, avalanche activity on two previous days, strong south-easterly winds on the previous days, and bad visibility during the two previous days. The final feature vector included a total of 44 variables. 5.7.4.1. Feature selection Though SVMs are well-suited to solving problems in high-dimensional input spaces, it is often noted that the use of feature selection improves the results. This improvement is basically caused by filtering out the redundant and noisy features, then decreasing the dimensionality of the input space. An initial step in identifying suitable features used recursive feature elimination to filter redundant features [GUY 02]. This feature selection method iteratively omits the variables with the smallest influence on the decision surface of the SVM classifier. The list of 20 features which were found to be the most valuable for SVM classification is listed in Table 5.8. It is interesting to note that these features were selected in a purely data-driven way, and that this choice is in good agreement with expert opinion. Furthermore, given the nature of rapid change in Scotland’s maritime climate it is notable that only two features (foot penetration and wind direction) are retained two days before the generalized forecast day. Current day
Snow Index; Foot Penetration; Cloudiness; Rain at 900m; Snow Index over season; Snow Drift; Snow Temperature
Previous days (-1), (-2)
Air Temperature (-1); Rain at 900m (-1); Wind Speed (-1); Foot Penetration (-1); Foot Penetration (-2); Wind Direction (-2)
Expert features
Air Temperature Gradient (-1); Southerly or southeasterly Wind; Bad Visibility (-1); South-easterly Wind (-1); Cumulative Snow Index over 2 days; Cumulative Snow Drift over 2 days; Avalanche Activity (-1)
Table 5.8. The list of features selected by recursive feature elimination algorithm amongst all the available features. The historic feature for the previous day and the day before the previous are marked with (-1) and (-2) correspondingly
Current air temperature is not retained; however, this information is available to the system through the previous day’s air temperature and air temperature gradient. Foot penetration, whose values are unfortunately quite noisy and subjective, together with snowpack temperature, the only direct snowpack characteristic, are retained, including their values for previous days. Most of the expert features are also
232
Advanced Mapping of Environmental Data
retained. Southerly or south-easterly winds are perhaps particularly important, since the main climbing venues are found on north facing slopes. 5.7.4.2. Training the SVM classifier The data were divided into a training set of 1,123 samples (winters of 19912000) and a validation set of 712 samples (winters of 2001-2007). The validation data set was not available to SVM during the training phase. It is only used to assess the quality of the predictions. The classification problem is thus formulated as the identification of a decision surface in the 20-dimensional space based on 1,123 training samples. We should be aware of the danger of over-fitting since the dimensionality of the space is very high with respect to a relatively low amount of available data. A reasonable approach in this situation is to start by training a linear SVM. Applied to this dataset, the linear SVM provided 414 support vectors. The training error (percentage of misclassified days) was found to be 14.8%, while the validation error is 15.5%. We can conclude that the dataset is not linearly separable. At the same time, the relatively low number of support vectors (414 out of 1,123 samples, 37% of data) suggests that the data are structured enough and the stated classification task is reasonable. A linear decision surface may not be found due to noise and mislabelled samples.
Figure 5.80. SVM training error surface (left) and cross-validation error surface (right). The classification error is a percentage of correctly classified data samples. The optimum pair of parameters is marked with a cross
The next step is to apply an SVM with a Gaussian RBF kernel. To select values for parameters V (kernel width) and C (trade-off between model complexity and data fitting) training and cross-validation surfaces were calculated using an appropriate
Case Studies
233
range of values of V and C. Figure 5.80 shows the training error surface with the minimum classification error lying at the top left of the figure (i.e. for the maximum value of C and minimum value of V). However, as shown by the cross-validation error surface choosing these values of V and C would result in overfitting. The crossvalidation error surface is generated by systematically removing one feature vector from the data set and recalculating the error surface. Values of V and C were selected to lie roughly in the center of the central band with low errors, with V=12 and C=25, thus minimizing the cross-validation error whilst having an acceptable training error. As discussed above, an important indicator for the SVM model is the number of support vectors. We consider here the relative number of SV as a percentage of all training data. This is an indicator of model complexity, which reaches 100% in the case of over-fitting. Considering it along with the training error (Figure 5.81) it is noticeable that small kernel widths lead to over-fitting (high percentage of support vectors and low training error) and large values to over-smoothing (low number of support vectors and large training error).
Figure 5.81. The choice of kernel width for a fixed C=25 as a model selection problem: trade-off between fit to data and model complexity
5.7.4.3. Adapting SVM forecasts for decision support The percentage of misclassified samples is the most simple performance measure. While it has been used above to find the appropriate parameters of the model, we are now interested in more specific measures which fit the needs of decision-making. Given the binary classification task, there are 4 types of predictions which we outline in Table 5.9. Specific measures can then be produced using these 4 basic values: hits, misses, false alarms and correct negatives.
234
Advanced Mapping of Environmental Data Forecast Observed
Yes (+1)
No (-1)
Yes (+1)
Hits
Misses
No (-1)
False Alarms
Correct Negative
Table 5.9. Basic measures for binary categorical forecasts (contingency table or confusion matrix)
The probabilistic outputs of the developed SVM classifier were obtained as described in section 5.7.2.1, using the bootstrap on training dataset. The decision threshold (probability level) can now be fine-tuned. We will first investigate the influence of different threshold values on the performance measures of the system using the validation dataset of 712 samples. Let us consider how the basic measures change with the varying threshold on the examples of three fixed sample values of 0.25, 0.5 and 0.75. Forecast at the
Forecast at the
Forecast at the
threshold of 0.25
threshold of 0.5
threshold of 0.75
Observed
Yes
No
Yes
No
Yes
No
Yes
164
14
131
47
61
117
No
139
395
52
482
19
515
Table 5.10. Joint distribution of forecasts and observations for binary categorical forecasts for several decision thresholds: 0.25, the default threshold of 0.5 and 0.75
Table 5.10 shows the joint distribution of forecasts and observations for binary categorical forecasts for the default threshold value of 0.5 and two other threshold values. When a low threshold (0.25) is selected, more avalanches are correctly forecast (164) at the cost of many more false alarms (139). Equally, when a higher threshold (0.75) is used many more misses occur (117) though the number of correct negatives also increases (515). These results confirm that a sensible threshold value lies, for these data, around a value of 0.5. These considerations can be further extended by the analysis of the forecast accuracy and skill measures. A range of conventional forecast verification measures is presented in Table 5.11. These are the probability of detection (PoD), success rate, hit rate, and two forecast skill scores. Note that both the basic values (hits, misses and so on) and the derived measures can be named differently in different scientific domains. For
Case Studies
235
example, in information retrieval PoD is often called “recall”, and one of the basic instruments is the analysis of the precision-recall curve. In the previous section on radon risk mapping, the misses were called omission errors. Forecast accuracy measures POD – Probability of detection SR – Success rate
The probability that the event was forecast when it occurred POD = Hits/(Hits+Misses) The probability that the event occurred when it was forecast SR = Hits/(Hits+False Alarms)
HR – Hit rate
The proportion of correct forecasts HR = (Hits+Correct Negative)/(Total Number of Days) Forecast skill measures
HSS - Heidke skill score; based on Hit rate.
HSS = (Hits+Correct Negative - Chance)/(Total-Chance)
KSS - Kuipers skill score
KSS = (Hits*CorrNeg-Misses*FalseAlarm) / ((Hits+Misses)(False Alarms+Correct Neg)) (like HSS, but marginal distribution of reference forecasts equals to base-rate)
Chance is the expected number of correct forecasts due to chance
Table 5.11. Forecast verification measures [DOS 90; WIL 95]
The sensitivity of these measures to threshold values of between 0 and 1 is shown in Figure 5.82. In choosing a threshold for categorical forecasts, a decision must be made about the acceptance of different forms of forecast error. For example, low threshold values maximize the probability of detection (i.e. the chances of missing an avalanche event are minimized), whilst leading to increased number of false alarms. Figure 5.82 (left) shows that a reasonable compromise between PoD and Hit Rate lies somewhere between values of around 0.4 and 0.6. In Figure 5.82 (right), skill scores which attempt to describe the ability of a technique to forecast better than that by random chance are shown. Here, the Heidke skill score once again suggests an ideal threshold value lies between about 0.4 and 0.6, whilst the Kuipers skill score suggests slightly lower threshold values. As explained above, it is also possible to probabilistically interpret the output of SVMs. To evaluate the quality of this output, the empirical probability of an event for a given range of values has to be calculated and compared to the forecast probability. In general, we would like to make sure that the forecast probabilities agree well with the empirical probability of events, especially for cases with higher values.
236
Advanced Mapping of Environmental Data
Figure 5.82. Forecast accuracy and forecast skill measures. X axis corresponds to the SVM decision threshold
Here, in Figure 5.83, we show SVM predictions for a single winter of 2003-2004 in the validation data set. The days with registered avalanche events are marked in the figure as well. Qualitatively, the good agreement between events and forecasted periods of high avalanching probabilities for this time periods can be seen.
Figure 5.83. The prediction of SVM for the validation data of winter 2003-2004. The observed events are plotted as black boxes (or the continuous series of black boxes) at 0 (no events) and 1 (avalanche activity) levels. The probabilistic output of SVM is plotted as a continuous curve. The x axis correspond to time in days
Finally, for the descriptive forecasts, the particular support vectors which contributed most to the given decision could be examined. The list of the k-nearest support vectors (which are the reference events with similar conditions in the past) may provide insightful information to the forecaster. However, these issues still have to be studied in more detail to be used in real life applications.
Case Studies
237
5.7.5. Extending the SVM to spatial avalanche predictions Practical approaches to spatial avalanche forecasting depend on the scale of the forecasting area. For the relatively local scale of a regional forecast and for forecasts for particular slopes and skiing/climbing venues the importance of precise spatialization increases. Usually, the descriptive textual forecasts are prepared by human experts. They highlight the current important factors of avalanche activity, such as particularly dangerous slope orientations and altitudes. Automatic avalanche forecasting at local scales is an open research problem. Meteorological and snowpack conditions vary significantly over the region, providing the main difficulties of this method. This problem is known as snowpack variability. Complex nonlinear relationships between snowpack formation in space and avalanche activity are thus even harder to model. As SVMs are well suited to high-dimensionality it is relatively straightforward to add some level of spatial forecasting. The following presentation contains early results intended to illustrate how SVMs can be used in spatial avalanche forecasting. More work is needed to consider the validity of the results. 5.7.5.1. Data preparation In the case of Lochaber, information about some 700 avalanche events for 47 individual avalanche paths (Figure 5.84) was available. Ideally, each case is supposed to be characterized with its exact location, aspect and slope of the avalanche release zone, the size of the event and some other notes on the avalanche debris, release type, and related observations. In reality, this information is often hard to collect, thus only the location and altitude is known for all the registered cases. A digital elevation model (DEM) of the region (Figure 5.84) was used to include the altitude and calculate the other spatial inputs of the model such as aspect and slope. If the slope and aspect of the avalanche release zone was unknown, it was approximated using a DEM. The DEM was also used in the spatialization of weather parameters. For every day of the observations, a feature vector describing the meterological and snowpack parameters measured at some distinct location was available. An important extension consists of spatialization of these data over the forecasting region. This can be performed using physical models, heuristics or data-driven approaches (sections 5.2–5.3) and is a matter of profound independent research. To give an example of these results, we will present in Figure 5.84 some examples of spatialization of the wind speed and wind direction, obtained with a simple linear heuristics model [LIS 07].
238
Advanced Mapping of Environmental Data
Figure 5.84. DEM of the Lochaber region. The locations of the usual avalanche paths are marked with circles
Figure 5.85. An example of interpolated wind speed which is used as an input feature for producing spatially variable SVM forecasts with SVMs
The temperature was spatialized using the temperature-elevation gradient. Some of the data were treated as constants over the region due to different reasons. Thus, for example, the indicator variable of raining at 900 m and cloudiness were not spatialized.
Case Studies
239
The SVM can be used to generate a spatial avalanche forecast, extrapolated over the region using a DEM, based on the enhanced feature vectors. The altitude, coordinates, aspect and gradient of each avalanche path were added. This results in a much larger number of feature vectors with the same total number of avalanche events. While it was relatively straightforward to put the registered avalanche events into a dataset as a class representing avalanche events, it is much harder to describe the “safe” conditions. This is required to formulate a binary classification problem. The following approach was used in this study: for the days when no avalanche events were observed, the “safe” samples were constructed by combining the spatial features of all the avalanche paths and the current meteorological features. The intuition behind this was to provide the boundary and most discriminative samples to the system: those which are “safe” but still closest to turning into dangerous ones given changing weather conditions. The structure of the resulting dataset is presented in Figure 5.86.
Figure 5.86. The structure of the dataset for spatially variable avalanche danger prediction with SVMs
5.7.5.2. Spatial avalanche forecasting The resulting problem is a binary classification, which is quite unbalanced in the sense that the number of samples in one class (avalanche events) is much less than the number of samples in another (“safe”). Some solutions have been proposed to approach this untypical situation which resulted in the modified setting of the SVM parameters [LIN 00].
240
Advanced Mapping of Environmental Data
Figure 5.87. The output of the spatiotemporal SVM model, indicating the probability of the avalanching on 20.01.1991. The actual observed events are shown with circles
Figure 5.88. The sample output of the spatiotemporal SVM model for a sub-region, produced using the spatialized meteorological inputs. The variability of the prediction is considerably higher. The actual observed events are shown with circles
Case Studies
241
We present in Figure 5.8 an example of preliminary prediction mapping for 20th January 1991, which was obtained by using all the data from 1991-2001 excluding the mentioned day. The prediction presented in Figure 5.87 did not use the spatialized meteorological inputs. Compared to the results in Figure 5.88, where these inputs were used, it is less variable and smoother. At first glance, the obtained forecast appears to agree well with the location of observed avalanches for the day. However, rigorous validation procedures have to be defined to assess the quality of this type of prediction. The same questions of categorical, probabilistic and descriptive interpretations of the forecasts have to be discussed in a dialog with forecasters before such prediction systems can be considered for operational use. 5.7.6. Conclusions Statistically-based methods are becoming more and more important in environmental decision support systems due to the growing amount of available data. The presented case study showed how machine-learning methods, and particularly the SVM, can be used for such problems. Several issues were elaborated to link the SVM with practical requirements of decision support systems, such as the interpretation and validation of the forecasts. Concerning decision support in avalanche forecasting, this is a complex process involving the assimilation of multiple data sources to make predictions over varying spatial and temporal resolutions. Numerically-assisted forecasting often uses nearest neighbor methods, which are known to have limitations in dealing with highly dimensional data. Here we demonstrated the application of SVMs as a prediction engine for the decision support in avalanche forecasting in Lochaber, Scotland, which revealed promising perspectives. 5.8. Conclusion In this chapter a variety of case studies were presented using different machine learning algorithms: multilayer perceptron, support vector machines, general regression neural networks, probabilistic neural networks and self-organizing maps. Both simple and nontrivial applications were demonstrated. The results were compared with geostatistical models. Geostatistical case studies, especially by applying advanced models such as indicator kriging and conditional stochastic simulations were carried out using real data.
242
Advanced Mapping of Environmental Data
In general, MLA can be considered as important data-driven models for the analysis and modeling of spatial environmental data. They produce comparable results with geostatistical models but without relying on variography. In real complex problems when the input space composed of many geo-features is highly dimensional they are indispensable. Nevertheless, these modeling approaches need the tools which can control the quality of the analysis and predictions. For spatial environmental data analysis some geostatistical tools, especially variography, can be proposed. The machine learning methods can be used as prediction engines in environmental support systems, including the decision support in natural hazards. Some approaches to the adaptation of machine learning for the latter tasks were presented. 5.9. References [AHA 97] AHA D.W. (1997) Editorial, Artificial Intelligence Review, 11(1-5), Special Issue on Lazy Learning, p. 1-6, 1997. [ATK 95] ATKESON C.G., MOORE A.W. and SCHAAL S., “Locally Weighted Learning”, Artificial Intelligence Review, 11(1-5), p. 11-73, 1995. [ATT 07] ATTORE F., ALFO M., SANCTIS M., FRANCESCONI F., and BRUNO F., “Comparison of interpolation methods for mapping climatic and bioclimatic variables at regional scale”, Int. J. of Climatology, vol. 27, p. 1825-1843, 2007. [BAR 02] BARLETT, P. and LEHNING, M., “A physical SNOWPACK model for the Swiss avalanche warning. Part I: Numerical model”, Cold Regions Science, 35, 3, 123-145, 2002. [BIS 07] BISHOP C. M., Pattern Recognition and Machine Learning, Springer, 2007. [BRA 00] BRABEC B. and MEISTER R., “A nearest neighbour model for regional avalanche forecasting”, Annals of Glaciology, 32, 130-134, 2000. [BRY 01] BRYAN B.A. and ADAMS J.M., “Quantitative and qualitative assessment of the accuracy of neurointerpolated annual mean precipitation and temperature surfaces for China”, Cartography, 30(2), p.1-14, 2001. [BRY 02] BRYAN B.A. and ADAMS J.M., “Three-dimensional neurointerpolation of annual mean precipitation and temperature surfaces for China”, Geographical Analysis 34(2), p. 94-111, 2002. [BUS 83] BUSER O., “Avalanche forecast with the method of nearest neighbors: an interactive approach”, Cold Regions Science and Technology, 8(2), 155-163, 1983. [BUY 06] BUYTAERT W., CELLERI R., WILLEMS P., DE BIÈVRE B. and WYSEURE G., “Spatial and temporal rainfall variability in mountainous areas: A case study from the south Ecuadorian Andes”, Journal of Hydrology, vol. 329, p. 413-421, 2006.
Case Studies
243
[CHI 99] CHILES J.-P., DELFINER P., Geostatistics, Modelling Spatial Uncertainty, Wiley series in probability and statistics, John Wiley & Sons, 1999. [DAV 99] DAVIS R.E., ELDER K., HOWLETT D. and BOUZAGLOU E., “Relating storm and weather factors to dry slab avalanche activity at Alta, Utah, and Mammoth Mountain, California, using classification and regression trees”, Cold Regions Science and Technology, 30, 1, 79-89, 1999. [DEM 03] DEMYANOV V., KANEVSKI M., CHERNOV S., SAVELIEVA E. and TIMONIN V., “Neural network residual kriging application for climatic data”, in [DUB 2003], 2003. [DEU 97] DEUTSCH C. V. and JOURNEL A. G., GSLIB: Geostatistical Software Library and User’s Guide, Oxford University Press, 1997. [DOB 07] DOBESCH H., DUMOLARD P., and DYRAS I (eds.), Spatial Interpolation for Climate Data: The Use of GIS in Climatology and Meteorology, (Geographical Information Systems series), ISTE, 2007. [DOS 90] DOSWELL C. DAIES-JONES R. and KELLER D.L., “On summary measures of skill in rare event forecasting based on contingency tables”, Weather and Forecasting, 5, 576-585, 1990. [DUB 03] DUBOIS G., MALCZEWSKI J., and DE CORT M. (eds.), Mapping radioactivity in the environment, Spatial Interpolation Comparison 97, European Commission, JRC Ispra, EUR 20667, 2003. [DUB 05] DUBOIS G. (ed.), Automatic mapping algorithms for routine and emergency data, European Commission, JRC Ispra, EUR 21595, 2005. [FAN 97] FAN J. and GIJBELS I., “Local Polynomial Modelling and its Applications”, Monographs on Statistics and Applied Probability, 66, London, Chapman & Hall, 1997. [GOO 97] GOOVAERST P., Geostatistics for Natural Resources Evaluation, Oxford University Press, 1997. [GOO 00] GOOVAERTS P., “Geostatistical approaches for incorporating elevation into the spatial interpolation of rainfall”, Journal of Hydrology, vol. 228, p. 113-129, 2000. [GUY 02] GUYON I., WESTON J., BARNHILL S. and VAPNIK V., “Gene Selection for Cancer Classification using Support Vector Machines”, J. Machine Learning, 46(1-3), 389-422, 2002. [HAR 89] HARDLE W., Applied Nonparametric Regression, Cambridge University Press, Cambridge, 1989. [HAY 98] HAYKIN S., Neural Networks: A Comprehensive Foundation, Prentice Hall, 1998. [HEI 04] HEIERLI J., PURVES R.S., FELBER A. and KOWALSKI J., “Verification of nearest-neighbors interpretations in avalanche forecasting”, Annals of Glaciology, 38, 1, 84-88, 2004. [HIG 03] HIGGINS N.A. and JONES J.A., Methods for Interpreting Monitoring Data Following an Accident in Wet Conditions, National Radiological Protection Board, Chilton, Didcot, 2003.
244
Advanced Mapping of Environmental Data
[ISA 90] ISAAKS E.H. and SRIVASTAVA R.M., An Introduction to Applied Geostatistics, Oxford University Press, 1990. [KAL 07] KALTEH A, HJORTH P. and BERNDTSSON R., “Review of the self-organising map SOM) approach in water resources: Analysis, modelling and application”, Environmental Modelling and Software, vol. 2007, p. 1-11. [KAN 96] KANEVSKI M., ARUTYUNYAN R., BOLSHOV L., DEMYANOV V. and MAIGNAN M. “Artificial neural networks and spatial estimations of Chernobyl fallout”. Geoinformatics, vol. 7, no. 1-2, p. 5-11, 1996. [KAN 97a] KANEVSKI M, DEMYANOV V. and MAIGNAN M., “Spatial estimations and simulations of environmental data by using geostatistics and artificial neural networks”, Proc. of the Annual Conf. of the International Association for Mathematical Geology (IAMG), 1997. [KAN 97b] KANEVSKI M., MAIGNAN M. and DEMYANOV V., “How neural network 2-D interpolations can improve spatial data analysis: neural network residual kriging (NNRK)”, Proc. of the Annual Conf. of the International Association for Mathematical Geology (IAMG), 1997. [KAN 99] KANEVSKI M., “Spatial Predictions of Soil Contamination Using General Regression Neural Networks”, Systems Research and Information Systems, vol. 8, no. 4, p. 241-256, 1999. [KAN 04] KANEVSKI M. and MAIGNAN M., Analysis and Modelling of Spatial Environmental Data, EPFL Press, 2004. [KAN 05] KANEVSKI M. and MAIGNAN M., “Analysis and modelling of indoor radon data in Switzerland: geostatistical approach and machine learning algorithms”, International Workshop of Radon Data: valorisation, analysis and mapping, Lausanne, 4-5 March 2005, Switzerland. [KAN 06] KANEVSKI M., MAIGNAN M. and TAPIA R., “Indoor radon risk mapping sing geostatistical simulations”, in E. Pirard, A. Dassargues and H.B. Havenish (eds.), XI International Congress for Mathematical Geology, Quantitative Geology from Multiple Sources, International Association for Mathematical Geology, September 2006. [KOH 00] KOHONEN T., Self-Organising Maps, Springer, 2000. [LAC 80] LaCHAPELLE E.R., “The fundamental processes in conventional avalanche forecasting”, Journal of Glaciology, 26(94), 75-84, 1980. [LEC 96] LECUN Y., BOTTOU L., ORR G.B., and MÜLLER K-R., Efficient BackProp, Lecture Notes in Computer Science, 1996. [LIN 00] LIN Y., LEE Y. and WAHBA G., Support Vector Machines for classification in nonstandard situations, Technical Report 1016, Department of Statistics, University of Wisconsin, Madison, 2000. [LLO 06] LLOYD C. D., Local Models for Spatial Analysis, CRC Press, 2006. [LOR 69] LORENZ E.N., “Atmospheric predictability as revealed by naturally occurring analogues”, Journal of the Atmospheric Sciences, 26, 636-646, 1969.
Case Studies
245
[MAL 05] MALARDEL S., Fondamentaux de Météorologie, à l’école du temps, Cépaduès Editions, 2005. [MCC 93] McCLUNG D. and SCHEARER P, The Avalanche Handbook, The Mountaineers, Seattle, Washington, 1993. [MCC 03] McCOLLISTER C., BIRKELAND K., HANSEN K., ASPINALL R., and COMEY R., “Exploring multi-scale spatial patterns in historical avalanche data, Jackson Hole Mountain Resort, Wyoming”, Cold Regions Science and Technology, 37, 3, 299-313, 2003. [NAD 64] NADARAYA E., “On estimating regression”, Theory of Probability and its Applications, vol. 9, p. 141-142, 1964. [OBL 80] OBLED C. and GOOD W., “Recent developments of avalanche forecasting by discriminant analysis techniques: a methodological review and some applications to the Parsenn area (Davos, Switzerland)”, Journal of Glaciology, 25, 92, 315-346, 1980. [PAR 62] PARZEN D., “On estimation of a probability density function and mode”, Annals of Mathematical Statistics, vol. 33, 1962, p. 1065-1076. [PAR 03] PARKIN R. and KANEVSKI M., “ANNEX Model: Artificial Neural Networks with External Drift Environmental Data Mapping”, StatGIS Conference 2003, Klagenfurt. 11 p., to be published in Pilz J (ed.) Interfacing Geostatistics and GIS, Springer, 2008. [PLA 99] PLATT J., “Probabilistic outputs for support vector machines and comparison to regularized likelihood methods”, in Advances in Large Margin Classiers, A.J. Smola, P. Bartlett, B. Scholkopf, D. Schuurmans, (eds.), MIT Press, Cambridge, MA, 1999. [PUR 03] PURVES R.S., MORRISSON K.W., MOSS G. and WRIGHT D.S.B., Nearest neighbors for avalanche forecasting in Scotland – development, verification and optimization of a model, Cold Regions Science and Technology, 37, 343-355, 2003. [RIG 01] RIGOL J., JARVIS C., and STUART N., “Artificial neural networks as a tool for spatial interpolation”, Int. J. Geographical Information Science, vol. 15, no. 4, p. 323-343, 2001. [ROS 56] ROSENBLAT M., “Remarks on some nonparametric estimates of a density function”, Annals of Mathematical Statistics, vol. 27, p. 832-837, 1956. [ROS 70] ROSENBLAT M., “Density estimates and Markov sequences”, in M. Puri (ed.), Nonparametric Techniques in Statistical Inference, p. 199-213, Cambridge University Press, London, 1970. [SCH 96] SCHWEIZER J. and FOHN P.M.B., “Avalanche forecasting – an expert system approach”, Journal of Glaciology, 42, 141, 318-332, 1996. [SCH 02] SCHOLKOPF B. and SMOLA A.A., Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, MIT Press, Cambridge, MA, 2002. [SOM 08] SOM bibliography, http://neuron-ai.tuke.sk/NCS/VOL1/P4_html/node35.html (last accessed January 2008). [SPE 90] SPECHT D., “Probabilistic Neural Networks”, IEEE Transactions on Neural Networks, vol. 3, p. 109-118, 1990.
246
Advanced Mapping of Environmental Data
[SPE 91] SPECHT D., “A General Regression Neural Networks”, IEEE Transactions on Neural Networks, vol. 2, p. 568-576, 1991. [TAP 06] TAPIA R., KANEVSKI M., MAIGNAN M. and GRUSON M., “Comprehensive multivariate analysis of indoor radon data in Switzerland”, in I. Barnet, M. Neznal and P. Pacherova (eds.), Radon Investigations in the Czech Republic XI and the 8th International Workshop on the Geological Aspects of Radon Risk Mapping, p. 228–238, Czech Geological Survey, Radon v.o.s., Joint Research Center IES REM Ispra, 2006. [TAP 07] TAPIA R., KANEVSKI M. and GRUSON M., “Geostatistical uncertainty quantification for indoor radon risk mapping”, Proceedings of the European Colloquium of Theoretical and Quantitative Geography, p. 379–383, September 2007. [VAP 95] VAPNIK V., The Nature of Statistical Learning Theory, Springer-Verlag, Berlin, 1995. [WAT 64] WATSON G., “Smooth regression analysis”, Sankhya: The Indian Journal of Statistics, Series A, vol. 26, p. 359-372, 1964. [WIL 05] WILKS D.S., Statistical Methods in the Atmospheric Sciences, Academic Press, 2005.
Chapter 6
Bayesian Maximum Entropy – BME
6.1. Conceptual framework A natural system (physical, biological, social or cultural) involves a number of interacting attributes (environmental contaminants, soil properties, hydrologic parameters, atmospheric variables, land use, human exposure indicators, disease incidence, mortality, poverty level, willingness to pay, commodity price, etc.) and the associated knowledge bases (intra- and inter-disciplinary). In this context, the system attributes manifest the composite space-time organization of the system. The realistic representation of such systems and the rigorous quantitative analysis of their attributes is a crucial part of Man’s effort to understand nature, use its valuable resources and avoid its varying hazards. Quantitative data analysis and system modeling, in their various forms, play an important role in this effort. In particular, spatiotemporal data analysis and modeling of natural systems in a modern statistical framework was introduced in [CHR 90a, CHR 91a and b, CHR 92]. Subsequent works include [GOO 94, HAA 95, BOG 96, CHR 98a and KYR 99]. More recent research efforts include [SER 03b, MAC 03, KOL 02, KOL 04, POR 06, YU 07a, b and c]; see the relevant literature for a detailed list of publications on the spatiotemporal statistics and geostatistics subject. Given its considerable importance, several developments in spatiotemporal modeling have taken place over the last two decades. Among them, the Bayesian Maximum Entropy (BME) conceptual framework and quantitative techniques of
Chapter written by G. CHRISTAKOS.
248
Advanced Mapping of Environmental Data
modeling and mapping in a composite space-time context are based on a synthesis that involves [CHR 90a and b; CHR 91b; CHR 92; CHR 00a]: (a) a methodology that is the fusion of ideas and functions from brain and behavioral sciences; (b) a stochastic theory providing adequate representation of spatiotemporal dependence and multi-sourced uncertainty; (c) a technology that offers the necessary means to integrate and visualize the conceptual, theoretical and methodological results of items a and b above. BME is a creation of the epistematics mode of thinking. Epistematics [CHR 08] introduces the conceptual means to integrate the various mental entities of spatiotemporal analysis (theories, techniques, knowledge sources and thinking modes). It involves models of the processes (perceptual, intellectual and linguistic) through which knowledge and understanding are achieved and communicated. These models include both mental constructions of the actual system and the general conditions of the mind that enable scientists to develop different disciplines that operate in an autonomous way but allow disciplinary integration to obtain a realistic world view. BME is used in the study of natural systems and attributes that are characterized by space-time dependence and multi-sourced uncertainty. In particular, BME distinguishes between two major knowledge bases (KB): – the general or core KB (denoted by G) that may include: physical laws, scientific theories, biological models, mechanistic relations, ecological systems, social structures and population dynamics that are relevant to the natural system under investigation; logical rules and reasoning principles of the human agent; as well as theoretical space-time dependence models (ordinary or generalized covariance, variogram and structure functions) previously known to adequately describe the general spatiotemporal characteristics of a wide range of natural systems (e.g., a physical equation may lead to the corresponding covariance equation, which can be solved yielding a sound dependence model); – the site-specific or specificatory KB (denoted by S), which includes different sources associated with the particular situation, such as: hard measurements characterized by a satisfactory level of accuracy and expressed as numerical attribute values across space-time; and soft data that include a significant amount of uncertainty (secondary sources, imperfect observations, categorical data, and fuzzy inputs). Thus, a G-KB comprises more of the intellectual side of research and education, with more emphasis on the cognitive and theoretical; whereas an S-KB comprises the more experiential, subjective, intuitively apprehended side. A main BME objective is to integrate various forms of core and site-specific KB in order to
Bayesian Maximum Entropy – BME
249
generate informative maps and meaningful statements. While some KB include rigorous quantitative assessments, many others are about multi-sourced beliefs concerning the situation of interest. In many cases, the KB do not refer directly to the beliefs but rather to the sentences that people use to state them. There is a certain amount of hazard in this, which is yet another good reason to involve stochastic concepts and probabilistic techniques. The development of the BME theory, in fact, involves state-of-the-art stochastics. In real-world applications, uncertainty is a major factor expressed either in terms of statistical probabilities (related to ontic features of the actual system) or by means of inductive probabilities (representing agent-dependent considerations). The interpretation of space-time metrics (or distances) may account for the physical or social conditions of the situation. In addition, it is often the observation scale that determines the representation of a phenomenon. Stochastics is a considerable advancement over mainstream statistics, since it does not suffer from the limitations of the latter such as: the use of questionable independence assumptions (attributes are modeled as independently distributed random variables); the logical problems entailed by statistical tests that are often irrelevant to the objectives of a real-world study (e.g., a statistical test states the probability of the observed outcome given that a zero hypothesis is true, whereas a scientific investigation often seeks the probability that the zero hypothesis is true given the observed outcome); the inadequate consideration of the physics of spacetime by spatial statistics; and the lack of rigorous mechanisms by spatial econometrics to incorporate various important types of inter- and intra-discipline knowledge sources. Figure 6.1 provides a schematic conceptual representation of the main stages of the BME approach in a composite space-time domain (herein, the vector s denotes spatial location and the scalar t denotes time):
G o fG ½ ¾ o f K S o [S ¿
f K (s,t)
t
s Figure 6.1. Schematic representation of the BME framework
250
Advanced Mapping of Environmental Data
– the G-KB and the S-KB refer to the general and the specific KB described above, which should be expressed in quantitative forms that are adequate for BME purposes; – the probability density function (pdf), fG , and the operator [ S are derived so that they express mathematically the knowledge provided by G and S, respectively. Some examples are given in Tables 6.6 and 6.7 later in this chapter; – the f K is a pdf model that updates the previous model, fG , in view of the SKB; i.e., the f K is a space-time dependent model that accounts for the total KB, K G S. Operationally, the BME models are constructed using evolutionary principles that are a fusion of concepts and methods from brain and behavioral sciences as established within the integrative methodological framework of epistematics [CHR 08]. For example, these evolutionary principles provide theoretical support to the maximum entropy and Bayesian rules used in BME but, also, they open up new avenues of innovative research and development [CHR 02b, CHR 05a, CHR 06]. From the pdf f K , several types of substantive attribute maps can be derived in a proper fashion, including maps of attribute predictions, uncertainty assessments and reality simulations across a composite space-time domain. For example, if the attribute of interest is the population mortality dynamics across a region, BME is capable of synthesizing the different KB available (intra- and inter-disciplinary) and generating a stochastically complete representation of the health situation in terms of maps of the resulting f K at any geographical location and time period of interest. From f K , mortality predictions are derived together with the associated accuracy across space-time, risk estimates for specified population cohorts are carried out, and decisions concerning the disease etiology are suggested [CHR 05c]. Based on sound theoretical reasoning, BME’s conceptual formulation accounts for the natural system’s space-time dependence structure (involving homogenous or heterogenous attribute patterns); it can consider space-time coordinate systems that accommodate Euclidean or non-Euclidean space-time metrics; it synthesizes core knowledge, empirical evidence and multi-sourced system uncertainty; it involves theoretical models of considerable generality (e.g., non-Gaussian probability laws and nonlinear predictors are automatically incorporated); and it produces an informative picture of the real-world system together with a meaningful assessment of the associated risks. It is noteworthy that there are significant differences (both in structure and function) between the philosophical underpinnings and operational formulation of BME against other techniques based on standard Bayesian statistics and/or maximum entropy rules (see discussion in [CHR 02a] and [CHR 05c]).
Bayesian Maximum Entropy – BME
251
Naturally, the BME developments have followed a variety of paths (theoretical and applied) over the years, depending on the scientific discipline considered. The BME theory is continually being developed within wider conceptual frameworks, and the BME techniques are translated into computer software libraries and successfully applied in real-world studies within a variety of disciplines (physical, health and social). This will be discussed further in the following sections. 6.2. Technical review of BME Below we provide a brief yet substantive review of the main technical elements of the BME approach. For more details, the reader is referred to the relevant literature. 6.2.1. The spatiotemporal continuum The BME study of attributes varying across space and time requires the introduction of the notion of a spatiotemporal continuum E , i.e., a set of points associated with a continuous spatial arrangement of attribute values combined with their temporal order. Spatiotemporal continuity implies an integration of space with time and is a fundamental property of the mathematical formalism of natural phenomena. The space-time coordinates and metric used in the context of E can play an important role in BME modeling, mapping and problem-solving, in general. A coordinate system is a systematic way of referring to places, times, things and events. A point in a spatiotemporal domain E can be identified by means of two separate entities: the spatial coordinates s (s1 ,..., sn ) S R n and the temporal 1 coordinate t along the temporal axis T R,0 , so that the combined space-time coordinates are denoted as follows p
(s,t)
[6.1]
A real-world attribute is distributed in the space-time domain defined by the p coordinates. Equation [6.1] suggests several ways to “locate” a point in a space-time domain. Essentially, the only constraint on the coordinate system implied by equation [6.1] is that it possesses n independent quantities available for denoting spatial position and one quantity for denoting a time instant. A classification of the spatial coordinate systems ( s ) can be made in terms of the following two major groups: – Euclidean coordinate systems, including systems for which there exists a transformation to rectangular (Cartesian) coordinates;
252
Advanced Mapping of Environmental Data
– non-Euclidean coordinate systems, including systems for which it is not possible to perform a transformation to Cartesian coordinates. In the two-dimensional case ( n 2 ), for example, the Euclidean group is associated with a flat, Euclidean geometry (coordinate systems on a Euclidean plane can be transformed into a rectangular system), whereas the non-Euclidean group is associated with a curved, non-Euclidean geometry (rectangular coordinates do not exist on a non-Euclidean surface). Table 6.1 gives a summary of some commonly used Euclidean coordinate systems that belong to the group of orthogonal curvilinear coordinate systems (the polar, cylindrical and spherical systems can all be transformed into a rectangular system). Among the basic non-Euclidean coordinate systems are the Gaussian coordinate system and the Riemannian coordinate system. Other coordinate systems are mentioned in Table 6.2 below [CHR 02c]. These are systems of coordinates with certain particular physical properties (e.g., cyclidic coordinates are such that the Laplace equation is separable; toroidal coordinates are such that the equation of a magnetic-field line is that of a straight line in these coordinates). Rectangular
Polar
Cylindrical
Spherical
s1
r
r
U
s2
T
T
M
s3
-
s3
T
Table 6.1. Common Euclidean coordinate systems
Barycentric
Chow
Ellipsoidal
Hamada
Quadriplanar
Bipolar
Conical
Glebsch
Orthocentric
Toroidal
Bispherical
Cyclidic
Grassman
Parabolic
Trilinear
Table 6.2. A partial list of coordinate systems with useful physical properties [CHR 02c]
A space-time metric, | 'p | | p pc | , is a mathematical expression that defines spatiotemporal distance. For an attribute that occurs in a natural continuum, this expression depends on two factors [CHR 00d]: – a “relative” factor – the particular coordinate system;
Bayesian Maximum Entropy – BME
253
– an “absolute” factor – the nature of the continuum E imposed by physical constraints (geometry of the space, physical laws, and internal structure of the medium within which an attribute takes place). In addition, the permissibility of a space-time dependence function (e.g., covariance) is affected by the metric (or norm) that determines space-time distance in several dimensions; see section 6.3 below. In general, in BME modeling we distinguish between separable and non-separable metrics. 6.2.2. Separable metric structures These metrics treat the concept of distance in space and time separately. The separate metric structure includes a spatial distance | s sc | | h | and an independent time lag | t tc | W so that | 'p | (| h |,W ) .
[6.2]
Accordingly, in equation [6.2] the structures of space and time are introduced independently. The spatial distance | h | can have different meanings depending on the particular topographic space used. Euclidean distance in a rectangular coordinate system on R n is defined as
| h | (¦i 1 hi2 ) 2 . n
1
[6.3]
Non-Euclidean distances can be defined by generalizing equation [6.3] as follows | h | (¦ ni 1| hi |P )1/ P ,
[6.4]
1 d P 2 ; note that [6.4] becomes the Euclidean metric only if P 2 . A common non-Euclidean metric of the form [6.4] is the absolute metric ( P 1 )
| h | ¦i 1 | hi |. n
[6.5]
Another distance metric is defined by
| h | max (| hi |; i 1,..., n) .
[6.6]
254
Advanced Mapping of Environmental Data
The distance between two geographical locations on the surface of the Earth (considered as a sphere with radius r) is defined by | h | r [ 'I 2 (cos 2 I ) 'T 2 ] 2 , 1
[6.7]
where 'I and 'T are the latitude and longitude differences, respectively (in radians). The spatiotemporal metric and the coordinate system in which the metric is evaluated are independent. An exception is the rectangular coordinate system, the definition of which involves the Euclidean metric. The following example illustrates how the above metrics can lead to different geometric properties of space. In the case of spatial isotropy ( R 2 ), we define the set 4 of points at a distance r | h | from a reference point O. Figure 6.2 shows that for distance [6.3] the set 4 is a circle of radius r , whereas for distance [6.5] the 4 is a square with sides 2 r . As we shall see below, a similar distinction in terms of metrics is valid for other attribute characteristics, such as space-time dependence functions. (a) (b) O
r r
Figure 6.2. Set 4 of points at distance r from O when r is (a) the Euclidean distance and (b) the absolute distance; 4 also defines an iso-covariance contour
A general form for spatial metrics, Euclidean or non-Euclidean, can be summarized in terms of the distance | h | (¦in, j 1H ij hi h j ) 2 , 1
[6.8]
where H ij are coefficients that, in general, depend on the spatial location. The tensor H (H ij ) is called the metric tensor. The Euclidean metric in a rectangular coordinate system is a special case of equation [6.8]: H ii 1 , H ij 0 ( i z j ). In a polar coordinate system the metric is obtained from [6.8] for n 2 , H11 1, H 22 s12 ,
Bayesian Maximum Entropy – BME
255
0 ( i z j ). Equation [6.8] for n 3 , H11 H 33 1 , H 22 s12 and H ij 0 ( i z j ) provides the metric in a cylindrical coordinate system. In a spherical coordinate system, the metric is obtained from [6.8] for n 3 , H11 1, H 22 s12 , H 33 [s1 sin(s2 )]2 , H ij 0 ( i z j ). The metric structures of Gaussian and Riemannian coordinate systems are also represented using [6.8]. For n 2 , equation [6.8] gives the local distance on a curved surface; the metric coefficients H ij are functions of the spatial coordinates si ( i 1, 2 ), and H12 H 21. Thus, the curvature of a Gaussian (or Riemannian) surface is reflected in the metric.
H ij
6.2.3. Composite metric structures A composite metric structure requires a higher level of physical understanding of space-time, which may involve theoretical and empirical facts about the attribute. The metric is determined by the geometry of space-time and also by the physical processes and space-time structures it generates. In composite metrics the structure of space-time is interconnected using an analytical expression of the form
| 'p | H (h1 ,..., hn , W ) ,
[6.9]
where H is a function determined by the KB available (topography, physical laws, etc.). Concerning knowledge representation, the Euclidean and non-Euclidean geometries display important differences. Euclidean geometry determines the metric that constrains the physics, in which case a single coordinate system implying a specific metric structure covers the entire spatiotemporal continuum. Non-Euclidean geometries distinguish between the spatiotemporal metric and the coordinate system, thus allowing for choices that are more appropriate for certain real-world problems. A special case of equation [6.9] is the space-time generalization of the distance [6.8] that leads to the spatiotemporal Riemannian metric | 'p | (¦in, j 1H ij hi h j 2W ¦ni 1H 0i hi H 00W 2 ) 2 , 1
[6.10]
where the coefficients H ij ( i, j 1,..., n ) are functions of the spatial location and the time. In several applications considered in BME analysis and modeling the separate metric structure [6.2] is adequate. In other applications, however, the more involved composite structures [6.9], [6.10] may be necessary. In the latter case, considering the several existing spatiotemporal geometries that are mathematically distinct but a priori and generically equivalent, the metric structure (e.g., the function H ) that best describes reality must be determined. Mathematics describes the possible geometric spaces, and empirical knowledge determines which best represents the physical
256
Advanced Mapping of Environmental Data
space. Axiomatic geometry is not sufficient for physical applications in space-time, and it is required to establish a relationship between the geometric concepts and the empirical investigation of space-time as a whole. 6.2.4. Fractal metric structures Many attributes that take place in non-uniform spaces with many-scale structural features are better represented by fractal rather than Euclidean geometry. In fractal spaces, it is not always possible to formulate explicit metric expressions, such as equations [6.10], since the physical laws may not be available in the form of differential equations for example. Geometric patterns in fractal space-time are selfsimilar (or statistically self-similar in the case of random fractals) over a range of scales [FED 88]. Self-similarity implies that fractional (fractal) exponents characterize the scale dependence of geometric properties. A common example is the percolation fractal generated by the random occupation of sites or bonds on a discrete lattice. Distance measures on a percolation cluster, denoted by A (r) , scale as power laws with the Euclidean (linear) size of the cluster. Power-law functions are called fractal if the scaling exponents are non-integer. The fractal functions are homogenous, i.e. they satisfy A (b r) b do A (r) ,
[6.11]
where r is the appropriate Euclidean distance, d o the fractal exponent for the specific property, and b a scaling factor. In practice, scaling relations [6.11] only hold within a range of scales bounded by lower and upper cut-offs, thus leading to A(r)
A(rco )(r / rco ) do .
[6.12]
where rco is the lower cut-off for the fractal behavior. For example, the length of the minimum path on percolation fractal scales as A min (r) v r dmin . The fractal dimension d min of the percolation fractal on a hypercubic lattice satisfies 1 d d min d 2 , where d min # 1.1, 1.3 for n 2, 3. Thus, if the minimum path length between two points at Euclidean distance r is on average 2 miles, the length of the minimum path between two points separated by 2 r is, on average, more than 4 miles. Figure 6.3 shows the minimum path length between two points separated by r in Euclidean space and in fractal space with d o 1.15 [CHR 00d]. The Euclidean path length is a linear function of the distance between two points, for all types of paths (e.g., circular arcs or linear segments); the fractal path length increases nonlinearly, since the fractal space is non-uniform and obstacles to motion occur at all scales.
Bayesian Maximum Entropy – BME
257
6.3. Spatiotemporal random field theory The spatiotemporal random field (S/TRF) theory of stochastics (section 6.1) aims to study the properties of a natural system as a whole and connect them to causal relations and space-time patterns under conditions of uncertainty. S/TRF tools include spatiotemporal pdf, space-time dependence functions (e.g., ordinary and generalized covariance and variogram functions), and local scale heterogenity characteristics (spatial and temporal continuity orders). A detailed presentation of stochastics, in general, and the S/TRF model, in particular, can be found in [CHR 92; CHR 98b]; these references also discuss several of the S/TRF applications in sciences and engineering. Here we provide a brief review that focuses on the S/TRF model. 4.5
Minimum Path Length
4 3.5
(2)
3 2.5
(1)
2 1.5 1
1
1.5
2
2.5
3
3.5
4
Euclidean Distance
Figure 6.3. Minimum path length between two points separated by (1) distance r in Euclidean space and (2) in a space with fractal length dimension d o 1.15 [CHR 00d]
Let X p X (p) be an S/TRF representing an attribute that varies within the space-time domain E . As above, the vector p (s ,t) denotes a point in E , where s is the spatial location and t is the time instant under consideration. The S/TRF model is viewed as the collection of all physically possible realizations concerning the attribute we seek to represent mathematically. From a stochastics point of view, the S/TRF model is fully characterized by its multivariate pdf, f KB , which is generally defined as
PKB [ F p1 d X p1 d F p1 dF p1 , ... , F pk d X pk d F pk dF pk ] f KB (p1 ,..., p k ) dF p1 ...dF pk ,
[6.13]
258
Advanced Mapping of Environmental Data
where the subscript KB denotes the knowledge base that BME analysis used to construct the pdf (e.g., KB = G, S or K), and the F pi ( i 1,...,k ) denote random field realizations. The f KB describes the comparative likelihoods of the various realizations and not the certain occurrence of a specific realization. Accordingly, the pdf unit is probability per realization unit. As it can investigate the different forms of space-time correlation that are allowed by the case-specific data and core knowledge available, the X p model can provide multiple admissible realizations and can also characterize their likelihood of occurrence. Using equation [6.13], f KB assigns probabilities to different X p -realizations that involve multiple space-time points. Thus, the S/TRF theory has many conceptual layers and salient features: – it assumes a composite space-time manifold, i.e., it considers space and time as an integrated whole rather than as separate entities; – it incorporates spatiotemporal cross-correlations and interdependencies of the attribute distribution, as well as natural laws (e.g., expressed in terms of algebraic or differential equations); – it is of immediate relevance to models that are mathematically rigorous and tractable while, at the same time, they are logically and physically plausible; – it generates informative realizations enabling the determination of several important characteristics of the attribute distribution across space-time. When representing an attribute in terms of an S/TRF model we assign to it a random character but, also, an equally important structural character; these two complementary characters are closely dependent and interacting with each other in a space-time context. This is acknowledged, e.g., by the fact that a realization is allowed only if it is consistent with the knowledge available regarding the attribute. Clearly, not all realizations of the S/TRF are equally probable. Depending on the underlying mechanisms, some realizations are more probable than others, and this is reflected in the pdf of the S/TRF. 6.3.1. Pragmatic S/TRF tools Pragmatic tools of the S/TRF theory include the spatiotemporal dependence functions of the attribute X p , and its local scale heterogenity characteristics. It must be noted at the outset that not every function can serve as a spatiotemporal dependence model. Certain permissibility criteria must be satisfied, which have been discussed in detail in the literature [CHR 84, CHR 92, CHR 05b; CHR 98b; CHR 00b and d; KOL 02]. These include permissibility criteria for spatiotemporal
Bayesian Maximum Entropy – BME
259
dependence functions associated with ordinary, generalized and fractal random fields. Remarkably, an S/TRF characterization in terms of the dependence functions is sufficient in many cases in practice. The most commonly used space-time dependence functions are as follows. The attribute mean function (the bar denotes stochastic expectation) Xp
³ dF
p
F p f KB (p)
at each space-time point p
c X ; p, pc
f
¦ j ,k
[6.14]
(s ,t) .
c [ [ [ [ °° jk 1 j ,s 1k ,sc 2 j ,t 2k ,t c A j Ak c[ ( j ,k ),s,sc [ 2 j ,t [ 2k ,t c X p X pc 0® ° °¯ A j Ak [1 j ,s [1k ,sc [ 2 j ,t [ 2 k ,t c
[ modes of associated differential equation with amplitudes A (random or deterministic); c jk A j A k , and c[ ( j ,k ) denotes the mode correlation [ 1 j ,s [ 1k,s . Table 6.3. Spatiotemporal covariance models
The attribute covariance function, c X ; p , pc
X˜ p X˜ pc
³ ³ dF p dF pc (F p X p )(F pc X pc ) f KB ( p, pc)
[6.15]
between pairs of points p (s , t ) and pc (sc, tc) , where X˜ p X p X p are attribute fluctuations. Covariance models can be separable (e.g., they can be expressed as the product of purely spatial and purely temporal components) or, more generally, nonseparable (i.e., they cannot be expressed as the above product). Table 6.3 gives an example of a non-separable space-time covariance. Several others can be found in the BME literature; e.g., [KOL 04] and references therein. The variance attribute, V X2 , p , is obtained as a special case of [6.15] if p pc , thus yielding
V X2 , p
X˜ 2p
³ dF p ( F p X p ) 2 f KB ( p) .
[6.16]
260
Advanced Mapping of Environmental Data
The variogram attribute function
J X ; p, pc
1 2
(X p X pc ) 2
1 2
³ ³ dF p dF pc (F p F pc ) 2 f KB (p, pc) .
[6.17]
The X p represents structural attribute trends, whereas the c X ; p, pc and J X ; p, pc express space-time attribute dependence. This dependence is an inherent feature of attribute variation across geographical space and during different times. There exist, in fact, different forms of dependence that lead to distinct covariance and variogram shapes. If the data are clustered in space, efficient algorithms exist for the practical estimation of the sample covariance and variograms [KOV 04a]. In these cases, a coefficient of variation of the dimensionless spatial density of the point pattern of sample locations is introduced as a measure of the degree of clustering of the dataset; then, a modified form of the covariance estimator is used that incorporates declustering weights and proposes a scheme for estimating the declustering weights based on zones of proximity. Furthermore, if the physical context requires the consideration of higher-order attribute spatiotemporal functions, also known as multiple-point dependence functions, can be expressed as follows g X , pi ,O
O i 1 X˜ Opi
O ³ ... ³ i 1 dF pi (F pi X pi ) f KB (p1 ,...p O ) ,
[6.18]
which includes the tri-, tetra- and penta-variogram functions. In the special case that p p i (for all i 1,..., O ), we obtain
T X ,p,O
X˜ Op
³ dF p ( F p X p ) O f KB ( p) ,
[6.19]
where O t 3 ; equation [6.19] includes the skewness and kurtosis functions. As we shall see below, the BME can generate pdf f KB at arbitrary geographical locations and time instants (e.g., at the nodes of a mapping grid). These pdf effectively incorporate a body of knowledge that includes theoretical models of space-time dependence, as above. Next, a substantive distinction is made between ordinary and generalized S/TRF. 6.3.2. Space-time lag dependence: ordinary S/TRF This is the case where the space-time attribute dependence can be expressed in terms of the corresponding lag, p pc (s sc,t tc) . The S/TRF is called an
Bayesian Maximum Entropy – BME
261
ordinary field that is spatially homogenous and temporally stationary. The dependence functions above are simplified as follows: the attribute mean function is a constant m across space-time, i.e. Xp
[6.20]
m
at each space-time point p . The attribute covariance and variogram are functions of the space-time lag only, i.e., c X ; p, pc
c X ; p pc ,
[6.21]
J X ; p, pc
J X ; p pc .
[6.22]
For illustration, Table 6.4 presents covariances that can be used to model the space-time dependence of attributes with wave-like characteristics (e.g., epidemic attributes). Figure 6.4 shows one of the (non-separable) covariances used to model the monthly mortality distribution X p M p (in %) for the 14th century Black Death epidemic in Europe; this covariance is spatially isotropic, as it is a function of the spatial lag magnitude | s sc | | h | r , so that p pc (| h |, t tc) (r,W ) and c M; p pc c M;r,W . In Figure 6.2 above, the set 4 also defines iso-covariance contours.
c X ; p pc
e (ss )v(tt ) a °° 2 2 ® e [(ss )v(tt )] a ° 2 2 O /2 (ss c )v(tt c ) °¯ [1 [(s sc) v (t tc)] /b ] e
a
a, b, O, v= coefficients calculated from the datasets available. Table 6.4. Spatiotemporal covariance models representing epidemic attributes [CHR 05c]
Figure 6.4. Space-time covariance of mortality (%) during the 14th century Black Death epidemic in Europe [CHR 05c]
262
Advanced Mapping of Environmental Data
As mentioned previously, the metric that determines space-time distance affects the permissibility of a dependence model. Thus, a model that is permissible for one metric may not be so for another. The matter has been discussed in considerable detail [CHR 00a, b and d; CHR 02c]. In particular, it was shown that a general class of functions that can be associated with Euclidean or non-Euclidean metrics is as follows c X ,h
e
NP (h)
,
[6.23]
where N P (h) ¦ni 1 | hi |P and 0 P d 2 . A few examples are as follows: The covariance
c X ,h
2
e | h| ,
[6.24]
is permissible for Euclidean metric [6.3] only; it is not permissible for absolute metric [6.5]. The covariance c X ,h
e N1 (h )
[6.25]
is permissible with non-Euclidean absolute metric [6.5]. The analysis above can be 1 extended to include metrics of the more general form | h | (¦ni 1 O i | hi |P ) P , where 1 d P 2 and O i ( i 1,..., n ) is a weight determining the “salience” of the hi direction. 6.3.3. Fractal S/TRF We now investigate space-time covariance models associated with fractal S/TRF. A covariance model within the fractal range is c X ; p, pc v r D (W r E ) z ,
[6.26]
where W 0 W W m and r0 r rm define the space-time fractal ranges; and 1 z 0 and (n 1) 2 D E z 0 are permissibility conditions. A covariance function that has a fractal behavior is c X ; p, pc
V X2 fˆz (W r E ; u c ) fˆD (r; w c ) ,
[6.27]
Bayesian Maximum Entropy – BME
263
where V X2 is the variance and u c , w c are cut-offs; an illustration of the model is plotted in Figure 6.5. The function fˆz of model [6.27] has an unusual dependence on the space and time lags through W r E . For large W , the W r E is close to 0 if r is sufficiently large, and the fˆz value is close to 1. With regard to fˆz , two pairs of space-time points are equidistant if W 1 r1E W 2 r2E . Thus, the equation for equidistant space-time contours is W r E c . This dependence is physically different than that implied by, for example, a Gaussian space-time covariance model. In the latter, equidistant lags satisfy the equation r 2 [ r2 W 2 [W2 c . The difference is shown in Figure 6.6 that plots the equidistant contours for fˆz (solid lines) and for e r
2
[r2 W 2 [W2
(dots) as a function of the space and time lags.
Figure 6.5. Plot of the fractal covariance model [6.27] for V X2 E 1.1 and u c w c 25
1, z
D
1 2,
BME space-time estimation and mapping depend on the metric structure assumed, since the dependence models are used as inputs in most mapping techniques. In fact, [CHR 00d] showed that the same dataset with its space-time dependence represented by covariance models of the same functional form can lead to different space-time maps if estimation is performed using different metric structures. In conclusion, the choice of a coordinate system and associated norm to describe a natural phenomenon depends on the attribute properties being described. Metricdependent permissibility analysis has important consequences in applications (e.g., space-time mapping or the solution of stochastic partial differential equations), in which we are concerned about the validity of space-time dependence functions associated with a physically meaningful metric (Euclidean or non-Euclidean).
264
Advanced Mapping of Environmental Data
Figure 6.6. Equidistant contours for fractal space-time dependence (solid contours) and for Gaussian dependence (dotted contours). Contour labels represent c0 W r E values (solid lines), and r 2 [ r2 W 2 [ W2 values (dots) obtained using c0
62.95 , [ r
10 and [ W
5
6.3.4. Space-time heterogenous dependence: generalized S/TRF A class of heterogenous S/TRF proposed by [CHR 91a and b] is considerably more general than the class of homogenous-stationary S/TRF. This is a generalized S/TRF class that is capable of handling complicated space-time variability of any size based on the following intuitive idea: the variability of an attribute can be characterized using its degree of departure from homogenity and stationarity. This departure can be determined by a mathematical operation, in the following sense. Let QQ /P be a space-time operator that transforms the S/TRF X p X s,t into a homogenous-stationary attribute Y p by annihilating heterogenities of degrees Q in space and P in time, i.e., QQ /P [ X p ] Y p .
[6.28]
Bayesian Maximum Entropy – BME
265
In this case, the attribute X p is said to be a generalized S/TRF with spatial and temporal heterogenity orders Q and P respectively (STRF-QP). The generalized S/TRF offers a theoretical model of the attribute distribution that expresses the way causal influence is propagated in space-time and gives information about the attribute dynamics at the scale of interest. For natural systems that evolve within domains containing complicated boundaries and trends, the departure of the S/TRF from homogenity-stationarity is expected to vary geographically and temporally. It is meaningful to construct local QQ /P -operators that produce homogenous/stationary attributes Y p within local neighborhoods, instead of seeking global representations. Parameters Q P provide a quantitative assessment of the rate of change of attribute patterns: the lower the heterogenity level, the smaller the Q P values. These parameters offer information about the stochastic model underlying the actual system; they may determine how “far away” in space and “deep” in time the model searches for information about the attribute. Plots of the Q P values associated with mortality distributions in the case of the bubonic plague in India (late 19th-early 20th century) are shown in Figure 6.7. Attribute correlations across space-time are characterized by the covariance c X ; p, pc between any pair of p , pc points. This covariance is non-homogenous in space and non-stationary in time, and according to the S/TRF-QP theory it can be decomposed as c X ; p, pc
N X; p pc PQ /P ,
[6.29]
where N X; p pc is called the generalized spatiotemporal covariance and PQ /P is a space-time trend function. An important feature of this model is that only N X; p pc is required in prediction and mapping [CHR 96]. For illustration, Table 6.5 presents some of the N X; p pc models that were used in the study of the Indian bubonic plague.
N X;r ,W
cG rGW G r ¦P9 0 (1)9 1a9 W 29 1 GW ¦QU 0 (1) U 1bU r 2 U 1 ¦QU 0 ¦P9 0 (1) U 9 a U9 r 2 U 1W 29 1
c, a] , b U , aU] coefficients calculated from the datasets; G r and GW = delta functions in space and time, respectively. Table 6.5. Generalized spatiotemporal covariance models of the mortality distribution during the Indian bubonic plague [YU 06]
266
Advanced Mapping of Environmental Data
Figure 6.7. Space-time maps of the QP differences associated with the Indian bubonic plague mortality distributions during different times [YU 06]
The generalized covariance N X; p pc can be expressed in terms of the residual ordinary covariance cY; p pc , so that given the form of cY; p pc , the corresponding N X; p pc is derived [CHR 92]. Thus, the class of generalized space-time covariances is richer than that of ordinary ones. It is worth noting that, due to its considerable sophistication, the random field theory has been poorly understood by certain authors (e.g., [MYE 89, MYE 02, and GOO 97]). This has generated several nonsensical statements concerning the theory’s mathematical structure and physical interpretation as well as a number of basic mistakes in its practical implementation.
Bayesian Maximum Entropy – BME
267
6.4. About BME 6.4.1. The fundamental equations Consider an attribute X p distributed across space-time that is modeled as a S/TRF, ordinary or generalized. The BME approach for studying X p is based on the following fundamental set of equations (see, also, Figure 6.1 above)
³ dF ³ dF
p
(g g) eP
p
[ S eP g A f K (p) T
T
g
0
½ ° ¾, 0 °¿
[6.30]
where g is a vector that represents the G-KB available concerning the physical situation, P is a vector of coefficients associated with g ( P expresses the relative significance of each element of g and depends on the space-time coordinates), [ S represents the site-specific KB available, A is a normalization parameter and f K is the pdf expressing the final distribution of the attribute X p at each space-time point. f K accounts for the integration of the general and site-specific KB. g and [ S are the inputs, whereas the unknown in equations [6.30] are P and f K across space-time. ³ ³ dF d F c F (wwt a1 wws a 2 ) fG (p, pc)
0
³ ³ dF dF c F 2 (wwt a1 wws 2a 2 ) fG (p, pc) 0
³ ³ dF dF c F F c (wwt a1 wws a 2 ) fG (p, pc)
0
Table 6.6. G-equations of an advection-reaction contaminant law [KOL 02]
For illustration, Table 6.6 presents the relevant G-equations in the case of an advection-reaction contaminant law (e.g., along a river). Table 6.7 presents interval and probabilistic (soft) data of an S-KB (I is an interval of possible attribute values; PS denotes a probability operator, and w is an empirical function relating attribute values between specified space-time points).
F I PS [w(F )] PS [w(F , F c)] Table 6.7. Soft attribute data of S-KB
268
Advanced Mapping of Environmental Data
The complete pdf, f K , across space-time are critical components of many scientific applications, risk assessment studies, etc. In addition, after the space-time dependent f K have been derived, a variety of attribute maps can be generated. Thus, from f we can select a space-time attribute estimation (or prediction), Xˆ , of the p
K
actual (but unknown) attribute X p value at any space-time point of interest. The choice of an estimator (predictor) depends on the goals of the study. A few examples are as follows (Table 6.8): the BMEmode estimate represents the most probable X p realization and the BMEmean estimate minimizes the mean squared estimation error. Other forms of BME estimators can be derived so that they optimize an objective function [CHR 00a]. Predicted attribute values are used to create informative spatiotemporal maps, which can be scientifically interpreted to provide a useful picture of reality and generate science-based decisions. Due to the randomness of the X p distribution and data inaccuracies, we use f K to obtain an uncertainty assessment of the Xˆ values. A popular accuracy measure is the prediction error p
standard (std) deviation of f K , i.e. BMEmode: Xˆ p,mode : max X p f K BMEmean: Xˆ p,mean
³ dF p F p f K
Table 6.8. Examples of space-time estimators
V K (p) [ ³ dF p (F p X p ) 2 f K ]1/2 ,
[6.31]
which is calculated at each map grid point. Other accuracy measures (including confidence intervals and sets) can also be calculated [CHR 02c]. Fundamental equations [6.30] constitute a very general and concise expression that summarizes a host of theoretical and applied results derived as special cases of [6.30], depending on the choice of g and [ S ; the latter, in turn, depend on the agent’s epistemic conditions and the problem’s objectives. For illustration, some of these results are shown in Figure 6.8 [CHR 08]. To gain insight, let us very briefly explain a few of the terms used in this figure (see relevant BME literature for a more detailed presentation of the notation, terminology and substantive technology). The S-KB includes a dataset F data at a collection of space-time points denoted by the vector pdata ; the F data consists of hard attribute data F hard at points p hard and soft data F soft at points p soft . The vector p map includes the pdata and the map grid points
Bayesian Maximum Entropy – BME
269
where an estimate is sought. The c map is a matrix with elements, the covariances c i j , between any of the points belonging to p map ; accordingly each c ij1 denotes the
ij-th element of the inverse covariance matrix. The qi are coefficients associated with the QQ /P operator, and T i are functions of [ S .
Fˆ k ¦i data Jik1 ¦ihard Jik1 F i ¦ isoft J ik T i ( F ˆ k) 1
1
¦ ihard cik
( F i xi ) ¦i soft cik1
g o JX
[T i ( Fˆ k ) xi ] c ( Fˆ k x k ) 0
[ S o F hard , F soft
q k Fˆ k ¦i soft q i T i ( Fˆ k ) ¦i hard qi F i 0
[ S o F hard , F soft
1 kk
g o X p , cX
0
³ dF ³ dF
g o N X , Q, P
p
(g g) e P
p
[ S e P g A f K ( p) 0
T
g
0
[ S o F hard , F soft
T
g o X p , cX
g o BC, IC f X,0 f G,0
[ S o F hard Single point
Fˆ k
1
xk ¦i hard ccik1 ( F i xi ) kk
g o JX [ S o F hard
g o N X , Q, P [ S o F hard
Single point
Single point
Fˆ k
¦ihard J 1 ik
1
¦ihard J ik
Fi
Fˆ k
¦i hard
qi qk
Stochastic Partial Differential Equation
Fi
Figure 6.8. Several cases derived from the general BME formulation of equations [6.30]
Equations [6.30] recognize a basic element that all theories and experiments have in common: human agents have created them, in one way or another. Thus, underlying [6.30] is the agent’s mental process concerning the problem in light of the KB available. In this sense, the philosophical underpinnings of [6.30] have much in common with Heisenberg’s perspective: “Contemporary science, today more than at any previous time, has been forced by nature herself to again pose the old
270
Advanced Mapping of Environmental Data
question of the possibility of comprehending reality using mental processes, and to answer it in a slightly different way.” By way of a summary, BME possesses certain features of theoretical and practical significance, such as: – BME’s scientific methodology involves evolutionary principles based on brain and behavioral functions, which can embrace diverse phenomena and interdisciplinary descriptions in a single scheme [CHR 08]; – BME assumes space-time coordinate systems that accommodate Euclidean and non-Euclidean metrics and consider attribute variability and underlying physical mechanisms [CHR 00a]; – BME accounts for multi-sourced uncertainties (conceptual and technical, ontologic and epistemic); [SER 03b and c]; – BME represents space-time patterns in terms of dependence models (covariances and variograms, separable and non-separable, ordinary and generalized); [YU 07a]; – BME can also incorporate higher-order (multiple-moments) moments [CHR 98c; HRI 01]; e.g., the effect of skewness on BME prediction is shown in Figure 6.9;
Figure 6.9. Pdf of space-time BME prediction errors showing the effect of incorporating knowledge about skewness (zero skewness corresponds to the Gaussian pdf); the error distributions change as the skewness values change [CHR 98c]
Bayesian Maximum Entropy – BME
271
– BME offers complete system characterization in terms of prediction probability laws, non-Gaussian, in general, at every grid point – and not merely a single prediction at each point; e.g., BOG 04a and LEE 07a]; – BME assumes nonlinear attribute predictors [CHR 90a,b, CHR 92; PAP 06] rather than restrictive linear or linearized attribute estimators commonly used in spatial and/or temporal statistics [CRE 93]; – BME relies on a natural knowledge-based methodology (rather than on mechanistic curve fitting, ad hoc trend surface, etc. techniques), which allows it to rigorously incorporate physical laws, theoretical models, scientific theories and well-established empirical relationships [KOL 02; YU 07b]. For illustration, Figure 6.10 presents the three-dimensional temperature distribution in a sub-region of the thermometric field of Nea Kessani (Greece); comparative analysis showed that the BME distribution offers a more realistic representation of the real-world phenomenon than the distributions obtained from conventional analytical and computational methods; and, unlike previous methods, this was in agreement with empirical quartz geothermometry analyses;
Figure 6.10. BME temperature distribution (in o C ) [YU 07b]
– BME provides operational Bayesian assimilation rules that are effective and considerably flexible; and it can consider different types of space-time support (functional BME) as well as more than one attribute (vector BME or co-BME); e.g., [CHR 00a, CHO 03];
272
Advanced Mapping of Environmental Data
– BME can also account for categorical variables [BOG 02a] and secondary knowledge in terms of fuzzy sets (Figure 6.11; [KOV 04b]). Various other knowledge bodies can be used, such as soil texture triangles (Figure 6.12; [DOR 01, DOR 03]). Many previous results (e.g., spatial regression and geostatistical kriging techniques) are easily derived as special cases of BME under limited conditions – see, for example, Figure 6.8 above.
(a)
(c)
(b)
(d)
Figure 6.11. (a), (b) Example membership functions of type I fuzzy data; (c) location map of membership functions (centroid values are also shown for each membership function); (d) location map of pdf obtained by generalized defuzzification of membership functions
BME can be used to solve a variety of problems. Here are a few examples: [SER 03c] used BME to study the inverse problem of saturated subsurface flow; [KOL 02, PAP 06 and YU 07b] applied BME in the solution of stochastic differential equations representing physical laws of different kinds; [CHR 02c and CHR 06] proposed various applications of BME in geographic information science and decision analysis within multi-dimensional environments; [FAS 07a and b, BOG 07]) used BME in data fusion and image processing problems; for more details, see the applications section below and references therein.
Bayesian Maximum Entropy – BME
273
Various extensions of the BME theory and techniques are possible. These extensions include the generalized BME (GBME) group of techniques that involve STRF-QP models and directly account for heterogenous space-time patterns and non-Gaussian data distributions (e.g., [YU 06 and YU 07a]); the stochastic logic space-time predictors [CHR 02b]; and the incorporation of Spartan random field models in which space-time dependence can be represented by means of physically or intuitively motivated “interactions” instead of the data-driven covariance matrix [HRI 03, ELO 08].
Figure 6.12. Belgian texture triangle. Each point on the triangle refers to a specific composition of sand (50 Am–2 mm), silt (2–50 Am) and clay (< 2 Am), [DOR 03]
6.4.2. A methodological outline Conceptually, BME constitutes an important component of multi-disciplinary knowledge synthesis in the context of epistematics [CHR 08]. The latter institutes a broad framework in which different sets of mental entities describing constituent phenomena in the individual disciplines are integrated in order to solve (describe, explain and predict) the composite real-world problems. Naturally, BME developments (theoretical and applied) have followed a variety of paths, depending on the scientific discipline considered. The BME theory is continually considered within wider conceptual frameworks, whereas the BME, GBME, etc. techniques have been successfully used in real-world studies in a variety of scientific disciplines. Figure 6.13 briefly outlines the various BME components that provide the means for rigorous quantitative assessments (generating predictions, assessing space-time dependence, characterizing uncertainty, etc.); for establishing a general integration framework for multidisciplinary KB (scientific, cultural, social, economic, etc.); and for clarifying the
274
Advanced Mapping of Environmental Data
underlying argumentation experimental, etc.).
modes
(taxonomic,
analogical,
mathematical,
As is the case with any dynamic conceptual system, a number of significant challenges are associated with BME theory including the following:
Figure 6.13. An outline of BME (methodology-modeling-implementation)
Bayesian Maximum Entropy – BME
275
(a) what is the nature of space-time? Is space-time like a canvas that exists whether or not the artist paints on it, or is space-time akin to parenthood that does not exist until there are parents and children? A related question is the “asymmetry of time”: is time’s asymmetry a property of states of the world rather than a property of time as such? See, also, [CHR 06]; (b) how can BME account for differences having to do with the way each scientific discipline communicates knowledge? Physical sciences use mainly mathematical formulae and models to express conceptual, observational and experimental results. In humanistic disciplines there is little resort to mathematical formulae – chiefly, reliance is placed upon analogy and metaphor; [CHR 08]; (c) what knowledge bases are most reliable and/or important? To address this question is to ask for a classification of types of knowledge, a ranking of these types by reference to some reliability/value standards, and a meaningful uncertainty characterization (conceptual vs. technical, etc.); (d) how can the effectiveness of the BME technology be improved to allow the implementation of a large number of theoretical results currently available? Significant developments along these lines include the work of [SER 01, KOL 06, and YU 07a]; (e) is a further unification and generalization possible as regards the various BME concepts and techniques developed by different research and development groups over the years? See, for example, the stochastic logic generalizations discussed in [CHR 02b]. 6.4.3. Implementation of BME: the SEKS-GUI Implementation of the BME concepts and techniques requires the development of software packages of spatiotemporal analysis and mapping. One such package is the SEKS-GUI software library (Spatiotemporal Epistematics Knowledge Synthesis and Graphical User Interface; see [KOL 06, YU 07a]. The SEKS-GUI includes: 1. stochastic models of composite space-time X p variation, uncertainty representation and data assimilation (Table 6.9).
276
Advanced Mapping of Environmental Data
Spatiotemporal Dependence Homogenous-stationary: ordinary S/TRF theory Non-homogenousnon-stationary: generalized S/TRF theory
Uncertainty Representation Gaussian probability laws Non-Gaussian probability laws
Data Assimilation Bayesian conditional rules (operational) Non-Bayesian adaptation principles (stochastic logic)
Table 6.9. A categorization of stochastic models by spatiotemporal dependence, uncertainty representation and data assimilation criteria [(YU 07a]
2. a list of spatiotemporal covariance models of X p for systems with different types of space-time dependence structures (Table 6.10). In addition to the models mentioned in Table 6.10, there are other classes of models that can be included in the SEKS-GUI framework [KOL 04]; 3. varying levels of attribute space-time orders ( Q , P ) expressing heterogenities in the spatial and temporal patterns of X p (Table 6.10); 4. mapping techniques with attractive features – from a theoretical and an applied perspective – Table 6.11. Table 6.11 categorizes the stochastic models providing the theoretical support of the SEKS-GUI in terms of spatiotemporal dependence, uncertainty representation and data assimilation criteria. SEKS-GUI includes the classical BME and the GBME techniques. GBME directly accounts for heterogenous and non-Gaussian data distributions. Spatiotemporal Dependence Homogenous/stationary Non-homogenous/non-stationary
Separable Models
Non-separable Models
¦ q c r; p ,q ; ¦ q cW ; pc ,q c r; p ,1 cW ; pc ,1
¦ q c r; p ,q cW ; pc ,q c h uW ; p N r,W ; Q , P
Note: the cr; p,q denote spatial exponential ( p 1 ), Gaussian ( p 2 ), spherical ( p 3 ), sine ( p 4 ), cosine ( p 5 ), Mexican-hat ( p 6 ) and nugget-effect ( p 7 ) models ( q non-negative integer); cW ; p ,q denote temporal exponential ( p c 8 ), Gaussian ( p c 9 ), spherical ( p c 10 ), sine ( p c 11), cosine ( p c 12 ), Mexican-hat ( p c 13 ) and nuggeteffect ( p c 14 ) models. The ch uW ; p denote space-time exponential ( p 1 ), Gaussian ( p 2 ) and polynomial-exponential ( p 3 ) models; u is a vector parameter. The N r,W is the polynomial generalized covariance of orders Q , P Table 6.10. Examples of space-time covariance models
Bayesian Maximum Entropy – BME
277
Noteworthy elements of SEKS-GUI are its generalization powers that account for space-time heterogenous dependence, non-Gaussian probability laws, nonlinear space-time predictors, and interdisciplinary data assimilation. Spatial and temporal variations are interrelated, as the geographical propagation of an attribute may also be affected by temporal mechanisms. The GBME technique directly accounts for heterogenous and non-Gaussian data distributions. Numerical codes have been bundled into a user interface creating an easy-to-use SEKS-GUI framework that addresses the needs of users with multidisciplinary backgrounds and free of any programming requirements. The spatiotemporal modeling codes are unified within the GUI framework such that the SEKS-GUI comprehensively features: 1. a user-friendly interface for space-time modeling and mapping using a series of screens. These screens facilitate space-time modeling and mapping in a way that allows users to control each substantive step of their investigation; for illustration, a few examples of such screens are shown in Figures 6.14–6.19; Space-time dependence representation:
Site-specific knowledge assimilated:
Heterogenous, in general. (Homogenous-stationary are special cases.)
Hard data and soft (uncertain) information.
Predictor:
Prediction maps:
Nonlinear, in general. (Linear is special case.)
Complete pdf at each grid point. Joint pdf at several points. Mean, mode and median at each point.
Underlying probability laws:
Accuracy maps:
Non-Gaussian, in general. (Gaussian is special case.)
Error variance, std. deviation, skewness, and confidence intervals.
General (core) knowledge processed:
Forthcoming:
Theoretical models, scientific laws and empirical relations.
Multi-attribute (Vector) mapping. Functional (change-of-support) mapping.
Table 6.11. Some features of the BME and GBME mapping techniques [YU 07a]
278
Advanced Mapping of Environmental Data
Figure 6.14. A screenshot of the soft data wizard in SEKS-GUI, at the stage of providing the site-specific knowledge [KOL 06]
Figure 6.15. A screenshot of the covariance analysis phase (BME version). The plot displays the experimental covariance derived from data (circles connected with larger tiles), and the covariance model fitted to them (surface with semi-transparent smaller tiles); [KOL 06]
Bayesian Maximum Entropy – BME
279
Figure 6.16. A screenshot of the prediction phase for the GBME version [KOL 06]
Figure 6.17. Screenshot of the visualization phase using the BME version. The map displays the predicted total ozone means at the user-defined output grid nodes on July 9, 1998
280
Advanced Mapping of Environmental Data
Figure 6.18. Screenshot of the visualization stage using the BME version. The map displays a collection of predicted total ozone pdf on July 9, 1998; the pdf shown are all on the same scale at selected output locations
Figure 6.19. A screenshot of the visualization phase for GBME. At every grid node the map displays the difference Q P of the spatial and temporal heterogenity orders at this node
Bayesian Maximum Entropy – BME
281
2. built-in functions for any intermediate step so that users need not handle individual library functions nor connect the processing stages of the software. Since these steps appear seamless, users can concentrate on substantive modeling activities; 3. a complete graphics-based environment for spatiotemporal modeling that offers significant flexibility in providing the input, deciding the investigation course, choosing from amongst an array of available predictions, and selecting from a broad variety of output visualization options. The above features 1-3, combined with the unique characteristics of the BME knowledge synthesis methodology implemented in the software libraries that support the interface, render SEKS-GUI an innovative addition to the spectrum of data processing and interpretation tools of modern spatiotemporal analysismodeling-visualization technology, including temporal geographical information systems (temporal GIS [CHR 02c]). A detailed description of SEKS-GUI can be found in [YU 07a]. Additionally, a comprehensive “User’s Manual” is available (see: http://geography.sdsu.edu/Research/Projects/SEKS-GUI/SEKS-GUI.html) that addresses a broad audience ranging from the novice spatiotemporal modeling user to the field expert [KOL 06]. 6.5. A brief review of applications BME techniques have been implemented in a long list of real-world cases studies and in a variety of disciplines, including physical and medical geography, human exposure, earth and atmospheric sciences, environmental engineering, epidemiology, health sciences, risk assessment, and decision analysis. Only a limited number of case studies are briefly reviewed below; for more information, the interested reader is referred to the rich BME literature.
282
Advanced Mapping of Environmental Data
6.5.1. Earth and atmospheric sciences Applications of BME analysis and modeling in earth and atmospheric sciences include (but are not limited to) the following real-world case studies: – To generate high-resolution maps of total ozone ( TO3 ) distributions over the USA; [CHR 04]. The high natural variability of ozone concentrations and the different levels of accuracy of the algorithms used to generate data from remote sensing instruments introduce major sources of uncertainty that cannot be confronted satisfactorily by means of conventional data analysis and interpolation techniques. BME successfully processed datasets generated by measuring instruments on board the Nimbus 7 satellite. In addition to exact ozone data, uncertain measurements and secondary information were used in terms of “ TO3 tropopause pressure” empirical relationships. The TO3 analysis took into consideration major sources of error in the TOMS/SBUV TOR and produced high spatial resolution maps that were more accurate and informative than those obtained by conventional interpolation techniques; see Figures 6.20 and 6.21. – Establishing an operational Temporal GIS for the systematic air quality assessment over the major Cairo area, Egypt [SER 01]. The proposed framework addressed several critical issues, including the high variability of data in space and time, topographical effects, the combination of various bodies of hard and soft information (air quality reports, meteorological databases etc.), the local pollution standards, and the possibility of implementing cost-effective prevention measures. – In the spatiotemporal mapping of total carbon stock in agroforestry systems of Sub-Saharan Africa, where the multivariate BME method yielded a more reliable prediction than the co-kriging techniques; see [QUE 07]. – In data fusion and image pan-sharpening applications; [FAS 07a and b, BOG 07]. A weighting parameter was used for balancing spectral and spatial information, thus enhancing the versatility of the method with respect to the context and the user’s needs (e.g., photo-interpreters may favor image sharpness, whereas automated procedures may require a better color consistency). The method is fast when compared to other techniques (e.g., wavelet-based techniques). In addition to IKONOS image pan-sharpening, the method can be used for optical/SAR image fusion and for hyperspectral fusion.
Bayesian Maximum Entropy – BME
283
Figure 6.20. BME predictions of TO3 (in DU) for July 7-10, 1988. Hard data locations for each day are shown as triangles. Soft data were also available at other locations (see [CHR 04])
– In the spatial analysis and prediction of censored soil variables in the Ivybridge area (Devon, UK) using soft (imprecise) data; see, [ORT 07a and b]. Different methods were compared: ordinary kriging, BME1 (the generalized least squares technique was used to extract the mean from the data) and BME2 (the maximum likelihood technique was used to fit the local mean). The methods were compared in terms of their predictions of soil depths over solid parent material using censored data; see Figure 6.22. The results showed that BME2 provided the most accurate predictions; the degree of improvement depended on the parameters of the spatial covariance model. – In geophysical assimilation of various forms [CHR 98a, CHR 05a]. Assimilation modeling is based on a conceptual framework where the model describes incomplete knowledge about nature and focuses on mechanisms and modes of scientific thinking. This approach can lead to more realistic representations of the geophysical situation than conventional data assimilation where the model supposedly describes nature and focuses on form manipulations.
284
Advanced Mapping of Environmental Data
Figure 6.21. Scattergram of TOMS data fluctuations against BME predictions of TO3 (in DU) for July 7-10, 1988; the perfect correlation line is also shown for comparison [CHR 04]
(a)
(b)
(c)
Figure 6.22. Maps showing the differences between the predicted soil depths using the OK, BME1 and BME2 methods; (a) BME1–OK, (b) BME2–OK and (c) BME2–BME1 [ORT 07b]
Bayesian Maximum Entropy – BME
285
– To generate estimates of horizontal hydraulic conductivity at the KirkwoodCohansey aquifer that has been identified as a critical source for meeting existing and expected water supply deficits for southern New Jersey, USA [VYA 04]. The study involved the compilation-geocoding of existing data and the integration of actual measurements with soft information on likely conductivity ranges to estimate hydraulic conductivity maps. Estimation error maps provided insight into the uncertainty associated with the conductivity estimates, and indicate areas where more information on hydraulic conductivity is required. – To combine continuous and categorical (qualitative) spatial information in a mapping context; [BOG 02a and b, BOG 04a and b), WIB 06]; see Figure 6.23. The approach relies on a definition of a mixed random field that can account for stochastic links between categorical and continuous random fields through the use of cross-covariance functions. Adding categorical information can significantly improve the prediction of continuous attributes. – In heterogenous porous media upscaling [YU 05]. Numerical experiments involved effective conductivities in bounded two-dimensional spatial domains, and the results were compared to previous upscaling solutions. In addition to dealing with new and more general upscaling situations, the proposed BME-based upscaling approach reproduced well-known results, a fact that further demonstrated its power and nesting capabilities. – In energy studies of the Nea Kessani geothermal field, Greece [YU 07b]; see, e.g., Figure 6.24. Temperature maps are: (a) composite, in the sense that apart from being consistent with the heat transfer law, they also account for multi-sourced uncertainty of the model parameters and the site-specific information at a set of vertical drill holes; and (b) complete, i.e. the whole temperature probability density is generated at each location. By means of comparative analysis, it is shown that the composite maps are more informative than the maps obtained using conventional methods (the composite solution offers a more realistic representation of the realworld phenomenon and, unlike the previous methods, it is in agreement with empirical quartz geothermometry analyses). – In the mapping of soil salinity, which is a major hazard to agriculture [DOU 04, DOU 05]. The study included two datasets: one consisting of field salinity measurements (ECa) at 413 locations and 19 time instants; and another containing, in addition to ECa, salinity determined in the laboratory (EC2.5) at 13-20 locations. Using cross-validation, the performance of three prediction methods was compared in a space-time domain, as follows:
286
Advanced Mapping of Environmental Data
Figure 6.23. Prediction maps of a continuous attribute using (a) simple kriging and (f) residual kriging; BME prediction maps when categorical information is available (b) at 50 continuous data locations, (c) at an extra set of 50 locations, (d) at an extra set of 350 locations, and (e) exhaustively (black and white correspond to the lowest and highest values, respectively) [WIB 06]
1. kriging using hard data (denoted as HK), 2. kriging using hard and mid-interval soft data (denoted as HMIK), and 3. BME using probabilistic soft data. BME was less biased, more accurate, and it also gave estimates that were better correlated with the observed values than the estimates of the kriging techniques. Furthermore, BME allowed us to delineate saline from non-saline areas with more detail.
Bayesian Maximum Entropy – BME
287
(a)
(b)
(c) Figure 6.24. Horizontal temperature distributions (ºC) at 300 m depth (Nea Kessani, Greece) generated by (a) BME (circles denote hard temperature data – triangle interval data), (b) a standard numerical technique, and (c) an empirical method. BME map is superior to the other two maps; the BME map includes temperatures of 110ºC in agreement with geothermometry studies, whereas the temperatures in the other two maps do not exceed the maximum measured temperature of 80ºC [YU 07b]
288
Advanced Mapping of Environmental Data
– In the study of the Equus Beds aquifer, which is an alluvial deposit near Wichita city, Kansas [SER 99a]. A well-field was installed in the aquifer to supply water to the city. Ground water pumping and droughts (during the 1950s and late 1980s) resulted in a substantial decline of water-levels over a large area. This decline motivated regulatory agencies to monitor the water-levels using a network of groundwater observation wells. However, because of recording errors and the difficulties of accurately measuring fluctuating water-levels in a pumping well-field, the information available consisted of a combination of hard and soft (uncertain) data. BME incorporated both hard and soft data in order to produce accurate waterlevel spatiotemporal maps and perform a reliable error assessment. The maps improved the hydrogeologic understanding of the region and optimized local decision-making regarding the operation of the Wichita well-field. – The study by [QUI 04] showed that improved mean ocean surface wave characteristics can be obtained at global and local scales using BME to handle relatively coarse altimeter sampling, and that TOPEX/Poseidon and Jason-1 altimeters can be merged to provide altimeter mean wave period fields with a better resolution. Altimeter mean wave period estimates were compared with the WaveWatch-III numerical wave model to illustrate their usefulness for wave models tuning and validation.
Figure 6.25. BME maps of surface water PCE concentrations on April 15, 2002 (a) over the entire state of New Jersey and (b) over an area restricted to WMA05 [AKI 07]
– To integrate physical laws with different types of site-specific data and auxiliary site conditions in the prediction of hydrogeologic variables and chemical concentration distributions across space-time. A variety of physical laws have been studied using the BME methodology, including Darcy’s law, the advection-reaction law and the heat transfer law [SER 99b, CHR 99a, CHR 99b, KOL 02, SER 03 and PAP 06].
Bayesian Maximum Entropy – BME
289
– In the non-attainment assessment of surface water tetrachloroethene (PCE) in the state of New Jersey, USA; [AKI 07]; e.g., Figure 6.25. Due to budget and scientific limitations, the sampling data is insufficient to assess all river streams in the state. To address this problem, the space-time PCE concentrations were estimated throughout all river reaches in New Jersey (1999-2003) and their evolvement over time was studied. A non-attainment assessment analysis was conducted that identified the river miles that were highly likely in non-attainment of the standard, those that were highly likely in attainment of the standard, and the remaining labelled as non-assessed. Watershed management areas with contamination problems were identified. – To study the inverse hydrologic problem in heterogenous aquifers; see [SER 03c]. BME offered an efficient solution to the inverse problem by first assimilating various physical knowledge sources (subsurface flow law, water table elevation data, uncertain hydraulic resistivity measurements, etc.) and then producing robust estimates of the subsurface parameters across space. In addition, the optimal distribution of hard and soft data needed to minimize the associated estimation error at a specified sampling cost was determined. – To represent particulate matter (PM10) distributions in the state of California, USA; [CHR 01]. The study provided a complete PM10 characterization in terms of the map of pdf of PM10 across space-time. PM10 estimates were chosen that offered an appropriate representation of the real distribution in space-time and a meaningful assessment of the representation accuracy. Depending on the space-time scales considered, the PM10 distributions depicted considerable levels of variability, which may be associated with topographic features, climatic changes, seasonal patterns and random fluctuations. The importance of integrating secondary information at surrounding sites and the estimation points themselves was discussed. Areas were identified where the annual PM10 geometric mean reached or exceeded the California standards, which is valuable information for regulatory purposes. – To generate an accurate spatiotemporal representation of Phoenix’s urban heat island (UHI) process, which is a critical component in understanding the process and its relationship to energy and water use, urban design features and ecosystem patterns; see [LEE 07a]. BME was used to the UHI to account for data uncertainty from missing records, retrieve and map minimum temperature observations over time from historical weather station networks, and test mapping accuracy compared to traditional maps that do not account for data uncertainty; e.g., Figure 6.26. The results showed that BME leads to increases of mapping accuracy (up to 35.28% over traditional linear kriging). Synthetic case studies confirmed that substantial increases in mapping accuracy occur when there are many cases of missing or uncertain data. Use of BME reduces the need for costly sampling protocols and produces UHI maps that can be integrated with other data about human and environmental processes in urban sustainability studies.
290
Advanced Mapping of Environmental Data
Figure 6.26. Estimation maps of monthly mean of minimum temperatures (June, 1998) using (A) spatial simple kriging, (B) space-time simple kriging, and (C) spatiotemporal BME. The number near the hardened and soft data points indicates the number of days sampled [LEE 07a]
Bayesian Maximum Entropy – BME
291
– To map local-scale estimates of water use in Maricopa County (Arizona) based on data aggregated to census tracts and measured only in the city of Phoenix; see [LEE 07b]. Accurate representation of regional water use by means of such maps is an important factor in the context of urban growth and climate variability studies. However, it is a challenging affair, because water use data are often unavailable, and when available, they are geographically aggregated to protect the identity of individuals. Different types of data uncertainty sources were considered (extrapolation and downscaling processes) and (soft) data were generated that accounted for the uncertainty sources. The results ascertained that BME is a theoretically sound soft data assimilation approach that leads to an increased mapping accuracy over classical spatial statistics methods. The analysis provided useful knowledge on local water use variability in the whole county that was further applied to the understanding of causal factors of urban water demand. – To estimate soil properties from thematic maps and texture mapping, and the continuous valued reconstruction of maps of various kinds; see [BOG 02a and b], and [DOR 03]. Thematic maps are one of the most common tools for representing the spatial variation of a variable. They are easy to interpret, thanks to the simplicity of presentation: clear boundaries define homogenous areas. However, when the variable is continuous, abrupt changes between cartographic units are often unrealistic and the intra-unit variation is hidden behind a single representative value. In many applications, such non-natural transitions are not satisfactory, including the poor precision of such maps. As additional samples are often cost prohibitive, we should try to use the information in the available map to evaluate the spatial variation of the variable under study. BME can achieve such a goal using only the vague (soft) information in the map. BME was compared to a method frequently used in soil sciences: the legend quantification method. It was shown, first by means of a simulated case study that the use of BME increased noticeably the precision of the estimates. The resulting BME maps had smooth transitions between mapping units, which conformed to the expected behavior of continuous variables. These observations were subsequently corroborated in a real case study where the sand, silt and clay contents in soils had to be estimated from a soil map. 6.5.2. Health, human exposure and epidemiology BME techniques have been also used in a number of studies in human exposure and health sciences, including the following: – Modeling the space-time distribution of particulate matter (PM10) in Thailand and the generation of informative spatiotemporal PM10 maps during the most polluted day of each year (1998-2003) [PUA 07]: the map of the predicted daily PM10 (Figure 6.27a), the map of the associated prediction error (Figure 6.27b), and the non-attainment map showing areas where PM10 values did not attain a 68%
292
Advanced Mapping of Environmental Data
probability of meeting the ambient standard. These maps provided valuable information for air quality management purposes: developing and evaluating strategies to abate PM10 levels, identifying unhealthy zones (especially for sensitive populations such as asthmatic children, seniors or those with cardiopulmonary disease); and optimizing the pollution monitoring network.
(a)
(b)
Figure 6.27. Maps of (a) the BME median estimate and (b) the normalized prediction error variance of daily average PM10 in Thailand on the most polluted day in 1998 [PUA 07]
– The cost-effective water quality assessment at the Catawba River reservoir system (western North Carolina), where integration of modeling results via BME reduced the uncertainty of reservoir chlorophyll predictions. These predictions were used to illustrate the cost savings achieved by less extensive and rigorous monitoring methods within the BME framework [LOB 07]; see Figure 6.28. – The evaluation of the effects of climate change on influenza risk in California, USA [CHO 06, 08]. Maps of risk variation during El Niño differed from those during normal weather, the corresponding covariances exhibited distinct space-time dependence features, and the temporal mean mortality profiles were considerably higher during normal weather than during El Niño. BME analysis offered a methodological framework to evaluate public health management strategies.
Bayesian Maximum Entropy – BME
Stage
Step A
Structural Stage
Prior pdf derived from G-KB that includes the covariance model determined analytically via physical laws or empirically via an existing dataset.
293
Illustration c X (r, W ) V X e 3r / a r [ae 3W / a t1 (a 1)ae 3W / a t2 ]
Covariance of parameter value between two points defined as a function of the spatial distance r and temporal lag W .
Hard data (if any) identified. Illustration shows monitoring data for one location over time.
Specificatory Stage
Water Quality Parameter
B
D
E
Time ---->
Soft data also obtained from model predictions. Illustration shows mean estimate along with a confidence interval for a point prediction. General knowledge reconciled with site-specific data. In special case of homogenous/stationary mean and covariance with hard (monitoring) data, the BME reduces to ordinary kriging.
Time ---->
Integration of model results (soft data) with hard data. Uncertainty is reduced and greater resolution is obtained.
Water Quality Parameter
F
Time ---->
G Integration of uncertain monitoring data with hard data and model predictions. Uncertainty is reduced near soft data space-time locations.
Water Quality Parameter
Integration Stage (3 steps are shown independently. In fact, BME performs integration simultaneously)
Soft data identified for space/time locations where hard data is absent. Illustration shows three measurements with uncertainty expressed as probability distribution.
Water Quality Parameter
C
Water Quality Parameter
Time ---->
Time ---->
Figure 6.28. An illustration of the BME stages implemented in the water quality study of the Catawba River reservoir system in western North Carolina [LOB 07]
294
Advanced Mapping of Environmental Data
– The study of multi-scale data features and their effects on the estimation of space-time mortality distributions in California, USA [CHO 03]. Artificial effects were filtered out using soft information at the data points themselves. The generated BME map displayed more variability at the local scale, and the contours lines of mortality rate followed more the outline of county boundaries (for which the information was collected), rather than the centroid locations (which were arbitrary choices). Accuracy measures demonstrated that the multiscale approach offered more accurate mortality predictions at the local scale than existing approaches that did not account for scale effects. – The characterization of the geographical dependence contaminant exposure in south-west Russia due to the Chernobyl fallout (Ukraine). The extent and magnitude of radioactive soil contamination by 137Cs was estimated which allowed incorporation of a variety of knowledge bases leading to improved prediction accuracy and informative soil contamination maps; see [SAV 05, PAR 05]. – The determination of the space-time extent of lead contamination at the Cherry Point Air Force site (North Carolina) and the corresponding health impact to nearby communities; see [AUG 02]. A composite lead dataset spread out over 14 years of sampling was used. The study analyzed the neurological impairment (depression in arithmetic ability in children) and lung cancer effects due to lead. It aimed at developing a general exposure and health effect assessment framework that is flexible enough to consider other contaminants of concern at Superfund sites. The study framework included demographic information and generated estimates of the population impact due to contamination exposure. – The study of causal associations between environmental exposure and health effects in the state of North Carolina, by synthesizing sources of physical exposure and population health knowledge [CHR 00c]. The strength and consistency of the exposure-effect association were evaluated on the basis of health effect predictions that the combined physical-health analysis generated in space-time. Potential confounders were accounted for in the quantitative analysis, which resulted only in a slightly different strength in the reported mortality-temperature association. – The analysis and mapping of syphilis distributions in Baltimore (USA) with the purpose of optimizing intervention and prevention strategies; see [LAW 06]. Covariance plots indicated that the distribution of the density of syphilis cases exhibited both spatial and temporal dependence. Disease maps suggested that syphilis increased within two geographic core areas of infection and spread outwards; see Figure 6.29. A new core area of infection was established to the northwest. As the outbreak waned, the density diminished and receded in all core areas. Morbidity remained elevated in the two original central and new northwestern core areas after the outbreak.
Bayesian Maximum Entropy – BME
295
Figure 6.29. Yearly changes (1994–2002) in the spatial distribution of syphilis infection density in Baltimore (Maryland, USA). The composite space-time BME analysis was used to produce spatially and temporally dependent maps. All maps share the same scale ranging from a minimum of 0 cases per km2 to a maximum of 60 cases per km2 [LAW 06]
– The health effects of ozone exposure in eastern USA [CHR 99b]. Spatiotemporal exposure distributions were generated and provided the input to toxicokinetic laws linked to population impact models that, in turn, were integrated with relationships describing how health effects are distributed across populations. The analysis helped health scientists and administrators derive valuable conclusions about the expected health impact on specific population cohorts within a geographical area and time period.
296
Advanced Mapping of Environmental Data
– To study lifetime population damage due to exposure to arsenic in drinking water across Bangladesh [SER 03b]. BME provided the means to assimilate a variety of KB (physical, epidemiologic, carcinogenetic and demographic) and uncertainty sources (soft data, measurement errors and secondary information). Maps of naturally occurring arsenic distribution in Bangladesh drinking water were generated. Global indicators of the adverse health effects on the population were derived (e.g., Figure 6.30), and valuable insight was gained by blending information from different scientific disciplines.
Figure 6.30. Bladder cancer maps (number of cases per km2) of Bangladesh using the empirical exposure-response (linear) model and the multistage carcinogenetic (nonlinear) model. In both cases, the results indicated an increased lifetime bladder cancer probability for the population due to arsenic [SER 03b]
Bayesian Maximum Entropy – BME
297
– To estimate residential level ambient particulate matter (PM2.5 and PM10) and ozone exposures at multiple time-scales in North Carolina, USA (Figure 6.31) and to study health effects of air pollution on lupus [YU 07c]. Since the spatiotemporal estimation of long-term exposure in residential areas on the basis of air quality system observations may suffer from missing data due to scarce monitoring across space and inconsistent monitoring periods at different geographical locations, the study developed two upscaling methods (data aggregation followed by exposure estimation; and exposure estimation followed by aggregation). The methods were applied at multiple temporal scales of particulate matter and ozone exposure estimation in the residential areas considered in the health study.
Figure 6.31. Spatiotemporal map of BMEmean estimates of PM2.5/ PM10 on (a) August 25, (b) August 31 and (c) September 6 (1996) [YU 07c]
298
Advanced Mapping of Environmental Data
Figure 6.32. Total geographical area in Europe infected by Black Death at different times –denoted in black (from [CHR 07]
– The comparative study of space-time patterns and geographical propagation dynamics of major epidemics, such as the Black Death epidemic in 14th century Europe and the bubonic plague in late 19th-early 20th century India; see [CHR 05c, CHR 07, WAN 05, YU 06]. For the first time, a series of detailed space-time maps of important characteristics of the two epidemics (mortality, infected area propagation, centroid evolution, etc.) were obtained (e.g., Figures 6.32 and 6.33). The maps integrated a variety of interdisciplinary knowledge bases generating a comparative epidemic modeling that led to a number of interesting findings. Epidemic indicators confirmed that Black Death mortality was two orders of magnitude higher than that of bubonic plague. Modern bubonic plague is a rural disease typically devastating small villages in the countryside, whereas the Black Death indiscriminately attacked both large urban centers and the countryside. The epidemics had reverse areal extension features in response to annual seasonal variations. During the Indian epidemic, the disease disappeared and reappeared several times at certain locations; in Europe, once the disease entered a place, it lasted for a time proportional to the population and then disappeared for several years. On average, the Black Death was much faster than bubonic plague to reach virgin territories, despite the fact that India is slightly larger in area than Western Europe and had a railroad network almost instantly moving infected rats, fleas and people from one end of the subcontinent to the other. These findings throw new light
Bayesian Maximum Entropy – BME
299
on the epidemics and need to be taken into consideration in the discussion concerning the two devastating diseases and the lessons learned from them. In this section, an attempt was made to communicate how BME analysis and modeling, with its full conceptual and technical beauty, can be applied across disciplines. For this purpose, a series of case studies were discussed that transcend a variety of disciplines.
Figure 6.33. Space-time mortality rate maps (per thousands) of bubonic plague in India during 1902-1903 [YU 06]
For a more thorough and detailed discussion of the concepts, techniques and real-world case studies presented above, the reader is encouraged to consult the original sources. 6.6. References [AKI 07] AKITA Y., CARTER G. and SERRE M.L., “Spatiotemporal non-attainment assessment of surface water Tetrachloroethene in New Jersey”, J. of Environmental Quality vol. 36(2), p. 508-520, 2007.
300
Advanced Mapping of Environmental Data
[AUG 02] AUGUSTINRAJ A., A Study of Spatiotemporal Health Effects due to Water Lead Contamination, MS Thesis, Dept. of Environ. Sci. and Engin., Univ. of North Carolina, Chapel Hill, NC. [BOG 96] BOGAERT P., “Comparison of kriging techniques in a space-time context”, Mathematical Geology, 28, p. 73-86, 1996. [BOG 02a] BOGAERT P., “Spatial prediction of categorical variables: the BME approach”, Stochastic Environ. Research and Risk Assessment, vol. 16, p. 425-448, 2002. [BOG 02b] BOGAERT P. and D’OR D., “Estimating soil properties from thematic soil maps – The BME approach”, Soil Science Soc. of America Journal, 66, p. 1492-1500, 2002. [BOG 04a] BOGAERT P., “Predicting and simulating categorical random fields: the BME approach”, Proceed. of the 1st Intern. Conf. for Advances in Mineral Resources Management & Environ. Geotechnology, (AMIREG 2004), p. 119-126, 2004. [BOG 04b] BOGAERT P. and WIBRIN M.A., “Combining categorical and continuous information within the BME paradigm”, in Proceed. GeoEnv V-Geostatistics for Environmental Applications, Neuchatel, Switzerland, October 13-15, 2004. [BOG 07] BOGAERT P. and FASBENDER D., “Bayesian data fusion in a spatial prediction context: a general formulation”, Stochastic Environmental Research and Risk Assessment, vol. 21, p. 695-709, 2007. [CHO 03] CHOI K-M, SERRE M.L. and CHRISTAKOS G., “Efficient mapping of California mortality fields at different spatial scales”, J. of Exposure Analysis & Environmental Epidemiology, vol. 13, p. 120-133, 2003. [CHO 06] CHOI K-M, CHRISTAKOS G. and WILSON M.L., “El Niño effects on influenza mortality risks in the state of California”, J. Public Health, vol. 120, p. 505-516, 2006. [CHO 08] CHOI K-M, YU H-L and WILSON M.L., “Spatiotemporal analysis of influenza mortality risks in the state of California during the period 1997-2001”, Stochastic Environmental Research and Risk Assessment, 2008, available online, DOI 10.1007/s00477-007-0168-4. [CHR 84] CHRISTAKOS G., “On the problem of permissible covariance and variogram models”, Water Resources Research, vol. 20, p. 251-265, 1984. [CHR 90a] CHRISTAKOS G., “Random Field Modelling and its Applications in Stochastic Data Processing”, Applied Sciences, PhD Thesis, 1990, Harvard University, Cambridge, MA. [CHR 90b] CHRISTAKOS G., “A Bayesian/maximum-entropy view to the spatial estimation problem”, Mathematical Geology, vol. 22, p. 763-776, 1990. [CHR 91a] CHRISTAKOS G., “On certain classes of spatiotemporal random fields with application to space-time data processing”, IEEE Trans Systems, Man, and Cybernetics vol. 21(4), p. 861-875, 1991. [CHR 91b] CHRISTAKOS G., “Some applications of the BME concept in Geostatistics”, in Fundamental Theories of Physics, Kluwer Acad. Publ., Amsterdam, The Netherlands, p. 215-229, 1991.
Bayesian Maximum Entropy – BME
301
[CHR 92] CHRISTAKOS G., Random Field Models in Earth Sciences. Academic Press, San Diego, CA, 1992. [CHR 96] CHRISTAKOS G. and BOGAERT P., “Spatiotemporal analysis of springwater ion processes derived from measurements at the Dyle Basin in Belgium”, IEEE Trans. Geosciences and Remote Sensing, vol. 34, p. 626-642, 1996. [CHR 98a] CHRISTAKOS G. and LI X., “Bayesian maximum entropy analysis and mapping: A farewell to kriging estimators?”, Mathematical Geology, vol. 30(4), p. 435-462, 1998. [CHR 98b] CHRISTAKOS G. and HRISTOPULOS D.T., Spatiotemporal Environmental Health Modelling, Kluwer Academic Publ., Boston, MA, 1998. [CHR 98c] CHRISTAKOS G., “Spatiotemporal information systems in soil and environmental sciences”, Geoderma, vol. 85(2-3), p. 141-179, 1998. [CHR 99a] CHRISTAKOS G., HRISTOPOULOS D.T. and SERRE M.L., “BME studies of stochastic differential equations representing physical laws-Part I”, 5th Annual Conference, Intern. Assoc. for Mathematical Geology, Trodheim, Norway, p.63-68, 1999. [CHR 99b] CHRISTAKOS G. and KOLOVOS A., “A study of the spatiotemporal health impacts of ozone exposure”, J. of Exposure Analysis & Environmental Epidemiology, vol. 9, p. 322-335, 1999. [CHR 00a] CHRISTAKOS G., Modern Spatiotemporal Geostatistics, Oxford Univ. Press, New York, 2000. [CHR 00b] CHRISTAKOS G. and PAPANICOLAOU V., “Norm-dependent covariance permissibility of weakly homogeneous spatial random fields”, Stochastic Environmental Research and Risk Assessment, vol. 14, p. 1-8, 2000. [CHR 00c] CHRISTAKOS G. and SERRE M.L., “A spatiotemporal study of exposure-health effect associations”, J. of Exposure Analysis & Environmental Epidemiology, vol. 10, p. 168-187, 2000. [CHR 00d] CHRISTAKOS G., HRISTOPOULOS D.T. and BOGAERT P., “On the physical geometry concept at the basis of space/time geostatistical hydrology”, Advances in Water Resources, vol. 23, p. 99-810, 2000. [CHR 01] CHRISTAKOS G., SERRE M.L. and KOVITZ J., “BME representation of particulate matter distributions in the state of California on the basis of uncertain measurements”. J. of Geophysical Research, vol. 106(D9), p. 9717-9731, 2001. [CHR 02a] CHRISTAKOS G., “On the assimilation of uncertain physical knowledge bases: Bayesian and non-Bayesian techniques”, Advances in Water Resources, vol. 25, p. 12571274, 2002. [CHR 02b] CHRISTAKOS G., “On a deductive logic-based spatiotemporal random field theory”, Probability Theory & Mathematical Statistics (Teoriya Imovirnostey ta Matematychna Statystyka), vol. 66, p. 54-65, 2002. [CHR 02c] CHRISTAKOS G., BOGAERT P. and SERRE M.L., Temporal GIS, SpringerVerlag, New York, NY, with CD-ROM, 2002.
302
Advanced Mapping of Environmental Data
[CHR 04] CHRISTAKOS G., KOLOVOS A., SERRE M.L. and VUKOVICH F., “Total ozone mapping by integrating data bases from remote sensing instruments and empirical models”, IEEE Trans. Geosciences and Remote Sensing, vol. 42(5), p. 991-1008, 2004. [CHR 05a] CHRISTAKOS G., “Recent methodological developments in geophysical assimilation modelling”, Reviews of Geophysics, vol. 43, p. 1-10, 2005. [CHR 05b] CHRISTAKOS G., Random Field Models in Earth Sciences, Dover Publ. Inc., Mineola, NY, 2005. [CHR 05c] CHRISTAKOS G., OLEA R.A., SERRE M.L., YU H.L. and WANG L.-L., Interdisciplinary Public Health Reasoning and Epidemic Modelling: The Case of Black Death, Springer-Verlag, New York, NY, 2005. [CHR 06] CHRISTAKOS G., “Modelling with Spatial and Temporal Uncertainty”, in Encyclopedia of Geographical Information Science (GIS), Springer, NY, 2006. [CHR 07] CHRISTAKOS G., OLEA R.A. and YU H.-L., “Recent results on the spatiotemporal modelling and comparative analysis of Black Death and bubonic plague epidemics”, J. Public Health, vol. 121, p. 700-720, 2007. [CHR 08] CHRISTAKOS G., Treatise on Epistematics, Springer, New York, NY, 2008. [CRE 93] CRESSIE N., Statistics for Spatial Data, J. Wiley, NY, 1993. [DOR 01] D’OR D., BOGAERT P. and CHRISTAKOS G., “Application of the BME approach to soil texture mapping”, Stochastic Environmental Research and Risk Assessment, vol. 15, p. 87-100, 2001. [DOR 03] D’OR D. and BOGAERT P., “Continuous-valued map reconstruction with the Bayesian Maximum Entropy”, Geoderma, vol. 112, p. 169-178, 2003. [DOU 04] DOUAIK A., VAN MEIRVENNE M., TOTH T. and SERRE M.L., “Space-time mapping of soil salinity using probabilistic BME”, Stochastic Environmental Research and Risk Assessment, vol. 18, p. 219-227, 2004. [DOU 05] DOUAIK A., VAN MEIRVENNE M. and TOTH T., “Soil salinity mapping using spatio-temporal kriging and Bayesian maximum entropy with interval soft data”, Geoderma, vol. 128, p. 234-248, 2005. [ELO 08] ELOGNE S., HRISTOPULOS D.T. and VAROUCHAKIS M., “An application of Spartan spatial random fields in environmental mapping: focus on automatic mapping capabilities”, Stochastic Environmental Research and Risk Assessment, vol. 22(5), 2008, forthcoming. [FAS 07a] FASBENDER D., RADOUX J., and BOGAERT P., “Adaptable Bayesian data fusion for image pansharpening”, IEEE Trans on Geosciences and Remote Sensing, 2007, forthcoming. [FAS 07b] FASBENDER D., TUIA D., BOGAERT P. and KANEVSKI M., “Support-based implementation of Bayesian Data Fusion for spatial enhancement: Applications to ASTER thermal images”, submitted to Geoscience and Remote Sensing Letters, 2007. [FED 88] FEDER J., Fractals, Plenum Press, NY, 1988.
Bayesian Maximum Entropy – BME
303
[GOO 94] GOODALL C. and MARDIA K.V., “Challenges in multivariate spatio-temporal modelling”, in Proceed. of the XVIIth Intern. Biometric Confer, 1-17, Hamilton, Ontario, Canada, 8-12 August 1994. [GOO 97] GOOVAERTS P., Geostatistics for Natural Resources Evaluation, Oxford Uni. Press, New York, NY. [HAA 95] HAAS T.C., “Local prediction of spatio-temporal process with an application to wet sulfate deposition”, J. of the Amer. Statistical Assoc., vol. 90, p. 1189-1199, 1995. [HRI 01] HRISTOPULOS D.T. and CHRISTAKOS G., “Practical calculation of nonGaussian multivariate moments in BME analysis”, Mathematical Geology, vol. 33(5), p. 543-568, 2001. [HRI 03] HRISTOPULOS D.T., “Spartan Gibbs random field models for geostatistical applications”, SIAM Journal of Scientific Computing, vol. 24(6), p. 2125-2162, 2008. [KOL 02] KOLOVOS A., CHRISTAKOS G., SERRE M.L. and MILLER C.T., “Computational BME solution of a stochastic advection-reaction equation in the light of site-specific information”, Water Resources Research, vol. 38, p. 1318-1334, 2002. [KOL 04] KOLOVOS A., CHRISTAKOS G., HRISTOPULOS D.T. and SERRE M.L., “Methods for generating non-separable spatiotemporal covariance models with potential environmental applications”, Advances in Water Resources, vol. 27, p. 815-830, 2004. [KOL 06] KOLOVOS A., YU H.-L, and CHRISTAKOS G., SEKS-GUI v.0.6 User Manual. Dept. of Geography, San Diego State University, San Diego, CA, 2006. [KOV 04a] KOVITZ J. and CHRISTAKOS G., “Spatial statistics of clustered data”, Stochastic Environmental Research and Risk Assessment, vol. 18(3), p. 147-166, 2004. [KOV 04b] KOVITZ J. and CHRISTAKOS G., “Assimilation of fuzzy data by the BME method”, Stochastic Environmental Research and Risk Assessment, vol. 18(2), p. 79-90, 2004. [KYR 99] KYRIAKIDIS P.C. and JOURNEL A.G., “Geostatistical space-time models: a review”, Mathematical Geology, vol. 31(6), p. 651–684, 1999. [LAW 06] LAW D.C., BERNSTEIN K., SERRE M.L., SCHUMACHER C.M., LEONE P.A., ZENILMAN J.M., MILLER W.C., and ROMPALO A.M., “Modelling an Early Syphilis Outbreak through Space and Time Using the Bayesian Maximum Entropy Approach”, Annals of Epidemiology 16(11): 797-804, 2006. [LEE 07a] LEE S.-J., BALLING R. and GOBER P., “Bayesian Maximum Entropy mapping and the soft data problem in urban climate research”, Annals of the Association of American Geographers, 2007, forthcoming. [LEE 07b] LEE S.-J. and WENTZ E.A., “Applying BME to extrapolating local-scale water consumption in Maricopa county, Arizona”, Water Resources Research, 2007, forthcoming. [LOB 07] LOBUGLIO J.N., CHARACKLIS G.W. and SERRE M.L., “Cost-effective water quality assessment through the integration of monitoring data and modelling results”, Water Resources Research, vol. 43, 2007, doi:10.1029/2006WR005020.
304
Advanced Mapping of Environmental Data
[MAC 03] MA C., “Spatio-temporal stationary covariance models”, J. of Multivariate Analysis, vol. 86, p. 97-107, 2003. [MYE 89] MYERS D.E., “To be or not to be…stationary: That is the question”, Mathematical Geology, vol. 21, p. 347-362, 1989. [MYE 02] MYERS D.E., “Space-time correlation models and contaminant plumes”, Environmetrics, vol. 13, p. 535-554, 2002. [ORT 07a] ORTON T.G. and LARK R.M., “Accounting for the uncertainty in the local mean in spatial prediction by BME”, Stochastic Environmental Research and Risk Assessment, vol. 21(6), p. 773-784, 2007. [ORT 07b] ORTON T.G. and LARK R.M., “Estimating the local mean for Bayesian maximum entropy by generalized least squares and maximum likelihood, and an application to the spatial analysis of a censored soil variable”, J. of Soil Science, vol. 58, p. 60-73, 2007. [PAP 06] PAPANTONOPOULOS G. and MODIS K., “A BME solution of the stochastic three-dimensional Laplace equation representing a geothermal field subject to sitespecific information”, Stochastic Environmental Research and Risk Assessment, vol. 20(1-2), p. 23-32, 2006. [PAR 05] PARKIN R., SAVELIEVA E. and SERRE M.L., “Soft geostatistical analysis of radioactive soil contamination”, in Ph. Renard (ed.) GeoENV V-Geostatistics for Environ. Applications, Kluwer Acad. Publishers, Dordrecht, 2005. [POR 06] PORCU E., GREGORI P. and MATEU J., “Nonseparable stationary anisotropic space-time covariance functions”, Stochastic Environmental Research and Risk Assessment, vol. 21(2), p. 113-122, 2006. [PUA 07] PUANGTHONGTHUB S., WANGWONGWATANA S., KAMENS R.M. and SERRE M.L., “Modelling the space/time distribution of Particulate Matter in Thailand and optimizing its monitoring network”, Atmospheric Environment, 2007, available online: doi:10.1016/j.atmosenv.2007.06.051. [QUE 07] QUERIDO A., YOST R., TRAORE S., DOUMBIA M.D., KABLAN R., KONARE H., and BALLO A., “Spatiotemporal mapping of total Carbon stock in agroforestry systems of Sub-Saharan Africa”, in Proceed. of ASA-CSSA-SSSA Intern. Annual Meetings, November 4-8, New Orleans, Louisiana, 2007. [QUI 04] QUILFEN Y., CHAPRON B., COLLARD F. and SERRE M.L., “Calibration/validation of an altimeter wave period model and application to TOPEX/Poseidon and Jason-1 Altimeters”, Marine Geodesy, vol. 27, p. 535-550, 2004. [SAV 05] SAVELIEVA E., DEMYANOV V., KANEVSKI M., SERRE M.L. and CHRISTAKOS G., “BME-based uncertainty assessment of the Chernobyl fallout”, Geoderma, vol. 128, p. 312-324, 2005. [SER 99a] SERRE M.L. and CHRISTAKOS G., “Modern Geostatistics: Computational BME in the light of uncertain physical knowledge – The Equus Beds study”, Stochastic Environmental Research and Risk Assessment, vol. 13(1), p. 1-26, 1999.
Bayesian Maximum Entropy – BME
305
[SER 99b] SERRE M.L. and CHRISTAKOS G., “BME studies of stochastic differential equations representing physical laws-Part II”, 5th Annual Confer, Intern Assoc for Mathematical Geology, Trondheim, Norway, p. 93-98, 1999. [SER 01] SERRE M.L., CHRISTAKOS G., HOWES J. and ABDEL-REHIEM A.G., “Powering an Egyptian air quality information system with the BME space/time analysis toolbox: Results from the Cairo baseline year study”, Geostatistics for Environ. Applications, P. Monestiez, D. Allard and R. Froidevaux (eds.), Kluwer Acad. Publ., Dordrecht, The Netherlands, p. 91-100, 2001. [SER 03a] SERRE M.L. and CHRISTAKOS G., “Efficient BME estimation of subsurface hydraulic properties using measurements of water table elevation in unidirectional flow”, in Calibration and Reliability in Groundwater Modelling: A Few Steps Closer to Reality. K. Kovar and Z. Hrkal (eds.), IAHS Publ. no. 277, Oxfordshire, UK, p. 321-327, 2003. [SER 03b] SERRE M.L., KOLOVOS A., CHRISTAKOS G. and MODIS K., “An application of the holistochastic human exposure methodology to naturally occurring Arsenic in Bangladesh drinking water”, Risk Analysis, vol. 23(3), p. 515-528, 2003. [SER 03c] SERRE M.L., CHRISTAKOS G., LI H. and MILLER C.T., “A BME solution to the inverse problem for saturated groundwater flow”, Stochastic Environmental Research and Risk Assessment, vol. 17(6), p. 354-369, 2003. [VYA 04] VYAS V.M., TONG S.N., UCHRIN C., GEORGOPOULOS P.G. and CARTER G.P., “Geostatistical estimation of horizontal hydraulic conductivity for the KirkwoodCohansey aquifer”, Jour. of the American Water Resources Assoc., vol. 40(1), p. 187-195, 2004. [WAN 05] WANG L.-L., Spatiotemporal Analysis of Black Death in France. M.S. Thesis, Dept. of Environ. Sci. and Engin., Univ. of North Carolina, Chapel Hill, NC, 2005. [WIB 06] WIBRIN M.-A., BOGAERT P. and FASBENDER D., “Combining categorical and continuous spatial information within the Bayesian Maximum Entropy paradigm”, Stochastic Environmental Research and Risk Assessment, vol. 20, p. 423-434, 2006. [YU 05] YU H.-L. and CHRISTAKOS G., “Porous media upscaling in terms of mathematical epistemic cognition”, SIAM J. on Appl. Math., vol. 66(2), 2005, p. 433-446. [YU 06] YU H.-L. and CHRISTAKOS G., “Spatiotemporal modelling and mapping of the bubonic plague epidemic in India”, Intern. Jour. of Health Geographics, vol. 5(12), 2006, Internet online journal [http://www.ij-healthgeographics.com/content/5/1/12]. [YU 07a] YU H.-L., KOLOVOS A., CHRISTAKOS G., CHEN J.-C., WARMERDAM S. and DEV B., “Interactive spatiotemporal modelling of health systems: The SEKS-GUI framework”, Stochastic Environmental Research and Risk Assessment – Special Issue on “Medical Geography as a Science of Interdisciplinary Knowledge Synthesis under Conditions of Uncertainty”, D.A. Griffith and G. Christakos (eds.), vol. 21(5), p. 555-572, 2007.
306
Advanced Mapping of Environmental Data
[YU 07b] YU H.-L., CHRISTAKOS G., MODIS K. and PAPANTONOPOULOS G., “A composite solution method for physical equations and its application in the Nea Kessani geothermal field (Greece)”, J. of Geophysical Research-Solid Earth., vol. 112, 2007, B06104, doi:10.1029/2006JB004900. [YU 07c] YU H.-L., CHEN J.-C., CHRISTAKOS G. and JERRETT M., “Estimating residential level ambient PM10 and Ozone exposures at multiple time-scales in the Carolinas with the BME method”, Epidemiology, 2007, submitted.
List of Authors
George CHRISTAKOS Department of Geography, Storm Hall 314, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182-4493, USA Vasily DEMYANOV Institute of Petroleum Engineering, Heriot-Watt University, Edinburgh, EH14 4AS, UK Loris FORESTI Institute of Geomatics and Analysis of Risk (IGAR), University of Lausanne, Amphipôle, 1015 Lausanne, Switzerland Christian KAISER Institute of Geography (IGUL), University of Lausanne, Antropôle 1015 Lausanne, Switzerland Mikhail KANEVSKI Institute of Geomatics and Analysis of Risk (IGAR), University of Lausanne, Amphipôle, 1015 Lausanne, Switzerland Michel MAIGNAN Institute of Mineralogy and Geochemistry, University of Lausanne, Antropôle, 1015 Lausanne, Switzerland Alexei POZDNOUKHOV Institute of Geomatics and Analysis of Risk (IGAR), University of Lausanne, Amphipôle, 1015 Lausanne, Switzerland
308
Advanced Mapping of Environmental Data
Ross PURVES Department of Geography, University of Zürich – Irchel, Zurich, Switzerland Frédéric RATLE Institute of Geomatics and Analysis of Risk (IGAR), University of Lausanne, Amphipôle, 1015 Lausanne, Switzerland Elena SAVELIEVA Environmental Modelling and System Analysis Lab. Nuclear Safety Institute (IBRAE), Russian Academy of Sciences, 52 B. Tulskaya, Moscow, 113191, Russia Rafael TAPIA Institute of Geomatics and Analysis of Risk (IGAR), University of Lausanne, Amphipôle, 1015 Lausanne, Switzerland Vadim TIMONIN Institute of Geomatics and Analysis of Risk (IGAR), University of Lausanne, Amphipôle, 1015 Lausanne, Switzerland Devis TUIA Institute of Geomatics and Analysis of Risk (IGAR), University of Lausanne, Amphipôle, 1015 Lausanne, Switzerland
Index
A a posteriori probability 124 accuracy 14, 58, 221, 234, 235, 236, 248, 250, 268, 277, 282, 289, 291, 294 ANNEX model 188, 190, 192–194, 199, 200 anisotropy 2, 6, 7, 49, 53, 84 atmospheric 210, 227, 247, 281, 282 automatic mapping 124, 150, 185, 187, 190, 193, 199 avalanches 149, 225, 226, 234, 241
B Bayesian 1, 5, 8, 12, 20, 99, 123, 124, 150, 194, 247, 250, 271, 276 behavior 47, 49, 141, 146, 167, 218, 225, 256, 262, 291 biological 14, 247, 248 BME 1, 8, 247, 248 box-counting 21, 27, 28, 30–32, 40–42 brain 109, 248, 250, 270
C classification 1–3, 13–15, 66, 71–74, 95– 102, 107–110, 122, 124, 131, 136, 137, 149, 165, 185, 186, 192–194, 200, 204– 209, 223–233, 239, 251, 275 cluster 31, 33, 35, 142–144, 256 clustering 2, 3, 6, 9, 14, 19, 20–44, 85, 98, 101, 127, 141, 142, 144, 201, 202, 206, 214, 215, 218, 225, 260
cognition 14 complex relationships 169 conditional distribution 99, 123 confidence measure 123 conjugate gradient 112, 154, 175 continuum 251–253, 255 co-simulations 5, 88–90 covariance 7, 47–51, 56, 59, 61, 120, 144, 145, 211, 248, 253, 254, 257, 259–263, 265, 266, 269, 270, 273, 276, 278, 283, 285, 292–294 cross-validation 105, 106, 122, 135, 137, 158, 187, 188, 190, 193–195, 229, 232, 233, 285
D decision support systems 149, 185, 226, 227, 228, 241 decision-oriented mapping 4, 8, 209 density estimation 96, 98, 99, 101, 119, 122, 141 de-trending 15 digital elevation model (DEM) 5, 61, 151, 154, 163, 164, 168, 169, 199, 237–239 directional variogram 91, 152, 153, 181 discriminative models 99 distribution of weights 176 drift 5, 58, 59, 62, 63, 103, 152, 157, 158, 198, 230, 231
310
Advanced Mapping of Environmental Data
E Earth 14, 254, 282 elevation 5, 58, 61, 150–156, 159–161, 163, 164, 166–174, 177–179, 189, 190, 230, 238, 289 empirical 1, 2, 6, 14, 104, 131, 132, 229, 235, 250, 255, 256, 267, 271, 277, 282, 285, 287, 296, entropy 82, 250 epidemiology 281, 291 epistematics 14, 248, 250, 273, 275 epistemology 14 Euclidean 23, 26, 31, 108, 143, 186, 188, 250–257, 262, 263, 270 Euclidean distance 108, 128, 142, 143, 186, 201, 253, 254, 256, 257 evolutionary 14, 141, 250, 270 experimental variogram 7, 8, 57, 197, 216– 218, 221 exploratory spatial data analysis (ESDA) 6, 20 exposure 210, 247, 281, 291, 294–297 extreme precipitation 168, 173, 174
F feature selection 101, 231 Föhn 159–162 forecast 90, 225, 226, 227, 229, 231, 234– 237, 239, 241 fractals 26, 33, 36, 256
G Gaussian kernel 98, 120, 121, 136, 137, 140 General Regression Neural Network (GRNN) 7, 13, 97, 99, 103, 109, 119– 122, 124, 146, 149, 185, 188, 190–192, 194, 199, 200, 241 generalization 8, 9, 59, 97, 104, 105, 113, 131, 132, 133, 141, 153, 154, 185, 198, 204, 205, 209, 228, 255, 275, 277 generative methods 98, 99 geo-features 151, 164, 242 Geostat Office 13 geostatistics 3, 5, 7, 10–13, 47, 58, 64, 149, 150, 168, 179, 197, 209, 247
GSLIB 12, 13 GUI 14, 275–278, 281
H health 14, 250, 251, 281, 291, 292, 294–297 Hebbian learning 125 heterogenity 257, 258, 265, 280 hybrid models 3, 15, 168, 179
I indoor radon 20, 36, 37, 42, 43, 149, 209– 213, 217, 218, 224, 225 infinity-norm distance 186 intrinsic hypothesis 2, 6, 48, 48, 51 inverse distance weighting 157, 158, 168 isotropy 254
J, K joint distribution 99, 220, 234 kernel-based methods 7 kernel width 120, 188, 190, 232, 233 kernels 120, 132, 136, 145, 229 K-function 25, 26, 28, 39, 40 k-nearest neighbors 13, 108, 186 kriging co-kriging 5, 13, 58–63, 67, 88, 282 collocated 62 indicator 7, 64–67, 69–74, 84, 85, 197–200, 209, 211, 215, 217, 241 lognormal 56, 57 ordinary 50 simple 50, 52, 58, 67, 77, 82, 211, 286, 290 universal 56 with external drift 5, 58, 59, 62, 63, 103, 152, 157, 158, 198
L lacunarity 27, 33, 34, 42, 43 laws 248, 250, 253, 255, 256, 258, 271, 272, 276, 277, 288, 293, 295 lazy learning 186 learning rate 110, 125, 128, 130, 205 leave-one-out cross-validation (LOOCV) 105, 106, 187, 188, 190, 193, 195
Index Levenberg-Marquardt 112, 117, 175, 176, 178, 179 LibSVM 13 local correlations 171, 173, 177 logic 273, 275, 276
M machine learning 1, 3, 6, 7, 9–14, 44, 95, 96, 99, 101, 104, 107, 109, 131–133, 146, 149, 150, 151, 157, 163, 168, 179, 184, 185, 194, 197, 199, 226, 227, 241, 242 Manhattan distance 186 maximum a posteriori (MAP) decision rule 123 meteorological stations 151, 165 meteorology 181 methodology 3, 5, 10, 107, 137, 150, 153, 160, 168, 181, 197, 248, 270, 271, 274, 281, 288 Minkowski distance 186 model assessment 2, 8, 9 model selection 2, 8, 9, 103, 131, 154, 233 Morisita index 24, 25, 26, 30, 36, 38, 39 moving windows statistics 173 multi-layer perceptron (MLP) 7, 97, 109– 118, 150, 153–162, 165, 166, 168, 171– 183, 187, 188, 190, 192
N Nadaraya-Watson Kernel Regression Estimator 119 natural hazard 149, 150, 225–227, 242 nearest neighbor methods 99, 103, 108, 109, 226, 227, 241 Netlab 13 neural networks 1, 7, 13, 97, 98, 103, 109, 119, 121, 122, 124, 133, 149, 168, 178, 190, 241 neural network residual kriging (NNRK) 173, 176, 179, 181–184 neural network residual simulations 173, 179, 182 neurons 109, 111, 114–118, 125–130, 153– 156, 162, 163, 166, 167, 171, 175, 176, 180, 203 n-fold cross-validation 187
311
non-Euclidean 250, 252–255, 262, 263, 270 nonlinear 7, 56, 103, 110, 111, 114, 115, 120, 126, 133, 135, 139, 144, 145, 150, 159, 160, 163, 164, 167–169, 179, 184, 185, 201, 204, 209, 226, 228, 229, 237, 250, 271, 277, 296 nonlinear dimensionality reduction 144 nonstationarity 152 Nscore transform 211
O optimization algorithms 110, 112, 113, 174, 179, 185 orographic precipitation 159 overfitting 104, 121, 131, 137, 154–156, 179, 181, 182, 188, 228, 233
P Parzen window 13–14 physical 2, 5, 14, 103, 123, 124, 159, 190, 209, 210, 217, 226, 227, 237, 247–249, 251–253, 255, 256, 260, 266, 267, 270– 272, 275, 281, 288, 289, 293, 294, 296 Poisson point process 39 precipitation 159, 160, 168–174, 177–179, 181, 182, 184, 192, 193, 196–199 principal component analysis (PCA) 102, 126, 144, 145 prior probability 123 Probabilistic Neural Network (PNN) 7, 13, 98, 103, 109, 122–124, 146, 149, 192, 194–196, 198–200, 241 probability density function (pdf) 1, 2, 5, 7, 15, 64, 65, 69, 70, 80–82, 84, 85, 88, 122, 184, 209, 211, 213, 220, 250, 257, 258, 260, 267, 268, 270, 272, 277, 280, 289, 293 probability mapping 64, 209, 211
Q, R quantile 81, 84, 85, 184, 213, 223 R 13 random field 5, 15, 99, 257, 258, 259, 266, 273, 285
312
Advanced Mapping of Environmental Data
realizations 5, 7, 47, 76–79, 82–91, 183, 209, 220, 257, 258 recursive feature elimination 165, 231 regression 1, 3, 7, 13, 14, 27, 28, 30, 58, 60, 64, 76, 84, 88, 96, 97, 99–101, 103, 108–110, 114, 119, 122–124, 137, 146, 149, 157, 158, 185, 186, 190–193, 199, 241, 272 regression kriging 97, 157, 158 residual kriging 173, 179, 286 residual simulations 173, 179, 182 risk 21, 22, 85, 96, 125, 131, 132, 136, 137, 150, 209, 250, 292 analysis 25, 64 assessment 14, 227, 268, 281 mapping 4, 5, 7, 64, 235 RMSE 63, 154, 155, 158, 159, 162, 166, 171, 172, 175, 178, 188, 191
S sandbox counting 26 science 5, 95, 96, 150, 248, 250, 257, 268, 269, 272, 275, 281, 282, 291 SEKS 14, 275–278, 281 self-organizing (Kohonen) maps 7, 124 sequential Gaussian simulations (SGS) 78, 83, 89, 176, 183, 184, 209, 211, 215, 217–221, 223–225 S-GeMS 13 simulated annealing 79, 112, 141 simulation cell-based 77–79, 84 Gaussian 13, 78, 81, 83, 84, 88, 89, 176, 183, 209, 211 indicator 13, 78, 81, 84–87 multiple-point 13 object-based 78 smoothing parameter 121, 122 software 11–14, 251, 275, 281 space-time dependence 14, 248, 250, 253, 254, 257, 259, 260, 261, 263, 264, 273, 276, 277, 292 spatiotemporal 1, 2, 3, 7, 8, 11, 14, 149, 150, 225–227, 240
stationarity 21, 47, 48, 67, 152, 215, 217, 218, 264, 265 second-order 2, 48–50 strict 48 statistical learning theory 7, 131, 226, 228 statistics 12, 13, 40, 43, 47, 62, 77– 79, 81, 95, 102, 131, 151–153, 169–171, 173, 177, 183, 187, 188, 191–193, 214, 223, 247, 249, 250, 271, 291 stochastic 2, 5, 64, 76–79, 81–85, 88– 91, 112, 248–250, 257, 259, 263, 265, 269, 272, 273, 275, 276, 285 simulations 7, 76, 77, 88, 182, 241 structural analysis 3, 6 supervised learning 3, 44, 100, 101 support vector 5, 133, 135, 139, 228, 229, 232, 233, 236 machines (SVM) 1, 7, 13, 19, 98, 101, 107, 132–134, 136, 137, 144, 146, 164, 165, 225, 226, 228–241 regression (SVR) 7, 97, 137–141, 157, 158, 168 synapse 125, 176 synthesis 14, 248, 273, 275, 281 system attributes 247
T temperature 58, 61, 62, 63, 79, 101, 150– 168, 179, 187–189, 191, 192, 199, 230, 231, 238, 271, 285, 287, 289, 290, 294 gradients 159–161, 231, 238 inversion 151, 159, 163, 165–167 temporal scale 150, 168, 169, 297 terrain features 163 testing 9, 10, 44, 70, 113, 115, 116, 118, 141, 153–156, 168, 202, 219, 229 theory 7, 67, 124, 126, 131–133, 135, 150, 153, 226, 228, 248, 249, 251, 257, 258, 265, 266, 273, 274, 276 topo-climatic 1, 185 topographical 163, 164, 167, 173, 282 TORCH 13 training and validation curves 155, 163
Index transductive 44, 99, 100, 101 trend 14, 44, 49, 56, 58, 59, 114, 115, 152, 168, 179, 181–183, 192, 215, 217, 260, 265, 271
U uncertainty 4, 5, 7, 11, 36, 53, 55, 64, 66, 72, 76, 84, 85, 90, 107, 108, 182, 196, 211, 224, 226, 229, 248–250, 257, 268, 273, 275, 276, 282, 285, 289, 291–293, 296 unsupervised learning 3, 100, 101, 109, 124, 125
313
V, W validity domains 34, 37, 42 variability 2, 5, 7, 14, 51, 58, 62, 74, 76, 77, 82, 84, 88, 90, 144, 150, 168, 169, 179, 182, 210, 211, 225, 237, 240, 264, 270, 282, 289, 291, 294 variogram model 51–54, 58, 60, 65, 66, 82, 84, 88, 153, 197, 198, 209, 216–218, 221, 222, 225 variography of residuals 6, 155, 190 VC dimension 131 Voronoï polygons 6, 24, 71, 215 Weka 14