ADVANCES IN IMAGING AND ELECTRON PHYSICS VOLUME 140
EDITOR-IN-CHIEF
PETER W. HAWKES CEMES-CNRS Toulouse, France
HONORARY ASSOCIATE EDITORS
TOM MULVEY BENJAMIN KAZAN
Advances in
Imaging and Electron Physics
E DITED BY
PETER W. HAWKES CEMES-CNRS Toulouse, France
VOLUME 140
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier
Academic Press is an imprint of Elsevier 525 B Street, Suite 1900, San Diego, California 92101-4495, USA 84 Theobald’s Road, London WC1X 8RR, UK ∞ This book is printed on acid-free paper.
Copyright © 2006, Elsevier Inc. All Rights Reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the Publisher. The appearance of the code at the bottom of the first page of a chapter in this book indicates the Publisher’s consent that copies of the chapter may be made for personal or internal use of specific clients. This consent is given on the condition, however, that the copier pay the stated per copy fee through the Copyright Clearance Center, Inc. (www.copyright.com), for copying beyond that permitted by Sections 107 or 108 of the U.S. Copyright Law. This consent does not extend to other kinds of copying, such as copying for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. Copy fees for pre-2005 chapters are as shown on the title pages. If no fee code appears on the title page, the copy fee is the same as for current chapters. 1076-5670/2006 $35.00 Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail:
[email protected]. You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.” For information on all Elsevier Academic Press publications visit our Web site at www.books.elsevier.com ISBN-13: 978-0-12-014782-3 ISBN-10: 0-12-014782-3 PRINTED IN THE UNITED STATES OF AMERICA 06 07 08 09 9 8 7 6 5 4 3 2 1
CONTENTS
C ONTRIBUTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . P REFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F UTURE C ONTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . .
vii ix xi
Recursive Neural Networks and Their Applications to Image Processing M ONICA B IANCHINI , M ARCO M AGGINI , AND L ORENZO S ARTI I. II. III. IV.
Introduction . . . . . . . . . . . . . Recursive Neural Networks . . . . . . Graph-Based Representation of Images Object Detection in Images . . . . . . References . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1 9 33 39 54
Deterministic Learning and an Application in Optimal Control C RISTIANO C ERVELLERA AND M ARCO M USELLI I. II. III. IV. V. VI. VII. VIII.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . A Mathematical Framework for the Learning Problem . . . . . Statistical Learning . . . . . . . . . . . . . . . . . . . . . . Deterministic Learning . . . . . . . . . . . . . . . . . . . . Deterministic Learning for Optimal Control Problems . . . . . Approximate Dynamic Programming Algorithms . . . . . . . Deterministic Learning for Dynamic Programming Algorithms Experimental Results . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .
. 62 . 65 . 69 . 74 . 90 . 94 . 99 . 104 . 114
X-Ray Fluorescence Holography KOUICHI H AYASHI I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 120 II. Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 v
vi
CONTENTS
III. IV. V. VI.
Experiment and Data Processing Applications . . . . . . . . . . Related Methods . . . . . . . . Summary and Outlook . . . . . References . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
138 159 174 180 181
A Taxonomy of Color Image Filtering and Enhancement Solutions R ASTISLAV L UKAC I. II. III. IV. V. VI.
Introduction . . . . Color Imaging Basics Image Noise . . . . Color Image Filtering Edge Detection . . . Conclusion . . . . . References . . . . .
AND
KONSTANTINOS N. P LATANIOTIS
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
188 190 193 199 244 257 257
General Sweep Mathematical Morphology F RANK Y. S HIH I. Introduction . . . . . . . . . . . . . . . . . . . . . . . II. Theoretical Development of General Sweep Mathematical Morphology . . . . . . . . . . . . . . . . . . . . . . . III. Blending of Swept Surfaces with Deformations . . . . . . IV. Image Enhancement . . . . . . . . . . . . . . . . . . . V. Edge Linking . . . . . . . . . . . . . . . . . . . . . . . VI. Shortest Path Planning for Mobile Robot . . . . . . . . . VII. Geometric Modeling and Sweep Mathematical Morphology VIII. Formal Language and Sweep Morphology . . . . . . . . IX. Representation Scheme . . . . . . . . . . . . . . . . . . X. Grammars . . . . . . . . . . . . . . . . . . . . . . . . XI. Parsing Algorithm . . . . . . . . . . . . . . . . . . . . XII. Conclusions . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . . . . . . .
. . . 265 . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
268 275 278 280 286 288 291 292 297 300 303 304 306
I NDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
CONTRIBUTORS
Numbers in parentheses indicate the pages on which the authors’ contributions begin.
M ONICA B IANCHINI (1), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Siena, 53100 Siena, Italy C RISTIANO C ERVELLERA (61), Istituto di Studi sui Sistemi Intelligenti per l’Automazione, Consiglio Nazionale delle Ricerche, 16149 Genova, Italy KOUICHI H AYASHI (119), Institute for Materials Research, Tohoku University, Sendai 980-8577, Japan R ASTISLAV L UKAC (187), Multimedia Laboratory—BA 4157, The Edward S. Rogers Sr. Department of ECE, University of Toronto, Toronto, Ontario M5S 3G4, Canada M ARCO M AGGINI (1), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Siena, 53100 Siena, Italy M ARCO M USELLI (61), Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, 16149 Genova, Italy KONSTANTINOS N. P LATANIOTIS (187), Multimedia Laboratory—BA 4157, The Edward S. Rogers Sr. Department of ECE, University of Toronto, Toronto, Ontario M5S 3G4, Canada L ORENZO S ARTI (1), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Siena, 53100 Siena, Italy F RANK Y. S HIH (265), Computer Vision Laboratory, College of Computing Sciences, New Jersey Institute of Technology, Newark, New Jersey 07102, USA
vii
This page intentionally left blank
PREFACE
The five chapters that make up this volume cover several aspects of image processing, control theory and a form of holography using X-rays. First, we have an account by M. Bianchini, M. Maggini, and L. Sarti of the role of recursive neural networks in image processing. The authors begin with a very clear description of the reasons why this approach is so powerful for pattern recognition in real-world situations. Then they present the networks themselves, the graph-based representation of images and, finally, show how objects can be detected by means of these tools. This is followed by a contribution on deterministic learning and its use in control theory by C. Cervellera and M. Muselli. Before describing deterministic learning in detail, they give a brief account of statistical learning in order to bring out the advantages of the former in certain situations. Dynamic programming algorithms are then set out and the chapter concludes with some experimental results. The subject of the third chapter is very different. Among the many forms of holography, X-ray fluorescence holography is proving very valuable. It is a recent addition to the family, owing to the difficulty of obtaining sufficiently strong signals but the results obtained at the European Synchrotron Radiation Facility show how useful it can be. K. Hayashi describes the technique itself and illustrates this with numerous applications. Finally, the technique is compared briefly with related methods, such as γ -ray and neutron holography, and photon interference XAFS. Processing color images is distinctly more complicated than with blackand-white images, and several contributions on the question are planned for these Advances. In the fourth chapter, R. Lukac and K.N. Plataniotis first examine the task of filtering such images, and then discuss enhancement and edge detection. Finally, we have another contribution in the area of mathematical morphology, which regularly appears in these pages. Here, F.Y. Shih introduces general sweep mathematical morphology, a branch of the subject useful in robotics and in automated construction and machining. After explaining what is meant by ‘sweeping’ in this context, the author gives formal definitions of the various operations required and then applies them to a wide variety of tasks. Three sections are devoted to a representation scheme, grammars, and parsing. ix
x
PREFACE
As always, I thank the authors most sincerely for their efforts to make their subjects clear to non-specialist readers. Contributions promised for future volumes in the series are listed in the next section. Peter Hawkes
FUTURE CONTRIBUTIONS
G. Abbate New developments in liquid–crystal-based photonic devices S. Ando Gradient operators and edge and corner detection A. Asif Applications of noncausal Gauss–Markov random processes in multidimensional image processing C. Beeli Structure and microscopy of quasicrystals V.T. Binh and V. Semet Cold cathodes G. Borgefors Distance transforms A. Buchau Boundary element or integral equation methods for static and time-dependent problems B. Buchberger Gröbner bases J. Caulfield (vol. 142) Optics and information sciences T. Cremer Neutron microscopy H. Delingette Surface reconstruction based on simplex meshes A.R. Faruqi Direct detection devices for electron microscopy R.G. Forbes Liquid metal ion source xi
xii
FUTURE CONTRIBUTIONS
C. Fredembach Eigenregions for image classification S. Fürhapter Spiral phase contrast imaging L. Godo and V. Torra Aggregation operators A. Gölzhäuser Recent advances in electron holography with point sources M.I. Herrera The development of electron microscopy in Spain D. Hitz (vol. 144) Recent progress on high-frequency electron cyclotron resonance ion sources D.P. Huijsmans and N. Sebe Ranking metrics and evaluation measures K. Ishizuka Contrast transfer and crystal images J. Isenberg Imaging IR-techniques for the characterization of solar cells K. Jensen Field-emission source mechanisms L. Kipp Photon sieves G. Kögel Positron microscopy T. Kohashi Spin-polarized scanning electron microscopy W. Krakow Sideband imaging R. Leitgeb Fourier domain and time domain optical coherence tomography B. Lencová Modern developments in electron optical calculations Y. Lin and S. Liu (vol. 141) Grey systems and grey information
FUTURE CONTRIBUTIONS
xiii
W. Lodwick Interval analysis and fuzzy possibility theory L. Macaire, N. Vandenbroucke, and J.-G. Postaire Color spaces and segmentation M. Matsuya Calculation of aberration coefficients using Lie algebra S. McVitie Microscopy of magnetic specimens S. Morfu and P. Morquié Nonlinear systems for image processing L. Mugnier, A. Blanc, and J. Idier (vol. 141) Phase diversity M.A. O’Keefe Electron image simulation D. Oulton and H. Owens Colorimetric imaging N. Papamarkos and A. Kesidis The inverse Hough transform K.S. Pedersen, A. Lee, and M. Nielsen The scale-space properties of natural images I. Perfilieva Fuzzy transforms E. Rau Energy analysers for electron microscopes H. Rauch The wave-particle dualism E. Recami Superluminal solutions to wave equations ˇ J. Rehᡠcek, Z. Hradil, J. Peˇrina, S. Pascazio, P. Facchi, and M. Zawisky (vol. 142) Neutron imaging and sensing of physical fields G. Ritter and P. Gader (vol. 144) Fixed points of lattice transforms and lattice associative memories
xiv
FUTURE CONTRIBUTIONS
J.-F. Rivest (vol. 144) Complex morphology P.E. Russell and C. Parish Cathodoluminescence in the scanning electron microscope G. Schmahl X-ray microscopy G. Schönhense, C.M. Schneider, and S.A. Nepijko (vol. 142) Time-resolved photoemission electron microscopy R. Shimizu, T. Ikuta, and Y. Takai Defocus image modulation processing in real time S. Shirai CRT gun design methods N. Silvis-Cividjian and C.W. Hagen (vol. 143) Electron-beam-induced nanometre-scale deposition H. Snoussi Geometry of prior selection T. Soma Focus-deflection systems and their applications W. Szmaja (vol. 141) Recent developments in the imaging of magnetic domains I. Talmon Study of complex fluids by transmission electron microscopy G. Teschke and I. Daubechies Image restoration and wavelets M.E. Testorf and M. Fiddy Imaging from scattered electromagnetic fields, investigations into an unsolved problem M. Tonouchi Terahertz radiation imaging N.M. Towghi Ip norm optimal filters D. Tschumperlé and R. Deriche Multivalued diffusion PDEs for image regularization
FUTURE CONTRIBUTIONS
E. Twerdowski Defocused acoustic transmission microscopy Y. Uchikawa Electron gun optics C. Vachier-Mammar and F. Meyer Watersheds K. Vaeth and G. Rajeswaran Organic light-emitting arrays M. van Droogenbroeck and M. Buckley Anchors in mathematical morphology M. Wild and C. Rohwer Mathematics of vision B. Yazici and C.E. Yarman (vol. 141) Stochastic deconvolution over groups J. Yu, N. Sebe, and Q. Tian (vol. 144) Ranking metrics and evaluation measures
xv
This page intentionally left blank
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 140
Recursive Neural Networks and Their Applications to Image Processing MONICA BIANCHINI, MARCO MAGGINI, AND LORENZO SARTI Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Siena, 53100 Siena, Italy
I. Introduction . . . . . . . . . . . . . . . A. From Flat to Structural Pattern Recognition . . . . B. Recursive Neural Networks: Properties and Applications II. Recursive Neural Networks . . . . . . . . . . A. Graphs . . . . . . . . . . . . . . . B. Processing DAGs with Recursive Neural Networks . . 1. Processing DPAGs . . . . . . . . . . . 2. Processing DAGs-LE . . . . . . . . . . C. Backpropagation Through Structure . . . . . . D. Processing Cyclic Graphs . . . . . . . . . . 1. Recursive-Equivalent Transforms . . . . . . 2. From Cyclic Graphs to Recursive Equivalent Trees . E. Limitations of the Recursive Neural Network Model . . 1. Theoretical Conditions for Collision Avoidance . . III. Graph-Based Representation of Images . . . . . . . A. Introduction . . . . . . . . . . . . . B. Segmentation of Images . . . . . . . . . . C. Region Adjacency Graphs . . . . . . . . . D. Multiresolution Trees . . . . . . . . . . . IV. Object Detection in Images . . . . . . . . . . A. Object Detection Methods . . . . . . . . . B. Recursive Neural Networks for Detecting Objects in Images 1. Learning Environment Setup . . . . . . . . 2. Detecting Objects . . . . . . . . . . . References . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
1 1 7 9 9 11 14 17 18 22 25 28 30 31 33 33 33 36 38 39 39 42 42 47 54
I. I NTRODUCTION A. From Flat to Structural Pattern Recognition Pattern recognition algorithms and statistical classifiers, such as neural networks or support vector machines (SVMs), can deal with real-life noisy data in an efficient way, so that they can be successfully applied in several different 1 ISSN 1076-5670/05 DOI: 10.1016/S1076-5670(05)40001-4
Copyright 2006, Elsevier Inc. All rights reserved.
2
BIANCHINI
et al.
domains. However, the majority of such tools are restricted to process real vectors of a finite and fixed dimensionality. On the other hand, most realworld problems have no natural representation as a single “table,” i.e., in several applications, the information that is relevant for solving problems is organized in entities and relationships among entities, so that applying traditional data mining methods implies that an extensive preprocessing has to be performed on the data. For instance, categorical variables are encoded by one-hot encoding, time series are embedded into finite dimensional vector spaces using time windows, preprocessing of images includes edge detection and the exploitation of various filters, sound signals can be represented by spectral vectors, and chemical compounds are characterized by topological indices and physicochemical attributes. However, other data formats and data representations exist, and can be exploited to represent patterns in a more natural way. Sets, without a specified order, can describe objects in a scene or a pool of measurements. Functions, evaluated at specific points, constitute a natural description for time series or spectral data. Sequences of any length also represent time series or spatial data. Tree structures describe terms, logical formulas, parse trees, or document images. Graph structures can be used to encode chemical compounds, images, and, in general, objects composed of atomic elements. Feature encoding of such data can produce compact vectors, even if the encodings are often problem-dependent, time-consuming, and heuristic. Moreover, some information is usually lost when complex data structures such as sequences, trees, or graphs of arbitrary size are encoded in fixed dimensional vectors. The need to deal with complex structures has focused the researchers’ efforts on developing methods able to process such types of data. However, this issue has given rise to a long-standing debate between fans of artificial intelligence methods, based on symbols, and fans of computational intelligence methods, which operate on numbers (Bezdek, 1994). As a matter of fact, in the last three decades, the emphasis in pattern recognition research has been swinging from decision-theoretic to structured approaches. Decision-theoretic models are essentially based on numerical features, which provide a global representation of patterns and are obtained using some sort of preprocessing (like those listed above). Many different decision-theoretic methods have been developed in the framework of connectionist models, which operate on symbolic pattern representations. On the other hand, syntactic and structural pattern recognition (and also artificial intelligence-based) methods have been developed that emphasize the subsymbolic nature of patterns. However, both purely decision-theoretic and syntactical/structural approaches have a limited value when they are applied to many interesting real-life problems. In fact, syntactical and structural approaches can model the structure of patterns, but they are not very well-suited for dealing with patterns corrupted by
RECURSIVE NEURAL NETWORKS
3
noise. This limitation was recognized early, and several approaches have been pursued to incorporate statistical properties into structured approaches. The data representations used for either syntactical or structural techniques have been enriched with attributes that are in fact vectors of real numbers describing appropriate features of the patterns. These attributes are expected to allow some statistical variability in the patterns under consideration. A comprehensive survey on the embedding of statistical approaches into syntactical and structural pattern recognition can be found in Tsai (1990). On the other hand, parametric or nonparametric statistical methods can nicely deal with distorted noisy patterns, but they are severely limited when the patterns are strongly structured. The feature extraction process in those cases seems to be inherently ill-posed. In fact, a structured pattern can be regarded as an arrangement of elements, deeply dependent by the interactions among them, and by the intrinsic nature of each element. Hence, the causal, hierarchical, and topological relations among parts of a given pattern yield significant information. In the past few years, some new models, which exploit the above definition of pattern as an integration of symbolic and subsymbolic information, have been developed. These models try to solve one of the most challenging tasks in pattern recognition: obtain a flat representation for a given structure (or for each atomic element that belongs to a structure) in an automatic, and possibly adaptive, way. This flat representation, computed following a recursive computational schema, takes into account both the local information associated with each atomic entity and the information induced by the topological arrangement of the elements, inherently contained in the structure. Nevertheless, pattern recognition approaches to structured data are often upgrades of methodologies originally developed for flat vectorial data. Among them, there are popular data mining methods, such as decision trees, supervised and unsupervised neural networks, Markov models, rule learners, and distance-based algorithms. With respect to Markov models, random walk (RW) techniques have recently been proposed (Gori et al., 2005a) that can compute the relevance for each node in a graph. The relevance depends on the topological information collected in the graph structure and on the information associated with each node. The relevance values, computed using an RW model, have been used in the past to compute the ranking of the Web pages inside search engines, and Google, probably the most popular one, uses a ranking technique based on a particular RW model. Classical problems related to graph theory, like graph or subgraph matching, can also be addressed in this framework. The RW approach can be used, for example, in problems of image retrieval. On the other hand, support vector machines (Boser et al., 1992; Vapnik, 1995) are among the most successful recent developments within the machine learning
4
BIANCHINI
et al.
and the data mining communities. Along with some other learning algorithms, like Gaussian processes and kernel principal component analysis, they form the class of kernel methods (Müller et al., 2001; Schölkopf and Smola, 2002). The computational attractiveness of kernel methods comes from the fact that they can be applied into high-dimensional feature spaces without the high cost of explicitly computing the mapped data. In fact, the kernel trick consists in defining a positive-definite kernel so that a set of nonlinearly separable data can be mapped onto a larger metric space, on which they become linearly separable, without explicitly knowing the mapping between the two spaces (Gärtner, 2003). Using a different kernel corresponds to a different embedding and thus to a different hypothesis language. Crucial to the success of kernel-based learning algorithms is the extent to which the semantics of the domain are reflected in the definition of the kernel. In fact, kernel functions that directly handle data represented by graphs are often designed a priori or at least they allow a limited adaptation, so that they cannot grasp any structural property that has not been guessed by the designer. Some examples are convolution kernels (Collins and Duffy, 2002), which recursively take into account subgraphs, string kernels (Vishwanathan and Smola, 2002), which require that each tree is represented by the sequence of labels generated by a depth-first visit, or graph kernels (Gärtner et al., 2003), based on a measure of the walks in two graphs that share some labels. An alternative approach consists in adaptive kernel functions, which are able to adapt the kernel to the dataset. Kernel functions defined on structured data have recently received growing attention, since they are able to deal with many real-world learning problems in bioinformatics, natural language processing, or document processing. In recent years, supervised neural networks have also been developed that are able to deal with structured data encoded as labeled directed positional acyclic graphs (DPAGs). These models are called recursive neural networks (RNNs) (Sperduti and Starita, 1997; Frasconi et al., 1998; Küchler and Goller, 1996). The essential idea of RNNs is to process each node of an input graph by a multilayer perceptron, and then to process the DPAG from its leaf nodes toward the root node [if any, otherwise such a node must be opportunely added (Sperduti and Starita, 1997)], using the structure of the graph to connect the neurons from one node to another. The output of the neurons corresponding to the root node can then be exploited to encode the whole graph. In other words, to process an input DPAG, the RNN is unfolded through the graph structure, producing the encoding network. Then, the computation of the state of the network is performed from the frontier of the graph to the root node, following, in the reverse direction, the arcs of the input structure. The state of the network at a generic node of the input structure depends on the label associated with the node and on the states of the children of the
RECURSIVE NEURAL NETWORKS
5
node itself (this kind of computation establishes a sort of causal relationship between each node and its children). A gradient descent method is then used to learn the weights of the multilayer perceptrons. The models are simplified by assuming that all multilayer perceptrons at each node and across the training set share the same parameters. This approach basically consists of an extension to graphic structures of the traditional “unfolding” process adopted by recurrent neural networks for sequences (Elman, 1990). The main limitation of this model is inherently contained in the kind of structures that can be processed. In fact, it is not always easy to represent real data using DPAGs. In this kind of graph, each edge starting from a node has an assigned position, and any rearrangement of the children of a node produces a different graph. While such an assumption is useful for some applications, it may sometimes introduce an unnecessary constraint on the representation. For example, this hypothesis is not suitable for the representation of a chemical compound and might not be adequate for several pattern recognition problems. Considering this limitation, some researchers have recently proposed several models aimed at processing more general classes of graphs. In Bianchini et al. (2001a), a new model able to process DAGs was presented. This model exploits a weight-sharing approach to relax the positional constraint. Even if interesting from a theoretical point of view, this methodology has limited applications. In fact, the complexity of the network architecture grows exponentially with the maximum outdegree of the processed structures. A different way of relaxing the positional constraint has been proposed (Bianchini et al., 2005a; Gori et al., 2003). This approach is based on processing directed acyclic graphs with labels also on the edges (DAGs-LE). The state of each node depends on the label attached to the node and on a combination of the contributions of its children weighed by the edge labels. This total contribution can be computed using a feedforward neural network or an ad hoc function, and is independent of both the number and the order of the children of the node. Therefore, the model allows processing of graphs with any outdegree. Moreover, since it is not always easy to determine useful features that can be associated with the edges of the structures, a procedure that allows a DPAG to be transformed into a DAG-LE is presented (Bianchini et al., 2004a). To process also cyclic graphs, in Bianucci et al. (2001), a collapse strategy is proposed for cycles, which are represented by a unique node that resembles all the information collected in the nodes belonging to the cycle. Unfortunately, this strategy cannot be carried out automatically, and it is intrinsically heuristic. A different technique for processing cyclic structure has been proposed (Bianchini et al., 2002, 2006). This method preprocesses the cyclic structures, transforming each graph into a forest of recursive-
6
BIANCHINI
et al.
equivalent trees. The forest of trees collects the same information contained in the cyclic graph. This method allows processing of both cyclic and undirected graphs. In fact, undirected structures can be transformed into cyclic-directed graphs by replacing each undirected edge with a pair of directed arcs with opposite directions. Finally, the graph neural networks (GNN) model (Gori et al., 2004, 2005b) is able to process general graphs, including directed and undirected structures, both cyclic and acyclic. In the GNN model, the encoding network can be cyclic and nodes are activated until the network reaches a steady state. An alternative approach for undirected structures has been proposed (Vullo and Frasconi, 2002). During a preprocessing phase, a direction for each edge of the structure is defined. Then the state of the RNN is computed for each node, considering a bidirectional computation. First, the RNN is unfolded following the chosen direction for the edges, then it is unfolded again considering the opposite direction. This model has proven to be particularly suited for bioinformatics applications. Finally, a model that relaxes the causal relationship between each node and its children has been proposed (Micheli et al., 2004). In this approach, called cascade-correlated, the state of each node depends on the attached label and both on the states of its children and the states of its parents. In fact, when real data are represented using structures, it is not always clear how the father/children relationship must be established. The aim of this approach is to relax the causal relationship established by the original model, defining a new processing schema that allows the father/children relationship to be disregarded. All the models cited above were defined inside the supervised learning paradigm. Supervised information, however, either may not be available or may be very expensive to obtain. Thus, it is very important to develop models that are able to deal with structured data in an unsupervised fashion. In the past few years, some RNN models have been proposed in the framework of unsupervised learning (Hammer et al., 2004), and various unsupervised models for nonvectorial data are available in the literature. The approaches presented (Günter and Bunke, 2001; Kohonen and Sommervuo, 2002) use a metric for self-organizing maps (SOMs) that directly works on structures. Structures are processed as a whole by extending the basic distance computation to complex distance measures for sequences, trees, or graphs. The edit distance, for example, can be used to compare strings of arbitrary length. Such a technique extends the basic distance computation for the neurons to a more expressive comparison that tackles the given input structure as a whole. Early unsupervised recursive models, such as the temporal Kohonen map or the recurrent SOM, include the biologically plausible dynamics of leaky integrators (Chappell and Taylor, 1993; Koskela et al., 1998a, 1998b). This
RECURSIVE NEURAL NETWORKS
7
idea has been used to model direction selectivity in models of the visual cortex and for time series representation (Koskela et al., 1998a, 1998b; Farkas and Mikkulainen, 1999). Combinations of leaky integrators with additional features can increase the capacity of the models as demonstrated in further proposals (Euliano and Principe, 1999; Hoekstra and Drossaers, 1993; James and Mikkulainen, 1995; Kangas, 1990; Vesanto, 1997). Recently, more general recurrences with richer dynamics have been proposed (Hagenbuchner et al., 2001, 2003; Strickert and Hammer, 2003a, 2003b; Voegtlin, 2000, 2002; Voegtlin and Dominey, 2001). These models transcend the simple local recurrence of leaky integrators and can represent much richer dynamic behavior, which has been demonstrated in many experiments. While the processing of tree-structured data has been discussed (Hagenbuchner et al., 2001, 2003), all the remaining approaches have been applied to time series. B. Recursive Neural Networks: Properties and Applications RNNs can compute maps from a graph space to an isomorph graph space or to a vector space (for instance Rn ). Some of the models presented above were studied from a theoretical point of view to understand how the approximation capabilities of feedforward neural networks can be extended to RNNs. In fact, feedforward networks were proved to be universal approximators (Hornik et al., 1989) and able to compute any function between vector spaces. In Hammer (1998), the pioneering work on the approximation capabilities of RNNs, the original RNN model, tailored to process DPAGs, was shown to behave, in probability, as a universal approximator for the space of positional trees, so that it is able to compute any function from such space to Rn . Subsequently (Bianchini et al., 2001a), RNNs with linear neurons have also been proved to be universal approximators for the domain of DPAGs, and, finally, the universal approximation capability has recently been extended to cyclic graphs (Bianchini et al., 2002, 2006) and to DAGs-LE (Bianchini et al., 2005a). Also recently (Bianchini et al., 2001a), some theoretical results have been stated on linear recursive networks to establish necessary and sufficient conditions to guarantee a unique vectorial representation for structures belonging to certain classes or, in other words, to avoid the collision phenomenon. In fact, provided that the dimension of the internal state grows exponentially with the height of trees, collisions can always be avoided. Of course, this result considerably limits the range of applicability of recursive architectures in dealing with general large structures. As a matter of fact, the problem of recognizing general trees becomes intractable as soon as their height increases. Also, it is interesting that such a negative conclusion can be directly extended to nonlinear recursive networks. Nevertheless, there are
8
BIANCHINI
et al.
some significant characteristics of trees that can be recognized by using a reasonable amount of resources. In fact, in Bianchini et al. (2001a), it has been proven that a simple class of linear recursive networks can count the number of nodes per level or the number of left and right branches per level. Thus, the linear recursive model cannot be used to encode all trees, although it can be useful, in some practical applications, to recognize different classes of trees. Such useful properties of RNNs have been exploited in many pattern recognition applications. In particular, RNNs have been applied to image analysis, chemistry, bioinformatics, Web searching, theorem proving, and natural language processing, for solving both classification and regression tasks. Actually, RNNs allow state-of-the-art results to be obtained in some bioinformatics problems, however, for the cases in which RNNs do not achieve the best results, they allow us to define general and nonheuristic techniques for pattern recognition applications. In the bioinformatics field, RNNs were applied to the prediction of protein topologies (Vullo and Frasconi, 2002; Pollastri et al., 2002). In this problem, the contact map of each protein is represented by an undirected graph, and then the birecursive architecture presented in Vullo and Frasconi (2002) is used to predict the protein secondary structures. In chemistry, an RNN application to the quantitative structure-activity relationship (QSAR) problem of benzodiazepines has been presented (Bianucci et al., 2001). This application has also allowed the performances of the recursive cascade correlation architecture proposed in Micheli et al. (2004) to be evaluated. With respect to natural language processing, RNNs were used (Sturt et al., 2003) to learn firstpass structural attachment preferences in sentences represented as syntactic trees, and in regard to structural pattern recognition, they were also used, combined with SVM classifiers, to classify fingerprints (Yao et al., 2003). For Web searching, GNN were applied (Scarselli et al., 2005) to implement an adaptive ranking, used for learning the importance of Web pages by examples. Finally, considering image analysis, RNNs were exploited for the classification of company logos (Diligenti et al., 2001), for the definition of a similarity measure useful for browsing image databases (de Mauro et al., 2003), and for the localization and detection of the region of interest in colored images. Moreover, a combination of RNNs for cyclic graphs and RNNs for DAGsLE was exploited (Bianchini et al., 2003a, 2003b) to locate faces, while an extension of the same model was proposed (Bianchini et al., 2004b, 2005a, 2005b) to detect general objects. In this chapter, the recursive neural network model is presented, paying attention to its evolution and, therefore, to its present capacity for processing general graphs. Moreover, the backpropagation through structure algorithm is briefly sketched, to allow the reader to grasp how the learning takes place.
RECURSIVE NEURAL NETWORKS
9
The computational capabilities of the recursive model are also assessed, to establish what kinds of tasks RNNs are able to face and how and when they are prone to failure. In Section III, the graph-based representation of images is described, starting from the segmentation process, and defining several different types of structures that can appropriately collect the perceptual/topological information extracted from images. Finally, in Section IV, the capacity of RNNs to process images is definitely established, showing some interesting results on object detection problems.
II. R ECURSIVE N EURAL N ETWORKS RNNs were conceived to process structured information coded as a graph. The term “recursive” reflects the fact that a local computation is recursively applied to each node in the input graph to yield a result that depends on the whole input data structure. In the proposed framework this local computation is performed by a neural network, and this choice allows us to extend the supervised learning paradigm of neural networks to structured domains. Backpropagation through structure (BPTS) (Sperduti and Starita, 1997) is a straightforward extension of the original backpropagation and backpropagation through time algorithms used to train feedforward and recurrent neural networks, respectively. Depending on the characteristics of the input graphs, different models of RNNs can be defined. The original RNN model was proposed to process DPAGs (Sperduti and Starita, 1997; Frasconi et al., 1998). Extensions of this model can process more general classes of graphs, like DAGs-LE or general cyclic graphs. In the following sections we will introduce the different architectures of RNNs and the related learning algorithm based on BPTS. Finally, some considerations on the approximation/classification capabilities of the recursive model are briefly sketched. A. Graphs A graph can encode structured data by representing elements or parts of the information as nodes and the relationships among them as arcs. A directed unlabeled graph is defined by the pair GU = (V , E), where V is the finite set of nodes and E ⊆ V × V represents the set of arcs. An arc from node u to node v is a directed link represented by the ordered pair (u, v) ∈ E, u, v ∈ V . In the following we will consider only directed graphs. An undirected graph can be conveniently represented as a directed graph by substituting each undirected edge with a pair of directed arcs: an edge between nodes u and v will correspond to the two directed arcs (u, v) and (v, u).
10
BIANCHINI
et al.
The pair (V , E) defines the structure of the graph by specifying the topology of the connections among the nodes. Anyway, when representing structured data, each node can be characterized by a set of values assigned to a predefined group of attributes. For example, if a node represents a region in an image, features describing the perceptual and geometric properties can be stored in the node to characterize the region. Thus, the data representation can be enriched by attaching a label to each node in the graph. In general a different set of attributes can be attached to each node, but, in the following, we will assume that the labels are chosen from a unique label space L (for instance, we can consider labels represented as vectors of rationals, i.e., L = Qm , or vectors of reals, i.e., L = Rm ). Thus, we define a directed labeled graph as a triple GL = (V , E, L), where V and E are the set of nodes and arcs, respectively, and L : V → L is a node labeling function that defines the label L(v) ∈ L for each node v in the graph. Finally, also the semantics of the arcs can be enriched by associating a label to each arc (u, v) in the graph. A graph with labeled edges can also encode attributes related to the relationships between pairs of nodes. For example, the arc between two nodes representing two regions in an image can encode the “adjacency” relationship. A label attached to the arc can specify a set of features that describes the mutual position of the two regions. We will assume that the labels for arcs belong to a given edge label space Le . A directed graph with labeled edges is defined by a quadruple GLE = (V , E, L, E ), where the edge labeling function E : E → Le attaches a label E ((u, v)) ∈ Le to the arc (u, v) ∈ E. Notice that, in general, the two arcs (u, v) and (v, u) can have different labels. The topology of the graph can be characterized by the following properties. Given any node v ∈ V , pa[v] = {w ∈ V | (w, v) ∈ E} is the set of the parents of v, while ch[v] = {w ∈ V | (v, w) ∈ E} represents the set of its children. The outdegree of v, od[v] = |ch[v]|, is the cardinality of ch[v], and o = maxv od[v] is the maximum outdegree in the graph, while the indegree of v is the cardinality of pa[v] (|pa[v]|). Nodes having no parents (i.e., |pa[v]| = 0) are called sources, whereas nodes having no children (i.e., |ch[v]| = 0) are referred to as leaves. We denote the class of graphs with maximum indegree i and maximum outdegree o as #(i,o) . Moreover, we denote the class of graphs with bounded indegree and outdegree (but unspecified) as #. Given a labeled graph GL , the structure obtained by ignoring the node and/or edge labels will be referred to as the skeleton of GL , denoted as skel(GL ). Finally, the class of all data structures defined over the domain of the labeling function L and skeleton in #(i,o) will be denoted as (i,o) L# and will be referred to as a structured space. A path from node u to node v in a graph G is a sequence of nodes (w1 , w2 , . . . , wp ) such that w1 = u, wp = v, and the arcs (wi , wi+1 ) ∈ E,
RECURSIVE NEURAL NETWORKS
11
i = 1, . . . , p − 1. If there is at least one path such that w1 = wp , the graph is cyclic. As we will discuss in the following sections, this property is crucial for defining the recursive schema used to process the graph. In fact, if the graph is acyclic we can define a partial ordering on the set of nodes V , such that u ≺ v if u is connected to v by a direct path. The set of the descendants of a node u, desc(u) = {v ∈ V | u ≺ v}, contains all the nodes that precede v in the partial ordering. We will focus on the class of DAGs since it allows a simple scheme for RNNs. In particular we will consider models designed to process the following subclasses of DAGs: 1. Directed Positional Acyclic Graphs (DPAGs). It is a subclass of DAGs for which an injective function ov : ch[v] → {1, . . . , o} assigns a position ov (c) to each child c of a node v. Therefore, a DPAG is represented by the tuple (V , E, L, O), where O = {o1 , . . . , o|V | } is the set of functions defining the position of the children for each node. Since the range for each function ov (c) is {1, . . . , o}, if for a node v |ch[v]| < o holds, there will be some empty positions that will be considered as null pointers (NIL). Thus, using a more intuitive view, in a DPAG the children of each node v can be organized in a fixed size vector ch[v] = [ch1 [v], . . . , cho [v]], where chk [v] ∈ V ∪ {NIL}. We denote with PTREEs the subset of DPAGs that contains graphs that are trees, i.e., such that each node in the structure has just one parent. 2. Directed Acyclic Graphs with Labeled Edges (DAGs-LE). DAGs-LE represent the subclass of DAGs for which an edge labeling function E is defined. In this case it is not necessary to define an ordering among the children of a given node. Finally, we denote with TREEs-LE the subset of DAGs-LE that contains graphs that are trees. When the result of the recursive processing is a single value for the whole graph, as it is for graph classification and regression tasks, the DAG G is required to possess a supersource, that is a node s ∈ V such that any other node in G can be reached by a directed path starting from s. Note that if a DAG does not have a supersource, it is still possible to define a convention for adding an extra node s with a minimal number of outgoing arcs, such that s is a supersource for the expanded DAG (Sperduti and Starita, 1997). B. Processing DAGs with Recursive Neural Networks We consider a processing scheme based on a set of state variables Xv that are defined for each node v. Each variable Xv is supposed to encode the information relevant for the overall computation, related to node v and all its descendants. The range for state variables is the state space X, whose
12
BIANCHINI
et al.
choice depends on the particular model we exploit. In the following, we will consider Xv ∈ Rn since neural networks can naturally compute functions on real vectors. The proposed state-based processing is closely related to the computation carried out by recurrent neural networks while analyzing a time series. The internal state of the recurrent network, which acts like an adaptive dynamic system, encodes the past history of inputs and collects all the information needed to define the future evolution of the computation. In the recursive model, the state can be computed locally at each node depending on the states of its children1 and on the input information available at the current node (the node label). This schema is analogous to the processing of recurrent neural networks where the new state of the network at time t is computed from the state at time t − 1 and the current input. Thus, this framework requires the definition of a state transition function f that is used to compute Xv given the states of the set of the children of node v, the labels eventually attached to the arcs connecting v to each child, and the label Uv stored in the node. In general, the function f can be implemented by a neural network that depends on a set of trainable parameters θf . Apart from the constraints on the number and type of inputs and outputs of the neural network, there are no other assumptions on its architecture (type of neurons, number of layers, etc.). More precisely, the state Xv is computed by the transition function Xv = f (Xch[v] , L(v,ch[v]) , Uv , θf ),
(1)
where Xch[v] = {Xchi [v] | i = 1, . . . , o(v)} is the set of the states of the children of node v and L(v,ch[v]) = {L(v,chi [v]) | i = 1, . . . , o(v)} is the set of edge labels attached to the arcs connecting v to its children. Given an input DAG G, the transition function of Eq. (1) is applied recursively to the nodes following their inverse topological order. Thus, first the state of the leaves is computed and then the computation is propagated to the upper levels of the graph until the source nodes are reached (or the supersource if there is only one source node). The use of the inverse topological order to process the nodes in a DAG guarantees that when the state of node v is computed using Eq. (1), the states of its children have already been calculated. To apply this computational scheme, the requirement of the graph to be acyclic is crucial. If a cycle was present in the graph, the state of a node belonging to the cycle would recursively depend on itself. In fact, by the recursive application of Eq. (1) the computation flows backward through the paths defined in the graph and the state of a given node v depends on all its descendants desc(v). The presence of cycles in the graph would make the 1 A child of a given node is a direct descendant in the partial ordering defined by the arcs.
RECURSIVE NEURAL NETWORKS
13
state computation undefined, unless a different scheme is used. A possible approach to extend the recursive neural network computation to cyclic graphs will be presented in a following section. The output of the state propagation is a graph having the same skeleton of the input graph G. In fact, the states Xv can be considered as new labels attached to the nodes of G. On the other hand, the computation can be viewed as a “copy” of the transition function f in each node v. Thus, moving from a local to a global view, the state computation can be seen as the application of a function that is obtained by combining different instances of the transition function following the topology of the input graph. This view yields the encoding network that is obtained by unfolding the transition function on the input graph. By applying the same transition function to different input graphs, we obtain different encoding networks featuring the same building block but assembled with a different structure. When processing a graph G having a supersource s, the state Xs can effectively be considered as the encoding of the whole graph. Figure 1 depicts how the encoding network is obtained by the unfolding on the input graph of the recursive network that implements the transition function f . The function f is replicated for each node in the graph and the network inputs are properly connected following the topology of the arcs in the graph. As can be observed in Figure 1, the
F IGURE 1. The encoding and the output networks associated with a graph. The recursive network is unfolded through the structure of the graph.
14
BIANCHINI
et al.
encoding network resulting from the unfolding is essentially a multilayered feedforward network, whose blocks share the same weights θf . Finally, an output network g can be defined to map the states to the actual output of the computation: Yv = g(Xv , θg ),
where θg is a set of trainable parameters. Yv belongs to an output space and in the following we will consider Yv ∈ Rr . The function g can be computed for each node in the input graph G, thus yielding an output graph with the same skeleton of G and the nodes labeled with the values Yv . In this case the RNN realizes a transduction from a graph G to a graph G′ , such that skel(G) = skel(G′ ). Otherwise, the output can be computed only for the supersource of the input graph, realizing a function ψ from the space of DAGs to Rr defined as ψ(G) = g(Xs , θg ). This second approach can be used in classification and regression tasks. Figure 1 shows this latter case, where the output network is applied only at the supersource. As shown in Figure 1 the pair of functions f and g defines the RNN. In particular the recursive connections on the function f define the dependencies among the variables in the connected nodes. In fact, the recursive connections define the topology of the encoding network establishing the modality of combination of the states of the children of each node. The parameters θf and θg are the trainable connection weights of the network, being θf and θg independent of node v.2 The parametric representations of f and g can be implemented by a variety of neural network models. 1. Processing DPAGs When considering DPAGs the set of children of a given node is ordered and it can be conveniently represented using a vector. The position of each child can be significant in the computation of the output, and two graphs differing just for the order of the children of a given node v may yield a different output. Thus, the transition function f of Eq. (1) must be properly redefined to take into account the position of each child. Basically, the position can be considered as a label attached to each arc connecting the node to a child, but it is simpler to code the position by organizing the children in a vector as shown in Section II.A. Thus, in the case of DPAGs the transition network is
with
Xv = f (Xch[v] , Uv , θf ), Xch[v] = [X′ch1 [v] , . . . , X′cho [v] ]′ ,
2 In this case, we say that the RNN is stationary.
(2)
o = max od[v] , v∈V
15
RECURSIVE NEURAL NETWORKS
and Xchi [v] equal to the frontier state X0 , if node v lacks its ith child. For example, when processing a leaf node, the state depends only on its label Uv and on the frontier state X0 , since Xleaf = f (X0 , . . . , X0 , Uv , θf ). If the function f is implemented by a two-layer perceptron, with sigmoidal activation functions in the hidden units and linear activation functions in the output units, the state is calculated according to o Xv = V · σ Ak · Xchk [v] + B · Uv + C + D, (3) k=1
where σ is a vectorial sigmoid function and θf collects the pointer matrices Ak ∈ Rq,n , k = 1, . . . , o, B ∈ Rq,m , C ∈ Rq , D ∈ Rn , and V ∈ Rn,q . Here, m is the dimension of the label space, n the dimension of the state space, and q represents the number of hidden neurons. As can be observed from Eq. (3), the dependency of the propagation on the position of each child is obtained by using a different pointer matrix Ak for each position k. This solution can show some limitations when the maximum outdegree in a graph is large but most of the nodes have a smaller number of children. In fact, even if the number of parameters grows linearly with the maximum graph outdegree, it can become quite large and, more importantly, for most of the nodes many pointer matrices are just used to propagate the NIL pointer value X0 , thus carrying very little information. In this case, two different solutions can be pursued. If possible, the arcs in the input graph can be conveniently pruned to reduce the maximum graph outdegree. For example, when extracting the representation of images based on the region adjacency graph, the arcs corresponding to adjacent regions sharing a border having a length under a predefined threshold could be pruned. Anyway, this approach results in the loss of part of the original information and thus is not always feasible. The second solution is to use a nonstationary transition network, in which different sets of parameters are used depending on the node outdegree. By using this approach, we avoid introducing noisy information due to the padding of empty positions with the frontier state, otherwise needed for nodes having lower outdegrees. A similar equation holds for the output function g: Yv = W · σ (E · Xv + F) + G, ′
′
′
where θg collects E ∈ Rq ,n , F ∈ Rq , G ∈ Rr , W ∈ Rr,q . A simple neural network architecture implementing the transition function of Eq. (3) is shown in Figure 2. The recursive connections link the output of the state neurons to the network inputs, corresponding to each position in the child vector. This notation specifies how the network is assembled in the encoding network. Apart from the recursive connections, the transition function is implemented by a classical feedforward network with a layer of
16
BIANCHINI
et al.
F IGURE 2. Transition function realized with a multilayer perceptron. This network can process graphs with a maximum outdegree o = 2.
hidden units. The network has n · o + m inputs corresponding to the o states of the children (n components each) and to the node label (m components). Example 1. Referring to Figure 1, where o = 2, and supposing that f and g are implemented with a three-layer perceptron, the state at each node and the output at the supersource are computed as Xd = Vσ (A1 + A2 )X0 + BUd + C + D, Xc = Vσ (A1 + A2 )X0 + BUc + C + D, Xb = Vσ (A1 Xc + A2 Xd + BUb + C) + D,
Xa = Vσ (A1 Xb + A2 Xd + BUa + C) + D,
Ya = Wσ (E1 Xa + F) + G.
Note that the states are computed starting from the leaf nodes d and c up to the root node a. Remark 1. In the case of sequences (Figure 3), each node represents a time step t and the arcs represent the relationship “follows.” Using this representation for time series, recursive networks reduce to recurrent networks. In fact, the state updating described in Eq. (3) becomes Xt = V · σ (AXt−1 + BUt + C) + D. Matrix A weighs the recurrent connections, while matrix B weighs the external inputs. When the output is computed at the supersource, an RNN implements a function from the set of DPAGs to Rr , h : DPAGs → Rr , where h(G) = Ys .
RECURSIVE NEURAL NETWORKS
F IGURE 3.
17
A temporal sequence. The arcs code the “follows” relationship.
Formally, h = g ◦ f˜, where f˜, recursively defined as X0 if G is empty, ˜ f (G) = f (f˜(G1 ), . . . , f˜(Go ), Uv ) otherwise,
denotes the process that takes a graph and returns the state at the supersource, f˜(G) = Xs . In fact, the function f˜ depends both on the topology and on the labels of the DPAG. Following Eq. (2), the state transition function f , computed by the RNN, depends on the order of the children of each node, since the state of each child occupies a particular position in the list of the arguments of f . To overcome such a limitation, in Bianchini et al. (2001b), a weight-sharing approach was described, able to relax the order constraint and to devise a neural network architecture suited for DAGs with a bounded outdegree. In fact, the weight-sharing technique used in this approach cannot be applied to DAGs with a large outdegree o, due to the factorial growth in the network parameters with respect to o. Even if the maximum outdegree can be bounded, for instance by pruning those connections that are heuristically classified as less informative, nevertheless some important information may be discarded in this preprocessing phase. 2. Processing DAGs-LE
In many applications, like image processing, the assumptions required by the DPAG-based model defined in the previous section introduce unnecessary constraints. First, in many cases the definition of a position for each child of a node is arbitrary. Second, as also noted previously, the need to bound the maximum outdegree in the graph can cause the loss of important information. For example, when considering the image representation based on the region adjacency graph, the order of the adjacent regions may be significant, but assigning a specific position to them by choosing a starting direction may be arbitrary. Moreover, to bound the maximum graph outdegree some of the arcs must be pruned and, anyway, for each node v for which |ch[v]| < o the last positions in the child vector have to be arbitrarily padded with the frontier state X0 . In fact, the need to consider exactly o children is a limitation of the model and not a feature of the problem. When considering DAGs-LE such limitations can be effectively removed (Gori et al., 2003). In fact, the edge label can encode the relevant features of
18
BIANCHINI
et al.
the relationship represented by the arcs and the constraints on the number of children can be avoided. For DAGs-LE, we can define a transition function f that does not have a predefined number of arguments and that does not depend on their order. The different contribution of each child depends on the label attached to the corresponding arc. At each node v, the total contribution X(ch[v], L(v,ch[v]) ) ∈ Rp of the children is computed as X(ch[v], L(v,ch[v]) ) =
|ch[v]| i=1
φ(Xchi [v] , L(v,chi [v]) , θφ ),
(4)
where L(v,chi [v]) ∈ Rk is the label attached to the arc (v, chi [v]) and the edgeweighting function φ : R(n+k) → Rp is a nonlinear function parameterized by θφ . Then, the state at node v is computed combining X(ch[v], L(v,ch[v]) ) and the node label Uv by a parametric function f˜, as Xv = f (Xch[v] , L(v,ch[v]) , Uv , θf ) = f˜ X(ch[v], L(v,ch[v]) ), Uv , θf˜ . (5)
With this approach, the transition function f can be applied to nodes with any number of children, and it is also independent of the order of the children. The parametric functions φ, f˜, and g, involved in the recursive network, can be implemented by feedforward neural networks. For example, φ can be computed by a two-layer perceptron with linear outputs as φ(Xchi [v] , L(v,chi [v]) , θφ ) = Vσ (AXchi [v] + BL(v,chi [v]) + C) + D,
where θφ collects A ∈ Rq,n , B ∈ Rq,k , C ∈ Rq , D ∈ Rp , and V ∈ Rp,q , with q the number of hidden neurons. On the other hand, the function φ can also be realized by an ad hoc model. In the following, we will consider the solution originally proposed in Gori et al. (2003), where φ is realized as k |ch[v]| (j ) Hj L(v,chi [v]) Xchi [v] , X(ch[v], L(v,ch[v]) ) = (6) i=1
j =1
with H ∈ Rp,n,k the edge-weight matrix. In particular, Hj ∈ Rp,n is the j th (j ) layer of matrix H and L(v,chi [v]) is the j th component of the edge label. In the following, the RNN that computes the state transition function defined in Eq. (5) will be referred as RNN-LE (recursive neural network for DAGs-LE). C. Backpropagation Through Structure A learning task for RNNs requires specification of the network architecture, a learning environment Le that contains the data used for learning, and a cost
RECURSIVE NEURAL NETWORKS
19
function to measure the error produced by the current network with respect to the target behavior. Once the learning dataset is chosen, the cost function depends only on the free parameters of the RNN, that is, the vectors θf and θg . By collecting all the network parameters in a unique vector, that is, θ = [θf′ θg′ ]′ , we can write the cost function used in the learning process as E = E(θ ). Hence, the learning task is reformulated as the optimization of a multivariate function. For a supervised learning task, the learning environment contains a set of graphs for which a supervisor provided target values for the network outputs at given nodes. More precisely, for each graph Gp , p = 1, . . . , P , in the learning set the supervisor provides a set of nodes SGp ⊆ VGp together with a target output for the network at each node in SGp , that is, the supervision is a set of pairs (v, Ytv ), v ∈ SGp , and Ytv ∈ Rr . For graph classification or regression tasks, the supervision is provided only at the graph supersource, thus only one target is specified for each graph in the training set. In this latter case, the examples can be specified in a more compact form as pairs (Gp , Yt (p)). Using a quadratic cost function, the error function on the learning set can be defined as P P
1 1
Yv (Gp , θ ) − Yt 2 , E(θ ) = EGp (θ ) = v 2 P 2P p=1
(7)
p=1 v∈SGp
where Yv (Gp , θ ) is the output produced by the RNN at node v, while processing the graph Gp , using the values θ for the neural network weights. If the functions f and g that define the RNN are differentiable with respect to the parameters θ, the cost function E(θ ) is a continuous differentiable function and can be optimized by using a gradient descent technique. In particular, the simplest approach is to update the weights at each iteration k as θk = θk−1 − ηk ∇θ E(θ )|θ=θk−1 ,
(8)
where the gradient of E(θ ) is computed for θ = θk−1 and ηk is the learning rate. The weight vector is usually initialized at step k = 0 with random small values. The weight update equation (8) is iteratively applied until a stopping criterion is met. Usually the learning algorithm is stopped when the cost function assumes a value below a predefined threshold, when a maximum number of iterations (epochs) is reached, or when the gradient norm is smaller than a given value. Unfortunately, the training algorithm is not guaranteed to halt yielding a global optimum. For example, the learning procedure can be trapped in a local minimum of the function E(θ ), which yields a suboptimal solution to the problem. In some cases suboptimal solutions can be acceptable, whereas in others a new learning procedure should be run starting from a
20
BIANCHINI
et al.
different initial set of weights θ0 . However, more sophisticated gradient-based optimization techniques can be applied to increase the speed of convergence. Thus, at each iteration, the learning algorithm requires two different steps: the computation of the gradient and the update of the RNN weights. In the scheme proposed in Eq. (8) the weight update is performed in batch mode, that is, the weight update is performed using the gradient computed on all the graphs in the learning set. This approach yields the correct optimization of the cost function of Eq. (7). Approximate versions can be defined by updating the weights after the presentation of each graph (pattern mode) or of a set of graphs (block mode). Anyway, the most computationally intensive part is the computation of the gradient of the cost function. Since the total cost E(θ ) is obtained by summing the contributions of the errors on each graph Gp , the total gradient can be obtained by accumulating the gradients computed for each example Gp in the training set. The gradient computation can be efficiently carried out by using the BPTS algorithm, which is derived by extending the original backpropagation algorithm for feedforward networks. The intuition is that the unfolding of the RNN on a given input graph Gp yields an encoding network that is substantially a multilayered feedforward network whose layers share the same set of weights. Thus the gradient computation can be performed in two steps as in the backpropagation algorithm: in the forward pass the recursive network outputs are computed and stored, building the encoding network; in the backward pass the errors are computed at the nodes where the targets are provided and they are backpropagated in the network to compute the gradient components. Let us consider a generic weight ϑ ∈ θ of the RNN. While processing the input graph Gp , the network is replicated at each node v ∈ VGp yielding the encoding network. Thus, we can consider each replica of the recursive neural network as an independent instance having a different set of weights θ (v). Under this assumption we can compute the derivative of the partial cost function EGp (θ ) with respect to the unfolded parameters θ (v). More precisely, we consider the cost function as EGp (θ (v1 ), . . . , θ (vNp )), being that VGp = {v1 , . . . , vNp }. Since the replicas of the RNN share the same set of weights, that is, θ (v1 ) = · · · = θ (vNp ) = θ, by applying the rule for the derivative of a composition of functions, we obtain the following rule for each weight ϑ: ∂EGp ∂ϑ
=
∂EGp
v∈VGp
∂ϑ(v)
.
(9)
From a practical point of view, Eq. (9) states that the complete derivative with respect to a given weight can be obtained by accumulating the contributions of the derivatives with respect to the instance of the weight at each node. The
21
RECURSIVE NEURAL NETWORKS
local derivatives ∂EGp /∂ϑ(v) can be computed using the same approach as in the original backpropagation algorithm. Basically, the derivative with respect to each local weight ϑ(v) is expanded by considering the local variable that is directly affected by a change in ϑ(v). In particular, backpropagation considers the neural unit that uses the weight ϑ(v) and rewrites the derivative as ∂EGp ∂ϑ(v)
=
∂EGp ∂zk (v) , ∂zk (v) ∂ϑ(v)
where zk (v) is the output of the affected neural unit. The term δkz (v) = ∂EGp /∂zk (v) is the generalized error that is computed recursively by the backpropagation algorithm, whereas ∂zk (v)/∂ϑ(v) is a factor that depends on the model of the considered unit. For example, for a linear unit, this last derivative equals the value of the input to the connection corresponding to the weight ϑ(v). Since the functions f and g can be implemented by very diverse architectures, it is difficult at this level of abstraction to detail the backpropagation algorithm. Anyway, assuming that the classical backpropagation procedure is implemented both for the transition network and the output network, it suffices to propagate the generalized errors available for the outputs of these networks to their inputs. The generalized errors for the outputs of each replica of the transition and output function in the unfolding graph can be propagated from the sources to the leaf nodes. In particular, let us consider the replica of the transition network corresponding to node v. Given any state variable xj (v) available at the output of this function, it affects the values of the inputs of the replicas of the transition function at the parents of v. The actual input affected by xj (v) for each node u ∈ pa[v] depends of the architecture of the transition network and on the role of v among the children of u (e.g., its position for DPAGs). Thus, we will indicate the affected input as xj(u,v) (u) [i.e., the variable depends on node u and is identified by the arc (u, v)]. The backpropagation procedure applied to the replica of the transition (u,v) function in u yields a generalized error δjx (u). Since a variation in xj (v) affects all the corresponding inputs of the parent nodes, we derive δj (v) =
∂EGp ∂xj (v)
=
∂EGp
(u,v) (u) u∈pa[v] ∂xj
(u,v)
∂xj
(u)
∂xj (v)
=
δjx
(u,v)
(u). (10)
u∈pa[v]
If we consider the output function g, the generalized errors for its output variables yj (v) at node v can be computed directly from the cost function. y y In fact, if there is no supervision at node v, δj (v) = 0; otherwise δj (v) = [yj (v) − yjt (v)]. The generalized errors can be backpropagated through the g output network g to yield the generalized errors for the inputs δj (v) =
22
BIANCHINI
et al.
g
∂EGp /∂xj (v) corresponding to the state variables at node v. This generalized error is an additional contribution to the generalized error δj (v) and, thus, Eq. (10) can be completed as (u,v) g δj (v) = δjx (u) + δj (v). (11) u∈pa[v]
Once the generalized errors δj (v) are computed, the backpropagation procedure for the replica of the transition function at node v is executed yielding both the partial gradients ∂EGp /∂ϑ(v) and the generalized errors at the network inputs. Hence, the BPTS algorithm can be implemented with a very modular structure, by realizing the backpropagation function for the transition and output networks and by combining the contributions at each node with a simple backpropagation of the generalized errors along the graph arcs following Eq. (11). This computation can be carried out by processing the nodes from the sources to the leaves. D. Processing Cyclic Graphs The model presented in the previous sections cannot be directly applied to the processing of cyclic graphs, because the unfolding of the recursive network would yield an infinite encoding network. In fact, the computation in the presence of cycles could be extended by iterating the state transition function starting from given initial values of the states, until the state values converge to a fixed point. To overcome the problems of cyclic structure processing, some techniques have been proposed based on the idea of collapsing each cycle into a unique unit that encodes the information corresponding to the nodes belonging to the cycle (Bianucci et al., 2001). Unfortunately, the strategy for collapsing the cycles cannot be carried out automatically, and it is intrinsically heuristic. Therefore, the effect on the resulting structures and the possible loss of information are almost unpredictable. In Bianchini et al. (2002, 2003c) a different approach is proposed for the case of graphs having a supersource and whose nodes store distinct labels (Directed Graphs with Unique labels, DUGs). These requirements are needed to assess the computational capabilities of this class of RNNs, but they do not pose actual limitations in real applications. For example, in image processing tasks, each node may represent a region and the label is a real valued vector. In this case it is quite unlikely that two nodes share exactly the same label. According to this framework, the encoding network has the same topology as the graph: if the graph is cyclic, the encoding network is also cyclic (see
RECURSIVE NEURAL NETWORKS
F IGURE 4.
23
The encoding and the output networks for a cyclic graph.
Figure 4). In fact, a replica of the transition network “replaces” each node of the graph and the connections between the transition networks are defined by the topology of the arcs in the graph. The computation is carried out by setting all the initial states Xv to X0 . Then, the copies of the transition network are repeatedly activated to update the states. According to Eq. (1), the transition network attached to node v produces the new state Xv . After a given number of updates, the computation can be stopped. The value of the output function is the result of the whole recursive processing. This general procedure is formalized in the following algorithm. Algorithm 1. CyclicRecursive(G) begin for each v ∈ V do Xv = X0 ; repeat < Select v ∈ V >; Xv = f (Xch[v] , Uv , θf ); until stop(); return g(Xs , θg ); end More precisely, Algorithm 1 is a generic framework that describes a class of procedures. To implement a particular procedure, we should decide the strategy adopted to select the nodes and the criterion used to halt the iterations. In fact, the theoretical results show that no particular ordering must be imposed on the sequence of node activations provided that each arc in the graph is considered at least once during the processing (Bianchini et al. 2003c, 2006). The nodes can be activated following any ordering, and also random sequences are admitted. Algorithm 1 can also be extended by adopting a synchronous activation strategy. The stopping criterion implemented in the function “stop()” should guarantee that Algorithm 1 halts after a “sufficient”
24
BIANCHINI
(a) F IGURE 5.
et al.
(b)
(c)
An artificial image (a), its RAG (b), and the corresponding directed graph (c).
number of iterations. A more precise definition of the stopping criterion will be given in the following, clarifying when the performed iterations are sufficient. In the following, Algorithm 1 is further discussed using an example. Example 2. An image can be represented by its region adjacency graph (RAG, see Section III.C), which is extracted from the image by associating a node to each homogeneous region and linking the nodes corresponding to adjacent regions (see Figure 5). Each node is labeled by a real vector that represents perceptual and geometric features of the region (perimeter length, area, average color, texture, etc.). Since RNNs can process only directed graphs, the undirected edges of RAGs must be transformed into a pair of directed arcs (see Figure 5c). When Algorithm 1 is applied to a RAG, the computation appears to follow an intuitive scheme. At each time step, a region is selected. Then, the state of the corresponding node is computed, based on the states of the adjacent nodes using the function f (see Figure 6). According to the recursive paradigm, the state of a node is an internal representation of the object denoted by that node. Thus, at each step, the algorithm adjusts the representation of a region using the representations of the adjacent regions. After some steps, the computation is stopped. Then, the output function g is applied to the
(a) Paying attention (b) Paying attention (c) Paying attention (d) Computing the outat the door: X4 = at the roof: X8 = at the sky: X1 = put: Y = g(X5 , . . .) f (X2 , X5 , . . .) f (X3 , X5 , . . .) f (X2 , . . .) F IGURE 6.
An application of Algorithm 1 to a RAG.
RECURSIVE NEURAL NETWORKS
25
state at the supersource3 to produce the output of the recursive network (see Figure 6d). Remark 2. Algorithm 1 is an extension of the original recursive processing for acyclic structures. In fact, the execution of Algorithm 1 on DAGs produces the same state at each node as the classical approach, provided that these states were updated a sufficient number of times and the order of activation is such that no node remains nonactivated for an infinite number of steps. Obviously, for acyclic graphs, whereas Algorithm 1 describes a generic way to activate each node, the ad hoc processing model constitutes the most efficient strategy. 1. Recursive-Equivalent Transforms The intuitive idea that supports the application of Algorithm 1 to DUGs is that in this case the presence of unique labels allows us to encode the presence of cycles with a finite unfolding of the graph. In fact, the presence of a cycle will be evidenced by the presence of nodes with the same label in the unfolding. Hence, we can introduce the concept of recursive equivalence between DUGs and trees stating that the computation of a recursive function on a given DUG yields the same result on a recursive-equivalent tree. From a theoretical point of view, this observation allows us to assess the computational capabilities of RNNs when processing DUGs (Bianchini et al., 2003c, 2006). In particular, it can be proved that an appropriate RNN can approximate in probability, up to any degree of precision, any real valued measurable function on the space of DUGs given any recursive equivalent transform. The processing carried out by Algorithm 1 represents a recursive equivalent transform if appropriate node selection and halting criteria are chosen. Thus, the theoretical results support the fact that any cyclic graph G can be transformed into a tree T such that Algorithm 1 applied on G produces the same output as when the recursive network is fed with T . This possibility also provides a mechanism for training RNNs on cyclic graphs. In fact, a learning set A containing cyclic graphs can be transformed into a set of recursiveequivalent trees B. Then, the recursive network can be trained using BPTS (Küchler and Goller, 1996) on the learning set B. After training, the RNN can be applied to unseen cyclic graphs by executing Algorithm 1 using the same criteria that were applied to obtain the trees in B. The concept of recursive equivalence of two graphs can be defined formally. 3 In this case, the supersource can be any node of the graph, since in RAGs there is a path between any pair of nodes.
26
BIANCHINI
et al.
Definition 1. Two arcs a = (v1 , w1 ), b = (v2 , w2 ) are said to be recursive equivalent, a ≈r b, if the labels of v1 and v2 and those of w1 and w2 are equal. Moreover, two graphs G1 , G2 are recursive equivalent, G1 ≈r G2 , if, for each arc a in G1 , there exists an arc b in G2 such that a ≈r b and, vice versa, for each arc a in G2 , there exists an arc b in G1 such that a ≈r b. Definition 2. A function F from directed graphs to trees is said to be a recursive-equivalent transform, if F (G) ≈r G, for each G. Intuitively, two graphs are recursive equivalent if they have the same arcs, where we assume that arcs can be distinguished only by looking at the labels of their delimiting nodes. Figure 7 shows a cyclic graph and two recursiveequivalent trees. It is easy to show that if G is a DUG and F is a recursiveequivalent transform, then G can be uniquely reconstructed from F (G). In fact, the nodes of a DUG are identified by their labels and thus its nodes can be obtained by merging together all the nodes in F (G) having the same label. Therefore, let us suppose that a given procedure implements a recursiveequivalent transform F . Then, any cyclic DUG can be processed by an RNN after a preprocessing phase carried out using such a procedure, since F is injective. Now let us discuss the computational properties of Algorithm 1. Notice that the state Xv (t) computed by Algorithm 1 at a given time step t depends on the states of the children of v, computed at the previous steps. Thus, due to the
F IGURE 7. (a) A cyclic graph. (b and c) Two trees that are recursive equivalent to the graph. A covering tree of the graph is represented by the continuous arcs.
RECURSIVE NEURAL NETWORKS
27
recursive processing, the dependence of Xv (t) on the previously calculated states can be represented by a computation tree Tv (t), which collects the nodes visited by the algorithm to compute Xv (t). In fact, computation trees are unfoldings of the input graphs. For example, Figure 7b and c shows two computation trees for the node v1 . The following definition formalizes the above concept. Definition 3. The computation tree Tv (t) for node v at time t is defined by ⎧ if t = 0, ⎨({v}, ∅) Tv (t) = Tree(v, Tch1 [v] (t − 1), . . . , Tcho [v] (t − 1)) if v is active at t > 0, ⎩T (t − 1) otherwise, v
where Tree( ) is a function that builds a tree from its root and subtrees.
In Algorithm 1 an important role is played by the state Xs (tH ), where tH is the time when the algorithm halts and s is the supersource of the input graph. In fact, the output of the algorithm is g(Xs (tH )). Definition 4. Xs (tH ) is the principal state of Algorithm 1. The principal unfolding function is the map H that takes a graph and returns the computation tree for the supersource s at time tH , that is, H (G) = Ts (tH ). Basically there is no difference between computing the output of the recursive network on the computation tree Ts (tH ) or on the original input graph G. In particular, if Algorithm 1 is such that for any G and any arc a in G there exists an arc b in H (G) such that a ≈r b (H (G) contains a copy of a), then H is a recursive-equivalent transform. In practice, Algorithm 1 merges the construction of a recursive-equivalent representation of the input graph G and the computation of the RNN in a unique procedure. Thus, by defining an appropriate policy for visiting the graph nodes and by halting the visit after “enough” steps, to obtain an unfolding that is recursive equivalent to the input DUG, we can choose an RNN that is able to approximate any real valued measurable function on DUGs, up to a given degree of precision. Remark 3. In Hammer (1999) it is also proved that it suffices to choose n = 2 (i.e., each state attached to a node is a vector with two entries) to reach any degree of approximation, whereas some hints are also given on how to choose the architecture of the transition network (number of layers and neurons per layer). Those results on the structural parameters of the architecture also apply to this case. Nevertheless, in practice, the selection of the optimal values for n and for the number of hidden units, both in the transition and in the output networks, is commonly a trial-and-error procedure.
28
BIANCHINI
(a) F IGURE 8.
(b)
et al.
(c)
Some graphs and their unfoldings. Gray nodes represent supersources.
Remark 4. The theoretical results on the approximation capabilities of RNNs provide a hint about the design of the halt function “stop()” and on the method for the selection of the active node in Algorithm 1. In fact, to guarantee the universal approximation property, the main loop of Algorithm 1 should be repeated until H (G) becomes recursive equivalent to G.4 Thus, for example, a solution can be to activate the nodes randomly, halting the algorithm only after many iterations, when G ≈r H (G) holds with high probability. Another solution consists of activating all the nodes at the same time and stopping the algorithm after |V | steps. In general, also admitting that some nodes share the same labels, an RNN is able to distinguish between two different graphs G1 and G2 provided it is possible to define an unfolding H such that H (G1 ) = H (G2 ). Depending on the topology of the graph and the labeling of nodes, this requirement might not be satisfied. In Figure 8, the two graphs (a) and (b) yield the same unfolding tree for any H and thus any RNN is not able to distinguish between them. Hence, Algorithm 1 can approximate, in probability, any function on general (possibly cyclic) graphs, provided that the stop function and the selection method are designed so that H produces a different unfolding for each graph of the input domain. The set of trees is a straightforward example where, even if the graphs may have shared labels, this hypothesis is satisfied. In fact, the principal unfolding of a tree is the tree itself, and H is injective on this domain, provided that all the nodes are activated a sufficient number of times. 2. From Cyclic Graphs to Recursive Equivalent Trees To train the recursive neural network to be used in Algorithm 1, the graphs in the learning set S must be preprocessed by a recursive-equivalent transform F , that is, the actual learning set is T S = {F (G) | G ∈ S}. More 4 Notice that the loop is not necessarily halted as soon as H (G) ≈ G becomes true. In fact, the r algorithm can continue for any number of steps after the condition is satisfied.
RECURSIVE NEURAL NETWORKS
29
precisely, the RNN should be trained using a set of examples that is a significant sample of the trees yielded by the principal unfolding H , that is, it is required that F ≈ H . More generally, if Algorithm 1 is nondeterministic, then F (G) ⊇ H (G) must hold and the probability distribution of the trees in F (G) should approximate the distribution of the trees in H (G). Remark 5. Note that H (G) may contain a number of recursive-equivalent trees having different depths, for example, the preprocessing unfolds G a random number of times. In this case, the network output g(Xs (t)) should reach the target value and remain stable after that, that is, the output at the supersource is the same for all the different unfoldings of G. To achieve this stable behavior, the training set must contain several unfoldings of the same graphs. In the following, an example of a procedure that implements a recursiveequivalent transform is shown. Algorithm 2. CyclicGraphToTree(G)
MarkedArcs = Q; For each v set GetOriginalCopyOf (v) = v; Repeat Select an arc (v, w) in G and a node v2 in T s.t. – v = GetOriginalCopyOf (v2 ) – (v2 , w2 ) ∈ / Q being w = GetOriginalCopyOf (w2 ); Set w2 = NewCopy(w) and GetOriginalCopyOf (w2 ) = w; Extend T with node w2 and arc (v2 , w2 ); MarkedArcs = MarkedArcs ∪ {(v, w)}; until (MarkedArcs = E AND stop(T , G, . . .)); return T ; Algorithm 2 builds a covering tree T of graph G6 and then iteratively extends T with other copies of nodes in G. The extension is carried out by the main loop, where GetOriginalCopyOf is a function that keeps the relationship between the nodes in G and the corresponding copies in T , and NewCopy(v) produces a new node having the same label as v. At each step, the procedure looks for a node v2 in T that lacks children and produces a copy w2 of one of its children. The copy is then inserted into T . 5 A tree is a covering for G if it contains the same set of nodes V and its arcs Q are such that Q ⊆ E.
Covering trees can be easily computed by visiting the graph (Aho et al., 1983). 6 The construction of the covering tree is not fundamental to the definition of a recursive-equivalent transform. However, this initialization is useful to reduce the number of steps required to reach the halt condition on MarkedArcs.
30
BIANCHINI
et al.
The loop halts when all the arcs of G were visited and the function stop returns true. For our purpose, which consists of implementing a recursiveequivalent transform, we can use any function stop that depends on some or all of the program variables. In the simplest case, stop = true when all the arcs have been visited once. When stop is a more complex condition, Algorithm 2 can visit each arc and each node many times: it continues to add nodes to T until stop becomes true. Figure 7b and c shows examples of trees that can be constructed by a stop function that always returns true and by a more complex one, respectively. In fact, there is experimental evidence, at least for particular applications (Bianchini et al., 2003a, 2004b, 2005b), that repeatedly visiting some nodes in the graph makes it possible to better catch the information encoded in the cycles and to improve the network performance. E. Limitations of the Recursive Neural Network Model The approximation capability of recursive models is highly related to their ability to store a significant representation of the input structure into the internal state at the supersource. Let G1 and G2 be two distinct graphs. If the RNN fed with G1 and G2 reaches the same state at the supersource, it will produce the same output response: in the following, such a phenomenon, which sometimes cannot be avoided, will be referred to as a collision. Definition 5. Given a recursive network, we say that a collision occurs for graphs G1 and G2 if the state corresponding to the supersource of G1 and G2 is the same, that is X s1 = X s2 . Remark 6. In some applications, the presence of collisions may be undesirable, since it limits the approximation capabilities of an RNN. On the other hand, in pattern recognition, collisions may be useful when G1 and G2 are different representations of the same object7 or the representation of similar objects in the application domain. In those cases, the collision is desirable since the network yields the same result for G1 and G2 . As a matter of fact, the presence of collisions indicates a sort of robustness with respect to the noise and, moreover, collisions can be exploited to capture similarities between different objects. In this section, to guarantee collision avoidance, we investigate some conditions that are of central interest to deepen what the recursive model 7 For instance, because of the noise, the same object can be represented by different graphs.
RECURSIVE NEURAL NETWORKS
F IGURE 9.
31
A graph and a tree that are output equivalent.
is expected to realize from an approximation point of view. For the sake of simplicity, the results on the computational power of the recursive model are obtained for linear architectures. Nevertheless, they can be simply extended to the general nonlinear case. 1. Theoretical Conditions for Collision Avoidance In the simplest case, collisions happen because the symbolic representations that define the output responses are the same for different graphs. Example 3. Referring to Figure 9, where A1 , A2 are the network weights and a, b, c, d are the labels, the state responses at nodes 4, 8, and 9 obviously coincide. Therefore, if the signal is propagated bottom-up (from the leaves to the supersource), the states at the supersources (nodes 1 and 5) also coincide, producing the same output response for the two graphs: Ys = C(A1 A2 X0 + A2 A1 X0 + Ba + A1 Bb + A2 Bc + A1 A2 Bd + A2 A1 Bd).
Graphs that have the same output expression always cause the network to produce a collision at the supersource, despite the values assigned to the labels and to the frontier state. Those graphs are completely indistinguishable and are “equivalent” for the purpose of linear recursive processing. The definition of the symbolic output-equivalence ≈o freezes such idea. Definition 6. For all G1 , G2 , we say that G1 ≈o G2 , provided that ∀Ak , k = 1, . . . o, B (parameters), ∀Uv (labels), and ∀X0 (frontier state), a collision occurs for graphs G1 and G2 . Since it is always possible to transform a DPAG into an output-equivalent tree (nodes with many parents must be replicated to produce an instance for each parent), from now on, theoretical conditions for collision avoidance will
32
BIANCHINI
et al.
be stated on trees, as more general structures, like graphs, may always be reduced to output-equivalent trees. Therefore, let us address the problem of distinguishing trees with nonnull real labels. It can be proved that in this case, collisions can always be avoided, provided that a linear recursive network with a sufficiently large internal state is used. To this purpose, an enumeration of all the paths that can appear in a tree is considered. In fact, such an enumeration makes it possible to represent trees uniquely, since each tree is exactly defined by the list of all paths it contains. Example 4. A possible way to enumerate paths is based on ordering the nodes of the complete tree of height p. In fact, each node in the complete tree identifies one of the paths, so that a one-to-one relationship exists between the numbering of the paths and the numbering of the nodes. Figure 10 shows a complete ternary tree (a) where the nodes have been ordered by a breadth– first visit. The tree in Figure 10b can then be represented by the set of nodes {1, 2, 3, 4, 5, 6, 7, 9}. An alternative representation uses a binary vector [1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0], with the ith element equal to 1 when (b) contains the ith path of (a) and 0, otherwise. As suggested by Example 4, the dimension of the state representation grows exponentially with the height of the trees, limiting the practical use of recursive architectures for structures of large dimension. On the other hand (see Bianchini et al., 2001a), avoiding collisions on general trees means that the weights of the network must be coded with a number of bits that grows at least exponentially in the tree height. Therefore, the general problem of recognizing different trees becomes intractable when the height of the trees
F IGURE 10. An example of a complete ternary tree along with the enumeration of the nodes that induces an enumeration on the paths.
RECURSIVE NEURAL NETWORKS
33
increases because the number of trees grows so rapidly that an exponential number of bits is needed to distinguish all trees at the root. Nevertheless, even if an exponentially large state is required to avoid collisions, important classes of trees can be distinguished with a reasonable amount of resources. In fact, in most of the problems, the objective does not consist of recognizing every tree, but instead of distinguishing some classes. Thus, the state at the root must be able to store only a coding of the classes, not of the whole input tree. In Bianchini et al. (2001a), linear RNNs with a small number of parameters are proved to be able to recognize interesting properties of trees: for example, the number of nodes in each level, the number of leaves, and the number of left and right branches in the paths of binary trees.
III. G RAPH -BASED R EPRESENTATION OF I MAGES A. Introduction The neural network models presented in Section II are assumed to process structured data. To exploit such models to perform any tasks on images (classification, localization, or detection of objects, etc.) a preprocessing phase that allows each image to be represented by a graph is needed. In the last few years, graph-based representations of images have received growing attention, since, as a matter of fact, they allow both symbolic and structural information to be collected in a unique “pattern.” To obtain a graphic representation, a preprocessing phase, which allows a set of homogeneous regions from the image to be extracted, has to be performed, and a set of attributes that describes each region must be chosen. In the following, we will describe several segmentation methods that are used to extract the set of homogeneous regions, and some graphic structures, particularly suited to represent images. B. Segmentation of Images The preliminary phase for obtaining a graph-based representation of images consists in their segmentation. The name “segmentation” can be referred both to the process of extracting a set of regions with homogeneous characteristics and to the process of determining the boundary of the objects depicted in an image. Clearly, the second meaning is related to a very complex task, and probably a correct description of this process should be “object segmentation.” In this section, the meaning of the word “segmentation” is instead referred to the first interpretation proposed.
34
BIANCHINI
et al.
The segmentation phase is crucial for image analysis and pattern recognition systems, and very often it determines the quality of the final results. Segmenting an image means dividing it into different regions such that each region is homogeneous with regard to some relevant characteristics, while the union of any pair of adjacent regions is not. A theoretical definition of segmentation (Pat, 1993) is as follows: If P () is a homogeneity predicate defined on groups of connected pixels, then a segmentation is a partition of the whole set of the pixels F into connected subsets or regions (S1 , S2 , . . . , Sn ) such that n
i=1
Si = F,
Si ∩ Sj = ∅ (i = j ).
The predicate P (Si ), that measures the homogeneity of the set Si is true for each region, and P (Si ∪ Sj ) is false if Si and Sj are adjacent. Unfortunately, according to Fu and Mui (1981), “the image segmentation problem is basically one of psychophysical perception, and therefore not susceptible to a purely analytical solution.” Thus, there is no universal theory on image segmentation yet. All of the existing methods are, by nature, ad hoc, and they are strongly application dependent, that is, there are no general algorithms that can be considered effective for all images. However, from the late 1970s, a wide variety of methods were proposed; here we report a classification presented in a recent survey (Cheng et al., 2001). Segmentation algorithms can be classified into two main categories, according to the method used to represent the images: monochrome and color segmentation. Color segmentation attracted more attention in the past few years, because, as a matter of fact, color images provide more information with regard to gray level images; however, color segmentation is a time-expensive process, even if the rapid increase of the computational capabilities of computers allows such a limitation to be overcome. The main color image segmentation methods can be classified as follows: • Histogram thresholding: This technique is widely used for gray level images, but can be directly extended to the more general case of color images. The color space is divided with regard to each color component, then a threshold is considered for each component. However, since the color information is represented by tristimulus R, G, and B, or by their linear/nonlinear transformations, representing the histogram of a color image and selecting effective thresholds are very challenging tasks (Haralick and Shapiro, 1985). • Color space clustering: The methods belonging to this class generally exploit one or more features to determine separate clusters in the considered color space. “Clustering of characteristic features applied to image segmentation is the multidimensional extension of the concept of thresholding”
RECURSIVE NEURAL NETWORKS
35
(Fu and Mui, 1981). Applying the clustering approach to color images is a straightforward idea, because the colors tend to form clusters in the color space. The main problem of these methods is how to determine the number of clusters in an unsupervised scheme. • Region-based approaches: Region-based approaches, including region growing, region splitting, region merging, and their combination, attempt to group pixels into homogeneous regions. In the region-growing approach, a seed region is first selected, then expanded to include all homogeneous neighbors. One problem with region growing is its dependence on the choice of the seed region and the order in which pixels are examined. However, in the region-splitting approach, the initial seed region is the whole image. If the seed region is not homogeneous, it is divided, generally, into four squared subregions, which become the new seed regions. The main disadvantage of this approach is that the regions obtained are too squared. The region-merging approach is often combined with region growing and splitting with the aim of obtaining homogeneous regions as large as possible. • Edge detection: In monochrome image segmentation, an edge is defined as a discontinuity in the gray level, and can be detected only when there is a sharp difference in the brightness between two regions. However, in color images, the information about edges is much richer than that in the monochrome case. For example, an edge between two objects with the same brightness but different hue can simply be detected (Macaire et al., 1996). According to monochrome image segmentation, edge detection in color images can be performed defining a discontinuity in a three-dimensional color space. The main disadvantage of edge detection techniques is that the result of the segmentation process can be particularly affected by noise. • Other techniques: Many other segmentation methods were proposed in the past, based on fuzzy techniques, physics approaches, and neural networks. Fuzzy techniques exploit the fuzzy logic to model the uncertainty. For instance, if the fuzzy theory is used combined with a clustering method, each pixel has an assigned score for each candidate region, which represents the “degree of membership.” Physics approaches aim at solving the segmentation problem by employing physical models to locate the objects’ boundaries, while eliminating the spurious edges due to shadows or highlights. Among the physics models, the dichromatic reflection model (Shafer, 1985) and the approximate color-reflectance model (Healey and Binford, 1989) are the most commonly used. Finally, neural network approaches exploit a wide variety of network architectures (Hopfield neural network, self-organizing maps, feedforward neural networks). Generally, unsupervised approaches are preferable, since providing the target class for each pixel that belongs to an image is very difficult.
36
BIANCHINI
et al.
In the following, we describe the segmentation method we used to represent images. The proposed segmentation algorithm is independent of the considered color space. The color space can be chosen considering the particular application that should be solved. The segmentation algorithm can be sketched as follows: • A K-means clustering (Duda and Hart, 1973) of the pixels belonging to each image is performed; the clustering algorithm minimizes the Euclidean distance (defined in the chosen color space) of each pixel from its centroid. • At the end of the K-means, a region growing procedure is carried out to reduce the number of regions. In practice, the number of initial clusters k is chosen to be approximately equal to the number of regions in which the image should be correctly divided. Nevertheless, such an initial choice is not so crucial with regard to the whole process of segmentation, due to the successive region growing phase, during which the number of regions with homogeneous features decreases. In fact, the number of regions computed via the K-means algorithm is greater than the number of clusters, since each cluster is divided into a certain number of connected components (regions). After the segmentation process, a structure that represents the arrangement of the regions obtained can be extracted. Such a structure normally also collects information associated with each node, which describes the geometric and visual properties of the associated region. Instead, the edges that link the nodes of the structure are exploited to describe the topological arrangement of the extracted regions. The graph obtained can be directed or undirected; moreover, the presence of an edge can represent adjacency or some hierarchical relationship. In the following, two kinds of structures, particularly suited to represent images, will be described: RAGs and multiresolution trees. C. Region Adjacency Graphs The segmentation method yields a set of regions, each region being described by a vector of real valued features. Moreover, the structural information related to the spatial relationships between pairs of regions can be coded by an undirected graph. Two connected regions R1 , R2 are adjacent if, for each pixel a ∈ R1 and b ∈ R2 , there exists a path connecting a and b, entirely lying into R1 ∪ R2 . The RAG is extracted from the segmented image by (see Figure 11) 1. Associating a node with each region; the real vector of features represents the node label. 2. Linking the nodes associated with adjacent regions with undirected edges.
RECURSIVE NEURAL NETWORKS
F IGURE 11.
37
The original image, the segmented image, and the extracted RAG.
A RAG takes into account both the topological arrangement of the regions and the symbolic visual information. Moreover, the RAG connectivity is invariant under translations and rotations (while labels are not), which is a useful property for a high-level representation of images. The information collected in each RAG can be enriched further by associating with each undirected edge a real vector of features (an edge label), which describes the mutual position of the regions associated with the linked nodes. This kind of structure is defined as a region adjacency graph with labeled edges (RAG-LE). For example, given a pair of adjacent regions i and j , the label of the edge (i, j ) can be defined as the vector [D, A, B, C] (see Figure 12), where • D represents the distance between the two barycenters. • A measures the angle between the two principal inertial axes. • B is the angle between the intersection of the principal inertial axis of i and the line connecting the barycenters.
F IGURE 12. Features stored into the label of each edge. The features describe the relative position of the two regions.
38
BIANCHINI
et al.
• C is the angle between the intersection of the principal inertial axis of j and the line connecting the barycenters. D. Multiresolution Trees Multiresolution trees (MRTs) are hierarchical data structures that are generated during the segmentation process, as, for instance, quad-trees (Hunter and Steiglitz, 1979). While quad-trees can be used to represent a region splitting process, MRTs are used to describe the region growing phase of the segmentation algorithm described in the previous section. Some different hierarchical structures, like monotonic trees (Song and Zhang, 2002) or contour trees (Morse, 1969; Roubal and Peucker, 1985; van Kreveld et al., 1997), can be exploited to describe the set of regions obtained at the end of the segmentation process, representing the inclusion relationships established among the region boundaries. However, MRTs represent both the final result of the segmentation and the sequence of steps that produces the final set of regions. An MRT is built performing the following steps (see Figure 13): • Each region obtained at the end of the clustering phase is associated with a leaf of the tree. • During the region growing phase, any time two regions are merged together, a new node is added to the tree as the father of the nodes corresponding to the merged regions. • At the end of the region growing step, a virtual node is added as the root of the tree. Nodes corresponding to the set of regions obtained at the end of the segmentation process become the children of the root node. Each node of the MRT, except the root, is labeled by a real vector that describes the geometric and visual properties of the associated region. Moreover, each edge can be labeled by a vector that collects information regarding the merging process. Considering a pair of nodes joined by an edge, the region associated with the child node is completely contained in the region associated with the father, and it is useful to associate some features with the edge to describe how the child contributes to the creation of the father. For instance, some fruitful features can be the color distance between the regions associated with the father and the child, the distance between their barycenters, and the ratio obtained dividing the area of the region that corresponds to the child by the area of the region associated with the father. Note that MRTs do not directly describe the topological arrangement of the regions, which, however, can be inferred considering both the geometric features associated with each node (for instance, the coordinates
RECURSIVE NEURAL NETWORKS
39
F IGURE 13. Multiresolution tree generation: red nodes represent vertices added to the structure when a pair of similar regions is merged together.
of the bounding box of each region can be stored in the node label) and the MRT structure. In the following section, both RAGs and MRTs are examined as possible graph representations of images when trying to solve an object detection problem using RNNs.
IV. O BJECT D ETECTION IN I MAGES A. Object Detection Methods The ever increasing performances of image acquisition techniques imply that computer vision systems can be deployed in desktop and embedded systems (Pentland, 2000). On the other hand, the useful exploitation of computer vision software requires understanding the content of the images that must be processed. Thus, the researchers’ efforts have recently been focused on understanding the image content, with the aim of creating several
40
BIANCHINI
et al.
software tools to autoannotate images stored in a database or to help robots to understand the environment around them. A preliminary step in any image understanding system is locating significant objects. However, object detection is a challenging task because of the variability in scale, location, orientation, and pose of the instances of the object in which we are interested. Moreover, occlusions and light conditions also change the overall appearance of objects in images. A definition of the object detection problem, which represents an extension of the definition of face detection reported in Yang et al. (2002), is: “Given an arbitrary image and assuming to be interested in locating a particular object, the goal of object detection is to determine whether or not there is any object of interest and, if present, return the image location and extent of each instance of the object.” The challenges associated with the object detection problem can be attributed to the following factors: • Pose. The images of an object can vary because of the relative camera– object position, and some object features may become partially or wholly occluded. • Object deformation. Nonsolid objects can appear deformed due to some forces applied to them. • Occlusions. Objects may be partially occluded by other objects. • Image orientation. The images of object vary for different rotations and translations with regard to the camera axis. • Imaging conditions. When the image is acquired, factors such as lighting and camera characteristics affect the appearance of the objects. There are many related problems derived from object detection. Object localization aims at determining the position of a single object in an image (Moghaddam and Pentland, 1997); this is a simplified detection problem with the assumption that an input image contains only one object. In object recognition or object identification, an input image is compared to a database and matches, if any, are reported. Finally, object tracking methods continuously estimate the location and, possibly, the orientation of an object in an image sequence, in real time. Consequently, object detection is the preliminary step in any automated system that solves the above problems, and it can be seen as a two-class recognition problem in which each region of an image is classified as an object or part of it, or as an uninteresting region. Object detection methods can be classified in four main categories (Yang et al., 2002): • Knowledge-based; • Feature invariant;
RECURSIVE NEURAL NETWORKS
41
• Template matching; • Appearance-based. Knowledge-based methods exploit the human knowledge on the searched objects and use some rules to describe the object models. Those rules are then used to detect and localize objects that match the predefined models. A possible drawback of these approaches is the difficulty in translating human knowledge into well-defined rules. If the rules are detailed (i.e., strict), they may fail to detect objects that do not match all the rules. If the rules are too general, they may yield many false positives. Moreover, it is difficult to extend this approach to detect objects in different poses due to an inability to enumerate all possible cases. Instead, the aim of feature invariant approaches (McKenna et al., 1998; Leung et al., 1998) is to define a set of features that is invariant with regard to object orientation, light conditions, dimension, etc. The underlying assumption is based on the observation that humans can effortlessly detect objects in different poses and light conditions and so there must exist properties or features that are invariant over these variabilities. Template matching methods store several patterns of objects and describe each pattern by visual and geometric features. The correlation between an input image and the stored patterns is computed for detecting objects (Shina, 1995). However, this class of techniques has proved to be often inadequate for object detection in images since it cannot effectively deal with variations in scale and pose. Finally, in contrast to template matching methods, appearance-based methods (Moghaddam and Pentland, 1997; Schneiderman and Kanade, 2000) learn the templates from examples. In general, appearance-based methods rely on techniques from statistical analysis and machine learning to find the relevant characteristics of images that either contain or do not contain a certain object. Many appearance-based methods can be understood in a probabilistic framework. An image, or a representation of it, is viewed as a random variable x, which is characterized by the class-conditional density functions p(x|object) and p(x|nonobject). Bayesian or maximum likelihood classifiers can be used to decide if a candidate image location represents an object. Unfortunately, a straightforward implementation of Bayesian classification is not possible because of the high dimensionality of x. Generally, image patterns are projected to a lower dimensional space and then a discriminant function is used for classification, or a nonlinear decision surface can be exploited using multilayer neural networks (Carleson et al., 1999). Recently, methods based on SVMs were also proposed (Papageorgiou et al., 1998). Those models project the patterns to a higher dimensional space and then form a decision surface between the projected object and nonobject patterns, under the assumption that the determination of the decision surface is easier in
42
BIANCHINI
et al.
higher dimensional space with regard to the original pattern space. Among the object detection methods, those based on learning algorithms have recently attracted much attention and have demonstrated excellent results (Yang et al., 2002). In the following, we present a machine learning technique, based on RNNs, that allows us to detect objects in images. B. Recursive Neural Networks for Detecting Objects in Images Recently, RNNs have been proposed as a tool for object detection in images. These models allow us to exploit a structured representation of images in a paradigm based on learning from examples. 1. Learning Environment Setup The proposed object detection method assumes a graph-based representation of images that can be obtained performing a segmentation of the images, as described in Section III. Both the training of RNNs and the subsequent exploitation of trained networks to detect objects depend on the kind of graphic structure used to represent images. Thus, in the following, we describe how these tasks are performed when images are represented by RAGs or MRTs. a. Region Adjacency Graphs. If a RAG (or a RAG-LE) is extracted to represent an image, a target equal to 1 is attached to each node of the RAG that corresponds to a part of the object in which we are interested, whereas a target equal to 0 is attached otherwise. In Figure 14, the target association is sketched. In this example, we want to localize the “toy car.” The black nodes correspond to parts of the car and have target 1, while white nodes correspond to parts of other objects and have target 0. The target association is a crucial step since, during this phase, we provide the RNN with the information that defines the model of the object. During
F IGURE 14.
The extracted RAG and the associated targets.
43
RECURSIVE NEURAL NETWORKS
the segmentation, some spurious regions can be associated with an area of the image that corresponds only partially to the object of interest. If the target association is performed manually, the supervisor, which prepares the training set, chooses from each segmented image the set of regions that belong to the object. Otherwise, if an automatic association is performed, ground-truth information is exploited to associate the targets. In the last case, the ratio between the area of the spurious region that intersects the bounding box of the object and the whole area of the spurious region can be calculated to decide if the region belongs to the object. After the target association, since all the RNN models described in Section II can process only directed graphs, each RAG must be transformed into one or more directed graphs. The performed transformation depends on the computation scheme realized by the RNN model. In fact, if the selected RNN realizes a transduction from a graph G to a graph G′ (see Section II.B), the RAG is transformed into a unique DAG, while if it realizes a supersource transduction, the RAG is converted into a forest of trees, exploiting the recursive-equivalent transform described in Section II.D.1. If we consider a transduction from a graph G to a graph G′ , the RAG (RAGLE) is transformed into a DPAG (DAG-LE) by applying the following steps (see Figure 15): 1. A starting region (root node) is chosen. 2. An ordering is imposed among the adjacent regions; for example, adjacent regions can be ordered by scanning the region boundary clockwise, starting from the vertical axis. 3. The graph is constructed recursively using a breadth-first visit of the nodes starting from the root node; when a new node a is visited, the edges from a to the nodes bk that have not already been visited are considered; the direction of the edges is chosen to be from a to bk ; moreover, the order of these arcs is defined using the ordering established by the previous rule;
(a)
(b)
(c)
F IGURE 15. (a) The original image. (b) The segmented image and the corresponding RAG. (c) The DPAG obtained by the RAG using the top-left region as the starting node.
44
BIANCHINI
et al.
finally, the target of node a is associated with the correspondent node in the DPAG (DAG-LE). Even if this transformation can be performed very efficiently, it presents some limitations. First, the arbitrary choice of the starting region affects the DPAG generation. Moreover, since the RNN computation proceeds from the frontier of the DPAG to the root node, the network predictions associated with the leaves are performed considering only the labels (i.e., only visual and geometric properties of the regions they represent), since the leaves have no descendants and the topological arrangement of the corresponding regions is unknown. However, this limitation could be partially overcome, transforming each RAG into a set of DPAGs, which can be obtained considering a random set of nodes belonging to the original RAG as the root node. When considering an RNN model that performs supersource transductions, the transformation procedure takes an RAG R, along with a selected node n, as input, and produces a tree T having n as its root. The method must be repeated for each node of the RAG, or, more practically, for a random set of nodes. It can be proved that the forest of trees built from R is recursive equivalent to R, that is the RNN behavior is the same whether the network processes R or if it processes the forest of trees (Bianchini et al., 2002, 2006). The first step of the procedure is a preprocessing phase that transforms R into a directed RAG G by assuming that a pair of directed edges replaces each undirected one. If the original undirected graph is an RAG-LE, each edge in the pair is assigned the same label as the original undirected edge. G is unfolded into T by the following algorithm: 1. Insert a copy of n in T . 2. Visit G, starting from n, using a breadth-first strategy; for each visited node v, insert a copy of v into T , link v to its parent node preserving the information attached to each edge, if it exists. 3. Repeat step 2 until a predefined stop criterion is satisfied, and, however, until all edges have been visited at least once. 4. Attach the target associated to n to the root node of T . The above procedure represents a possible implementation of Algorithm 2, which is presented in Section II.D.1 as a general framework for processing cyclic graphs. Note that the preprocessing step that transforms an RAG into a directed structure generates a directed cyclic structure, which cannot be directly processed by an RNN. The above unfolding strategy produces a recursive-equivalent tree that holds the same information contained in R. With respect to the chosen stop criterion, if the breadth-first visit is halted when all the arcs have been visited once, the minimal recursive-equivalent tree is obtained (minimal unfolding—see Figure 16a). However, other stop criteria are acceptable. For example, each edge can be visited once, then the visit
RECURSIVE NEURAL NETWORKS
45
F IGURE 16. The transformation from an RAG-LE to a recursive-equivalent tree. The dimension of the recursive-equivalent tree depends on the stop criterion chosen during the unfolding of the directed RAG.
proceeds starting from each leaf node v ∈ T , if a stochastic variable xv is true, then all the children of v are added to T (probabilistic unfolding—see Figure 16b). Otherwise, we can replace the breadth-first visit with a random visit of the graph (random unfolding—see Figure 16c). In this case, starting from the current node v, the visit can proceed or not depending on a set of stochastic variables xv1 , . . . , xvo , one for each arc outcoming from v. The probability of visiting a given arc is uniform over the whole graphic structure. Anyway, each edge must be visited at least once to guarantee the recursive equivalence between R and T . From a cognitive point of view, the unfoldings performed during the transformation from an RAG to a forest of trees seems to reproduce the behavior of a human observer who pays attention to the parts that constitute
46
BIANCHINI
et al.
the image to detect the eventual presence of an object. However, humans usually do not need to analyze the whole image to detect a particular object. This assertion suggests relaxing the recursive-equivalence constraint and performing the unfoldings stopping the breadth-first visit before each edge has been visited once. For instance, for each node t belonging to a directed RAG, we can start the unfolding from t halting the visit when all the nodes at a certain distance from t have been reached, so generating a forest of trees, which are not recursive equivalent to the original structure, and, at the same time, simulating the behavior of a human who pays attention to an object and to a certain area around the object. Independently from the chosen unfolding strategy, we can consider a set of images as a training set, transform each image into an RAG (or an RAG-LE), and then associate the correct target to each node of the undirected graph, finally extracting the corresponding forest of trees for each RAG. Every RNN model presented in Section II is able to process a tree, and so we can train an RNN to predict if the root node of each tree is a part of the object in which we are interested. b. Multiresolution Trees. If images are represented by MRTs no preliminary transformations are needed on the structures, which are directed and can constitute an input for RNNs. As described previously in this section, when RNNs deal with transformed RAGs, they predict if each region is a part of the object in which we are interested. However, if MRTs are exploited, each node has an associate target that states if the node represents a part of the object, if it corresponds to a region that contains the object or part of it, and, finally, if it corresponds to a region that does not contain the object (see Figure 17). Thus, RNNs that process MRTs solve a multiclass classification problem. Moreover, since each node has an associated target, the RNN computes a transduction from a graph to another graph. The targets can also be associated with a subset of the nodes. In particular, since RNNs process the leaves only on the base of the visual and geometric features stored in the node labels, targets could not be associated until a certain distance from the frontier is reached. Using this method, RNNs are guaranteed to perform a prediction that depends both on the symbolic information and on the topological arrangement of the nodes. RNNs process MRTs straightforwardly and the predictions of a trained network locate the subtrees that contain the object. The targets can be associated exploiting the same methods described for RAGs. The main limitation regarding MRTs is the height of the tree. In fact, the RNN training can suffer from the so-called “long-term dependencies,” which were investigated originally with regard to recurrent neural networks (Bengio et al., 1994). The node states computed by an RNN are marginally affected by the information
RECURSIVE NEURAL NETWORKS
47
F IGURE 17. Targets associated with the nodes of an MRT. Gray nodes represent regions that contain the object or part of it, black nodes represent regions that correspond to part of the object, and white nodes represent regions that do not contain the object.
collected into far descendants, and, given a generic node, the contribution of the descendants to the state of the considered node decreases, while the distance from the node increases. Substantially, RNNs show problems in extracting properties related to long-term memory, if the processed structure is too deep. This limitation can be partially overcome cutting MRTs to a certain depth, which guarantees, however, that the information needed to detect the objects is maintained. In particular, nodes that correspond to part of the object must not be discarded. For example, if the object dimensions are known a priori (at least approximately), we can consider discarding nodes that are associated with regions much smaller than those representing the target object. 2. Detecting Objects After the set up of the learning environment, we need to select the RNN architecture. Unfortunately, no rules exist to guide this choice and a trialand-error procedure must be carried out to determine the best RNN. Then, the RNN can be trained using the BPTS algorithm (see Section II.C). Given a trained RNN and an input image, the detection procedure differs, based on whether an RAG-based or an MRT-based representation is used. If the image is represented by an RAG, then the detection of an object is obtained as follows: 1. The image is segmented and the corresponding RAG (RAG-LE) is built. 2. The RAG (RAG-LE) is unfolded, producing a forest of trees.
48
BIANCHINI
et al.
3. Each tree is processed by the trained RNN. The network predicts whether the root node of each tree is a part of the object or not. 4. Adjacent regions predicted as a part of the object are merged together to compute the minimum bounding boxes that contain the detected objects. However, if the image is represented by an MRT, the detection can be performed as described in the following: 1. The image is segmented and the corresponding MRT is built. 2. The MRT is processed by the trained RNN. The network predicts whether each node of the tree is a part of the object and whether it contains or does not contain the object. 3. Regions predicted as parts of the object are merged together to compute the minimum bounding boxes that contain the detected objects. Even if, considering an image, the related MRT usually contains more nodes than its RAG, the detection based on MRTs can be performed in a very efficient way. In fact, dealing with MRTs allows us to avoid the transformation from undirected to directed structures, which can be particularly time consuming. Independent of the kind of structures used to represent images, the detection technique described has several advantages. First, the use of structures allows us to obtain an invariant representation of the images and of the object with regard to rotations, translations, and scale.8 Moreover, the user can define the model of the object in which he or she is interested, collecting a representative set of images as a training set, and specifying which is the “concept” of the object that he or she wants to detect (by associating the target with each node of the graphs in the training set). In this way, the method described is completely object independent. In fact, the RNN, during the training, builds its model of the object, following the “concept” expressed by the supervisor. During the development of the object detection technique described above, several experiments were carried out to evaluate the effectiveness of the proposed approach. Most of the experiments have been focused on detecting faces in images, nevertheless, the proposed method does not exploit any a priori information about the particular object model and, therefore, is independent of the problem at hand. To perform any kind of experimentation, the choice of the dataset plays a crucial role. Even if several datasets were proposed in the literature for evaluating face and object detection methods, none of them fits our requirements. In fact, benchmark datasets usually contain gray level images, while our method works in the more general case of color images. Moreover, very often the benchmark datasets collect images that include only one face or one object, and so they are useful for evaluating face 8 This holds true if the change in the scale does not alter the information contained in the image.
RECURSIVE NEURAL NETWORKS
49
F IGURE 18. Examples of localized faces in images acquired by TV video sequences using the proposed method.
or object localization methods. The objects are often centered in the images and in a frontal pose, and these controlled acquisition conditions do not allow us to evaluate the robustness of the methods with regard to variations in scale, position, and pose. Finally, often no ground-truth information is available together with the benchmark datasets. Therefore, it is not always clear if an object is present or not, for instance, if it is partially visible. However, for a complete list of benchmark datasets, the reader can refer to Yang et al. (2002). We have performed our experimentation using three distinct datasets. Two datasets were chosen with the aim of using our detection method to locate faces: the first dataset contains images acquired by TV video sequences, while the second one includes images acquired by an indoor camera. The third dataset was created artificially using the objects of the COIL-100 dataset (Nene et al., 1996a). In the following, the three datasets, together with the main results obtained, are described. a. TV Video Sequences. The experimental dataset contains 201 images and 238 faces (each image contains at least one face) and was acquired by TV video sequences. The appearance of faces in images is unsettled with respect to the orientation, light conditions, dimension, etc. (see Figure 18). The images were divided into three sets: training, validation, and test sets. Both training and validation sets contain 50 images, whereas 101 images (118 faces) constitute the test set. This dataset was used to understand which kind of unfolding strategy is most promising to transform an RAG or RAG-LE into a forest of trees9 (see Section II.D.1), and to compare the performances 9 These results were already discussed in Bianchini et al. (2003a, 2003b).
50
BIANCHINI
et al.
of standard RNN for DPAGs, RNNs-LE, and feedforward neural networks.10 Each image was segmented with both the RGB and the HSV color spaces, obtaining RAGs with 90 nodes, on average. Each node of the extracted RAG has an associated label that collects some geometric (area, perimeter, barycenter coordinates, bounding box coordinates, and momentum) and color information. Moreover, for each RAG-LE, an edge label was added, whose elements are the distance between the barycenters of the regions and the angles formed by the intersection of their principal inertial axes. With respect to the transformation from undirected structures to forest of trees, the most promising unfolding strategy is the probabilistic unfolding, which allows us to reach a recall of 90% and a precision of 72%, on average. Considering the results exploited to determine the best model, RNNs-LE outperforms both the original RNNs defined for DPAGs and traditional feedforward neural networks. These results were evaluated considering only the accuracy of the RNNs, without performing the merge of adjacent regions predicted as parts of a face. RNNs-LE reached a global accuracy of 87% (94% on nonface regions and 81% on face regions), as opposed to an accuracy of 81% and 79% reached by RNNs for DPAGs and feedforward neural networks, respectively. The approach does not use a priori heuristics specific to the face detection problem. From this point of view, it is completely different from other solutions described in the literature. Moreover, no postprocessing on the detected bounding boxes was performed. False positives, for instance, often correspond to very small bounding boxes or, however, the ratio between the height and the width of the detected bounding boxes is very far from the usual ratio that can be obtained dividing the height of a face by its width. Thus, considering such a naive postprocessing procedure, which checks some geometric properties of the detected bounding boxes, allows us to significantly improve the performances. b. Images Acquired by an Indoor Camera. The experimental dataset contains 500 images and 348 faces (each image contains at most one face) and was acquired by an indoor camera, which was placed in front of a door. One person at a time went in through the door and walked until he or she was out of the camera eye. Each image corresponds to a frame of the acquired scene. We are interested in detecting only the face position, whereas no tracking of the faces was performed, and no information derived by the movement of the object was exploited. The faces appear in different orientations, dimensions, and positions (see Figure 19). Both the training and the cross-validation sets contain 100 images, whereas 300 images (199 faces) constitute the test set. This experimentation was performed to investigate how indoor light conditions affect the performance of our method. In fact, images acquired by 10 These results were presented in Bianchini et al. (2005b).
RECURSIVE NEURAL NETWORKS
51
F IGURE 19. Variability of face appearance in the indoor camera dataset. Faces vary with regard to dimension and pose and can be partially occluded. The images used to perform the experimentation were provided by ELSAG S.p.A.; all the images were used strictly for research purpose and are published under license of the reproduced persons.
TV video sequences usually have controlled light conditions, which allow a high-quality TV video to be obtained. In a TV studio, the lights are oriented in a way that minimizes shadows and, however, in our TV video sequence dataset the skin color of the depicted persons is close to pink (except for black persons). However, the acquisition condition used to collect our indoor dataset produces a skin color that ranges from green to gray, and several shadows are visible on faces and other objects. This particular skin color is due to the neon lighting, which causes a prevalence of green not only with regard to the skin color but in the whole image. This situation limits the relevance of the color as a discriminative feature. Therefore, a fundamental contribution can be provided by the exploitation of the information derived by the mutual position of the regions, and the proposed object detection model actually uses this information to improve its performances. Moreover, in these experiments only RNNs-LE are used since we are interested in determining how the implementation of the function φ (the function exploited to compute the average contribution of the children to the state of their parents—see Section II.B.2) affects the detection results. During the described experimentation, each image was represented in the RGB color space and segmented, producing an RAG-LE, with 100 nodes, on average. The geometric and visual features stored in the label associated with each node and the mutual spatial position represented by the label associated with each edge are exactly the same as described for the experimentation on the TV video sequence dataset. Each RAG-LE was subsequently unfolded using the probabilistic unfolding strategy, described in Section IV.B. To obtain balanced training and cross-validation sets, each RAG-LE, corresponding to a training or a validation image, was unfolded starting the breadth-first visit from all the nodes belonging to a part of a face and from
52
BIANCHINI
et al.
a randomly chosen set of nodes that do not belong to parts of faces. We assume that the number of nodes corresponding to parts of faces is smaller than the number of nodes of other kinds, and this assumption is always true considering our dataset. However, the test set is obtained performing the probabilistic unfolding for all the nodes belonging to each RAG-LE. In fact, to locate faces, the trained RNN-LE must be able to predict whether each node in the recursive-equivalent trees represents a part of a face. Several RNNs-LE were trained to determine how the implementation of the function φ affects the detection results. In fact, the function φ can be implemented using a feedforward neural network [neural φ—see Eq. (4)] or using an ad hoc model (linear φ). In the neural case, φ can be obtained considering a two-layer feedforward neural network with sigmoidal hidden units and linear output units, while in the linear case a three-dimensional weight matrix can be considered [see Eq. (6)]. As discussed in Bianchini et al. (2004b, 2005a), the choice of the neural φ allows us to obtain better performances. The accuracy rate obtained by RNNs with neural φ is 87%, on average, while the choice of linear φ allows us to reach an average accuracy of 82%. Moreover, all the tested RNN architectures succeeded in their learning task showing how the proposed model is able to generalize, even if the training is performed on perfectly balanced data sets. Since the focus of these experiments is mainly on the RNN-LE model, the postprocessing procedure was not carried out. However, the results achieved on the TV video sequence dataset show that the accuracy in the detection of the bounding boxes is generally greater than the accuracy reached by the RNN. Actually, the correct bounding boxes can be computed even if some regions belonging to faces are not correctly classified. c. Artificial Dataset Generated from COIL-100. The COIL-100 dataset contains 7200 color images of 100 objects (72 images per object, one at every 5 degrees of rotation—see Figures 20 and 21) and it has been used in the past to evaluate the performances of three-dimensional object recognition systems (Nene et al., 1996b; Pontil and Verri, 1998). The object appearance varies both in its geometric and reflectance characteristics. We created some artificial datasets pasting, for each generated image, three COIL objects, which were chosen randomly with regard to the depicted object and to its degree of rotation, on a dark canvas. The pasting position of the first object was chosen randomly, while the second and third objects were located checking that objects already present on the canvas were not completely occluded. The images generated have the same properties as the COIL collection, thus, objects can vary their appearance with regard
RECURSIVE NEURAL NETWORKS
53
F IGURE 20. Images of the 100 objects of the COIL database. (All trademarks remain the property of their respective owners. All trademarks and registered trademarks are used strictly for educational and scholarly purposes and without intent to infringe on the copyright owners.)
F IGURE 21.
Twenty-four of 72 images of a COIL object.
to orientation, light conditions, and scale; moreover some objects can be partially occluded (see Figure 22). The above generation technique allows us, at the same time, to create a set of images and their associated ground truth.
54
BIANCHINI
et al.
F IGURE 22. Examples of images generated pasting COIL objects on a black canvas. (All trademarks remain the property of their respective owners. All trademarks and registered trademarks are used strictly for research purposes and without intent to infringe on the copyright owners.)
To evaluate the ability of RNNs to compute transductions that take as input a graph and produce as output another graph, we generated a dataset whose images always contained a white piggybank. The dataset collects 250 images, and each image contains exactly one piggybank. The images were segmented, producing RAGs-LE with about 70 nodes. Subsequently, each RAG-LE was transformed into a DAG-LE, using the procedure described in Section IV.B.1, and a target was associated with each node, assessing if the node corresponds to a part of the piggybank. Several RNNs-LE were trained and evaluated, varying their architecture. The accuracy obtained, on average, was equal to 87%. Moreover, the greatest part of the misclassified regions belonged to the leaves of the DAGs-LE. This situation is probably due to the absence of topological information related to the regions associated with the leaves, and it shows how such types of information play a crucial role in our object detection approach.
R EFERENCES Aho, A., Hopcroft, J., Ullman, J. (1983). Data Structures and Algorithms. Addison-Wesley, Reading, MA. Bengio, Y., Frasconi, P., Simard, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 157–166. Bezdek, J. (1994). What is computational intelligence? In: Computational Intelligence: Imitating Life. IEEE Press, New York, pp. 1–12. Bianchini, M., Gori, M., Scarselli, F. (2001a). Theoretical properties of recursive networks with linear neurons. IEEE Trans. Neural Netw. 12 (5), 953–967.
RECURSIVE NEURAL NETWORKS
55
Bianchini, M., Gori, M., Scarselli, F. (2001b). Processing directed acyclic graphs with recursive neural networks. IEEE Trans. Neural Netw. 12 (6), 1464–1470. Bianchini, M., Gori, M., Scarselli, F. (2002). Recursive processing of cyclic graphs. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2002), pp. 154–159. Bianchini, M., Mazzoni, P., Sarti, L., Scarselli, F. (2003a). Face spotting in color images using recursive neural networks. In: Gori, M., Marinai, S. (Eds.), IAPR—TC3 International Workshop on Artificial Neural Networks in Pattern Recognition (Florence, Italy). Bianchini, M., Gori, M., Mazzoni, P., Sarti, L., Scarselli, F. (2003b). Face localization with recursive neural networks. In: Marinaro, M., Tagliaferri, R. (Eds.), Neural Nets—WIRN ’03, Vietri (Salerno, Italy). Springer, Berlin. Bianchini, M., Gori, M., Sarti, L., Scarselli, F. (2003c). Backpropagation through cyclic structures. In: Cappelli, A., Turini, F. (Eds.), LNAI — AI*IA 2003: Advances in Artificial Intelligence (Pisa, Italy), LNCS. Springer, Berlin, pp. 118–129. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F. (2004a). Recursive neural networks for processing graphs with labelled edges. In: Proceedings of ESANN 2004 (Bruges, Belgium), pp. 325–330. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F. (2004b). Recursive neural networks for object detection. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2004), pp. 1911–1915. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F. (2005a). Recursive neural networks for processing graphs with labelled edges: Theory and applications. Neural Netw. 18, 1040–1050. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F. (2005b). Recursive neural networks learn to localize faces. Pattern Recognit. Lett. 26, 1885–1895. Bianchini, M., Gori, M., Sarti, L., Scarselli, F. (2006). Recursive processing of cyclic graphs. IEEE Trans. Neural Netw. 17, 10–18. Bianucci, A., Micheli, A., Sperduti, A., Starita, A. (2001). Analysis of the internal representations developed by neural networks for structures applied to quantitative structure-activity relationship studies of benzodiazepines. Chem. Info. and Comp. Sci. 41 (1), 202–218. Boser, B., Guyon, I., Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In: Haussler, D. (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. ACM Press, New York, pp. 144–152. Carleson, A., Cumby, C., Rosen, J., Roth, D. (1999). The SNoW learning architecture. Tech. Rep. UIUCDCS-R-99-2101, University of Illinois at Urbana–Campaign, Computer Science Department. Chappell, G., Taylor, J. (1993). The temporal Kohonen map. Neural Netw. 6, 441–445.
56
BIANCHINI
et al.
Cheng, H.D., Yang, X.H., Sun, Y., Wang, J.L. (2001). Color image segmentation: Advances and prospects. Pattern Recognit. 34, 2259–2281. Collins, M., Duffy, N. (2002). Convolution kernels for natural language. In: Dietterich, T., Becker, S., Ghahramani, Z. (Eds.), Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA. de Mauro, C., Diligenti, M., Gori, M., Maggini, M. (2003). Similarity learning for graph based image representation. Pattern Recognit. Lett. 24 (8), 1115– 1122. Diligenti, M., Gori, M., Maggini, M., Martinelli, E. (2001). Adaptive graphical pattern recognition for the classification of company logos. Pattern Recognit. 34, 2049–2061. Duda, R., Hart, P. (1973). Pattern Classification and Scene Analysis. Wiley, New York. Elman, J. (1990). Finding structure in time. Cog. Sci. 14, 179–211. Euliano, N., Principe, J. (1999). A spatiotemporal memory based on SOMs with activity diffusion. In: Oja, E., Kaski, S. (Eds.), Kohonen Maps. Elsevier, Amsterdam. Farkas, I., Mikkulainen, R. (1999). Modeling the self-organization of directional selectivity in the primary visual cortex. In: Proceedings of the International Conference on Artificial Neural Networks. Springer, pp. 251– 256. Frasconi, P., Gori, M., Sperduti, A. (1998). A general framework for adaptive processing of data structures. IEEE Trans. Neural Netw. 9 (5), 768–786. Fu, K., Mui, J.K. (1981). A survey on image segmentation. Pattern Recognit. 13, 3–16. Gärtner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations 5 (1), 49–58. Gärtner, T., Flach, P., Wrobel, S. (2003). On graph kernels: Hardness results and efficient alternatives. In: Proceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop, pp. 129– 143. Gori, M., Maggini, M., Sarti, L. (2003). A recursive neural network model for processing directed acyclic graphs with labeled edges. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2003), pp. 1351–1355. Gori, M., Hagenbuchner, M., Scarselli, F., Tsoi, A.-C. (2004). Graphicalbased learning environment for pattern recognition. In: Proceedings of SSPR 2004. Gori, M., Maggini, M., Sarti, L. (2005a). Exact and approximate graph matching using random walks. IEEE Trans. Pattern Anal. Mach. Intell. 27 (7), 1100–1111. Gori, M., Monfardini, G., Scarselli, F. (2005b). A new model for learning in graph domains. In: Proceedings of IJCNN 2005, vol. 2, pp. 729–734.
RECURSIVE NEURAL NETWORKS
57
Günter, S., Bunke, H. (2001). Validation indices for graph clustering. In: Jolion, J.-M., Kropatsch, W., Vento, M. (Eds.), Proceedings of the Third IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition, pp. 229–238. Hagenbuchner, M., Tsoi, A.-C., Sperduti, A. (2001). A supervised selforganizing map for structured data. In: Allison, L.A.N., Yin, H., Slack, J. (Eds.), Advances in Self-Organizing Maps. Springer, Berlin, pp. 21–28. Hagenbuchner, M., Sperduti, A., Tsoi, A.-C. (2003). A self-organizing map for adaptive processing of structured data. IEEE Trans. Neural Netw. 14 (3), 491–505. Hammer, B. (1998). On the approximation capability of recurrent neural networks. In: NC’98, International Symposium on Neural Computation (Vienna, Austria). Hammer, B. (1999). Approximation capabilities of folding networks. In: ESANN ’99 (Bruges, Belgium), pp. 33–38. Hammer, B., Micheli, A., Stricker, M., Sperduti, A. (2004). A general framework for unsupervised processing of structured data. Neurocomputing 57, 3–35. Haralick, R., Shapiro, L. (1985). Image segmentation techniques. Comput. Vision, Graph. Image Process. 29, 100–132. Healey, G., Binford, T. (1989). Using color for geometry-insensitive segmentation. J. Opt. Soc. Am. 22 (1), 920–937. Hoekstra, A., Drossaers, M. (1993). An extended Kohonen feature map for sentence recognition. In: Gielen, S., Kappen, B. (Eds.), Proceedings of the International Conference on Artificial Neural Networks. Springer, Berlin, pp. 404–407. Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366. Hunter, G.M., Steiglitz, K. (1979). Operations on images using quadtrees. IEEE Trans. Pattern Anal. Mach. Intell. 1, 145–153. James, D., Mikkulainen, R. (1995). SARDNET: A self-organizing feature map for sequences. In: Tesauro, G., Touretzky, D., Leen, T. (Eds.), Advances in Neural Information Processing Systems, vol. 7. MIT Press, Cambridge, MA, pp. 577–584. Kangas, T. (1990). Time-delayed self-organizing maps. In: Proceedings of IEEE/INNS IJCNN, vol. 2, pp. 331–336. Kohonen, T., Sommervuo, P. (2002). How to make large self-organizing maps for nonvectorial data. Neural Netw. 15 (8–9), 945–952. Koskela, T., Varsta, M., Heikkonen, J., Kaski, K. (1998a). Recurrent SOM with local linear models in time series prediction. In: Verleysen, M. (Ed.), Proceedings of the 6th European Symposium on Artificial Neural Networks, pp. 167–172.
58
BIANCHINI
et al.
Koskela, T., Varsta, M., Heikkonen, J., Kaski, K. (1998b). Time series prediction using recurrent SOM with local linear models. In: Proceedings of the Int. J. Conf. of Knowledge-Based Intelligent Engineering Systems, vol. 2(1), pp. 60–68. Küchler, A., Goller, C. (1996). Inductive learning in symbolic domains using structure-driven recurrent neural networks. In: Görz, G., Hölldobler, S. (Eds.), Advances in Artificial Intelligence. Springer, Berlin, pp. 183–197. Leung, T.K., Burl, M.C., Perona, P. (1998). Probabilistic affine invariants for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–684. Macaire, L., Ultre, V., Postaire, J. (1996). Determination of compatibility coefficients for color edge detection by relaxation. In: Proceedings of the ICIP, pp. 1045–1048. McKenna, S., Raya, Y., Gong, S. (1998). Tracking colour objects using adaptive mixture models. Image Vision Comput. 17 (3/4), 223–229. Micheli, A., Sona, D., Sperduti, A. (2004). Contextual processing of structured data by recursive cascade correlation. IEEE Trans. Neural Netw. 15 (6), 1396–1410. Moghaddam, B., Pentland, A. (1997). Probabilistic visual learning for object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 19 (7), 696–710. Morse, S. (1969). Concepts of use in computer map processing. Commun. ACM 12 (3), 147–152. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 2 (2), 181–201. Nene, S., Nayar, S., Murase, H. (1996a). Columbia object image library (COIL-100). Tech. Rep. CUCS-006-96, Columbia University. Nene, S., Nayar, S., Murase, H. (1996b). Real-time 100 object recognition system. In: Proceedings of the IEEE Conference on Robotics and Automation, vol. 3, pp. 2321–2325. Papageorgiou, C., Oren, M., Poggio, T. (1998). A general framework for object detection. In: Proceedings of the 6th IEEE International Conference on Computer Vision, pp. 555–562. Pat, S.K. (1993). A review on image segmentation techniques. Pattern Recognit. 29, 1277–1294. Pentland, A. (2000). Perceptual intelligence. Commun. ACM 43 (3), 35–44. Pollastri, G., Baldi, P., Vullo, A., Frasconi, P. (2002). Prediction of protein topologies using generalized IOHMMs and recursive neural networks. In: Proceedings of NIPS. Pontil, M., Verri, A. (1998). Support vector machines for 3D object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 20 (6), 637–646. Roubal, J., Peucker, T. (1985). Automated contour labeling and the contour tree. In: Proceedings of AUTO-CARTO 7, pp. 472–481.
RECURSIVE NEURAL NETWORKS
59
Scarselli, F., Yong, S., Gori, M., Hagenbuchner, M., Tsoi, A.-C., Maggini, M. (2005). Graph neural networks for ranking Web pages. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, pp. 666– 672. Schneiderman, H., Kanade, T. (2000). A statistical method for 3D object detection applied to faces and cars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 746–751. Schölkopf, B., Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA. Shafer, S. (1985). Using color to separate reflection components. Color Res. Appl. 10, 201–218. Shina, P. (1995). Processing and Recognizing 3D Forms. Ph.D. thesis, Massachusetts Institute of Technology. Song, Y., Zhang, A. (2002). Monotonic tree. In: Proceedings of the 10th International Conference on Discrete Geometry for Computer Imagery (Bordeaux, France). Sperduti, A., Starita, A. (1997). Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 8, 714–735. Strickert, M., Hammer, B. (2003a). Neural gas for sequences. In: Proceedings of WSOM ’03, pp. 53–57. Strickert, M., Hammer, B. (2003b). Unsupervised recursive sequence processing. In: Verleysen, M. (Ed.), Proceedings of the European Symposium on Artificial Neural Networks, pp. 27–32, D-side publications. Sturt, P., Costa, F., Lombardo, V., Frasconi, P. (2003). Learning first-pass structural attachment preferences with dynamic grammars and recursive neural networks. Cognition 88 (2), 133–169. Tsai, W. (1990). Combining statistical and structural methods. In: Syntactic and Structural Pattern Recognition: Theory and Applications. World Scientific, Singapore, pp. 349–366. van Kreveld, M., van Oostrum, R., Bajaj, C., Pascucci, V., Schikore, D. (1997). Contour trees and small seed sets for iso-surface traversal. In: Proceedings of the 13th Annual Symposium on Computational Geometry, pp. 212–220. Vapnik, V. (1995). The Nature of Statistical Learning Theory. SpringerVerlag, Berlin. Vesanto, J. (1997). Using the SOM and local models in time-series prediction. In: Proceedings of the Workshop on Self-Organizing Maps, pp. 209–214. Vishwanathan, S., Smola, A. (2002). Fast kernels for string and tree matching. In: Becker, S., Thrun, S., Obermayer, K. (Eds.), Advances in Neural Information Processing Systems, vol. 15. MIT Press, Cambridge, MA. Voegtlin, T. (2000). Context quantization and contextual self-organizing maps. In: Proceedings of the IJCNN, vol. 5, pp. 20–25.
60
BIANCHINI
et al.
Voegtlin, T. (2002). Recursive self-organizing maps. Neural Netw. 15 (8–9), 979–992. Voegtlin, T., Dominey, P.F. (2001). Recursive self-organizing maps. In: Allison, N., Yin, H., Allinson, L., Slack, J. (Eds.), Advances in SelfOrganizing Maps. Springer, Berlin, pp. 210–215. Vullo, A., Frasconi, P. (2002). A bi-recursive neural network architecture for the prediction of protein coarse contact maps. In: Proceedings of the 1st IEEE Computer Society Bioinformatics Conference (Stanford). Yang, M.-H., Kriegman, J., Ahuja, N. (2002). Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 24 (1), 34–58. Yao, N.Y., Marcialis, G.L., Pontil, M., Frasconi, P., Roli, F. (2003). Combining flat and structural representations for fingerprint classification with recursive neural networks and support vector machines. Pattern Recognit. 36 (2), 397–406.
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 140
Deterministic Learning and an Application in Optimal Control CRISTIANO CERVELLERAa AND MARCO MUSELLIb a Istituto di Studi sui Sistemi Intelligenti per l’Automazione, Consiglio Nazionale delle Ricerche,
16149 Genova, Italy b Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni,
Consiglio Nazionale delle Ricerche, 16149 Genova, Italy
I. Introduction . . . . . . . . . . . . . . . Notation . . . . . . . . . . . . . . . II. A Mathematical Framework for the Learning Problem . . . III. Statistical Learning . . . . . . . . . . . . . IV. Deterministic Learning . . . . . . . . . . . . A. The Distribution-Free Case . . . . . . . . . . B. Ensuring a Bounded Variation . . . . . . . . . 1. Feedforward Neural Networks . . . . . . . . 2. Radial Basis Functions . . . . . . . . . . C. Bounds on the Convergence Rate of the ERM Approach . D. The Distribution-Dependent Case . . . . . . . . E. The Noisy Case . . . . . . . . . . . . . V. Deterministic Learning for Optimal Control Problems . . . VI. Approximate Dynamic Programming Algorithms . . . . A. T-SO Problems . . . . . . . . . . . . . B. ∞-SO Problems . . . . . . . . . . . . . 1. Approximate Value Iteration . . . . . . . . . 2. Approximate Policy Iteration . . . . . . . . . C. Performance Issues . . . . . . . . . . . . VII. Deterministic Learning for Dynamic Programming Algorithms A. The T-SO Case . . . . . . . . . . . . . B. The ∞-SO Case . . . . . . . . . . . . . VIII. Experimental Results . . . . . . . . . . . . A. Approximation of Unknown Functions . . . . . . B. Multistage Optimization Tests . . . . . . . . . 1. The Inventory Forecasting Model . . . . . . . 2. The Water Reservoir Network Model . . . . . . References . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
62 64 65 69 74 75 80 83 84 85 87 88 90 94 94 96 96 97 98 99 99 102 104 104 107 108 109 114
61 ISSN 1076-5670/05 DOI: 10.1016/S1076-5670(05)40002-6
Copyright 2006, Elsevier Inc. All rights reserved.
62
CERVELLERA AND MUSELLI
I. I NTRODUCTION In a wide variety of real world situations a functional dependence y = g(x) has to be estimated from a set of observations (x L , y L ) = {(x l , yl ), l = 0, . . . , L − 1} concerning a phenomenon of interest. This is the case when the behavior of a continuous signal has to be forecast starting from its previous history or when the value of an unmeasurable quantity has to be inferred from the measurements of other related variables. If an insufficient amount of a priori information about the form of the functional dependence g is available, its estimation must provide for two different actions: 1. at first a sufficiently large class Γ of functions must be properly selected (model selection); 2. then, the best element g ∈ Γ must be retrieved by adopting a suitable optimization algorithm (training phase). The model selection task is usually performed by taking a very general paradigm, whose complexity can be controlled by acting on a small number of constant values. For example, the usual polynomial series expansion can approximate arbitrarily well every measurable function, that is, polynomials are universal approximators. However, by including in Γ only the functions whose polynomial series expansion does not contain terms with exponent greater than a prescribed maximum k, we can control the richness of the class Γ . In particular, if k = 1 only linear functions are included in Γ ; if k = 2 the expansion can realize only linear and quadratic functions, etc. Other general paradigms have been extensively used for model selection: neural networks have been shown to possess the universal approximation property (Cybenko, 1989; Hornik et al., 1989; Barron, 1993; Girosi et al., 1995) and have been successfully applied in many different fields. In this case, the complexity of the class Γ can be controlled by acting on the architecture of the network (the number of layers and the number of neurons in the feedforward structure). Once the class Γ has been chosen, the optimization algorithm to be employed in the training phase is selected accordingly. For example, the backpropagation technique (and its modifications) is often adopted to retrieve the function in Γ that best fits the collection (x L , y L ) of observations at our disposal, usually called training set. However, the basic goal to be pursued is to obtain a function g that generalizes well, that is, that behaves correctly even in correspondence with other points of the domain not included in the training set. How can it be guaranteed that the element of Γ that best fits our observations also generalizes well? This is a fundamental question in the context of learning theory.
DETERMINISTIC LEARNING AND AN APPLICATION
63
Since in many practical situations the input vectors x l cannot be freely chosen and the training set (x L , y L ) can be corrupted by noise, most results on learning theory are based on a statistical framework, which arose in the pattern recognition community (Vapnik and Chervonenkis, 1971; Valiant, 1984; Blumer et al., 1989; Devroye et al., 1997) and has been naturally extended to other inductive problems, like regression estimation (Pollard, 1990; Vapnik, 1995; Alon et al., 1997) and probability density reconstruction (Vapnik, 1995). In this framework, called statistical learning (SL), the input vectors x l , for l = 0, . . . , L − 1, are viewed as realizations of a random variable, generated according to an unknown (but fixed) probability density p. On the other hand, there are several cases where the position of the points x l in the input space can be suitably selected for the problem at hand. If a deterministic algorithm is employed to choose the input vectors x l , SL is no longer the most appropriate approach. In this case a new framework, called deterministic learning (DL), is able to catch the peculiarities of the situation at hand, thus providing precise conditions about the generalization ability of the function g ∈ Γ that best fits the observations of the training set (x L , y L ). This chapter presents a survey of DL, comparing its results with those obtained by standard SL. In particular, basic quantities, like variation and discrepancy, are introduced, pointing out their centrality in the derivation of upper bounds for the generalization error that decreases as 1/L (apart from logarithmic factors) with the size L of the training set. This behavior outperforms the equivalent result obtained by SL, where a convergence rate of 1/L2 has been derived. An important application of DL concerns system control and, specifically, the solution of multistage stochastic optimization (MOS) problems, a particular kind of Markovian decision process. In such problems, the aim is to minimize a cost that depends on the evolution of a system, affected by random disturbances, through a horizon of an either finite or infinite number of stages. This very classic framework is widely employed in many different contexts, such as economics, artificial intelligence, engineering, etc. Since in most practical situations optimal control and cost functions cannot be obtained in an analytical form, a numerical approach is needed to solve MOS problems. The standard tool is dynamic programming (DP), introduced by Bellman (1957), as is documented by the large number of studies devoted to this method through the years. The basic idea underlying the DP procedure is to define, at each stage, a function, commonly named cost-to-go or value function, that quantifies the cost that has to be paid from that stage on to the end of the time horizon. In this way, it is possible to transform the MOS problem into a sequence of simpler static optimization subproblems, which can be solved recursively. The basics
64
CERVELLERA AND MUSELLI
of the recursive solution adopted in DP are introduced and discussed in several classic references (see, for example, Bellman, 1957; Bellman and Dreyfus, 1962; Larson, 1968). Among the most recent surveys on DP techniques and Markov decision processes in general are two excellent monographs (Puterman, 1994; Bertsekas, 2000). Although efficient variations of the DP procedure exist for the “deterministic” version of the MOS problem, such as differential dynamic programming (Jacobson and Mayne, 1970), the general approach followed to implement DP in a practical situation requires choosing for each stage a number of sampling points in the d-dimensional state space, then approximating the cost-to-go functions outside these points. In this way, solving the original MOS problem implies the reconstruction of several functional dependencies, one for each stage. The most common sampling technique used in the literature is the “full uniform” grid, that is, the uniform discretization of each component of the state space in a fixed number of values. This clearly leads to an exponential growth of the number of points commonly known as curse of dimensionality: if each of the d components of the state space is discretized by means of m equally spaced values, the number of points of the grid is equal to md . A nonexponential complexity can be obtained by adopting a finer sampling scheme, like those proposed by SL and DL. In the former case, a uniform probability density is employed to generate the training set x l , l = 0, . . . , L − 1; in the latter approach points x l are selected by using a deterministic algorithm, which is able to guarantee an almost linear rate of convergence. Numerical simulations confirm the superiority of DL over SL when dealing with complex MOS problems, solved through the DP procedure. Notation X ⊂ Rd , Y ⊂ R: x ∈ X, y ∈ Y : xi ∈ R: g(x): (x L , y L ): L: (x l , yl ): x L ∈ XL : Γ: ψ(x, α): Λ ⊂ Rk : α ∈ Λ:
input and output space input vector and scalar output ith component of the input vector x unknown function to be estimated training set for estimating the functional dependence number of points in the training set lth example of the training set, with l = 0, . . . , L − 1 collection of all the input vectors x l in the training set family of models generic model in Γ parameter space for the family Γ parameter vector of the model ψ(x, α)
DETERMINISTIC LEARNING AND AN APPLICATION
ℓ(·, ·): RQ (α): Q(x): q(x): R(α): Remp (α, x L ): AL (x L , y L ): Ψ (L): λ(B): cB (x): D(x L ), D ∗ (x L ): (ϕ, B): V (d) (ϕ): VHK (ϕ): ∂i1 ,...,ik ϕ: WM (B): η ∈ R: x t ∈ Xt ⊂ Rd : ut ∈ U t ⊂ R m : θ t ∈ Θt ⊂ R q : f (x t , ut , θ t ): µt (x t ): h(x t , ut , θ t ): β ∈ R: J ◦ (x): J˜◦ (x): ◦ ˆ J (x, α):
65
loss function expected risk for the model ψ(x, α) probability measure for the evaluation of the expected risk probability density function for the measure Q expected risk computed with the uniform distribution (risk functional) empirical risk computed on the sample x L training algorithm minimizing the empirical risk deterministic function for the selection of x L Lebesgue measure of the set B ⊂ Rd characteristic function of the set B ⊂ Rd discrepancy and star discrepancy of the sample x L alternating sum of the function ϕ at the vertexes of the interval B variation in the sense of Vitali of the function ϕ variation in the sense of Hardy and Krause of the function ϕ kth partial derivative of ϕ(x) with respect to the components x i1 , . . . , x ik class of functions ϕ such that ∂i1 ,...,ik ϕ is continuous and bounded random noise with zero mean state vector of a dynamic system at stage t control vector of a dynamic system at stage t random vector acting on a dynamic system at stage t state equation of a dynamic system closed-loop control function (policy) at stage t cost function for the single stage t discount factor for infinite-horizon MOS problems cost-to-go function of MOS problems generic approximated value of J ◦ (x) approximated value of J ◦t (x) based on a parameterized model
II. A M ATHEMATICAL F RAMEWORK FOR THE L EARNING P ROBLEM We want to estimate, inside a family of functions (models) Γ = {ψ(x, α): α ∈ Λ ⊂ Rk }, the parameter α ∗ corresponding to the ψ that best approximates a functional dependence of the form y = g(x), where x ∈ X ⊂ Rd and
66
CERVELLERA AND MUSELLI
y ∈ Y ⊂ R, on the basis of a training set (x L , y L ) ∈ (XL × Y L ) containing L samples (x l , yl ) with l = 0, . . . , L − 1. The quality of a model ψ ∈ Γ can be evaluated at any point of X by a loss function ℓ : Y 2 → R that measures the difference between the output of ψ and the function g. ℓ must be symmetric and nonnegative; furthermore, ℓ(y, y ′ ) = 0 if and only if y = y ′ . The output y assigned to each observation point x is generally noisy; thus, we suppose that y is the realization of a random variable on Y with density p(y|x). described by a conditional probability P ˜ An overall evaluation of the model ψ(x, α) can then be obtained by averaging the value of the loss function ℓ[y, ψ(x, α)] over the whole input domain X. To this aim we assume the existence of a probability measure Q that determines the occurrence frequency of any input vector x. Again we suppose that Q admits a probability density q(x). With this notation we can define the expected risk RQ (α) ℓ y, ψ(x, α) p(y|x)q(x) ˜ dy dx RQ (α) = X×Y
which measures the mean error committed by the model ψ(x, α) over the whole space X. The learning problem can then be stated as follows: Problem 1. Find α ∗ ∈ Λ such that RQ (α ∗ ) = minα∈Λ RQ (α). If the minimum does not exist, the target of our problem can be to find α ∗ ∈ Λ such that RQ (α ∗ ) < infα∈Λ RQ (α) + ε for some fixed ε > 0. The most common loss function is the squared error 2 ℓ y, ψ(x, α) = y − ψ(x, α) .
With this choice, when Y is an interval of R, the solution to Problem 1 corresponds to the function ψ ∈ Γ that is closest to the regression function given by ∗ g (x) = y p(y|x) ˜ dy.
For this reason, when y assumes continuous values, Problem 1 is usually referred to as a regression estimation problem or, simply, a regression problem. To verify the above assertion, we can write the risk as 2 y − g ∗ (x) + g ∗ (x) − ψ(x, α) p(y|x)q(x) RQ (α) = ˜ dy dx X×Y
=
X×Y
2 y − g ∗ (x) p(y|x)q(x) ˜ dy dx
DETERMINISTIC LEARNING AND AN APPLICATION
+
X×Y
+2
2 ∗ ˜ dy dx g (x) − ψ(x, α) p(y|x)q(x)
X×Y
=
∗ RQ
+
X
∗ RQ
67
˜ dy dx y − g ∗ (x) g ∗ (x) − ψ(x, α) p(y|x)q(x)
2 ∗ g (x) − ψ(x, α) q(x) dx
(1)
where ˜ dy dx is the expected risk of the = X×Y [y − g ∗ (x)]2 p(y|x)q(x) ∗ regression function g (x). In the derivation of Eq. (1) for RQ (α) we have used the following identity: y − g ∗ (x) g ∗ (x) − ψ(x, α) p(y|x)q(x) ˜ dy dx X×Y
=
X
Y
˜ dy g ∗ (x) − ψ(x, α) q(x) dx = 0 y − g ∗ (x) p(y|x)
since by definition of the regression function ˜ dy = 0. y − g ∗ (x) p(y|x) Y
However, the probability densities p˜ and q are unknown; hence, we cannot derive the behavior of g ∗ (x). Consequently, the regression problem must be solved only by employing the knowledge inherent in the training set (x L , y L ). A typical way of proceeding consists of minimizing the empirical risk Remp (α, x L ), which evaluates a measure of RQ (α) on the samples included in the training set. In general, the empirical risk is defined as L−1 1 ℓ yl , ψ(x l , α) Remp α, x L = L l=0
which becomes, in the case of quadratic loss function,
L−1 2 1 L Remp α, x = yl − ψ(x l , α) . L l=0
α ∗L
Denote with ∈ Λ the point of minimum of the empirical risk Remp (α, x L ); a nonlinear optimization method AL : (XL × Y L ) → Λ can
68
CERVELLERA AND MUSELLI
be adopted as a learning algorithm to determine a close approximation to the optimum α ∗L . In particular, since optimization techniques are generally (m) iterative, we can define AL as the learning algorithm obtained by taking L L = A(m) the first m iterations of AL . Accordingly, let α (m) L L (x , y ) be the (m) parameter vector produced by AL . A suboptimal solution to Problem 1 can then be retrieved from the training set (x L , y L ) by performing a sufficiently high number (mL ) of iterations with (m ) the learning algorithm AL and by using the resulting parameter vector α L L as an approximation for α ∗ . This approach, called empirical risk minimization (mL ) (mL ) ) of the expected risk in α L is (ERM), is successful if the value RQ (α L ∗ close to the minimum RQ (α ). Note that (m ) (m ) RQ α L L − RQ (α ∗ ) ≤ RQ α L L − RQ α ∗L + RQ α ∗L − RQ (α ∗ ). Thus, the ERM approach is valid if the following two basic conditions are satisfied; when this is the case, Problem 1 is said to be learnable.
Condition 1. The sequence {α ∗L }∞ L=1 of minima of the empirical risk Remp (α, x L ) converges to the desired minimum α ∗ of the expected risk RQ (α). (m)
Condition 2. For every L the sequence {α L }∞ m=1 of optimal points found by the learning algorithm AL at different iterations converges to the minimum α ∗L of the empirical risk. The former condition depends on the characteristics of the learning problem at hand, whereas the latter is related to the behavior of the optimization technique employed to search for the minimum of the empirical risk. In particular, if the learning algorithm belongs to the class of global optimization methods, which are always able to find the global minimum of a cost function when the number of iterations increases indefinitely, Condition 2 is surely verified, at least in probability. However, an analysis of the properties of an optimization technique that lead to the fulfillment of Condition 2 is a central topic in nonlinear programming theory and will not be included in the present chapter. The interested reader is referred to dedicated monographs, such as Törn and Žilinskas (1989). In the following sections the focus will be centered on Condition 1 examining the hypotheses on the learning problem that ensure its fulfillment. Two different situations will be considered:
DETERMINISTIC LEARNING AND AN APPLICATION
69
Passive learning: when the generation of the points in x L for the training set is not under our control; in this case they are viewed as realizations of a random variable with an unknown probability measure. Active learning: when points in x L are produced by a generation algorithm, which can be freely chosen. In particular, the behavior of the difference RQ (α ∗L ) − RQ (α ∗ ) when L increases is examined; this allows us to obtain lower bounds for the size L of the training set, which guarantees the achievement of a desired generalization error. Since the context of passive learning is intrinsically probabilistic, the convergence involved in Condition 1 can be ensured only in a probabilistic sense. The analysis of this case forms the subject of SL, whose main results will be presented in the following section. On the other hand, active learning can be studied in a deterministic way, thus leading to hypotheses for standard convergence in the fulfillment of Condition 1. This is the subject of DL, whose treatment is contained in Section IV.
III. S TATISTICAL L EARNING If the generation of the input points x l to be included in the training set (x L , y L ) is not under our control, we can assume there is an external random source that generates them. Denote with P the probability measure that characterizes this external source and suppose that P admits a density p(x), whose behavior is unknown. However, the learning problem at hand can be solved only if the probability measure P is related to the probability Q adopted to evaluate the expected risk RQ introduced in the previous section. Specifically, the following condition of absolute continuity must hold: if P (S) = 0 for some S ⊂ X then it must also be Q(S) = 0. If this condition is not true for a subset S, but we have P (S) = 0 and Q(S) > 0, there is no hope of minimizing the contribution to the expected risk due to S by examining the points of the training set (x L , y L ), which cannot belong to S. To rule out critical situations, the following two assumptions are normally supposed to hold in the SL framework: Assumption 1. Points in x L are generated by i.i.d. realizations of an unknown density p. Assumption 2. The density p, used in the generation of the training set, is equal to the density q adopted in the evaluation of the expected risk. The first requirement is rarely verified in real world situations; nevertheless, the removal of the i.i.d. hypothesis limits the applicability of typical theoreti-
70
CERVELLERA AND MUSELLI
cal results such as those reported in the following (Vidyasagar, 1997), which are heavily based on Hoeffding’s inequality (Devroye et al., 1997; Hoeffding, 1961). An attempt in this direction is described in Najarian et al. (2001), but its validity is restricted to nonlinear FIR models. Assumption 2 regarding the equality between p and q cannot be verified in practice; it can only be hoped that the mechanism involved in obtaining the samples for the training phase remains almost unchanged when new data are generated. On the other hand, if p and q are radically different from each other, the indirect minimization of the expected risk can lead to poor results. In the SL framework the empirical risk Remp (α) is a random variable, since it depends on the training set (x L , y L ). It follows that the point of minimum α ∗L is also a random variable and therefore the convergence involved in Condition 1 must be formulated in a probabilistic way. For example, it can be rewritten as (2) lim P RQ α ∗L − RQ (α ∗ ) > ε = 0 for every ε > 0, L→∞
which amounts to considering the convergence in probability of RQ (α ∗L ), or as P lim RQ α ∗L = RQ (α ∗ ) = 1, (3) L→∞
which corresponds to the convergence a.s. of the sequence {RQ (α ∗L )}. The probabilities involved in Eqs. (2) and (3) are defined on the product space of the possible training sets (x L , y L ). If Eq. (2) holds and Condition 2 is verified, the learning problem is said to be probably approximately correct (PAC) learnable (Valiant, 1984; Angluin, 1987). The following theorem gives sufficient conditions for the regression problem to be PAC learnable. Theorem 1.
Condition 1 is verified if the following convergence holds: lim P sup Remp α, x L − RQ (α) > ε = 0 for every ε > 0. (4)
L→∞
α∈Λ
The proof can be found, for example, in Vidyasagar (1997). Condition (4) is often referred to as uniform convergence of empirical means; sufficient conditions on the class Γ of functions that ensure its validity can be derived by using the notion of Pollard dimension (or pseudodimension, or P -dimension), a generalization of the VC-dimension (“Vapnik–Chervonenkis dimension”), which is at the core of all the relevant results in SL, as it provides a way of measuring the “richness” of a set of functions. A description of the VC-dimension, more suited to classification problems, can be found in several papers and books on statistical learning, such as
DETERMINISTIC LEARNING AND AN APPLICATION
F IGURE 1.
71
P-shattering.
(Vapnik, 1995) and the references therein. P-dimension was first introduced by Pollard (1990). Without loss of generality, we will suppose henceforth that Y = [0, 1]. Definition 1. A set S = {x 0 , . . . , x j −1 } is P-shattered by the family of functions Γ if there exists a vector c ∈ [0, 1]j such that, for every binary vector e ∈ {0, 1}j , there exists a corresponding function ψ(α e , x i ) ∈ Γ such that ψ(α e , x i ) > ci when ei = 1 and ψ(α e , x i ) < ci when ei = 0 for i = 0, . . . , j − 1, where ei is the ith component of e. In other terms, if the set S is P-shattered, there must exist a vector [c0 , . . . , cj −1 ] such that it is possible to find a function in Γ that can arbitrarily “pass” above or below the various cj . Figure 1 illustrates graphically the concept of P-shattering. Definition 2. The P-dimension of Γ is the largest integer m such that there exists a set S of cardinality m that is P-shattered by Γ . As an example, consider (for the one-dimensional case) the family Γ = {y = k1 x + k2 , k1 , k2 ∈ R}. Figure 2 depicts this situation. As we can see from the figure, it is not possible to find three points that the function can arbitrarily pass above or under, while all the combinations are possible if we consider two points. As an example, in the situation represented in the right part of Figure 2, a function in Γ for e = [1, 0, 1] cannot be found. Therefore, we can conclude that the P-dimension for this particular family of functions is equal to 2.
72
CERVELLERA AND MUSELLI
F IGURE 2.
P-dimension of one-dimensional linear functions.
The exact value of the P-dimension can be computed only for very simple classes of functions, such as hyperplanes or hyperspheres. For more realistic models, like neural networks or radial basis function networks, only upper bounds or asymptotic behaviors are available (Anthony and Bartlett, 1999). However, the definition of P-dimension allows us to obtain some results about uniform convergence of empirical means. The following notation will be used: ρ(L, ε, ℓ, Γ ) = sup P sup Remp α, x L − RQ (α) > ε Q∈Q
α∈Λ
where Q is the set of all the probability measures on X. It can be observed that if ρ(L, ε, ℓ, Γ ) → 0 when L → ∞, for every ε > 0, the uniform convergence of the empirical means (4) occurs independently of the underlying probability. In this case the term distribution-free convergence is usually adopted. The following theorem gives sufficient conditions for the validity of condition (4) as well as an explicit upper bound for the number L of samples needed to achieve a desired accuracy for ρ(L, ε, ℓ, Γ ). The proof of the theorem can be found in Vidyasagar (1997). Theorem 2. Suppose the family Γ has finite P-dimension m and the loss function ℓ satisfies the following uniform Lipschitz condition ℓ(y, u1 ) − ℓ(y, u2 ) ≤ μ|u1 − u2 | for every y, u1 , u2 ∈ [0, 1] (5) for some constant μ. Then, the property of distribution-free uniform convergence of empirical means holds: lim ρ(L, ε, ℓ, Γ ) = 0 for every ε > 0.
L→∞
DETERMINISTIC LEARNING AND AN APPLICATION
73
Moreover, the inequality ρ(L, ε, ℓ, Γ ) ≤ δ is satisfied, provided at least 16eμ 16eμ 8 32 + ln ln (6) L ≥ 2 ln + m ln δ ε ε ε samples are drawn. This theorem states that if we want the difference between the empirical risk and the expected risk to be less than ε with probability 1 − δ for each ψ ∈ Γ , we must choose the number of samples L according to Eq. (6). This lower bound is independent of the probability measure adopted to generate the training set, provided that Assumptions 1 and 2 are valid. We can use this bound to relate the number L of samples to the error between the expected risk RQ (α ∗L ) in the point of minimum of Remp (α, x L ) and the best achievable risk RQ (α ∗ ). For what concerns the uniform Lipschitz condition (5) on ℓ, commonly used loss functions such as ℓ(y, u) = |y − u|n satisfy this property. Equation (6) and Theorem 1 provide an explicit indication about the sample complexity of the learning problem, that is, how many samples we must draw to attain a given error between the best approximating function in Γ and the one obtained with our training algorithm. The first thing to be noted is that the bound is, apparently, independent of the dimension d of the input vector. In most situations this is not the case, since in general the P-dimension m depends on d. However, we have L = O(ln m); consequently, if we choose a class of approximating functions with a P-dimension that does not grow superexponentially with d, we can say that the curse of dimensionality is avoided, at least for what concerns the sample complexity. Inequality (6) seems to suggest that a certain accuracy can be achieved by using fewer training samples providing that a class Γ with a small Pdimension is adopted. Nevertheless, if we reduce the complexity too much, then the approximation capability becomes too limited. In other words, there exists a trade-off between the estimation error RQ (α ∗L ) − RQ (α ∗ ) and the approximation error RQ (α ∗ ) itself: by reducing the P-dimension m we can obtain a small estimation error at the expense of increasing the approximation error. On the other hand, if we increase m to retrieve a good approximation for the function we want to learn, we need to increase the number L of samples to obtain an acceptable estimation based on the ERM approach. If we use too few samples, the phenomenon generally known as overfitting (Bishop, 1995) occurs, where too much freedom in the choice of the approximating function leads the trained model to show made-up complex behavior in undersampled regions.
74
CERVELLERA AND MUSELLI
It is important to note that the bound in Eq. (6) is derived in a distributionfree context. This means that it is valid for every possible probability measure on X. This is useful when we do not have any prior knowledge about the underlying probability by which the samples are drawn. Including this knowledge may probably lead to better bounds on L, especially if we choose a suitable training algorithm. For a discussion about this particular topic the reader is referred to Vidyasagar (1997). It is also important to point out that the bound on the number of samples corresponds to a quadratic rate for the sample complexity, since it depends on ε −2 . This is consistent with typical convergence results of random methods and Monte Carlo algorithms. Furthermore, such bounds are probabilistic in nature, that is, we must expect the results to hold true within a certain confidence interval δ. We will show in the next section how the possibility of choosing the points x l of the training set can lead to a significant improvement in the rate of sample complexity, besides providing the possibility of retrieving deterministic results not involving any confidence value.
IV. D ETERMINISTIC L EARNING If the location of the input patterns x l to be included in the training set (x L , y L ) is not fixed beforehand, but is part of the learning process, the SL approach is no more applicable since Assumptions 1 and 2 do not hold anymore. In fact, the position of the points x l in active learning is typically decided by a deterministic algorithm that does not make subsequent choices in an independent manner. Most existing active learning methods use an optimization procedure for the generation of the input sample x l+1 on the basis of the information contained in previous training points (MacKay, 1992; Cohn, 1994; Kindermann et al., 1995; Fukumizu, 1996), which possibly leads to a heavy computational burden. In addition, some strong assumptions on the observation noise y − g(x) [typically, noise with normal density (Cohn, 1994; Fukumizu, 2000)] or on the class of learning models are introduced. In this section, the validity of Condition 1 in a very general situation is examined. In particular, it will be shown that even if the location of the input patterns x l is decided a priori, the learning problem can still be learnable, provided that some mild assumptions on the class Γ of models is verified. Three different cases will be examined in the following sections: 1. the distribution-free case, where the probability density q, employed to evaluate the expected risk, is unknown (Section IV.A), but the output y = g(x) is observed without noise;
DETERMINISTIC LEARNING AND AN APPLICATION
75
2. the distribution-dependent case, where q is known and can be suitably taken into account (Section IV.D); again, the output is supposed to be noise free; and 3. the noisy case, where the observation noise can be described by any probability distribution (Section IV.E), provided that it does not depend on x and its mean is zero. In the first two cases we suppose that the output y for a given input x is observed without noise, that is, y = g(x). With this assumption the expected risk RQ (α) becomes RQ (α) = ℓ g(x), ψ(x, α) q(x) dx X
since the output y is no longer a random variable. The generation of the points x l to be included training set is in the l such that Ψ (L) performed by a deterministic algorithm Ψ : N → ∞ X l=1 is a collection of exactly L input patterns x l , which can be written as x L . Ψl (L) denotes the single point x l of the sequence. We assume henceforth that X is the d-dimensional semiclosed unit cube [0, 1)d . However, it is possible to extend the results to other intervals of Rd or more complex input spaces, such as spheres and other compact convex domains like simplexes, by suitable transformations (Fang and Wang, 1994). If the input space X is not compact, it is always possible to find a compact K ⊂ X such that the probability measure of the difference X \ K is smaller than any fixed positive value ε [see the Ulam’s theorem (Dudley, 1989)]. Now, the smallest interval I including K can be considered as the input space by simply assigning null probability to the measurable set I \ K and by defining g(x) = 0 for x ∈ I \ K. A. The Distribution-Free Case In this section we suppose that no information is available about the behavior of the probability Q. We only know that Q belongs to a subset Q of the complete collection Q including all the probability measures on X. With this assumption privileging certain regions of the input space over others would be unreasonable; consequently, we can consider the uniform probability density instead of q in the computation of the expected risk, which reduces to the following risk functional: R(α) = ℓ g(x), ψ(x, α) dx. (7) X
76
CERVELLERA AND MUSELLI
The use of this risk functional in place of RQ (α) is theoretically motivated by the following result. Theorem 3. Suppose that every Q ∈ Q admits a density q with q∞ ≤ M for some fixed M ∈ R and that the risk functional R(α) can be minimized up to any desired accuracy (zero-error hypothesis), that is, minα∈Λ R(α) = 0. Then, R(α) = 0 implies RQ (α) = 0 for every Q ∈ Q. Proof. Consider α ∈ Λ such that R(α) = 0. For every Q ∈ Q, having density q, we can write RQ (α) = ℓ g(x), ψ(x, α) q(x) dx ≤ M ℓ (x), ψ(x, α) dx X
X
= MR(α) = 0.
Thus, RQ (α) = 0 since it is a nonnegative quantity. This result ensures that if the risk R(α) can be minimized up to any accuracy (as is the case for many approximators, including most neural network architectures), conditions for learnability obtained by considering the risk functional (7) hold also for the expected risk RQ (α), provided that the probability measure Q is absolutely continuous with respect to the uniform one and its density is bounded. Under this assumption the validity of Condition 1 can be established by employing the following result, which can be viewed as the parallel of Theorem 1. Theorem 4. Condition 1 is verified if the sequence {Ψ (L)}∞ L=1 satisfies lim sup Remp α, Ψ (L) − R(α) = 0. (8) L→∞ α∈Λ
¯ such Proof. If condition (8) holds, for any ε > 0 we can choose an L¯ = L(ε) that for every L ≥ L¯ ε R(α ∗L ) ≤ Remp α ∗L , Ψ (L) + 2 and ε Remp α ∗ , Ψ (L) ≤ R(α ∗ ) + . 2 Since, by definition of α ∗L we have Remp [α ∗L , Ψ (L)] ≤ Remp [α ∗ , Ψ (L)] the fulfillment of Condition 1 follows.
DETERMINISTIC LEARNING AND AN APPLICATION
77
Condition (8) can be considered as the equivalent for DL of the uniform convergence of empirical means property analyzed in Section III for SL. Since we are using the uniform density to compute the risk functional R(α), a basic requirement for the fulfillment of such a condition is that the points of the deterministic sequence x L = Ψ (L) are well spread over the input space X. If β is a collection of Lebesgue-measurable subsets of X and B ∈ β, denote with cB the characteristic function 1 if x ∈ B cB (x) = 0 otherwise and with C(B, x L ) the number of points of x L that belong to B L−1 cB (x l ). C B, x L =
(9)
l=0
Then, if λ(B) is the Lebesgue measure of the subset B, the spreading of the set of points x L over B can be measured by the absolute difference between the ratio C(B, x L )/L and λ(B). If we consider the whole collection β, this measure gives rise to the quantity C(B, x L ) L (10) − λ(B). Dβ x = sup L B∈β The following particular choices of β are commonly employed in numerical analysis (Fang and Wang, 1994; Niederreiter, 1992) and probability (Alon and Spencer, 2000).
Definition 3. If β is the collection of all the closed subintervals of X of the form di=1 [ai , bi ], then the quantity Dβ (x L ) is called discrepancy and is denoted by D(x L ). If β is the collection of all the closed subintervals of X of the form di=1 [0, bi ], then the quantity Dβ (x L ) is called star discrepancy and is denoted with D ∗ (x L ). A classic result (Kuipers and Niederreiter, 1974) states that the following three properties are equivalent: 1. Ψ (L) is uniformly distributed in X, that is, limL→∞ C[B, Ψ (L)]/L = λ(B) for all the subintervals B of X. 2. limL→∞ D(Ψ (L)) = 0. 3. limL→∞ D ∗ (Ψ (L)) = 0. Thus, a uniformly well-distributed sequence of points in the input domain has a small discrepancy or star discrepancy.
78
CERVELLERA AND MUSELLI
Now, with each vertex v of a given subinterval B = di=1 [ai , bi ] of X a binary string s can be associated, whose ith bit is 0 if the corresponding component vi of the vertex is equal to ai and 1 if vi = bi . Denote with EB (respectively OB ) the set of vertexes whose associated strings contain an even (respectively odd) number of 1s. For every function ϕ : X → R we define (ϕ, B) as the alternating sum of ϕ computed at the vertexes of B, that is, ϕ(x) − ϕ(x). (ϕ, B) = x∈EB
x∈OB
Definition 4. The variation of ϕ on X in the sense of Vitali is defined by (Niederreiter, 1992) (ϕ, B) (11) V (d) (ϕ) = sup β B∈β
where β is any partition of X into subintervals of the form
d
i=1 [ai , bi ].
If the partial derivatives of ϕ are continuous on X, the variation V (d) (ϕ) can be written as (Niederreiter, 1992) V
(d)
(ϕ) =
1 0
1 ∂d ϕ · · · ∂x . . . x 1
0
d
dx1 . . . dxd
(12)
where xi is the ith component of x. The equivalence between Eq. (11) and Eq. (12) can be readily seen when the function ϕ is monotone increasing in the domain [0, 1]d . In this case, the supremum in Eq. (11) is reached when the partition β contains only the whole interval [0, 1]d . On the other hand, the dth derivative in Eq. (12) is always nonnegative and a direct integration shows that the alternating sum (ϕ, [0, 1]d ) follows as result. Similar reasoning makes it possible to achieve the same conclusion when ϕ is monotone decreasing. For a general ϕ the equivalence between Eq. (11) and Eq. (12) can be viewed by partitioning the domain [0, 1]d into subintervals, where the restriction of ϕ to each of them is again monotone. For 1 ≤ k ≤ d and 1 ≤ i1 < i2 < · · · < ik ≤ d, let V (k) (ϕ, i1 , . . . , ik ) be the variation in the sense of Vitali of the restriction of ϕ to the k-dimensional face {(x1 , . . . , xd ) ∈ X: xi = 1 for i = i1 , . . . , ik }. Definition 5. The variation of ϕ on X in the sense of Hardy and Krause is defined by (Niederreiter, 1992) VHK (ϕ) =
d
k=1 1≤i1
V (k) (ϕ, i1 , . . . , ik ).
(13)
DETERMINISTIC LEARNING AND AN APPLICATION
79
By the following result, due to Hlawka (1961), we can tie the accuracy of the integration for a function of bounded variation to the star discrepancy of the sample x L . Theorem 5. [Koksma–Hlawka (KH) inequality]. If ϕ has bounded variation VHK (ϕ) on X in the sense of Hardy and Krause, then, for any x L ∈ XL , we have L−1 1 ϕ(x ) − ϕ(x) dx (14) ≤ VHK (ϕ)D ∗ x L . l L l=0
X
Assume the following conditions hold:
Assumption 3. The sequence of points Ψ (L) satisfies limL→∞ D ∗ [Ψ (L)] = 0. Assumption 4. The loss function ℓ, the function g, and the family Γ of models are such that supα∈Λ VHK (α) < ∞ where VHK (α) = VHK ℓ g(x), ψ(x, α) .
Under such assumptions, the KH inequality can be employed to verify condition (8). Theorem 6. The following upper bound holds: Remp α, Ψ (L) − R(α) ≤ VHK (α)D ∗ Ψ (L) . Thus, if Assumptions 3 and 4 are verified, we have lim sup Remp α, Ψ (L) − R(α) = 0. L→∞ α∈Λ
Proof. It is a direct consequence of the KH inequality.
From this result we see that the rate of convergence of the estimation error can be directly related to the rate of convergence of the star discrepancy of the sequence Ψ (L). As we have seen earlier, this corresponds to ensuring a sufficiently uniform spreading of the points in Ψ (L). Furthermore, due to the structure of the KH inequality, we can note that it is possible to deal with the uniformity of the sampling separately from the issue of the complexity of the approximating model. In fact, the latter depends only on the value of the variation of the loss function (which here plays the role of the Pdimension introduced in Section III). This is different from what happens in
80
CERVELLERA AND MUSELLI
bounds obtained for SL [such as the one in Eq. (6)], where the number of points needed to reach a given accuracy (within a certain confidence interval) depends on the richness of the involved functions in a more complex way. B. Ensuring a Bounded Variation We have seen that for the DL problem to be learnable, all the involved functions must satisfy regularity conditions imposed by Assumption 4. In particular, the variation of the loss function ℓ, of the family of models Γ , and of the unknown function g must be finite. Given a function φ(z) : A ⊂ Rp → R, we introduce the notation ∂i1 ,...,ik φ
∂kφ ∂zi1 . . . ∂zik
for 1 ≤ k ≤ p and 1 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ p, where zi is the ith component of z. Then, for M ∈ R, we define the class WM (X) of functions in the following way: WM (X) = f : X → Y such that ∂i1 ,...,ik f is continuous on X and |∂i1 ,...,ik f | ≤ M for 1 ≤ k ≤ d and 1 ≤ i1 ≤ · · · ≤ ik ≤ d .
According to Eqs. (12) and (13), all the elements of WM (X) have bounded variation in the sense of Hardy and Krause. To state sufficient conditions for the finiteness of the variation of the loss function, we introduce the following lemma (Cervellera and Muselli, 2004). Lemma 1. Consider the generic composite function g[μ1 (ξ1 , . . . , ξd ), . . . , μs (ξ1 , . . . , ξd )], where μj : [0, 1)d → Hj ⊂ R for 1 ≤ j ≤ s. Suppose the following conditions hold: 1. g ∈ WM ( j Hj ); 2. μj ∈ WM ([0, 1)d ) for all 1 ≤ j ≤ s. Then g has bounded variation in the sense of Hardy and Krause on [0, 1)d .
Proof. By Eq. (13) we see that the total variation VHK (g) is finite, provided that every term V (k) (g; i1 , . . . , ik ), for 1 ≤ k ≤ d, is finite, where ξi = 1 for i = i1 , . . . , ik . To this purpose, we recur to Eq. (12) and write (k) (g; i1 , . . . , ik ) V 1 k 1 ∂ g[μ1 (ξ1 , . . . , ξd ), . . . , μs (ξ1 , . . . , ξd )] dξi . . . dξi . = · · · k 1 ∂ξ . . . ∂ξ 0
0
i1
ik
DETERMINISTIC LEARNING AND AN APPLICATION
81
(k) (g; i1 , . . . , ik ) is the actual variation in the sense of Vitali The term V 1 , . . . , ik ), when the partial derivative under integration is continuous on [0, 1)d . In particular, we have k (k) (g; i1 , . . . , ik ) ≤ sup ∂ g(μ1 (ξ1 , . . . , ξd ), . . . , μs (ξ1 , . . . , ξd )) . V ∂ξi1 . . . ∂ξik ξi1 ,...,ξik
V (k) (g; i
For a generic ξp , i1 ≤ p ≤ ik , we can write s
∂g ∂μj ∂g = . ∂ξp ∂μj ∂ξp j =1
The computation of the second partial derivative with respect to another generic ξq , q = p, yields s ! s ∂ 2 g ∂μn ∂μj ∂ 2g ∂g ∂ 2 μj . = + ∂ξp ∂ξq ∂μj ∂μn ∂ξq ∂ξp ∂μj ∂ξp ∂ξq j =1
n=1
In general, when deriving ∂ r g/∂ξi1 . . . ∂ξir to obtain the higher order derivatives ∂ r+1 g/∂ξi1 . . . ∂ξir+1 , 2 < r ≤ k − 1, we have a combination through sums and products of many terms having one of the two following structures: 1.
∂j g ∂μi1 ...∂μij
, for 1 ≤ j ≤ r.
Deriving each of these terms with respect to ξir+1 generates a term corresponding to s
m=1
2.
∂j μ
p
∂ξi1 ...∂ξij
∂ j +1 g ∂μm . ∂μi1 . . . ∂μij ∂μm ∂ξir+1
, for 1 ≤ j ≤ r and 1 ≤ p ≤ s.
Each of these terms, once derived with respect to ξir+1 , generates a term that has the following structure: ∂ j +1 μp . ∂ξi1 . . . ∂ξij ∂ξir+1 As 1 corresponds to ∂i1 ,...,ij g and 2 to ∂i1 ,...,ij μp , it is straightforward from conditions 1 and 2 to verify that ∂ k g[μ1 (ξ1 , . . . , ξd ), . . . , μs (ξ1 , . . . , ξd )] ∂ξi1 . . . ∂ξik is continuous on [0, 1)d .
82
CERVELLERA AND MUSELLI
Then, by recalling that for every a, b ∈ R, |a + b| ≤ |a| + |b|, and |ab| = |a||b|, we have k ∂ g[μ1 (ξ1 , . . . , ξd ), . . . , μs (ξ1 , . . . , ξd )] < ∞. ∂ξi1 . . . ∂ξik
(k) (g; i1 , . . . , ik ) are finite for each 1 ≤ k ≤ It has been shown that all the V d. Given the hypotheses of the lemma, we show that such terms actually correspond to V (k) (g; i1 , . . . , ik ), thus proving the finiteness of VHK (g). By using this lemma we can see that Assumption 4 is verified provided that 1. the family of models Γ is such that ∂α,i1 ,...,ik ψ is continuous for any α ∈ Λ and sup |∂α,i1 ,...,ik ψ| < ∞
α∈Λ
for all 1 ≤ k ≤ d and all 1 ≤ i1 ≤ · · · ≤ ik ≤ d; 2. the unknown function g is such that ∂i1 ,...,ik g is continuous and |∂i1 ,...,ik g| < ∞ for all 1 ≤ k ≤ d and all 1 ≤ i1 ≤ · · · ≤ ik ≤ d; 3. the loss function ℓ is such that ∂i1 ,...,ik ℓ is continuous and |∂i1 ,...,ik ℓ| < ∞ for all 1 ≤ k ≤ 2 and all 1 ≤ i1 ≤ · · · ≤ ik ≤ 2. These three conditions imply that ψ (for any α ∈ Λ), g, and ℓ have bounded variation (in the sense of Hardy and Krause). Furthermore, ℓ must belong to C 2 and g and ψ (for any α ∈ Λ) must belong to C d . To show how we can use Eq. (12) to compute directly the variation in the sense of Hardy and Krause, we present three examples regarding algebraic functions in WM ([0, 1)d ). " 1. ϕ1 (x) = α0 + di=1 αi xi , with αi ∈ R for i = 0, 1, . . . , d. All the terms V (k) (ϕ1 , i1 , . . . , ik ) for k > 1 are equal to 0. Therefore, the variation is given by VHK (ϕ1 ) =
d i=1 0
1
|αi | dxi =
d i=1
|αi |.
DETERMINISTIC LEARNING AND AN APPLICATION
83
2. ϕ2 (x) = di=1 xi . In this case every term V (k) (ϕ2 , i1 , . . . , ik ) is equal to 1. Thus, the variation is given by VHK (ϕ2 ) = d
d
1
k=1 1≤i1
dxi =
d d k=1
k
= 2d − 1.
3. ϕ3 (x) = i=1 (1 − xi ). If we consider the restriction of ϕ3 to the k-dimensional face {(x1 , . . . , xd ) ∈ X: xi = 1 for i = i1 , . . . , ik }, we can see that it is equal to 0 for each k < d. Therefore, only the term V (d) has nonzero value, and the resulting variation is VHK (ϕ3 ) =
1
dxi = 1.
0
However, computing the variation of a generic function is not always so straightforward. In general, this can be a very difficult task [a discussion of this kind of variation can be found in Blumlinger and Tichy (1986), Hua and Wang (1981), Tichy (1984)], made even more difficult by the fact that similar functions can have completely different variations, as shown by examples 2 and 3, where the two functions are identical up to rotation. However, in the case of “well-behaved” functions, that is, functions with continuous derivatives, the computation of upper bounds for the variation can be much simpler. This is the case of commonly used approximating models, such as neural networks. As an example, we report here upper bounds for the variation of feedforward multilayer perceptrons and radial basis functions. 1. Feedforward Neural Networks Consider the class of one hidden layer feedforward neural networks, described by the functions d ν (15) cn σ ani xi + bn + c0 ψ(x, α) = n=1
i=1
where σ is the activation function, c0 , cn , bn ∈ R, and a n [an1 , . . . , and ]⊤ ∈ ⊤ Rd . Then the parameter vector α is given by α = [a ⊤ 1 , . . . , a ν , b1 , . . . , b ν , c0 , . . . , cν ]⊤ and d ν # ∂i1 ,...,ik ψ = ani xi + bn anij cn σ (k) n=1
i=1
1≤j ≤k
84
CERVELLERA AND MUSELLI
where σ (k) (z), z ∈ R, is the kth derivative of σ . Define σ¯ (k) supz∈R |σ (k) (z)|, and suppose it is finite. Then we obtain |∂i1 ,...,ik ψ| ≤
ν n=1
|cn |σ¯ (k)
#
1≤j ≤k
|anij |.
(16)
Thus, if the parameters a n , bn , cn are bounded and σ¯ (k) is finite, Eq. (12) implies that the terms V (k) (ψ) are finite; then by Eq. (13) we also have that VHK (ψ) is bounded. Note that for typical sigmoidal activation functions, we have σ¯ (k) = 1. A possible way of controlling the P-dimension of neural networks consists of imposing an upper bound on the absolute value of the parameters a n , bn , cn . In this way, overfitting can be avoided, thus establishing a trade-off between approximation and estimation error (Bishop, 1995). Inequality (16) ensures that the same approach is also valid in the DL framework. 2. Radial Basis Functions Radial basis function networks are characterized by the following structure: "d ν 2 i=1 (xi − τni ) ψ(x, α) = + c0 cn exp − 2sn2 n=1
where c0 , cn , sn ∈ R, and τ n [τn1 , . . . , τnd ]⊤ ∈ X. In this case the parameter vector α has the form α = [c0 , . . . , cν , s1 , . . . , sν , ⊤ ⊤ τ⊤ 1 , . . . , τ ν ] , and the computation of ∂i1 ,...,ik ψ yields "d ν 2 # −2 k i=1 (xi − τni ) −sn cn exp − (xij − τnij ). ∂i1 ,...,ik ψ = 2sn2 1≤j ≤k
n=1
Consequently we obtain
|∂i1 ,...,ik ψ| ≤
ν # −2k s |cn | |xij − τnij |. n 1≤j ≤k
n=1
If the parameters cn and τnij are bounded for each n = 1, . . . , ν, the finiteness of supα∈Λ VHK (ψ) follows directly. If X = [0, 1)d , usually τni ∈ [0, 1); in this case we have ν −2k s |cn |. |∂i1 ,...,ik ψ| ≤ n n=1
It is interesting to note that the upper bound on the variation increases as sn gets smaller, that is, as the basis functions “shrink.” This is another classic way of controlling overfitting.
DETERMINISTIC LEARNING AND AN APPLICATION
85
C. Bounds on the Convergence Rate of the ERM Approach Theorem 6 allows us to conclude that the convergence rate of the ERM approach in the DL framework can be controlled by the rate of convergence of the star discrepancy of the sequence Ψ (L). In this section we introduce a special family of deterministic sequences that leads to an almost linear convergence of the estimation error. Such sequences, typically employed in quasirandom (or quasi-Monte Carlo) integration methods, are usually referred to as low-discrepancy sequences. A detailed discussion of quasirandom methods and low-discrepancy sequences can be found in Niederreiter (1992). Here we report only the most relevant results for the purposes of deterministic learning. Definition 6. An elementary interval in base b (where b ≥ 2 is an integer) is a subinterval E of X having the form E=
d # −p ai b i , (ai + 1)b−pi i=1
where ai , pi ∈ Z, pi > 0, 0 ≤ ai ≤ bpi for 1 ≤ i ≤ d. Let t, m be two integers satisfying 0 ≤ t ≤ m. A (t, m, d)-net in base b is a set F of bm points in X ⊂ Rd such that C(E, F ) = bt [as defined in Eq. (9)] for every elementary interval E in base b with λ(E) = bt−m . It is easy to see that a (t, m, d)-net is endowed with properties of good uniform spreading in X. In fact, if the sample x L is a (t, m, d)-net in base b, every elementary interval in which we divide X must contain bt points of x L . However, the cardinality of (t, m, d)-nets, being constrained to be equal to bm , can be a limitation for the choice of the sample x L ; therefore, to have a higher degree of freedom in choosing the sample size L, we introduce (t, d)-sequences. Definition 7. Let t ≥ 0 be an integer. A sequence {x 0 , . . . , x L−1 } of points in X ⊂ Rd is a (t, d)-sequence in base b if, for all the integers k ≥ 0 and m ≥ t [with (k + 1)bm ≤ L], the point set consisting of {x kbm , . . . , x (k+1)bm } is a (t, m, d)-net in base b. The definitions of (t, m, d)-nets and (t, d)-sequences are due to Sobol’ (1967) for the case b = 2 and to Niederreiter (1987) for the general case. The following result (Niederreiter, 1992) can be proved.
86
CERVELLERA AND MUSELLI
Theorem 7. For every dimension d ≥ 1 and every prime power v (i.e., an integer power of a prime number), there exists a [T (v, d), d]-sequence in base v, where T (v, d) is a constant that depends only on v and d. Although explicit nonasymptotic bounds for the star discrepancy of a (t, d)sequence in base b exist, it is necessary to verify the advantages of using such sequences, with respect to classic Monte Carlo methods (Hammersley and Handscomb, 1964), by considering their asymptotic behavior. In particular, suppose that b is a prime power v. By Theorem 7 there is a [T (v, d), d]sequence x L where T (v, d) can be determined from v and d. For this sequence we have (Niederreiter, 1992) LD ∗ x L ≤ C(d, v)v T (v,d) (log L)d + O (log L)d−1 . We can then choose the optimal v that yields
Cd = min C(d, v)v T (v,d) v
(17)
where the minimum is made over all the prime powers v. It is possible to show (Niederreiter, 1992) that Cd can be found as the minimum in a finite set, therefore we can tabulate the values of Cd for every dimension d, along with the corresponding optimal prime power v. Asymptotically we have (Niederreiter, 1992) d d 1 Cd < . d! log(2d)
In other words, Cd → 0 superexponentially as d → ∞. This means that our [T (v, d), d]-sequence x L , obtained after the optimization in Eq. (17), satisfies (log L)d−1 ∗ L . (18) D x ≤O L Consequently, we can see how (t, d)-sequences in base b satisfy Assumption 3 with an almost linear rate of convergence. For what concerns the ERM approach in the DL framework, if we use the optimized [T (v, d), d]sequence, the following result can be obtained from Theorem 6 and inequality (18).
Corollary 1. Let x L be a [T (v, d), d]-sequence optimized as in (17), and V = supα∈Λ VHK (ℓ) < ∞. Then we have V (log L)d−1 L . (19) sup Remp α, x − R(α) ≤ O L α∈Λ
DETERMINISTIC LEARNING AND AN APPLICATION
87
Now, when facing an active learning problem, a possible way of choosing the points in x L amounts to performing a random sampling with uniform probability inside the input domain X. In this case we can analyze the convergence rate of the ERM approach by applying the results obtained in the SL framework. However, if we compare the bound in Eq. (6) with that derived from Corollary 1, we can conclude that the use of deterministic low-discrepancy sequences permits a faster asymptotic convergence. Specifically, if we ignore logarithmic factors, we have a rate of O(1/L) for a [T (v, d), d]-sequence, and a rate of O(1/L1/2 ) for a random generation of points. From a practical point of view, it is often recognized that the theoretical advantages of low-discrepancy sequences arise, for high-dimensional contexts, only when L is large. The results coming from actual applications of low-discrepancy methods in the literature seem to show that they definitely outperform Monte Carlo methods when the dimension is not particularly large (d ≤ 10) or, equivalently, only a few components of the input are effectively important (see, for example, Caflisch et al., 1997; Morokoff and Caflisch, 1995; and the references therein). The main advantages of Monte Carlo methods rely on the fact that the quadratic rate of convergence is mostly independent on the dimension d, and that it is easy to obtain estimates of the integration error (for example, an estimate of its variance) for a finite sample size. For low-discrepancy sequences, instead, only exaggerated upper bounds on finite-sample accuracy are known. Still, the good theoretical properties of low-discrepancy sequences have suggested developing randomized versions, usually referred to as randomized quasi-Monte Carlo methods, which try to combine the good properties of purely Monte Carlo and low-discrepancy techniques. Basically, these methods consist of applying random permutations to the digits of points in elementary intervals, so that their basic properties are preserved. For such sequences, it is possible to provide accurate estimates of the variance of the error for a finite sample of points. Then, if the function we want to integrate presents some additional regularity such as the Lipschitz condition, it is possible to further improve the linear rate of convergence of pure low-discrepancy sequences (Owen, 1997). The interested reader is referred to Owen (1995) for a detailed discussion of randomized quasi-Monte Carlo methods. D. The Distribution-Dependent Case In the context of deterministic learning, now consider the problem of minimizing the functional risk when we have complete information about the
88
CERVELLERA AND MUSELLI
probability measure Q by which the samples will be drawn in the application phase. Once again suppose that the output value is not corrupted by noise, that is, y = g(x). In this case, the learning problem aims at finding min RQ (α) = min ℓ g(x), ψ(x, α) q(x) dx α∈Λ
α∈Λ
X
with q(x) being the density of the probability measure Q. This learning problem can still be solved by employing low discrepancy sequences introduced in the previous section provided that the following assumption holds. Assumption 5. The probability density q belongs to WM (X) and the loss function ℓ belongs to WM (Y 2 ) for some M ∈ R. In fact, it is possible to write the risk RQ as RQ (α) = ℓ g(x), ψ(x, α) q(x) dx = ℓ′ (x, α) dx. X
X
Thus, we can solve the original problem by generating a low-discrepancy sequence x L = {x 0 , . . . , x L−1 } and by minimizing the following empirical risk L−1 Remp,Q (α) = ℓ′ (x l , α) l=0
ℓ′ (x
where l , α) = ℓ[g(x l ), ψ(x l , α)]q(x l ). Using Assumption 5 and Lemma 1, it is possible to prove the existence of M ′ ∈ R such that ℓ′ belongs to WM ′ (X). In this way, by minimizing Remp,Q (α) we also obtain a minimum for RQ (α), thus ensuring the fulfillment of Condition 1. E. The Noisy Case When applying a learning procedure to real world problems, it is very likely that the output of the observations in given sampled points is affected by random noise. Also, in this case, low-discrepancy sequences can be successfully applied, although a possible degradation in the rate of convergence can arise. Suppose that the value of the output y, for a given input x, is perturbed by a random noise η ∈ E ⊂ R y = g(x) + η.
DETERMINISTIC LEARNING AND AN APPLICATION
89
Then, we have a random term ηl in correspondence with any sample input x l . The following assumptions are commonly employed in the literature: Assumption 6. 1. The random variables ηl are i.i.d. according to a probability measure Pη with density pη and have zero mean; 2. the random variables ηl are independent of x l for l = 0, . . . , L − 1; 3. the loss function ℓ is quadratic. In this case the functional risk R(α) assumes the form 2 R(α) = g(x) + η − ψ(x, α) pη (η) dη dx X×E
=
X
2 g(x) − ψ(x, α) dx +
+ =
X
X
ηpη (η) dη
E
!
η2 pη (η) dη
E
g(x) − ψ(x, α) dx
2 g(x) − ψ(x, α) dx +
η2 pη (η) dη
(20)
E
since the random variable η has zero mean. With a similar procedure, the following expression is obtained for the empirical risk: L−1 2 1 L−1 1 (ηl )2 Remp α, x L = g(x l ) − ψ(x l , α) + L L l=0
+2
1 L
l=0
L−1 l=0
ηl g(x l ) − ψ(x l , α) .
By combining Eqs. (20) and (21) the following inequality is derived: sup Remp α, x L − R(α) α∈Λ
L−1 1 2 2 ≤ sup g(x) − ψ(x, α) dx g(x l ) − ψ(x l , α) − α∈Λ L l=0
X
(21)
90
CERVELLERA AND MUSELLI L−1 1 |ηl | sup g(x l ) − ψ(x l , α) L α∈Λ l=0 L−1 1 + (ηl )2 − η2 p(η) dη. L
+2
l=0
(22)
E
We have seen in the previous sections that the first summand in Eq. (22) converges, with the usual rate of O(1/L). Concerning the second and third terms, which do not depend on α, we can use Hoeffding’s inequality (Devroye et al.,√1997; Hoeffding, 1961) to obtain a rate of convergence of order O(1/ L). This proves that the consistency of the ERM approach is also preserved in the noisy case, even when adopting low-discrepancy sequences. However, the presence of the noise spoils the linear rate of estimation for the “deterministic” part of the output, resulting in a global quadratic rate of convergence (which is still not worse than rates obtained in the SL framework). In a way, this can be considered as a hybrid context between pure SL and pure DL. Intuitively, we can expect to fully exploit the advantageous properties of the quasirandom approach in a purely deterministic context. Still, if the output error is small, we can apply the Bernstein–Chernoff bounds (Angluin and Valiant, 1979) for the last two terms at the right-hand side of Eq. (22), thus again obtaining an almost linear rate of convergence.
V. D ETERMINISTIC L EARNING FOR O PTIMAL C ONTROL P ROBLEMS We will devote the rest of the chapter to an important application of efficient learning in a high-dimensional space, namely a particular kind of optimal control problem, focusing on the applicability of the DL framework to take advantage of the good theoretical properties previously derived. Consider a Markovian decision process in which we want to control a discrete dynamic system while it evolves through a certain horizon of temporal stages, according to the general stochastic state equation x t+1 = f (x t , ut , θ t ),
t = 0, 1, . . .
where x t ∈ Xt ⊂ Rd is the state vector, ut ∈ Ut ⊂ Rm is the control vector, and θ t ∈ Θt ⊂ Rq is a random vector, for example, a disturbance affecting the system. Suppose the random vectors θ t are characterized by a probability measure P (θ t ) with density p(θ t ), defined on the Borel σ -algebra of Rq .
DETERMINISTIC LEARNING AND AN APPLICATION
91
The aim of the control is to minimize a cost function associated with the evolution of the state, which generally has an additive form over the various stages. As the decision problem is Markovian, we want to obtain control functions in a closed-loop form, that is, the control vector at each stage must be a function µt , typically called policy, of the current state vector ut = µt (x t ),
t = 0, 1, . . . .
This entails an optimization problem that we consider in two versions: a T-stage stochastic optimization (T-SO) problem and a discounted infinitehorizon stochastic optimization (∞-SO) problem. In the first case, we want to find the optimal control law u◦ = [µ0 ◦ (x 0 )⊤ , . . . , µT −1 ◦ (x T −1 )⊤ ]⊤ that minimizes ! T −1 F (u) = E h(x t , ut , θ t ) + hT (x T ) θ
t=0
subject to µt (x t ) ∈ Ut ,
t = 0, . . . , T − 1
and x t+1 = f x t , µt (x t ), θ t ,
t = 0, . . . , T − 1
⊤ ⊤ ⊤ ⊤ ⊤ where x 0 is a given initial state, θ = [θ ⊤ 0 , . . . , θ T −1 ] , u = [u0 , . . . , uT −1 ] , h(x t , ut , θ t ) is the cost paid at the single stage t, and hT (x T ) is the cost associated with the final stage [in many cases hT (x T ) ≡ 0].
For what concerns ∞-SO problems, where the number of stages T is not limited, we usually look for policies that do not change from stage to stage. We consider here discounted problems, where the effect of future costs is weighted by a parameter β ∈ [0, 1). Now, the form of the cost to be minimized is lim E
T →∞ θ
subject to
T t β h(x t , ut , θ t ) t=0
ut = µ(x t ), and x t+1 = f (x t , ut , θ t ),
t = 0, 1, . . .
t = 0, 1, . . . .
92
CERVELLERA AND MUSELLI
As stated at the beginning of this chapter, problems of this kind are generally solved by DP. Yet, although efficient variations of the DP procedure exist for the “deterministic” version of the MOS problem, such as differential dynamic programming (Jacobson and Mayne, 1970), the presence of the random disturbances on the state makes the DP equations analytically solvable only when some assumptions on the dynamic system and on the cost function are verified. Typically, these assumptions are the classic “LQ” hypotheses (linear system equation and quadratic cost). For the general case we must look for approximate numerical solutions, that is, we must accept suboptimal policies based on an approximation of the cost and possibly of the control functions. This generally requires, for each stage t, the choice of sampling points in the d-dimensional state space and the approximation of the cost-togo functions outside these points. Several numerical algorithms have been proposed for the approximate solution of the DP procedure (see, e.g., Bellman et al., 1963; Bertsekas, 1975; Chow and Tsitsiklis, 1991; Foufoula-Georgiou and Kitanidis, 1988; Johnson et al., 1993). Anyway, if the problem is stated under very general hypotheses, any method based on discretization suffers from the problem of dimensionality, which prevents finding accurate solutions for nontrivial dimensions d. By “accurate” we mean solutions that are ε-close to the true solution (which implies the problem of finding an ε-close approximation to the cost-to-go functions). In fact, although ε-convergence of the approximate solution to the true solution can be proved for most numerical methods, the problem lies in their complexity in terms of the number of required sampling points as a function of the desired accuracy. In Chow and Tsitsiklis (1989, 1991) it is proved that the complexity of finding ε-close approximations to the true solutions of continuous Markovian decision processes by general discretization techniques, even when regularity hypotheses are supposed (such as Lipschitz assumptions), is ruled by an exponential upper bound, in terms of dimension of the state and control variables. If the class of considered problems is sufficiently restricted, though, there is hope of overcoming the problem of dimensionality by suitable techniques. For example, in Rust (1997) a class of Markovian decision processes with a finite set of possible actions is proved to be ε-solvable with quadratic complexity by means of Monte Carlo discretization. Unfortunately, this result cannot be employed whenever the sets of admissible controls Ut are uncountable. Such a general formulation finds application in many different practical examples, such as multicommodity networks, reservoirs operation, and inventory problems. Therefore, despite the unavoidable problem of dimensionality, there is the need to find computationally tractable methods that can be effectively applied to the general context.
DETERMINISTIC LEARNING AND AN APPLICATION
93
A very general algorithm is based on the approximation of the cost-to-go function by means of some fixed-structure parametric architecture, which is “trained” on the basis of sample points coming from the discretization of the state space. After the cost-to-go approximation is obtained for stage t, it can be used recursively to build the approximations for stages t − 1, t − 2, . . . . There are many examples of such an approach in which different structures are employed. Among the others are polynomial approximators (Bellman et al., 1963), splines (Johnson et al., 1993), multivariate adaptive regression splines (Chen et al., 1999), and neural networks (Baglietto et al., 2001; Bertsekas and Tsitsiklis, 1996) (in the last case the term neurodynamic programming is often used). This scheme for an approximate solution has succeeded in different high-dimensional actual examples of applications. Anyway, from a theoretical point of view, the method is affected by three main possible sources of the curse of dimensionality: 1. the complexity of the class of approximators needed to include an εa -close element to the unknown true cost-to-go function. We will refer to this issue as model complexity; 2. the number of sample points required to estimate a model that is εe -close to the “best” one inside our class (i.e., the one that is εa -close to the true cost-to-go function). This can be defined as sample complexity; 3. the complexity of the algorithm employed to perform the above mentioned estimation process (computational complexity). For what concerns model complexity, most of the commonly employed nonlinear models mentioned in previous sections, such as feedforward sigmoidal neural networks and radial basis functions networks, do not need an exponential growth of the number of parameters to approximate, within an arbitrary accuracy, suitable classes of functions defined on the basis of different a priori regularity assumptions. The interested reader is referred to Girosi et al. (1995) and Zoppoli et al. (2002) for a detailed discussion. Anyway, since the cost-to-go function is completely unknown, the regularity assumptions cannot actually be verified; thus we have very little control on this particular form of the curse of dimensionality. The two issues of sample complexity and model complexity can be easily interpreted in the context of learning problems that we have analyzed in the previous sections. In particular, we can exploit the advantageous bounds of deterministic learning to cope with the problem of dimensionality related to sample complexity. We have already pointed out how the use of a full uniform grid directly leads to the exponential growth of the number of points as the dimension of the state vector grows. A more feasible option for discretization is given by random sampling, for which the results on regression that we have seen in Section III hold. As we
94
CERVELLERA AND MUSELLI
have seen, under suitable hypotheses on the structure of the class of models, the sample complexity of the estimation is quadratic, almost independently of the dimension d. In the next sections we will focus on the use of low-discrepancy sequences at the base of the DL framework in the context of approximate DP. Examples of low-discrepancy sequences [(t, m, d)-nets and the Hammersley sequence] in the DP context are cited among the possible options of deterministic discretization in Chen (2001), where the use of orthogonal arrays is adopted. In Chen et al. (2003) and Cervellera et al. (2006), a 30-dimensional optimal reservoirs operation problem has been solved by applying different lowdiscrepancy sequences and feedforward neural networks, showing a successful practical application of the theory presented here.
VI. A PPROXIMATE DYNAMIC P ROGRAMMING A LGORITHMS We discuss general approximate dynamic programming (ADP) algorithms, commonly employed for the solution of both T -SO and ∞-SO problems. Since such methods are heavily based on the estimation of an unknown function from a set of finite samples that we are generally free to choose, it will be easy to see how all the results of Section IV can be directly applied to analyze their efficiency. A. T-SO Problems The computation of the optimal controls for a vector x t at stage t through DP is based on a function, called the cost-to-go function or value function, which represents the optimal cost that has to be paid starting from the state x t to reach stage T . The optimal controls can be obtained recursively by the following wellknown equations J ◦t (x t ) = min E h(x t , ut , θ t ) + J ◦t+1 f (x t , ut , θ t ) , t = T − 1, . . . , 0, J ◦T (x T )
ut ∈Ut θ t
hT (x T )
where J ◦t (x t ) is the cost-to-go function. It is possible to prove (Bertsekas, 2000) that J ◦0 (x 0 ) corresponds to the optimal cost of the T -SO problem.
As previously stated, it is generally impossible to solve the DP equations analytically. Thus, we discuss here a general numerical solution for which a discretization of the state spaces is needed. In that way, we can compute estimated values of the cost-to-go functions in the points of such discretization, and approximate them in the other points of the state space.
DETERMINISTIC LEARNING AND AN APPLICATION
95
We define xL t = {x t,l ∈ Xt : l = 0, . . . , L − 1},
t = 1, . . . , T − 1
as a sample of L points x t,l chosen in Xt , for each stage t. At each stage t, we consider approximations Jˆt+1 of the cost-to-go functions, having the form of generic parameterized functions with a fixed structure ψ(x, α), where α ∈ Λ ⊂ Rk is a set of “free” parameters to be optimized. Then, we define Jˆt x t , α ◦t = ψ x t , α ◦t where the notation α ◦ is used to indicate that the model has been optimized in a way that will be described in the following. With the approximation Jˆt+1 (x t+1 , α ◦t+1 ), we can write the DP equation as J˜◦t (x t,l ) = min E h(x t,l , ut , θ t ) + Jˆt+1 f (x t,l , ut , θ t ), α ◦t+1 (23) ut ∈Ut θ t
˜◦ for each x t,l ∈ x L t . Henceforth we will denote by J t an approximated value ◦ of the true cost-to-go function J t . Here, the quality of the approximation is influenced by the use of Jˆt+1 , by the impossibility of computing exactly the true minimum, and by the need to estimate the expected value on θ t through an average over a finite number of realizations of the random vectors. Once the L values J˜◦t (x t,l ), l = 0, . . . , L − 1, are computed, we can build the approximation Jˆt by optimizing the empirical risk, as defined in Section II. If we employ a typical mean square error (MSE) criterion, which corresponds to the quadratic loss function, we have α ◦t = arg min αt
L−1 2 1 ˜ ◦ Jt (x t,l ) − Jˆt (x t,l , α t ) . L l=0
Now we are able to evaluate the cost-to-go function for each point of Xt , as is required at stage t − 1 for the computation of J˜◦t−1 . It must be pointed out that this procedure for obtaining the various costto-go approximations can be performed entirely off-line. The policy to be employed on-line can be obtained by applying a reoptimization procedure that involves the use of the DP equations and the approximation of the costto-go functions Jˆt (x t , α ◦t ) obtained off-line. In particular, at a given state x t , the optimal vector u˜ ◦t is derived through the following minimization u˜ ◦t = arg min E h(x t , ut , θ t ) + Jˆt+1 f (x t , ut , θ t ), α ◦t+1 . ut ∈Ut θ t
96
CERVELLERA AND MUSELLI
B. ∞-SO Problems In the case of infinite horizon we look for stationary policies µt ◦ (x t ) = µ◦ (x t ), to which corresponds a stationary cost-to-go function J ◦ (x t ). To simplify the notation the subscript t will be dropped, assuming that for every stage t Xt ≡ X, Ut ≡ U , and Θt ≡ Θ. Furthermore, we assume X is such that f (x, u, θ) ∈ X for every x ∈ X, u ∈ U , and θ ∈ Θ. The optimal cost-to-go function for the infinite horizon case can be obtained by solving the well-known Bellman’s equation J ◦ (x) = min E h(x, u, θ) + βJ ◦ f (x, u, θ) (24) u∈U θ
J◦
is the same function at the left-hand and at the right-hand side of where Eq. (24). Now, unlike the finite horizon case, we formally need to solve a functional equation, rather than applying a recursive procedure. Since this is again generally not possible in an analytic way, different methods have been proposed to obtain J ◦ . Those based on the estimation of the cost-to-go function through approximating models are among the most popular and successful algorithms [see, e.g., the book by Bertsekas and Tsitsiklis (1996) for a survey of the aforementioned methods]. In particular, we consider two quite general approaches, namely (1) approximate value iteration and (2) approximate policy iteration. 1. Approximate Value Iteration The solving algorithm is basically an iterative version of the same ADP procedure described for the finite horizon case. In particular, the generic kth iteration is based on the use of an approximation of the cost-to-go function J ◦ Jˆk x, α ◦k = ψ x, α ◦k
that is obtained from step k − 1 in the following way. Consider once again a sample x L = {x l ∈ X: l = 0, . . . , L − 1} of L points chosen in X, and suppose, for the sake of simplicity, that the sample x L remains the same for every k. Next, for each state x l ∈ x L , compute (25) J˜◦k (x l ) = min E h(x l , u, θ) + β Jˆk−1 f (x l , u, θ), α ◦k−1 . u∈U θ
Then, obtain the cost-to-go approximation for the kth iteration by minimizing the empirical risk α ◦k
L−1 2 1 ˜◦ = arg min Jk (x l ) − Jˆk (x l , α) . α L l=0
DETERMINISTIC LEARNING AND AN APPLICATION
97
After a sufficient number of iterations, we can obtain the on-line control vector ut for the generic stage x t by taking the argument of the minimum in Eq. (25) and replacing x l by x t . 2. Approximate Policy Iteration This method involves, at the kth iteration and for each state x l ∈ x L , the use of estimates of the cost-to-go functions J˜k ◦ (x l ) evaluated by using the current policy µk . Theoretically, this would correspond to J˜k ◦ (x l ) = E θ
∞ t=0
β t h x t,l , µk (x t,l ), θ t
(26)
where x t+1,l = f [x t,l , µk (x t,l ), θ t ] and x 0,l = x l . In practice we simulate the system by using the current policy µk ; when a sufficiently long finite horizon of T stages is employed and Q realizations of a sequence of random vectors {θ 1,q , . . . , θ T ,q }, q = 1, . . . , Q, are used, the true infinite-horizon cost-to-go function J˜k ◦ is estimated as Q T −1 1 t J˜k ◦ (x l ) = β h x t,l,q , µk (x t,l,q ), θ t,q Q q=1 t=0
where x t,l,q = f [x t−1,l,q , µk (x t−1,l,q ), θ t,q ] and x 0,l,q = x l . This phase is usually called policy evaluation. Next, the cost-to-go approximation Jˆk (x, α ◦k ) = ψ(x, α ◦k ), corresponding to the kth iteration, is obtained in the usual way: α ◦k = arg min α
L−1 2 1 ˜ ◦ Jk (x l ) − Jˆk (x l , α) . L l=0
Finally, we can improve the policy for all the states in x L by employing (27) µk+1 (x l ) = arg min E h(x l , u, θ) + β Jˆk f (x l , u, θ), α ◦k , u∈U θ
which can be used in the policy evaluation phase at the (k + 1)th iteration. For what concerns the value of the control function in the points outside x L , Eq. (27) can be used, provided we replace x l with the state that is actually reached through the simulation. Since this can result in being too computationally intensive, especially when T is large, we can also estimate the control functions µk by approximating models, on the basis of the available L pairs x l , µk (x l ) . This means that we build approximations
98
CERVELLERA AND MUSELLI μ◦
ˆ k (x, α k ) where µ μ◦
α k = arg min μ α
L−1 2 1
µk (x l ) − µ ˆ k x l , αμ . L l=0
In this way, we can immediately evaluate the “optimal” control vector for any given state, at the price of a further level of suboptimality. It should be note that all the properties of the sampling methods discussed in the previous sections clearly also hold for the problem of approximating the policies. C. Performance Issues For ∞-SO discounted problems, it is possible to derive bounds on the goodness of the performance of both the value iteration and the policy iteration methods. Concerning value iteration, it is easy to determine that Jˆk approaches J ◦ as k → ∞ within an absolute error of ε/(1 − β), where ε is defined by the following upper bound (see Bertsekas and Tsitsiklis, 1996)
Jˆk − J˜k ◦ ≤ ε ∞
and J˜k ◦ is computed as in Eq. (25). For policy iteration we have (see Bertsekas and Tsitsiklis, 1996) that J˜k ◦ approaches J ◦ as k → ∞ within an absolute error of (δ + 2βε)/(1 − β)2 where again Jˆk − J˜k ◦ ∞ ≤ ε with J˜k ◦ obtained as in Eq. (26). Actually, when we employ a simulated version involving a finite horizon of T stages and Q realizations of the random vectors, a further error term should formally be added to ε. For the sake of simplicity, we will assume that Q and T are large enough for such a term to be neglected. δ is an error term that takes into account the impossibility, in general, of performing an exact minimization in Eq. (27), and possibly also the use of μˆ k for approximating the control function μk . Similar bounds may be derived for the finite horizon case. From these results, a key requirement for the accuracy of both methods is that the functions J˜k ◦ , even if defined differently for each method, must be approximated as closely as possible. As previously said, ε depends on the richness of the class of models we choose for the approximation and on how accurately we can estimate, inside such class, the closest element to the “true” unknown function. It appears clear how a good sampling method, that is, one that does not suffer from exponential sample complexity (curse of dimensionality), is crucial to the efficiency of approximated dynamic programming algorithms. In fact, even if
DETERMINISTIC LEARNING AND AN APPLICATION
99
the convergence results of deterministic learning as discussed in the previous sections actually hold for Lp norms (in particular, they are well suited to the L2 norm that is standard in the DP literature), it is easy to verify that convergence in the Lp norm eventually implies convergence in the supremum norm, when the involved functions are sufficiently regular. In particular, this is true for the conditions on the variation that will be discussed in the following section.
VII. D ETERMINISTIC L EARNING FOR DYNAMIC P ROGRAMMING A LGORITHMS In this section we focus on the variation of the cost-to-go functions involved in the ADP procedures described in Section VI, and derive sufficient conditions under which the finiteness of such variation is attained. In this way, we can effectively take advantage of the almost linear sample complexity given by low-discrepancy sequences. Henceforth, for simplicity, we generically use X ≡ [0, 1)d to denote the input space for every stage. This can be done assuming there exists a homeomorphism, that is, a bijective function ζ such that both ζ and ζ −1 are continuous, between either the sets Xt (finite horizon) or X (infinite horizon) and [0, 1)d . A. The T-SO Case Consider the estimation problem related to the ADP algorithm described in Section VI.A and, in particular, the estimation of the cost-to-go approximations J˜◦t , defined as in Eq. (23), by means of the parameterized models Jˆt (x t , α t ). If we consider Lemma 1, we note that Assumption 4 is verified, in the ADP context, when the following conditions hold: 1. the cost-to-go approximations Jˆ(x t , α t ) are such that, for all t, ∂i1 ,...,ik Jˆ is continuous for any α t ∈ Λ and sup |∂i1 ,...,ik Jˆ| < ∞
α t ∈Λ
for all 1 ≤ k ≤ d and all 1 ≤ i1 ≤ · · · ≤ ik ≤ d; 2. the cost-to-go approximations J˜◦t (x t ) belong to WM ′ (X) for some M ′ ∈ R and all t; 3. the loss function ℓ belongs to WM (Y × Y ) for some finite M.
100
CERVELLERA AND MUSELLI
We have already seen that Conditions 2 and 3 are verified by commonly employed loss functions and standard approximators, such as neural networks, radial basis functions networks, and support vector machines (for proper behavior of the kernel function). In the following we discuss sufficient hypotheses for condition 1 to be satisfied. For every t we define Jˆ◦t (x t ) = Jˆt (x t , α ◦t ) as the cost-to-go approximation for stage t obtained after the training, and U C = {µ: μi ∈ WM (X) for all 1 ≤ i ≤ m}, where μi is the ith component of µ. Assumption 7. a. For all t, and all θ t ∈ Θ, we have fi ∈ WMi (X × Ut ), where fi is the ith component of f and h ∈ WM (X × Ut ); ◦ b. for t = T , JT ≡ hT ∈ WM ′′ (X). For all t = T − 1, . . . , 1, Jˆ◦t ∈ WM ′′ (X). We define the argument of the minimum in Eq. (23) as ut ◦ = µ◦t (x t ) for each point x t . Note that if we define the optimal solution µt ◦ as the one corresponding pointwise to each x t , we must be aware that such a solution belongs to U C only under particular assumptions of convexity on h and f (e.g., see Stokey et al., 1989). Therefore, we formally need to focus our attention on “well-behaved” control function spaces, and accept a further slight level of suboptimality. This has little impact on the accuracy of the algorithm, provided the following assumption holds. Assumption 8. For every t and for any ε > 0, there exists µ∗t ∈ U C that is ε-close (in some norm) to the optimal pointwise solution µ◦t . It is known that under mild hypotheses, for any measurable function ζ there exists an arbitrarily close function ζ¯ ∈ C ∞ (see, e.g., Dudley, 1989). Therefore, the only requirement for µ◦t really implied by Assumption 8 is that it admits an arbitrarily close approximation with bounded derivatives. It can easily be verified that even when f and h are not convex, the regularity properties imposed by Assumption 7 imply that the cost-to-go function corresponding to µ∗t can also be made arbitrarily close to J˜◦t (x t ) (i.e., the one corresponding to µ◦t ). The following theorem considers an arbitrarily close approximation to J˜◦t (x t ), obtained by constraining the control functions to belong to U C , and presents conditions on the applicability of the results of Section IV to the T -SO context. For the sake of readability, we will employ the same notation J˜◦t (x t ) for its above-mentioned approximation.
DETERMINISTIC LEARNING AND AN APPLICATION
101
Theorem 8. Suppose Assumptions 7 and 8 hold. Then we have J˜◦t ∈ d ∈ R and all t, that is, J˜◦t has finite variation in WM ([0, 1) ) for some M the sense of Hardy and Krause. Proof. We write the approximation of the tth cost-to-go function in this form: J˜◦t (x t ) = E h x t , µ∗t (x t ), θ t + Jˆ◦t+1 f (x t , µ∗t (x t ), θ t ) θt = h x t , µ∗t (x t ), θ t p(θ t ) dθ t Θt
+
Θt
Jˆ◦t+1 f (x t , µ∗t (x t ), θ t ) p(θ t ) dθ t
¯ t ) + J¯◦t+1 (x t ). = h(x
(28)
d ∈ R, provided It follows from Lemma 1 that J˜◦t ∈ WM ([0, 1) ) for some M ◦ d ¯ ¯ both h and J t+1 have finite variation on [0, 1) . ¯ we have, for k = 1, . . . , d and 1 ≤ i1 ≤ i2 ≤ · · · ≤ For what concerns h, ik ≤ d ∂k ∂i1 ,...,ik h¯ = h x t , µ∗t (x t ), θ t p(θ t ) dθ t ∂xt,i1 . . . ∂xt,ik
=
θt
and
θt
∂ k h[x t , µ∗t (x t ), θ t ] p(θ t ) dθ t ∂xt,i1 . . . ∂xt,ik
∂ k h[x , µ∗ (x ), θ ] t t t t ¯ = p(θ t ) dθ t |∂i1 ,...,ik h| ∂xt,i1 . . . ∂xt,ik θt
k ∂ h[x t , µ∗t (x t ), θ t ] p(θ t ) dθ t . ≤ ∂x . . . ∂x θt
t,i1
t,ik
Performing a change of variables, we can write h x t , µ∗t (x t ), θ t = h γi1 (x t ), . . . , γik+m (x t ), θ t
where
γij (x t )
for 1 ≤ j ≤ k xt,ij ∗ μt,j −k (x t ) for k + 1 ≤ j ≤ k + m.
102
CERVELLERA AND MUSELLI
It follows from Lemma 1, Assumptions 7 and 8 that for every θ t , the function ∂ k h[x t , µ∗t (x t ), θ t ] ∂xt,i1 . . . ∂xt,ik has finite variation. So we can define M(θ t ) =
sup xt,i1 ,...,xt,ik
k ∂ h[x t , µ∗t (x t ), θ t ] . ∂x . . . ∂x t,i1
t,ik
¯ is bounded by It is easy to see that ∂i1 ,...,ik h¯ is continuous, and |∂i1 ,...,ik h| sup M(θ t ). θ t ∈Θ
This proves that h¯ has bounded variation on [0, 1)d . ¯ After For what concerns J¯◦t+1 , we can proceed in the same way as for h. the same change of variables we can write J¯◦t+1 f x t , µ∗t (x t ), θ t = J¯◦t+1 f1 γi1 (x t ), . . . , γik+m (x t ) , . . . , fd γi1 (x t ), . . . , γik+m (x t ) , θ t thus obtaining, from Lemma 1 and Assumptions 7 and 8 that J¯◦t+1 has bounded variation on X. As a consequence, the same also holds true for J˜◦t . B. The ∞-SO Case We focus now on the estimation of the functions J˜k ◦ using parameterized models Jˆk (x, α k ), as we have seen in Section VI.B. For the approximate value iteration algorithm, it is easy to see how the results of Theorem 8, due to the structure of Eq. (25), can still be applied, provided we replace the subscripts related to the temporal stage t with those related to the iteration k of the algorithm. For the approximate policy iteration algorithm and, in particular, the estimation of functions J˜k ◦ (x), given the kth policy µk (x), we consider the costs in their “exact” form, that is, J˜k ◦ (x) = E θ
∞ t β h x t , µk (x t ), θ t t=0
where x 0 = x. The convergence results can be directly extended to the practical case, where we employ a finite horizon of T stages and approximate the expected value through Q realizations of the random sequences. Suppose the following conditions hold:
DETERMINISTIC LEARNING AND AN APPLICATION
103
Assumption 9. a. For all θ ∈ Θ, we have fi ∈ WMi (X × U ), where fi is the ith component of f ; b. for all θ ∈ Θ, we have h ∈ WM ′ (X × U ); c. for all k, we have µk ∈ U C ; d d Mμ,1 < 1, where Mf,1 = supθ maxn=1,...,d |∂i fn | (for i = d. βMf,1 1, . . . , d + m) and Mμ,1 = maxk supθ maxn=1,...,m |∂i μk,n | (for i = 1, . . . , d), respectively. Then we can prove the following result: Theorem 9. Suppose Assumption 9 is verified. Then we have J˜k ◦ ∈ d ∈ R, that is, J˜◦ has finite variation in the sense WM ([0, 1) ) for some M k of Hardy and Krause. Proof. We write the estimated cost given µk as J˜k ◦ (x) = lim E T →∞ θ
T −1 t=0
t β h x t , µk (x t ), θ t
T −1 t = lim β h x t , µk (x t ), θ t p(θ) dθ. T →∞ t=0 ΘT
Consider the term of the sum corresponding to t = 1, that is, h[x 1 , µk (x 1 ), θ 1 ]. The explicit dependence on x can be written as h1 (x) = h f x, µk (x), θ 0 , µk f x, µk (x), θ 0 ) , θ 1 .
In general, when the “unfolded” expression of the tth term is considered, a long recursion is obtained made of terms where the function f is nested t times and µk is nested up to t + 1 times. To write an expression for |∂i1 ,...,ij ht | we need to compute the partial derivatives, up to the j th order, of the “unfolded” form mentioned above. By applying iteratively the chain rule for differentiating compositions of functions, and some tedious algebra, we determine that this leads to a term that is bounded, for each sequence θ 0 , . . . , θ t , by |∂i1 ,...,ij ht | ≤ δ(j, d, m, Mf,2 , . . . , Mf,j , Mμ,2 , . . . , Mμ,j , Mh,1 , . . . , Mh,j ) jt
j (t+1)
× t γ (j,d,m) Mf,1 Mμ,1
where Mf,q , Mh,q , and Mμ,q correspond to supθ maxn=1,...,d |∂i1 ,...,iq fn |, supθ |∂i1 ,...,iq h|, and maxk maxn=1,...,m |∂i1 ,...,iq μk,n |, respectively, while δ, γ
104
CERVELLERA AND MUSELLI
are functions that do not depend on t. δ is finite provided that Mf,q , Mμ,q , Mh,q are finite (which is true from conditions a–c of Assumption 9). Thus, we can write T jt j (t+1) ∂i ,...,i J˜k ◦ ≤ lim δt γ (j,d,m) β t Mf,1 Mμ,1 , j 1 T →∞
t=0
j
j
which converges to a finite value provided βMf,1 Mμ,1 < 1.
It is worth noting that Assumption 9c is automatically true when we approximate the control functions µk by the usual standard parameterized models, as seen in Section VI.B. For what concerns Assumption 9d, we have a bound on the absolute value of the first-order partial derivatives of f and µk . Therefore, the requirement can theoretically be fulfilled by a suitable scaling of the input space and/or choosing a sufficiently small β. Furthermore, when we consider the real implementation of the approximate policy iteration algorithm, in which we need to estimate the true cost-togo function values through simulation over a finite horizon of stages and averaging over a finite number of random sequences, point d of Assumption 9 is no longer required. That is, finiteness of the variation of the functions f , µk , h, stated in points a–c, is enough to guarantee finiteness of the variation of J˜k ◦ .
VIII. E XPERIMENTAL R ESULTS To analyze experimentally the advantages of employing low-discrepancy sequences in the context of function estimation, we consider two different groups of tests concerning (1) approximation of unknown generic functions and (2) solution of multistage optimization problems. A. Approximation of Unknown Functions Three different test functions, taken from Cherkassky and Mulier (1998), have been considered for approximation. The models employed for this purpose are one hidden-layer feedforward networks with sigmoidal activation function, that is, nonlinear mappings of the form (15), where σ (·) is the hyperbolic tangent σ (z) =
ez − e−z . ez + e−z
DETERMINISTIC LEARNING AND AN APPLICATION
105
For each function, a comparison between low-discrepancy sequences (LDS) and training sets formed by i.i.d. samples drawn from uniform probability (URS) was carried out. The LDS come from Niederreiter sequences with different prime power bases; a detailed description of such sequences can be found in Niederreiter (1992), while the actual implementation has been taken from Bratley et al. (1994). A quadratic loss function was used for the empirical risk Remp (α, x L ) L−1 2 1 L g(x l ) − ψ(x l , α) Remp α, x = L l=0
and such a function has been minimized through the Levenberg–Marquardt algorithm (Hagan and Menhaj, 1994), up to the same level of accuracy for each function. The generalization error has been estimated through the square root of the mean square (RMS) error computed over a set of points obtained by a uniform grid on the input space. Function 1 (highly oscillatory behavior): g1 (x) = sin exp 4x1 x2 sin(π x3 ) sin(π x4 ) , x ∈ [0, 1]4 .
Six random training sets with length L = 3000 have been generated: three of them are produced by random sampling with uniform probability, while the other three are based on Niederreiter’s low-discrepancy sequences. Every training set has been employed to train a neural network with ν = 50 hidden units. To show the improvement achieved by increasing the number of training samples, the results obtained with subsets of the original sequences having size L = 500, 1000, 1500, 2000, and 2500 are also included. The chosen level of accuracy for the minimization of the empirical risk is 0.14 for any size of the training set. Table 1 contains the relative RMS errors with respect to the lowest error, for the various sequences, computed over a TABLE 1 F UNCTION 1: R ELATIVE RMS E RRORS FOR THE 36 D ISCRETIZATIONS Training set
500
Sample size L 1000
1500
2000
2500
3000
UR-1 UR-2 UR-3 LD-3 LD-5 LD-7
1.1196 0.8370 1.0163 0.9239 0.6576 0.6848
0.6522 0.5598 0.7011 0.7011 0.4783 0.5652
0.3804 0.2120 0.5924 0.4457 0.2174 0.4239
0.2554 0.1304 0.4511 0.2935 0.0924 0.3261
0.0924 0.0598 0.4293 0.1467 0.0489 0.2065
0.0380 0.0543 0.1522 0.0598 0 0.0870
106
CERVELLERA AND MUSELLI
TABLE 2 F UNCTION 1: AVERAGE R ELATIVE RMS E RRORS FOR R ANDOM AND L OW-D ISCREPANCY S EQUENCES Training set
500
Sample size L 1000
1500
2000
2500
3000
URS LDS
0.9909 0.7554
0.6377 0.5815
0.3949 0.3623
0.2790 0.2373
0.1938 0.1341
0.0815 0.0489
TABLE 3 F UNCTION 2: R ELATIVE RMS E RRORS FOR THE 36 D ISCRETIZATIONS Training set
500
Sample size L 1000
1500
2000
2500
3000
UR-1 UR-2 UR-3 LD-3 LD-5 LD-7
3.3879 5.1724 5.4483 2.4224 2.0259 2.9397
4.0517 2.7500 2.9397 1.9310 1.7414 2.0086
1.4914 1.2241 1.2931 1.2241 0.7845 0.7069
1.1466 0.6379 0.7069 0.2500 0.2672 0.5517
0.8017 0.3448 0.4741 0.1293 0.1293 0.2672
0.1638 0.2155 0.4397 0 0.0690 0.1293
fixed uniform grid of 154 = 50,625 points. In particular, the relative error er is defined as er = |e − e∗ |/e∗ , where e is the true RMS error and e∗ is the lowest RMS error among the various sequences. In Table 1, “UR-n” is “uniform random sequence number n” and “LD-q” is “low-discrepancy sequence in base q.” In Table 2, the average values of the relative RMS errors for the two kinds of sampling are presented. Function 2 (multiplicative): $ 2 1 1 g2 (x) = 4 x1 − x4 − sin 2π x2 + x32 , x ∈ [0, 1]4 . 2 2 The same network with ν = 50 hidden units and the same input vectors of the 36 training sets used for g1 (x) were employed for this function. The empirical risk was minimized up to an accuracy level of 10−4 . Table 3 contains the relative RMS errors for the different sequences, computed over the same uniform grid of 154 = 50,625 points used for function 1. Table 4 contains the average values of the relative RMS errors for the two kinds of discretizations.
DETERMINISTIC LEARNING AND AN APPLICATION
107
TABLE 4 F UNCTION 2: AVERAGE R ELATIVE RMS E RRORS FOR R ANDOM AND L OW-D ISCREPANCY S EQUENCES Training set
500
Sample size L 1000
1500
2000
2500
3000
URS LDS
4.6695 2.4626
3.2471 1.8937
1.3362 0.9052
0.8305 0.3563
0.5402 0.1753
0.2730 0.0661
TABLE 5 F UNCTION 3: R ELATIVE RMS E RRORS FOR THE 36 D ISCRETIZATIONS Training set
500
Sample size L 1000
1500
2000
2500
3000
UR-1 UR-2 UR-3 LD-8 LD-9 LD-11
1.3636 0.6818 1.3485 0.4394 1.2273 1.2424
0.7273 0.6061 0.5758 0.2879 0.9848 0.8030
0.4242 0.3636 0.4242 0.1364 0.4545 0.4091
0.3182 0.3485 0.2727 0.0606 0.2879 0.2727
0.3030 0.3182 0.2727 0.0303 0.2273 0.2273
0.2879 0.2727 0.2273 0 0.1667 0.1667
Function 3 (additive): 1 2 g3 (x) = 10 sin(π x1 x2 ) + 20 x3 − + 10x4 + 5x5 + x6 , 2
x ∈ [0, 1]6 .
For this six-dimensional function, 36 new training sets with L = 3000 points were generated. Eighteen of them were again based on a random extraction with uniform probability, while the remaining 18 were based on Niederreiter sequences. For each training set, the same network with ν = 40 hidden units was trained by minimizing the empirical risk up to an accuracy level of 10−4 . The RMS errors are computed over a uniform grid of 66 = 46,656 points. In Table 5 the relative RMS errors for the different sequences are presented, while Table 6 contains the average values of the relative RMS errors for the two kinds of sampling. B. Multistage Optimization Tests We consider two high-dimensional test problems, namely (1) a 9-dimensional inventory forecasting problem and (2) a 30-dimensional water reservoirs
108
CERVELLERA AND MUSELLI
TABLE 6 F UNCTION 3: AVERAGE R ELATIVE RMS E RRORS FOR R ANDOM AND L OW-D ISCREPANCY S EQUENCES Training set
500
Sample size L 1000
1500
2000
2500
3000
URS LDS
1.1313 0.9697
0.6364 0.6919
0.4040 0.3333
0.3131 0.2071
0.2980 0.1616
0.2626 0.1111
optimal management problem. For both systems we compare low-discrepancy sequences and random sequences, when used for the solution of a finite horizon problem. 1. The Inventory Forecasting Model The purpose of the optimization is to satisfy the demand of three items in an inventory while keeping the storage levels as small as possible. Thus, we have a V-shaped cost function for each stage t, which has the following structure: h(x t , ut , θ t ) =
3 ωj max{xt+1,j , 0} − πj min{0, −xt+1,j } j =1
where xt+1,j = xt,j + ut,j − xt,j +3 · θt,j xt+1,j = xt+1,j +3 · θt,j xt+1,j = qj · θt,j
for j = 1, 2, 3 for j = 4, 5, 6 for j = 7, 8, 9.
The components xt,j represent the item levels in period t when j = 1, 2, 3, the forecasts for the demand of each item in period t + 1 when j = 4, 5, 6, and the forecasts for the demand of each item in period t + 2 when j = 7, 8, 9. The random vector θt,j represents a correction between the forecast and the true demand of items, with the 9-dimensional vector θ t having lognormal distribution. ωj ≥ 0 is the holding cost parameter for item j , πj ≥ 0 is the backorder cost parameter for item j , and qj is a constant. To have differentiable costs, we used for the tests “smooth” approximations of the ideal V shape of the cost. Specifically, we have ⎧ 0 for z ≤ 0 ⎪ ⎪ ⎨ 1 1 z3 − z4 for 0 < z < 2δ Q+ (z, δ) = 4δ 2 16δ 3 ⎪ ⎪ ⎩z − δ for z ≥ 2δ
DETERMINISTIC LEARNING AND AN APPLICATION
⎧ −z − δ ⎪ ⎪ ⎨ 1 1 4 − z Q (z, δ) = − 2 z3 − 4δ 16δ 3 ⎪ ⎪ ⎩0
109
for z ≤ −2δ for − 2δ < z < 0 for z ≥ 0.
Note that when δ → 0, we have Q+ (z, δ) → max{0, z} and Q− (z, δ) → − min{0, z}. A detailed description of the model can be found in Chen et al. (1999). 2. The Water Reservoir Network Model We consider a water reservoir network that is made up of 10 basins, each one affected by stochastic inflows. The objective of the optimization problem is to control the releases in such a way that (1) the level of the water in the reservoirs at the beginning of the new stage is kept as close as possible to a target value smaller than the maximum capacity and (2) a cost represented by a function g of the water releases and/or water levels (e.g., power generation, irrigation) is minimized (or, alternatively, a benefit is maximized). We can write the state equation for the j th reservoir, j = 1, . . . , 10, as xt+1,j = xt,j + rt,i − rt,j + εt,j i∈U j
where xt,j , for j = 1, . . . , 10, is the amount of water in the j th reservoir at the beginning of stage t, rt,j is the amount of water released from the j th reservoir during stage t, εt,j is the total inflow into the j th reservoir during stage t, and U j is a set of indexes corresponding to the reservoirs that releases water directly into reservoir j . The stochastic inflows ε t are modeled through an autoregressive process of order 2, stochasticity being given by a random correction θ t having a standard normal distribution. This implies the need, at each stage t, of including the values of the inflows of stages t − 1 and t − 2 into the state vector, thus leading to a 30-dimensional problem. For what concerns the other 20 components of the state vector, we can define, for j = 1, . . . , 10, xt,j +10 = εt−1,j , and xt,j +20 = εt−2,j , and write xt+1,j +10 = εt,j ,
xt+1,j +20 = xt,j +10 .
The structure of the cost function is h(x t , r t , εt ) =
10 j =1
|xt+1,j − x˜j | −
10
pj Q+ (rt,j , δj )
j =1
where x˜j is the desired level for the j th reservoir and pj is a coefficient that weights the benefit due to the release of water from reservoir j . Also in this
110
CERVELLERA AND MUSELLI
F IGURE 3.
Reservoir network.
TABLE 7 TARGET WATER L EVELS FOR THE 30-D IMENSIONAL P ROBLEM Res. 1 Res. 2 Res. 3 Res. 4 Res. 5 Res. 6 Res. 7 Res. 8 Res. 9 Res. 10 Target water level 200
250
260
270
220
420
200
500
180
340
case, the absolute value has been approximated by using Q+ and Q− . Q+ has also been used to model a generic benefit that becomes relevant for “large” values of the water releases, depending on the value of δj . Suitable constraints are considered to make sure that each release is positive and limited by a maximal pumpage capability, and that it never exceeds the sum of the amount of water at the beginning of the period and the water that gets in from the upstream reservoirs. The configuration of the network is depicted in Figure 3, while the target levels for the various reservoirs are reported in Table 7. A detailed description of the model is contained in Cervellera et al. (2006). For what concerns the actual tests, the ADP procedure described in Section VI was applied for both the 9-dimensional and the 30-dimensional models to solve the T -SO problem over T = 3 stages.
111
DETERMINISTIC LEARNING AND AN APPLICATION TABLE 8 B OUNDS FOR THE I NVENTORY F ORECASTING PROBLEM Components Bounds of X0
x1
x2
x3
x4
x5
x6
x7
x8
x9
X0min X0max
−20 20
−24 24
−15 15
0 20
0 24
0 15
0 13
0 16
0 10
The state spaces Xt for the various sets have been discretized by 18 different sequences, 9 coming from i.i.d. sequences with uniform distribution and 9 based on different kinds of low-discrepancy sequences (the implementation of such types of sequences comes from various sources, including Bratley et al. (1994) http://www.csit.fsu.edu/~burkardt/m_src/halton/halton.html, http:// ldsequences.sourceforge.net). For the two aforementioned discretization types, three basic sequences with L = 600 have been generated for the 9-dimensional case, while three with L = 2000 points have been generated for the 30-dimensional case. Again, to test the improvement of the solution when new points are added to the discretization, the other sequences are obtained as subsets of these basic samples, giving rise to three sequences with size L = 200 and L = 1000 and three with size L = 400 and L = 1500, respectively. The 18 sequences thus obtained, generated in the d-dimensional hypercube [0, 1]d , have then been scaled to fit the state spaces for the various stages t, according to the values shown in Tables 8 and 9. For what concerns the expected value of the costs with respect to the random vectors, an average over a finite number of realizations drawn from the proper distribution (i.e., the log-normal distribution for the inventory problem and the standard normal distribution for the reservoirs management problem) has been employed. In particular, 8 vectors were used for the 9dimensional problem and 10 vectors for the 30-dimensional one. The models chosen for the approximation of the cost-to-go functions were again feedforward one hidden-layer perceptron networks. For the inventory problem, we used ν = 7 and a logarithmic sigmoid activation function 1 σ (z) = . 1 + e−z For the 30-dimensional problem, we chose ν = 15 and hyperbolic tangent activation function. For every stage t, a network initialized in the same way has been trained on the basis of each of the 18 training sets, through minimizing the empirical risk as described in Section VI, using the Levenberg–Marquardt method with the same number of iterations.
112
CERVELLERA AND MUSELLI TABLE 9 B OUNDS FOR THE WATER R ESERVOIRS P ROBLEM Components 1–10
Bounds of X0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
X0min X0max
180 220
230 270
240 280
250 290
200 240
400 440
180 220
480 520
160 200
320 360
Components 11–20 Bounds of X0 X0min X0max
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
16 38
15 37
16 37
17 37
15 37
14 21
14 22
14 22
14 21
8 11
Components 21–30 Bounds of X0 X0min X0max
x21
x22
x23
x24
x25
x26
x27
x28
x29
x30
14 44
10 42
12 41
11 43
12 43
3 23
3 25
1 25
3 25
2 15
To evaluate the goodness of the approximate cost-to-go functions obtained by the various discretization schemes, 10 initial vectors x 0,i , i = 1, . . . , 10, have been taken in a set X0 . Tables 8 and 9 show the ranges for the various components of the initial state. For each point, the optimal cost has been computed by means of reoptimization averaging over 20 different “online” random sequences. Then, the average value of these 200 costs has been employed to measure the performance of a given discretization. Tables 10a and 11a contain the comparison among the performances of 18 training sets, based on the value of the relative errors for the mean cost defined as above (with respect to the lowest mean cost), for the inventory forecasting and the reservoir management examples, respectively. It is worth noting that for the reservoir management problem, good performance is represented by a highly negative cost. Tables 10b and 11b contain the relative costs for the average RMS costs for both kinds of discretizations. Again, “UR-n” means “uniform random sequence number n” and “LD-n” means “low-discrepancy sequence number n.” In particular, LD-1 is a Niederreiter sequence, LD-2 is a Halton sequence, and LD-3 is a Sobol’ sequence for the inventory model, while LD-1 and LD-2 are two different Sobol’ sequences and LD-3 is a Halton sequence for the reservoir model. For what concerns the approximation of the unknown functions, each having different behavior and complexity, we can see that LDS generally outperform URS.
DETERMINISTIC LEARNING AND AN APPLICATION TABLE 10 I NVENTORY F ORECASTING P ROBLEM a. Relative Costs Training set
Cost
UR-1 UR-2 UR-3 LD-1 LD-2 LD-3
L = 200
L = 400
L = 600
0.6206 0.1229 0.2590 0.5471 0.2784 0.1701
0.2854 0.1142 0.7891 0.2734 0.1845 0.0967
0.2019 0.1712 0.1243 0.0391 0 0.2876
b. Average Relative Costs Training set
Average cost URS
LDS
L=200 L=400 L=600
0.3342 0.3962 0.1658
0.3319 0.1849 0.1089
TABLE 11 R ESERVOIRS N ETWORK M ANAGEMENT P ROBLEM a. Relative Costs Training set
Cost L = 1000
L = 1500
L = 2000
UR-1 UR-2 UR-3 LD-1 LD-2 LD-3
0.0154 0.0155 0.4029 0.0128 0.0232 0.0267
0.0110 0.0193 0.0119 0.0084 0.0053 0.0172
0.0205 0.0303 0.0022 0.0025 0 0.0031
b. Average Relative Costs Training set
Average cost URS
LDS
L=1000 L=1500 L=2000
0.1446 0.0141 0.0177
0.0209 0.0103 0.0019
113
114
CERVELLERA AND MUSELLI
In fact, not only is the best single RMS always given by an LDS, but in all but one case (function 3, L = 1000) the average RMS given by LDS are smaller. Furthermore, the better performance of the LDS with respect to URS becomes evident when the size L increases, according to the good asymptotic properties of the low-discrepancy approach. Also for what concerns the dynamic programming tests, Tables 10a and 11a show that low-discrepancy sequences generally perform better than random ones, both in the context of the 9-dimensional and the 30-dimensional problem. In fact, LDS not only provide the lowest costs, they also show the lowest average (for the two kinds of sequences) for every dimension, as Tables 10b and 11b show. Then, only LDS show an actual improvement of the average solution as the dimension of the sample size increases. If we look at the single discretizations, though, this does not hold for LD-3 in the 9-dimensional case. However, for L = 400 such a training set still provides a cost that is better than all those coming from random sequences.
R EFERENCES Alon, N., Spencer, J.H. (2000). The Probabilistic Method. Wiley, New York. Alon, N., Ben-David, S., Cesa-Bianchi, N., Haussler, D. (1997). Scalesensitive dimensions, uniform convergence, and learnability. J. ACM 44, 615–631. Angluin, D. (1987). Queries and concept learning. Mach. Learn. 2, 319–342. Angluin, D., Valiant, L. (1979). Fast probabilistic algorithms for Hamiltonian circuits and matchings. J. Comput. Syst. Sci. 18, 155–193. Anthony, M., Bartlett, P.L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge. Baglietto, M., Cervellera, C., Parisini, T., Sanguineti, M., Zoppoli, R. (2001). Approximating networks for the solution of T -stage stochastic optimal control problems. In: Bittanti, S. (Ed.), Proceedings of the IFAC Workshop on Adaptation and Learning in Control and Signal Processing (Cernobbio–Como, Italy, 29–31 August 2001). Elsevier, Oxford, UK. Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39, 930–945. Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton. Bellman, R., Dreyfus, S. (1962). Applied Dynamic Programming. Princeton University Press, Princeton.
DETERMINISTIC LEARNING AND AN APPLICATION
115
Bellman, R., Kalaba, R., Kotkin, B. (1963). Polynomial approximation— a new computational technique in dynamic programming allocation processes. Math. Comp. 17, 155–161. Bertsekas, D. (1975). Convergence of discretization procedures in dynamic programming. IEEE Trans. Automatic Control 20, 415–419. Bertsekas, D. (2000). Dynamic Programming and Optimal Control, vol. I, 2nd ed. Athena Scientific, Belmont. Bertsekas, D., Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont. Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford University Press, New York. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K. (1989). Learnability and the Vapnik–Chervonenkis dimension. J. ACM 36 (1989), 929–965. Blumlinger, M., Tichy, R.F. (1986). Bemerkungen zu einigen Anwendungen gleichverteilter Folgen. Sitzungsber. Österr. Akad. Wiss. Math.-Natur. Kl. II 195, 253–265. Bratley, P., Fox, B.L., Niederreiter, H. (1994). Programs to generate Niederreiter’s low-discrepancy sequences. ACM Trans. Math. Software 20 (4), 494–495. Caflisch, R., Morokoff, W., Owen, A. (1997). Valuation of mortgage backed securities using Brownian bridges to reduce effective dimension. J. Comp. Finance 1, 27–46. Cervellera, C., Muselli, M. (2004). Deterministic design for neural network learning: An approach based on discrepancy. IEEE Trans. Neural Networks 15, 533–543. Cervellera, C., Chen, V.C., Wen, A. (2006). Optimization of a large-scale water reservoir network by stochastic dynamic programming with efficient state space discretization. Eur. J. Oper. Res. 117, 1139–1151. Chen, V.C.P. (2001). Measuring the goodness of orthogonal array discretizations for stochastic programming and stochastic dynamic programming. SIAM J. Optim. 12, 322–344. Chen, V., Ruppert, D., Shoemaker, C. (1999). Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming. Oper. Res. 47, 38–53. Chen, V.C., Cervellera, C., Wen, A. (2003). Optimization of a large-scale water reservoir network by stochastic dynamic programming with efficient state space discretization, In: INFORMS Joint International Meeting, Istanbul, Turkey. Cherkassky, V., Mulier, F. (1998). Learning from Data: Concepts, Theory, and Methods. Wiley, New York. Chow, C., Tsitsiklis, J. (1989). The complexity of dynamic programming. J. Complex. 5, 466–488.
116
CERVELLERA AND MUSELLI
Chow, C., Tsitsiklis, J. (1991). An optimal multigrid algorithm for continuous state discrete time stochastic control. IEEE Trans. Automatic Control 36, 898–914. Cohn, D.A. (1994). Neural network exploration using optimal experiment. In: Cowan, J., Tesauro, G., Alspector, J. (Eds.), Advances in Neural Information Processing Systems, vol. 6. Morgan Kaufmann, San Mateo, CA, pp. 679–686. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math. of Control, Signals, Systems 2, 303–314. Devroye, L., Györfi, L., Lugosi, G. (1997). A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York. Dudley, R.M. (1989). Real Analysis and Probability. Wadsworth & Brooks/Cole, Pacific Grove, CA. Fang, K.-T., Wang, Y. (1994). Number-Theoretic Methods in Statistics. Chapman & Hall, London. Foufoula-Georgiou, E., Kitanidis, P. (1988). Gradient dynamic programming for stochastic optimal control of multidimensional water resources systems. Water Resour. Res. 24, 1345–1359. Fukumizu, K. (1996). Active learning in multilayer perceptrons. In: Touretzky, D., Mozer, M., Hasselmo, M. (Eds.), Advances in Neural Information Processing Systems, vol. 8. MIT Press, Cambridge, MA, pp. 295–301. Fukumizu, K. (2000). Statistical active learning in multilayer perceptrons. IEEE Trans. Neural Netw. 11, 17–26. Girosi, F., Jones, M., Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Comp. 7, 219–269. Hagan, M., Menhaj, M. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Netw. 5, 989–993. Hammersley, J.M., Handscomb, D.C. (1964). Monte Carlo Methods. Methuen, London. Hlawka, E. (1961). Funktionen von Beschränkter Variation in der Theorie der Gleichverteilung. Ann. Mat. Pura Appl. 54, 325–333. Hoeffding, W. (1961). Probability inequalities for sum of bounded random variables. Am. Stat. Assoc. Math. Soc. Trans. 17, 277–364. Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366. Hua, L.K., Wang, Y. (1981). Applications of Number Theory to Numerical Analysis. Springer-Verlag, Berlin. Jacobson, D., Mayne, D. (1970). Differential Dynamic Programming. Academic Press, New York. Johnson, S., Stedinger, J., Shoemaker, C., Li, Y., Tejada-Guibert, J. (1993). Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Oper. Res. 41, 484–500.
DETERMINISTIC LEARNING AND AN APPLICATION
117
Kindermann, J., Paass, G., Weber, F. (1995). Query construction for neural networks using the bootstrap. In: Proceedings of the International Conference on Artificial Neural Networks, vol. 95, pp. 135–140. Kuipers, L., Niederreiter, H. (1974). Uniform Distribution of Sequences. Wiley, New York. Larson, R.E. (1968). State Increment Dynamic Programming. Elsevier Publ. Co., New York. MacKay, D. (1992). Information-based objective functions for active data selection. Neural Comp. 4, 305–318. Morokoff, W.J., Caflisch, R.E. (1995). Quasi-Monte Carlo integration. J. Comp. Phys. 122, 218–230. Najarian, K., Dumont, G.A., Davies, M.S., Heckman, N.E. (2001). PAC learning in non-linear FIR models. Int. J. Adaptive Control Signal Process. 15, 37–52. Niederreiter, H. (1987). Point sets and sequences with small discrepancy. Monatsh. Math. 104, 273–337. Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods. SIAM, Philadelphia. Owen, A. (1995). Randomly permuted (t, m, s)-nets and (t, s) sequences. In: Niederreiter, H., Shiue, P.J.-S. (Eds.), Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing. Springer-Verlag, Berlin. Owen, A. (1997). Scrambled net variance for integrals of smooth functions. Ann. Stat. 25, 1541–1562. Pollard, D. (1990). Empirical Processes: Theory and Applications, vol. 2. NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Mathematical Statistics. Puterman, M. (1994). Markov Decision Processes. Wiley, New York. Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica 65, 487–516. Sobol’, I.M. (1967). The distribution of points in a cube and the approximate evaluation of integrals. Zh. Vychisl. Mat. Mat. Fiz. 7, 784–802. Stokey, N., Lucas, R., Prescott, E. (1989). Recursive Methods in Economic Dynamics. Harvard University Press, Cambridge, MA. Tichy, R.F. (1984). Über eine zahlentheoretische Methode zur numerischen Integration und zur Behandlung von Integralgleichungen. Sitzungsber. Österr. Akad. Wiss. Math.-Natur. Kl. II 193, 329–358. Törn, A., Žilinskas, A. (1989). Global Optimization. Springer-Verlag, Berlin. Valiant, L. (1984). A theory of the learnable. Commun. ACM 27, 1134–1142. Vapnik, V.N. (1995). Statistical Learning Theory. Wiley, New York. Vapnik, V.N., Chervonenkis, A.Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16, 264–280.
118
CERVELLERA AND MUSELLI
Vidyasagar, M. (1997). A Theory of Learning and Generalization. SpringerVerlag, London. Zoppoli, R., Sanguineti, M., Parisini, T. (2002). Approximating networks and extended Ritz method for the solution of functional optimization problems. J. Optim. Theory Appl. 112, 403–439.
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 140
X-Ray Fluorescence Holography KOUICHI HAYASHI Institute for Materials Research, Tohoku University, Sendai 980-8577, Japan
I. Introduction . . . . . . . . . . . . . . II. Theory . . . . . . . . . . . . . . . . A. Theory Using Simple Models . . . . . . . . B. Simulation Using Realistic Models . . . . . . . C. Kossel and X-Ray Standing Wave Lines . . . . . D. Removal of Twin Images . . . . . . . . . 1. Multiple Energy Method . . . . . . . . . 2. Two Energy Method . . . . . . . . . . 3. Complex Holography . . . . . . . . . . E. Polarization Effect of Incident X-Ray . . . . . . F. Near Field Effect . . . . . . . . . . . . III. Experiment and Data Processing . . . . . . . . A. Experimental Geometries for Normal and Inverse Modes B. Laboratory XFH Apparatus . . . . . . . . . C. Fast X-Ray Fluorescence Detection System at SR . . D. Details of Data Processing for Obtaining Atomic Images E. Sample Cooling Effect . . . . . . . . . . F. Inverse Fourier Analysis . . . . . . . . . . 1. Theoretical Proof . . . . . . . . . . . 2. Demonstration by Experimental Holograms . . . IV. Applications . . . . . . . . . . . . . . A. Ultrathin Film . . . . . . . . . . . . . B. Dopants . . . . . . . . . . . . . . 1. GaAs:Zn . . . . . . . . . . . . . . 2. Si:Ge . . . . . . . . . . . . . . . C. Quasicrystal . . . . . . . . . . . . . D. Complex X-Ray Holography . . . . . . . . V. Related Methods . . . . . . . . . . . . . A. π XAFS . . . . . . . . . . . . . . B. γ -Ray Holography . . . . . . . . . . . C. Neutron Holography . . . . . . . . . . . VI. Summary and Outlook . . . . . . . . . . . References . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
120 122 122 124 128 129 130 130 133 134 136 138 138 140 143 145 149 150 150 156 159 159 162 162 165 168 170 174 174 176 178 180 181
119 ISSN 1076-5670/05 DOI: 10.1016/S1076-5670(05)40003-8
Copyright 2006, Elsevier Inc. All rights reserved.
120
HAYASHI
I. I NTRODUCTION A structural analysis of solids at an atomic level is necessary to develop advanced materials, such as semiconductors, superconductors, and catalysis. Transmission electron microscopy and scanning probe microscopy, which are widespread in scientific and engineering fields, are recognized as general atomic visualization techniques. Structural analysis techniques using X-rays, such as X-ray diffraction, have made it possible to evaluate crystal structures for over 50 years. However, the determination of atomic arrangements is not straightforward. It needs an accurate data fitting procedure for experimental and theoretical diffraction profiles, and this requires sufficient knowledge and experience of X-ray diffraction analysis. Therefore, a direct three-dimensional (3D) atomic imaging technique, which will help determine crystal structure, has long been desired. X-ray fluorescence holography (XFH) is one solution. Holography, which is a way of recording and then reconstructing waves, was invented by Dennis Gabor in 1948. The waves may be of any kind— light, sound, X-ray, corpuscular, etc. The word “holography” originates from the Greek “holos” meaning “the wholes.” By using the word holography the inventor of the technique wanted to stress that it records complete information about a wave. In conventional photography, only the distribution of the amplitude is recorded in a two-dimensional projection of an object onto the plane of the photograph. However, a hologram can regenerate the field of the wave scattered by an object, and therefore it can reconstruct the object. Holography’s unique ability makes it a valuable tool for industry, science, business, and education. It is commonly used on labels and covers of magazines. Recently, we have seen holograms on credit cards and paper currencies, which prevents copies because the hologram is difficult to reproduce. For advanced scientific fields, holography using photons, electrons, and neutrons with 2 ∼sub-Ångstrom wavelength has attracted attention as an atomic-order microscopic tool. Gabor (1948) proposed the principle of holography and demonstrated that it improves the power of electron microscopes. His idea was very simple. He used the interference between the wave scattered by the object (object wave) and one passing through the object (reference wave), and recorded the wavefront of the object wave. Although many researchers expected to visualize atoms in solids using Gabor’s method, it could not be realized at that time. Holography became widely used in many areas of science and technology with the introduction of the laser (Leith and Upatnieks, 1965). Szöke (1986) pointed out that photoelectrons and fluorescent X-rays from ionized atoms in a single crystal formed atom-resolved holograms. His idea was first proved by Harp et al. (1990) as X-ray photoelectron holography. Photoelectron holography is a powerful tool for studying surface structures.
X - RAY FLUORESCENCE HOLOGRAPHY
121
Much theoretical and experimental work on photoelectron holography has been conducted. However, the atomic image obtained from the hologram is not clear due to a phase shift resulting from electron scattering and multiple scattering. Since the effect of phase shift and the multiple scattering of X-rays are negligible in data analysis, X-ray scattering is much better than electron scattering. Feasibility studies of XFH by computer simulations started in 1991 (Tegze and Faigel, 1991). However, the experimental application of X-rays for holographic imaging has been delayed in comparison to photoelectron holography. The primary reason for this is the weakness of the holographic oscillation, which is 0.1–0.01% in the angular distribution of X-ray fluorescence intensity. Moreover, the weak oscillations are masked by strong and sharp Kossel or X-ray standing wave lines due to X-ray diffraction. An XFH experiment was first performed by Tegze and Faigel (1996) as a demonstration of the structural analysis of strontium titanate (SrTiO3 ). Similar to photoelectron holography, XFH measures the spherically distributed fluorescence intensity varying the detector position. The number of measurable holograms is limited by number of X-ray emission lines, such as Kα and Kβ. Shortly after the XFH experiment by Tegze and Faigel, Gog et al. (1996) demonstrated multiple energy X-ray holography, which was a time-reversed version of normal XFH. Here, we call this method “inverse XFH.” In inverse XFH, a holographic pattern can be obtained by detecting the fluorescence by varying the sample orientation relative to the incident beam. In contrast to normal XFH, inverse XFH allows holograms to be recorded at an arbitrary energy using an energy tunable X-ray source, such as synchrotron radiation. This is the reason why Gog et al. named this method “multiple energy X-ray holography.” The first experiment by Tegze and Faigel (1991) using a conventional X-ray generator and solid-state X-ray detector needed a few months to record one hologram due to the weakness of the holographic signals. Today, we can measure the hologram within a few hours using a strong incident X-ray beam and advanced X-ray detecting system (Tegze et al., 1999; Hayashi et al., 2001a). The XFH setup at the European Synchrotron Radiation Facility (ESRF) makes it possible to take a hologram within 10 min using a pink beam, which is a fundamental undulator radiation (Marchesini et al., 2001). The spatial resolution of the atomic images was 0.5 Å with a 4π full extension technique using crystal symmetry (Tegze et al., 1999). Moreover, light atoms such as oxygen can be visualized due to a data set with extremely high statistical accuracy (Tegze et al., 2000). Since the sample of XFH needs a regular orientation of atomic arrangement around a specific element, the amorphous or powder samples cannot be measured. However, it is not limited to systems with a long-range order but can also be applied to cluster, surface adsorbates, and impurities. Several
122
HAYASHI
attractive applications have been demonstrated. Holograms of Zn doped in a GaAs wafer were measured and a dominant site was clarified by visualizing the environment around Zn (Hayashi et al., 2001b). Takahashi et al. (2003c) measured the holograms of FePt thin films and successfully reconstructed atomic images of a Pt layer up to 15 Å in radius. Marchesini et al. (2000) applied XFH to understanding the icosahedral atomic arrangement of quasicrystal AlPdMn. In this article we describe the principle of X-ray holography, reconstruction techniques, experimental systems, and some applications. Furthermore, other related holographic techniques, such as πXAFS, γ -ray holography, and neutron holography, are reviewed. Concerning photoelectron holography, there is so much work in this field that a review is outside the scope of this chapter. Finally, we would like to mention the perspective of atom-resolved holography.
II. T HEORY A. Theory Using Simple Models There are two types of XFH methods, i.e., “normal XFH” and “inverse XFH.” Let us consider these X-ray holography techniques using a simple dimer model. In the normal mode, the wave source is a fluorescing atom A, as shown in Figure 1a. Atom A is excited by an external source such as X-rays or high-energy electrons and emits X-ray fluorescence photons in the form of a spherical wave. This wave can reach the detector surface directly, constituting the holographic reference wave, or after scattering by the neighboring atom B, constituting the holographic object wave. An interference of these two waves produces an intensity modulation on a spherical surface surrounding the sample. In this case, the path difference between these reference and object waves AB–AC is expressed as d(1−cos θ ), where d is the interatomic distance
(a)
(b)
F IGURE 1. Principle of X-ray fluorescence holography. (a) Normal and (b) inverse methods. A and B indicate emitter and scatterer atoms, respectively.
X - RAY FLUORESCENCE HOLOGRAPHY
123
between A and B and θ is the angle of BAC in Figure 1a. The X-ray phase is shifted by π due to the negative charge of the electron when scattering by atoms. Thus, phases of the reference and object waves coincide and an intensity maximum appears when d(1 − cos θ )/λ is equal to a half-integer, where λ is the wavelength of the fluorescent X-ray. Using the atomic structure factor of atom A, f (θ, λ), for X-ray at λ, the relative intensity of fluorescence I (θ, λ) can be expressed by λre f (θ, λ) i2πd(cos θ−1)/λ 2 e I (θ, λ) = 1 − 2π d λre f (θ, λ) i2πd(cos θ−1)/λ e = 1 − 2Re 2π d λre f (θ, λ) i2πd(cos θ−1)/λ 2 , e (1) + 2π d
where re is the classical electron radius. Since a scattering cross section of the atom for the X-ray is extremely small, the value of λf (θ )/2π d is less than 10−3 . Therefore, Eq. (1) can be approximated as λre f (θ, λ) i2πd(cos θ−1)/λ ∼ e . (2) I (θ, λ) = 1 − 2Re 2π d The second term refers to the hologram formed by atom B. The inverse XFH is based on the idea of optical reciprocity of the normal XFH. Figure 1b is a scheme of the inverse XFH. The fluorescence emitted from atom A is used to monitor an interference field originating from X-rays directly coming to atom A or after scattering off atom B, which correspond to reference and object beams, respectively. The holographic pattern is obtained by detecting the fluorescence while varying the sample orientation relative to the incident beam. The normalized intensity of the fluorescence can be expressed by Eq. (2). Here, λ is the wavelength of the incident X-ray. Since the inverse XFH allows holograms to be recorded at any incident energy above the absorption edge of an emitter, the twin image effect is suppressed and the spatial resolution of an atomic image is improved. The holograms in inverse mode were calculated using a simple [001] and [100] CuI dimer oriented along the vertical and horizontal directions, respectively. Here, the Cu and I atoms act as emitters and scatterers, respectively. Figure 2 shows the calculated holographic intensities at an energy of E = 27.8 keV for both the dimers. The holographic interference fringes in the case of unpolarized incident radiation are visible as azimuthal bands for the vertical [001] CuI dimer in Figure 2a, and as vertical bands centered along the [100] direction for the horizontal [100] CuI dimer in Figure 2b. The concept
124
HAYASHI
(a)
(b)
F IGURE 2. Holograms calculated from CuI dimers. (a) [001] CuI dimer and its hologram. (b) [100] CuI dimer and its hologram.
of inverse XFH is also applicable to γ -ray holography using resonant nuclear scattering due to a Mössbauer effect (Korecki et al., 1997). B. Simulation Using Realistic Models In the preceding section, the theory of XFH is explained using the dimer model. In this section, using a large cluster with a realistic crystal structure instead of the dimer, the hologram pattern will be discussed. Equation (2) can be extended to the large cluster model. When k is the wave number vector of the incident X-ray and rj is the coordinate of the j th atom, the relative intensity of X-ray fluorescence I (k) is expressed by I (k) ∼ = 1 − 2Re
re fj (θrkj ) j
rj
ei(−k·rj −krj )
re fj (θrkj ) i(−k·r −kr ) 2 j j , + e rj j
(3)
X - RAY FLUORESCENCE HOLOGRAPHY
(a)
125
(b)
F IGURE 3. Theoretical holograms of (a) 109 and (b) 33,453 CuI clusters. The CuI cluster has ZnS structure with a lattice constant of a = 6.04 Å.
where fj is the atomic structure factor and θrkj is an angle between k and rj . Though Eq. (3) is for the inverse XFH, this equation can be utilized for normal XFH by replacing k by −k. In this case, k is the wave number vector of the fluorescent X-ray. The second term of Eq. (3) is of the order of 10−3 , which is a very weak signal to observe. Here, the inverse XFH hologram pattern around Cu in CuI with the zinc blend structure was calculated using Eq. (3). The incident X-ray energy was assumed to be 27.8 keV. The spherical cluster model of CuI includes 109 atoms. A radius of the cluster is about 9 Å. Figure 3a shows the calculated 4π full sphere hologram. A fine structure of intensity undulation is exhibited on the whole hologram. An atomic image can be obtained by the Helmholtz– Kirchhoff formula, which is a Fourier transform-like data processing (Barton, 1988): U (r) = e−k·r χ (k) dσ. (4)
Figure 4a is a reconstruction of a (110) plane. Solid and dotted circles in the figure indicate the positions of I and Cu, respectively. The emitter Cu atom locates at the origin. Point images are seen in these circles, revealing that atomic images were reconstructed from the calculated hologram. In addition to these, twin images appear at positions centrosymmetrical to the true atomic images. Moreover, in the vicinity of the origin of Figure 4a, some artifacts are visible. These twin images and artifacts disturb accurate reconstruction of the atomic image. The solution, which can suppress these undesired images, will be explained in Section II.D. Next, using a large CuI cluster including 33,453 atoms, whose radius is about 60 Å, the hologram was calculated, as shown in Figure 3b. The pattern
126
HAYASHI
(a)
(b)
F IGURE 4. Atomic images of the (110) plane of CuI. (a and b) Reconstructions from holograms in Figure 2a and b, respectively.
becomes finer compared to that of the 109 atom cluster, because a highfrequency component in the hologram is formed by the far atoms in the model cluster. Moreover, line structures are observed in Figure 3b, and these are called X-ray standing wave lines caused by the Bragg reflection due to structural periodicity in the model cluster. Consequently, the hologram pattern varies with a change of cluster size. Of course, the experimental hologram patterns are similar to the pattern in Figure 3b. Figure 4b shows an atomic image reconstructed from the hologram in Figure 3b. Comparing it with Figure 4a, the shapes and positions of neighboring I atoms are not greatly affected by the cluster size. However, the background intensity becomes higher, and the number of artifacts increases. Fanchenko et al. (2002) reported that the far atoms form not only a high-frequency component but also a low-frequency one, and this is considered to be one reason for the above phenomenon. The grain size of the actual sample is normally much larger than the cluster size used here. The calculation time is proportional to the cube of the cluster radius using Eq. (3). To calculate the hologram of a cluster over 1000 Å in radius, we will need many years. Thus, there is a calculation method using the diffraction structure factor. In this technique, the calculation time of the holographic pattern depends on the number of observable diffraction planes, and does not depend on the cluster radius.
X - RAY FLUORESCENCE HOLOGRAPHY
127
A holographic function χ (k) in the inverse hologram is the second term of Eq. (3), χ (k) ∼ = −2Re
re fj (θrkj ) rj
j
e
i(−k·rj −krj )
& .
(5)
Adams et al. (1998) expressed Eq. (5) in terms of the electron density ρ(r) such as ! ei(k·rj −krj ) dV , (6) ρ(r) χ (k) = −2Re r V
where V is the volume of a single domain in a crystal. ρ(r) is the result of a Fourier transformation of FH which is a structure factor at reflection index H . It is explained as ρ(r) = FH eiH·r , (7) H
where H is the reciprocal lattice vector. Equation (6) can be substituted for ρ(r) in Eq. (7): ! [cos {(k − H) · r}] · eikr dV . (8) FH χ (k) = −2Re r H
V
Solving the volume integral, the holographic function χ (k) is expressed as χ (k) = −
H
FH
4π (2 − A), |k − H|2 − k 2
(9)
where A=
|k − H| − k cos |k − H| + k r |k − H| |k − H| + k + cos |k − H| + k r . |k − H|
(10)
Here, r corresponds to the cluster radius. Using this technique, a holographic pattern with a large cluster is calculated in a realistic calculation time. Even if the cluster size is over 100 nm, the holographic pattern with an angular step of 1.0◦ is calculated in a few minutes using a Pentium 4 PC.
128
HAYASHI
C. Kossel and X-Ray Standing Wave Lines As the size of the crystalline cluster grows, Kossel lines (KL) (Kossel et al., 1935) or X-ray standing wave (XSW) (Batterman, 1969; Bedzyk and Materlik, 1985) lines appear in the hologram, which are caused by the diffraction of the fluorescent or incident X-rays due to a long-range periodicity of the atomic arrangement in the cluster. In the notation of holography, KL and XSW lines are observed in “normal” and “inverse” XFH measurements, respectively. These lines mask the hologram pattern because they are quite sharp and localized. In KL or XSW, the shapes of the lines are analyzed to obtain phase information of the diffraction. The fine structure of these lines has been explained by the dynamical theory of X-ray diffraction via the reciprocity theorem used in optics. These techniques have usually been applied to the localization of dopants in high-quality crystals and surfaceadsorbate distributions on substrate crystals. As described in the preceding section, calculation of the hologram using the single scattering theory produces XSW lines when the cluster size is large. In Eq. (3), the second term indicates holographic oscillation. Thus, to perform a holographic analysis, the third term must be neglected. While the amplitude of the second term is principally proportional to re f (θ )/r, which is in the range of the 10−3 , the third term is quadratic. If the phase factors in the sum are random, the third term remains much smaller than the second term. In special directions of the X-rays that satisfy the Bragg condition, the phase factors of each element in the sums can have the same value, so for large enough samples both the second and third term in Eq. (3) become comparable to the first term. In this case the third term can no longer be neglected, and the KL and XSW lines cannot be regarded as a part of the hologram for this reason. In addition, the extinction effect, which is expressed by another formula, is remarkable under the Bragg diffraction condition. Based on the multiple scattering power transfer equation, Korecki et al. (2004a) examined the secondary extinction effect in the inverse XFH. They measured the holograms of a Cu3 Au single crystal by X-ray fluorescence and total electron yield detections and explained the difference between the hologram patterns due to the detection modes. To extract a true holographic component from the measured fluorescence intensity distribution, the effects of the third term in Eq. (3) and extinction must be examined accurately. We have to study these effects extensively. Recently, there are a few interesting topics concerning XSW. The KL and XSW methods can collect the phases and amplitudes of the diffraction. Thus, in principle, the data can be Fourier inverted to give direct-space, element-specific atom crystallographic distributions. However, data from a single diffraction are not sufficient. Cheng et al. (2003) conducted XSW measurements on the first eight orders of allowed (00l) (l = 2, 4, . . . , 16)
X - RAY FLUORESCENCE HOLOGRAPHY
129
Bragg reflections of a muscovite mica crystal, and then determined the modelindependent distributions of impurity with respect to the known muscovite (001) lattice. In addition, Marchesini et al. (2002) performed unique work, which combined the theory of XFH and KL/XSW in kinematical approximation to directly obtain the phases of the diffraction structure factor. They determined the partial phase from experimental data obtaining the sign of the real part of the structure factor for several reciprocal lattice vectors of a vanadium crystal. D. Removal of Twin Images Twin images, which are conjugated images of true ones, appear at positions centrosymmetrical to the true atomic images in single energy holography. Figure 5 shows the relation between the sample dimer and its reconstruction from the single energy hologram. An overlap of true and twin images causes a diminishment, distortion, and position shift of atomic images. To resolve these problems, the multiple energy X-ray holography method, the two energy method, and complex holography have previously been used. Using the
F IGURE 5. Concept of twin image. (a) Emitter–scatterer dimer. The emitter and scatterer locate at the origin and x = x1 , respectively. (b) Reconstruction from the hologram of (a). True and twin images appear at x = −x1 and x = x1 , respectively. (c) Vectors of the true and twin images. Magnitudes of these vectors are equal, but the phases are different.
130
HAYASHI
theoretical hologram of the CuI cluster, characteristics of these techniques are described below. 1. Multiple Energy Method Though the real space image U (r), which is obtained by Eq. (4), is in a complex vector space, the atomic images are generally displayed as their absolute values. In the U (r), the absolute values of the true and twin images are equal, but their phases are different, as shown in Figure 5c. Utilizing this principle, Barton (1991) proposed that twin images could be removed by summing the reconstructions at different energies. Barton’s multiple energy reconstruction is obtained by modifying Eq. (4) and can be expressed as e−k·r χ (k) dσ. (11) U (r) = e−ikr
In the reconstruction by this equation, the phase of the true image is fixed at π , while the phase of the twin image varies by the incident X-ray energy. Therefore, by summing the reconstructions from the holograms recorded at several X-rays energies, the intensities of the true images increase and those of twin images decrease. Figure 6a shows an example of intensities and phases of true and twin images obtained using the multiple energy method. Here, the holograms at 26.8, 27.3, 27.8, 28.3, and 28.8 keV are calculated using the 109 atom CuI cluster and atomic images are reconstructed by Eq. (11). Figure 6b and c shows the reconstructions from three and five holograms, respectively. The twin images still remain in Figure 6b, while they vanish completely in Figure 6c. An algorithm for the multiple energy method was proposed by Barton as a reconstruction of the atomic images in the sample surface by photoelectron holography. In X-ray holography, the first successful experiment involving a multiple energy hologram in the inverse mode was performed by Gog et al. (1996) on a natural slab of a hematite (Fe2 O3 ) single crystal. Fluorescence emitted from the sample was collected by a proportional counter via a cylindrical graphite analyzer selecting the Fe Kα line. The holograms were recorded at three incident energies (E = 9.00, 9.65, and 10.30 keV). The reconstruction gives the superposition of two Fe layers corresponding to two possible stacking orders present in the crystal, because XFH distinguishes between different layers. However, image aberrations known to distort single energy holograms are effectively suppressed by summing data for several incident energies. 2. Two Energy Method Though the multiple energy method is a good technique for suppressing the twin images, several holograms must be recorded to obtain clear atomic
X - RAY FLUORESCENCE HOLOGRAPHY
131
F IGURE 6. Atomic images of the (110) plane of CuI reconstructed by the multiple energy X-ray holography method. X-ray energies from calculations are 26.8, 27.3, 27.8, 28.3, and 28.8 keV. (a) Vectors of true and twin images by a multiple energy reconstruction algorithm. (b and c) Reconstructions from three (26.8, 27.8, 28.8) and five (26.8, 27.3, 27.8, 28.3, 28.8) energy holograms, respectively.
images. This requires a long measurement time. As an alternative to the multiple energy X-ray method, Nishino et al. (2002) proposed a two energy method, which is a twin image removal algorithm using two holograms recoded at X-ray energies a few hundred electronvolts apart. In the two energy method, the real space reconstruction is obtained by U (r) = e
ikr
e−k·r χ (k) dσ.
(12)
132
HAYASHI
This equation is obtained by replacing e−ikr in Eq. (11) by eikr . The phase of the twin image is necessarily constant in the reconstruction obtained by applying Eq. (12). Thus, the differential of the two reconstructions can remove only twin images, as shown in Figure 7a. Here, Eq. (12) was applied to the theoretical holograms of the 109 atom CuI cluster at 27.3 and 27.8 keV, and the differential of both reconstructions was calculated. As shown in Figure 7b, twin images are perfectly removed. Moreover, artifacts observed in Figure 3a are also removed in this simulation. Since this technique must display a small difference between the two reconstructions, it requires high-quality hologram data in the experiment compared to the multiple energy method. The two
F IGURE 7. Atomic images of the (110) plane of CuI reconstructed by the two energy X-ray holography method. (a) Vectors of true and twin images by the two energy reconstruction algorithm. (b) Differential between two reconstructions recorded at 27.3 and 27.8 keV.
X - RAY FLUORESCENCE HOLOGRAPHY
133
energy method is useful when several holograms for each sample cannot be measured due to the limitation of beam time in a synchrotron facility. Furthermore, in a laboratory scale holography experiment, the recordable X-ray energy is limited to two or three, such as Kα and Kβ, lines. In this case, the two energy method is powerful for twin image elimination. Nishino et al. (2002) demonstrated the two energy method as well as its theoretical feasibility using the sample of a (110) ZnSe single crystal with a zinc-blend structure. The atomic structure of the sample has no centrosymmetry, and therefore true and corresponding twin images of Se are not overlapped. The Zn Kα fluorescence X-rays were analyzed by the cylindrical LiF crystal and they were collected by a Si PIN diode. They took the pure inverse holography scheme (Adams et al., 2000). The hologram scans were made at 11.0 and 11.3 keV. A three-dimensional phase combined image that obtained 4π extended holograms at 11.0 and 11.3 keV clearly represents four nearest-neighbor Se atoms around the central Zn atom without twin images. 3. Complex Holography Though we indicated that X-ray holography can measure the intensity and phase of the scattered X-rays, the phase problem was not perfectly resolved in the single energy hologram. In Figure 1, the complex value of ei2πd(cos θ−1)/λ gives complete information of the phase concerning the pass difference AB–AC. But actually only its real part can be measured. The lack of phase information is characteristic of twin image occurrence. The concept of complex X-ray holography is important for ideal hologram measurement. Using resonant X-ray scattering around the X-ray absorption edge or nuclear resonant scattering due to the Mössbauer effect (Korecki et al., 1997), a slight change of X-ray or γ -ray energy can control the phase of the scattered photon. The complex hologram can be derived from a few holograms recorded at the energies in the region of the resonant scattering, and it is expressed as c i(−k·rj −krj ) , (13) e χ (k) = 2 krj j
where c is a constant. Korecki et al. (2001) demonstrated a complex γ -ray holography experiment using nuclear X-ray scattering. Here, I calculated the complex γ -ray hologram using the CuI cluster. An iodide typically has two isotopes such as 129 I and 127 I. Since 129 I has the Mössbauer transition at 27.8 keV, the complex γ -ray hologram can be constructed principally. Similar to the simulations in the previous sections, the complex hologram of the 109 atom CuI cluster can be calculated using Eq. (13). Figure 8a and b shows the
134
HAYASHI
(a) F IGURE 8.
(b)
Complex hologram of CuI. (a) Real and (b) imaginary parts.
real and imaginary parts of the complex hologram. Applying Eq. (4) to the complex hologram, the atomic image can be obtained, as shown in Figure 9. The twin images are strongly suppressed as with the multiple energy and two energy methods. Furthermore, the reconstruction displays only I atoms giving the resonant nuclei scattering. Complex X-ray holography using resonant Xray scattering is introduced in Section IV.D. As mentioned above, experimental complex holography was first realized by the γ -ray hologram measurement of epitaxial Fe (001) film by Korecki et al. (2001). By detuning the resonance, the phase of the nuclear scattering amplitude can be changed. Using γ -rays equally detuned on either side of the narrow Mössbauer resonance peak, they could take linear combinations of two holograms to separate the real and imaginary parts. From the complex holograms, an accurate, twin-free real-space image can be reconstructed. E. Polarization Effect of Incident X-Ray The multiple energy, two energy, and complex holography methods are promising for obtaining clear atomic images because they can effectively suppress the twin images and artifacts in reconstructing images from holograms covering a range of different energies. Normally, they are conducted in the inverse mode, and therefore they are strongly influenced by the polarization of the incident radiation when using synchrotron radiation. Synchrotron radiation is the most practical source of incident radiation for the inverse XFH experiment due to its energy tunability, high brightness, and high energy resolution. Thus, linearly polarized incident radiation must be considered. The effect of incident radiation polarization on hologram patterns is discussed here for two types of CuI dimers.
X - RAY FLUORESCENCE HOLOGRAPHY
F IGURE 9. in Figure 8.
135
Atomic images of the (110) plane of CuI reconstructed from the complex hologram
′
The general form of the well-known Thomson scattering factors is sin θeˆk , ′ where θeˆk is the angle between the polarization vector of the incident radiation eˆ , and the direction k′ of the scattered radiation (Len et al., 1997). For the unpolarized incident beam, the Thomson scattering factor is expressed as (1 + cos2 Θ)/2, where Θ is the angle between the incident beam and the scattered X-rays. All holograms calculated in the previous sections are for unpolarized radiation. The effect of polarized incident radiation in the inverse hologram can be demonstrated by considering the theoretical holograms of [001] and [100] CuI dimers used in Section II.A. A top view of the experimental geometry is illustrated in Figure 10. Since a beam line is fixed in a real experimental situation, the dimers were rotated with respect to incident radiation polarization during the calculations.
136
HAYASHI
F IGURE 10. Orientation of the sample with respect to the horizontal (ˆe1 ) or vertical (ˆe2 ) polarization vector of incident radiation. The polar rotation axis θ is perpendicular to eˆ 1 and is parallel to eˆ 2 .
Figure 11a and b shows the inverse mode holograms from the CuI dimers for incident radiation polarized horizontally with respect to a stationary synchrotron source, where the polarization vector eˆ 1 is always perpendicular to the θ rotation axis. Due to the azimuthal symmetry of the vertical polarization with respect to the [001] CuI dimer for all incident radiation angles, the resulting hologram in Figure 11a is nearly identical in the lower polar regions to the unpolarized incident radiation hologram in Figure 2a, but is weaker in intensity for higher polar angles (outside of the hologram). While the hologram intensity for the [100] CuI dimers for horizontally polarized light in Figure 11b is suppressed in directions perpendicular to the [100] CuI axis for the lower polar region (inside of the hologram). Figure 11c and d shows the holograms from the [001] and [100] CuI dimers for incident radiation polarized vertically with respect to a stationary synchrotron source, where the polarization vector eˆ 2 is parallel to the θ rotation axis, as shown in Figure 10. The resulting hologram of the [001] CuI dimer is again similar to the unpolarized incident radiation hologram. But the holographic fringe in Figure 11c is relatively contrasted. The hologram intensity for the [100] CuI dimer in Figure 11d is suppressed in directions perpendicular to the [100] CuI axis for the higher polar region in contrast to horizontal polarization. These polarization effects are also seen in the reconstructed image. An atomic image of the [001] and the [100] CuI dimer for horizontally polarized light is higher and lower than for an vertically polarized light, respectively. Recently, Bortolani et al. (2003) calculated theoretical holograms for an Fe bcc crystal using a multipole expansion for the scattered field, and found a crucial polarization effect on atom observability in the reconstructed image. F. Near Field Effect In most of the earlier works about XFH, a first-order approximation has been used. This approximation assumes that the size of the core electron
X - RAY FLUORESCENCE HOLOGRAPHY
(a)
(b)
(c)
(d)
137
F IGURE 11. Holograms with horizontal (a, b) and vertical (c, d) polarization for [001] (left) and [100] (right) CuI dimers.
distribution of the scatter is much smaller than the radius of the incident spherical wavefront and is thus valid only for a pointlike scatter. But this approximation is incorrect when the scatter is close to the emitter, and this is called a near field effect. Accurate hologram calculations taking into account the near field effect help the reconstruction using a nonlinear least-squaresfitting algorithm, which will produce true electron charge density. Tegze and Faigel (2001) investigated this problem by first formulating the hologram of a single electron and then calculating the hologram of an atom using numerical integrals over the charge density associated with the atom. Bai (2003) derived a formula for calculating the atomic scattering factor for spherical waves and used it for a discussion of the near field effect in XFH. Figure 12 shows a hologram calculated by Bai’s method for a single emitter– scatterer pair of Cu atoms. The fluorescence emitter atom is at the origin while the scatterer is at 2.56 Å along the x-axis. X-ray energy used for the calculations was 16.0 keV. The largest difference between the calculations by the plane wave approximation and Bai’s method appears near the forward
138
HAYASHI
F IGURE 12. Hologram oscillations of a single pair of Cu atoms separated by 2.56 Å. The incident energy is 16.0 keV. The solid line shows the oscillation taking into account near field effect. Hologram of the dotted line was calculated by plane wave approximation for comparison.
scattering direction. The near field effect includes the vector nature of the electromagnetic field and the curved wave front. Both near field effects influence the reduction of the forward scattering. With an increase of X-ray energy, the near field effect due to the curved wavefront becomes large. This causes the phase shifts due to the size of the electron cloud. Taking into account the near field effect, Tegze and Faigel (2001) calculated the hologram of a spherical NiO cluster containing more than 33,000 atoms, and revealed that its main feature was in agreement with an experimental hologram.
III. E XPERIMENT AND DATA P ROCESSING A. Experimental Geometries for Normal and Inverse Modes As mentioned previously, there are two types of experimental geometries: the normal mode and the inverse mode. The measurements of both normal and inverse holograms are carried out using the setup displayed in Figure 13a. In general, the samples are limited to single crystals or epitaxial films with
X - RAY FLUORESCENCE HOLOGRAPHY
139
F IGURE 13. Experiment setup for recording X-ray fluorescence holography. (a) System using an energy-dispersive X-ray detector. (b) System using a crystal analyzer and avalanche photodiode.
the orientation regularity of atomic arrangements. To avoid the effect of sample shape, the areas irradiated by the incident beam should be flat. In the normal mode, the incident angle θ1 is fixed, and the angular variation of fluorescence intensity is measured by scanning an azimuthal angle of the sample φ and an X-ray exit angle θ2 . Normally, step angles of φ and θ2 are less than 1◦ , and a pinhole slit is set in front of the X-ray detector to limit an acceptance angle of the X-ray fluorescence. To record the hologram in k-space, scan ranges of φ and θ2 should be set as wide as possible, such as 0◦ ≤ φ ≤ 360◦ and 0◦ ≤ θ2 ≤ 80◦ . However, if θ1 was not set to 0◦ , a pure inverse hologram cannot be directly obtained. In all X-ray holography experiments performed so far, the atoms within the sample are excited to fluorescence by incident monochromatic X-rays. Thus, not only were the X-ray standing waves formed by fluorescence but also the incident X-rays. Since θ1 is kept constant during measurement, the pattern obtained is a mixture of a two-dimensional hologram of fluorescence (normal hologram) and a one-dimensional hologram of incident X-rays (inverse hologram). Hiort et al. (2000) reported a pure normal holography measurement setting θ1 to 0. However, this setup could not measure the hologram in the range between 0◦ and 17◦ due to blocking of the direct beam by the detector. The energy dispersive solid state detector (SSD) is convenient for detecting fluorescence photons. The holographic oscillation is about 0.1% of its background intensity. This requires about 10,000 measurements in one hologram with at least one million photons of fluorescence. Since the maximum count rate of a commercial type SSD is about 10,000 cps, the first XFH experiment by Tegze and Faigel using a shield-off X-ray tube needed a few months to gain satisfactorily statistical accuracy. If undesired fluorescent X-rays can be cut by an absorption foil filter, an imaging plate or X-ray CCD camera will
140
HAYASHI
be available and the scanning of the sample and detector is not necessary. In this case, the measurement time will be shortened. Kopecky et al. (2001) placed a thin iron absorber between a CoO single crystal sample and detector to pass through a Co Kα fluorescence line, and measured the hologram using the imaging plate in one shot. In the inverse mode, the relationship between θ1 and θ2 is the reverse of that in the normal mode, and therefore θ2 must be kept constant. The variation of the fluorescence intensity is recorded by scanning φ and θ1 in Figure 13a. Total fluorescence photons from the emitter atoms must be collected, but in practice this is difficult. The large angular acceptance of the fluorescence photon is still preferable, because the component of the normal hologram is smeared out. To measure a pure inverse hologram, θ2 is set to 0◦ . Using this setup, Adams et al. measured the pure inverse hologram of a Cu3 Au single crystal (Adams et al., 1998). However, the hologram of the lower polar region cannot be recorded due to the blocking of the incident beam by the detector, similar to the pure normal XFH measurement. B. Laboratory XFH Apparatus Laboratory XFH equipment with a conventional X-ray source has been built to conveniently carry out some preliminary and basic research and to increase the number of users. This was developed using a singly bent graphite monochromator with a large curvature and a high count rate X-ray detecting system. Figure 14a shows a schematic drawing of a high intensity incident X-ray system for the laboratory XFH equipment that we designed. A 21-kW rotating-anode X-ray generator with a molybdenum target was adopted as an X-ray source. A cylindrically bent graphite crystal with a 21-mm curvature radius (Matsushita Electric Co.) (Figure 14b) was installed as an incident monochromator, so that high intensity monochromatic Mo Kα radiation focused at the sample. The focal length is 197 mm. Both vertical and horizontal full width at half maximum (FWHM) of the focal spot are about 1.3 mm. The vertical and horizontal convergent angles are 1.2◦ and 2.2◦ , respectively. The photon flux of incident X-rays at the sample position is of the order of 109 photons per second per mm2 when the X-ray power is 60 kV and 350 mA. This is more than 100 times as intense as nonfocused monochromatic X-rays. Figure 14c shows a schematic drawing of the incident beam monitor. Incident X-ray beams pass through two pinholes with a 2 mm diameter at the entrance and exit of the monitor. The convergent angle of the pass-through X-ray beam is about 1.2◦ . The intensity of the incident beams was monitored by measuring both fluorescent and scattered X-rays from a copper foil 2 µm thick with an avalanche photodiode (APD) that can collect X-rays, even at a high count rate of ∼107 cps, without any counting loss.
X - RAY FLUORESCENCE HOLOGRAPHY
141
F IGURE 14. Illustration of laboratory XFH apparatus. (a) Top view. (b) Photograph of a bent graphite monochromator. (c) Intensity monitor of incident beam.
Using the setup in Figure 13a, normal and inverse XFH measurements were conducted. X-ray fluorescence emitted from a sample was detected by an SSD that was designed to detect X-rays at a count rate of ∼105 cps with an energy resolution of about 200 eV. An Au single crystal (001) sample was mounted on a φ stage. In the normal mode, θ1 is kept constant. The pinhole slit is not set in front of the SSD. The SSD is placed near the sample to catch the fluorescence with a large solid angle. In the inverse mode, the convergent angle of incident X-rays determines the angular resolution of holograms. Detailed experimental conditions for measuring holograms have been described (Takashi et al., 2003d, 2004). In normal XFH measurements, the pinhole with a 3 mm diameter was set in front of the SSD so as to detect X-ray fluorescence with a receiving angle of about 2◦ . The X-ray generator was set at 55 kV and 300 mA. Au Lα, Lβ, and Lγ lines forming three different holographic patterns were simultaneously detected. It took eight days to record these patterns. The patterns in Figure 15a–c correspond to the XFH profiles by Au Lα, Lβ, and Lγ , respectively. The intensities of Au Lα, Lβ, and Lγ fluorescence photons at each pixel were, respectively, about 5×105 , 4×105 , and 5×104 counts with
142
HAYASHI
(a)
(b)
(c)
(d)
F IGURE 15. Holograms of an Au single crystal. (a) 9.7 keV (Au Lα). (b) 11.6 keV (Au Lβ). (c) 13.4 keV (Au Lγ ). (d) 17.4 keV (Mo Kα).
an integration time of 20 sec. In inverse XFH measurements, Au Lα and Lβ lines were detected for recording a holographic pattern. The X-ray generator was set at 50 kV and 50 mA so as to limit the total count rate at the detector to about 105 cps. The total intensities of the Au Lα and Lβ fluorescent X-rays at each pixel were about 5 × 105 counts with an integration time of 5 sec. Figure 15d shows a holographic pattern of Au in the inverse mode. It took two days to obtain. Figure 16a and b shows the real space images reconstructed by using Barton’s (1991) multiple energy algorithm. Blue circles indicate atomic positions of Au calculated from its lattice parameter. The 110 atomic images in the (001) plane are reconstructed near the true atomic positions. Their intensity maxima shift outward by 0.5 Å compared to the actual atomic
X - RAY FLUORESCENCE HOLOGRAPHY
(a) F IGURE 16.
143
(b)
Atomic images of (001) and (002) planes for an Au single crystal.
positions. Strong artifacts appear outside the atomic positions labeled A. This is due to the strong image of 9.71 keV (Au Lα) in the normal mode, whose 110 atomic images shift outside. This origin is not fully understood yet. In the (002) plane, a 01 12 atomic image appears at the true positions with an accuracy of ±0.1 Å. These results clearly demonstrate that this multipleenergy technique provides us with quite accurate atomic images, even with laboratory XFH equipment. C. Fast X-Ray Fluorescence Detection System at SR The XFH experiments at a synchrotron radiation (SR) facility are conducted mainly in the inverse mode because the wavelength of a hologram is selectable. Since the beam time at the synchrotron radiation facility is usually limited to be a few days, the SSD is not adequate for the X-ray detector of the holography experiment. To overcome this difficulty, pure fluorescent intensity has to be detected at a high counting rate. Tegze et al. designed an XFH measurement system using a cylindrical graphite analyzer and an APD for the inverse method (Marchesini et al., 1998; Tegze et al., 1999) and succeeded in recording a high quality hologram that provides images of oxygen atoms (Tegze et al., 2000). Its geometry is illustrated in Figure 13b. The APD is a fast X-ray detector, which can detect X-ray photons at a count rate of 107 cps, but the energy resolution is very low, which is necessary to suppress the unwanted radiation, so the energy analysis was done by the cylindrical graphite analyzer. This arrangement resulted in a large loss of solid angle detection compared to the measurement system using SSD. This
144
HAYASHI
was compensated for by the high flux of synchrotron radiation. If the count rate was over 107 cps, it cannot be handled by single photon counters. In this case, a Si PIN diode is available in the current mode. Advancing this concept, Marchesini et al. (2001) designed a quick hologram measurement system equipping fast rotation stages. It could record a hologram within 10 min even for a thin film sample using a direct undulator beam (pink beam). A toroidally bent analyzer is preferable to a cylindrically bent one because it collects fluorescence photons more effectively (Sekioka et al., 2005). The disadvantage of the toroidally bent analyzer is that the selectable fluorescence lines are strictly limited. I mentioned that the energy resolution of an APD is low, which is larger than 20% at room temperature. However, it improves to about 10% by thermoelectric cooling (Kishimoto et al., 2001). In the inverse mode, the incident energy is usually apart from the absorption edge, and it is not necessary to collect all K or L fluorescent lines from each element separately. Thus, a very high energy resolution like SSD is not required. We used the cooled APD instead of the SSD and measured the hologram of a Ge wafer. Ge fluorescence and scattered X-rays from the sample were observed by the cooled APD system. Fast data signals from the APD were processed by a fast amplifier, discriminator, and scaler. By scanning the discriminator’s threshold, we measured a pulse-height distribution of the signal. Figure 17 shows an example. Peaks of the Ge Kα and scattered X-rays are observed at 366 and 455 channels, respectively. The peak of Ge Kβ is hidden by the strong peaks of the Ge Kα and scattered X-rays. As shown in Figure 17, the Ge Kα, Kβ, and elastic scattering peaks and their backgrounds were evaluated by Proctor and Sherwood’s (1982) fitting method, which was originally developed to determine photoelectron peak intensities. The weak Ge Kβ peak was obtained according to the energy dependence of the efficiency of the APD and the wellknown intensity ratio of the Ge Kα and Kβ peaks. The energy resolution of these spectra was about 13%. This value will be improved up to 10% or less by optimizing the device temperature and suppressing the electronic noise. In the actual measurement, total Ge fluorescence intensities were evaluated by subtracting the intensities above 420 channels from those above 300 channels using the discriminator. Since the peaks of the Ge fluorescence overlapped those of the scatterings, the total intensities of Ge Kα were corrected by the profile analysis explained in Figure 17 as an example. In the present and preceding sections, I described several X-ray fluorescence detection systems. The maximum count rates and solid angle of the acceptance and energy resolutions are summarized in Table 1 for various detectors that we used in the holography experiments.
X - RAY FLUORESCENCE HOLOGRAPHY
F IGURE 17. APD.
145
Energy spectra of Ge K fluorescent and scattered X-rays measured with cooled
TABLE 1 P ERFORMANCES OF X- RAY D ETECTORS FOR X- RAY F LUORESCENCE H OLOGRAPHY
Multielement SSD (19 elements) Si drift detector Cylindrical analyzer and APD Toroidal analyzer and APD Cooled APD
Count rate (cps)
Solid angle (sr)
Energy resolution (%)
∼106 ∼105 ∼106 ∼106 ∼106
∼1 ∼1 ∼0.01 ∼0.1 ∼1
3 3 5 5 10
D. Details of Data Processing for Obtaining Atomic Images Here, I show typical data processing using the hologram data of the Ge wafer obtained by the cooled APD, which was introduced in the preceding section. The sample was set on a two-axis rotatable stage. Intensities of Ge fluorescence were measured as a function of the azimuthal angle φ and polar angle θ1 within the range of 0◦ ≤ φ ≤ 360◦ and 20◦ ≤ θ1 ≤ 70◦ . The definitions of θ1 and φ are given in Figure 13a. The dwelling time for each measurement was 1 sec with a 0.5◦ step in φ and a 1◦ step in θ1 . The total integrated intensity of Ge in each measurement was about 500,000 counts. The Ge X-ray fluorescence intensity is normalized with respect to the incident X-ray intensity, because the SR intensity fluctuates and decays
146
HAYASHI
exponentially during the scan. The normalized fluorescence intensity I (θ, φ) is transformed into χ (θ, φ) using the following expression, (14) χ (θ, φ) = I (θ, φ) − I0 /I0 ,
where I0 is the average intensity over the whole φ scan range. Figure 13 shows χ (θ1 , φ) of the Ge single crystal. We can see wide stripes along the θ1 direction and narrow X-ray standing wave lines. The wide stripes are attributed to the scanning method that we adopted (Tegze et al., 1999). The entire system rotated around the polar angle θ1 , and thus the exit angle of X-ray fluorescence was fixed at θ2 = 60◦ . Using this setup, the holographic pattern obtained was a sum of the inverse hologram and a part of a normal one. This comes from the scanning technique we used. Details of its origin are described in Section III.A. A pure normal hologram at θ2 = 60◦ can be precisely measured by setting θ1 to 0◦ . However, I illustrate another technique subtracting the component of the normal hologram, which uses the Fourier transform method. From the data in Figure 18a, Figure 19 shows a pattern obtained by Fourier transformation. Several strong spots exist along the ωφ direction at ωθ1 = 0. These are attributed to the one-dimensional periodicity along the ωφ direction, namely, the component of the normal hologram. After removing these spots, the data in Figure 19 are inverse Fourier transformed. The resulted pattern is displayed in Figure 18b. Since the stripes are not observed in this pattern, it is known that the normal hologram component was removed. Of course, the atomic images can be reconstructed at this stage. However, the reconstructed atomic images were distorted and blurry because of insufficient statistical accuracy. The remaining sharp line pattern is due to Bragg reflection (X-ray standing waves) caused by the long-range periodic order present in the sample. These lines reflect the four-fold symmetry of the Ge 001 wafer. Therefore, the resulting pattern can be symmetrized by the use of X-ray standing wave lines to gain statistical accuracy. Figure 18c shows the symmetrized hologram pattern. The X-ray standing wave patterns are clearly visible compared to those in Figure 18b. The processed hologram interpolated to a k-space mesh is presented in Figure 20a using the following expressions: kx = |k| sin θ cos φ,
(15)
ky = |k| sin θ sin φ.
Figure 20a shows the hologram pattern on kx –ky space. We have calculated the hologram of a spherical Ge cluster containing 185,771 atoms, as shown in Figure 20b. The X-ray energy used (E = 12.5 keV) was the same as one of the wavelengths of the measurement. We applied the same low-pass filter to both the experimental and calculated hologram data to compare the results.
X - RAY FLUORESCENCE HOLOGRAPHY
147
F IGURE 18. Holograms of Ge (001) single crystal. (a) Raw data. (b) Data after removing the component of the normal XFH pattern. (c) Four-fold symmetrized data.
Comparing them, the details are not exactly the same, but the main features of the measured and calculated holograms agree very well. The small differences can be explained by the limited size of the cluster used in the calculation. The reconstruction of the atomic image was carried out with the Barton (1988) algorithm. Figure 21a and b shows planes parallel to the {001} lattice planes taken at a distance of z = 0 Å and 1.4 Å (∼ =a/4), respectively. In Figure 21a, the 21 12 0 atom is visible, but a distance between this atom and the
148
HAYASHI
F IGURE 19. Fourier transformation of the hologram data in Figure 18a. Integrations were carried out along the φ and θ1 directions.
(a)
(b)
F IGURE 20. Holograms of (001) Ge projected on the kx –ky plane. (a) Experimental data. (b) Calculated data.
√ emitter is smaller than the tabulated a/ 2 = 4.00 Å. In Figure 21b, 14 14 14 is displayed at a position 2.0 Å apart from the center in agreement with the √ known value of a/2 2 = 2.00 Å. The image at this level is a superposition of two associated environments, because atoms in a Ge crystal lie in two distinct crystallographic sites. The XFH technique cannot distinguish them. The images reconstructed from the theoretical hologram in Figure 20b are also in good agreement with those from the experimental hologram. In particular, the displacement of the 12 12 0 atom toward the center is observed, similar to the reconstruction of z = 0 Å from the theoretical hologram. The calculation
X - RAY FLUORESCENCE HOLOGRAPHY
(a)
149
(b)
F IGURE 21. Reconstructions from the hologram in Figure 20a. The planes parallel to the {001} lattice plane cutting through the fluorescent emitter atom and 1.4 Å above the emitter atom are displayed in (a) and (b), respectively. Circles show the theoretical positions of atoms.
shows that the shift of these atomic images is mostly caused by real-twin interference (Hiort et al., 2000). Moreover, the far atoms such as the 100 and 311 4 4 4 atoms are hardly seen in the theoretical reconstructions, similar to the experimental ones, suggesting that the disappearance of these atoms is not explained simply by weak holographic signals. It is explained by the effect of the twin images. These displacement and disappearance problems can be solved by recording the holograms at several incident energies. E. Sample Cooling Effect In previous sections, the temperature dependence on the signal was not taken into account for theoretical calculations. However, thermal vibrations of atoms affect holographic oscillation. Intensities of X-ray diffractions are reduced by them, which is called the “Debye–Waller factor.” The atomic scattering factor including the Debye–Waller factor is expressed as f (θ ) = f0 (θ )e−M , θ k &2 & x sin θ 2 6h2 T rj M = B sin φ(x) + = , 2 4 λ ma k B Θ 2 x Θ ξ 1 dξ, x = , φ(x) = ξ x e −1 T 0
(16)
150
HAYASHI
where T is the absolute temperature, Θ is the Debye temperature, ma is the weight of the atom, h is the Planck constant, and the kB is the Boltzmann constant. These equations indicate that the holographic signal damps with the increase of k due to the temperature factor. Thus, sample cooling suppresses the damping of the holographic signal. Note that the f (θ )e−M cannot be simply replaced by f (θ ) in Eq. (5), if we want to calculate an accurate holographic oscillation. Since the emitter atoms as well as the scatter atoms vibrate thermally, another thermal factor on the emitter atom must be added to the formula of holographic oscillation. To check the sample cooling effect, a cryostream cooler (Oxford: cryostream 70 series), which flew nitrogen gas on the sample, was installed in the laboratory XFH apparatus. Here, a Pb single crystal (001) was used for the sample. Its lattice constant is a = 4.9505 Å. The Debye temperature of Pb is 88 K, which is very small in metals. The sample was cooled to 100 K. The B values of Pb at 300 K and 100 K are 2.155 and 0.732, respectively. The measurement was carried out using a laboratory XFH apparatus in the inverse mode. The Mo Kα line monochromatized by bent graphite was used as the incident beam. Sums of Pb Lα, Lβ, and Lγ fluorescence intensities were measured by a silicon drift detector as functions of φ and θ1 within the ranges of 0◦ ≤ φ ≤ 360◦ and 0◦ ≤ θ1 ≤ 60◦ . We also obtained the hologram at room temperature at the same condition for comparison. Measured data were symmetrized using four-fold symmetry of the Pb (001) surface. The hologram pattern for 100 K clearly exhibits standing wave lines compared to that for room temperature. This corresponds to an enhancement of X-ray diffraction due to cooling the sample. Figure 22a and d shows the reconstructions of (001) and (002) planes. Atomic images are never seen in the reconstructions for room temperature. However, for 100 K, the atomic images emerge at Pb fcc sites. Figure 22c shows a differential image between Figure 22a and b. 100, 21 12 0, and 010 atomic images are clearly observed, though 100 and 010 images shift outward probably due to real twin images overlap. Figure 22f shows the differential image between Figure 22d and e, exhibiting, remarkably, neighboring 21 0 21 and 0 12 12 atomic images. The experimental result proved that cooling samples enhance only the atomic images and is useful for visualizing light atoms in the reconstruction. F. Inverse Fourier Analysis 1. Theoretical Proof If we could determine atomic positions to within an accuracy of ±0.01 Å, local lattice distortion around dopants could be evaluated quantitatively and the mechanisms responsible for material properties would be elucidated.
X - RAY FLUORESCENCE HOLOGRAPHY
151
F IGURE 22. Temperature dependence of reconstructions of (001) (top) and (002) (bottom) planes for Pb crystal. (a, d) Room temperature. (b, e) 100 K. (c, f) Differential images between room temperature and 100 K.
However, no novel data analysis technique has been proposed since the development of Barton’s reconstruction algorithm (Tegze et al., 1999; Barton, 1991). In the data analysis of extended X-ray absorption fine structure (EXAFS), the measured signal, which corresponds to electron momentum, is Fourier transformed to a radial distribution function around a specified element (Kikuta, 1992). After the inverse Fourier transformation of a selected peak in the radial distribution function, the filtered signal is fitted by a theoretical signal calculated using a simple cluster model (Sayers et al., 1971). The errors of the estimated interatomic distances are within ±0.01 Å. This method can be utilized for the accurate determination of the atomic position observed in a real space reconstruction from the hologram. On the basis of this concept, we propose a new analytical method that can determine the accurate interatomic distance in XFH. In this study, the potential of this method is investigated theoretically and experimentally using an Au single crystal. First, a simple dimer model of Au is used for the calculation of the hologram pattern, as shown in Figure 23a. To calculate accurate holograms, near field effect of X-ray scattering is used for all hologram calculations
152
HAYASHI
F IGURE 23. The calculated holograms of a dimer. (a) The dimer model. (b) The hologram and outlines of holograms at 22.5, 24.0, and 25.5 keV on the kx –ky plane. (c) The one-dimensional holograms averaged along the ky direction.
based on Bai’s (2003) method. The interatomic distance is 2.884 Å, which corresponds to the Au–Au bond length in bulk crystal. Here, we assumed that the left and right Au atoms in Figure 23a are the scatterer and emitter, respectively. Figure 23b shows the inverse hologram χ (k) of the dimer calculated at an incident X-ray energy of 22.5 keV, where k denotes the wave number vector. As can be seen, the projection of χ (k) on the kx –ky plane shows a stripe pattern along the ky -axis. Thus, this pattern can again be
X - RAY FLUORESCENCE HOLOGRAPHY
153
projected on the kx -axis, as shown in Figure 23c (Lee et al., 1981). This is the procedure for averaging the hologram over the ky -direction or projection from the surface of the sphere to the x-axis, χ (k) dky dkz . (17) χ¯ kx (kx ) = S S dky dkz
The one-dimensional hologram of the single scatterer is a cosine-like wave, whose amplitude decreases with increasing kx due to the scattering angle dependence of the atomic structure factor. The radius of the hologram in k-space increases with the wave number of the incident X-rays, which is proportional to X-ray energy. In addition to the hologram at 22.5 keV, outlines of holograms at 24.0 and 25.5 keV are displayed in Figure 23b. Here, the minimum values of kx of these holograms are defined to be 0. In this case, the phases of these holograms coincide, as shown in Figure 23c, and therefore the sum of the amplitudes of the holographic oscillations from a single scatterer simply increases with the number of holograms even at different energies. To evaluate the feasibility of the present data analysis procedure using a single crystal as a sample, we calculated the inverse holograms of a large Au cluster. The structure of the Au cluster is a face-centered cubic cell (fcc) with a = 4.079 Å, and the cluster contains 16,726 atoms within a 40-Å radius around the emitter. At first, the inverse holograms are calculated for unpolarized radiation. The incident energies ranged from 22.5 to 26.0 keV in steps of 0.5 keV. The kx -direction is defined to be the same as the crystallographic [110] direction of the Au cluster. A one-dimensional hologram projected on the kx -axis could be obtained using Eq. (17). However, opposed to cases of the dimer model, the hologram of the large cluster has a twodimensional complex pattern, which is well known by researchers in the field of atom-resolved holography. Principally, the processed one-dimensional single energy hologram includes the holographic signals of the atoms located on the x-axis. However, scattering patterns from the other atoms not located on the x-axis remain (Lee et al., 1981). These components are canceled out by the summation of the holograms at different energies. Figure 24 shows the calculated average of eight holograms at energies ranging from 22.5 to 26.0 keV in steps of 0.5 keV. The displayed oscillation is similar to that seen in EXAFS. The reconstructed intensity along the x-axis was obtained by a simple Fourier transformation as shown in Figure 25. Before applying the Fourier transformation to the oscillation shown in Figure 24, we doubled the kx -range by defining χkx (−kx ) = χkx (kx ), because the phase of the holographic cosinelike curve at kx = 0 from any scatterer is fixed at approximately π . This increases the resolution of the peaks of the atomic images. The reconstructed
154
HAYASHI
F IGURE 24. The one-dimensional hologram obtained from an average of six holograms of the Au cluster. Incident X-ray energies for calculation are 22.5–26.0 keV in steps of 0.5 keV.
F IGURE 25. The reconstructed intensity along the [110] direction obtained by Fourier transform of the plot shown in Figure 24.
intensity obtained up to 10 Å shows peaks corresponding to the 12 12 0, 110, and 32 32 0 atoms. In addition to these, an artifact peak appears at 1.0 Å. It is also seen in the reconstruction obtained using the Barton algorithm, and it disappears when the cluster size is small. This result corresponds to that reported by Fanchenko et al. (2002), namely, the existence of a lowfrequency component corresponding to distant atoms in a large sample. The
X - RAY FLUORESCENCE HOLOGRAPHY
155
F IGURE 26. The fitted holographic signal of the 12 21 0 atom (solid line). The dotted line represents the hologram of the dimer.
11 2 20
peak was inverse Fourier transformed, for which the an R-range of the filter window was between 2.46 and 3.08 Å. The holographic signal obtained and that of the dimer in Figure 23c are shown in Figure 26. It is known that the phases of these peaks coincide up to 17 Å−1 . Here, we defined the fitting function as χk′ x (kx )
=
5 i=0
ai kxi · cos(a6 kx + a7 ),
(18)
where ai (i = 0–7) is a fitting parameter. This function is simply the product of a fifth-degree polynomial function and a cosine function. Using Eq. (18), we fitted both oscillations and determined the parameter ai . The k-range of the fitting was 0–16.5 Å−1 . The value of a6 , which is the period of the cosine curve, is proportional to the interatomic distance. The estimated values of a6 for the cluster and dimer were 0.1653 and 0.1658 Å−1 , respectively. If the a6 of the dimer corresponds to the actual Au–Au bond length of 2.884 Å, the value for the cluster is calculated to be 2.893 Å, indicating that the difference is 0.009 Å. This feasibility study of the inverse Fourier analysis method using the large Au cluster model demonstrated the determination of accurate interatomic distances from XFH data. The interatomic distances for the 100, 110, 23 12 0, 200, and 23 32 0 atoms were also estimated using the present method and were in good agreement with actual values within an accuracy of ±0.01 Å.
156
HAYASHI
2. Demonstration by Experimental Holograms The data analysis procedure mentioned in the previous section was applied to the experimental hologram data of an Au single crystal. The experiment was carried out using the beam line BL39XU at the third-generation synchrotronradiation facility, SPring-8. Incident energies were in the range of 22.5– 30.0 keV in steps of 0.5 keV. The intensity of the incident beam was monitored using an ionization chamber. The data were collected in the inverse mode. The Au Lα (9.712 keV) X-ray fluorescence via a cylindrical LiF crystal was detected using an avalanche photodiode. The count rate of the X-ray fluorescence was approximately two million cps. The fluorescence intensities were measured as functions of φ and θ1 within the ranges of 0◦ ≤ φ ≤ 360◦ and 0◦ ≤ θ1 ≤ 76◦ . The θ2 was fixed at 45◦ . We recorded 16 holograms at different energies in this experiment. For data handling, we incorporated the extension of the hologram to a full sphere using the measured X-ray standing wave line and crystal symmetry of the sample. To obtain the atomic image, the multiple-energy reconstruction described by Barton was applied to these hologram data (Barton, 1991). The arrangement of atoms in the reconstruction clearly shows the fcc structure. We plotted the reconstructed intensities of the 21 12 0 atom and its real part in a radial direction. The maximum peak intensity is located at 2.95 Å from the central Au atom. This value is 0.07 Å larger than the actual position of the first neighbor atom (2.884 Å). This accuracy is sufficient for qualitative analysis, but not for quantitative analysis. Applying Eq. (17) to the 16 holograms fully extended here, we obtained a one-dimensional hologram along the [110] direction, as shown in Figure 27a. An EXAFS-like oscillation is obtained similarly to the plot shown in Figure 24. However, the amplitude of the experimental oscillation is lower than that of the simulation by one order of magnitude. The main reason for this signal suppression phenomenon is still unclear. Thus, we did not correct the data in the present study. The reconstructed intensity, which is the Fourier transform of the oscillation shown in Figure 27a, was plotted in Figure 27b. Peaks corresponding to the 11 33 2 2 0, 110, and 2 2 0 atoms are observed at 2.9, 5.8, and 8.6 Å, respectively, and many artifact peaks appear compared with the plot shown in Figure 25. The artifact peak at 1.0 Å, which is observed in the reconstructed intensity obtained from the simulated hologram in Figure 25, disappears in the present reconstructed intensity. The low-frequency component corresponding to this artifact was thought to have been removed during background subtraction and high-pass filtering. According to the procedure detailed in the preceding section, we obtained the holographic oscillation of the 12 12 0 and 110 atoms, as shown in Figure 28. The filter windows for the 12 12 0 and 110 atoms
X - RAY FLUORESCENCE HOLOGRAPHY
157
F IGURE 27. (a) The one-dimensional hologram obtained from an average of 16 experimental holograms of the Au single crystal and (b) its Fourier transform. Incident X-ray energies were 22.5–30.0 keV in steps of 0.5 keV. The kx direction is the same as the crystallographic [110] direction of the Au cluster.
were 2.52–3.20 and 5.55–6.34 Å, respectively. By fitting Eq. (18) to these cosine-like curves, the interatomic distances to the 12 12 0 and 110 atoms from the emitter were evaluated as being 2.888 and 5.772 Å, respectively. Since the actual interatomic distances for the 21 12 0 and 110 atoms were 2.884 and 5.768 Å, respectively, the accuracy of the atomic positions for
158
HAYASHI
F IGURE 28. (a) The filtered holographic signal of 21 21 0 and (b) 110 atoms (dotted lines). Solid lines show curves fitted using Eq. (18).
both images was achieved within an error of ±0.01 Å, which is equivalent to those obtained for EXAFS (Matsushita et al., 2004). Since the Au–Au bond length estimated from the reconstruction using the Barton algorithm was 0.07 Å larger than the actual value, the inverse Fourier transformation technique markedly refines the determined atomic positions. According to this procedure, we also estimated the interatomic distance from the emitter to the 23 32 0 atom. The value obtained was 8.531 Å, which is 0.121 Å lower
X - RAY FLUORESCENCE HOLOGRAPHY
159
TABLE 2 VALUES OF I NTERATOMIC D ISTANCES E STIMATED FROM E XPERIMENTAL XFH DATA , ACTUAL I NTERATOMIC D ISTANCES , R-R ANGE OF F OURIER F ITTINGS AND k-R ANGE FOR C URVE F ITTINGS Atom
Experimental (Å)
Actual (Å)
R-range (Å)
k-range (Å−1 )
1 10 22
2.888 4.092 5.772 6.436 8.152 8.531
2.884 4.079 5.768 6.448 8.157 8.652
2.52–3.20 3.58–5.20 5.55–6.34 6.23–7.07 6.96–9.21 8.31–8.98
0–16.9 7.3–11.8 0–8.4 0–10.1 3.9–5.6 0–4.5
100 110 3 10 22 200 3 30 22
than the actual value. The accuracy obtained for this atomic image is very poor compared to those for the 12 12 0 and 110 atoms. The peak corresponding to the 32 32 0 atom shown in Figure 27b is quite small compared to the other ones, and therefore it is contaminated more strongly by the artifacts, causing a modulation in the hologram of the 23 32 0 atom. We summarized the estimated interatomic distances with the parameters used in Table 2. The interatomic distances of neighboring atoms around an emitter were estimated from 16 experimental holograms of an Au single crystal and most of them were in good agreement with the actual values within an error of 0.3%. This reveals that the local lattice distortion, e.g., the environment around dopants in electronic materials, can be quantitatively evaluated using the experimental data of XFH. Moreover, the present technique has the potential to provide information on the type of element and the Debye–Waller factor as well as the interatomic distance.
IV. A PPLICATIONS A. Ultrathin Film Most electronic devices are fabricated by using the technique of epitaxial growth on single crystal substrates. Since these film samples have a translational order similar to a single crystal, XFH is applicable. Here, the application to an L10 -ordered FePt ultrathin film, which has a large magnetic moment anisotropy, is described (Takahashi et al., 2003e). Its atomic arrangement is illustrated in Figure 29. Shima et al. (2002) in the Takanashi group at Tohoku University studied L10 -ordered FePt films, which are prepared at low temperatures below 503 K. In a study of the same kind of magnetic films, the correlation between magnetic moment anisotropies and
160
F IGURE 29. FePt film.
HAYASHI
Schematic drawing of atomic arrangements around Fe atoms in the L10 ordered
long-range chemical order was suggested (Gehanno et al., 1998; Kamp et al., 1998). However, the relevant parameter for the magnetic properties should be of short-range directional chemical order, which can be studied only with the XFH technique or more indirect methods. Here holograms of an FePt film are measured and atomic images around Fe atoms are reconstructed. The FePt film was prepared using a UHV deposition system with two independent e-guns by the Takanashi group. A 10-Å-thick Fe seed layer was deposited on an MgO(001) substrate, and consecutively a 400-Å-thick Pt buffer layer was epitaxially grown at 343 K. Monoatomic layers of Fe and Pt were alternately deposited at 503 K 50 times. Since the sample contained Fe and Pt elements, holograms of Fe or Pt could be recorded by detecting Fe K or Pt L lines, respectively. The amount of Pt was lager in the Pt buffer layer than in the FePt layer. Therefore, in the reconstruction from the Pt holograms, the structural image of the buffer layer would be more dominant than that of the FePt layer. However, a 10-Å seed layer of Fe exists between the substrate and the Pt buffer layer. As opposed to the Pt buffer layer, it does not affect the Fe holograms of the FePt layer because the fluorescence from the Fe seed layer is negligibly weak. Hologram measurements of FePt film were carried out at BL37XU in SPring-8. Nine incident energies were selected from 9.50 to 11.50 keV with 0.25-keV steps between Fe K (7.11 keV) and Pt L3 (11.56 keV) absorption edges so as not to excite the Pt atoms. The incident beam was monochromatized using an Si(111) double-crystal monochromator, and an Rh-coated mirror was used to suppress higher harmonic X-rays. The beam size at the sample was adjusted to be 0.3 mm along the horizontal direction and 0.5 mm along the vertical direction. The incident beam intensity was monitored by detecting X-rays scattered from a polyimide film of 125 µm
X - RAY FLUORESCENCE HOLOGRAPHY
F IGURE 30. planes.
161
Reconstructed image from Fe holograms of FePt film. (a) (001) and (b) (002)
thickness using an Si PIN photodiode. Fe Kα fluorescence emitted from FePt was analyzed by cylindrical bent graphite and focused on an avalanche photodiode. Intensities of Fe Kα fluorescence were measured as a function of azimuthal angles φ (0◦ ≤ φ ≤ 360◦ ) and θ1 (0◦ ≤ θ1 ≤ 70◦ ). The φ rotation speed was 3.0◦ /sec, and the fluorescence intensity was integrated over 0.33◦ , corresponding to a 0.1-sec sampling time. θ1 was rotated discretely with the 1.0◦ step. The count rate of the Fe Kα line was over 1,000,000 cps at any θ1 value. Measured holograms show X-ray standing wave lines indicating a fourfold symmetry. To improve the statistical accuracy, the symmetry equivalent data were summed up, referring to these XSW lines. The atomic environment around Fe was reconstructed using the Barton algorithm. Images of (001) and (002) planes reconstructed from nine holograms are shown in Figure 30a and b. Pt atomic images at the (002) plane are reconstructed at the position predicted from knowledge of the bulk crystal structure of FePt. However, Fe atomic images shift outward by 0.8 Å. At present, the reason for this shift is still unclear. Clarifying this will help us obtain new structural knowledge of the present sample. The experiment demonstrated here proved that the hologram of the 200 Å ultrathin film can be recorded within a reasonable measurement time using third-generation synchrotron radiation. But the structure of this kind of thin film can be evaluated by grazing incidence X-ray diffraction. XFH is advantageous for a thinner layer, such as a few atomic layers buried in a multilayer. In this case, the intensity of fluorescence will not be sufficient to complete the hologram record within a reasonable measurement time. This will be resolved by the use of a toroidal type graphite analyzer (Sekioka et al., 2005).
162
HAYASHI
B. Dopants 1. GaAs:Zn We performed the hologram measurement of dopants for the first time. We carried out the holography experiment of Zn atoms in GaAs at two different energies using synchrotron radiation. In this experiment, a multielement SSD was used to record the hologram. The hologram measurement was carried out using a synchrotron beam line BL10XU at SPring-8. The synchrotron radiation from an undulator was monochromatized by a Si(111) doublecrystal monochromator. The Zn concentration in the wafer was determined to be 1.0 × 1019 atoms/cm3 (0.02 mass%) by a Hall measurement. The diameter and thickness of the sample were 50.0 × 0.25 mm, respectively. The incident X-ray energy was 9.7 and 10.0 keV, which was between the Zn and Ga K absorption edges, so as to avoid excitation of the Ga and As X-ray fluorescence. The beam size was 1 × 1 mm2 . The 19-element SSD was placed parallel to the incident X-ray electric field. The intensity of Zn Kα X-ray fluorescence was measured as a function of φ and θ within the ranges of 0◦ ≤ φ ≤ 360◦ and 26◦ ≤ θ ≤ 60◦ . The total integrated intensity of the Zn Kα X-ray fluorescence at each pixel was about 200,000 counts. The total measurement time for one hologram was about 10 h. Figure 31a and b shows two-fold averaged and low-pass filtered holographic patterns at 9.7 and 10.00 keV, respectively. From these holograms, we reconstructed atomic images of planes parallel to the {001} plane including the Zn atom, and 1.41 Å above and below the emitter Zn atom, which are here termed planes A, B, and C. Figure 32a shows the three-dimensional view consisting of the reconstructed images of plane A, B, and C. At plane B, 1¯ 1 1¯ 1¯ 11 1 1¯ 2 2 0, 2 2 0, 2 2 0, and 2 2 0 atoms are clearly seen and the distances between the intensity maxima of these atoms and the emitter are equally 4.02 Å. The crystal structure of GaAs is a zinc blend structure with a = 5.65 Å; that is, it consists of two fcc cells. The Ga and As layers stack alternately along the c-axis; they are separated by 1.41 Å. The atomic configuration of the Ga layer is the same as that of the As layer, and the nearest Ga–Ga or As–As distances are 4.00 Å. Thus, the Zn atoms are found to substitute for a Ga or As site. This result agreed well with the EXAFS one (Kitano et al., 1989). ¯ ¯ ¯ ¯ At plane A, strong images of 14 14 14 , 14 14 14 , 43 34 14 , and 34 34 14 atoms and weak ¯
¯
¯
¯
images of 14 14 14 , 14 14 14 , 34 34 14 , and 34 34 14 atoms are seen, revealing that Zn atoms substituted selectively for one site of Ga or As. The possibility of the As-site substitution may be negligible because of the charge neutrality. Figure 32b shows a possible model of the atomic arrangement around Zn atoms. Since the intensity of the X-ray scattered from the atoms lying on plane A is
X - RAY FLUORESCENCE HOLOGRAPHY
F IGURE 31. (b) 10.0 keV.
163
Holograms of Zn in GaAs recorded at the incident energies of (a) 9.7 and
1 1 1 1¯ 1¯ 1 3 3 1 3¯ 3¯ 1 4 4 4 , 4 4 4 , 4 4 4 , and 4 4 4 atoms are ¯ ¯ ¯ ¯ and the 14 41 14 , 41 14 41 , 43 43 14 , and 34 34 41 atom1 1 1¯ 1¯ 1¯ 1¯ 3 3 1¯ 3¯ 3¯ 1¯ 4 4 4 , 4 4 4 , 4 4 4 , and 4 4 4 atoms exiting
stronger than those on plane C, the considered to be the real images
like images are twin images of on plane C, respectively. Holographic twin images appear necessarily in the image reconstructed from the single energy hologram and are suppressed by reconstruction from the hologram data. Twin image suppression becomes
164
HAYASHI
F IGURE 32. (a) Holographic reconstruction of an environment around the Zn atom. (b) Model of a local environment around the Zn atom.
X - RAY FLUORESCENCE HOLOGRAPHY
165
effective with an increase in the number of holograms recorded at different energies. However, since we measured only two holograms in our experiment, ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ this effect is small. The 41 14 14 , 14 14 14 , 43 34 14 , and 34 34 14 atomic images appearing ¯ ¯
¯ ¯
at plane C in Figure 32a are the twin images of 14 14 14 , 41 14 14 , 34 34 14 , and 34 34 14 atoms at plane A, respectively. Since the intensity of each twin image is nearly equal to that of the paired real image, the twin image suppression is ¯ ¯ ¯ ¯ not confirmed for these atoms. In contrast, the real images of 14 14 14 , 41 14 14 , 3¯ 3 1¯ 4 4 4,
¯ ¯
and 34 34 14 atoms at plane C are obviously stronger in comparison to their twin images displayed at plane A. This result revealed that the two energy ¯ ¯ ¯ hologram data contribute to suppressing the twin images for 14 14 14 , 41 14 14 , 34 34 14 , and
3 3¯ 1 444
atoms.
2. Si:Ge The doping of crystalline Si has played an important role in the fabrication of advanced semiconductor devices (Pearsall, 1911; Matsuura et al., 1991), which requires state-of-the-art tailoring of a band gap. To understand the nature of the doping-induced electronic state, it is essential to study the local structures around impurities in the doped semiconductor. The properties of the SiGe system have had a significant impact on device applications, and the basic science of semiconductor materials, which includes the original concept of direct optical transitions in an indirect semiconductor with superlatticeinduced band structure, provides optoelectronic capabilities for integration with standard VLSI (Pearsall, 1911). Without exception, numerous XAFS studies on the SiGe system have been conducted (Matsuura et al., 1991; Aldrich et al., 1994; Woicik et al., 1998; Aubry et al., 1999; Wei et al., 1997). In many cases, coordination numbers and bond lengths of the first shell were estimated accurately. Analysis of the second or third shell is difficult because the second-shell signal strongly overlaps the third-shell signals and is heavily contaminated by a triangular multiple scattering path. However, there is interference of their holographic signals, since XFH images atoms of second or third shells at different positions. In addition, multiple scattering of X-ray photons is negligible, as opposed to that from electrons in the XAFS. Therefore, the use of XFH has an advantage from the viewpoint of long-range local structural analysis. Here, Si single crystal doped dilute Ge was used as a sample and measured its X-ray fluorescence holograms. The experiment was done using beam line BL47XU at the third-generation synchrotron radiation facility, SPring-8. The electron storage ring current was between 100 and 80 mA during measurement. The synchrotron radiation from an undulator was monochromatized by an Si(111) double-crystal monochromator. Si0.999 Ge0.001 grown by the Czochralski method was used
166
HAYASHI
F IGURE 33. Multiple energy X-ray holograms of Si0.999 Ge0.001 . The displayed patterns were recorded at 14.50, 15.75, and 17.00 keV.
as the measured samples (Yonenaga, 1999). The dimension of the sample was 5 × 5 × 2 mm3 . Incident energies were 14.5–17.0 keV with 0.25-keV steps. A fully tuned X-ray beam at BL47XU was too brilliant to monitor its intensity with an ionization chamber. Thus, we detected elastic scatterings from the air by an Si PIN diode in current mode instead of the ionization chamber. The data were collected in the inverse mode. Ge Kα (9.87 keV) X-ray fluorescence via a cylindrical LiF crystal was detected by an avalanche photodiode. The count rate of the X-ray fluorescence was about 200,000 cps. The fluorescence intensities were measured as a function of azimuthal angle φ and polar angle θ1 within the ranges of 0◦ ≤ φ ≤ 360◦ and 0◦ ≤ θ1 ≤ 76◦ . The X-ray exit angle of θ2 was fixed at 45◦ . We recorded 11 holograms at different energies in this experiment. For data handling, we incorporated extension of the hologram to the full sphere by using the sample crystal symmetries. Figure 33 shows the resulting hologram patterns at three different energies. Multiple energy reconstruction algorithm was applied to these hologram data (Barton, 1991). The real space image is depicted in Figure 34. The atomic images were extremely fine, and artifacts, which were obvious at the reconstruction from the single-energy XFH, were sufficiently suppressed. Although we displayed the atoms up to the fourth coordination shell in the figure because of complications in the airspace figure, the atoms up to the seventh coordination shell were visible. The arrangement of the atoms in the reconstruction remarkably shows a superposition of two associated environments of a diamond structure, revealing that Ge atoms lie in two distinct crystallographic sites. In the field of the SiGe alloy, it is well
X - RAY FLUORESCENCE HOLOGRAPHY
F IGURE 34.
167
3D atomic image around Ge in Si0.999 Ge0.001 .
known that Ge atoms are randomly substituted for Si sites (Matsuura et al., 1991; Aldrich et al., 1994; Woicik et al., 1998; Aubry et al., 1999; Wei et al., 1997). Thus, the present result is naturally understood. Taking into account the Ge concentration, most of the atomic images were regarded as Si. Doping impurities locally cause a lattice distortion in host crystal. XAFS was a powerful tool with which to evaluate this local lattice distortion, since it can determine the bond length within ±0.01 Å. If XFH can also determine an accurate atomic position with the same precision, it will become a more powerful tool with which to evaluate local lattice distortion from the viewpoint a three-dimensional atomic arrangement. Figure 35 shows the intensity variation of nearest-neighbor atoms in a radial direction. The position of the experimental curve estimated by the Gaussian fitting method was 2.46 Å. This value was larger than the Ge–Ge bond length of bulk Ge crystal. Wei et al. (1997) used XAFS to study the environment of dilute Ge in Si (Si0.994 Ge0.006 ) and reported that the Si–Ge bond length was 2.38 Å. Thus, the reconstructed image of the nearest-neighbor atom was considered to have shifted 0.08 Å toward the outside compared to the predicted value. To check this, we created a 98 Si cluster model, at the center of which the Ge atom was placed, and calculated holograms at 14.5–17.0 keV with 0.25-keV steps. During the calculation, the interatomic distance between Ge and first-neighbor Si atoms rGeSi was varied and the reconstructions were obtained from theoretical holograms. When the rGeSi in the cluster was set to 2.38 Å, the position of the reconstructed peak from the calculated hologram was in good agreement with that from the experimental hologram, as shown in Figure 35. We confirmed that the position of the nearest-neighbor atoms
168
HAYASHI
F IGURE 35.
Reconstructed intensities of single first neighbor atoms.
was hardly affected by the cluster size and the distances of the second nearest or more distant atoms from the emitter. On the other hand, FWHMs of the experimental and theoretical reconstructed images (0.71 and 0.44 Å, respectively) were different. The main origin of this peak broadening is considered to be thermal vibration of the nearest Si atoms. Applying inverse Fourier analysis described in Section III.F to the present data is now being done. C. Quasicrystal It is well known that quasicrystals lack long-range translation order. The atomic decoration is modeled from chemical considerations. Surface and projected atomic images are obtained by scanning probe microscopy (SPM) and transmission electron microscopy (TEM), respectively. To obtain a bulk picture of the atomic order, X-ray methods have to be used. Although traditional crystallographic measurements show strong peaks in well-defined directions, the atomic positions cannot be derived. Marchesini et al. (2000) recorded the local atomic structure in the icosahedral quasicrystal Al70.4 Pd21 Mn8.6 . The quasicrystal had sufficient orientational order to allow direct visualization of the arrangement of atoms by XFH, though it is nonperiodic. The experiment was done at the ID22 undulator beam line of ESRF. The inverse hologram of the Mn fluorescence was recorded at an incident X-ray energy of 16 keV at ESRF. The scanned ranges of sample orientation were 0◦ ≤ φ ≤ 360◦ and 0◦ ≤ θ1 ≤ 70◦ . The Mn fluorescence was selected
X - RAY FLUORESCENCE HOLOGRAPHY
169
F IGURE 36. Reconstructed real-space image around Mn sites in the quasicrystal Al70.4 Pd21 Mn8.6 . (From Marchesini et al., 2000.)
by the cylindrical crystal analyzer ribbon with a small diameter. This led to a shorter beam path reducing the air absorption. The measured hologram pattern was extended to a 4π full sphere using the observed X-ray standing wave pattern lines. The real space was reconstructed by the Helmholtz–Kirchhoff integral transformation, as depicted in Figure 36. Since the present sample has several different Mn sites, the displayed image can be regarded as an overlap of images of the different environments around Mn. Moreover, the real-twin cancellation effect due to the single energy hologram appears in the reconstruction. From Figure 36, it is found that the 12 highest intensity spots are at the corner of an icosahedron, and that the distance of these spots from the central atom is about 4.6 Å. From chemistry and material density, these points cannot be the first atomic neighbors. A specified coordination shell around a specified Mn site seems to be emphasized. To understand these observations, Marchesini et al. tabulated the average scattering factor and distances for the first 10 atomic shells about the central Mn atoms according to one of the existing models of this quasicrystal. They indicated that the highest intensity spots are a combination of the fifth coordination shell of Pd, the third of Al, and the first of Mn. Furthermore, closer shells are found to be much less occupied, and as a consequence, they cannot be seen at all. These findings are the first direct evidence for the wellestablished model of quasicrystal. The demonstration shows that structural
170
HAYASHI
analysis of atomic resolution holography to solids with orientational order in the absence of periodicity is possible. D. Complex X-Ray Holography The twin image in single-energy holography results from the fact that observable fluorescent intensities include only a cosine term of complex interference in a holographic function. The twin image distorts atomic images, which makes it difficult to accurately determine atomic positions. γ -Ray holography proposed by Korecki et al. (2001) can measure a complex hologram by combining two holograms measured at two symmetrical energies detuned at a Mössbauer resonance. The complex hologram includes the complete phase information, and its reconstruction can provide a real space image free from the twin image problem. The idea of complex X-ray holography can be utilized in inverse XFH by adopting resonant X-ray scattering. Omori et al. (2001) proposed resonant Xray fluorescence holography, which enables us to reconstruct an atomic image with element selectivity using differential holograms near an absorption edge. But this technique still has the twin-image problem. We therefore proposed a complex X-ray holography using resonant X-ray scattering, which can solve the twin image problem and has elemental identification in the real space image. The atomic scattering factor fj is expressed as fj = f0j +f1j +if2j , where f0j is the atomic form factor, and f1j and f2j are the real and imaginary parts of the anomalous dispersion term, respectively. The holographic intensity of Eq. (5) can be rewritten in the form of re fj 0 θrkj + fj 1 cos(−k · rj − krj ) rj j & − fj 2 sin(−k · rj − krj ) .
χ (k) ∼ = −2
(19)
Let us assume that a binary compound crystal consisting of elements X and Y and a K absorption edge of Y is the higher energy side than X. Figure 37 shows the X-ray anomalous dispersion term of the Y element. Three incident X-ray energies E A , E B , and E C are selected. Relations between the dispersion terms at these energies are f1B ≈ f1A and f2C ≈ f2C . f1 and f2 are defined to be f1 = f1C − f1B and f2 = f2A − f2B , respectively. Real and imaginary parts of the complex hologram of element X is derived from the hologram χ A ,
X - RAY FLUORESCENCE HOLOGRAPHY
F IGURE 37.
171
Example of a selection of X-ray anomalous dispersion terms for CXH.
χ B , and χ C by the following equations: cos(ϕk,aj ) χ B (k) − χ C (k) ≈ 2re = χreal , f2 aj j
χ A (k) − χ B (k) f2
≈ 2re
sin(ϕk,aj ) j
aj
(20)
= χimag .
The complex hologram χcmplx is expressed as χcmplx = χreal + χimag . Applying Eq. (4) to χcmplx , the real space image can be reconstructed. A feasibility study was performed by computer simulation using a 34-atom GaAs cluster (Takahashi et al., 2003a). The reconstruction clearly shows only the As atoms around Ga at the theoretical atomic positions within an accuracy of ±10−2 Å. In the above simulation, f1 and f2 are 3.41 in electron units, which is about 10% of the total scattering factor of As. This corresponds to about a 0.01% difference of fluorescence intensity. And this is comparable to XFH signals from light elements. Hologram measurements were carried out at high intensity undulator beamline BL37XU in SPring-8, Harima, Japan. Monochromatic incident beams by an Si 111 double crystal monochromator with an Rh-coated mirror were used. Incident beam intensities were monitored by detecting X-rays scattered from a 125-mm-thick polyimide film by an Si PIN photodiode. A GaAs 110 single crystal was used as a sample. Images of Ga
172
HAYASHI
F IGURE 38. XAFS spectrum at the As K edge of a GaAs single crystal. As K absorption edge and three incident X-ray energies for recording the holographic patterns are shown by arrows. (From Takahashi et al., 2003b.)
and As reconstructed by XFH cannot be distinguished because of their close atomic numbers. The As K absorption edge was precise determined from X-rays absorption fine structure (XAFS) in the present sample. Figure 38 shows the XAFS spectrum at the edge. The As K absorption edge was defined as the inflection point. Three holograms (χ A , χ B , χ C ) were recorded at 11.872 keV (EA , +5 eV), 11.865 keV (EB , −2 eV), and 11.767 keV (EC , −100 eV) around the As K absorption edge (11.867 keV). The GaAs single crystal was cooled at 100 K to suppress atomic thermal vibrations. Ga Kα fluorescence (10.367 keV) emitted from the sample was focused by the cylindrical graphite analyzer, which was set in the pure inverse XFH geometry (Adams et al., 2000), and detected by an Si PIN diode. The intensities were measured as a function of azimuthal angle φ (0◦ ≤ φ ≤ 360◦ ) and polar angle θ1 (0◦ ≤ θ1 ≤ 70◦ ). The φ rotation speed was 1.0◦ /sec, and the fluorescent intensity was integrated over data for an 1 sec sampling time. θ was rotated discretely with 1.0 steps. It took 6 h to measure a hologram.
X - RAY FLUORESCENCE HOLOGRAPHY
173
F IGURE 39. Three-dimensional atomic images reconstructed from (a) the single-energy (11.767 keV) hologram and (b) the complex hologram. Black circles show the Ga emitter position. Purple and orange images show first neighbor As and Ga atoms, respectively. The region below 50% of the maximum intensity observed in the whole image volume is cut off. (From Takahashi et al., 2003b.)
The observed hologram data were corrected for background fluorescence and normalized. Both real and imaginary parts of the phase of X-rays scattered by As atoms were derived from Eq. (20). The experimental anomalous dispersion terms of As, i.e., (f1B , f1C ) = (−9.3, −5.5) and (f2A , f2B ) = (8.6, 3.8), are used for calculation in Eq. (20). After 4π full extension of the complex hologram, the real space was reconstructed by applying Eq. (4). Three-dimensional atomic images around a Ga emitter reconstructed from the complex hologram are compared with those reconstructed from the single-energy (11.767 keV) hologram in Figure 39. In the single-energy reconstruction, both Ga and As images exist (Figure 39a). Moreover, the As atomic images at 14 14 14 shift outward largely. However, since the complex hologram includes only the As scattering component, the Ga atomic images disappear and the reconstructed image represents isolated As images as shown in Figure 39b. A tetragonal configuration of As atoms around a Ga emitter is reconstructed with an accuracy of ±0.01 Å. In this experiment, the second neighbor As atoms could not be observed. This demonstration revealed that complex X-ray holography with resonant scattering was feasible using an extremely strong X-ray beam. The result (Takahashi et al., 2003b) shows that the present technique has the potential for further improvement of accuracy of a reconstructed image as well as elemental identification of neighbor atoms around a certain emitter.
174
HAYASHI
V. R ELATED M ETHODS A. π XAFS The samples of XFH need translational symmetry in their atomic arrangements like single crystal and epitaxial thin films. Thus, the X-ray fluorescence holograms of powder and amorphous materials cannot be measured. However, structural information on amorphous materials and powder can be obtained by the application of XFH. The technique presented here gives a radial distribution function around a specified element like those obtained by EXAFS and anomalous X-ray scattering (AXS), though the signal is extremely weak. For single cluster, angular anisotropy the fluorescence intensity is calculated by Eq. (3). For a powder sample, the fluorescence intensity is an angular average of Eq. (3), that is, & re f (θrkj ) I (k) ∼ 1 − 2Re (21) ei(−k·rj −krj ) dσk , = krj Ω
j
where dσk = k 2 cos θ δθ δφ, and θ and φ are polar and azimuthal angles in the polar coordinate The fluorescence intensity as functions of φ and θ obtained by this equation is constant and this is the reason why the hologram cannot be measured for the powder and amorphous sample. But based on this equation, Nishino and Materlik (1999) and Nishino et al. (2001) find that the angular averaged fluorescence intensity varies with the change of the incident X-ray energy, which is expressed as re 1 χ¯ = − Im fj (π )e2ikrj . (22) 2 k r i=0 j In this equation, only backward scattering contributions remain and therefore χ¯ oscillates with the phase 2krj . As opposed to photoelectron interference by neighboring atom in EXAFS, this phenomenon is caused by photon interference of the incident beam. Thus, this technique is called photon interference X-ray absorption fine structure (π XAFS). I assumed the randomly oriented powder of the CuI polycrystal and calculated the energy variation of the fluorescence intensity of Cu. The amplitude of the obtained oscillation is of the order of 10−5 . Compared to the EXAFS oscillation of over 10−3 , the signal of π XAFS is unremarkable. However, the damping rate of XAFS with respect to the incident energy is higher that of π XAFS. Therefore, 2 keV above the X-ray absorption edge, the signal of π XAFS is higher than that of XAFS. Since π XAFS oscillates with the phase 2krj , a radial distribution function can be obtained by Fourier transformation, similar to XAFS. Nishino
X - RAY FLUORESCENCE HOLOGRAPHY
F IGURE 40.
175
πXAFS of a Pt foil. (From Nishino et al., 2001.)
et al. improved the π XAFS formulation by considering elastic scattering in addition to photoelectronic absorption (Korecki et al., 2004b). A feasibility study of πXAFS was performed at bending magnet beamline X1 HASYLAB at Deutsches Electronen-Synchrotron DESY. This beamline is optimized for high-precision X-ray absorption experiments. The X-rays were monochromatized by an Si(311) double-crystal monochromator, which was stabilized using a remotely controlled, digital monochromator stabilization. A high-purity 25-mm-thick Pt foil was used, which ensures that the signal of the order of 10−4 is not influenced by impurities. The incident energy was varied from 14.5 to 22.0 keV in 3-eV steps over a wide region above the Pt L1 absorption edge. Measurements were performed in transmission mode at room temperature. The high-frequency Pt L1 EXAFS signal fades out as the incident energy increases, and the fine structure of π XAFS becomes dominant in the energy region around 16.0 keV and above. Figure 40 shows the structure of π XAFS in the energy region from 15.5 to 23 keV. Theoretical signals by simulations using Eq. (22) are also plotted in this figure. The cluster for simulations had a radius of 600 Å. Figure 41 shows the Fourier transform of the π XAFS oscillation with respect to 2k. Observed peaks correspond to radial coordinates of neighbor atoms. The peak positions of the experimental data in Figure 41 agree well with those from the simulation up to a radial distance of 15 Å. The present demonstration proved that πXAFS provides directly shortrange order structural information. But compared with EXAFS, π XAFS can obtain the RDF with a longer distance from the absorbed atoms. If the sample
176
HAYASHI
F IGURE 41.
Fourier transform of π XAFS of a Pt foil. (From Nishino et al., 2001.)
contains few elements, we obtain the local structure around specified elements from the π XAFS oscillation by measuring X-ray fluorescence yield. B. γ -Ray Holography The greatest difficulty with the X-ray holography experiment is very low contrast. This problem does not exist in γ -ray holography. γ -Ray holography was first suggested by Tegze and Faigel and it was performed by Korecki et al. γ -Rays from nuclei having a low-lying excited state cause absorption and resonant scattering by nuclei due to a Mössbauer transition. For example, 57 Fe nuclei are excited by γ -ray photons of a 57 Co source. γ -Ray holography uses both absorption and resonant scattering phenomenon and its principle is the same as that of inverse XFH. The γ -rays reach absorption nuclei either directly or after scattering on the nuclei by resonant scattering. Since a cross section of the γ -ray resonant scattering is two orders of magnitude higher than that of normal X-ray scattering (Thomson scattering), γ -ray holography has the advantage of a large contrast of the hologram pattern. This significantly reduces the number of counts necessary. Nuclei after absorbing γ -rays deexcite by γ -ray, X-ray, or electron emissions. The latter two emissions are due to an internal conversion effect. The first demonstration by Korecki et al. (1997) measured the γ -ray hologram of the 57 Fe epitaxial film by detecting electrons, and about 2% holographic oscillation was obtained. These processes are substantial in 10−8 –10−9 eV wide lines. Thus, multiple and two energy X-ray holography techniques, which resolve twin image problem in a single energy hologram, cannot be applicable. The problem
X - RAY FLUORESCENCE HOLOGRAPHY
177
of the twin image was solved by complex γ -ray holography, as described in Section II.D.3. The phase of scattered γ -rays shifts by detuning from the resonance. Real and imaginary parts of a complex hologram can be obtained from the difference and the sum of the patterns symmetrically recorded below and above the resonance. Therefore, two holograms have to be recorded for each transition. Korecki et al. (2001) performed the complex γ -ray holography experiment using 57 Fe epitaxial film four years after their first demonstration and successfully obtained a fine real space reconstruction without twin images. The most promising method feature of γ -ray holography is the ability to record holograms characteristic of ions in different chemical environments and characterized by different hyperfine interaction parameters. Korecki et al. (2004b) chose magnetite (Fe3 O4 ) thin film as a sample of γ -ray holography and carried out the site-selective holographic imaging. In magnetite, iron ions occupy interstitial tetrahedral and octahedral positions with respect to the cubic oxygen lattice of the inverse spinel that is found in Verwey temperature Tv ∼125 K. Tetrahedral A sites are occupied by eight Fe3+ ions, whereas octahedral B sites are randomly occupied by eight Fe3+ and eight Fe2+ ions. The Mössbauer spectrum of magnetite shows two Zeeman sextets corresponding to Fe3+ in the A site and Fe2.5+ , which is cations with an average valency in the B site. Complex γ -ray holograms were recorded by detuning below and above exact resonances of Fe cations in A and B sites. Full sets of hologram data were taken for 4.6 months. Figure 42 shows the 3D real-space images of iron arrangements in magnetite. The complex holograms properly visualize the ionic arrangements corresponding to the expected positions in the Fe sublattices of the inverse spinel to within 0.15 Å with a spatial resolution of 0.6 Å. Since there are two equivalent orientations of the local environment for A sites and four equivalent for B sites, the reconstruction shows linear combinations of all the environments. Nuclei imaged by real and imaginary holograms are shown using different color scales to emphasize that these holograms are formed in scattering processes having different phase shifts. The present reconstructed images agree well with the images resulting from simulations. Small discrepancies were also recognized between the experiment and simulations (Korecki and Szymo´nski, 2002). These were thought to result from the imperfect resonances. In all Korecki’s experiments, holograms were recorded by a conversion electron from nuclei. However, nuclei detecting the wave field of γ -rays can be replaced by atoms emitting fluorescence. In the case of an FeS sample, a γ -ray hologram using the Mössbauer effect of Fe can be recorded by detecting the S fluorescence yield in principle, because the energy of the S K absorption edge is lower than that of the Mössbauer effect of Fe. γ -Ray holography has
178
HAYASHI
F IGURE 42. 3D real space images of iron arrangements in magnetite. Images were reconstructed from the holograms recorded for the nuclear resonance characteristic of (a) Fe3+ and (b) Fe2.5+ ions. The hot color scale corresponds to images reconstructed from real holograms, while the cold color scale corresponds to images reconstructed from imaginary holograms. (From Korecki and Szymo´nski, 2002.)
the advantages of a high contrast of holograms and site selectivity. However, it is limited by the types of special isotopes (Mössbauer nuclei). This restriction could be partially resolved by the use of nuclear resonant scattering beamlines at synchrotrons. C. Neutron Holography Atomic resolution holography has been developed using electron, X-ray, and γ -ray waves at 2 ∼sub-Å wavelength. In addition to these, demonstrations of neutron holography in the normal and inverse modes were performed using single crystal samples by Sur et al. in 2001 and Cser et al. in 2002, respectively. Neutron holography was realized by detecting incoherent neutron scattering or γ -ray emission from neutron-captured nuclei instead of X-ray fluorescence in the XFH. In neutron holography in a normal mode, a pointlike inner source of monochromatic spherical neutron waves is realized by a nucleus having an extremely high incoherent scattering cross section since
X - RAY FLUORESCENCE HOLOGRAPHY
179
F IGURE 43. Reconstructions of the planes (a) +0.9 Å and (b) −1.7 Å from the hydrogen atom (the origin). (From Sur et al., 2001.)
incoherent scattering redistributes the incident neutron waves isotropically (Cser et al., 2001). Hydrogen has a large incoherent neutron scattering cross section, which is about 80 barns. This value is two orders of magnitude larger than the cross sections of many other nuclei. This approach was demonstrated with the single crystal sample of simpsonite Al4 Ta3 O13 (OH) by Sur et al. (2001). Since the simpsonite contains only one H atom in the unit cell, overlaps of atomic images of different environments can be avoided. The hologram measurement was conducted on the N5 instrument located at the National Research Universal Reactor (Chalk River, Canada). The data were collected during 10 day measurements with neutrons of 1.3 Å in a wavelength from a (113) reflection of a germanium monochromator. Angular ranges for scanning are 0◦ ≤ φ ≤ 360◦ and 17◦ ≤ θ2 ≤ 83◦ with 2×2 pixels. The Bragg peaks were excluded from the hologram data. The structure of simpsonite contains two layers of oxygen atoms +0.9 Å above and −1.4 Å below the hydrogen atom along the direction of the c-axis. Figure 43 shows reconstructed planes of these layers. The vertices of the triangles indicate oxygen atoms. In the normal neutron holography study, the positions of seven oxygen atoms in simpsonite were identified and validated by comparison with the result of X-ray structure investigations. On the other hand, Cser et al. (2002) used the prompt γ -ray emission from Cd nuclei to confirm the feasibility of inverse neutron holography. The sample was a spherically shaped single crystal of Pb0.9974 Cd0.0026 . The absorption cross section of Cd is more than four orders of magnitude larger than for Pb so that the Cd atoms act as highly efficient detectors. Since the Cd concentration is very low, usually all lattice sites surrounding any one Cd atom are expected to be occupied by Pb atoms. The Pb nuclei act as the object while the cadmium nuclei serve as detectors sensing the internal field of the neutron wave. The experiment was carried out at the D9 Institute Laue-Langevin (Grenoble). The
180
HAYASHI
neutron wavelength was λ = 0.84 Å. The sample was set on a four-circle diffractometer, and was rotated about the angle θ1 through a range of 45◦ and about φ through a range of 354◦ during the measurement. The angular stepwidths for φ and θ1 are 3◦ . The prompt γ -rays emitted from the Cd nuclei were detected by scintillation counters. The data collection time is about 18 h. Reconstruction from the restored hologram clearly displays the 12 equidistant nearest-neighbor atoms around a Cd detector nucleus. The lattice parameter a = 4.93 Å obtained from the holographic data is in very good agreement with the values determined in the usual way by X-ray and neutron diffraction measurements. The results introduced here show the advantages of neutron holography. Though a large number of materials can be applicable by the present technique, exploring environments around hydrogen in hydrogen storage materials is important. Moreover, recording holograms using magnetic scattering may provide a new perspective on the investigation of magnetic materials.
VI. S UMMARY AND O UTLOOK The theory, experiments, and applications of XFH are present in the literature. Today, TEM and SPM are well-known tools for visualizing atoms in solids, and are used routinely in various science fields. However, since the atomic images obtained by these methods are projection and surface ones, determination of the precise arrangements of atoms is not easy. XFH can provide isotropic three-dimensional atomic images around specific elements within a radius of 10 Å. Electron emission holography, which was realized five years earlier than XFH, gives similar 3D atomic images. However, its difficulties are strong forward scattering, multiple scattering, and phase shift effects, which often cause image distortions and artifacts. In XFH, these effects are negligible, and therefore the reconstructed images represent actual atomic arrangements. Since the hard X-rays penetrate the sample up to micrometer-order depth, not only surfaces but also bulks can be analyzed differently with electron emission holography. For these reasons, analysis of the environment around dopants in single crystals is a promising application. Moreover, in strongly correlated electron systems with a layer perovskite structure, such as cuprates and manganites, many researchers have investigated the relations between local lattice distortion and their electronic or magnetic properties (Oyanagi and Bianconi, 2001). XFH will help clarify this question. At an early stage of the XFH experiment, weakness of the holographic signals is the most serious problem. But this has been resolved by the use of third-generation synchrotron radiation and a fast X-ray detector. As represented by light-atom imaging, elemental identification using resonant
X - RAY FLUORESCENCE HOLOGRAPHY
181
X-ray scattering, improvement of spatial resolutions, and poisons of atoms, experimental and data processing techniques have been developed rapidly in the past decade. Besides, γ -ray and neutron holography methods were developed after the appearance of XFH, revealing that the field of atom resolved holography is increasing step by step. In the future, I think that development of a fitting-based reconstruction algorithm will be important (Marchesini and Fadley, 2003). This will provide quantitative information on the occupancy of atoms at each site and the atomic position within an accuracy of 0.01 Å. I believe that the potential of XFH is much greater than its present performance and that it will clarify the unknown structures of materials, which cannot be obtained from other structural analyses. I hope that many researchers are interested in the present technique and use it.
R EFERENCES Adams, B., Novikov, D.V., Hiort, T., Materlik, G. (1998). Atomic holography with X-rays. Phys. Rev. B 57, 7526–7534. Adams, B., Nishino, Y., Materlik, G. (2000). A novel experiment technique for atomic X-ray holography. J. Synch. Rad. 7, 274–279. Aldrich, D.B., Nemanich, R.J., Sayers, D.E. (1994). Bond-length relaxation in Si1−x Gex alloys. Phys. Rev. B 50, 15026–15033. Aubry, J.C., Tyliszczak, T., Hitchcock, A.P., Baribeau, J.-M., Jackman, T.E. (1999). First-shell bond lengths in Six Ge1−x crystalline alloys. Phys. Rev. B 59, 12872–12883. Bai, J. (2003). Atomic scattering factor for a spherical wave and near-field effects in X-ray fluorescence holography. Phys. Rev. B 68, 144109. Barton, J.J. (1988). Photoelectron holography. Phys. Rev. Lett. 61, 1356–1359. Barton, J.J. (1991). Removing multiple scattering and twin images from holographic images. Phys. Rev. Lett. 67, 3106–3109. Batterman, B.W. (1969). Detection of foreign atom sites by their X-ray fluorescence scattering. Phys. Rev. Lett. 22, 703–705. Bedzyk, M.J., Materlik, G. (1985). Two-beam dynamical diffraction solution of the phase problem: A determination with X-ray standing wave. Phys. Rev. B 32, 6456–6463. Bortolani, V., Celli, V., Marvin, A.M. (2003). Multiple-energy X-ray holography: Polarization effect. Phys. Rev. B 67, 024102. Cheng, L., Fenter, P., Bedzyk, M.J., Sturchio, N.C. (2003). Fourier-expansion solution of atom distributions in a crystal using X-ray standing waves. Phys. Rev. Lett. 90, 255503. Cser, L., Krexner, G., Török, Gy. (2001). Atomic-resolution neutron holography. Europhys. Lett. 54, 747–752.
182
HAYASHI
Cser, L., Török, Gy., Krexner, G., Sharkov, I., Faragó, B. (2002). Holographic imaging of atoms using thermal neutrons. Phys. Rev. Lett. 89, 175504. Fanchenko, S.S., Novikov, D.V., Schley, A., Materlik, G. (2002). Invalidity of low-pass filtering in atom-resolving X-ray holography. Phys. Rev. B 66, R060104. Gabor, D. (1948). A new microscopic principle. Nature 161, 777–778. Gehanno, V., Revenant-Brizard, C., Marty, A., Gilles, B. (1998). Studies of epitaxial Fe0.5 Pd0.5 thin films by X-ray diffraction and polarized fluorescence absorption spectroscopy. J. Appl. Phys. 84, 2316–2323. Gog, T., Len, P.M., Materik, G., Bahr, D., Fadley, C.S., Sanchez-Hanke, C. (1996). Multiple-energy X-ray holography: Atomic images of hematite (Fe2 O3 ). Phys. Rev. Lett. 76, 3132–3135. Harp, G.R., Saldin, D.K., Tonner, B.P. (1990). Scanned-angle X-ray photoemission holography with atomic resolution. Phys. Rev. B 42, 9199–9202. Hayashi, K., Miyake, M., Tobioka, T., Awakura, Y., Suzuki, M., Hayakawa, S. (2001a). Development of apparatus for multiple energy X-ray holography at SPring-8. Nucl. Instrum. Methods Phys. Res. A 467/468, 1241–1244. Hayashi, K., Matsui, M., Awakura, Y., Kaneyoshi, T., Tanida, H., Ishii, M. (2001b). Local-structure analysis around dopant atoms using multiple energy X-ray holography. Phys. Rev. B 63, R41201. Hiort, T., Novikov, D.V., Kossel, E., Materlik, G. (2000). Quantitative assessment of X-ray fluorescence holography for bcc Fe as a test case. Phys. Rev. B 61, R830–R833. Kamp, P., Marty, A., Gilles, B., Hoffman, R., Marchesini, S., Belakovsky, M., Boeglin, C., Purr, H.A., Dhesi, S.S., Laan, G.V., Rogalev, A. (1998). Correlation of spin and orbital anisotropies with chemical order in Fe0.5 Pd0.5 alloy films using magnetic circular dichroism. Phys. Rev. B 59, 1105–1112. Kikuta, S. (1992). X-Ray Diffraction and Scattering, vol. 1. University of Tokyo Press, Tokyo. (Japanese). Kishimoto, S., Adachi, H., Ito, M. (2001). A cooled avalanche photodiode detector for X-ray magnetic diffraction experiments. Nucl. Instrum. Methods Phys. Res. Sect. A 467, 1171–1174. Kitano, T., Watanabe, H., Matsui, J. (1989). Existence of interstitially Zn atoms in GaAs:Zn grown by the liquid-encapsulated Czochralski technique. Appl. Phys. Lett. 54, 2201–2203. Kopecky, M., Busetto, E., Lausi, A., Miculin, M., Savoia, A. (2001). Recording of X-ray holograms on a position-sensitive detector. Appl. Phys. Lett. 78, 2985–2987. Korecki, P., Szymo´nski, M. (2002). Three-dimensional imaging of local atomic and magnetic structure in compound epitaxial films with γ -ray holography. Surf. Sci. 507–510, 422–428. Korecki, P., Koreski, J., Slezak, T. (1997). Atomic resolution γ -ray holography using the Mössbauer effect. Phys. Rev. Lett. 79, 3518–3521.
X - RAY FLUORESCENCE HOLOGRAPHY
183
Korecki, P., Materlik, G., Korecki, J. (2001). Complex γ -ray hologram: Solution to twin images problem in atomic resolution imaging. Phys. Rev. Lett. 86, 1534. Korecki, P., Novikov, D.V., Tolkiehn, M., Materlik, G. (2004a). Extinction effects in X-ray holographic imaging with internal reference. Phys. Rev. B 69, 184103. ´ Korecki, P., Azymo´nski, M., Korecki, J., Slezak, T. (2004b). Site-selective holographic imaging of iron arrangement in magnetite. Phys. Rev. Lett. 92, 205501. Kossel, W., Loeck, V., Voges, H. (1935). Die Richtungsverteilung der in einen Kristall entstandenen charakteristischen Röntgenstrahlung. Z. Phys. 94, 139–144. Lee, P.A., Citrin, P.H., Eisenberg, P., Kincard, B.M. (1981). Extended X-ray absorption fine structure—its strength and limitations as a structural tool. Rev. Mod. Phys. 53, 769–806. Leith, E.N., Upatnieks, J. (1965). Wavefront reconstruction with continuous tone objects. J. Opt. Soc. Am. 53, 1377–1381. Len, P.M., Gog, T., Novikov, D., Eisenhower, R.A., Materlik, G., Fadley, C.S. (1997). Multiple energy X-ray holography: Incident-radiation polarization effects. Phys. Rev. B 56, 1529–1539. Marchesini, S., Belakovsky, M., Baron, A.Q.R., Faigel, G., Tegze, M., Kamp, P. (1998). Standing wave and Kossel line patterns in structure determination. Solid State Commun. 105, 685–687. Marchesini, S., Schmithüsen, F., Tegze, M., Faigel, G., Calvayrac, Y., Belakhovsky, M., Chevrier, J., Simionovici, A. (2000). Direct 3D imaging of Al70.4 Pd21 Mn8.6 quasicrystal local atomic structure by X-ray holography. Phys. Rev. Lett. 85, 4723–4726. Marchesini, S., Ulrich, O., Faigel, G., Tegze, M., Belakhovsky, M., Simionovicim, A.S. (2001). Instrumental development of X-ray atomic holography. Nucl. Instrum. Methods Phys. Res. Sect. A 457, 601–606. Marchesini, S., Mannella, N., Fadley, C.S., Van Hove, M.A., Bucher, J.J., Shuh, D.K., Fabris, L., Press, M.H., West, M.W., Stolte, W.C., Hussain, Z. (2002). Holographic analysis of diffraction structure factors. Phys. Rev. B 66, 094111. Marchesini, S., Fadley, C.S. (2003). X-ray fluorescence holography: Going beyond the diffraction limit. Phys. Rev. B 67, 024115. Matsushita, T., Agui, A., Yoshige, A. (2004). A new approach for three dimensional atomic image reconstruction from a single-energy photoelectron hologram. Europhys. Lett. 65, 207–213. Matsuura, M., Tonnerre, J.M., Calgill III, G.S. (1991). Lattice parameters and local atomic-structure of Si-rich SiGe/Si(100) films. Phys. Rev. B 44, 3842– 3849.
184
HAYASHI
Nishino, Y., Materlik, G. (1999). Holographies and EXAFS in quantum electrodynamics. Phys. Rev. B 60, 15074. 61 (2000) 14845(E). Nishino, Y., Tröger, L., Korecki, P., Materlik, G. (2001). Photon interference X-ray absorption fine structure. Phys. Rev. B 64, 201101. Nishino, Y., Ishikawa, T., Hayashi, K., Takahashi, Y., Matsubara, E. (2002). Two-energy twin image removal in atomic-resolution X-ray holography. Phys. Rev. B 66, 092105. Omori, S., Zhao, L., Marchesini, S., Van Hove, M.A., Fadley, C.S. (2001). Resonant X-ray fluorescence holography: Three-dimensional atomic imaging in true color. Phys. Rev. B 65, 014106. Oyanagi, H., Bianconi, A. (2001). Physic in Local Lattice Distortions. AIP Conference Proceedings, vol. 554. American Institute of Physics, New York. Pearsall, T.P. (1911). Si–Ge alloys and superlattices for optoelectronics. Mater. Sci. Eng. B 9, 225–231. Proctor, A., Sherwood, P.M. (1982). Data-analysis techniques in X-ray photoelectron spectroscopy. Anal. Chem. 54, 13–19. Sayers, D.E., Stern, E.A., Lytle, A.W. (1971). New technique for investigating noncrystalline structures—Fourier analysis of extended X-ray absorption fine structure. Phys. Rev. Lett. 27, 1204–2107. Sekioka, T., Hayashi, K., Matsubara, E., Takahashi, Y., Hayashi, T., Terasawa, M., Mitamura, T., Iwase, A., Michikami, O. (2005). Atomic imaging in EBCO superconductor films by an X-ray holography system using a toroidally bent graphite analyzer. J. Synch. Rad. 12, 530–553. Shima, T., Moriguchi, T., Mitani, S., Takanashi, K. (2002). Low-temperature fabrication of L1(0) ordered FePt alloy by alternate monatomic layer deposition. Appl. Phys. Lett. 80, 288–290. Sur, B., Rogge, R.B., Hammond, R.P., Anghel, V.N.P., Katsuras, J. (2001). Atomic structure holography using thermal neutrons. Nature 414, 525– 527. Szöke, A. (1986). Short wavelength coherent radiation: Generation and applications. In: Attwood, D.T., Boker, J. (Eds.), AIP Conference Proceedings, vol. 147. American Institute of Physics, New York, pp. 361–367. Takahashi, Y., Hayashi, K., Matsubara, E. (2003a). Complex X-ray holography. Phys. Rev. B 68, 052103. Takahashi, Y., Hayashi, K., Matsubara, E. (2003b). Elemental identification of a three dimensional environment by complex X-ray holography. Phys. Rev. B 71, 134107. Takahashi, Y., Hayashi, K., Matsubara, E., Shima, T., Takanashi, K. (2003c). X-ray fluorescence holography of atomically controlled magnetic thin film. Experiment Report of Nanotechnology in SPring-8 2, 19–20.
X - RAY FLUORESCENCE HOLOGRAPHY
185
Takahashi, Y., Hayashi, K., Wakoh, K., Nishiki, N., Matsubara, E. (2003d). Development of laboratory X-ray fluorescence holography equipment. J. Mater. Res. 18, 1471–1473. Takahashi, Y., Hayashi, K., Matsubara, E., Shima, T., Takanashi, K., Mori, T., Tanaka, M. (2003e). A new technique for study of local atomic environment in artificially grown magnetic thin film. Scripta Mater. 48, 975–979. Takahashi, Y., Hayashi, K., Matsubara, E. (2004). Development and application of laboratory X-ray fluorescence holography equipment. Powder Diffraction 19, 77–80. Tegze, M., Faigel, G. (1991). Atomic resolution X-ray holography. Europhys. Lett. 16, 41–46. Tegze, M., Faigel, G. (1996). X-ray holography with atomic resolution. Nature 380, 49–51. Tegze, M., Faigel, G. (2001). X-ray holography: Theory and experiment. J. Phys.: Condens. Matter 13, 10613–10623. Tegze, M., Faigel, G., Marchesini, S., Belakhovsky, M., Chumakov, A.I. (1999). Three dimensional imaging of atoms with isotropic 0.5 Å resolution. Phys. Rev. Lett. 82, 4847–4850. Tegze, M., Faigel, G., Marchesini, S., Belakhovsky, M., Ulrich, O. (2000). Imaging light atoms by X-ray holography. Nature 407, 38. Wei, S.Q., Oyanagi, H., Kawanami, H., Sakamoto, K., Sakamoto, T., Tamura, K., Sami, N.L., Usaki, K. (1997). Local structures of isovalent and heterovalent dilutre impurities in Si crystal proved by fluorescence X-ray absorption fine structure. J. Appl. Phys. 82, 4810–4815. Woicik, J.C., Miyano, K.E., King, C.A., Johnson, R.W., Pellegrino, J.G., Lee, T.-L., Lu, Z.H. (1998). Phase-correct bond lengths in crystalline Gex Si1−x alloys. Phys. Rev. B 57, 14592–14595. Yonenaga, I. (1999). Czochralski growth of GeSi bulk alloy crystals. J. Cryst. Growth 198/199, 404–408.
This page intentionally left blank
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 140
A Taxonomy of Color Image Filtering and Enhancement Solutions RASTISLAV LUKAC AND KONSTANTINOS N. PLATANIOTIS Multimedia Laboratory—BA 4157, The Edward S. Rogers Sr. Department of ECE, University of Toronto, Toronto, Ontario M5S 3G4, Canada
I. Introduction . . . . . . . . . . . . . . . II. Color Imaging Basics . . . . . . . . . . . . III. Image Noise . . . . . . . . . . . . . . A. Natural Image Noise . . . . . . . . . . . B. Noise Modeling . . . . . . . . . . . . . 1. Sensor Noise . . . . . . . . . . . . 2. Transmission Noise . . . . . . . . . . IV. Color Image Filtering . . . . . . . . . . . . A. Noise-Reduction Techniques . . . . . . . . . 1. Order-Statistic Theory for Color Vectors . . . . 2. Component-Wise Median Filters . . . . . . 3. Vector Median Filters . . . . . . . . . . 4. Vector Directional Filters . . . . . . . . . 5. Selection Weighted Vector Filters . . . . . . 6. Data-Adaptive Vector Filters . . . . . . . . 7. Adaptive Multichannel Filters Based on Digital Paths 8. Switching Filtering Schemes . . . . . . . 9. Similarity Based Vector Filters . . . . . . . 10. Adaptive Hybrid Vector Filters . . . . . . . B. Performance Evaluation of the Noise Reduction Filters . 1. Objective Evaluation . . . . . . . . . . 2. Subjective Evaluation . . . . . . . . . . C. Inpainting Techniques . . . . . . . . . . . D. Image Sharpening Techniques . . . . . . . . E. Image Zooming Techniques . . . . . . . . . F. Applications . . . . . . . . . . . . . . 1. Virtual Restoration of Artworks . . . . . . . 2. Television Image Enhancement . . . . . . . V. Edge Detection . . . . . . . . . . . . . . A. Scalar Operators . . . . . . . . . . . . 1. Gradient Operators . . . . . . . . . . 2. Zero-Crossing-Based Operators . . . . . . . B. Vector Operators . . . . . . . . . . . . C. Evaluation Criteria . . . . . . . . . . . . 1. Objective Evaluation Approach . . . . . . . 2. Subjective Evaluation Approach . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
188 190 193 193 194 195 197 199 202 202 205 207 212 215 218 220 223 226 228 231 231 233 234 235 239 241 241 243 244 245 248 249 250 253 254 255
187 ISSN 1076-5670/05 DOI: 10.1016/S1076-5670(05)40004-X
Copyright 2006, Elsevier Inc. All rights reserved.
188 VI. Conclusion References
LUKAC AND PLATANIOTIS . .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
257 257
I. I NTRODUCTION The purpose of this chapter is to present the state-of-the-art in color image filtering and enhancement techniques, and to discuss in a systematic and comprehensive way the most important developments in the field of color image filtering. The perception of color is of paramount importance since human observers routinely use color information to sense the environment, recognize semantic objects of interest, and convey information about the environment. Color image processing is concerned with the manipulation of digital color images on electronic devices, such as computers and digital cameras, through the utilization of digital signal processing methods. Digital image processing, a well-established research discipline with many important practical applications, focuses on the conversion of a continuous image field into an equivalent digital form. The synthesis of images from the signals arising from various sensor systems is accomplished by a digital process directed to transforming the signal into a form allowing visual or machine perception. The requirements for an ideal conversion system are usually expressed in terms of certain technical properties, such as the resolution of the imaging systems, photometric accuracy, quantization levels, intensity of intrinsic noise, and many others. In this chapter Section II reviews briefly the fundamentals of the trichromatic theory of color representation emphasizing the connections between colorimetric properties and vectorial representation in a three-dimensional (3D) color space. This part also includes Section III in which synthetic color image noise models are reviewed. The purpose is to provide insights into the fundamentals of color image formation and a basic understanding of the objectives behind the development of the various color image filtering schemes. The analysis of the image noise in digital image acquisition systems often focuses on random noise sources, such as those associated with quantum signal detection (shot noise) and signal independent fluctuations (dark current, readout noise). Another important source of image noise is the inhomogeneity of the responsiveness of the sensor elements and signal disturbances that introduce repeatable patterns into image data. The second part of the chapter, Section IV, is devoted to color image processing solutions, focusing on noise reduction and image enhancement. The correction of the signal distortions is a digital process, by which disturbances introduced by the sensor are rectified, with the goal being to obtain
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
F IGURE 1.
189
Image processing chain.
the image or generally the signal, which corresponds as closely as possible to the output of an ideal imaging system. Thus, correcting signal artifacts, in practice, means adjusting the characteristics of the imaging system to meet specific demands of the human observer or the computer vision system (Figure 1). Particular emphasis is placed on the so-called nonlinear vector processing solutions that follow the order-statistic framework. Color images are nonlinear in nature due to the presence of structural information, and are perceived through the human visual system, which has strong nonlinear characteristics. Nonlinear methods are able to preserve important color structural elements and eliminate degradations occurring during signal formation or transmission through nonlinear channels and they are proved to be efficient in the suppression of impulsive, Gaussian, and mixed types of noise, which, as will be discussed in Section III, are assumed present in perceived color data. State-ofthe-art solutions including data-adaptive filters, weighted vector processing filters, filters utilizing the concept of digital paths in the color image, fuzzy logic principles, and switching filtering concepts are reviewed, commented upon, and taxonomized. The connection between noise filtering and other frequently used color image processing tasks, such as color image inpainting, sharpening, and spatial interpolation, is provided. The second part of Section IV deals with the problem of performance evaluation in the context of color image processing. Improvement of the quality of the images has always been one of the central tasks of digital image processing. In modern terms, improvements in sensitivity, resolution, and noise reduction have equated higher quality with greater informational throughout. Image noise is an unwanted feature, which is either contained in the relevant light signal or is added by the imaging process, and involves a precise evaluation of the light signal distribution, which should be measured. It also includes applications of the reviewed filtering and enhancement solutions. Examples and experimental results included in the chapter indicate that the operators presented are computationally attractive, yield good performance, and are able to preserve color information and fine details while removing noise and visual impairments.
190
LUKAC AND PLATANIOTIS
The last part of the chapter, Section V, shows the close relationship between color image filters and edge-based image analysis. Edges convey essential information about a scene. Determination of object boundaries is important in many areas such as visual communication, medical imaging, dactyloscopy, quality control, photogrammetry, and intelligent robotic systems. Thus, edge detection—a process of transforming an input digital image into an edge map—is a common component in image processing systems. By modifying the robust order-statistic concepts reviewed in Section IV, a number of efficient, easily applicable, edge detectors can be designed. Such edge operators can be used to detect the edge information and fine details not only in conventional color images, but also in emerging microarray image processing.
II. C OLOR I MAGING BASICS To utilize color as a visual cue in multimedia, image processing, computer graphics, and computer vision applications, an appropriate method for representing color signals is needed. Since human vision is based on three types of color photoreceptor cone cells, three numerical components are necessary and sufficient to describe a color, specified by a three-component vector (Wyszecki and Stiles, 1982). As shown in Figure 2, a K1 × K2 RGB color image x : Z 2 → Z 3 represents a two-dimensional matrix of three-component samples (pixels) x(p,q) = [x(p,q)1 , x(p,q)2 , x(p,q)3 ] occupying the spatial location (p, q), with p = 1, 2, . . . , K1 and q = 1, 2, . . . , K2 denoting the image row and column, respectively. In the color vector x(p,q) the x(p,q)k value, for k = 1, 2, 3, defined in the integer domain Z denotes the kth vector’s spectral component equal to an arbitrary integer value ranging from 0 to 2B −1 levels in B-bits per component representation (typically B = 8 in standard RGB color images). Namely, x(p,q)1 signifies the R component, x(p,q)2 denotes the G component, and x(p,q)3 indicates the B component. The large value of x(p,q)k denotes high contributions of the kth primary in the color vector x(p,q) . The process of displaying an image creates a graphic representation of the image matrix where the pixel values represent particular colors. Each individual channel {x(·,·)k } of a color image x can be considered a K1 × K2 monochrome image xk : Z 2 → Z. It has been widely observed that the frequency of the G color band is close to the peak of the human luminance frequency response and, thus, G channel elements x(·,·)2 contribute the most in the perception of color images by the end-user (Gunturk et al., 2002; Lukac et al., 2005e). In addition, the G color channel {x(·,·)2 } is the most similar to the gray-scale representation L of the color image x. In practice, the pixel’s values L(p,q) of the gray-scale image L : Z 2 → Z can be obtained in one of
F IGURE 2.
Color image representation.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
191
192
LUKAC AND PLATANIOTIS
F IGURE 3.
RGB color cube with the Maxwell triangle.
the two following ways: L(p,q) = 0.299x(p,q)1 + 0.587x(p,q)2 + 0.114x(p,q)3
L(p,q) = (x(p,q)1 + x(p,q)2 + x(p,q)3 )/3.
(1) (2)
According to the tristimulus theory of color representation, the 3D RGB vector x(p,q) = [x(p,q)1 , x(p,q)2 , x(p,q)3 ] is uniquely defined (Lukac et al., 2005a) by its length (magnitude) $ 2 2 2 Mx(p,q) = x(p,q) = x(p,q)1 + x(p,q)2 + x(p,q)3 (3) and orientation (direction)
Ox(p,q) =
1 x(p,q)
x(p,q) =
1 Mx(p,q)
x(p,q)
(4)
where Ox(p,q) = 1 denotes the unit sphere defined in the vector space. The directional properties may also be expressed as the point on the Maxwell triangle, which is a triangle in 3D space, that intersects the RGB color primaries in corners of the RGB cube (Figure 3). Similarly to the definition of the vector’s directionality in (4), the Maxwell triangle represents the parameterization of the chromaticity space, where each chrominance line is entirely determined by its intersection point with the Maxwell plane (Gomes and Velho, 1997; Lukac et al., 2005f). Operating on the Maxwell plane, the color vector x(p,q) is expressed as the point Cx(p,q) = [cx(p,q)1 , cx(p,q)2 , cx(p,q)3 ]
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
193
with coordinates cx(p,q)k =
x(p,q)k , x(p,q)1 + x(p,q)2 + x(p,q)3
for k = 1, 2, 3
(5)
where cx(p,q)1 + cx(p,q)2 + cx(p,q)3 = 1. Thus, any color image can be considered a vector field where each vector’s direction and length are related to the pixel’s color characteristics and influence significantly its perception by the human observer (Lukac et al., 2005a). With respect to the previous definitions (3)–(5), color image processing can be performed using the magnitude information, the directional information, or both magnitude and directional characteristics of the processed vectors can be taken into consideration during processing.
III. I MAGE N OISE A. Natural Image Noise Visual data are usually corrupted by noise and other impairments associated with the measurements in the transmission apparatus. These impairments (Figure 4) significantly degrade the value of the color information, decrease the perceptual fidelity, and complicate other processing and analysis tasks. For example, television images are corrupted by atmospheric interference and imperfections of the reception apparatus (Lukac et al., 2005a). Noise is introduced into the digitized artworks by scanning damaged and granulated surfaces of the original artworks (Lukac et al., 2005f). The noise floor present in the cDNA microarray image data can be attributed to both source and detector noise introduced due to the nature of microarray technology (Lukac et al., 2004d). To design a filter capable of removing image noise and producing visually pleasing color images, the effect of both the noise and the filtering structure on the desired (original) signal is usually studied in simulated, approximated conditions. Such an approach allows for both the objective and subjective evaluation of the filtering results and allows for further analysis and refinement of the processing framework. The first step in such an approach is the introduction of artificial noise to the original image for the purpose of analysis and study. To simulate the real noise—effect observed in real-life digital images—the development of specialized noise models is of paramount importance (see Figures 4 and 5).
194
LUKAC AND PLATANIOTIS
(a)
(b)
(c) F IGURE 4. Real color image noise: (a) cDNA microarray image, (b) digitized artwork image, (c) television image.
B. Noise Modeling In many practical applications, multichannel signals, such as color images, are corrupted by additive noise. The most commonly used model is defined (Astola et al., 1990; Lukac et al., 2005a; Plataniotis et al., 1999) as x(p,q) = o(p,q) + v(p,q)
(6)
where x(p,q) = [x(p,q)1 , x(p,q)2 , x(p,q)3 ] represents the observation (noisy) sample, o(p,q) = [o(p,q)1 , o(p,q)2 , o(p,q)3 ] is the desired (noise-free) sample,
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
195
v(p,q) = [v(p,q)1 , v(p,q)2 , v(p,q)3 ] is the vector describing the noise process, and (p, q) characterizes the spatial position of the samples in the image. It should be noted that v(p,q) in Eq. (6) can be used to describe both signaldependent and signal-independent additive noise. The appearance of the noise and its influence on the image relate to its characteristics (Lukac et al., 2005a, 2005g; Plataniotis and Venetsanopoulos, 2000). Noise signals can be either periodic in nature or random. In certain cases noise signals can be described in terms of the commonly used Gaussian noise model, as is, for example, the case for the sensor thermal noise caused by thermal degeneration of materials in optical sensors. However, noisecorrupted natural images are often characterized by abrupt local changes, in which case the noise masking the true signal can be modeled as impulsive sequences, which occur in the form of short time duration, high energy spikes attaining large amplitudes with probability higher than predicted by a Gaussian density model (Kayargadde and Martens, 1996; Zheng et al., 1993). Such impulsive type noise is introduced to the images either by electronic interference, flaws in the data transmission procedure, or because of aging and faulty storage material. In the next few sections, the most common noise types and their mathematical models are briefly discussed. 1. Sensor Noise A charge-coupled device (CCD) is commonly used as the sensor in most imaging devices (Sharma and Trussell, 1997) and is usually characterized by numerous noise sources such as photon shot noise, dark current shot noise, on-chip and off-chip amplifier noise, and fixed pattern noise (Holst, 1998). Among these sources, shot noise resulting from the photoelectric process, usually described in terms of Poisson statistics, can never be removed at the camera hardware level (Holst, 1998). However, considering the likely presence of all these different types of noise at the sensing apparatus, it is reasonable to assume that the overall lumped noise can be considered to be following an additive Gaussian noise model with zero mean, which affects each color component and spatial image pixel position independently (Figure 5b) (Plataniotis and Venetsanopoulos, 2000; Sung, 1992). If it is further assumed, without loss of generality, that the noise variance σ is the same for all three color components in the RGB color space representation, the noise corruption process can be reduced to a scalar perturbation. Defining the magnitude of the noise vector as Mv = v(p,q) =
$ 2 2 2 v(p,q)1 + v(p,q)2 + v(p,q)3
(7)
196
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 5. Simulated color image noise: (a) 256 ×256 color image Lena, (b) additive Gaussian noise with σ = 30, (c) impulsive noise with pv = 0.10, (d) mixed noise (additive Gaussian noise with σ = 30 + impulsive noise with pv = 0.10).
the distribution of the Mv in Eq. (7) quantities is expressed as follows (Lukac et al., 2005a; Sung, 1992): Pr(Mv ) =
√
1 2π σ 2
3
4π Mv2 exp
& Mv2 − 2 . 2σ
(8)
The perturbation due to the noise results in the creation of a cone in the RGB space as can be seen in Figure 6. Following the derivation in Sung (1992), the noise-induced vector magnitude perturbation can be mapped to an angular
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
F IGURE 6.
197
Angular noise margins for a color signal corrupted by sensor noise.
perturbation A described by Lukac et al. (2005a): & o2 A2 o2 exp − . Pr(A) ≈ A σ2 2σ 2
(9)
It is not difficult to see that in the Rayleigh distribution of Eq. (9), the mean angular perturbation A can be defined as ' σ 2π A≈ . (10) 2o2
Using the concept of color noise expressed as an angular perturbation of the original color vector represented in a correlated vector color space such as RGB and sRGB (Stokes et al., 1996), the effect of the different filters can be roughly derived (Lukac et al., 2005a; Sung, 1992). 2. Transmission Noise Aside from acquisition noise, natural images are corrupted by noise during transmission (Plataniotis et al., 1997; Smolka et al., 2001). Transmission noise has been found to be mostly impulsive in nature with sources ranging from human-made (e.g., switching and interference) to natural (e.g., lightning) (Henkel et al., 1995; Neuvo and Ku, 1975; Plataniotis and Venetsanopoulos, 2000). Such an effect can be described as follows (Figure 5c)
198
LUKAC AND PLATANIOTIS
(Lukac, 2004a; Lukac et al., 2004c): v(p,q) with probability pv x(p,q) = o(p,q) with probability 1 − pv
(11)
where (p, q) characterizes the image pixel position, o(p,q) is the original sample, x(p,q) represents the sample from the noisy image, and pv is a corruption probability (also referred to as a percentage number of corrupted pixels). The impulse v(p,q) is usually considered independent from pixel to pixel and generally has a much larger (or smaller) amplitude compared to that of neighboring samples at least in one of the spectral components. Alternatively, the transmission noise model can be expressed in terms of the additive model of Eq. (6) with the noise component v(p,q) defined as follows (Lukac et al., 2005f, 2005g): v(p,q) with probability pv v(p,q) = (12) 0 with probability 1 − pv .
In the noise models above color pixels were represented using vector notation. Considering the 3D nature of the color signal (Viero et al., 1994), the noise corrupted color image can be represented as ⎧ [v(p,q)1 , o(p,q)2 , o(p,q)3 ] with probability pv pv1 ⎪ ⎪ ⎪ [o(p,q)1 , v(p,q)2 , o(p,q)3 ] with probability pv pv ⎪ ⎪ 2 ⎪ ⎪ [o(p,q)1 , o(p,q)2 , v(p,q)3 ] with probability pv pv3 ⎪ ⎪ ⎨ [v(p,q)1 , v(p,q)2 , o(p,q)3 ] with probability pv pv1 pv2 (13) x(p,q) = [v ⎪ (p,q)1 , o(p,q)2 , v(p,q)3 ] with probability pv pv1 pv3 ⎪ ⎪ ⎪ ⎪ ⎪ [o(p,q)1 , v(p,q)2 , v(p,q)3 ] with probability pv pv2 pv3 ⎪ ⎪ [v ,v ,v ] with probability pv pv1 pv2 pv3 ⎪ ⎩ (p,q)1 (p,q)2 (p,q)3 o(p,q) with probability 1 − pv where pv is the probability that the original color image vector o(p,q) is being corrupted by noise, whereas pvk denotes the probability of corruption of a particular [e.g., R (k = 1), G (k = 2), and B (k = 3)] component of the vector o(p,q) . In such a model, the probabilities values should satisfy the following condition: pv1 + pv2 + pv3 + pv1 pv2 + pv1 pv3 + pv2 pv3 + pv1 pv2 pv3 = 1. (14)
The transmission noise models previously discussed are not the only methods available. In Astola and Kuosmanen (1997) and Lukac et al. (2004e), a bit error representation was used to describe the effect of the transmission noise. In particular, the noise corrupted pixel bit-representation level is given as j 1 − o(p,q)k with probability pv j (15) x(p,q)k = j with probability 1 − pv o(p,q)k
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
199
where (p, q) denotes the sample position, j is the bit level affected by the noise process, and pv is the bit error probability. It is not hard to see that the pixel-level description of the noise-corrupted and noise-free (original) samples is given as follows: B−1 B−1 B−2 B−2 1 0 2 + x(p,q)k 2 + · · · + x(p,q)k 2 + x(p,q)k x(p,q)k = x(p,q)k
(16)
o(p,q)k =
(17)
B−1 B−1 o(p,q)k 2
B−2 B−2 + o(p,q)k 2
1 0 + · · · + o(p,q)k 2 + o(p,q)k
where B denotes the bit representation of the color components o(p,q)k and x(p,q)k . Note that the value B = 8 should be used for standard RGB color images. In more realistic application scenarios, color image data are corrupted by both acquisition and transmission noise (Figure 5d). To simulate such a corruption process, the mixed noise model (additive Gaussian noise followed by impulsive noise) is defined as follows (Plataniotis and Venetsanopoulos, 2000; Tang et al., 1995): with probability pv v(p,q) (18) x(p,q) = o + vA with probability 1 − p (p,q)
(p,q)
v
where vA (p,q) is the additive Gaussian noise and v(p,q) denotes the impulsive noise. In this section, a number of models that relate original color image data to noise-corrupted signals were introduced. The models can be used either to approximate the true noise corruption mechanism or for simulation purposes. In the sequence, noise filtering will be posed as reconstruction of the true image data from observed noise-corrupted color signal readings.
IV. C OLOR I MAGE F ILTERING Noise and other impairments associated with the measurement or the transmission apparatus significantly degrade the value of the color information (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000). This usually reduces the perceptual quality and the fidelity of the images, and also decreases the performance of the task for which the image was used. Since humans and computer vision systems use color information to sense the environment and the correct perception of color can help in different tasks of image understanding and object recognition, it is not surprising that the most common signal processing task is noise filtering (Lukac et al., 2005a). Noise filtering is an essential part of any image processing based system, whether the final information is used for human inspection or for an automatic analysis (Plataniotis et al., 1999; Plataniotis and Venetsanopoulos, 2000).
200
LUKAC AND PLATANIOTIS
F IGURE 7. Component-wise filtering concept: (a) color channel decomposition, (b) separate filtering of the color channels, (c) projection of the separately filtered channels into the outputted color image.
In recent decades, several noise reduction techniques have been proposed. They can be divided into (Lukac et al., 2004c; Mitra and Sicuranza, 2001; Peltonen et al., 2001) linear and nonlinear techniques. Since linear processing techniques are relatively easy to analyze and implement, they have been widely used in digital signal processing applications (Peltonen et al., 2001). However, many multichannel image processing tasks cannot be efficiently accomplished by linear techniques. Image signals are nonlinear in nature due to the presence of edges and, thus, most of the linear techniques tend to blur structural elements such as fine image details (Lukac et al., 2004c; Peltonen et al., 2001). In addition, visual information is perceived via the human visual system, which has strong nonlinear characteristics (Faugeras, 1979). It is therefore not surprising that nonlinear methods can potentially preserve important multichannel structural elements, such as color edges, and eliminate degradations occurring during signal formation or transmission through nonlinear channels (Mitra and Sicuranza, 2001; Plataniotis et al., 1998a; Smolka et al., 2004). With regards to the multichannel nature of the color image, color image techniques developed in the past are often classified into (Lukac et al., 2004c; Plataniotis and Venetsanopoulos, 2000) component-wise (marginal) methods and vector (multichannel) methods. Component-wise filters (Figure 7), directly adopted from gray-scale imaging, process each channel of the color images separately (Rantanen et al., 1992; Zheng et al., 1993). By omitting the essential spectral information and introducing a certain inaccuracy in filter estimates, the projection of the outputted color components into the restored RGB image often produces a color artifact—a new color quite different from the neighbors. This is not the case when vector filters that utilize inherent correlation between color channels are used (Figure 8). Vector filters process
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
F IGURE 8.
F IGURE 9.
201
Vector filtering concept.
Sliding filtering window concept.
the color pixels as vectors and, thus, they avoid color artifacts in the output (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 1998). Since natural images are nonstationary, filtering schemes operate on the premise that an image can be subdivided into small regions, each of which can be treated as stationary (Lukac et al., 2005a; Pitas and Venetsanopoulos, 1990). The filters use a processing window to determine small image regions and process information in such a localized area of the input image (Figure 9). The window, defined as Ψ(p,q) = {x(i,j ) ; (i, j ) ∈ ζ }, for p = 1, 2, . . . , K1 and q = 1, 2, . . . , K2 , slides over the entire image x placing, successively, every pixel at the center of a local neighborhood denoted by ζ . The procedure replaces the color vector x(p,q) located at the window center (p, q) with the output y(p,q) = f (Ψ(p,q) ) of a filter function f (·) operating over noisecorrupted samples listed in Ψ(p,q) . Thus, the value of the estimated pixel depends on the values of image samples x(i,j ) in its neighborhood. The concept and the properties of the sliding (running) window are discussed in detail in Lukac et al. (2005a). The performance of a filtering scheme is generally influenced by the size of the local area inside the processing window Ψ(p,q) . Some applications may require different support to read local image features and complete the task appropriately. As shown in Lukac et al. (2005a), the processing window may vary in shape. The type of window determines both the area of support and the overall performance of the procedure. A particular window, such
202
LUKAC AND PLATANIOTIS
as unidirectional windows and bidirectional windows, can be designed to preserve specifically oriented image edges. However, the most commonly used window is a rectangular shape, such as a 3 × 3 window described by ζ = {(p − 1, q − 1), (p − 1, q), . . . , (p + 1, q + 1)}, due to its versatility and demonstrated good performance. A. Noise-Reduction Techniques 1. Order-Statistic Theory for Color Vectors Probably the most popular family of nonlinear filters is the one based on the concept of robust order statistics (Barnett, 1976; Hardie and Arce, 1991; Pitas and Venetsanopoulos, 1992). Their nonlinearity is given by ordering operation, which isolates scalar noisy samples as extremes of an ordered set (Astola and Kuosmanen, 1997; Pitas and Tsakalides, 1991). For multivariate data, however, an additional step in the process is required, namely the adoption of the appropriate subordering principle as the basis for expressing extremeness of observations. Among four well-known concepts, such as marginal, conditional, partial, and reduced ordering, the two most commonly used approaches are marginal and reduced ordering schemes (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000; Tang et al., 1995). Using marginal ordering, the components x(i,j )k of the vector x(i,j ) , for (i, j ) ∈ ζ , are ordered along each dimension (for k = 1, 2, 3) independently resulting in the scalar ordered sets (Pitas and Tsakalides, 1991): x(1)k ≤ x(2)k ≤ · · · ≤ x(τ )k ≤ · · · ≤ x(|ζ |)k
(19)
where x(τ )k , for τ = 1, 2, . . . , |ζ |, denotes the so-called τ th marginal (component-wise) order statistics. Since the marginal ordering approach often produces output vectors y(p,q) = [x(τ )1 , x(τ )2 , x(τ )3 ] ∈ / Ψ(p,q) , which differ from the set of vectorial inputs, application of marginal ordering to natural color images often results in color artifacts. In reduced ordering, each vector x(i,j ) , for (i, j ) ∈ ζ , is reduced to a scalar representative D(i,j ) and then the vectorial inputs are ordered in coincidence with the ranked scalars. To order the color vectors x(i,j ) located inside the supporting window Ψ(p,q) , the R-ordering-based vector filters use the aggregated distances or the aggregated similarities (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000) D(i,j ) = d(x(i,j ) , x(g,h) ) or D(i,j ) = s(x(i,j ) , x(g,h) ) (20) (g,h)∈ζ
(g,h)∈ζ
associated with the vectorial input x(i,j ) , for (i, j ) ∈ ζ . Thus, the ordered sequence D(1) ≤ D(2) ≤ · · · ≤ D(τ ) ≤ · · · ≤ D(|ζ |) of scalars D(i,j ) , for
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
203
(i, j ) ∈ ζ , implies the same ordering of the corresponding vectors x(i,j ) ∈ Ψ(p,q) as follows (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000): x(1) ≤ x(2) ≤ · · · ≤ x(τ ) ≤ · · · ≤ x(|ζ |)
(21)
where x(τ ) , for τ = 1, 2, . . . , |ζ |, denotes the so-called τ th vector order statistics. The reduced ordering scheme is the most attractive and widely used in color image processing since it relies on overall ranking of the original set Ψ(p,q) of input samples and the output y(p,q) = x(τ ) ∈ Ψ(p,q) is selected from the same set (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000; Smolka et al., 2004). In such an extension of the order-statistic theory to vector-valued signals, outliers or vectors that diverge greatly from the data population usually appear in higher indexed locations in the ordered sequence and are associated with the maximum extremes of aggregated distances to other input samples in the sliding window. For that reason, the output of the vector filters based on ranking is the lowest ranked vector x(1) (or x(τ ) for τ = 1) in a predefined sliding window. The most commonly used measure to quantify the distance between two color vectors x(i,j ) = [x(i,j )1 , x(i,j )2 , x(i,j )3 ] and x(g,h) = [x(g,h)1 , x(g,h)2 , x(g,h)3 ] in the magnitude domain is the generalized weighted Minkowski metric (Lukac et al., 2005a; Nosovsky, 1984): 1/L 3 L d(x(i,j ) , x(g,h) ) = x(i,j ) − x(g,h) L = c (22) ξk |x(i,j )k − x(g,h)k | k=1
where the nonnegative scaling parameter c is a measure of the overall discrimination power and the exponent L defines the nature of the distance metric. The parameter ξk measures the "proportion of attention allocated to the dimensional component k and thus k ξk = 1. Vectors having a range of values greater than a desirable threshold can be scaled down by the use of the weighting function ξ . The most commonly used members of the Minkowski metric family are (Duda et al., 2000; Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000): the so-called city-block distance x(i,j ) − x(g,h) 1 = and the Euclidean distance
3 k=1
|x(i,j )k − x(g,h)k |
( ) 3 ) x(i,j ) − x(g,h) 2 = * (x(i,j )k − x(g,h)k )2 . k=1
(23)
(24)
204
LUKAC AND PLATANIOTIS
Another special case of the Minkowski metric in Eq. (22) is the chess-board distance, which corresponds to L → ∞. In this case, the distance between the two 3D vectors is considered equal to the maximum distance among their corresponding components. The Minkowski class of metrics is not the only way to measure differences among color vectors. The Canberra distance (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000): d(x(i,j ) , x(g,h) ) =
3 |x(i,j )k − x(g,h)k | k=1
|x(i,j )k + x(g,h)k |
(25)
is another metric readily applicable to positively valued signals such as color signal values. It should be noted that the summand is defined to be zero if both x(i,j )k and x(g,h)k are zero (Plataniotis et al., 1999). It was argued in Plataniotis et al. (1999) and Plataniotis and Venetsanopoulos (2000) that for image processing purposes the commonly used metric distances can be replaced by a similarity measure s(x(i,j ) , x(g,h) ) between two color vectors x(i,j ) and x(g,h) . Usually a symmetric function s(·) returns a large value when the vectorial inputs x(i,j ) and x(g,h) are similar and converges to zero if the two inputs are dissimilar. The well-known normalized inner product (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000): s(x(i,j ) , x(g,h) ) =
x(i,j ) xT(g,h) |x(i,j ) ||x(g,h) |
,
(26)
which corresponds to the cosine of the angle between x(i,j ) and x(g,h) can be viewed as similarity in orientation. Since similar colors have almost parallel orientations and significantly different colors point in different overall directions in a 3D color space such as the RGB space, the normalized inner product, or equivalently the angular distance (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000) x(i,j ) xT(g,h) θ = A(x(i,j ) , x(g,h) ) = arccos (27) |x(i,j ) ||x(g,h) | can be used instead of the Minkowski metric to quantify the dissimilarity between the two vectors. It is obvious that a generalized similarity measure model that can effectively quantify differences among color signals should take into consideration both the magnitude and orientation of the color vectors. Thus, a generalized measure based on both the magnitude and orientation of vectors should provide a robust solution to the problem of similarity quantification between two vectors. Such an idea is used in constructing the generalized content
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
205
model family of measures, which treat similarity between two vectors as the degree of common content in relation to the total content of the two vectors. (g,h) Therefore, given the common quantity, commonality C(i,j ) , and the total (g,h)
quantity, totality T(i,j ) , the similarity between x(i,j ) and x(g,h) is defined as (Plataniotis et al., 1999; Plataniotis and Venetsanopoulos, 2000): (g,h)
s(x(i,j ) , x(g,h) ) =
C(i,j )
(g,h)
.
(28)
T(i,j )
Based on this general framework, different similarity measures can be obtained by utilizing different commonality and totality concepts (Lukac et al., 2005a; Plataniotis et al., 1999): x(i,j ) xT(g,h) ||x(i,j ) | − |x(g,h) || w(g,h) 1 − s(x(i,j ) , x(g,h) ) = w(i,j ) |x(i,j ) ||x(g,h) | max (|x(i,j ) |, |x(g,h) |) (29) |x(i,j ) | cos(θ ) + |x(g,h) | cos(θ ) h(i,j ) + h(g,h) = s(x(i,j ) , x(g,h) ) = |x(i,j ) | + |x(g,h) | |x(i,j ) | + |x(g,h) | = cos(θ ) (30) h(i,j ) + h(g,h) s(x(i,j ) , x(g,h) ) = (31) [|x(i,j ) |2 + |x(g,h) |2 + 2|x(i,j ) ||x(g,h) | cos(θ )]1/2 s(x(i,j ) , x(g,h) ) =
[|h(i,j ) |2 + |h(g,h) |2 + 2|h(i,j ) ||h(g,h) | cos(θ )]1/2 |x(i,j ) | + |x(g,h) |
s(x(i,j ) , x(g,h) ) = 1 −
(32)
[|x(i,j ) |2 + |x(g,h) |2 − 2|x(i,j ) ||x(g,h) | cos(θ )]1/2 . (33) [|x(i,j ) |2 + |x(g,h) |2 + 2|x(i,j ) ||x(g,h) | cos(θ )]1/2
2. Component-Wise Median Filters Taking advantage of the marginal ordering scheme in Eq. (19), median filters are commonly accepted as robust signal estimators. Since the marginal ordering operation isolates scalar noisy samples as extremes of an ordered set, the median (Pitas and Venetsanopoulos, 1990)—the center x[(|ζ |+1)/2]k of the ordered sequence in Eq. (19)—represents the sample that has the largest probability to be noise-free. Equivalently, the output of the well-known median filter (MF) is defined as a maximum likelihood estimate (MLE) when the underlying probability densities are Laplacian. Using the minimization concept the MF output can be obtained as follows (Pitas and Venetsanopoulos, 1990): y(p,q)k = arg min |x(g,h)k − x(i,j )k | (34) x(g,h)k
(i,j )∈ζ
206
LUKAC AND PLATANIOTIS
where y(p,q)k = x(g,h)k ∈ Ψ(p,q)k is the output of the scalar median filter operating in the kth color channel of the input image x. The component-wise input set Ψ(p,q)k is composed of the kth components x(i,j )k of the input vectors x(i,j ) ∈ Ψ(p,q) for (i, j ) ∈ ζ . Assuming that each location (i, j ) ∈ ζ is assigned a real-valued weight w(i,j ) , the weighted median (WM) of the component-wise input set Ψ(p,q)k is the component x(g,h)k ∈ Ψ(p,q)k minimizing the following expression (Gabbouj et al., 1992; Yin et al., 1996; Yu and Liao, 1994): y(p,q)k = arg min w(i,j ) |x(g,h)k − x(i,j )k |. (35) x(g,h)k
(i,j )∈ζ
If each weight w(i,j ) , for (i, j ) ∈ ζ is equal to 1, the above definition reduces to Eq. (34). To choose an appropriate weight vector and obtain the desired performance of the WM filters, the optimization algorithms (Yang et al., 1995; Yin and Neuvo, 1994) that originate from the stack filter design (Coyle et al., 1989; Yin et al., 1993) have been developed. Estimating a desired signal o(p,q) in Eq. (6), the loss in performance (error in the filtering operation) is defined using the absolute error criterion as e(p,q) = o(p,q) − y(p,q) 1 .
(36)
One (but not the only) natural way of choosing the weighting coefficients w(i,j ) of the weight vector w is to require that this choice should minimize the average loss or risk. Therefore, the cost function of the WM filtering is defined as follows: JWM (w) = E o − y1 (37) where E{·} indicates statistical expectation. With the constraint of nonnegative weights, the optimization problem with inequality constraints can be expressed as follows (Yin et al., 1996): minimize JWM (w) subject to w(i,j ) ≥ 0,
for (i, j ) ∈ ζ.
(38)
The above optimization problem can be solved using different methods, with an overview to be found in Lukac (2004b) and Yin et al. (1996). Using the adaptation algorithm based on the sigmoidal function (Yin and Neuvo, 1994) sgn(a) =
2 −1 1 + exp(−a)
(39)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
207
an adjustment of the filter weights in a supporting window Ψ(p,q) sliding over the image during processing can be expressed as follows (Lukac et al., 2004b; Yin and Neuvo, 1994): w(i,j ) = P w(i,j ) + 2μ(o(p,q)k − y(p,q)k ) sgn(x(i,j )k − y(p,q)k ) (40)
where μ is the iteration constant, y(p,q)k is the component-wise WM output [Eq. (35)], and 0 if w(i,j ) < 0 (41) P (w(i,j ) ) = w(i,j ) otherwise
is a projection function used to project the updated weight w(i,j ) onto the constraint space of w during the adaptation process in Eq. (40). Assuming for the moment that P (·) is an identity function, for x(p,q)k ≫ y(p,q)k and positive μ the adaptation formula [Eq. (40)] reduces to w(i,j ) = w(i,j ) + 2μ(o(p,q)k − y(p,q)k ).
(42)
According to the above expression, the importance of the sample occupying the (i, j ) location in a supporting window Ψ(p,q) increases if o(p,q)k is greater than the actual output y(p,q)k in Eq. (35) and decreases if o(p,q)k is smaller than y(p,q)k . Thus, this difference multiplied by the regularization factor represents the weight increment (for 0 < o(p,q)k − y(p,q)k ), the weight decrement (for 0 > o(p,q)k − y(p,q)k ), or it can remain with the weights unchanged (for o(p,q)k − y(p,q)k = 0). An alternative solution to Eq. (40) can be obtained by solving the constrained problem in Eq. (38) as follows (Lukac et al., 2004b; Yin et al., 1993): w(i,j ) = P w(i,j ) + 2μ x(|ζ |)k − x(1)k − 2|o(p,q)k − x(i,j )k | & (43) w(g,h) x(|ζ |)k − x(1)k − 2|x(i,j )k − x(g,h)k | − (g,h)∈ζ
where x(|ζ |)k and x(1)k represent the uppermost and the lowest componentwise order statistics in Eq. (19), respectively, and μ is the positive adaptation constant. 3. Vector Median Filters Unlike the scalar filters described above, the essential spectral characteristics of the noisy color image x are utilized by vector filtering schemes. The most popular vector filter is the vector median filter (VMF) (Astola et al., 1990). The VMF is a vector processing operator that has been introduced as an extension of the scalar median filter. The VMF can be derived either as
208
LUKAC AND PLATANIOTIS
an MLE or by using vector order-statistic techniques (Lukac et al., 2005a). Using the reduced ordering in Eq. (21), the vector median of a population of the vectors inside the supporting window Ψ(p,q) is the lowest ranked vector x(1) ∈ Ψ(p,q) . Since the ordering can be used to determine the positions of the different input vectors without any a priori information regarding the signal distributions, vector order-statistic filters, such as the VMF, are robust estimators. Similarly to the traditional MF in Eq. (34), the VMF output can be defined using the minimization concept as follows (Lukac et al., 2005a): x(g,h) − x(i,j ) L (44) y(p,q) = arg min x(g,h)
(i,j )∈ζ
where y(p,q) = x(g,h) ∈ Ψ(p,q) denotes the outputted vector belonging to the input set Ψ(p,q) . Such a concept has been used to develop the VMF modifications following the properties of color spaces (Regazoni and Teschioni, 1997). To speed up the calculation of the distances between the color vectors, the VMF based on the linear approximation of the Euclidean norm has been proposed in Barni et al. (1994). It is widely observed that the VMF excellently suppresses impulsive noise (Astola et al., 1990; Smolka et al., 2004). To improve its performance in the suppression of additive Gaussian noise, the VMF has been combined with linear filters (Astola et al., 1990). This so-called extended VMF is defined as
⎧ AMF
yVMF − x(i,j )
yAMF − x(i,j ) ≤ ⎪ (p,q) (p,q) ⎨ y(p,q) if L L (i,j )∈ζ (i,j )∈ζ (45) y(p,q) = ⎪ ⎩ VMF y(p,q) otherwise
VMF where yVMF (p,q) is the VMF output obtained in Eq. (44) and y(p,q) is an arithmetic mean filter (AMF) defined over the vectors inside the neighborhood ζ : 1 x(i,j ) . (46) yAMF (p,q) = |ζ | (i,j )∈ζ
Even though the sample outputted in Eq. (45) is not always one of the input samples, the filter adapts to the input signal by applying the VMF near a signal edge and using the AMF in the smooth areas. Thus, the extended VMF preserves the structural information of the image x while improving noise attenuation in the smooth areas. Another modification of the VMF filter uses the multistage filtering concept and the finite impulse response (FIR) filters to increase the design freedom of the VMF and to reduce the number of processing operations required to find the VMF output (Astola et al., 1990). The so-called vector FIR-median hybrid filters combine linear filtering with the VMF operation by dividing the
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
209
filter window ζ into an odd number of smaller subwindows where FIR filters are operating. The output of such a vector hybrid filter is the vector median of the FIR filter outputs. For example, using the three AMF filters with the subwindows ζ1 = {(p − 1, q − 1), (p − 1, q), (p − 1, q + 1), (p, q − 1)}, (p, q), and ζ2 = {(p, q + 1), (p + 1, q − 1), (p + 1, q), (p + 1, q + 1)}, respectively, the output of the vector FIR-median filter is defined as follows (Astola et al., 1990): y(p,q) = fVMF fAMF (x(i,j ) ; (i, j ) ∈ ζ1 ), x(p,q) , fAMF (x(i,j ) ; (i, j ) ∈ ζ2 ) (47) where ζ = {ζ1 ∪ (p, q) ∪ ζ2 }. The functions fVMF (·) and fAMF (·) denote the VMF and AMF operations, respectively. Since the central sample x(p,q) is usually the most important sample in the supporting window Ψ(p,q) , it remains unchanged by the corresponding FIR filter (identity filter). The main advantage of the vector FIR-median filters is that they significantly speed-up the filtering process compared to the traditional VMF. Both VMF and extended VMF utilize the aggregated Euclidean distance of the input vectors within a processing window. However, these measures do not take into account either the importance of the specific samples in the filter window or structural contents of the image. Much better results can be obtained when the weighting coefficients w(i,j ) are introduced into the filter structure to control the contribution of the associated input vectors x(i,j ) to the aggregated distances (Lukac et al., 2003a; Viero et al., 1994), w(g,h) x(i,j ) − x(g,h) L (48) D(i,j ) = (g,h)∈ζ
used as the ordering criterion in Eq. (21). Based on the aggregated weighted magnitude distances in Eq. (48), the so-called weighted VMF (WVMF) operators (Lucat et al., 2002; Lukac et al., 2004c; Viero et al., 1994) produce the lowest ranked vector x(1) in Eq. (21) as the filter output, that is, y(p,q) = x(1) . Similarly to the traditional VMF operator, the output of the WVMF filters is equivalently determined using the minimization concept as follows (Lukac et al., 2004c): w(i,j ) x(g,h) − x(i,j ) L (49) y(p,q) = arg min x(g,h)
(i,j )∈ζ
where y(p,q) = x(g,h) ∈ Ψ(p,q) represents the filter output. In the case of the unity weight vector w = [w(i,j ) = 1, (i, j ) ∈ ζ ], the WVMF definition [Eq. (49)] reduces to the earlier one [Eq. (44)] of the VMF. Note that both the VMF and the WVMF are generalized within a class of selection weighted vector filters (SWVF) (Lukac et al., 2004c) presented in Section IV.A.5. Thus,
210
LUKAC AND PLATANIOTIS
to tune the performance of the WVMF operators (Lukac et al., 2003a) the SWVF optimization framework can be utilized (Lukac et al., 2004c). In some situations, the choice of the outputted sample y(p,q) from the input set Ψ(p,q) may come as a limitation. Therefore, the combined vector and component-wise filtering can be used to achieve better noise attenuation (Astola et al., 1990). Operating on the set of the filter weights w(i,j ) , for (i, j ) ∈ ζ , associated with the input vectors x(i,j ) ∈ Ψ(p,q) the so-called extended WVMF is defined as follows (Viero et al., 1994): ⎧
⎪ yWAMF if w(i,j ) yAMF ⎪ (p,q) − x(i,j ) L (p,q) ⎪ ⎪ ⎪ (i,j )∈ζ ⎪ ⎨
≤ w(i,j ) yVMF y(p,q) = (50) (p,q) − x(i,j ) L ⎪ ⎪ (i,j )∈ζ ⎪ ⎪ ⎪ ⎪ ⎩ yWVMF otherwise (p,q)
yWVMF (p,q)
where denotes the WVMF output calculated using Eq. (44), and AVMF y(p,q) is the output of a weighted averaging filter yWAMF (p,q) = "
1 (g,h)∈ζ
w(g,h)
w(i,j ) x(i,j ) .
(51)
(i,j )∈ζ
Similarly to Eq. (45), the extended WVMF operation in Eq. (50) chooses yWAMF (p,q) in smooth areas to produce the final output, whereas near edges it tends to choose the VMF to preserve the structural information. Since the weighted averaging operation in Eq. (50) tends to smooth fine details and it is prone to outliers, improved design characteristics can be obtained by replacing the weighted averaging filter yWAMF (p,q) in Eq. (50) with the alpha-trimmed filter: 1 x(i,j ) (52) yα(p,q) = |ζα | (i,j )∈ζα
where α is a design parameter that can have values α = 0, 1, . . . , |ζ | − 1. The set ζα = {(i, j ), for D(i,j ) ≤ D(|ζ |−α) } ⊂ ζ , consists of the spatial locations of the vectors x(p,q) ∈ Ψ(p,q) , which have the aggregated weighted distances in Eq. (48) smaller or equal to the (|ζ | − α)th largest aggregated weighted distance D(|ζ |−α) ∈ {D(i,j ) ; (i, j ) ∈ ζ }. If α = |ζ | − α, the filter in Eq. (50) is equivalent to a WVMF. Vector rational filters (VRF) (Khriji and Gabbouj, 1999, 2002) operate on the input vectors x(i,j ) of Ψ(p,q) using rational functions: y(p,q) =
P [x(i,j ) ; (i, j ) ∈ ζ ] Q[x(i,j ) ; (i, j ) ∈ ζ ]
(53)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
211
where P (·) = [P1 (·), P2 (·), P3 (·)] is a vector-valued polynomial with the components Pk x(i,j ) ; (i, j ) ∈ ζ = a0 + ai,j x(i,j )k (i,j )∈ζ
+
(i1 ,j1 )∈ζ (i2 ,j2 )∈ζ
ai1 ,j1 ,i2 ,j2 x(i1 ,j1 )k x(i2 ,j2 )k + · · ·
(54)
and a0 , ai,j , ai1 ,j1 ,i2 ,j2 , . . . , are the functions f (x(i,j ) ; (i, j ) ∈ ζ ) of the input set Ψ(p,q) . The function Q(·) is a scalar polynomial Q x(i,j ) ; (i, j ) ∈ ζ = b0 + bi1 ,j1 ,i2 ,j2 x(i1 ,j1 ) − x(i2 ,j2 ) L (55) (i1 ,j1 )∈ζ (i2 ,j2 )∈ζ
where b0 > 0 and bi1 ,j1 ,i2 ,j2 are constant. The kth component y(p,q)k of the VRF output vector y(p,q) is defined as y(p,q)k = [[Pk (·)]]/Q(·), where [[Pk (·)]] denotes the integer part of Pk (·). Thus, depending on the filter parameters the VRF can remove the different type of noise (additive Gaussian, impulsive, mixed), while retaining sharp edges. Similarly to the other VMF-like filters, the VRF reduces to rational scalar filters if the vector dimension is one. Finally, there also exist vector median rational hybrid filters that combine the VRF and linear low-pass filters to reduce the computational complexity of the standard VRF operators. Multichannel L filters (Kotropoulos and Pitas, 2001; Nikolaidis and Pitas, 1996) use a linear combination of the ordered input samples to determine the filter output: y(p,q) =
|ζ |
wτ x(τ )
(56)
τ =1
where wτ is the weight associated with the τ th ordered vector x(τ ) ∈ Ψ(p,q) . These filters can be designed optimally under the mean-square error (MSE) criterion for a specific additive noise distribution. Assuming the weight vector w = [w1 , w2 , . . . , w|ζ | ] and the unity vector e = [1, 1, . . . , 1] of the dimension identical to that of w, the optimal coefficients wτ , for τ = 1, 2, . . . , |ζ |, can be determined as follows: w=
R−1 e eT R−1 e
(57)
212
LUKAC AND PLATANIOTIS
F IGURE 10.
Directional processing concept on the Maxwell triangle.
where wT e = 1 is the constraint imposed on the solution and R is a |ζ | × |ζ | correlation matrix of the ordered noise variables. Alternatively, the popular r based on the least mean square (LMS) formula w = w + 2μe(p,q) Ψ(p,q) r ordered input set Ψ(p,q) can be used instead of Eq. (57) to speed-up the optimization process. 4. Vector Directional Filters It has been observed that the filtering techniques taking into account the vectors’ magnitude may produce color outputs with chromaticity impairments. To alleviate such problems, a new type of multichannel filters has been proposed (Trahanias and Venetsanopoulos, 1993). The so-called vector directional filter (VDF) family operates on the direction of the image vectors, aiming to eliminate vectors with atypical directions in the vector space (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000). To achieve its objective, the VDF utilizes the angle [Eq. (27)] between the image vectors to order vector inputs inside a processing window (Plataniotis et al., 1998b). The output of the basic vector directional filter (BVDF) (Trahanias et al., 1996) defined within the VDF class is the color vector x(g,h) ∈ Ψ(p,q) whose direction is the MLE of directions of the input vectors (Nikolaidis and Pitas, 1998). Thus, the BVDF output x(g,h) minimizes the angular ordering criterion (Figure 10) (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000) to other samples inside the sliding filtering window Ψ(p,q) : y(p,q) = arg min A(x(g,h) , x(i,j ) ). (58) x(g,h)
(i,j )∈ζ
The above definition can be used to express a spherical median (SM) (Trahanias et al., 1996), which minimizes the angular criterion in Eq. (58)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
213
without the constraint that the filter output y(p,q) is one of the original samples within the filtering window Ψ(p,q) . It was argued in Tang et al. (2001) that the vector’s direction defines its color chromaticity properties. Thus minimizing the angular distances between vector inputs may produce better performance than the VMF-based approaches in terms of direction preservation (Lukac et al., 2005a, 2005g). On the other hand, the BVDF does not take into account the magnitude characteristics and thus ignores the brightness of color vectors. To utilize both features in color image filtering (Trahanias et al., 1996), the generalized vector directional filters (GVDF) first eliminate the color vectors with atypical directions in the vector space: y(p,q) = fGVDF (x(1) , x(2) , . . . , x(τ ) )
(59)
where {x(1) , x(2) , . . . , x(τ ) } is the set of the τ lowest vector order statistics obtained using the angular distances in Eq. (27). As a result of this process, a set of input vectors with approximately the same direction in the vector space is produced as the output set. Then, the GVDF operators process the vectors with the most similar orientation according to their magnitude. Thus, the GVDF splits the color image processing into directional processing and magnitude processing. Another approach, the directional-distance filter (DDF) (Karakos and Trahanias, 1997), combines both ordering criteria used in the VMF and the BVDF schemes. Using equal amounts of the magnitude and directional information, the DDF makes use of a hybrid ordering criterion expressed through a product of the aggregated Euclidean distance and the aggregated angular distance as follows: A(x(i,j ) , x(g,h) ) . (60) x(i,j ) − x(g,h) L D(i,j ) = (g,h)∈ζ
(g,h)∈ζ
The DDF output is the sample x(1) ∈ Ψ(p,q) in Eq. (21) associated with the smallest value D(i,j ) , for (i, j ) ∈ ζ . The introduction of the DDF inspired a new set of heuristic vector processing filters such as the hybrid vector filters (HVF) (Gabbouj and Cheickh, 1996; Plataniotis and Venetsanopoulos, 2000). These filters try to capitalize on the same appealing principle, namely the simultaneous minimization of the distance functions used in the VMF and the BVDF. The HVFs operate on the direction and the magnitude of the color vectors independently and then combine them to produce a unique final output. The HVF1 approach, viewed as a nonlinear combination of the VMF and BVDF filters, generates an output according to the following rule (Lukac et al., 2004b; Plataniotis and
214
LUKAC AND PLATANIOTIS
Venetsanopoulos, 2000): y(p,q) =
yVMF (p,q)
BVDF if yVMF (p,q) = y(p,q)
y¯ 1(p,q)
otherwise
(61)
VMF yVMF (p,q) is the VMF output obtained in (44), y(p,q) characterizes the BVDF output in Eq. (58) and y¯ 1(p,q) is the vector calculated as
y¯ 1(p,q)
=
|yVMF (p,q) |
|yBVDF (p,q) |
yBVDF (p,q)
(62)
with | · | denoting the magnitude of the vector. A more refined, nonlinear combiner is the so-called HVF2 (Gabbouj and Cheickh, 1996), which combines AMF, VMF, and BVDF as follows (Lukac et al., 2004b; Plataniotis and Venetsanopoulos, 2000): ⎧ VMF BVDF y(p,q) if yVMF ⎪ (p,q) = y(p,q) ⎪ ⎪ ⎪ ⎨ 1 x(i,j ) − y¯ 1 < x(i,j ) − y¯ 2 ¯ if y (p,q) (p,q) (p,q) y(p,q) = ⎪ (i,j )∈ζ (i,j )∈ζ ⎪ ⎪ ⎪ ⎩ 2 y¯ (p,q) otherwise (63) where y¯ 1(p,q) is the vector obtained in Eq. (62) and y¯ 2(p,q) is the vector defined as AMF |y(p,q) | BVDF 2 y(p,q) y¯ (p,q) = (64) |yBVDF (p,q) |
with yAMF (p,q) in Eq. (46) denoting the output of the AMF operating inside the same processing window Ψ(p,q) positioned in (p, q). Both hybrid vector filters are computationally demanding due to the required evaluation of both the VMF and BVDF outputs (Plataniotis and Venetsanopoulos, 2000). Thus, the two independent ordering schemes are applied to the input samples to produce a unique final output. The recently introduced weighted vector directional filters (WVDF) (Lukac, 2004a; Lukac et al., 2004b) utilize nonnegative real weighting coefficients w(i,j ) associated with the input vectors x(i,j ) , for (i, j ) ∈ ζ . These filters output the color vector y(p,q) = x(g,h) ∈ Ψ(p,q) , which minimizes the aggregated weighted angular distance to the other samples inside the processing window Ψ(p,q) : y(p,q) = arg min w(i,j ) A(x(g,h) , x(i,j ) ). (65) x(g,h)
(i,j )∈ζ
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
215
Equivalently, the WVDF output is determined using the lowest vector order statistics x(1) from the ordered set in Eq. (21) based on the aggregated weighted angular distance criterion (Lukac et al., 2004b): D(i,j ) = w(g,h) A(x(i,j ) , x(g,h) ) (66) (g,h)∈ζ
where D(i,j ) is associated with the input vector x(i,j ) . Since the WVDF output is by construction one of the original samples in the input set Ψ(p,q) , the filter never introduces new outlying vectors. Based on the actual weight vector w in Eq. (65) or (66), the WVDF filtering class extends the flexibility of the VDFbased designs, improves the detail-preserving filtering characteristics of the conventional VDF schemes, and provides a powerful color image filtering tool capable of tracking varying signal and noise statistics. To obtain the desired performance of the WVDF operators, the least mean absolute (LMA) errorbased multichannel adaptation algorithms operating in the directional domain of the processed vectors have been introduced in Lukac et al. (2004b). The WVDF and WVMF weights’ adaptation algorithms along with the WVDF and WVMF operators have been generalized in the unified framework of the selection weighted vector filters in Lukac et al. (2004c). 5. Selection Weighted Vector Filters The structure of the selection weighted vector filters (SWVF) (Lukac et al., 2004c) is characterized by a design parameter ξ ranging from 0 to 1, and a set of nonnegative real weights w = {w(i,j ) ; (i, j ) ∈ ζ }. For each input sample x(i,j ) , (i, j ) ∈ ζ , the weights w(i,j ) are used to form a SWVF processing function fSWVF (Ψ(p,q) , w, ξ ) defined as follows (Lukac et al., 2004c; Lukac and Plataniotis, 2005b): 1−ξ w(i,j ) x(g,h) − x(i,j ) L y(p,q) = arg min x(g,h)
×
(i,j )∈ζ
(i,j )∈ζ
w(i,j ) A(x(g,h) , x(i,j ) )
ξ
(67)
where y(p,q) = x(g,h) ∈ Ψ(p,q) represents the filter output. The selective nature of the SWVF operator and the use of the minimization concept ensure the outputting of the input color vector x(g,h) , which is the most similar, under the specific setting of w, to other samples in Ψ(p,q) . The weighting coefficient w(i,j ) signifies the importance of x(i,j ) in Ψ(p,q) . Through the weight vector w and the design parameter ξ , the SWVF scheme tunes the overall filter’s detail-preserving and noise-attenuating characteristics
216
LUKAC AND PLATANIOTIS
F IGURE 11. Adaptive filtering concept with the parameter’s adaptation obtained using (a) original signal, (b) noisy signal.
and uses both the spatial and spectral characteristics of the color image x during processing. Depending on the value of parameter 0 ≤ ξ ≤ 1 in Eq. (67), color image processing can be performed in the magnitude (ξ = 0) or directional (ξ = 1) domain. By setting ξ = 0.5, the SWVF process the input image using an equal amount of the magnitude and directional information. Any deviation from this value to a lower or larger value of ξ places more emphasis on the magnitude or directional characteristics, respectively. Thus, each setting of the filter parameters w and ξ represents a specific filter that can be used for a specific task. This suggests that SWVF filters constitute a wide class of vector operators. For example, the use of the unity weight vector w = 1 in Eq. (67) with ξ = 0 reduces an SWVF operator to the VMF, while the use of w = 1 reduces the SWVF to the BVDF (for ξ = 1) and to the DDF (for ξ = 0.5). Similarly, the use of ξ = 0 and ξ = 1 reduces the SWVF to the WVMF and WVDF, respectively, for an arbitrary weight vector w. The use of the SWVF scheme [Eq. (67)] requires the determination of the weight vector w by the end-user. Alternatively, if the original signal o(p,q) of Eq. (6) is available, the weights w(i,j ) in Eq. (67) can be adapted as follows (Figure 11a) (Lukac et al., 2004c): (68) w(i,j ) = P w(i,j ) + 2μR(o(p,q) , y(p,q) ) sgn R(x(i,j ) , y(p,q) )
where μ is a regulation factor. Each weight w(i,j ) is adjusted by adding the contributions of the corresponding input vector x(i,j ) and the SWVF output y(p,q) . These contributions are measured as the distances to the original signal o(p,q) , which is used to guide the adaptation process (Lukac et al., 2004b). The initial weight vector can be set to any arbitrary positive value, but equally aligned weighting coefficients such as w(i,j ) = 1, for (i, j ) ∈ ζ , corresponding to the robust smoothing functions and μ ≪ 0.5, are the values recommended in Lukac et al. (2004c) for conventional color image processing applications. To minimize the influence of the initial setting of the SWVF parameters, the adaptation formula should allow for the adjustment
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
217
of w(i,j ) using both positive and negative contributions. Therefore, Eq. (68) is constructed using the sign sigmoidal function [Eq. (39)] and the vectorial sign function (Lukac et al., 2004c): ξ 1−ξ R(x(i,j ) , x(g,h) ) = S(x(i,j ) , x(g,h) ) x(i,j ) − x(g,h) L A(x(i,j ) , x(g,h) ) , (69) which considers contributions using both the magnitude and directional information with the polarity S(·) ∈ {−1, 1} defined as follows: +1 for x(i,j ) − x(g,h) ≥ 0 (70) S(x(i,j ) , x(g,h) ) = −1 for x(i,j ) − x(g,h) < 0.
The use of R(·) is essential in sgn(·) since the positive (or negative) values of R(x(i,j ) , y(p,q) ) allow for the corresponding adjustment of w(i,j ) in Eq. (68) by adding the negative (or positive) value of 2μR(o(p,q) , y(p,q) ) sgn[R(x(i,j ) , y(p,q) )]. If the sample under consideration x(i,j ) and the actual SWVF output y(p,q) are identical [i.e. R(x(i,j ) , y(p,q) ) = 0], then sgn(·) = 1, which suggests that w(i,j ) is adjusted based solely on the difference between the SWVF output y(p,q) and the original signal o(p,q) . To keep the aggregated distances in Eq. (67) positive, and thus to ensure the unbiased low-pass characteristics of the SWVF filters, a projection function [Eq. (41)] is used to project the updated weight w(i,j ) onto the constraint space of w during the adaptation process in Eq. (68). If the original signal o(p,q) is not available (Figure 11b), the weights w(i,j ) in Eq. (67) can be adapted replacing the original signal o(p,q) in Eq. (68) with ∗ ∗ ∗ , y(p,q)2 , y(p,q)3 ] as follows: the feature signal y∗(p,q) = [y(p,q)1 w(i,j ) = P w(i,j ) + 2μR(y∗(p,q) , y(p,q) ) sgn R(x(i,j ) , y(p,q) ) . (71)
The considered adaptation scheme leads to a number of SWVF filters with different design characteristics. For example, the feature signal y∗(p,q) can be obtained using one of the following ways (Lukac et al., 2004c; Lukac and Plataniotis, 2005b):
• The use of the acquired signal y∗(p,q) = x(p,q) is useful, when the corrupting noise power is low and strong detail-preserving characteristics are expected from the SWVF operators. • The robustness of the SWVF operator and its noise attenuation capability is ensured using a robust, easy to calculate estimate such as ∗ = the component-wise median of Ψ(p,q) with the components y(p,q)k ∗ median{x(i,j )k ; (i, j ) ∈ ζ } of y(p,q) obtained in Eq. (34). • The trade-off between the noise attenuating and detail-preserving characteristics can be obtained through the combination of the input signal and the obtained estimate.
218
LUKAC AND PLATANIOTIS
Different design characteristics of the SWVF operators are obtained when the adaptation algorithm in Eq. (43) is extended to process the vector signals. In this case, the uppermost x(|ζ |) and the lowest x(1) ranked vectors in Eq. (21) are used to update the weights w(i,j ) , for (i, j ) ∈ ζ , as follows (Lukac et al., 2004b): w(i,j ) = P w(i,j ) + 2μ R(x(|ζ |) , x(1) ) − 2R(o(p,q) , x(i,j ) ) −
(g,h)∈ζ
& w(g,h) R(x(|ζ |) , x(1) ) − 2R(x(i,j ) , x(g,h) ) (72)
where μ is the positive adaptation stepsize. Similarly to Eq. (68), the negative weight coefficients are modified by projection operation [Eq. (41)]. Following the rationale in Eq. (71), the adaptation formula in Eq. (72) can be modified by replacing the desired signal o(p,q) with the feature signal y∗(p,q) . Finally, it should be mentioned that the both multichannel adaptation algorithms in Eqs. (68) and (72) generalize their scalar versions Eqs. (40) and (43), respectively. In addition, the framework generalizes simplified versions of the algorithms used to optimize the WVMF and WVDF operators in the magnitude (ξ = 0) or the directional (ξ = 1) domain of the processed vector signals, respectively. Therefore, it can be concluded that the SWVF framework constitutes a flexible tool for multichannel image processing. 6. Data-Adaptive Vector Filters Since the images are highly nonstationary due to the edges and fine details, and it is difficult to differentiate between noise and edge pixels, fuzzy sets are highly appropriate for image filtering tasks (Plataniotis and Venetsanopoulos, 2000). A number of fuzzy filters, such as the one proposed in Tsai and Yu (2000), adopt a window-based, rule-driven approach leading to a datadependent fuzzy solution. To obtain the desired performance of such a filter, the fuzzy rules must be optimally set using the optimization procedure, which often requires the presence of the original signal. However, the original data are usually not available in practical applications. Therefore, the fuzzy vector filters in Lukac et al. (2005b) and Plataniotis et al. (1996, 1999) are designed to remove noise in multichannel images without the requirement of fuzzy rules. The most commonly used method to smooth high-frequency variations and transitions is averaging. Therefore, the general form of the data-dependent filter is given as a fuzzy weighted average (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000) of the input vectors inside the supporting win-
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
dow Ψ(p,q) : y(p,q) = f
w(i,j ) x(i,j )
(i,j )∈ζ
219
(73)
where f (·) is a nonlinear function that operates over the weighted average of the input set and + w(i,j ) = μ(i,j ) μ(g,h) (74) (g,h)∈ζ
is the normalized filter weight calculated using the weighting coefficient μ(i,j ) equivalent to the fuzzy membership function associated with the input color " vector x(i,j ) ∈ Ψ(p,q) . Note that the two constraints w(i,j ) ≥ 0 and (i,j )∈ζ w(i,j ) = 1 are necessary to ensure that the filter output is an unbiased estimator and produces the samples within the desired intensity range. Operating on the vectorial inputs x(i,j ) , the weights w(i,j ) in Eq. (74) are determined adaptively using functions of a distance criterion between the input vectors (Lukac et al., 2005a). Since the relationship between distances measured in physical units and perception is generally exponential (Plataniotis et al., 1999), an exponential type of function may be suitable for use in the weighting formulation (Lukac et al., 2005b; Plataniotis et al., 1999): −r μ(i,j ) = β 1 + exp{D(i,j ) } (75)
where r is a parameter adjusting the weighting effect of the membership function, β is a normalizing constant, and D(i,j ) is the aggregated distance or similarity measure defined in Eq. (20). The data-adaptive filters can be optimized for any noise model by appropriately tuning their membership function in Eq. (75). The vector y(p,q) outputted in Eq. (73) is not part of the original input set Ψ(p,q) . In some image processing application (Lukac et al., 2005a), constrained solutions such as the VMF of Eq. (44) and the BVDF of Eq. (58), which can provide higher preservation of image details (see Figure 12) compared to the unconstrained solutions, are required. Therefore, a different design strategy should be used and the adaptive weights in Eq. (73) can be redefined as follows (Lukac et al., 2005b; Plataniotis et al., 1999): 1 if μ(i,j ) = μmax w(i,j ) = (76) 0 if μ(i,j ) = μmax where μmax ∈ {μ(i,j ) ; (i, j ) ∈ ζ } is the maximum fuzzy membership value. If the maximum value occurs at a single point only, Eq. (73) reduces to a selection filtering operation y(p,q) = x(i,j ) ,
for μ(i,j ) = μmax
(77)
220
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 12. Filtering of additive Gaussian noise in Figure 5b: (a) MF output, (b) VMF output, (c) data-adaptive filter output, (d) DPAL filter output.
which identifies one of the samples inside the processing window Ψ(p,q) as the filter output. 7. Adaptive Multichannel Filters Based on Digital Paths Adaptive multichannel filters proposed in Szczepanski et al. (2003, 2004) exploit connections between image pixels using the concept of digital paths instead of using a fixed supporting window. Operating in a predefined search area, image pixels are grouped together forming paths that reveal the underlying structural image content and are used to determine the weighting coefficients of a data-adaptive filter in Eq. (73).
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
221
Assuming that ρ0 = (i, j ) and ρη = (g, h)—the two spatial locations inside the search area—are connected by a digital path P η (ρ0 , ρ1 , . . . , ρη ) of length η, the connection cost Λη (·) defined over the digital path linking the starting ρ0 and the ending ρη location using η − 1 connecting locations ρc , for c = 1, 2, . . . , η − 1, is expressed as η
Λ (ρ0 , ρ1 , . . . , ρη ) = f (xρ0 , xρ1 , . . . , xρη ) =
η c=1
xρc − xρc−1 . (78)
The function Λη (·) can be seen as a measure of dissimilarity between color image pixels xρ0 , xρ1 , . . . , xρη . If a path P η (·) joining two distinct locations consists of the identical vectors xρc , for c = 0, 1, . . . , η, then Λη (·) = 0, otherwise Λη (·) > 0. In general, two distinct pixel’s locations on the image lattice can be connected by many paths. Moreover the number of possible geodesic paths of certain length η connecting two distinct points depends on their locations, length of the path, and the neighborhood system used (Smolka et al., 2004). Similarly to a fuzzy membership function [Eq. (75)] used in a data-adaptive filter in Eq. (73), a similarity function is employed here to evaluate the appropriateness of the digital paths leading from (i, j ) to (g, h) as follows: χ
μ
η,Ψ
η,b f Λ (i, j ), (g, h) (i, j ), (g, h) = =
b=1 χ b=1
exp −β · Λη,b (i, j ), (g, h)
(79)
where χ is the number of all paths connecting (i, j ) and (g, h), Λη,b [(i, j ), (i, j )] is a dissimilarity value along a specific path b from the set of all χ possible paths leading from (i, j ) to (g, h) in the search area Ψ , f (·) is a smooth function of Λη,b , and β is the design parameter. Note that Ψ can be restricted by the dimension of the supporting window in conventional filtering. If x(p,q) ∈ Ψ(p,q) is the color vector under consideration and x(i,j ) ∈ Ψ(p,q) represents the vector connected to x(p,q) via a digital path, the digital path approach (DPA) filter is defined as follows: (p,q) w(i,j ) x(i,j ) y(p,q) = (i,j )⇔(p,q)
=
(i,j )⇔(p,q)
"
μη,Ψ [(p, q), (i, j )]x(i,j ) η,Ψ [(p, q), (g, h)] (g,h)⇔(p,q) μ
(80)
222
LUKAC AND PLATANIOTIS
where (i, j ) ⇔ (p, q) denotes all points (i, j ) connected by digital paths with (p, q) contained in Ψ(p,q) . Adapting the concept in Eq. (73), the outputted vector y(p,q) is equivalent to the weighted average of all vectors x(i,j ) connected by digital paths with the vector x(p,q) . A more sophisticated solution is obtained by incorporating the information on the local image features into the filter structure (Smolka et al., 2004). This can be done through the investigation of the connection costs Λη (·) of digital paths that originate at ρ0 , cross ρ1 , and then pass the successive locations ρc , for c = 2, 3, . . . , η, until the path reaches length η. Operating on the above assumptions, the similarity function [Eq. (79)] is modified as μ
η,Ψ
(ρ0 , ρ1 , η) =
χ b=1
exp −β · Λη,b (ρ0 , ρ1 , ρ2∗ , . . . , ρη∗ )
(81)
where χ denotes the number of the paths P η (ρ0 , ρ1 , p2∗ , . . . , ρη∗ ) originating at ρ0 crossing ρ1 and ending at ρη∗ , which are totally included in the search area Ψ . If the constraint of crossing the location ρ1 is omitted, then Λη,b (ρ0 , ρ1 , ρ2∗ , . . . , ρη∗ ) can be replaced with Λη,b (ρ0 , ρ1∗ , ρ2∗ , . . . , ρη∗ ). " Using the normalized weights w(ρ0 , ρ1∗ ) = μη,Ψ (ρ0 , ρ1 , η)/ μη,Ψ (·) the so-called DPAF filter replaces the color vector x(p,q) ∈ Ψ(p,q) under consideration as follows (Szczepanski et al., 2003): y(p,q) = w(ρ0 , ρ1∗ )xρ1∗ . (82) ρ1∗ ∼ρ0
Thus, operating inside the supporting window Ψ(p,q) , the weights are calculated exploring all digital paths starting from the central pixel and crossing its neighbors. The output vector y(p,q) is obtained through a weighted average of the nearest neighbors of x(p,q) . In a similar way, the so-called DPAL filter can be defined as (Szczepanski et al., 2003) w(ρ0 , ρη∗ )xρη∗ (83) y(p,q) = ρη∗
where the weights w(ρ0 , ρη∗ ) are obtained by exploring all digital paths leading from the central pixel x(p,q) , for ρ0 = (p, q) to any of the pixels in the supporting window. Then, a weighted average of all pixels contained in the supporting window is calculated to determine y(p,q) . Note that the DPAL filter involves all the |ζ | pixels from Ψ(p,q) into the averaging process, whereas the DPAF filter determines the weighted output using only its nearest neighbors. Therefore, the DPAL filter has more efficient smoothing capability.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
223
F IGURE 13. Filtering of impulsive noise in Figure 5c: (a) MF output, (b) VMF output, (c) SWVF output, (d) output of the switching filter in Lukac et al. (2004e).
8. Switching Filtering Schemes Besides the adaptive filters, such as SWVF, fuzzy vector filters, and DPAbased filters, the essential trade-off between noise suppression and imagedetail preservation (see Figure 13) can be achieved in the impulsive environment by switching within a range of predefined filtering operators (Figure 14) (Hore et al., 2003; Ma and Wu, 2006; Smolka, 2002). As explained in Lukac et al. (2005f), filters following the switching mode paradigm most often switch between a robust nonlinear smoothing mode and an identity processing mode that leaves input samples unchanged during
224
LUKAC AND PLATANIOTIS
F IGURE 14. Switching filter concept based on (a) the fixed threshold ξ and (b) fully adaptive control using the signal statistics.
the filtering operation. Their decoupled, easily implemented structure and their computational simplicity made such filters popular and a method of choice in a variety of applications where the desired signal is corrupted by impulsive noise. Recent developments have seen the introduction of switching schemes based on vector operations, which further increased the appeal of the switching framework in color processing applications (Lukac, 2003, 2004a; Lukac et al., 2005c). Operating on the color vectors inside the supporting window Ψ(p,q) , the switching vector filter (SVF) output is defined as follows (Lukac, 2004a): y(p,q) =
yNSF (p,q) x(p,q)
if λ ≥ ξ otherwise
(84)
where yNSF (p,q) denotes a robust nonlinear smoothing filter (NSF) output (e.g., VMF, BVDF, or DDF) and x(p,q) is the input color vector occupying the center of the supporting window Ψ(p,q) . The switching mechanism is controlled by comparing the adaptive parameter λ(Ψ(p,q) ) and the nonnegative threshold ξ , which can be defined either as the fixed value used in Figure 14a or the function ξ(Ψ(p,q) ) of the input set Ψ(p,q) as shown in Figure 14b. In the case of noise detection (λ ≥ ξ ), the input vector x(p,q) ∈ Ψ(p,q) is replaced with the NSF output yNSF (p,q) . If λ < ξ , then x(p,q) is considered noise-free and remains unchanged (i.e., the SVF performs the so-called identity operation). If ξ = 0, then the filter output is always the NSF output, while for large values of ξ , the switching filter output will always be the central pixel x(p,q) as follows: y(p,q) =
yNSF (p,q) x(p,q)
if ξ = 0 if ξ → ∞.
(85)
The SVF family in Lukac (2002a, 2003) uses the switching mechanism in Eq. (84) based on the function λ(Ψ(p,q) , τ ) of the window center x(p,q) and the
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
225
robust order statistics {x(c) ∈ Ψ(p,q) , c = 1, 2, . . . , τ } obtained in Eq. (21): τ 1 λ = d x(p,q) , (86) x(c) . τ c=1
The achieved value of λ is compared with the fixed threshold ξ . It has been found in Lukac (2002a) that the utilization of the angular distance Eq. (27) in Eq. (86) along with the set of the five (τ = 5) lowest vector directional order statistics x(c) , the BVDF-based yNSF (p,q) , and the threshold ξ = 0.16 provides excellent color/structural preserving characteristics. A more robust solution in Lukac (2003) uses Eqs. (86) and (21) based on the Euclidean metric [Eq. (24)], the VMF-based yNSF (p,q) , and the control parameters τ = 5 and ξ = 60. A sophisticated switching filtering scheme is obtained using the selection weighted filters (WM, WVMF, WVDF, SWVF) with the weight vector w = [w(i,j ) ; (i, j ) ∈ ζ ] constituted as follows (Lukac, 2004a): w(i,j ) = |ζ | − 2c + 2 if (i, j ) = (p, q) (87) 1 otherwise. By tuning the smoothing parameter c, for c = 1, 2, . . . , (|ζ | + 1)/2, in such a center-weighted filter the mechanism regulates the amount of smoothing provided by the filter ranging from no smoothing (the identity operation for c = 1) to the maximum amount of smoothing [c = (|ζ | + 1)/2 reduces, for example, the WVMF and WVDF operators to the VMF and BVDF, respectively] (Lukac, 2002b). Operating in this range, the switching parameter λ(Ψ(p,q) , τ ) is defined as follows: τ +2 λ= d yc(p,q) , x(p,q)
(88)
c=τ
where yc(p,q) denotes the vector obtained using the selection weighted filters based on Eq. (87) with the parameter c. The value of λ in Eq. (88) is compared in Eq. (84) with the fixed threshold ξ . Using the different distance measures d(·) and selection weighted filters yc(p,q) in Eq. (88), and the different smoothing filters in Eq. (84), a multitude of SVF filters varying in their performance and complexity can be obtained. For example, using the directional processing based WVDF yc(p,q) , the angular distance [Eq. (27)], the BVDF-based yNSF (p,q) , and the parameters τ = 2 and ξ = 0.19 excellent color/structural preservation is obtained (Lukac, 2004a). Robust impulsive noise filtering characteristics are observed when the WVMF-based yc(p,q) , the Euclidean distance [Eq. (24)], the VMF-based yNSF (p,q) , and ξ = 80 are used
226
LUKAC AND PLATANIOTIS
instead (Lukac, 2001). Very low complexity is obtained when the componentwise implementation with the WM-based yc(p,q) , the MF-based yNSF (p,q) , and ξ = 60 is employed (Lukac et al., 2004e). Finally, it should be noted that using center-weighted selection filters, the concept can be extended from a bilevel smoothing scheme in Eq. (84) to a multilevel smoothing [up to (|ζ | + 1)/2 smoothing levels in Eq. (84)], which can allow for additional flexibility and performance improvement (Lukac and Marchevsky, 2001a; Lukac, 2004a). The SVF filters in Lukac et al. (2005c) use the approximation of the multivariate dispersion. In this design, the value of λ is determined through the function λ(Ψ(p,q) ) = D(p,q) defined as the aggregated distance D(p,q) in Eq. (20) between the window center x(p,q) and the other vectors x(i,j ) , for (i, j ) ∈ ζ , inside the supporting window Ψ(p,q) . The parameter ξ(Ψ(p,q) , τ ) in Eq. (84) is determined as (Lukac et al., 2005c) ξ = D(1) + τ ψ =
|ζ | − 1 + τ D(1) |ζ | − 1
(89)
|ζ | + τ Dx¯ |ζ |
(90)
where ψ is the variance approximated using the smallest aggregated distance D(1) ∈ {D(i,j ) ; (i, j ) ∈ ζ } as ψ = D(1) /(|ζ | − 1) and τ (suboptimal value is τ = 4) is the tuning parameter used to adjust the smoothing properties of the SVF filter. An alternative " solution can approximate the variance via ψx¯ = Dx¯ /|ζ |, where Dx¯ = x(p,q) , x(i,j ) ) is the aggregated (i,j )∈ζ d(¯ distance between multichannel input samples x(i,j ) ∈ Ψ(p,q) and the sample mean x¯ (p,q) . In this case, the parameter ξ(Ψ(p,q) , τ ) in Eq. (84) is determined as follows (Lukac et al., 2005c): ξ = Dx¯ + τ ψx¯ = with the suboptimal value τ = 12. 9. Similarity Based Vector Filters Taking advantage of the switching vector filters, the class of similarity-based vector filters has been proposed in Smolka et al. (2003). These filters use the modified aggregated similarity measures ′ D(p,q) μ(x(p,q) , x(g,h) ) (91) = (g,h)∈ζ (g,h)=(p,q)
′ D(i,j ) =
(g,h)∈ζ (i,j )=(p,q) (i,j )=(g,h)
μ(x(i,j ) , x(g,h) )
(92)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
227
associated with the input central vector x(p,q) and the neighboring vectors x(i,j ) , respectively, located inside the supporting window Ψ(p,q) . The function μ(x(i,j ) , x(g,h) ) = μ(x(i,j ) − x(g,h) ) denotes the similarity function μ : [0; ∞) → R, which is nonascending and convex in [0; ∞) and satisfies μ(0) = 1, μ(∞) = 0. The similarity between two identical vectors is equal to 1, and the similarity between the two maximally different color vectors [0, 0, 0] and [255, 255, 255] should be very close to 0. The above conditions are satisfied by the following similarity functions (Smolka et al., 2003): 2 μ(x(i,j ) , x(g,h) ) = exp − x(i,j ) − x(g,h) / h μ(x(i,j ) , x(g,h) ) = exp −x(i,j ) − x(g,h) / h
μ(x(i,j ) , x(g,h) ) =
1 , 1 + x(i,j ) − x(g,h) / h
h ∈ (0; ∞)
1 (1 + x(i,j ) − x(g,h) )h 2 μ(x(i,j ) , x(g,h) ) = 1 − arctan x(i,j ) − x(g,h) / h π 2 , h ∈ (0; ∞) μ(x(i,j ) , x(g,h) ) = 1 + exp{x(i,j ) − x(g,h) / h} μ(x(i,j ) , x(g,h) ) =
μ(x(i,j ) , x(g,h) ) =
1 . 1 + x(i,j ) − x(g,h) h
(93) (94) (95) (96) (97) (98) (99)
′ ′ By comparing the values D(p,q) and D(i,j ) , for (i, j ) ∈ ζ and (i, j ) = (p, q), the switching filtering function (Smolka et al., 2003) is formed: ′ ′ ′ y(p,q) if D(p,q) ≤ min{D(i,j )} y(p,q) = (100) x(p,q) otherwise ′ ′ where the satisfied condition D(p,q) ≤ min{D(i,j ) } identifies the noisy window center x(p,q) (with the minimum similarity to other vectors in Ψ(p,q) ) to be replaced with
′ y′(p,q) = arg max D(i,j ) (x(i,j ) ); (i, j ) ∈ ζ, (i, j ) = (p, q) . x(i,j )
(101)
The vector y′(p,q) denotes the input vector x(i,j ) , which maximizes the aggregated similarity measure in Eq. (92) defined over the vectors neighboring ′ ′ the central vector x(p,q) ∈ Ψ(p,q) . If D(p,q) > min{D(i,j ) } in Eq. (100), then the window center x(p,q) is passed to the filter output unchanged.
228
LUKAC AND PLATANIOTIS
Apart from the various similarity measures listed in Eqs. (93)–(99), the simplest function 1 − x(i,j ) − x(g,h) / h if x(i,j ) − x(g,h) ≤ h μ(x(i,j ) , x(g,h) ) = (102) 0 otherwise where h ∈ (0; ∞), can be used through the aggregated distance function ⎧ −h + x(i,j ) − x(g,h) if (i, j ) = (p, q) ⎪ ⎪ ⎪ ⎪ ⎪ (g,h)∈ζ ⎪ ⎨ (g,h)=(p,q) (103) D(i,j ) = x(i,j ) − x(g,h) otherwise ⎪ ⎪ ⎪ ⎪ (g,h)∈ζ ⎪ ⎪ ⎩ (i,j )=(p,q) (i,j )=(g,h)
as the basis of the fast VMF-like filter. Taking into consideration the quantities D(i,j ) obtained in Eq. (103), the filter replaces the original vector x(p,q) in the supporting window Ψ(p,q) with x(i,j ) ∈ Ψ(p,q) as follows: (104) y(p,q) = arg min D(i,j ) (x(i,j ) ); (i, j ) ∈ ζ . x(i,j )
The construction of the above vector filter is similar to that of the VMF, with the major difference related to the omission of the central vector x(p,q) when calculating D(i,j ) in Eq. (103), for (i, j ) = (p, q). Since the central vector x(p,q) is not used in calculating the aggregated distances associated with its neighbors, the filter replaces x(p,q) only when it is really noisy. Similar to other switching filters, this preserves the desired image information. 10. Adaptive Hybrid Vector Filters A robust structure-adaptive hybrid vector filter (SAHVF) (Ma et al., 2005) classifies the central pixel x(p,q) into several different signal activity categories using noise-adaptive preprocessing and modified quadtree decomposition. The classification is performed during processing for each input set Ψ(p,q) determined by the sliding supporting window and a window adaptive hybrid filtering operation is then chosen according to the structure classification. Thus, the filter adapts itself to both local statistics through an update of the filter parameters and local structures by modifying the window dimension. The SAHVF employs the presmoothing filter similar to the well-known adaptive filter in Lee and Fam (1987). To use fast component-wise filter and avoid color artifacts in the outputted RGB vector, the presmoothing operation
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
229
is performed in the decorrelated YCb Cr space as follows: ′ y(p,q)k
=
′ x¯(p,q)k
+
σx2′
k
σk2
′ ′ (x(p,q)k − x¯(p,q)k )
(105)
where k = 1, 2, 3 indicates the Y, Cb , Cr channel, respectively; and σk2 = " ′ ′ 2 1/|ζ |2 (i,j )∈ζ (x(i,j component-wise variance calculated )k − x¯ (p,q)k ) is a" ′ ′ using the sample mean x¯(p,q)k = 1/|ζ | (i,j )∈ζ x(i,j )k of the kth components ′ ′ ′ x(i,j )k ∈ Ψ(p,q)k . The components x¯(p,q)k , for k = 1, 2, 3, constitute the ′ ′ ′ YCb Cr versions x′(·,·) = [x(·,·)1 , x(·,·)2 , x(·,·)3 ] of the original RGB vectors 2 x(·,·) . The local image signal variance σx ′ = max{σk2 − σv2k , 0} is obtained k √ using the contaminated additive noise deviation σvk = max{ξ π/18σ¯ vk − ε, 0}, where √ K 1 −1 K 2 −1 π/2 ′ (p,q)k ∗ u(p,q) (106) Ψ σ¯ vk = 6(K1 − 2)(K2 − 2) p=2 q=2
is the global noise deviation calculated using all the YCb Cr image values ′ ′ ∈Ψ xˆ(·,·)k (p,q)k corresponding to the outputted RGB VMF values in Eq. (44). A 3 × 3 Laplacian convolution mask u(p,q) = {1, −2, 1, −2, 4, −2, 1, −2, 1} is utilized to collect the additive noise energy from the input image (Ma et al., 2005). The empirically determined parameters ξ = 1.2 and ε = 4.4 has been found to well compensate the bias from fine image structures. To achieve sufficient suppression of noise in background areas, the choice of the window dimension of the presmoothing filter in Eq. (106) should depend on the value of the noise deviation σvk , namely, a 3 × 3, 5 × 5, and 7 × 7 window is recommended for 0 ≤ σvk ≤ 15, 15 < σvk ≤ 30, and 30 < σvk , respectively. The structure activity classification is performed at the second SAHVF stage. By converting the preprocessed image in Eq. (105) into the luminance ′ ′ ′ signal L′(p,q) ([y(p,q)1 , y(p,q)2 , y(p,q)3 ]), the modified quadtree decomposition (Ma et al., 2005) is used to decompose the luminance image {L′(·,·) } into nonoverlapping rectangular blocks represented by the deviation quantities λc , for c = 1, 2, 3, 4. Using the design parameter η (recommended values 0.7 ≤ η ≤ 2.5) and the median absolute deviation (MAD) of the ensemble deviations σ(p,q) , the satisfied condition max {λc } − min {λc }
1≤c≤4
1≤c≤4
≤ η.MAD{σ(p,q) ; p = 1, 2, . . . , K1 , q = 1, 2, . . . , K2 }
(107)
denotes a homogeneous block. If any block is classified as inhomogeneous, the splitting step is recursively applied until all the subblocks are either
230
LUKAC AND PLATANIOTIS
marked as homogeneous or they reach the minimum 1 × 1 size. If the dimension of any homogeneous block is larger than 16 × 16 pixels, the recursive block splitting procedure is applied until all the subblocks reach a 16 × 16-square shape. Since each luminance pixel L(p,q) is contained in an l × l block, for l = 1, 2, . . . , 16, an activity flag a(p,q) (L(p,q) ) = l can be determined. Using a 3 × "3 mean filter, the structure activity index is determined as I(p,q) = 1/|ζ | (i,j )∈ζ a(p,q) , where |ζ | = 9. Based on the obtained value of I(p,q) , each input RGB vector x(p,q) is classified finally as (1) high activity area (for 1.0 ≤ I(p,q) ≤ 2.5), (2) medium activity area (for 2.5 < I(p,q) ≤ 5.5), and (3) low activity area (for 5.5 < I(p,q) ≤ 16.0). Thus, the small values of I(p,q) denote details or impulses requiring use of a detail-preserving nonlinear filter with a small supporting window. The medium values of I(p,q) usually correspond to edges and textures, whereas the large values of I(p,q) denote flat areas and allow for the utilization of a large supporting window. If I(p,q) denotes the high-activity area (1.0 ≤ I(p,q) ≤ 2.5), then the L-filtering SAHVF output y(p,q) is obtained via a 3 × 3 window-based " structure in Eq. (56) with the coefficients wc = μc / τc=1 μc , for μc = (d(τ ) − d(c) )/(d(τ ) − d(1) ) and c = 1, 2, . . . , τ . The parameter τ is determined using the central pixel peer group concept as τ = min arg " α
1 "α
"|ζ |−1 & 1
c=1 d(c) − |ζ |−1−α c=α+1 d(c) α 2 "|ζ |−1 2 "|ζ |−1 α 1 1 "α c=1 d(c) + c=1 d(c) − α c=α+1 d(c) − |ζ |−1−α c=α+1 d(c)
(108) where α = 1, 2, . . . , |ζ |−1. The quantities {d(c) ; for c = 1, 2, . . . , |ζ |−1} ⊂ {d(x(p,q) , x(i,j ) ); (i, j ) = (p, q), (i, j ) ∈ ζ } are the ordered distance between the central vector x(p,q) and its neighbors x(i,j ) inside the processing window Ψ(p,q) . The values of d(x(p,q) , x(i,j ) ) are calculated using the combined distance measure from Eq. (29) as follows (Plataniotis et al., 1996; Plataniotis and Venetsanopoulos, 2000): |x(p,q) − x(i,j ) | 1− . (109) d(x(p,q) , x(i,j ) ) = x(p,q) x(i,j ) max{x(p,q) , x(i,j ) }
x(p,q) .xT(i,j )
In the case of a medium activity area (for 2.5 < I(p,q) ≤ 5.5), the recommended window size is 3 × 3 or 5 × 5 pixels and the SAHVF output y(p,q) is obtained using the data-adaptive concept in Eq. (73) with the weights w(i,j ) in Eq. (74) defined using μ(i,j ) =
(D(|ζ |) − D(i,j ) ) + γ (D(i,j ) − D(1) ) (1 + γ )(D(|ζ |) − D(1) )
(110)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
231
where γ is an empirical parameter used to control the nonlinearity of the weighted function. The value of D(|ζ |) ∈ {D(i,j ) ; (i, j ) ∈ ζ } and D(1) ∈ {D(i,j ) ; (i, j ) ∈ ζ } denotes the maximum and the minimum, respectively, of the aggregated distances in Eq. (20), which are calculated using the combined distance in Eq. (109). The extremes D(|ζ |) and D(1) can be used (Plataniotis and Venetsanopoulos, 2000) to define γ as γ = (D(|ζ |) − D(1) )−1 . Finally, if the input vector x(p,q) corresponds to the 5.5 < I(p,q) ≤ 16.0 range, then the SAHVF output y(p,q) is determined using the data-adaptive concept in Eq. (73) with the weights w(i,j ") equivalent to the normalized structure activity values w(i,j ) = I(i,j ) / (g,h)∈ζ I(g,h) , for (i, j ) ∈ ζ . Depending on the value of I(p,q) , the supporting window can be chosen from 7 × 7 to 11 × 11 pixels in size. B. Performance Evaluation of the Noise Reduction Filters In many application areas, such as multimedia, visual communications, production of motion pictures, the printing industry, and graphic arts, greater emphasis is given to perceptual image quality. Consequently, the perceptual closeness of the filtered image to the uncorrupted original image is ultimately the best measure of the efficiency of any color image filtering method (see images shown in Figures 5, 12, 13, and 15). There are basically two major approaches1 used for assessing the perceptual error between two color images, namely, the objective evaluation approach and the subjective evaluation approach. 1. Objective Evaluation Following conventional practice, the difference between the original and noisy images, as well as the difference between the original and filtered images, is often evaluated using the commonly employed objective measures (Lukac et al., 2004c), such as mean absolute error (MAE) and MSE corresponding to signal-detail preservation and noise suppression, respectively. Since the RGB space is the most popular color space used conventionally to store, process, display, and analyze color images, both the MAE and the MSE are expressed in the RGB color space as follows: MAE =
K2 3 K1 1 |o(p,q)k − y(p,q)k | 3K1 K2
(111)
k=1 p=1 q=1
1 Comprehensive comparisons of various noise removal filters can be found in Lukac (2004a), Lukac et al. (2004c, 2005f), Plataniotis et al. (1999), Plataniotis and Venetsanopoulos (2000), and Smolka et al. (2004).
232
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 15. Filtering of mixed noise in Figure 5d: (a) MF output, (b) VMF output, (c) data-adaptive filter output, (d) DPAL filter output.
MSE =
K2 3 K1 1 (o(p,q)k − y(p,q)k )2 3K1 K2
(112)
k=1 p=1 q=1
where o(p,q) = [o(p,q)1 , o(p,q)2 , o(p,q)3 ] is the original RGB pixel, y(p,q) = [y(p,q)1 , y(p,q)2 , y(p,q)3 ] is the processed pixel with (p, q) denoting a spatial position in a K1 × K2 color image, and k characterizing the color channel. However, the above criteria do not measure the perceptual closeness between the two images because the RGB is not a uniform color space. Therefore, the additional criterion expressed in the perceptually uniform CIE Lab or CIE Luv color space should be used in conjunction with the MAE
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
233
and MSE. The normalized color difference (NCD) criterion (Plataniotis et al., 1999) is defined in the CIE Luv color space and its usage is essential in determining the color information preservation. Using the NCD criterion, the perceptual similarity between the original and the processed image is quantified as follows: "K1 "K2 $"3 2 k=1 (o¯ (p,q)k − y¯(p,q)k ) q=1 p=1 $ (113) NCD = "K1 "K2 "3 2 ( o ¯ ) k=1 (p,q)k q=1 p=1
where o¯ (p,q) = [o¯ (p,q)1 , o¯ (p,q)2 , o¯ (p,q)3 ] and y¯ (p,q) = [y¯(p,q)1 , y¯(p,q)2 , y¯(p,q)3 ] are the vectors representing the RGB vectors o(p,q) and y(p,q) , respectively, in the CIE Luv color space. Since the NCD as well as the MAE and the MSE evaluate the difference between two images, small error values denote enhanced performance. A precisely designed color image filter should yield consistently good results with respect to all of the above measures. An alternative criteria to the MSE are the signal-to-noise ratio (SNR) and the peak signal-to-noise ratio (PSNR) defined as "K1 "K2 "3 2 k=1 (o(p,q)k ) q=1 p=1 (114) SNR = 10 log10 "K "K "3 1 2 2 k=1 (o(p,q)k − y(p,q)k ) p=1 q=1 2552 (115) PSNR = 10 log10 MSE whereas the NCD criterion can be replaced using the Lab criterion (Vrhel et al., 2005): ( K2 ) K1 3 ) 1 * (o¯ 2 (116) Lab = (p,q)k − y¯(p,q)k ) K1 K2 p=1 q=1
k=1
where o¯ (p,q)k is the kth component of the CIE Lab vector o¯ (p,q) corresponding to the original RGB vector o(p,q) . Similarly, y¯(p,q)k is the kth component of the CIE Lab vector y¯ (p,q) corresponding to the RGB vector y(p,q) in the filtered image y. 2. Subjective Evaluation Since most enhanced images are intended for human inspection, a subjective image quality evaluation approach is widely used. Subjective evaluation is also required in practical application, where the original, uncorrupted images are unavailable. In this case, standard objective measures (MAE, MSE, NCD) of performance evaluations, which are based on the difference in the statistical
234
LUKAC AND PLATANIOTIS TABLE 1 S UBJECTIVE I MAGE E VALUATION G UIDELINES Score
Overall evaluation of the distortion
Noise removal evaluation
1 2 3 4 5
Very disruptive Disruptive Destructive but not disruptive Perceivable but not destructive Imperceivable
Poor Fair Good Very good Excellent
distributions of the pixel values, cannot be utilized (Lukac et al., 2005f; Plataniotis and Venetsanopoulos, 2000). Using the subjective evaluation approach, the image quality is evaluated with respect to (Lukac et al., 2005f) (1) image detail preservation (DP), (2) presence of residual noise, and (3) the introduction of color artifacts as a result of faulty, or excessive, processing. The choice of these criteria follows the well-known fact that the human visual system is sensitive to changes in color appearance. Furthermore, a good restoration method should maintain the edge information while it removes image noise. Edges are important features since they indicate the presence and the shape of various objects in the image. As shown in Table 1, performance (or lack of it) should be ranked subjectively in five categories (Plataniotis and Venetsanopoulos, 2000). In the subjective evaluation procedure (Lukac et al., 2005f), the methods under consideration are usually applied to the reference images and compared according to criteria listed in Table 1. Input (reference) images and filtered outputs should be viewed simultaneously under identical viewing conditions either on panel or on the screen by the set of observers. Although specific criteria can be applied in choosing the observers, to simulate a realistic situation where viewing of images is done by ordinary citizens and not image processing experts, the observers should be unaware of the specifics of the experiments (Lukac et al., 2005f). The subjective evaluation experiment should be performed in a controlled room (external light had no influence on image perception) with gray painted walls. If the screen is used to display the images during evaluation tests, it should be a calibrated, highquality, displaying device with controlled illumination. Pixel-resize zooming functionality control over the images can be allowed to highlight the image details. Finally, it should be mentioned that the images can be presented either in specific or random order. C. Inpainting Techniques Although many noise reduction filters can excellently enhance color images corrupted by point-wise acquisition and/or transmission noise, there are
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
235
situations when the damaged image areas are a size larger than that of the supporting window. Such a problem occurs in video data archiving, transmission over best effort networks or wireless communication channels, and aggressive coding where visual impairments in visual data are often observed (Criminisi et al., 2004; Park et al., 2005; Rane et al., 2003). For example, missing blocks are introduced by packet loss during wireless transmission. Archived photographs, films, and videos are exposed to chemical and physical elements as well as environmental conditions, which cause visual information loss and artifacts (e.g., cracks, scratches, and dirt) in the corresponding digital representation. Finally, undesired image objects such as logos, stamped dates, text, and persons can also be considered as the damaged area to be reconstructed using digital image processing. To restore the damaged image areas, digital inpainting techniques designed either for image or video restoration should be used. Image inpainting (Rane et al., 2003) refers to the process of filling in missing data in a designated region of the visual input by means of image interpolation. The object of the process is to reconstruct missing parts or damaged images in such a way that the inpainted region cannot be detected by a casual observer (Figure 16). To recover the color, structural, and textural content in a large damaged area, output pixels are calculated using the available data from the surrounding undamaged areas (Rane et al., 2003). The required input can be automatically determined by the inpainting technique or supplied by the user. Since different inpainting techniques focus on pure texture or pure structure restoration, both the quality and cost of the inpainting process differ significantly. Boundaries between image regions constitute structural (edge) information, which is a complex, nonlinear phenomenon produced by blending together different textures. It is not therefore surprising that the state-of-the-art inpainting methods attempt to simultaneously perform texture and structure filling in (Rane et al., 2003). D. Image Sharpening Techniques Apart from the noise reduction techniques, image sharpening or high-pass filtering is often required to enhance the appearance of the edges and fine image details (Figure 17) (Hardie and Boncelet, 1995; Konstantinides et al., 1999; Tang et al., 1994). Images are usually blurred by image processing, such as low-pass filtering and compression resulting in various edge artifacts. Since many practical applications suffer from noise, image sharpeners should be insensitive to both noise and compression artifacts (Fischer et al., 2002; Tang et al., 1994). It has been widely observed that linear sharpeners such as unsharp masking are inefficient and introduce new artifacts into the image (Fischer et al., 2002;
236
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 16. Color image inpainting: (a, c) damaged images, and (b, d) the corresponding images reconstructed using image inpainting.
Hardie and Boncelet, 1993). Therefore, nonlinear sharpening methods based on robust order statistics are used instead. Operating in the component-wise manner, the comparison and selection (CS) filter enhances the image by replacing the input components x(p,q)k with its enhanced value y(p,q)k as follows (Lee and Fam, 1987): if x¯k ≥ x((|ζ |+1)/2)k x(τ )k y(p,q)k = (117) x(|ζ |−τ +1)k otherwise where x¯k = mean{x(i,j )k ; (i, j ) ∈ ζ } is the component-wise mean of Ψ(p,q)k , x((|ζ |+1)/2)k obtained in Eq. (19) is the component-wise median of
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
237
F IGURE 17. Image sharpening demonstrated on (a) a 256×256 color image in Figure 5a blurred by a 3 × 3 mean filter. Images (b)–(d) were obtained by sharpening the blurred image in (a) using (b) CS filter, (c) LUM sharpener, (d) WM sharpener.
Ψ(p,q)k , x(τ )k and x(|ζ |−τ +1)k are the component-wise order statistics defined in Eq. (19), and τ , for τ = 1, 2, . . . , (|ζ | + 1)/2, is the parameter used to control the level of enhancement. The smaller, value of τ , the more significant sharpening characteristics are obtained. Although the CS filter offers good performance, it often smoothes fine details (Hardie and Boncelet, 1993). This is not the case when lower-uppermiddle (LUM) sharpeners (Hardie and Boncelet, 1993, 1995) are used. The output of the LUM sharpener is obtained by comparing the window center (middle sample) x(p,q)k with the lower x(τ )k and the upper x(|ζ |−τ +1)k
238
LUKAC AND PLATANIOTIS
component-wise order statistics in Eq. (19) as follows: if x(τ ) < x(p,q)k ≤ x¯kτ x(τ )k y(p,q)k = x(|ζ |−τ +1)k if x¯kτ < x(p,q)k < x(|ζ |−τ +1)k x(p,q)k otherwise
(118)
where x¯kτ = (x(τ )k + x(|ζ |−τ +1)k )/2, and τ , for τ = 1, 2, . . . , (|ζ | + 1)/2, is the parameter used to control the enhancement process. The level of enhancement varies from the maximum amount of sharpening (for τ = 1) to no sharpening [identity operation obtained for τ = (|ζ | + 1)/2]. If x(τ ) < x(p,q)k < x(|ζ |−τ +1)k , then the input central sample x(p,q)k represents an edge transition. By shifting x(p,q)k to extreme order statistics x(τ )k and x(|ζ |−τ +1)k in Eq. (118), the transition point is removed resulting in a stepper edge. More sophisticated sharpeners can be obtained using the WM framework, which admits the negative weights (Arce, 1998). However, such an approach requires complex optimization procedures to obtain the desired performance. To avoid this drawback, computationally efficient approaches combine linear sharpening operators and robust order-statistics (Fischer et al., 2002). By adding to the input image x a high-pass filtered version of x the sharpening filter can be obtained. Following the derivation in Fischer et al. (2002), the Laplacian-based high-pass permutation WM (PWM) filter can be used to enhance the image x as follows: (x(τ +1)k − x(1)k )/2 if r(p,q)k ≤ τ if τ < r(p,q)k ≤ |ζ | − τ (119) y(p,q)k = x(p,q)k + (x(p,q)k − x(1)k )/2 (x(|ζ |−τ )k − x(1)k )/2 otherwise where the addition of the kth component x(p,q)k of the central vector x(p,q) ∈ Ψ(p,q) normalizes the output of the high-pass filter to the desired intensity range. Another approach follows the unsharp masking concept (Polesel et al., 2000) and expresses the sharpening filter as y(p,q) = x(p,q) + λf (Ψ(p,q) ). The scaling parameter λ is used to tune the amount of sharpening operation and f (·) is a high-pass filter defined over the input set Ψ(p,q) used to extract the high-frequency component in the image. By employing the WM high-pass filter in the processing pipeline, the component-wise sharpening procedure is performed as follows (Fischer et al., 2002): y(p,q)k = (1 + λ)x(p,q)k − λx¯(p,q)k
(120)
x¯(p,q)k = (x(1)k + x(|ζ |)k )/2
(121)
where
denotes the mid-range of the component-wise input set Ψ(p,q)k .
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
239
The concept can be extended to derive the PWM sharpener (Fischer et al., 2002): x + 0.5λx − 0.5λx¯ if r ≤τ (p,q)k
(τ +1)k
(p,q)k
(p,q)k
if τ < r(p,q)k ≤ |ζ | − τ otherwise (122) where x¯(p,q)k is the mid-range obtained by Eq. (121). The sharpener obtained can be further modified as follows (Fischer et al., 2002): PWM y(p,q)k if xˆ(p,q)k > ξ y(p,q)k = (123) NSF otherwise y(p,q)k y(p,q)k =
(1 + 0.5λ)x(p,q)k − 0.5λx¯(p,q)k x(p,q)k + 0.5λx(|ζ |−τ )k − 0.5λx¯(p,q)k
where xˆ(p,q)k = x(|ζ |)k −x(1)k and the threshold ξ form the switching function. The large value of xˆ(p,q)k indicates the presence of an edge, and in this case PWM should be used. The small values indicate no the sharpening filter y(p,q)k presence of an edge allowing for the utilization of the different processing NSF to flat image areas. Note that type, for example, the smoothing filter y(p,q)k the accuracy of the enhancement process depends highly on the value of ξ . E. Image Zooming Techniques Another application of multichannel image filters is image zooming or spatial interpolation of a digital color image, which is the process of increasing the number of pixels representing the natural scene (Figure 18) (Lukac et al., 2005a, 2005d). It is frequently used in high resolution display devices (Herodotou and Venetsanopoulos, 1995) and consumer-grade digital cameras (Lukac et al., 2004a, 2005e). Spatial interpolation preserves the spectral representation of the input. Operating on the spatial domain of a digital image, spatial interpolation transforms a color image into an enlarged color image. It is well-known that a typical natural image exhibits significant spectral correlation among its RGB color planes. Therefore, scalar techniques operating separately on the individual color channels are insufficient and produce various spectral artifacts and color shifts (Lukac et al., 2005d, 2005e). Moreover, many conventional methods such as bilinear interpolation and spline based techniques often cause excessive blurring or geometric artifacts (Herodotou and Venetsanopoulos, 1995; Lukac et al., 2005e). Therefore, the development of the more sophisticated, vector processing-based, nonlinear approaches is of paramount importance. Zooming a K1 × K2 color image x with the pixels x(p,q) by a factor of z results in a zK1 × zK2 zoomed color image y. The zooming factor z ∈ Z can be an arbitrary positive integer, however, the value z = 2 is selected
240
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 18. Spatial interpolation of a 256×256 color image shown in Figure 5a with a zooming factor of z = 2: (a) MF output, (b) VMF output, (c) BVDF output, (d) data-adaptive filter output.
here to facilitate the discussion. Assuming the aforementioned setting, the use of the zooming procedure maps the original color vectors x(p,q) with spatial coordinates p and q into the enlarged image y as y(2p−1,2q−1) = x(p,q) where the pixels y(2p,2q) denote the new rows and columns (e.g., of zeros) added to the original data (Lukac et al., 2005e). Using a 3 × 3 processing window Ψ(p,q) , for p = 1, 2, . . . , 2K1 and q = 1, 2, . . . , 2K2 , sliding on the up-sampled image y to calculate individually all the new image pixels in the enlarged color image, the three pixel configurations are obtained when the window is centered on an empty pixel position. In these configurations, the available pixels are described
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
241
using ζ = {(p, q − 1), (p, q + 1)}, ζ = {(p − 1, q), (p + 1, q)}, and ζ = {(p − 1, q − 1), (p − 1, q + 1), (p + 1, q − 1), (p + 1, q + 1)}. Since the first two configurations provide an insufficient number of original pixels for the estimation of the unknown vector y(p,q) at the center of Ψ(p,q) , a two-iteration interpolation procedure is employed. In the first interpolation step, the unknown vector y(p,q) is estimated using the filtering function f (·) defined over the pixels y(i,j ) , for (i, j ) ∈ ζ = {(p − 1, q − 1), (p − 1, q + 1), (p + 1, q − 1), (p + 1, q + 1)}. When this processing step is completed over all regular locations, the second interpolation step is performed on all remaining positions with unknown pixels y(p,q) located in the center of Ψ(p,q) by using an operator f (·) defined over the vectors y(i,j ) , for (i, j ) ∈ ζ = {(p − 1, q), (p, q − 1), (p, q + 1), (p + 1, q)}, constituted of the two original and the two previously estimated pixels. This processing step completes the spatial interpolation process resulting in the fully populated, enlarged color image y. Finally, it should be mentioned that the number of interpolation steps in the spatial interpolation process increases with the value of the zooming factor z. F. Applications 1. Virtual Restoration of Artworks Virtual restoration of artworks is an emerging digital image processing application (Barni et al., 2000; Li et al., 2000; Lukac et al., 2005f). Since original materials, such as mural, canvas, vellum, photography, and paper medium, are invariably exposed to various aggressive environmental factors that lead to the deterioration of the perceived image quality, digital image processing solutions are used to restore, interpret, and preserve collections of visual cultural heritage in a digital form. Environmental conditions include sunshine, oxidation, temperature variations, humidity, and the presence of bacteria. These undesirable effects result in significant variation in the color characteristics and pigmentation of the artwork, preventing proper recognition, classification, and dissemination of the corresponding digitized artwork images. It was argued in Lukac et al. (2005f) that modern color image filters can be used as a preprocessing tool to eliminate noise introduced during the digital acquisition of the original visual artworks. The most common sources of noise and visual impairments are the acquisition device limitations, the granulation of the artwork’s surfaces, as well as the encrusting and accumulation of dirt on protecting surfaces. Thus, color image enhancement methods utilized in virtual restoration of artwork should eliminate noise and impairments present
242
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
(e)
(f)
F IGURE 19. Artwork image enhancement: (a–c) artwork images, and (d–f) the corresponding enhanced images.
in the corresponding digital data, while at the same time preserving the original colors (pigment) and the fine details of the artwork (Figure 19). Apart from denoising, one of the most critical issues in digitized artwork image restoration is the task of crack removal and fading color enhancement. Cracks are breaks in the original medium, paint, or varnish of the original artwork usually caused by aging, drying, or mechanical factors (Barni et al., 2000; Giakumis et al., 2006). With various degrees of interaction, cracks can first be localized by a sophisticated detection process. In the sequence, the damaged area is restored using image inpainting, which fills the corresponding spatial locations with image interpolated values (Figure 19c, f). Similar to crack removal, a region with faded colors and obscure shadows is first localized (Li et al., 2000). Then the user, by selecting target colors from a color template and an inpainting method, fills in the detected gaps and restores both intensity and color information.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
(e)
(f)
243
F IGURE 20. Television image reconstruction: (a–c) television images, and (d–f) the corresponding reconstructed images.
2. Television Image Enhancement Television image enhancement represents a typical application where filtering is used to remove strong transmission noise and other visual impairments (Hamid et al., 2003; Lukac et al., 2005a; Rantanen et al., 1992). In most cases, television signals transmitted over the air are highly corrupted by impulsive or mixed noise due to atmospheric conditions. In this case, the noise can be removed (Figure 20) using robust estimators such as the techniques described in Section IV.A. In addition to transmission noise, images received often contain large damaged areas that are mostly present in the form of corrupted image rows or noise-like diagonal lines. Since the dimension of the damaged areas usually exceeds the size of support ζ used in traditional filtering solutions, image quality is enhanced using image inpainting (Figure 20c, f) rather than filtering. Apart from processing still television images, motion video enhancement is often required in restoring archived films and videos (Kokaram et al., 1995). Motion video can be viewed as a 3D image signal or a time sequence of two-dimensional image frames (Arce, 1991; Lukac and Marchevsky, 2001b).
244
LUKAC AND PLATANIOTIS
Such a visual input exhibits significant spatial and temporal correlation. Since temporal restoration of motion video without spatial processing results in blurring of the structural information in the reconstructed video, and ignoring the temporal correlation in processing each one of the video frames as still images produces strong motion artifacts, the development of spatiotemporal restoration techniques is of paramount importance (Lukac et al., 2004e). Since the spatial position and intensity of visual impairments, such as missing data patches caused by macroblocks dropped during transmission over the best effort type of networks, or speckle noise, random data patches, and sparkles caused by the presence of dirt, dust, and scratches in the original medium, vary significantly in the corresponding digital motion video, impairments are localized as temporal discontinuity (Kokaram et al., 1995). Through the employed motion compensation algorithms, this discontinuity is viewed as a spatial area in the actual frame that cannot be matched to a similar area in reference frames. After localizing the artifacts at the target frame, image inpainting implemented either as a spatial or a spatiotemporal solution can be used to fill in the missing information.
V. E DGE D ETECTION Edges convey essential information about a scene. For gray-scale images, edges are commonly defined as physical, photometric, and geometric discontinuities of the image function, as the edges can be defined as the boundaries of distinct image regions that differ in intensity (Gonzalez and Woods, 1992; Plataniotis and Venetsanopoulos, 2000). In the case of color images, represented in the 3D color space, the edges may be defined as discontinuities in the vector space representing the color image. In this way the edges split image regions of different color or intensity. Determination of object boundaries is important in many areas such as visual communication, medical imaging, dactyloscopy, quality control, photogrammetry, and intelligent robotic systems. Thus, edge detection—a process of transforming an input digital image into an edge map that can be viewed as a line drawing image with a spatial resolution identical to that of the input—is a common component in image processing systems (Lukac et al., 2005a). It has been widely observed that color images carry much more information than monochrome images (Plataniotis and Venetsanopoulos, 2000). For example, the monochrome images may not contain enough information in cases of image scenes, in which two close objects are of quite different color but of the same brightness and are merged in the gray-scale imaging. In this case, the utilization of the full information contained in color channels of the
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
245
F IGURE 21. Scalar edge detection based on color to gray-scale image conversion: (a) color to gray-scale image conversion, (b) scalar edge detection.
input image enables better detection of color edges compared to the use of monochrome (gray-scale) edge techniques, which operate separately on the individual color channels (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000). Moreover, multichannel images such as the color RGB images carry additional information contained in the various spectral channels. In this case, the boundary between two surfaces with different properties can be determined in more than one way (Lukac and Plataniotis, 2005a; Plataniotis and Venetsanopoulos, 2000). The development of an efficient edge detector, which properly detects the objects’ edges, is a rather demanding task. The most popular edge operators generate the edge maps by processing information contained in a local image neighborhood as determined by an element of support (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000). These operators (1) do not use any prior information about the image structure, (2) are image content agnostic, and (3) are localized, in the sense that the detector output is solely determined by the features obtained through the element of support. Such color edge detectors can be classified as scalar and vector techniques. A. Scalar Operators Since color images are arrays of three-component color vectors, the use of the scalar edge detectors requires the conversion of the color image to its luminance-based (monochrome) equivalent (Lukac et al., 2003b; Lukac and Plataniotis, 2005a). Assuming the conventional RGB representation, the conversion of the color image to a luminance-based image can be performed via Eq. (1) or (2) with (p, q) denoting the spatial location in the image. In the sequence, a scalar edge detector is applied on the luminance image (Figure 21): m(p,q) = f (L(i,j ) );
for (i, j ) ∈ ζ
(124)
where f (·) denotes the edge operator defined over the local luminance quantities L(i,j ) , with (i, j ) ∈ ζ denoting the area of support (e.g., a 3 × 3 filtering window Ψ(p,q) ). Alternatively, the edge map of the color image can be achieved using component-wise processing (Figure 22) (Lukac et al., 2003b; Plataniotis and
246
LUKAC AND PLATANIOTIS
F IGURE 22. Component-wise edge-detection concept: (a) color channel decomposition, (b) scalar edge detection in the separated color channels, (c) combination of the separate edge maps to form the outputted edge map.
Venetsanopoulos, 2000). In this way, each of the three color channels is processed separately. The operator then combines the three distinct edge maps to form the output map as follows (Lukac and Plataniotis, 2005a): m(p,q) = max(m(p,q)1 , m(p,q)2 , m(p,q)3 )
(125)
where m(p,q)k = f (x(i,j )k );
for (i, j ) ∈ ζ
(126)
denotes the edge maps obtained by applying the scalar edge detector on the R (k = 1), G (k = 2), and B (k = 3) channel of the color image, respectively. The output edge description corresponds to the dominant indicator of the edge activity noticed in the different color bands. The edge operator’s output [Eq. (124) or (125)] is compared with a predefined threshold to obtain the edge map. In other words, the purpose of the thresholding operation is to determine if a given pixel belongs to an edge or not. The resulting edge map E(p, q) with pixels E(p,q) is determined as follows (Lukac et al., 2003b; Lukac and Plataniotis, 2005a): E(p,q) = m(p,q) if m(p,q) ≥ ξ (127) 0 otherwise where ξ is a nonnegative threshold value. It is known that edge operators are sensitive to noise and small variations in intensity, phenomena often encountered in localized image processing areas (Lukac and Plataniotis, 2005a), and that the edge map usually contains noise. Using the appropriate setting of ξ , the thresholding operation [Eq. (127)] can extract the structural information, which corresponds to the edge discontinuities (Figure 23). Note that an overshot value of ξ results in edge information exclusions, whereas
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
247
F IGURE 23. Scalar edge detection: (a) 512 × 512 color image butterfly and (b)–(d) the corresponding edge maps obtained by (b) Sobel detector, (c) Canny detector, (d) LwG detector.
the use of a too small value of ξ usually produces edge maps that include noise-like pixels and redundant details. In practice, most popular edge detectors are approximated through the use of convolution masks. Assuming a 3 × 3 supporting window Ψ(p,q) , where each spatial location (i, j ), for (i, j ) ∈ ζ , is associated with the mask coefficient w(i,j ) , the edge map’s pixel m(p,q) is obtained as follows (Gonzalez and Woods, 1992): {w(i,j ) u(i,j ) } (128) m(p,q) = w ∗ U (p, q) = (i,j )∈ζ
248
LUKAC AND PLATANIOTIS
with ζ denoting the spatial locations within the area of support. The quantities u(i,j ) ∈ U (p, q) and w(i,j ) ∈ w denote the image inputs and the mask coefficients, respectively. The U (p, q) denotes the set of the signal values used as the input for an edge operator, that is, U (p, q) = {L(i,j ) ; (i, j ) ∈ ζ } for Eq. (124) and U (p, q) = {x(i,j )k ; (i, j ) ∈ ζ } for Eq. (125). The set w = {w(i,j ) ; (i, j ) ∈ ζ } forms the so-called set of mask coefficients. Both component-wise and dimensionality reduction-based edge detection operators can be further grouped into two main classes of operators (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000): • gradient methods, which use the first-order directional derivatives of the image to determine the edge contrast used in edge map formation, and • zero-crossing-based methods, which use the second-order directional derivatives to identify locations with zero crossings. 1. Gradient Operators The gradient methods use the so-called gradient (Gonzalez and Woods, 1992): ∂U (p, q) ∂U (p, q) (129) , ∇U (p, q) = ∂p ∂q of the function U (p, q). Based on the definition of an edge as an abrupt change in the image intensity, that is, for denoting the luminance values in Eq. (124) or image channel intensity values in Eq. (125), a derivative operator was proposed for the detection of intensity discontinuities. The first derivative provides information on the rate of change of the image intensity. Using this information, it is possible to localize points where large changes of intensity occur. Of particular interest is the gradient magnitude ' ∂U (p, q) 2 ∂U (p, q) 2 ∇U (p, q) = + (130) ∂p ∂q denoting the rate of change of the image intensity and the gradient direction , ∂U (p, q) ∂U (p, q) (131) θ = arctan ∂p ∂q
denoting the orientation of an edge. The mask w of a particular operator in Eq. (128) should be considered a digital approximation of the gradient in a given direction. Typically, two masks are defined enabling the determination of the gradient magnitude in two orthogonal directions. For the most commonly used scalar gradient operators such as the Prewitt, Sobel, isotropic, and Canny operator, the convolution masks are defined as follows (Gonzalez and Woods, 1992; Lukac et al., 2003b):
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
• Prewitt operator:
! ! −1 0 1 −1 −1 −1 w= , −1 0 1 , 0 0 0 −1 0 1 1 1 1 ! ! 0 1 1 −1 −1 0 w= −1 0 1 , −1 0 1 −1 −1 0 0 1 1
• Sobel operator:
! ! −1 0 1 −1 −2 −1 w= , −2 0 2 , 0 0 0 −1 0 1 1 2 1 ! ! 0 1 2 −2 −1 0 w= −1 0 1 , −1 0 1 −2 −1 0 0 1 2
• Canny operator:
w=
0 −1 0
! 0 1 0 0 0 1 , 0 0 0 0 0 −1
0 0 0
!
or
249
(132)
(133)
or
.
(134)
(135)
(136)
It should be√mentioned that the isotropic operator can be obtained by using the value of 2 instead of 2 in Eqs. (134)–(135). Other members of the gradient edge detectors are the Roberts operator and the Kirsch compass operator (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000). Within gradient operators, the Canny operator (Canny, 1986) is considered by many the most advanced gradient-based operator. This operator does not solely rely on the intensity variations, but attempts to limit using Gaussian presmoothing the effect of the noise to improve the quality of the edge maps, and improve the appearance of the edge maps using the hysteresis-based thinning of the thresholded edge maps. 2. Zero-Crossing-Based Operators It is well-known that when the first derivative achieves a maximum, the second derivative is zero (Lukac et al., 2003b; Ziou and Tabbone, 1998). Therefore, operators may localize edges by evaluating the zeros of the second derivatives of U (p, q). The most commonly used operator of this type is the Laplacian operator U (p, q) = ∇ 2 U (p, q) =
∂ 2 U (p, q) ∂ 2 U (p, q) + ∂p2 ∂q 2
(137)
250
LUKAC AND PLATANIOTIS
F IGURE 24.
Vector edge detection.
which is approximated in practice using the convolution masks ! ! 0 1 0 1 1 1 w = 1 −4 1 , w = 1 −8 1 (138) 0 1 0 1 1 1 defined for a four- and eight-neighborhood, respectively. Other zero-crossing-based methods are the so-called LwG and LoG edge detectors, which combine Laplacian and Gaussian operators, and the socalled DoG operator defined through the difference of Gaussians (Gomes and Velho, 1997; Lukac et al., 2003b). Both gradient and zero-crossing scalar edge operators do not use the full potential of the spectral image content and, thus, they can miss the edges in multichannel images. B. Vector Operators Psychological research on the characteristics of the human visual system reveals that color plays a significant role in the perception of edges or boundaries between two surfaces. Since the ability to distinguish between different objects is crucial for applications such as object recognition, image segmentation, image coding, and robot vision, the additional boundary information provided by color is of paramount importance (Plataniotis and Venetsanopoulos, 2000; Scharcanski and Venetsanopoulos, 1997). Following the major performance issues in color edge detection such as the ability to extract edges accurately, robustness to noise, and the computational efficiency, most popular color edge detectors are vector edge detectors (Figure 24) based on vector order statistics (Lukac et al., 2003b, 2005a; Plataniotis and Venetsanopoulos, 2000). Edge detectors based on order statistics operate by detecting local minimum and maximum in the color image function and combining them in an appropriate way to produce the corresponding edge map (Figure 25). Since there is no unique way to define ranks for multichannel signals, the reduced ordering scheme in Eq. (21) is commonly used to achieve the ranked sequence of the color vectors inside the processing window. Based on these two extreme vector order statistics x(1) and x(N ) , the vector range (VR) detector is defined as follows (Lukac et al., 2005a; Plataniotis and
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
(a)
(b)
(c)
(d)
251
F IGURE 25. Vector edge detection using the image shown in Figure 23a: (a) VR detector, (b) MVD detector, (c) NNVR detector, (d) NNMVD detector.
Venetsanopoulos, 2000): m(p,q) = x(|ζ |) − x(1)
(139)
where (p, q) corresponds to the center spatial location of Ψ(p,q) . The output of Eq. (139) quantitatively expresses the deviation of the vector outlier in the highest rank from the most representative vector in the lowest rank within Ψ(p,q) . It is not difficult to see that in a uniform area, where all vectors x(i,j ) , for (i, j ) ∈ ζ , are characterized by a similar magnitude Mx(i,j ) and/or the direction Ox(i,j ) , the output of Eq. (139) will be small. However, this is not the case in high-frequency regions, where x(N ) is usually located at one side of an
252
LUKAC AND PLATANIOTIS
edge, whereas x(1) is included in the set of vectors occupying spatial positions on the other side of the edge. Thus, the response of Eq. (139) is a large value. Due to the utilization of the distance between the lowest and uppermost ranked vector, the VR operator is rather sensitive to noise. More robust color edge detectors are obtained using linear combinations of the lowest ranked vector samples. This is mainly due to the fact that the lowest ranks are associated with the most similar vectors in the population of the color vectors, and upper ranks usually correspond to the outlying samples. The so-called vector dispersion edge detector (VDED) is obtained as follows (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000):
|ζ |
m(p,q) =
(140) wr x(r)
r=1
where · denotes the magnitude of the vector operand and wr is the weight coefficient associated with ranked vector x(r) . Different coefficients wr in the linear combinations result in a multitude of edge detectors that vary significantly in terms of performance and/or complexity. For example, the VR operator [Eq. (139)] is a special case of the VDED operator, obtained using w1 = −1, w|ζ | = 1, and wr = 0, r = 2, 3, . . . , |ζ | − 1. To design robust edge detectors, the operators should utilize a linear combination of the lowest ranked vectors x(r) , for r = 1, 2, . . . , c and c < |ζ |. This is mainly due to the fact that (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000) (1) the lowest ranks are associated with the most similar vectors in the population of the vectorial inputs whereas upper ranks usually correspond to the outlying samples, and (2) the lowest ranked vector is commonly used to attenuate noise in the vectorial data sets. Therefore, employing the set of the c lowest, in rank, vectors x(1) , x(2) , . . . , x(c) , for c < |ζ |, the output of Eq. (140) becomes immune to noise. The minimum over the magnitudes of these linear combinations defines the output of the so-called minimum vector dispersion (MVD) operator (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000):
c
x(r)
m(p,q) = min x(|ζ |−c+1) −
, c = 1, 2, . . . , b, b, c < |ζ | (141)
b c
r=1
where the parameters b and c control the trade-off between complexity and noise attenuation. Such an operator exhibits significant robustness against image noise. Since the response of the MVD operator is much larger at the true edges, the highly precise edge maps can be obtained through subsequent thresholding [Eq. (127)]. An alternative design of the vector edge operators utilizes the adaptive nearest-neighbor filter. The coefficients are chosen to adapt to local image
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
253
characteristics. Instead of constants, the coefficients are determined by an adaptive weight function for each window Ψ(p,q) . The so-called nearestneighbor VR (NNVR) operator is defined as the distance between the outlier and the weighted sum of all the ranked vectors (Lukac et al., 2003b; Plataniotis and Venetsanopoulos, 2000):
|ζ |
(142) wr x(r) . m(p,q) = x(|ζ |) −
r=1
The weight function wr is determined adaptively using transformations of a distance criterion at each image location and it is not uniquely defined. Similar to the adaptive design in Eq. (73), each weight coefficient is positive (wr 0) "|ζ | and the weight function is normalized ( r=1 wr = 1). The MVD operator can also be incorporated with the NNVR operator to further improve its performance in the presence of impulse noise. The resulting NNMVD operator is defined as follows (Lukac et al., 2005a; Plataniotis and Venetsanopoulos, 2000):
|ζ |
m(p,q) = min x(|ζ |−j +1) − wr x(r) , for b = 1, 2, . . . , r, r < |ζ |
b i=r (143) where wr denotes the normalized weighting coefficient. A possible weight function to be used in Eqs. (142)–(143) can be defined as follows: wr =
D(|ζ |) − D(r) "|ζ | |ζ | · D(|ζ |) − b=1 D(b)
(144)
where D(r) is the aggregated distance associated with the vector x(r) ∈ Ψ(p,q) . Since in a highly uniform area (no edge) all pixels have the same distance and the above weight function cannot be used since the denominator is zero, the NNVR and NNMVD output should be set to zero. C. Evaluation Criteria The performance of the edge detectors, in terms of accuracy in edge detection and robustness to noise, is usually evaluated using both quantitative and qualitative measures (Avcibas et al., 2002; Lukac et al., 2003b). The quantitative performance measures can be grouped into two types: (1) probabilistic measures that are based on the statistic of correct edge detection and false edge rejection, and (2) distance measures that are based on edge deviation from true edges.
254
LUKAC AND PLATANIOTIS
The probabilistic measures can be adopted to evaluate the accuracy of edge detection by measuring the percentage of correctly and falsely detected edges. Since a predefined edge map (ground truth) is needed, synthetic images are preferred for this experiment. The distance measures can be adopted to evaluate the noise performance by measuring the deviation of edges caused by noise from the true edges. Since numerical measures, such as various percentage criteria, are not sufficient to model the complexity of human visual systems and evaluation based on synthetic images has limited value, qualitative evaluation using subjective (visual) tests is often used. Such an approach allows for the utilization of real RGB color images in the evaluation process. 1. Objective Evaluation Approach The use of the percentage criteria requires the presence of the reference (ground-truth) edge map with known location of edges in the ideal test image. Then the so-called coefficient of found edges CF and the coefficient of lost edges CL are determined as follows: |ξF | 100%, ξF = {m(p,q) : m(p,q) ∈ χ ∧ m(p,q) ∈ ξA } |χ | |ξL | 100% = 1 − CF , CL = |χ | ξL = {m(p,q) : m(p,q) ∈ χ ∧ m(p,q) ∈ / ξA }
CF =
(145)
(146)
where ξF denotes the number of edge pixels of the output edge map, which coincides with the edges derived from the artificial image, ξL denotes the lost edge pixels, χ is the number of pixels of ideal edges, and ξA denotes the number of edge pixels found by the tested edge detector. The performance of the edge operators can be further evaluated using a quotient of pixels falsely detected as an edge and the total number of pixels comprising the found edges (Lukac et al., 2003b): CFD =
|TF | 100%, |ξA |
TF = {m(p,q) : m(p,q) ∈ / χ ∧ m(p,q) ∈ ξA } (147)
where TF denotes the number of false edge pixels. The so-called fault ratio (Plataniotis and Venetsanopoulos, 2000) is calculated as a quotient of pixels wrongly classified as edges to the correctly identified edge pixels: CFR =
|TF | , |TH |
TF ={m(p,q) : m(p,q) ∈ / χ ∧ m(p,q) ∈ ξA } TH ={m(p,q) : m(p,q) ∈ χ ∧ m(p,q) ξ IA }
where TF and TH denote the false and real edge pixels.
(148)
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
255
Finally, the measure in Avcibas et al. (2002) is based on knowledge about the ideal reference edge map, where the reference edges should preferably have a width of one pixel. The measure considers both the accuracy of the edge location and false (or missing) edge elements as follows: ξ
CR =
A 1 1 max{ξA , χ } 1 + αdi2
(149)
i=1
where ξA and χ are the number of detected and ground-truth edge points, respectively, and di denotes the distance to the closest real edge pixel corresponding to the ith detected edge pixel. Parameter α is a scale factor (e.g., 1/9 for Pratt edge operator), which provides the relative weighting between smeared edges and thin but offset (shifted, dislocated) edges (Lukac et al., 2003b). Unlike the percentage measures, distance measures do not require the reference image and they enable the analysis of edge maps acquired from real scenes. Since smoothing filters usually produce an error by processing the edge pixels, criteria like the SNR and PSNR can be used to measure the difference between the image edge maps produced before and after image smoothing. Alternatively, the reference edge map required in Eqs. (145)–(148) can be obtained as the absolute difference between the input image and the image produced by a robust smoothing filter. 2. Subjective Evaluation Approach The subjective evaluation allows for further investigation of the characteristics of the obtained edge maps through the involvement of human factors (Plataniotis and Venetsanopoulos, 2000). The edge operators are usually rated in terms of several criteria, such as (1) ease at organizing objects, (2) continuity of edges, (3) thinness of edges, and (4) performance in suppressing noise. Visual inspection of the edge maps shown in Figures 23 and 25 reveals that the performance of the scalar and vector edge operators is very similar for noiseless images. More sophisticated MVD and NNMVD operators produce thinner edges and, due to the employed averaging operation, they are less sensitive to small texture variations. If the edge operators are used to localize the edges in the noisy images such as the cDNA microarray images (Figure 26),2 then due to the utilization of the robust order statistic concept, the vector edge operators usually outperform 2 Complementary deoxyribonucleic acid (cDNA) microarray imaging is considered one of the most important and powerful technologies used to extract and interpret genomic information (Lukac et al., 2004d; Lukac and Plataniotis, 2005b). The image formation process produces two monochromatic images that are further registered into a two-channel, red-green image that contains thousands of spots
256
LUKAC AND PLATANIOTIS
(a)
(b)
(c)
(d)
F IGURE 26. Edge detection-based cDNA microarray spot localization: (a) 200 × 200 cDNA microarray image, and (b)–(d) the corresponding edge maps obtained by (b) Sobel detector, (c) MVD detector, (d) NNMVD edge detector.
the scalar edge detectors in terms of both the accuracy of edge localization and the robustness against noise. Since cDNA microarray image formation is affected by a number of impairments that can be attributed to (Lukac et al., 2005b; Lukac and Plataniotis, 2005b) (1) variations in the image background, (2) variations in the spot sizes and positions, (3) artifacts caused by laser light reflection and dust on the glass slide, and (4) photon and electronic noise carrying the genetic information. The generated cDNA microarray image is a multichannel vector signal that can be represented, for storing or visualization purposes, as the RGB color image with a zero blue component (Lukac et al., 2004d, 2005a).
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
257
introduced during scanning; microarray spot localization necessitates the use of robust vector operators that are able to follow the spectral correlation that exists between the R and G channels of the vectorial cDNA microarray image (Lukac and Plataniotis, 2005a).
VI. C ONCLUSION This chapter provided a taxonomy of modern color image filtering and enhancement solutions. Since image signals are nonlinear in nature, due to the presence of edges and fine details, and are often processed by the highly nonlinear human visual system, nonlinear color image processing solutions were the main focus of the chapter. Moreover, given the vectorial nature of the color image, particular emphasis was given to nonlinear vector operators that constitute a rich and expanding class of tools for color image filtering, enhancement, and analysis. By utilizing robust order statistics calculated through a supporting processing window, nonlinear filters can preserve important structural elements, such as color edges, and eliminate degradations occurring during signal formation and transmission. As shown in this work, vector processing operators constitute a basis for noise detection and removal in color images. The same color vector tools can be used to inpaint missing color information, to enhance color input in acquired images and videos, to increase the spatial resolution of the visual data, and to localize color image edges and fine details. The utilization of spatial, structural, and spectral characteristics of the visual input is essential in modern imaging systems that attempt to mimic the human perception of the visual environment. Therefore, it is not difficult to see that color image filtering techniques have an extremely valuable position in modern color image science, communication, multimedia, and biomedical applications.
R EFERENCES Arce, G.R. (1991). Multistage order statistic filters for image sequence processing. IEEE Trans. Signal Process. 39 (5), 1146–1163. Arce, G.R. (1998). A general weighted median filter structure admitting negative weights. IEEE Trans. Signal Process. 46 (12), 3195–3205. Astola, J., Kuosmanen, P. (1997). Fundamentals of Nonlinear Digital Filtering. CRC Press, Boca Raton, FL. Astola, J., Haavisto, P., Neuvo, Y. (1990). Vector median filters. Proc. IEEE 78 (4), 678–689. Avcibas, I., Sankur, B., Sayood, K. (2002). Statistical evaluation of image quality measures. Journal of Electronic Imaging 11 (2), 206–223.
258
LUKAC AND PLATANIOTIS
Barnett, V. (1976). The ordering of multivariate data. Journal of Royal Statistical Society A 139, 318–354. Barni, M., Cappellini, V., Mecocci, A. (1994). Fast vector median filter based on Euclidean norm approximation. IEEE Signal Processing Letters 1 (6), 92–94. Barni, M., Bartolini, F., Cappellini, V. (2000). Image processing for virtual restoration of artworks. IEEE Multimedia 7 (2), 34–37. Canny, J.F. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 679–698. Coyle, E.J., Lin, J.H., Gabbuoj, M. (1989). Optimal stack filtering and the estimation and structural approaches to image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 37 (12), 2037–2066. Criminisi, A., Perez, P., Toyama, K. (2004). Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13 (9), 1200–1212. Duda, R.O., Hart, P.E., Stork, D.G. (2000). Pattern Classification and Scene Analysis, 2nd ed. John Wiley, New York. Faugeras, O. (1979). Digital color image processing within the framework of a human visual model. IEEE Transactions on Acoustics, Speech, and Signal Processing 27 (4), 380–393. Fischer, M., Paredesm, J.L., Arce, G.R. (2002). Weighted median image sharpeners for the World Wide Web. IEEE Trans. Image Process. 11 (7), 717–727. Gabbouj, M., Cheickh, A. (1996). Vector median–vector directional hybrid filter for color image restoration. In: Proceedings of the European Signal Processing Conference EUSIPCO’96, pp. 879–881. Gabbouj, M., Coyle, E.J., Gallagher, N.C. (1992). An overview of median and stack filtering. Circuit Systems Signal Processing 11 (1), 7–45. Giakumis, I., Nikolaidis, N., Pitas, I. (2006). Digital image processing techniques for the detection and removal of cracks in digitized paintings. IEEE Transactions on Image Processing 15 (1), 178–188. Gomes, J., Velho, L. (1997). Image Processing for Computer Graphics. Springer-Verlag, Berlin. Gonzalez, R., Woods, R.E. (1992). Digital Image Processing. AddissonWesley, Reading, MA. Gunturk, B., Altunbasak, Y., Mersereau, R. (2002). Color plane interpolation using alternating projections. IEEE Trans. Image Process. 11 (9), 997– 1013. Hamid, M.S., Harvey, N.L., Marshall, S. (2003). Genetic algorithm optimization of multidimensional grayscale soft morphological filters with applications in film archive restoration. IEEE Trans. Circuits Systems Video Tech. 13 (5), 406–416.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
259
Hardie, R.C., Arce, G.R. (1991). Ranking in R p and its use in multivariate image estimation. IEEE Trans. Circuits Systems Video Tech. 1 (2), 197– 208. Hardie, R.C., Boncelet, C.G. (1993). LUM filters: A class of rank-order-based filters for smoothing and sharpening. IEEE Trans. Signal Process. 41 (3), 1061–1076. Hardie, R.C., Boncelet, C.G. (1995). Gradient-based edge detection using nonlinear edge enhancing prefilters. IEEE Trans. Image Process. 4 (11), 1572–1578. Henkel, W., Kessler, T., Chung, H.Y. (1995). Coded 64-CAP ADSL in an impulse-noise environment—modeling of impulse noise and first simulation results. IEEE J. Selected Areas Communications 13 (9), 1611–1621. Herodotou, N., Venetsanopoulos, A.N. (1995). Colour image interpolation for high resolution acquisitions and display devices. IEEE Trans. Consumer Electron. 41 (4), 1118–1126. Holst, G.C. (1998). CCD Arrays, Cameras, and Displays, 2nd ed. JCD Publishing and SPIE Optical Engineering Press. Hore, E.S., Qiu, B., Wu, H.R. (2003). Improved vector filtering for color images using fuzzy noise detection. Opt. Eng. 42 (6), 1656–1664. Karakos, D.G., Trahanias, P.E. (1997). Generalized multichannel imagefiltering structure. IEEE Trans. Image Process. 6 (7), 1038–1045. Kayargadde, V., Martens, J.B. (1996). An objective measure for perceived noise. Signal Process. 49 (3), 187–206. Khriji, L., Gabbouj, M. (1999). Vector median-rational hybrid filters for multichannel image processing. IEEE Signal Process. Lett. 6 (7), 186–190. Khriji, L., Gabbouj, M. (2002). Adaptive fuzzy order statistics-rational hybrid filters for color image processing. Fuzzy Sets and Systems 128 (1), 35–46. Kokaram, A.C., Morros, R.D., Fitzerald, W.J., Rayner, P.J.V. (1995). Detection of missing data in image sequences. IEEE Trans. Image Process. 4 (11), 1496–1508. Konstantinides, K., Bhaskaran, V., Beretta, G. (1999). Image sharpening in the JPEG domain. IEEE Trans. Image Process. 8 (6), 874–878. Kotropoulos, C., Pitas, I. (2001). Nonlinear Model-Based Image/Video Processing and Analysis. John Wiley, New York. Lee, Y.H., Fam, A.T. (1987). An edge gradient enhancing adaptive order statistic filters. IEEE Trans. Acoust. 35 (5), 680–695. Li, X., Lu, D., Pan, Y. (2000). Color restoration and image retrieval for Donhuang fresco preservation. IEEE Multimedia 7 (2), 38–42. Lucat, L., Siohan, P., Barba, D. (2002). Adaptive and global optimization methods for weighted vector median filters. Signal Processing: Image Communications 17 (7), 509–524.
260
LUKAC AND PLATANIOTIS
Lukac, R. (2001). Vector LUM smoothers as impulse detector for color images. In: Proc. European Conference on Circuit Theory and Design ECCTD’01, vol. III, pp. 137–140. Lukac, R. (2002a). Color image filtering by vector directional order-statistics. Patt. Recognition Image Anal. 12 (3), 279–285. Lukac, R. (2002b). Optimised directional distance filter. Machine Graphics and Vision: Special Issue on Colour Image Processing and Its Applications 11 (2–3), 311–326. Lukac, R. (2003). Adaptive vector median filtering. Patt. Recognition Lett. 24 (12), 1889–1899. Lukac, R. (2004a). Adaptive color image filtering based on center-weighted vector directional filters. Multidimensional Systems and Signal Processing 15 (2), 169–196. Lukac, R. (2004b). Performance boundaries of optimal weighted median filters. Intern. J. Image Graphics 4 (2), 157–182. Lukac, R., Marchevsky, S. (2001a). Adaptive vector LUM smoother. In: Proc. 2001 IEEE International Conference on Image Processing ICIP’01, vol. 1, pp. 878–881. Lukac, R., Marchevsky, S. (2001b). LUM smoother with smooth control for noisy image sequences. EURASIP Journal of Applied Signal Processing 2001 (2), 110–120. Lukac, R., Plataniotis, K.N. (2005a). Vector edge operators for cDNA microarray spot localization. Image Vision Comput., Submitted for publication. Lukac, R., Plataniotis, K.N. (2005b). cDNA microarray image segmentation using root signals. Intern. J. Imaging Systems Tech., Submitted for publication. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2003a). Weighted vector median optimization. In: Proc. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications ECVIP-MC’03, vol. 1, pp. 227–232. Lukac, R., Plataniotis, K.N., Venetsanopoulos, A.N., Bieda, R., Smolka, B. (2003b). Color edge detection techniques. In: Signaltheorie und Signalverarbeitung, Akustik und Sprachakustik, Informationstechnik, vol. 29. W.E.B. Universität Verlag, Dresden, pp. 21–47. Lukac, R., Martin, K., Plataniotis, K.N. (2004a). Digital camera zooming based on unified CFA image processing steps. IEEE Trans. Consumer Electron. 50 (1), 15–24. Lukac, R., Smolka, B., Plataniotis, K.N., Venetsanopulos, A.N. (2004b). Selection weighted vector directional filters. Computer Vision and Image Understanding, Special Issue on Colour for Image Indexing and Retrieval 94 (1–3), 140–167.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
261
Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2004c). Generalized selection weighted vector filters. EURASIP Journal on Applied Signal Processing: Special Issue on Nonlinear Signal and Image Processing 2004 (12), 1870–1885. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2004d). A multichannel order-statistic technique for cDNA microarray image processing. IEEE Trans. Nanobioscience 3 (4), 272–285. Lukac, R., Fischer, V., Motyl, G., Drutarovsky, M. (2004e). Adaptive video filtering framework. Intern. J. Imaging Systems Tech. 14 (6), 223–237. Lukac, R., Smolka, B., Martin, K., Plataniotis, K.N., Venetsanopulos, A.N. (2005a). Vector filtering for color imaging. IEEE Signal Processing Magazine: Special Issue on Color Image Processing 22 (1), 74–86. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2005b). cDNA microarray image processing using fuzzy vector filtering framework. Journal of Fuzzy Sets and Systems: Special Issue on Fuzzy Sets and Systems in Bioinformatics 152 (1), 17–35. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2005c). A statistically-switched adaptive vector median filter. J. Intell. Robot. Syst. 42 (4), 361–391. Lukac, R., Plataniotis, K.N., Smolka, B., Venetsanopoulos, A.N. (2005d). Vector operators for color image zooming. In: Proc. IEEE International Symposium on Industrial Electronics ISIE’05, vol. 3, pp. 1273–1277. Lukac, R., Plataniotis, K.N., Hatzinakos, D. (2005e). Color image zooming on the Bayer pattern. IEEE Trans. Circuit Syst. Video Tech. 15 (11), 1475– 1492. Lukac, R., Plataniotis, K.N., Smolka, B. (2005f). Adaptive color image filter for application in virtual restoration of artworks. Image Vision Comput., in preparation. Lukac, R., Plataniotis, K.N., Venetsanopoulos, A.N. (2005g). Color image denoising using evolutionary computation. Intern. J. Imaging Syst. Tech. 15, Submitted for publication. Ma, Z., Wu, H.R. (2006). Partition based vector filtering technique for color suppression of noise in digital color images. IEEE Trans. Image Process., in preparation. Ma, Z., Wu, H.R., Qiu, B. (2005). A robust structure-adaptive hybrid vector filter for color image restoration. IEEE Trans. Image Process. 14 (12), 1990–2001. Mitra, S., Sicuranza, J. (2001). Nonlinear Image Processing. Academic Press, San Diego. Neuvo, Y., Ku, W. (1975). Analysis and digital realization of a pseudorandom Gaussian and impulsive noise source. IEEE Trans. Commun. 23, 849–858. Nikolaidis, N., Pitas, I. (1996). Multichannel L filters based on reduced ordering. IEEE Trans. Circuits Syst. Video Tech. 6 (5), 470–482.
262
LUKAC AND PLATANIOTIS
Nikolaidis, N., Pitas, I. (1998). Nonlinear processing and analysis of angular signals. IEEE Trans. Signal Process. 46 (12), 3181–3194. Nosovsky, R.M. (1984). Choice, similarity and the context theory of classification. J. Exp. Psychol. Learn. Mem. Cog. 10 (1), 104–114. Park, J., Park, D.C., Marks, R.J., El-Sharkawi, M.A. (2005). Recovery of image blocks using the method of alternating projections. IEEE Trans. Image Process. 14 (4), 461–474. Peltonen, S., Gabbouj, M., Astola, J. (2001). Nonlinear filter design: Methodologies and challenges. In: Proc. International Symposium on Image and Signal Processing and Analysis ISPA’01, pp. 102–107. Pitas, I., Tsakalides, P. (1991). Multivariate ordering in color image filtering. IEEE Trans. Circuits Syst. Video Tech. 1 (3), 247–259. Pitas, I., Venetsanopoulos, A.N. (1990). Nonlinear Digital Filters, Principles and Applications. Kluwer Academic Publishers, Dordrecht. Pitas, I., Venetsanopoulos, A.N. (1992). Order statistics in digital image processing. Proc. IEEE 80 (12), 1892–1919. Plataniotis, K.N., Venetsanopoulos, A.N. (1998). Vector processing. In: Sangwine, S.J. (Ed.), Colour Image Processing. Chapman & Hall, London, U.K., pp. 188–209. Plataniotis, K.N., Venetsanopoulos, A.N. (2000). Color Image Processing and Applications. Springer-Verlag, Berlin. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N. (1996). Fuzzy adaptive filters for multichannel image processing. Signal Process. 55 (1), 93–106. Plataniotis, K.N., Androutsos, D., Vinayagamoorthy, S., Venetsanopoulos, A.N. (1997). Color image processing using adaptive multichannel filters. IEEE Trans. Image Process. 6 (7), 933–950. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N. (1998a). Adaptive multichannel filters for colour image processing. Signal Process. Image Commun. 11 (3), 171–177. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N. (1998b). Color image processing using adaptive vector directional filters. IEEE Trans. Circuits Syst. 45 (10), 1414–1419. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N. (1999). Adaptive fuzzy systems for multichannel signal processing. Proc. IEEE 87 (9), 1601–1622. Polesel, A., Ramponi, G., Mathews, V.J. (2000). Image enhancement via adaptive unsharp masking. IEEE Trans. Image Process. 9 (3), 505–510. Rane, S.D., Sapiro, G., Bertalmio, M. (2003). Structure and texture fillingin of missing image blocks in wireless transmission and compression applications. IEEE Trans. Image Process. 12 (3), 296–303.
TAXONOMY OF COLOR IMAGE FILTERING AND ENHANCEMENT
263
Rantanen, H., Karlsson, M., Pohjala, P., Kalli, S. (1992). Color video signal processing with median filters. IEEE Trans. Consumer Electron. 38 (3), 157–161. Regazoni, C.S., Teschioni, A. (1997). A new approach to vector median filtering based on space filling curves. IEEE Trans. Image Process. 6 (7), 990–1001. Scharcanski, J., Venetsanopoulos, A.N. (1997). Edge detection of color images using directional operators. IEEE Trans. Circuits Syst. Video Tech. 7 (2), 397–401. Sharma, G., Trussell, H.J. (1997). Digital color imaging. IEEE Trans. Image Process. 6 (7), 901–932. Smolka, B. (2002). Adaptive modification of the vector median filter. Machine Graphics and Visions 11 (2–3), 327–350. Smolka, B., Chydzinski, A., Wojciechowski, K., Plataniotis, K.N., Venetsanopoulos, A.N. (2001). On the reduction of impulsive noise in multichannel image processing. Opt. Eng. 40 (6), 902–908. Smolka, B., Lukac, R., Plataniotis, K.N., Wojciechowski, K., Chydzinski, A. (2003). Fast adaptive similarity based impulsive noise reduction filter. Real-Time Imaging, Special Issue on Spectral Imaging 9 (4), 261–276. Smolka, B., Plataniotis, K.N., Venetsanopoulos, A.N. (2004). Nonlinear techniques for color image processing. In: Barner, K.E., Arce, G.R. (Eds.), Nonlinear Signal and Image Processing: Theory, Methods, and Applications. CRC Press, Boca Raton, FL, pp. 445–505. Stokes, M., Anderson, M., Chandrasekar, S., Motta, R. (1996). A standard default color space for the internet—sRGB. Technical Report, www.w3.org/Graphics/Color/sRGB.html. Sung, K.K. (1992). A Vector Signal Processing Approach to Color. M.S. Thesis, Massachusetts Institute of Technology. Szczepanski, M., Smolka, B., Plataniotis, K.N., Venetsanopoulos, A.N. (2003). On the geodesic paths approach to color image filtering. Signal Process. 83 (6), 1309–1342. Szczepanski, M., Smolka, B., Plataniotis, K.N., Venetsanopoulos, A.N. (2004). On the distance function approach to color image enhancement. Discrete Appl. Math. 139 (1–3), 283–305. Tang, K., Astola, J., Neuvo, Y. (1994). Multichannel edge enhancement in color image processing. IEEE Trans. Circuits Syst. Video Tech. 4 (5), 468– 479. Tang, K., Astola, J., Neuvo, Y. (1995). Nonlinear multivariate image filtering techniques. IEEE Trans. Image Process. 4 (6), 788–798. Tang, B., Sapiro, G., Caselles, V. (2001). Color image enhancement via chromaticity diffusion. IEEE Trans. Image Process. 10 (5), 701–707.
264
LUKAC AND PLATANIOTIS
Trahanias, P.E., Venetsanopoulos, A.N. (1993). Vector directional filters: A new class of multichannel image processing filters. IEEE Trans. Image Process. 2 (4), 528–534. Trahanias, P.E., Karakos, D., Venetsanopoulos, A.N. (1996). Directional processing of color images: Theory and experimental results. IEEE Trans. Image Process. 5 (6), 868–881. Tsai, H.H., Yu, P.T. (2000). Genetic-based fuzzy hybrid multichannel filters for color image restoration. Fuzzy Sets and Systems 114 (2), 203–224. Viero, T., Oistamo, K., Neuvo, Y. (1994). Three-dimensional median related filters for color image sequence filtering. IEEE Trans. Circuits Syst. Video Tech. 4 (2), 129–142. Vrhel, M.J., Saber, E., Trussell, H.J. (2005). Color image generation and display technologies. IEEE Signal Process. Mag. 22 (1), 22–33. Wyszecki, G., Stiles, W.S. (1982). Color Science, Concepts and Methods, Quantitative Data and Formulas, 2nd ed. John Wiley, New York. Yang, R., Yin, L., Gabbouj, M., Astola, J., Neuvo, Y. (1995). Optimal weighted median filtering under structural constraints. IEEE Trans. Signal Process. 43 (3), 591–604. Yin, L., Neuvo, Y. (1994). Fast adaptation and performance characteristics of FIR-WOS hybrid filters. IEEE Trans. Signal Process. 41 (7), 1610–1628. Yin, L., Astola, J., Neuvo, Y. (1993). Adaptive stack filtering with application to image processing. IEEE Trans. Signal Process. 41 (1), 162–184. Yin, L., Yang, R., Gabbouj, M., Neuvo, Y. (1996). Weighted median filters: A tutorial. IEEE Trans. Circuits Syst. 43 (3), 157–192. Yu, P.T., Liao, W.H. (1994). Weighted order statistics filters—their classification, some properties, and conversion algorithm. IEEE Trans. Signal Process. 42 (10), 2678–2691. Zheng, J., Valavanis, K.P., Gauch, J.M. (1993). Noise removal from color images. J. Intell. Robot. Syst. 7 (3), 257–285. Ziou, D., Tabbone, S. (1998). Edge detection techniques: An overview. Patt. Recognition Image Anal. 8 (4), 537–559.
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 140
General Sweep Mathematical Morphology FRANK Y. SHIH Computer Vision Laboratory, College of Computing Sciences, New Jersey Institute of Technology, Newark, New Jersey 07102, USA
I. Introduction . . . . . . . . . . . . . . . . . II. Theoretical Development of General Sweep Mathematical Morphology A. Computation of Traditional Morphology . . . . . . . . B. General Sweep Mathematical Morphology . . . . . . . C. Properties of Sweep Morphological Operations . . . . . . III. Blending of Swept Surfaces with Deformations . . . . . . . IV. Image Enhancement . . . . . . . . . . . . . . . V. Edge Linking . . . . . . . . . . . . . . . . . A. Edge Linking Using Sweep Morphology . . . . . . . . VI. Shortest Path Planning for Mobile Robot . . . . . . . . . VII. Geometric Modeling and Sweep Mathematical Morphology . . . A. Tolerance Expression . . . . . . . . . . . . . B. Sweep Surface Modeling . . . . . . . . . . . . VIII. Formal Language and Sweep Morphology . . . . . . . . IX. Representation Scheme . . . . . . . . . . . . . . A. Two-Dimensional Attributes . . . . . . . . . . . B. Three-Dimensional Attributes . . . . . . . . . . . X. Grammars . . . . . . . . . . . . . . . . . . A. Two-Dimensional Attributes . . . . . . . . . . . B. Three-Dimensional Attributes . . . . . . . . . . . XI. Parsing Algorithm . . . . . . . . . . . . . . . XII. Conclusions . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
265 268 268 270 273 275 278 280 281 286 288 289 291 291 292 292 293 297 298 298 300 303 303 306
I. I NTRODUCTION The sweep operation to generate a new object by sweeping an object along a space curve trajectory provides a natural design tool in solid modeling. The simplest sweep is linear extrusion defined by a two-dimensional (2D) area swept along a linear path normal to the plane of the area to create a volume. Another simple sweep is rotational sweep defined by rotating a 2D object about an axis. Though simple, these two sweeps are often seen in real applications. Sweeps that generate area or volume changes in size, shape, 265 ISSN 1076-5670/05 DOI: 10.1016/S1076-5670(05)40005-1
Copyright 2006, Elsevier Inc. All rights reserved.
266
SHIH
or orientation during the sweeping process, and follow an arbitrarily curved trajectory, are called general sweeps (Requicha, 1980). General sweeps of solids are useful in modeling the region swept out by a machine-tool cutting head or a robot following a path. General sweeps of 2D cross sections are known as generalized cylinders in computer vision, and are usually modeled as parameterized 2D cross sections swept at right angles along an arbitrary curve. Being the simplest of general sweeps, generalized cylinders are somewhat easy to compute. However, general sweeps of solids are difficult to compute since the trajectory and object shape may make the sweep object self-intersect (Foley et al., 1995). Mathematical morphology involves the geometric analysis of shapes and textures in images. Appropriately used, mathematical morphological operations tend to simplify image data presenting their essential shape characteristics and eliminating irrelevancies (Haralick et al., 1987; Serra, 1982; Shih and Mitchell, 1989, 1992). As the object recognition, feature extraction, and defect detection correlate directly with shape, it becomes apparent that mathematical morphology is the natural processing approach to deal with the machine vision recognition process and the visually guided robot problem. The mathematical morphological operations can be thought of working with two images. Conceptually, the image being processed is referred to as the active image and the other image being a kernel is referred to as the structuring element. Each structuring element has a designed shape, which can be thought of as a probe or filter of the active image. We can modify the active image by probing it with various structuring elements. The two fundamental mathematical morphological operations are dilation and erosion. Dilation combines two sets using vector addition of set elements. Dilation by disk structuring elements corresponds to isotropic expansion algorithm popular to binary image processing. Dilation by small square (3 × 3) is an eight-neighborhood operation that can be easily implemented by adjacently connected array architectures and is the one known by the name “fill,” “expand,” or “grow.” Erosion is the morphological dual to dilation. It combines two sets using vector subtraction of set elements. Some equivalent terms of erosion are “shrink” and “reduce.” The traditional morphological operations perform vector additions or subtractions by a translation of structuring element to the object pixel. They are far from being capable of modeling the swept volumes of structuring elements moving with complex, simultaneous translation, scaling, and rotation in Euclidean space. In this chapter, we developed an approach that adopts sweep morphological operations to study the properties of swept volumes. We present the theoretical framework for representation, computation, and analysis of a new class of general sweep mathematical morphology and its practical applications.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
267
Geometric modeling is the foundation for CAD/CAM integration (Pennington et al., 1983). The goal of automated manufacturing inspection and robotic assembly is to generate a complete process automatically. The representation must not only possess the nominal geometric shapes, but also reason the geometric inaccuracies (or tolerances) into the locations and shapes of solid objects. Boundary representation and constructed solid geometry (CSG) representation are popularly used as an internal database (Requicha and Voelcker, 1982; Rossignac, 2002) for geometric modeling. Boundary representation consists of two kinds of information—topological information and geometric information, including vertex coordinates, surface equations, and connectivity between faces, edges, and vertices. There are several advantages in boundary representation: large domain, unambiguity, uniqueness, and explicit representation of faces, edges, and vertices. There are also several disadvantages: verbose data structure, difficulty in creating, difficulty in checking validity, and variational information unavailability. The CSG representation works by constructing a complex part by hierarchically combining simple primitives using Boolean set operations (Mott-Smith and Baer, 1972). There are several advantages of using CSG representation: large domain, unambiguity, easy-to-check validity, and easy creativity. There are also several disadvantages: nonuniqueness, difficulty in editing graphically, input data redundancy, and variational information unavailability (Voelcker and Hunt, 1981). The framework we propose for geometric modeling and representation is sweep mathematical morphology. The sweep operation to generate a volume by sweeping a primitive object along a space curve trajectory provides a natural design tool. The simplest sweep is linear extrusion defined by a 2D area swept along a linear path normal to the plane of the area to create a volume (Chen et al., 1999). Another sweep is rotational sweep defined by rotating a 2D object about an axis. General sweep is useful in modeling the region swept out by a machinetool cutting head or a robot following a path (Blackmore et al., 1994). General sweeps of 2D cross sections are known as generalized cylinders in computer vision and are usually modeled as parameterized 2D cross sections swept at right angles along an arbitrary curve. Being the simplest of general sweeps, generalized cylinders are somewhat easy to compute. However, general sweeps of solids are difficult to compute since the trajectory and object shape may make the sweep object self-intersect (Foley et al., 1995). A generalized sweeping method for CSG modeling was developed by Shiroma et al. (1982, 1991) to generate swept morphological operations that tend to simplify image data representing their volume. It is shown that the complex solid shapes can be generated with a blending surface to join two disconnected
268
SHIH
solids, fillet volumes for rounding corners, and swept volumes formed by the movement of numeric control (NC) tools. Ragothama and Shapiro (1998) presented a B-Rep method for deformation in parametric solid modeling. This chapter is organized as follows. Section II presents the theoretical development of general sweep mathematical morphology along with its properties. Section III describes an application of sweep morphology, which represents the blending of swept surfaces with deformations. Section IV presents the usage of sweep morphology for image enhancement, Section V the edge linking, and Section VI the shortest path planning. Section VII describes modeling based on the sweep mathematical morphology. Section VIII describes the formal languages. Section IX proposes the representation scheme for 2D and three-dimensional (3D) objects. Section X introduces the adopted grammars. Section XI applies the parsing algorithm to determine whether a given object belongs to the language. The conclusions are drawn in Section XII.
II. T HEORETICAL D EVELOPMENT OF G ENERAL S WEEP M ATHEMATICAL M ORPHOLOGY Traditional morphological dilation and erosion perform vector additions or subtractions by translating a structuring element along an object. These operations obviously have the limitation of orientation dependence and can represent the sweep motion, which involves only translation. By including not only translation but also rotation and scaling, the entire theoretical framework and practical applications become extremely fruitful. Sweep morphological dilation and erosion describe a motion of a structuring element that sweeps along the boundary of an object or an arbitrary curve by geometric transformations. The rotation angles and scaling factors are defined with respect to the boundary or the curve. A. Computation of Traditional Morphology Because rotation and scaling are inherently defined on each pixel of the curve, the traditional morphological operations of an object by a structuring element need to be converted to the sweep morphological operations of a boundary by the structuring element. We assume throughout this chapter that the sets considered are connected and bounded. Definition 1. A set S is said to be connected if each pair of points, p, q ∈ S can be joined by a path that consists of pixels entirely located in S. Definition 2. Given a set S, a boundary ∂S is defined as the set of points all of whose neighborhoods intersect both S and its complement S c .
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
269
Definition 3. If a set S is connected and has no holes, it is called simply connected; if it is connected but has holes, it is called multiply connected. Definition 4. Given a set S, the outer boundary ∂+ S of the set is defined as the closed loop of points in S that contains every other closed loop consisting of points of the set S; the inner boundary ∂− S is defined as the closed loop of points in S that does not contain any other closed loop in S. Proposition 1. If a set S is simply connected, then ∂S is its boundary; if it is multiply connected, then ∂S = ∂+ S ∪ ∂− S. Definition 5. The positive filling of a set S is denoted as [S]+ and is defined as the set of all points that are inside the outer boundary of S; the negative filling is denoted as [S]− and is defined as the set of all points that are outside the inner boundary. Note that if S is simply connected, then [S]− is a universal set. Therefore, we determine that whether S is simply or multiply connected, S = [S]+ ∩ [S]− . Proposition 2. Let A and B be simply connected sets. The dilation of A by B equals the positive filling of ∂ A ⊕ B, that is, A ⊕ B = [∂ A ⊕ B]+ . The significance is that if A and B are simply connected sets, we can compute the dilation of the boundary ∂A by the set B. This leads to a substantial reduction of computation. Proposition 3. If A and B are simply connected sets, the dilation of A by B equals the positive filling of the dilation of their boundaries, that is, A ⊕ B = [∂ A ⊕ ∂B]+ . This proposition further reduces the computation required for the dilation. Namely, the dilation of sets A by B can be computed by the dilation of the boundary of A by the boundary of B. Proposition 4. If A is multiply connected and B is simply connected, A ⊕ B = [∂+ A ⊕ ∂B]+ ∩ [∂− A ⊕ ∂B]− . Since A and B possess the commutative property with respect to dilation, the following proposition can be easily obtained. Proposition 5. If A is simply connected and B is multiply connected, A ⊕ B = [∂ A ⊕ ∂+ B]+ ∩ [∂ A ⊕ ∂− B]− .
270
SHIH
B. General Sweep Mathematical Morphology The sweep morphology can be represented as a four-tuple, Ψ (B, A, S, Θ), where B is a structuring element set, indicating a primitive object; A is either a curve path or a closed object whose boundary representing the sweep trajectory with a parameter t along which the structuring element B is swept; S(t) is a vector consisting of the scaling factors; Θ(t) is a vector consisting of the rotation angles. Note that both scaling factors and rotation angles are defined with respect to the sweep trajectory. Definition 6. If A is a simply connected object and ∂A denotes its boundary, the sweep morphological dilation of A by B in Euclidean space is denoted by A ⊞ B and is defined as A ⊞ B = c | c = a + bˆ for some a ∈ A and bˆ ∈ S(t) × Θ(t) × B . This is equivalent to performing on the boundary of A (i.e., ∂A) and taking the positive filling as & A⊞B = . ∂A(t) + b × S(t) × Θ(t) 0≤t≤1 b∈B
+
If A is a curve path, that is, ∂A = A, the sweep morphological dilation of A by B is defined as A⊞B = A(t) + b × S(t) × Θ(t) . 0≤t≤1 b∈B
Note that if B does not involve rotations (or B is rotation-invariant like a circle) and scaling, then the sweep dilation is equivalent to the traditional morphological dilation.
Example 1. Figure 1a shows a curve and Figure 1b shows an elliptical structuring element. The rotation angle θ is defined as θ (t) = tan−1 (dy/dt)/(dx/dt) along the curve with parameter t in the range of [0, 1]. The traditional morphological dilation is shown in Figure 1c and the sweep dilation using the defined rotation is shown in Figure 1d. A geometric transformation of the structuring element specifies the new coordinates of each point as functions of the old coordinates. Note that the new coordinates are not necessarily integers after a transformation to a digital image is applied. To make the results of the transformation into a digital image, they must be resampled or interpolated. Since we are transforming a two-valued (black-and-white) image, the zero-order interpolation is adopted.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
271
F IGURE 1. (a) An open curve path, (b) a structuring element, (c) result of a traditional morphological dilation, (d) result of a sweep morphological dilation.
The sweep morphological erosion, unlike dilation, is defined with the restriction on a closed object only and its boundary represents the sweep trajectory. Definition 7. Let ∂A be the boundary of an object A and B be a structuring element. The sweep morphological erosion of A by B in Euclidean space, denoted by A ⊟ B, is defined as A ⊟ B = c | c + bˆ ∈ A and for every bˆ ∈ S(t) × Θ(t) × B .
An example of a sweep erosion by an elliptical structuring element whose semimajor axis is tangent to the boundary is shown in Figure 2. As in traditional morphology, the general sweep morphological opening can be defined as a general sweep erosion of A by B followed by a general sweep dilation, where A must be a closed object. The sweep morphological closing can be defined in the opposite sequence, that is, a general sweep dilation of
272
SHIH
(a)
(b) F IGURE 2.
(a) Traditional erosion and (b) sweep erosion.
A by B followed by a general sweep erosion, where A can be either a closed object or a curve path. The propositions of traditional morphological operations can be extended to sweep morphological operations. Proposition 6. If the structuring element B is simply connected, the sweep dilation of A by B equals the positive filling of the sweep dilation by the boundary of B, that is, A ⊞ B = [A ⊞ ∂B]+ . Extending this proposition to multiply connected objects, we get the following three cases.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
273
Case 6a. If A is multiply connected, that is, ∂A = ∂+ A ∪ ∂− A, then A ⊞ B = [∂A+ ⊞ B]+ ∩ [∂A− ⊞ B]− . Case 6b. If B is multiply connected, that is, ∂B = ∂+ B ∪ ∂− B, then A ⊞ B = [A ⊞ ∂B+ ]+ ∩ [A ⊞ ∂B− ]− . Case 6c. Finally, if both A and B are multiply connected, that is, ∂A = ∂+ A ∪ ∂− A and ∂B = ∂+ B ∪ ∂− B, then A ⊞ B = [(∂+ A ∪ ∂− A) ⊞ ∂+ B]+ ∩ [(∂+ A ∪ ∂− A) ⊞ ∂− B]− . This leads to a substantial reduction of computation. We can make an analogous development for the sweep erosion. Proposition 7.
If A and B are simply connected sets, then A⊟B = A⊟∂B.
With the aforementioned propositions and considering the boundary of the structuring element, we can further reduce the computation of sweep morphological operations. C. Properties of Sweep Morphological Operations Property 1 (Noncommutative). Because of the rotational factor in the operation, the commutativity does not hold, that is, A ⊞ B = B ⊞ A. Property 2 (Nonassociative). Because the rotational and scaling factors are dependent on the characteristics of the boundary of the object, associativity does not hold. Hence, A ⊞ (B ⊞ C) = (A ⊞ B) ⊞ C. But associativity of regular dilation and sweep dilation holds, that is, A ⊕ (B ⊞ C) = (A ⊕ B) ⊞ C. As the structuring element is rotated based on the boundary properties of B and after A ⊕ B, the boundary properties will be similar to that of B. Property 3 (Translational invariance). & ∂A + x + b × S(t) × θ (t) Ax ⊞ B = 0≤t≤1 b∈B
+
0≤t≤1 b∈B
+
& ∂A + b × S(t) × θ (t) + x =
274
SHIH
=
0≤t≤1 b∈B
∂A + b × S(t) × θ (t)
= (A ⊞ B)x .
&
+
+x
Sweep erosion can be derived similarly. Property 4. Increasing property will not hold in general. If the boundary is smooth, that is, derivative exists everywhere, then increasing property will hold. Property 5 (Distributivity). a. Distributivity over union. Dilation is distributive over union of structuring elements. That is, dilation of A with a union of two structuring elements B and C is the same as union of dilation of A with B and dilation of A with C. & ∂A + b × S(t) × θ (t) A ⊞ (B ∪ C) = +
0≤t≤1 b∈B∪C
=
0≤t≤1 b∈B
∪
∂A + b × S(t) × θ (t)
& ∂A + b × S(t) × θ (t)
b∈C
+
∂A + b × S(t) × θ (t) = 0≤t≤1 b∈B
& ∂A + b × S(t) × θ (t) ∪ 0≤t≤1 b∈C
& = ∂A + b × S(t) × θ (t) 0≤t≤1 b∈B
+
+
& ∂A + b × S(t) × θ (t) ∪ 0≤t≤1 b∈C
= (A ⊞ B) ∪ (A ⊞ C).
+
b. Dilation is not distributive over union of sets. That is, dilation of (A ∪ C) with a structuring element B is not the same as union of dilation of A with B and dilation of C with B. (A ∪ C) ⊞ B = (A ⊞ B) ∪ (C ⊞ B).
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
275
c. Erosion is antidistributive over union of structuring elements. That is, erosion of A with a union of two structuring elements B and C is the same as intersection of erosion of A with B and erosion of A with C. d. Distributivity over intersection. & ∂A ⊞ (B ∩ C) = ∂A + b × S(t) × θ (t) 0≤t≤1 b∈B∩C
& ∂A + b × S(t) × θ (t) ⇒ ∂A ⊞ (B ∩ C) ⊆ 0≤t≤1 b∈B
and also
∂A ⊞ (B ∩ C) ⊆ Therefore,
& ∂A + b × S(t) × θ (t) . 0≤t≤1 b∈C
∂A ⊞ (B ∩ C) ⊆ (∂A ⊞ B) ∩ (∂A ⊞ C), which implies ∂A ⊞ (B ∩ C) + ⊆ (∂A ⊞ B) + ∩ (∂A ⊞ C) + , that is,
A ⊞ (B ∩ C) ⊆ (A ⊞ B) ∩ (A ⊞ C).
III. B LENDING OF S WEPT S URFACES WITH D EFORMATIONS By using general sweep mathematical morphology, a smooth sculptured surface can be described as a trajectory of a cross section curve swept along a profile curve, where the trajectory of the cross section curve is the structuring element B and the profile curve is the open or closed curve C. It is very easy to describe the sculptured surface by specifying the 2D cross sections, and that the resulting surface is aesthetically appealing. The designer can envision the surface as a blended trajectory of cross section curves swept along a profile curve. Let ∂B denote the boundary of a structuring element B. A swept surface Sw (∂B, C) is produced by moving ∂B along a given trajectory curve C. The plane of B must be perpendicular to C at any time instance. The contour curve is represented as a B-spline curve and ∂B is represented as the polygon net of the actual curve. This polygon net is swept along the trajectory to obtain the intermediate polygon nets and later they are interpolated by a B-spline
276
SHIH
surface. The curve can be deformed by twisting or scaling uniformly or by applying the deformations to selected points of ∂B. The curve can also be deformed by varying the weights at each of the points. When a uniform variation is desired, it can be applied to all the points and otherwise to some selected points. These deformations are applied to ∂B before it is moved along the trajectory C. Let ∂B denote a planar polygon with n points and each point ∂Bi = (xi , yi , zi , hi ), where i = 1, 2, . . . , n. Let C denote any 3D curve with m points and each point Cj = (xj , yj , zj ), where j = 1, 2, . . . , m. The scaling factor, weight, and twisting factor for point j of C are denoted as sxj , syj , szj , wj , and θj , respectively. The deformation matrix is obtained as [Sd ] = [Ssw ][Rθ ], where ⎤ ⎡ 0 0 0 sxj 0 0 ⎥ ⎢ 0 syj [Ssw ] = ⎣ 0 0 szj 0 ⎦ 0 0 0 wj and
⎡
cos θj ⎢ − sin θj [Rθ ] = ⎣ 0 0
sin θj cos θj 0 0
⎤ 0 0 0 0⎥ . 1 0⎦ 0 1
The deformed ∂B must be rotated in 3D with respect to the tangent vector at each point of trajectory curve C. To calculate the tangent vector, we add two points to C, C0 and Cm+1 , where C0 = C1 and Cm+1 = Cm . The rotation matrix Rx about the x-axis is given by ⎡ ⎤ 1 0 0 0 sin αj 0 ⎥ ⎢ 0 cos αj , [Rx ] = ⎣ 0 − sin αj cos αj 0 ⎦ 0 0 0 1 where
cos αj =
cyj −1 − cyj +1
hx =
hx
$
,
sin αj =
czj +1 − czj −1 hx
,
(cyj −1 − cyj +1 )2 + (czj +1 − czj −1 )2 .
The rotation matrices about the y- and z-axes can similarly be derived. Finally, ∂B must be translated to each point of C and the translation matrix Cxyz is
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 3.
277
Sweeping of a square along a trajectory with deformation to a circle.
given by ⎡
⎢ [Cxyz ] = ⎣
Cxj
1 0 0 − C x1
Cy j
0 1 0 − Cy 1
0 0 1 Czj − Cz1
⎤ 0 0⎥ . 0⎦ 1
The polygon net of the sweep surface will be obtained by [Bi,j ] = [∂Bi ][Sd ][SwC ] where [SwC ] = [Rx ][Ry ][Rz ][Cxyz ]. The B-spline surface can be obtained from the polygon net by finding the B-spline curve at each point of C. To obtain the whole swept surface, the B-spline curves at each point of the trajectory C have to be calculated. This computation can be reduced by selecting a few polygon nets and calculating the B-spline surface. Example 2. Sweeping of a circle along a trajectory with deformation to a square. Here the deformation is only the variation of the weights. The circle is represented as a rational B-spline curve. The polygon net is a square with nine points with√the first and last being the same and the weights of the corner vary from 5 to 2/2 as it is being swept along the trajectory C, which is given in parametric form as x = 10s and y = cos(π s) − 1. The sweep transformation is given by ⎡ ⎤ cos ψ sin ψ 0 0 ⎢ − sin ψ cos ψ 0 0 ⎥ −1 −π sin(π s) . [SwT ] = ⎣ , where ψ = tan 0 0 1 0⎦ 10 10s cos π s 0 1 Figure 3 shows the sweeping of a square along a trajectory with deformation to a circle.
278
SHIH
IV. I MAGE E NHANCEMENT Because of being adapted to local properties of the image, general sweep morphological operations can provide variant degrees of smoothing for noise removal while preserving the object features. Research on statistical analysis of traditional morphological operations has been found. Stevenson and Arce (1987) developed the output distribution function of opening with flat structuring elements by threshold decomposition. Morales and Acharya (1993) presented general solutions for the statistical analysis of morphological openings with compact, convex, and homothetic structuring elements. Traditional opening can remove noise as well as object features whose sizes are smaller than the structuring element. With the general sweep morphological opening, the object features of similar shape and greater size compared to the structuring element will be preserved while removing noise. In general, the highly varying parts of the image are assigned based on smaller structuring elements and the slowly varying parts with larger ones. The structuring elements can be assigned based on the contour gradient variation. An example is illustrated in Figure 4.
Step edge is an important feature in an image. Assume a noisy step edge is defined as Nx if x < 0 f (x) = h + Nx if x ≥ 0,
where h is the strength of the edge and Nx is i.i.d. Gaussian random noise with mean value 0 and variance 1. For image filtering with a general sweep morphological opening, we essentially adopt a smaller sized structuring element for the important feature points and a larger sized one for other
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 4.
279
Structuring element assignment using general sweep morphology.
locations. Therefore, the noise in an image is removed while the features are preserved. For instance, in a one-dimensional image, it can be easily achieved by computing the gradient by f (x) − f (x − 1) and setting those points accordingly, in which the gradient values are larger than the predefined threshold, with smaller structuring elements. In Chen et al. (1993), the results of noisy step edge filtering by both traditional morphological opening and so-called space-varying (involving both scaling and translation in our general sweep morphology model) opening were shown and compared by computing the mean and variance of output signals. The mean value of the output distribution follows the main shape of the filtering result well, and this gives evidence of the shape-preserving ability of the proposed operation. Meanwhile, the variance of output distribution coincides with the noise variance, and this shows the corresponding noise-removing ability. It is observed that general sweep opening possesses approximately the same noise-removing ability as compared to the traditional one. Moreover, it can be observed that the relative edge strength with respect to the variation between the transition interval, say [−2, 2], for general sweep opening, is larger than that of the traditional one. This explains why the edge is degraded in the traditional morphology case but is enhanced in the general sweep one. Although a step-edge model was tested successfully, other complicated cases need further elaboration. Statistical analysis for providing a quantitative approach to general sweep morphological operations will be further investigated. Chen et al. (1999) have shown image filtering using adaptive signal processing, which is nothing but the sweep morphology with only scaling and translation. The method uses space-varying structuring elements by
280
SHIH
assigning different filtering scales to the feature parts and other parts. To adaptively assign structuring elements, they have developed the progressive umbra-filling (PUF) procedure. This is an iterative process. The experimental results have shown that this approach can successfully eliminate noise without oversmoothing the important features of a signal.
V. E DGE L INKING Edge is a local property of a pixel and its immediate neighborhood. Edge detector is a local processing to locate sharp changes in the intensity function. An ideal edge has a step-like cross section as gray levels change abruptly across the border. In practice, edges in digital images are generally slightly blurred as effects of sampling and noise. There are many edge detection algorithms, and the basic idea underlying most edge detection techniques is the computation of a local derivative operator (Gonzalez and Woods, 2002). Some algorithms like the LoG filter produce closed edges; however, false edges are generated when blur and noise appear in an image. Some algorithms like Sobel operator produce noisy boundaries that do not actually lie on the borders and broken gaps where border pixels should reside. That is because noise and breaks are present in the boundary from nonuniform illumination and other effects that introduce spurious intensity discontinuities. Thus, edge detection algorithms are typically followed by linking and other boundary detection procedures, which are designed to assemble edge pixels into meaningful boundaries. Edge linking by the tree search technique was proposed by Martelli (1976) to link the edge sequentially along the boundary between pixels. The cost of each boundary element is defined by the step size between the pixels on both of its sides. A larger intensity difference corresponds to a larger step size, which is assigned a lower cost. The path of boundary elements with the lowest cost is linked up as an edge. The cost function was later redefined by Cooper et al. (1980), where the edge is extended through the path having a maximal local likelihood. Similar efforts were made by Eichel et al. (1988) and by Farag and Delp (1995). Basically, the tree search method is time consuming and requires the suitable assignment of root points. Another method locates all of the end points of the broken edges and uses a relaxation method to pair them up, so that line direction is maintained, lines are not allowed to cross, and closer points are matched first. However, this results in problems if unmatched end points or noises are present. A simple approach to edge linking is a morphological dilation of points by some arbitrarily selected radius of circles followed by the OR operator of
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 5.
281
The elliptic structuring element.
the boundary image with the resulting dilated circles and the result is finally skeletonized (Russ, 1992). This method, however, has a problem in that some of the points may be too far apart for the circles to touch, while the circles may obscure details by touching several existing lines. To overcome this, sweep mathematical morphology is used to allow the variation of the structuring element according to local properties of the input pixels. A. Edge Linking Using Sweep Morphology Let B denote an elliptic structuring element shown in Figure 5, where p and q denote, respectively, the semimajor and semiminor axes. That is ∂B ≡ [x, y]T | x 2 /p2 + y 2 /q 2 = 1 .
An edge-linking algorithm was proposed by Shih and Cheng (2004) based on the sweep dilation, thinning, and pruning. This is a three-step process as explained below. Step 1: Sweep Dilation. The broken line segments can be linked up by using the sweep morphology provided that the structuring element is suitably adjusted. Considering the input signal plotted in Figure 6a, the concept of using the sweep morphological dilation is illustrated in Figure 6b. Extending the line segments in the direction of the local slope performs the linking. The basic shape of the structuring element is an ellipse, where the major axis is always aligned with the tangent of the signal. The elliptical structuring element reduces noisy edge points and small insignificant branches. The width of the ellipse is selected to accomplish this purpose. The major axis of the ellipse should be adapted to the local curvature of the input signal to protect it from overstretch at high curvature points. At high curvature points, a short major axis is selected and vice versa. Step 2: Thinning. After performing the sweep dilation by directional ellipses, the edge segments are extended in the direction of the local slope.
282
SHIH
(a) F IGURE 6.
(b)
(a) Input signal and (b) sweep dilation with elliptical structuring elements.
(a) F IGURE 7.
(b)
(a) Original elliptical edge and (b) its randomly discontinuous edge.
Because the tolerance (or the minor axis of the ellipse) is added, the edge segments grow a little thick. To suppress this effect, morphological thinning is adopted. An algorithm of thinning using mathematical morphology was proposed by Jang and Chin (1990). The skeletons generated by their algorithm are connected, one pixel width, and closely follow the medial axes. The algorithm is an iterative process based on the hit/miss operation. Four structuring elements
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
(a)
283
(b)
(c) F IGURE 8. (c) r = 10.
Using circular structuring elements in five iterations with (a) r = 3, (b) r = 5, and
are constructed to remove boundary pixels from four directions, and another four are constructed to remove the extra pixels at skeleton junctions. There are four passes in each iteration. Three of the eight predefined structuring element
284
SHIH
F IGURE 9.
Using the sweep morphological edge-linking algorithm.
templates are applied simultaneously in each pass. The iterative process is performed until the result converges. The thinning algorithm will not shorten the skeletal legs. Therefore, it is applied to the sweep dilated edges. Step 3: Pruning. The dilated edge segments after thinning may still produce a small number of short skeletal branches. These short branches should be pruned. In a skeleton, any pixel, which has three or more neighbors, is called a root. Starting from each neighbor of the root pixel, the skeleton is traced outward. Those paths whose lengths are shorter than a given threshold k are treated as branches and are pruned away. Figure 7a shows an original elliptical edge and Figure 7b shows its randomly discontinuous edge. The sweep morphological edge-linking algorithm is shown experimentally in Figure 7b. Figure 8 shows the results of using circular structuring elements with radius r = 3, r = 5, and r = 10, respectively, in five iterations. Compared with the original ellipse in Figure 7a, we know that if the gap is larger than the radius of the structuring element, it is difficult to link the gap smoothly. However, if a very large circular structuring element is used, the edge will look hollow and protuberant. Also, using a big circle can obscure the details of the edge. Figure 9 shows the result of using the sweep morphological edge-linking algorithm. Figure 10a shows the edge of an industrial part and Figure 10b shows its randomly discontinuous edge. Figure 10c shows the result of using the sweep morphological edge-linking algorithm. Figure 11a shows the edge
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
(a)
285
(b)
(c) F IGURE 10. (a) The edge of an industrial part, (b) its randomly discontinuous edge, and (c) using the sweep morphological edge-linking algorithm.
with added uniform noise and Figure 11b shows the edge after removing noise. Figure 11c shows the result of using the sweep morphological edgelinking algorithm. Figure 12a shows a face image with the originally detected broken edge. Figure 12b shows the face image with the edge linked by the sweep morphological edge-linking algorithm.
286
SHIH
(a)
(b)
(c) F IGURE 11. (a) Part edge with added uniform noise, (b) part edge after removing noise, and (c) using the sweep morphological edge-linking algorithm.
VI. S HORTEST PATH P LANNING
FOR
M OBILE ROBOT
The recent advances in the fields of robotics and artificial intelligence have stimulated considerable interest in robot motion planning and the shortest path-finding problem (Latombe, 1991). The path planning is in general concerned with finding paths connecting different locations in an environment (e.g., a network, a graph, or a geometric space). Depending on the specific
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
(a)
287
(b)
F IGURE 12. (a) Face image with the originally detected broken edge and (b) face image with the edge linked by the sweep morphological edge-linking algorithm.
F IGURE 13.
Shortest path of an H-shaped car by using the sweep (rotational) morphology.
applications, the desired paths often need to satisfy some constraints (e.g., obstacle avoiding) and optimize certain criteria (e.g., variant distance metrics and cost functions). The problems of planning shortest paths arise in many disciplines, and in fact these constitute one of the most powerful tools for modeling combinatorial optimization problems.
288
SHIH
The path planning problem is given as a mobile robot of arbitrary shape moves from a starting position to a destination in a finite space with arbitrarily shaped obstacles in it. When we apply traditional mathematical morphology to solve the problem, its drawback is the fixed directional movement of the structuring element (i.e., robot), which is no longer the optimal path in real world applications (Lin and Chang, 1993). By incorporating rotation into the motion of a moving object, it gives more realistic results to solving the shortest path finding problem. The shortest path finding problem is equivalent to applying sweep (rotational) morphological erosion to the free space followed by a distance transformation (Shih and Wu, 2004) on the domain with the grown obstacles excluded and then tracing back the distance map from the destination point to the neighbors with the minimum distance until the starting point is reached (Pei et al., 1998). An example illustrating the shortest path of an H-shaped car by using the sweep (rotational) morphology is shown in Figure 13.
VII. G EOMETRIC M ODELING AND S WEEP M ATHEMATICAL M ORPHOLOGY The dilation can be represented in the matrix form as follows. Let A(t) be represented by the matrix [ax (t), ay (t), az (t)], where 0 ≤ t ≤ 1. For every t, let the scaling factors be sx (t), sy (t), sz (t) and the rotation factors be θx (t), θy (t), θz (t). By using the homogeneous coordinates, the scaling transformation matrix can be represented as ⎡ ⎤ sx (t) 0 0 0 ⎢ 0 0 0⎥ sy (t) S(t) = ⎣ . 0 0 sz (t) 0 ⎦ 0 0 0 1 The rotation matrix about the x-axis is represented as ⎡ ⎤ 1 0 0 0 3 4 ⎢ 0 cos θx (t) sin θx (t) 0 ⎥ Rx (t) = ⎣ , 0 − sin θx (t) cos θx (t) 0 ⎦ 0 0 0 1 where
cos θx (t) = hx =
ay (t − 1) − ay (t + 1) , hx $
sin θx (t) =
az (t + 1) − az (t − 1) , hx
2 2 ay (t − 1) − ay (t + 1) + az (t + 1) − az (t − 1) .
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
289
The rotation matrices about the y- and z-axes can be derived similarly. Finally, the structuring element is translated by using ⎤ ⎡ 1 0 0 0 ⎢ 0 1 0 0⎥ . A(t) = ⎣ 0 0 1 0⎦ ax (t) ay (t) az (t) 1 Therefore, the sweep dilation is equivalent to the concatenated transformation matrices as (A ⊞ B)(t) = [B] S(t) Rx (t) Ry (t) Rz (t) A(t) , where 0 ≤ t ≤ 1.
Schemes based on sweep representation are useful in creating solid models of two-and-a-half-dimensional objects that include both solids of uniform thickness in a given direction and axis-symmetric solids. Computer representation of the swept volume of a planar surface has been used as a primary modeling scheme in solid modeling systems (Shih and Cheng, 2004). Representation of the swept volume of a 3D object (Stevenson and Arce, 1987), however, has received limited attention. Leu et al. (1986) presented a method for representing the swept volumes of translating objects using boundary representation and ray in–out classification. Their method is restricted to translation only. Representing the swept volumes of moving objects under a general motion is a more complex problem. A number of researchers have examined the problem of computing swept volumes, including Korein (1985) for rotating polyhedra, Kaul (1993) using Minkowski sums for translation, Wang and Wang (1986) using envelop theory, and Martin and Stephenson (1990) using envelop theory and computer algebraic techniques. In this chapter, geometric modeling based on sweep morphology is proposed. Because of the morphological operators’ geometric nature and nonlinear property, some modeling problems will become simple and intuitive. This framework can be used for modeling not only swept surface and volumes but also for tolerance modeling in manufacturing. A. Tolerance Expression Tolerances constrain an object’s features to lie within regions of space called tolerance zones (Requicha, 1984). Tolerance zones in Rossignac and Requicha (1985) were constructed by expanding the nominal feature to obtain the region bounded by the outer closed curve, shrinking the nominal feature to obtain the region bounded by the inner curve, and then subtracting the two resulting regions. This procedure is equivalent to the morphological dilation of the offset inner contour with a tolerance-radius disked structuring element.
290
SHIH
(a)
(b)
F IGURE 14. Tolerance zones. (a) An annular tolerance zone that corresponds to a circular hole. (b) A tolerance zone for an elongated slot.
(a) F IGURE 15.
(b) (a, b) An example of adding tolerance by a morphological dilation.
Figure 14a shows an annular tolerance zone that corresponds to a circular hole, and Figure 14b shows a tolerance zone for an elongated slot. Both can be constructed by dilating the nominal contour with a tolerance-radius disked structuring element as shown in Figure 15. The tolerance zone for testing the size of a round hole is an annular region lying between two circles with the specified maximal and minimal diameters; the zone corresponding to a form constraint for the hole is also an annulus, defined by two concentric circles whose diameters must differ by a specified amount but are otherwise arbitrary (Shih et al., 1994). The sweep mathematical morphology supports the conventional limit (±) tolerance on “dimensions” that appear in the engineering drawings. The positive deviation is equivalent to the dilated result and the negative deviation is equivalent to the eroded result. The industrial parts adding tolerance information can be expressed using a dilation with a circle.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 16.
291
Modeling of sweep surface.
B. Sweep Surface Modeling A simplest sweep surface is generated by a profile sweeping along a spine with or without deformation. This is nothing but sweep mathematical dilation of the two curves. Let P (u) be the profile curve, B(w) be the spine, and S(u, w) be the sweep surface. The sweep surface can be expressed as S(u, w) = P (u) ⊞ B(w). A sweep surface with initial and final profiles P1 (u) and P2 (u) at relative locations O1 and O2 , respectively, and with the sweeping rule R(w) is shown in Figure 16 and can be expressed as S(u, w) = 1 − R(w) P1 (u) ⊞ B(w) − O1 + R(w) P2 (u) ⊞ B(w) − O2 .
VIII. F ORMAL L ANGUAGE AND S WEEP M ORPHOLOGY Our representation framework is formulated as follows. Let E N denote the set of points in the N-dimensional Euclidean space and p = (x1 , x2 , . . . , xN ) represent a point in E N . In this way, any object is a subset in E N . The formal model for geometric modeling is a context-free grammar, G, consisting of a four-tuple (Fu, 1982; Gosh, 1988; Shih, 1991): G = (VN , VT , P , S), where VN is a set of nonterminal symbols, such as complicated shapes; VT is a set of terminal symbols that contains two sets: one is the decomposed primitive shapes, such as lines and circles, and the other is the shape operators; P is a finite set of rewriting rules or productions denoted by A → β, where A ∈ VN and β is a string over VN ∪ VT ; S is the start symbol, which represents
292
SHIH
the solid object. The operators used include sweep morphological dilation, set union, and set subtraction. Note that such a production allows the nonterminal A to be replaced by the string β independent of the context in which A appears. A context-free grammar is defined as a derived form: A → β, where A is a single nonterminal and β is a nonempty string of terminals and nonterminals. The languages generated by the context-free grammars are called context-free languages. Object representation can be viewed as a task of converting a solid shape into a sentence in the language, whereas object classification is the task of “parsing” a sentence. The criteria for the primitive selection are influenced by the nature of data, the specific application in the question, and the technology available for implementing the system. The following serves as a general guideline for primitive selection. 1. The primitives should be the basic shape elements that can provide a compact but adequate description of the object shape in terms of the specified structural relations (e.g., the concatenation relation). 2. The primitives should be easily extractable by the existing nonsyntactic (e.g., decision-theoretic) methods, since they are considered to be simple and compact shapes and their structural information is not important.
IX. R EPRESENTATION S CHEME In this section, we describe 2D attributes and 3D attributes in our representation scheme. A. Two-Dimensional Attributes Commonly used 2D attributes are rectangle, parallelogram, triangle, rhombus, circle, and trapezoid. They can be represented easily by using the sweep morphological operators. The expressions are not unique, and the preference depends on the simplest combination and the least computational complexity. The common method is to decompose the attributes into smaller components and apply morphological dilations to grow these components. Let a and b represent unit vectors in x- and y-axes, respectively. The unit vector could represent 1 m, 0.1 m, 0.01 m, and so on as needed. Note that when the sweep dilation is not associated with rotation and scaling, it is equivalent to the traditional dilation. a. Rectangle: It is represented as a unit x-axis vector a swept along a unit y-axis vector b, that is, b ⊞ a with no rotation or scaling.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
(a) F IGURE 17.
293
(b)
(a, b) The decomposition of two-dimensional attributes.
b. Parallelogram: Let k denote a vector sum of a and b that is defined in a. It is represented as k ⊞ a with no rotation or scaling. c. Circle: Using a sweep rotation, a circle can be represented as a unit vector a swept about a point p through 2π degrees, that is, p ⊞ a. d. Trapezoid: b ⊞ a with a linear scaling factor to change the magnitude of a into c as it is swept along b as shown in Figure 17a. Let 0 ≤ t ≤ 1. The scaling factor along the trajectory b is S(t) = (c/a)t + (1 − t). e. Triangle: b ⊞ a, similar to a trapezoid but with a linear scaling factor to change the magnitude of a into zero as it is swept along b, as shown in Figure 17b. Note that the shape of triangles (e.g., an equilateral or right triangle) is determined by the fixing location of the reference point (i.e., the origin) of the primitive line a.
B. Three-Dimensional Attributes The 3D attributes can be applied by a similar method. Let a, b, c denote unit vectors in the x-, y-, and z-axes, respectively. The formal expressions are presented below. a. Parallelepiped: It is represented as a unit vector a swept along a unit vector b to obtain a rectangle and then it is swept along a unit vector c to obtain the parallelepiped, that is, c ⊞ (b ⊞ a). b. Cylinder: It is represented as a unit vector a swept about a point p through 2π degrees to obtain a circle, and then it is swept along a unit vector c to obtain the cylinder, that is, c ⊞ (p ⊞ a).
294
SHIH
F IGURE 18.
Sweep dilation of a rectangle with a corner truncated by a circle.
c. Parallelepiped with a corner truncated by a sphere: A unit vector a is swept along a unit vector b to obtain a rectangle. A vector r is swept about a point p through 2π degrees to obtain a circle, and then it is subtracted
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 19.
295
Sweeping of a square along a trajectory with deformation to a circle.
from the rectangle. The result is swept along a unit vector c, that is, c ⊞ [(b ⊞ a) − (p ⊞ r)], as shown in Figure 18. d. Sweep dilation of a square along a trajectory with deformation to a circle: The square is represented as a rational B-spline curve. The polygon net is specified by a square with nine points with the first√and the last being the same and the weights of the corner vary from 5 to 2/2 as it is swept along the trajectory C that is defined in the parametric form as x = 10s and y = cos(π s) − 1. The sweep transformation is given by ⎡ ⎤ cos ψ sin ψ 0 0 ⎢ − sin ψ cos ψ 0 0 ⎥ −1 −π sin(π s) [SwT ] = ⎣ . , where ψ = tan 0 0 1 0⎦ 10 10s cos π s 0 1
The formal expression is C⊞B. The sweeping of a square along a trajectory with deformation to a circle is shown in Figure 19. e. Parallelepiped with a cylindrical hole: A unit vector a is swept along a unit vector b to obtain a rectangle. A vector r is swept about a point p through 2π degrees to obtain a circle, and it is subtracted from the rectangle. The
296
SHIH
F IGURE 20.
Sweep dilation of rectangle with a circular hole.
result is swept along a unit vector c, that is, c ⊞ [(b ⊞ a) − (p ⊞ r)], as shown in Figure 20. f. U-shape block: A unit vector a is swept along a unit vector b to obtain a rectangle. A vector r is swept about a point p through π degrees to obtain a half circle, and it is dilated along the rectangle to obtain a two-roundedcorner rectangle that is then subtracted from another rectangle to obtain a U-shaped 2D object. The result is swept along a unit vector c to obtain the final U-shaped object, that is, c ⊞ {(b′ ⊞ a ′ ) − [(b ⊞ a) ⊞ (p ⊞ r)]}, as shown in Figure 21. Note that the proposed sweep mathematical morphology model can be applied to the NC machining process. For example, the ball-end milling cutter can be viewed as the structuring element, and it can be moved along a predefined path to cut a work piece. During the movement, the cutter can
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
F IGURE 21.
297
Machining with a round bottom tool.
be rotated to be perpendicular to the sweep path. If the swept volume is subtracted from the work piece, the remaining part can be obtained.
X. G RAMMARS In this section, we describe grammars in 2D and 3D attributes. We have experimented on many geometric objects. The results show our model can work successfully.
298
SHIH
A. Two-Dimensional Attributes All the primitive 2D objects can be represented by the following grammar using the sweep mathematical morphology model: G = (VN , VT , P , S), where, VN = {S, A, B, K},
VT = {a, b, k, p, ⊞},
P : S → B ⊞ A | B ⊞ K | p ⊞ A, A → aA | a,
B → bB | b,
K → kK | k.
The sweep dilation ⊞ can be ⊕ (S = 0, θ = 0), ⊕ [S = ac t +(1−t), θ = 0], or ⊕ (S = 0, θ = 2π ). Note that the repetition of a unit vector in a generated string is the usual method of grammatical representation. We can shorten the length of a string by adopting a repetition symbol “*.” For example, “*5a” denotes “aaaaa.” a. Rectangle can be represented by the string bb⊞aa, with as and bs repeated any number of times depending on the required size. b. Parallelogram can be represented by the string kk ⊞ aaa, with as and ks repeated any number of times depending on the required size. c. Circle can be represented by the string p ⊞ aaa, with as repeated any number of times depending on the required size and with ⊞ as ⊕ (S = 0, θ = 2π ). d. Trapezoid can be represented by the string bb⊞aa, with as and bs repeated any number of times depending on the required size and with ⊞ as ⊕ [S = c a t + (1 − t), θ = 0]. e. Triangle can be represented by the string bb ⊞ aa, with as and bs repeated any number of times depending on the required size and with ⊞ as ⊕ [S = (1 − t), θ = 0]. B. Three-Dimensional Attributes All the primitive 3D objects can be categorized into the following grammar: G = (VN , VT , P , S),
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
299
where VN = {S, A, B, C},
VT = a, b, c, p, (,), ⊞ ,
P : S → C ⊞ (B ⊞ A) | C ⊞ (p ⊞ A), A → aA | a,
B → bB | b,
C → cC | c.
The sweep dilation ⊞ can be either ⊕ (S = 0, θ = 0) or ⊕ (S = 0, θ = 2π ).
a. Parallelepiped can be represented by the string ccc ⊞ (bb ⊞ aaa), with as, bs, and cs repeated any number of times depending on the required size. b. Cylinder can be represented by the string cccc ⊞ (p ⊞ aaa), with as and cs repeated any number of times depending on the required size and with the first dilation operator ⊞ as ⊕ (S = 0, θ = 2π ) and the second dilation as the traditional dilation. c. Consider the grammar G = (VN , VT , P , S), where VN = {A, B, C, N},
VT = a, b, c, p, ⊞, −, (,) ,
P : S → C ⊞ N,
N → rectangle − circle, C → cC | c.
The productions for the rectangle and circle are given in Section X.A. c.1. The sweep dilation of a rectangle with a corner truncated by a circle can be represented by the string cc ⊞ [(bb ⊞ aaa) − (p ⊞ aa)], with as, bs, and cs repeated any number of times depending on the required size and with the second dilation operator ⊞ as ⊕ (S = 0, θ = 2π ). c.2. The sweep dilation of a rectangle with a circular hole can be represented by the string cc ⊞ [(bb ⊞ aaa) − (p ⊞ a)], with as, bs, and cs repeated any number of times depending on the required size and with the second dilation operator ⊞ as ⊕ (S = 0, θ = 2π ). The difference from the previous one is that the circle lies completely within the rectangle, and hence we obtain a hole instead of a corner truncated.
300
SHIH
d. The grammar for the U-shape block can be represented as follows: G = (VN , VT , P , S), where VN = {A, B, C, N, M, half_circle},
VT = {a, b, c, p, ⊞, −},
P : S → C ⊞ N,
N → rectangle − M, C → cC | c,
M → rectangle ⊞ half_circle,
half_circle → p ⊞ A.
The U-shape block can be represented by the string ccc ⊞ {bbb ⊞ aaaa − [(bb ⊞ aa) ⊞ (p ⊞ a)]}, with as, bs, and cs repeated any number of times depending on the required size and with the fourth dilation operator ⊞ as ⊕ (S = 0, θ = π ).
XI. PARSING A LGORITHM Given a grammar G and an object representation as a string, the string can be parsed to determine whether it belongs to the given grammar. There are various parsing algorithms, among which Earley’s parsing algorithm for context-free grammars is very popular. Let V ∗ denote the set of all sentences composed of elements from V . The algorithm is described as follows: Input: Context-free grammar G = (VN , VT , P , S) and an input string w = a1 a2 . . . an in VT∗ . Output: The parse lists I0 , I1 , . . . , In . Method: First construct I0 as follows: 1. If S → α a is a production in P , add [S → .α, 0] to I0 . Now, perform steps 2 and 3 until no new item can be added to I0 . 2. If [B → γ ., 0] is on I0 , add [A → αB.β, 0] for all [A → α.Bβ, 0] on I0 . 3. Suppose that [A → α.Bβ, 0] is an item in I0 . Add to I0 for all productions in P of the form B → γ , the item [B → .γ , 0] (provided this item is not already in I0 ). Now, we construct Ij by having constructed I0 , I1 , . . . , Ij −1 . 4. For each [B → α.aβ, i] in Ij −1 such that a = aj , add [B → αa.β, i] to Ij .
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
301
Now, perform steps 5 and 6 until no new items can be added. 5. Let [A → γ ., i] be an item in Ij . Examine Ii for items of the form [B → α.Aβ, k]. For each one found, add [B → αA.β, k] to Ij . 6. Let [A → α.Bβ, i] be an item in Ij . For all B → γ in P , add [B → .γ , j ] to Ij . The algorithm, then, is to construct Ij , for 0 < j ≤ n. Some examples of the parser are shown below. Example 1. Let a rectangle be represented by the string b ⊞ aa. The given grammar is G = (VN , VT , P , S),
where
VN = {S, A, B},
VT = {a, b, ⊞},
P : S → B ⊞ A, A → aA | a,
B → bB | b.
The parsing lists obtained are as follows: I0 [S → .B ⊞ A, 0] [B → .bB, 0] [B → .b, 0]
I3 [A → a.A, 2] [A → a., 2] [S → B ⊞ A., 0] [A → .aA, 3] [A → .a, 3]
[B [B [S [B [B
I1 I2 [S → B ⊞ .A, 0] → b.B, 0] [A → .aA, 2] → b., 0] → B. ⊞ A, 0] [A → .a, 2] → .bB, 1] → .b, 1]
I4 [A → a.A, 3] [A → a., 3] [A → aA., 2] [A → .aA, 4] [A → .a, 4] [S → B ⊞ A., 0]
Since [S → A ⊞ B., 0] is on the last list, the input belongs to the language L(G) generated by G. Example 2. Consider the input string b ⊞ ba. The given grammar is given by G = (VN , VT , P , S),
302
SHIH
where VN = {S, A, B},
VT = {a, b, ⊞},
P : S → B ⊞ A, A → aA | a,
B → bB | b. The parsing lists obtained are as follows: I0 [S → .B ⊞ A, 0] [B → .bB, 0] [B → .b, 0]
[B [B [S [B [B
I3 I1 I2 [S → B ⊞ .A, 0] Nil. → b.B, 0] [A → .aA, 2] → b., 0] → B. ⊞ A, 0] [A → .a, 2] → .bB, 1] → .b, 1]
Since there is no production starting with S on the last list, the input does not belong to the language L(G) generated by G. A question that arises is how we could construct a grammar that will generate a language to describe any kind of solid object. Ideally, it would be nice to have a grammatical inference machine that would infer a grammar from a set of given strings describing the objects under study. Unfortunately, such a machine is not available except for some very special cases. In most cases so far, the designer constructs the grammar based on the a priori knowledge available and experiences. In general, the increased descriptive power of a language is paid for in terms of the increased complexity of the analysis system. The trade-off between the descriptive power and the analysis
F IGURE 22.
An example of a swept surface.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
303
efficiency of a grammar for a given application is almost completely justified by the designer. Consider the swept surface shown in Figure 22. Its string representation is the same as the swept surface shown in Figure 19. The only difference is that it is not deformed as it is being swept.
XII. C ONCLUSIONS We describe the limitations of traditional morphological operations and define new morphological operations, called general sweep morphology. It is shown that traditional morphology is a subset of general sweep morphology. The properties of the sweep morphological operations are studied. We provide several examples of the proposed approach to demonstrate the advantages obtained by using sweep morphological operations instead of traditional morphological operations. The properties of opening/closing will be studied in the future. We have also presented a method of geometric modeling and representation based on sweep mathematical morphology. Since the shape and the dimension of a 2D structuring element can be varied during the process, not only simple rotational and extruded solids but also more complicated objects with blending surfaces can be generated by sweep morphology. We have developed grammars for solid objects and have applied Earley’s parser algorithm to determine whether a given string belongs to a group of similar objects. We compared our model to two popular solid models: boundary representation and the CSG model. Our mathematical framework used for modeling solid objects is sweep morphology, which provides a natural tool for shape representation. There are several advantages: a simple, large domain, lack of ambiguity, and it is easy to create and edit graphically. Furthermore, it does support the conventional limit (±) tolerances on “dimensions” that appear in many engineering drawings. The positive deviation is equivalent to the dilated result and the negative deviation is equivalent to the eroded result. It has been demonstrated that sweep mathematical morphology is an efficient tool for geometric modeling and representation in an intuitive manner.
R EFERENCES Blackmore, D., Leu, M.C., Shih, F.Y. (1994). Analysis and modeling of deformed swept volumes. Comput. Aided Design 26, 315–326. Chen, C., Hung, Y., Wu, J. (1993). Space-varying mathematical morphology for adaptive smoothing of 3D range data. In: Asia Conference on Computer Vision, Osaka, Japan, pp. 23–25.
304
SHIH
Chen, C.S., Wu, J.L., Hung, Y.P. (1999). Theoretical aspects of vertically invariant gray-level morphological operators and their application on adaptive signal and image filtering. IEEE Trans. Signal Process. 47, 1049– 1060. Cooper, D., Elliott, H., Cohen, F., Symosek, P. (1980). Stochastic boundary estimation and object recognition. Comput. Graphics Image Process 12, 326–356. Eichel, P.H., Delp, E.J., Koral, K., Buda, A.J. (1988). A method for a fully automatic definition of coronary arterial edges from cineangiograms. IEEE Trans. Med. Imag. 7, 313–320. Farag, A.A., Delp, E.J. (1995). Edge linking by sequential search. Pattern Recognit. 28, 611–633. Foley, J., van Dam, A., Feiner, S., Hughes, J. (1995). Computer Graphics: Principles and Practice, 2nd ed. Addison-Wesley, Reading, MA. Fu, K.S. (1982). Syntactic Pattern Recognition and Applications. PrenticeHall, Englewood Cliffs, NJ. Gonzalez, R.C., Woods, R.E. (2002). Digital Image Processing. AddisonWesley, New York. Gosh, P.K. (1988). A mathematical model for shape description using Minkowski operators. Compt. Vision, Graphics, and Image Processing 44, 239–269. Haralick, R.M., Sternberg, S.K., Zhuang, X. (1987). Image analysis using mathematical morphology. IEEE Trans. Pattern Anal. Mach. Intell. 9, 532– 550. Jang, B.K., Chin, R.T. (1990). Analysis of thinning algorithms using mathematical morphology. IEEE Trans. Pattern Anal. Mach. Intell. 12, 541–551. Kaul, A. (1993). Computing Minkowski Sums. Ph.D. thesis, Department of Mechanical Engineering, Columbia University. Korein, J. (1985). A Geometric Investigating Reach. MIT Press, Cambridge, MA. Latombe, J. (1991). Robot Motion Planning. Kluwer Academic, New York. Leu, M.C., Park, S.H., Wang, K.K. (1986). Geometric representation of translational swept volumes and its applications. ASME J. Eng. Industry 108, 113–119. Lin, P.L., Chang, S. (1993). A shortest path algorithm for nonrotating objects among obstacles of arbitrary shapes. IEEE Trans. Syst. Man Cybern. 23, 825–832. Martelli, A. (1976). An application of heuristic search methods to edge and contour detection. Communication ACM 19, 73–83. Martin, R.R., Stephenson, P.C. (1990). Sweeping of three dimensional objects. Comput. Aided Design 22, 223–234. Morales, A., Acharya, R. (1993). Statistical analysis of morphological openings. IEEE Trans. Signal Process. 41, 3052–3056.
GENERAL SWEEP MATHEMATICAL MORPHOLOGY
305
Mott-Smith, J.C., Baer, T. (1972). Area and volume coding of pictures. In: Huang, T.S., Tretiak, O.J. (Eds.), Picture Bandwidth Compression. Gordon and Breach, New York. Pei, S.-C., Lai, C.-L., Shih, F.Y. (1998). A morphological approach to shortest path planning for rotating objects. Pattern Recognit. 31, 1127–1138. Pennington, A., Bloor, M.S., Balila, M. (1983). Geometric modeling: A contribution toward intelligent robots. In: Proc. 13th Inter. Symposium on Industrial Robots, pp. 35–54. Ragothama, S., Shapiro, V. (1998). Boundary representation deformation in parametric solid modeling. ACM Trans. Graphics 17, 259–286. Requicha, A.A.G. (1980). Representations for rigid solids: Theory, methods, and systems. ACM Computing Surveys 12, 437–464. Requicha, A.A.G. (1984). Representation of tolerances in solid modeling: Issues and alternative approaches. In: Boyse, J.W., Pickett, M.S. (Eds.), Solid Modeling by Computers. Plenum, New York, pp. 3–12. Requicha, A.A.G., Voelcker, H.B. (1982). Solid modeling: A historical summary and contemporary assessment. IEEE Comput. Graph. Appl. 2, 9–24. Rossignac, J. (2002). CSG-Brep duality and compression. In: Proc. ACM Symposium on Solid Modeling and Applications, Saarbrucken, Germany, pp. 59–66. Rossignac, J., Requicha, A.A.G. (1985). Offsetting operations in solid modeling. Production Automation Project, University of Rochester, NY Tech. Memo 53. Russ, J.C. (1992). The Image Processing Handbook. CRC Press, Boca Raton, FL. Serra, J. (1982). Image Analysis and Mathematical Morphology. Academic Press, London. Shih, F.Y. (1991). Object representation and recognition using mathematical morphology model. J. Syst. Integr. 1, 235–256. Shih, F.Y., Cheng, S. (2004). Adaptive mathematical morphology for edge linking. Inf. Sci. 167, 9–21. Shih, F.Y., Mitchell, O.R. (1989). Threshold decomposition of gray-scale morphology into binary morphology. IEEE Trans. Pattern Anal. Mach. Intell. 11, 31–42. Shih, F.Y., Mitchell, O.R. (1992). A mathematical morphology approach to Euclidean distance transformation. IEEE Trans. Image Process. 1, 197– 204. Shih, F.Y., Wu, Y. (2004). The efficient algorithms for achieving Euclidean distance transformation. IEEE Trans. Image Process. 13, 1078–1091. Shih, F.Y., Gaddipati, V., Blackmore, D. (1994). Error analysis of surface fitting for swept volumes. In: Proc. Japan–USA Symp. Flexible Automation, Kobe, Japan, pp. 733–737.
306
SHIH
Shiroma, Y., Okino, N., Kakazu, Y. (1982). Research on 3-D geometric modeling by sweep primitives. In: Proc. of CAD, Brighton, United Kingdom, pp. 671–680. Shiroma, Y., Kakazu, Y., Okino, N. (1991). A generalized sweeping method for CSG modeling. In: Proc. of the First ACM Symposium on Solid Modeling Foundations and CAD/CAM Applications, Austin, Texas, pp. 149–157. Stevenson, R.L., Arce, G.R. (1987). Morphological filters: Statistics and further syntactic properties. IEEE Trans. Circuits Syst. 34, 1292–1305. Voelcker, H.B., Hunt, W.A. (1981). The role of solid modeling in machining process modeling and NC verification. SAE Tech. Paper #810195. Wang, W.P., Wang, K.K. (1986). Geometric modeling for swept volume of moving solids. IEEE Comp. Graph. Appl. 6, 8–17.
F URTHER R EADING Brooks, R.A. (1981). Symbolic reasoning among 3-D models and 2-D images. Artif. Intell. 17, 285–348. Requicha, A.A.G., Voelcker, H.B. (1983). Solid modeling: Current status and research direction. IEEE Comput. Graph. Appl. 3, 25–37.
Index
A
B
Absolute continuity, 69 Active learning, 69 Adaptive filtering, 216f Adaptive hybrid vector filters, 228–231 Adaptive multichannel filters, based on digital paths, 220–222 Adjacency, 36 ADP. See Approximate dynamic programming algorithms Algorithms, distance-based, 3 AMF. See Arithmetic mean filter Angular noise margins, 197 Anomalous X-ray scattering (AXS), 174 APD. See Avalanche photodiode Approximate dynamic programming algorithms (ADP), 94–99, 144 performance issues, 98–99 T-SO problems and, 94–96 Approximate policy iteration, 97–98 Approximate value iteration, 96–97 Approximation error, 73 Arithmetic mean filter (AMF), 208 Artworks, restoration of, 241–242 Atomic arrangements, around Fe atoms, 160f Atomic images data processing for obtaining, 145–149 three-dimensional, from single-energy holograms, 173f Atomic resolution holography, 178 Atom-resolved holography, 122 Au clusters, 153 Au crystals atomic images of, 143f dimer model of, 151–152 holograms of, 142f Avalanche photodiode (APD), 140 energy resolution of, 144 Averaging, 218–219 AXS. See Anomalous X-ray scattering
Backpropagation through structure (BPTS), 9 RNNs and, 18–22 Backward pass, 20 Barton algorithm, 142, 147–148, 154–155 Base b elementary intervals in, 85 sequence in, 85 Basic vector directional filter (BVDF), 212 Batch mode, 20 Bellman’s equation, 96 Bioinformatics, RNNs in, 8 BL47XU, 165 Block mode, 20 Boltmann constant, 150 Boundaries inner, 269 outer, 269 representation of, 267 Bounded variations, ensuring, 80–84 BPTS. See Backpropagation through structure Bragg condition, 128 Bragg peaks, 179 Bragg reflection, 126 BVDF. See Basic vector directional filter
C Cameras, indoor, images acquired by, 50–51 Canberra distance, 204 Cascade-correlated approach, 6 Cd atoms, 179–180 Charge-coupled devices, 195 Circles, 298 Circular structuring elements, 283f City-block distance, 203 Closed-loop forms, 91 Clustering color space, 34–35 K-means, 36
307
308 COIL-100, datasets generated from, 52–54 images from, 53f, 54f Collision phenomenon, 7, 30 conditions for avoidance of, 31–33 Color, perception of, 188 Color image filtering, 199–244 applications of, 241–244 television image enhancement, 243–244 virtual restoration of artworks, 241–242 basics, 190–193 component-wise, 200f edge detection, 244–257 scalar operators, 245–250 vector operators, 250–253 image sharpening, 235–239 image zooming techniques, 239–241 introduction to, 188–190 noise-reduction techniques, 202–231 adaptive hybrid vector filters, 228–231 adaptive multichannel filters based on digital paths, 220–222 component-wise median filtering, 205– 207 data-adaptive filters, 218–220 order-statistic theory, 202–205 selection weighted vector filters, 215– 218 similarity based vector filters, 226–228 switching filtering schemes, 223–226 vector directional filters, 212–215 vector median filtering, 207–212 sliding, 201f vector, 201f Color space clustering, 34–35 Company logos, RNNs and, 8 Comparison and selection (CS) filter, 236 Complementary DNA microarray imaging, 255, 256f Complex holography, 133–134, 170–173 atomic images from, 135f Complexity computational, 93 model, 93 sample, 93 Component-wise filtering, 200f median, 205–207 Computation trees, 27 Computational complexity, 93 Connection costs, 222
INDEX Constructed solid geometry (CSG) representation, 267 Continuity, absolute, 69 Control vectors, 98 Convergence distribution-free, 72 ERM approach and, 85–87 uniform, of empirical means, 70, 77 Conversion electrons, from nuclei, 177–178 Cooling effect, sample, 149–150 Cost-to-go, 63, 94, 100, 101 Cryostream coolers, 150 CS filter. See Comparison and selection filter CSG representation. See Constructed solid geometry representation CuI clusters atomic images of, 126f, 131f complex hologram of, 134f theoretical holograms of, 125f CuI dimers, 136 holograms calculated from, 124f Curse of dimensionality, 64 Curve fittings, k-range for, 159t Cyclic graphs encoding and output networks for, 23f processing, 5–6, 22–30 recursive equivalent trees and, 28–30 recursive-equivalent transforms and, 25–28 Cylinder, 293
D DAG-LE. See Directed acyclic graphs with labeled edges Data formats, 2 Data processing, 138–150 Data-adaptive vector filters, 218–220 Debye–Waller factor, 149–150 Decision-theoretic approach, 2 Deformations, swept surfaces and, 275–280 Descendants, 11 Deterministic learning (DL), 63, 74–90 distribution-dependent case in, 87–90 distribution-free case, 75–80 dynamic programming and, 99–104 T-SO problems, 99–104 experimental results of, 104–114 unknown function approximation, 104– 107 mathematical framework for, 65–69
309
INDEX multistage optimization tests and, 107–114 inventory forecasting model, 108–109 water reservoir network model, 109–114 noisy cases in, 88–89 for optimal control problems, 90–94 Digital path approach (DPA) filter, 221–222 Digital paths, adaptive multichannel filters based on, 220–222 Dilation, 266 Dimensionality, 64 Dimers, calculated holograms of, 152f Directed acyclic graphs with labeled edges (DAG-LE), 5, 7, 11 RNNs and, 11–18 Directed graphs with labeled edges, 10 Directed graphs with unique labels (DUGs), 22 Directed labeled graphs, 10 Directed positional acyclic graphs (DPAGs), 4, 5, 7, 11, 44, 50 RNN processing of, 14–17 Directed unlabeled graphs, 9 Directional processing concept, on Maxwell triangle, 212f Directional-distance filters (DDF), 213 Discounted problems, 91 Discrepancy, 77 star, 77 Discretizations, RMS errors for, 105t, 106t, 107t Distance-based algorithms, 3 Distribution-dependent cases, 75 in deterministic learning, 87–90 Distribution-free cases, 74 in deterministic learning, 75–80 Distribution-free convergence, 72 DL. See Deterministic learning Dopants, 162–168 GaAs:Zn, 162–165 quasicrystal, 168–170 Si:Ge, 165–168 DP. See Dynamic programming DPA filter. See Digital path approach DPAF filter, 222 DPAGs. See Directed positional acyclic graphs DPAL filter, 222 DUGs. See Directed graphs with unique labels Dynamic programming (DP), 63
deterministic learning and, 99–104 T-SO problems, 99–104
E Edge detection, 35, 244–257 component-wise, 246f evaluation criteria, 253–257 objective evaluation approach, 254–255 subjective evaluation approach, 255– 257 scalar operators, 245–250 gradient, 248–249 zero-crossing-based operators, 249–250 vector dispersion, 252 vector operators, 250–253 Edge linking, 280–285 sweep morphological algorithm, 284f using sweep morphology, 281–285 Edge-weighting functions, 18 Electron scattering, 121 Elementary intervals, 85 Elliptic structuring elements, 281f, 282f Emitter-scatterer dimer, 129f Empirical risk, 67 minimization of, 68 Encoding networks, 13 for cyclic graphs, 23f Energy dispersive solid state detectors, 139 Energy resolution, of APD, 144 Energy spectra, of scattered X-rays, 145f Enhancement techniques, 188–190 ERM approach, 90 convergence rates of, 85–87 Erosion sweep, 272f traditional, 272f ESRF. See European Synchrotron Radiation Facility Estimation error, 73 Euclidean distance, 203 European Synchrotron Radiation Facility (ESRF), 121, 168 EXAFS. See Extended X-ray absorption fine structure Expected risks, 66 Experiment processing, 138–150 Experimental holograms, demonstration by, 156–159
310 Extended X-ray absorption fine structure (EXAFS), 151, 156, 158. See also XAFS
F Faces, appearance of, 51f Fe atoms atomic arrangements around, 160f holograms of, 161f Feature encoding, 2 Feedforward neural networks, 83–84 FePt film, 160–161 Filtered holographic signals, 158f Filtering. See Specific types Finite horizon cases, 98 Finite impulse response (FIR), 208–209 FIR. See Finite impulse response Fitted holographic signals, 155f Fittings curve, 159t Fourier, 159t Flat pattern recognition, 1–7 Fluorescence photons, 139 Formal language, sweep morphology and, 291–292 Forward pass, 20 Fourier fittings, r-range for, 159t Fourier transforms, 125, 127, 146, 153–154 of holograms, 148f inverse, theoretical proof of, 150–155 reconstructed intensity and, 154f of π XAFS, 176f Frontier states, 15 Full width at half maximum (FWHM), 140 Functions, 2 Fuzzy filters, 218 Fuzzy membership function, 219 Fuzzy theory, 35 FWHM. See Full width at half maximum
G G color band, 190–191 Ga emitters, 173 GaAs:Zn, 162–165 XAFS spectrum of, 172f Gabor, Dennis, 120–121 Gaussian noise, filtering of, 220f Ge crystals atomic images of, 167f
INDEX holograms, 166f holograms of, 147f, 148f Ge X-ray fluorescence, 145–146 General sweeps, 266 in computation of traditional morphology, 268–269 deformations and, 275–280 edge linking and, 281–285 formal language and, 291–292 geometric modeling and, 288–291 tolerance expression, 289–291 grammars, 297–300 three-dimensional attributes, 298–300 two-dimensional attributes, 298 image enhancement and, 278–280 mathematical morphology, 270–273 parsing algorithm, 300–303 representation scheme, 292–297 three-dimensional attributes, 293–297 two-dimensional attributes, 292–293 theoretical development of, 268–275 Generalized cylinders, 266, 267 Generalized errors, 21 Generalized similarity measure model, 204 Generalized vector directional filters (GVDF), 213 Geometric modeling, sweep mathematical morphology and, 288–291 tolerance expression, 289–291 Geometries, experimental, for normal and inverse modes, 138–140 Gradient operators, 248–249 Grammars, 297–300 Graph(s) cyclic, 22–30 encoding and output networks for, 23f recursive equivalent trees and, 28–30 recursive-equivalent transforms and, 25–28 directed labeled, 10 directed unlabeled, 9 directed, with labeled edges, 10 encoding networks associated with, 13f output networks associated with, 13f regional adjacency, 24, 36–38 RNNs and, 9–11 structures, 2 topology of, 10–11 Graph-based representation, 33–39 introduction to, 33
311
INDEX multiresolution trees, 36–38 region adjacency graphs, 36–38 segmentation of images in, 33–36 region based approaches, 33 γ -ray holography, 176–178 Gray-scale imaging, 200 GVDF. See Generalized vector directional filters
H Halton sequences, 112 Helmholtz–Kirchhoff formulas, 125, 169 Histogram thresholding, 34 Holograms, 120 Au crystal, 142f calculated from CuI dimers, 124f calculated, of dimer, 152f experimental, 156–159 Fe, 161f filtered signals, 158 fitted signals, 155 Fourier transformation of, 148f Ge crystal, 147f, 148f horizontal polarization of, 137f in inverse mode, 123–124, 123f of ions in chemical environments, 177 in k space, 153 multiple energy, 166f one-dimensional, 157f oscillations, 138f reconstructions from, 149f single-energy, 173f theoretical, of CuI clusters, 125f vertical polarization of, 137f Zn, 163f Holography. See also XFH atomic resolution, 178 complex, 133–134 complex X-ray, 170–173 γ -ray, 176–178 history of, 120 neutron, 178–180 Horizontal polarization, 137f Human luminance frequency response, 190 HVF. See Hybrid vector filters Hybrid vector filters (HVF), 213, 214 Hydrogen atom, reconstruction of planes from, 179f Hyperbolic tangents, 104
I Image analysis graph-based representation, 33–39 indoor camera, 50–51 RNNs in, 8 Image enhancement, general sweeps and, 278–280 Image filtering. See Color image filtering Image noise, 193–199 Gaussian, 220f impulsive, 222f mixed, 232f natural, 193–194 noise modeling, 194–199 sensor noise, 195–197 transmission noise, 197–199 real color, 194f simulated color, 196f Image orientation, 40 Image processing chain, 189f Image sharpening, 235–239 Image zooming techniques, 239–241 Imaging conditions, 40 Impulsive noise, filtering of, 223f Indegrees, 10 Industrial parts, 286f edges of, 285f Infinite-horizon stochastic optimization problems, 91, 96–98 approximate policy iteration and, 97–98 Inpainting techniques, 234–235, 236f Interatomic distances, 159t Inventory forecasting model, 108–109, 113t bounds for, 111t Inverse Fourier analysis, theoretical proof of, 150–155 Inverse mode, holograms in, 123–124, 123f
K Kernel trick, 4 KL. See Kossel lines K-means clustering, 36 Kohonen map, 6 Koksma–Hlawka inequality, 79 Kossel lines (KL), 128–129 k-range, for curve fittings, 159t
312
L Laboratory XFH apparatus, 140–143 illustration of, 141f Learnable problems, 68 probably approximately correct, 70 Learning active, 69 passive, 69 Learning algorithms, 68 Learning environment setup MRTs in, 46–47 RAGs in, 42–46 Learning rates, 19 Learning theory, 62 Least mean absolutes (LMA), 215 Leaves, 10 Lemmas, 82, 101, 102 Levenberg–Marquardt algorithm, 105 Lif crystals, 166 Linear extrusion, 265, 267 Linear functions, Pollard dimension of onedimensional, 72f LMA. See Least mean absolutes Localized faces, 49f Loss functions, 66 Low-discrepancy sequences, 85, 105 RMS errors for, 106t Lower–upper–middle (LUM) sharpeners, 237 LQ hypotheses, 92 LUM sharpeners. See Lower–upper–middle sharpeners
M Magnetite, 177 iron arrangements in, 178f Marginal ordering, 202 Markov models, 3 Mathematical morphology, 266 general sweep, 270–273 Maximum likelihood estimates (MLE), 205 Maxwell triangle directional processing concept on, 212f RGB color cube with, 192f Mean square error (MSE) criterion, 95, 211 Median absolute deviation, 229 Median filters (MF), 205 MF. See Median filters Minkowski metric family, 203–204
INDEX Mixed noise, 232 MLE. See Maximum likelihood estimates Mn sites, in quasicrystal, 169f Mobile robot, shortest path planning for, 286–288 Model complexity, 93 Model selection, 62 Monochromators, 175 Monte Carlo methods, 87, 92 randomized, 87 Morphology, 266 MOS problem. See Multistage stochastic optimization Mössbauer effect, 124, 133, 177–178 MRT. See Multiresolution trees MSE criterion. See Mean square error criterion Multilayer perceptrons, transition functions realized with, 16f Multiple energy method, 130 atomic images constructed using, 131f Multiply connections, 269 Multiresolution trees (MRT), 38–39 generation of, 39f in learning environment setup, 46–47 targets associated with nodes of, 47f Multistage stochastic optimization (MOS) problem, 63 deterministic, 92
N Natural image noise, 193–194 NCD. See Normalized color difference Near field effect, 136–138 Nearest neighbor vector range (NNVR), 253 Negative filling, 269 Neural networks, 1, 3 recursive, 4 supervised, 4 Neurodynamic programming, 93 Neutron holography, 178–180 Niederreiter sequences, 105, 112 NNVR. See Nearest neighbor vector range Noise. See Image noise Noise margins, 197f Noise modeling, 194–199 sensor noise, 195–197 transmission noise, 197–199 Noise-reduction techniques, 202–231
313
INDEX adaptive hybrid vector filters, 228–231 adaptive multichannel filters based on digital paths, 220–222 component-wise median filtering, 205–207 data-adaptive filters, 218–220 order-statistic theory, 202–205 performance evaluation of, 231–235 inpainting techniques, 234–235 objective, 231–233 subjective, 233–234 selection weighted vector filters, 215–218 similarity based vector filters, 226–228 switching filtering schemes, 223–226 vector directional filters, 212–215 vector median filtering, 207–212 Noisy cases, 75 in deterministic learning, 88–89 Nonlinear smoothing filter (NSF), 224 Nonlinear vector processing, 189 Normalized color difference (NCD), 233 NSF. See Nonlinear smoothing filter Nuclei, conversion electrons from, 177–178 Null pointers, 11
O Object deformation, 40 Object detection, 39–54 methods, 39–42 appearance-based, 41 challenges in, 40 feature invariant, 40–41 knowledge-based, 40–41 template matching, 41 RNNs in, 42–54 Observation noise, 74 Occlusions, 40 One-dimensional holograms, 157f Open curve paths, 271f Optical reciprocity, 123 Optimal control problems, deterministic learning for, 90–94 Optimal management problems, 108 Optimization problems discounted infinite-horizon stochastic, 91 T-stage, 91 Optimization tests, multistage, 107–114 inventory forecasting model, 108–109 water reservoir network model, 109–114
Ordering marginal, 202 reduced, 202 Order-statistic theory, 202–205 Outdegrees, 10 Output networks, 13f, 14, 31f for cyclic graphs, 23f Overfitting, 73
P PAC. See Probably approximately correct Parallelograms, 298 Parallelpiping, 293–297, 299–300 Parsing algorithm, 300–303 Passive learning, 69 Path planning, for mobile robot, 286–288 Paths, 10–11 Pattern mode, 20 Pattern recognition flat, 1–7 structural, 1–7 Pb atoms, 179–180 Pb crystals, reconstructions of planes for, 151f Performance evaluation, 189 of noise reduction techniques, 231–235 inpainting techniques, 234–235 objective evaluation, 231–233 subjective evaluation, 233–234 Performance issues, ADP, 98–99 Permutation wave medians (PWM), high pass, 238 Photons, fluorescence, 139 Planck constant, 150 Pointer matrices, 15 Polarization effect horizontal, 137f of incident X-ray, 134–136 vertical, 137f Policy evaluation, 97 Pollard dimension, 70–71 of one-dimensional linear functions, 72f Pose, 40 Probably approximately correct (PAC) learnability, 70 Protein topologies, RNNs in, 8 Pruning, 284 P-shattering, 71 Pt foils, π XAFS of, 175f PWM. See Permutation wave medians
314
Q QSAR. See Quantitative structure-activity relationships Quadratic rates, 74 Quantitative structure-activity relationships (QSAR), 8 Quasicrystal, 168–170 reconstructed real-space image around Mn sites in, 169f Quasirandom integration methods, 85
R Radial basis functions, 84–87 RAG. See Region adjacency graph Random variables, 89 Random walk (RW) techniques, 3–4 Randomized quasi-Monte Carlo methods, 87 γ -ray holography, 176–178 Real color image noise, 194f Reconstructed intensity, 153, 156–157 Fourier transforms and, 154f of single first neighbor atoms, 168f Rectangles, 298 Recursive equivalent trees cyclic graphs and, 28–30 RAG transformation to, 45f Recursive neural networks (RNN), 4 BPTS and, 18–22 cyclic graph processing with, 22–30 DAG-LE processing and, 11–18 DPAG processing and, 14–17 graphs and, 9–11 limitations of, 30–33 in object detection, 42–54 detecting objects, 47–54 learning environment setup in, 42–46 properties and applications of, 7–9 in bioinformatics, 8 in company logo classification, 8 in image analysis, 8 Recursive-equivalent transforms, 25–28 Reduced ordering, 202 Region adjacency graph (RAG), 24, 36–38 extracted, 42f features stored in, 37f in learning environment setup, 42–46 transformation to recursive-equivalent trees, 45f Region based approaches, 35
INDEX Regression estimation problem, 66 Regression functions, 66 Regularization factors, 207 Reoptimization, 95 Reservoir network, 110f Restoration, of artworks, 241–242 RGB color images, 190 with Maxwell triangle, 192f Risk functionals, 75–76 RMS. See Root of mean square RNN. See Recursive neural networks Root of mean square (RMS) errors for discretizations, 105t, 106t, 107t low-discrepancy sequences, 106t, 108t for random sequences, 106t, 108t Rotational sweep, 265, 267 Round bottom tools, 297f r-range, for Fourier fittings, 159t Rule learners, 3 RW. See Random walk
S Sample complexity, 73, 93 Sample cooling effect, 149–150 Samples, 95 orientation of, 136f Scalar operators, 245–250 gradient, 248–249 zero-crossing-based operators, 249–250 Scalar techniques, 239 Scanning probe microscopy (SPM), 168 Scattering, 169–170 electron, 121 Thomson, 135 X-ray, 121 Segmentation, 33–36 color space clustering, 34 edge detection, 35 fuzzy theory, 35 histogram thresholding, 34 neural network approaches to, 35 physics approaches to, 35 region-based, 35 Selection weighted vector filters (SWVF), 209–210, 215–218 Self-organizing maps (SOM), 6 Sensor noise, 195–197 Sequences, 2 Sets, 2
315
INDEX Si, holograms, 166f Si clusters, 167 Si:Ge, 165–168 Similarity based vector filters, 226–228 Similarity functions, 221, 227 Single-energy reconstruction, 173 Skeletons, 10–11 SL. See Statistical learning Sliding filtering, 201f Sobol sequences, 112 Solid state detector (SSD), 141, 144 energy dispersive, 139 SOM. See Self-organizing maps Sources, 10 Space-varying, 279 Spatial interpolation, 240f Spherical median (SM), 212f SPM. See Scanning probe microscopy SPring-8, 165 Spurious regions, 43 SR. See Synchrotron radiation SSD. See Solid state detector Stack filter design, 206 Star discrepancy, 77 State space, 11 State transition functions, 12 State variables, 11 Statistical learning (SL), 63, 69–74 Step edge, 278 Structural methods, 2 Structural pattern recognition, 1–7 Structure-adaptive hybrid vector filter (SAHVF), 228–231 Structuring element assignment, 279f circular, 283f elliptic, 281f Subjective image evaluation, 233–234 guidelines for, 234t Superexponential growth, 73, 86 Supersources, 11, 28f Supervised learning paradigm, 6 Supervised neural networks, 4 Support vector machines (SVM), 1 SVF. See Switching vector filter SVM. See Support vector machines Sweep, 265 general, 266 rotational, 265, 267 Sweep dilation, 281, 294f, 299–300 Sweep erosion, 272f
Sweep morphological dilation, 270 Sweep morphological erosion, 271 Sweep morphological operations, properties of, 273–275 Sweep surface modeling, 291 Switching filtering schemes, 223–225 based on fixed threshold, 224f based on fully adaptive control, 224f Switching vector filter (SVF), 224, 226 SWVF. See Selection weighted vector filters Symbolic output-equivalence, 31 Symmetry, XSWs and, 161 Synchrotron radiation (SR), fast X-ray fluorescence detection systems at, 143– 149 Syntactical methods, 2
T Tangents, hyperbolic, 104 (t, d)-sequences, 85 Television image enhancement, 243–244 TEM. See Transmission electron microscopy Temporal sequences, 17f Ternary trees, 32 Thining, 281–282 Thomson scattering factors, 135 Three-dimensional attributes, 293–297 Tolerance expression, 289–291 zones, 290f Traditional morphology, computation of, 268–269 Training phase, 62 Training set, 62 Transition functions, multilayer perceptrons and, 16f Transmission electron microscopy (TEM), 168 Transmission noise, 197–199 Tree search method, 280 Tree structures, 2 Triangles, 298 Trichromatic theory, 188 Tristimulus theory, 192 T-SO. See T-stage stochastic optimization T-stage stochastic optimization (T-SO) problems, 91 ADP algorithms and, 94–96 deterministic learning for dynamic programing, 99–102
316 TV video sequences, 49–50 Twin images concept of, 129f removal of, 129–134 complex holography, 133–134 multiple energy method, 130 two energy method, 130–133 Two energy method, 130–133 atomic images constructed using, 132f Two-dimensional attributes, decomposition of, 293f
U Ultrathin film, 159–161 Undirected structures, 6 Unfolded parameters, 20, 103 Unfoldings, 28f Uniform convergence of empirical means, 70, 77 Uniform distribution, 77 Uniform probability (URS), 105 Universal approximators, 62 Unknown functions, approximation of, 104– 107 URS. See Uniform probability U-shape blocks, 296
V Value function, 63, 94 Variations, bounded, 80–84 VC dimension, 70 VDED. See Vector dispersion edge detector VDF. See Vector directional filters Vector directional filters (VDF), 212–215 basic, 212 directional-distance, 213 generalized, 213 hybrid, 213 weighted, 214 Vector dispersion edge detector (VDED), 252 Vector median filtering (VMF), 201f, 207– 212 selection weighted, 209–210 weighted, 209 Vector operators, 250–253 Vector processing, nonlinear, 189 Vector range (VR) detectors, 250–251 Vector rational filters (VRF), 210–211 Vertical polarization, 137f VMF. See Vector median filtering
INDEX VR. See Vector range VRF. See Vector rational filters
W Water levels, 110t Water reservoir network model, 109–114, 113t bounds for, 112t Weighted filters, 225 Weighted medians (WM), 206 permutation, 238 Weighted vector directional filters (WVDF), 214–215 Weighted vector median filters (WVMF), 209 WVDF. See Weighted vector directional filters WVMF. See Weighted vector median filters
X XAFS. See X-ray absorption fine structure XFH. See X-ray fluorescence holography X-ray absorption fine structure (XAFS), 172. See also EXAFS π , 174–176 Fourier transforms of, 176f Pt foils of, 175f at As K edge, 172f X-ray fluorescence holography (XFH) applications of, 159–173 complex X-ray holography, 170–173 dopants, 162–168 ultrathin film, 159–161 experiment and data processing, 138–150 experimental geometries, 138–140 experimental holograms, 156–159 interatomic distances estimated by, 159t inverse fourier analysis, 150–155 laboratory XFH apparatus, 140–143 for obtaining atomic images, 145–149 sample cooling effect, 149–150 at synchrotron radiation, 143–149 experimental setup for, 139f introduction to, 120–122 inverse method, 122f Kossel lines and, 128–129 near field effect, 136–138 normal method, 122f outlook, 180–181 polarization effect of incident X-ray, 134– 136
317
INDEX related methods, 174–180 γ -ray holography, 176–178 neutron holography, 178–180 π XAFS, 174–176 simulation using realistic models, 124–127 theory using simple models, 122–124 twin image removal and, 129–134 complex holography, 133–134 multiple energy method, 130 two energy method, 130–133 X-ray detector performance, 145f XSWs and, 128–129
X-ray scattering, 121 X-ray standing wave lines (XSW), 128–129 four-fold symmetry and, 161 XSW. See X-ray standing wave lines
Z Zeeman sextets, 177 Zero-crossing-based operators, 249–250 Zn atom, holographic reconstruction of environment around,164f Zn holograms, 163f
This page intentionally left blank