DATA HANDLING IN SCIENCE AND TECHNOLOGY- VOLUME 22
Wavelets in Chemistry
DATA HANDLING IN SCIENCE AND TECHNOLOGY- VOLUME 22
Wavelets in Chemistry
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and S.C. Rutan
Other volumes in this series: Volume 1 Volume 2 Volume 3 Volume 4 Volume 5 Volume 6
Volume 7 Volume 8 Volume 9 Volume 10 Volume 11 Volume 12 Volume 13 Volume 14 Volume 15 Volume 16 Volume 17 Volume 18 Volume 19
Volume 20A
Volume 20B
Volume 21 Volume 22
Microprocessor Programming and Applications for Scientists and Engineers, by R.R. Smardzewski Chemometrics: A Textbook, by D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, and L. Kaufman Experimental Design: A Chemometric Approach, by S.N. Deming and S.L. Morgan Advanced Scientific Computing in BASIC with Applications in Chemistry, Biology and Pharmacology, by P. Valk6 and S. Vajda PCs for Chemists, edited by J. Zupan Scientific Computing and Automation (Europe) 1990, Proceedings of the Scientific Computing and Automation (Europe) Conference, 12-15 June 1990, Maastricht, The Netherlands, edited by E.J. Karjalainen Receptor Modeling for Air Quality Management, edited by P.K. Hopke Design and Optimization in Organic Synthesis, by R. Carlson Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, edited by R.G. Brereton Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing, by P.M. Gy Experimental Design: A Chemometric Approach (Second, Revised and Expanded Edition) by S.N. Deming and S.L. Morgan Methods for Experimental Design: Principles and Applications for Physicists and Chemists, by J.L. Goupy Intelligent Software for Chemical Analysis, edited by L.M.C. Buydens and P.J. Schoenmakers The Data Analysis Handbook, by I.E. Frank and R. Todeschini Adaption of Simulated Annealing to Chemical Optimization Problems, edited by J. Kalivas Multivariate Analysis of Data in Sensory Science, edited by T. Naes and E. Risvik Data Analysis for Hyphenated Techniques, by E.J. Karjalainen and U.P. Karjalainen Signal Treatment and Signal Analysis in NMR, edited by D.N. Rutledge Robustness of Analytical Chemical Methods and Pharmaceutical Technological Products, edited by M.W.B. Hendriks, J.H. de Boer, and A.K. Smilde Handbook of Chemometrics and Qualimetrics: Part A, by D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. de Jong, P.J. Lewi, and J. Smeyers-Verbeke Handbook of Chemometrics and Qualimetrics: Part B, by B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. de Jong, P.J. Lewi, and J. Smeyers-Verbeke Data Analysis and Signal Processing in Chromatography, by A. Felinger Wavelets in Chemistry, edited by B. Walczak
DATA HANDLING IN SCIENCE AND T E C H N O L O G Y - VOLUME 22
Advisory Editors: B.G.M. Vandeginste and S.C. Rutan
Wavelets in C h e m i s t ry edited Beata
by Walczak
Institute of Chemistry, Silesian University, 9 Szkolna Street, 40-006 Katowice, Poland
2000 ELSEVIER Amsterdam
- Lausanne - New York - Oxford - Shannon - Singapore - Tokyo
ELSEVIER SCIENCE PUBLISHERS B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands 92000 Elsevier Science B.V. All rights reserved. This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permission may be sought directly from Elsevier Science Rights & Permissions Department, PO Box 800, Oxford OX5 1DX, UK; phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail:
[email protected]. You may also contact Rights & Permissions directly through Elsevier's home page (http://www.elsevier.nl), selecting first 'Customer Support', then 'General Information', then 'Permissions Query Form'. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (978) 7508400, fax: (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London WlP 0LP, UK; phone: (+44) 171 631 5555; fax: (+44) 171 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Science Rights & Permissions Department, at the mail, fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made. First edition 2000 Library of Congress Cataloging in Publication Data A catalog record from the Library of Congress has been applied for. ISBN: 0 444 50111 8 e The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.
PREFACE Wavelets seem to be the most efficient tool in signal denoising and compression. They can find unlimited numbers of applications in all fields of chemistry, where the instrumental signals are the source of information about the studied chemical systems or phenomena, and in all cases, when these signals have to be archived. The quality of the instrumental signals decides about the quality of answer to the basic analytical questions: how many components are in the studied systems, what are these components like and what are their concentrations? Efficient compression of the signal sets can drastically speed up further processing (such as, e.g. data visualization, modelling (calibration and pattern recognition), library search, etc.). Exploration of the possible applications of wavelets in analytical chemistry has just started and the proposed book about wavelet theory and about the already existing applications can significantly speed up this process. Presently wavelets are a hot issue in many different fields of science and technology. There are already many books about wavelets, but almost all of them are written by mathematicians, or by people involved in information science. Due to the fact that wavelet theory is quite complicated and different languages are involved in its presentation, these books are almost unreadable for chemists. Lack of the texts comprehensible to chemists seems to be a barrier and can be considered a reason why wavelets enter chemistry so slowly and so shyly. The book is written in the tutorial-like manner. We intended to gently introduce wavelets to an audience of chemists. Although the particular chapters are written by independent authors, we intended to cover all important aspects of wavelet theory and to present wavelet applications in chemistry and in chemical engineering. Basic concepts of wavelet theory, together with all important aspects of wavelet transforms, are presented in the first part of the book. This part is extensively illustrated with figures and simulated examples. The second part of this book consists of examples of wavelet applications in chemistry and in chemical engineering. Written by chemists for the chemists, this book can be of great help for all those involved in signals and data processing. All invited authors are the widely recognized experts in the field of chemometrics, with an unquestionable competence in the theory and practice of wavelets.
The book is addressed to analytical chemists, dealing with any type of spectral data (main interest: signal to noise enhancement and/or signal compression); organic chemists, involved in combinatorial chemistry (main interest: compression of instrumental signals); chemists involved in chemometrics (main interest: compression of the ill-posed data sets for the further preprocessing and data denoising); artificial intelligence fields (main interests: compression of any spectral libraries and speeding up library search); theoretical chemists (main interest: wavelets as a new family of basis functions with special properties); and engineers involved in process control (main interest: analysis of trends). Readers are expected to know basic terms of linear algebra and be familiar with the matrix notation. As a team of Contributors to this volume, we are well aware of certain repetitions occurring on its pages, which are hardly avoidable in case of massive joint enterprises of similar sort. There are, however, certain advantages of this situation as well, the main one being the enriching demonstration of the selected wavelet issues from different perspectives. Finally, may I allow myself to express my profound gratitude to all the Colleagues, whose experience, endurance and willingness to cooperate materialized in this volume, which hopefully will become a useful and up-to-date source textbook in the field of wavelets applied to chemistry. Beata Walczak
Katowice, November 1999
vii CONTENTS
PREFACE LIST OF CONTRIBUTORS
XV
PART I: THEORY CHAPTER 1 FINDING FREQUENCIES IN SIGNALS: THE FOURIER TRANSFORM (B. van den Bogaert) Introduction The Fourier integral Convolution Convolution and discrete Fourier Polynomial approximation and basis transformation The Fourier basis Fourier transform: Numerical examples Fourier and signal processing Apodisation CHAPTER 2 WHEN FREQUENCIES CHANGE IN TIME; TOWARDS THE WAVELET TRANSFORM (B. van den Bogaert) 1 Introduction 2 Short-time Fourier transform 3 Towards wavelets 4 The wavelet packet transform CHAPTER 3 FUNDAMENTALS OF WAVELET TRANSFORMS (Y. Mallet, O. de Vel and D. Coomans) Introduction Continuous wavelet transform Inverse wavelet transform Discrete wavelet transform Multiresolution analysis Fast wavelet transform Wavelet families and their properties Biorthogonal and semiorthogonal wavelet bases
3 4 5 8 9 13 18 22 28 33
33 35 40 53 57
57 59 63 65 65 74 76 79
viii CHAPTER 4 THE DISCRETE WAVELET TRANSFORM IN PRACTICE (O. de Vel, Y. Mallet and D. Coomans) 1 Introduction 2 Introduction to matrix theory 2.1 Patterned matrices 2.2 Matrix operations 2.3 Some matrix properties 3 Matrix representation of the discrete wavelet transform 3.1 The discrete wavelet transform for infinite signals 3.2 Discrete wavelet transform for signals with finite-length
85
85 85 86 88 89 91 91 97
CHAPTER 5 MULTISCALE M E T H O D S FOR DENOISING AND COMPRESSION (M.N. Nounou and B.R. Bakshi) 1 Introduction 2 Multiscale representation of signals using wavelets 3 Characterization of noise 3.1 Autocorrelation function 3.2 Power spectrum 3.3 Wavelet spectrum 4 Denoising and compression 4.1 Denoising and compression of data with Gaussian errors 4.2 Filtering of data with non-Gaussian errors 5 On-line multiscale filtering 5.1 On-line multiscale filtering of data with Gaussian errors 5.2 OLMS filtering of data with non-Gaussian errors 5.3 Hints for tuning the filter parameters in multiscale filtering and compression Conclusions
119
CHAPTER 6 WAVELET PACKET TRANSFORMS AND BEST BASIS ALGORITHMS (Y. Mallet, D. Coomans and O. de Vel) 1 Introduction 2 Wavelet packet transforms 2.1 What do wavelet packet functions look like? 3 Best basis algorithm
151
CHAPTER 7 JOINT BASIS AND JOINT BEST-BASIS FOR DATA SETS (B. Waiczak and D.L. Massart) 1 Introduction 2 Discrete wavelet transform and joint basis 3 Wavelet packet transform and joint best-basis
165
119 121 123 124 124 126 126 126 136 139 141 145 147 148
151 151 154 155
165 167 171
CHAPTER 8 THE ADAPTIVE WAVELET A L G O R I T H M FOR DESIGNING TASK SPECIFIC WAVELETS (Y. Mallet, D. Coomans and O. de Vel) 1 Introduction Higher multiplicity wavelets 2 m-Band discrete wavelet transform of discrete data 3 4 Filter coefficient conditions Factorization of filter coefficient matrices 5 6 Adaptive wavelet algorithm Criterion functions 7 Introductory examples of the adaptive wavelet algorithm 8 8.1 Simulated spectra 8.2 Mineral spectra Key issues in the implementation of the AWA 9
177
PART II: APPLICATIONS
203
CHAPTER 9 A P P L I C A T I O N OF WAVELET T R A N S F O R M IN PROCESSING C H R O M A T O G R A P H I C DATA (F.-t. Chau and A.K.-m. Leung) 1 Introduction Applications of wavelet transform in chromatographic studies 2 2.1 Baseline drift correction 2.2 Signal enhancement and noise suppression 2.3 Peak detection and resolution enhancement 2.4 Pattern recognition with combination of wavelet transform and artificial neural networks Conclusion
205
CHAPTER 10 APPLICATION OF WAVELET T R A N S F O R M IN E L E C T R O C H E M I C A L STUDIES (F.-t. Chau and A.K.-m. Leung) 1 Introduction 2 Application of wavelet transform in electrochemical studies 2.1 B-spline wavelet transform in voltammetry 2.2 Other wavelet transform applications in voltammetry 3 Conclusion
225
CHAPTER 11 APPLICATIONS OF WAVELET T R A N S F O R M IN S P E C T R O S C O P I C STUDIES (F.-t. Chau and A.K.-m. Leung) 1 Introduction
241
177 179 180 185 186 189 191 194 194 196 199
205 206 207 208 210 219 220
225 225 225 233 236
241
2 2.1 2.2 2.3 3 3.1 3.2 3.3 4 5
Applications of wavelet transform in infrared spectroscopy Novel algorithms for wavelet computation in IR spectroscopy Spectral compression with wavelet neural network Standardization of IR spectra with wavelet transform Applications of wavelet transform in ultraviolet visible spectroscopy Pattern recognition with wavelet neural network Compression of spectrum with wavelet transform Denoising of spectra with wavelet transform Application of wavelet transform in mass spectrometry Application of wavelet transform in nuclear magnetic resonance spectroscopy Application of wavelet transform in photoacoustic spectroscopy Conclusion
243 244 248 250 250 251 251 253 254 255 256 257
CHAPTER 12 APPLICATIONS OF WAVELET ANALYSIS TO PHYSICAL CHEMISTRY (H. Teitelbaum) 1 Introduction 2 Quantum mechanics 2.1 Molecular structure 2.2 Spectroscopy 3 Time-series 3.1 Chemical dynamics 3.2 Chemical kinetics 3.3 Fractal structures 4 Conclusion
263
CHAPTER 13 WAVELET BASES FOR IR LIBRARY COMPRESSION, SEARCHING AND RECONSTRUCTION (B. Walczak and J.P. Radomski) 1 Introduction Theory 2 2.1 Wavelet transforms 2.2 Compression of individual signals 2.3 Data set (library) compression 2.4 Compression ratio 2.5 Storage requirements 2.6 Matching criteria 2.7 The data 3 Results and discussion 3.1 Principal component analysis applied to IR data compression 3.2 Individual compression of IR spectra in wavelet domain 3.3 Joint basis and joint best-basis approaches to data set compression 3.4 Matching performance 4 Conclusions
291
263 264 264 273 274 274 279 282 285
291 292 292 292 293 294 295 296 296 297 297 298 303 305 308
CHAPTER 14 APPLICATION OF THE DISCRETE WAVELET TRANSFORMATION FOR ONLINE DETECTION OF TRANSITIONS IN TIME SERIES (M. Marth) 1 Introduction 2 Early transition detection 3 Application of the DWT 4 Results and conclusions
311
CHAPTER 15 CALIBRATION IN WAVELET DOMAIN (B. Walczak and D.L. Massart) 1 Introduction 2 Feature selection coupled with MLR 2.1 Stepwise selection 2.2 Global selection procedures Feature selection with latent variable methods 3 3.1 UVE-PLS 3.2 Feature selection in wavelet domain 4 Illustrative example 5 Conclusions
323
CHAPTER 16 WAVELETS IN P A R S I M O N I O U S FUNCTIONAL DATA ANALYSIS M O D E L S (B.K. Alsberg) 1 Introduction 2 Functional data analysis 2.1 From vectors to functions 2.2 Spline basis 2.3 Non-linear bases 2.4 Wavelet bases Methods for creating parsimonious models 3 3.1 The simple multiscale approach 3.2 The optimal scale combination (OSC) method 3.3 The masking method 3.4 Genetic algorithms 3.5 The dummy variables approach 3.6 Mutual information 3.7 Selecting large w coefficients 4 Regression and classification 4.1 Regression 4.2 Classification Example applications 5 5.1 Regression
351
311 311 315 319
323 324 324 325 326 328 331 333 347
351 352 354 355 357 358 361 362 366 367 369 369 372 373 375 375 377 38O 38O
xii 5.2 Classification 5.3 Conclusion
391 405
CHAPTER 17 MULTISCALE STATISTICAL PROCESS CONTROL AND MODEL-BASED DENOISING (B.R. Bakshi) 1 Introduction 2 Wavelets 3 General methodology for multiscale analysis, modeling, and optimization 4 Multiscale statistical process control 4.1 MSSPC methodology 4.2 MSSPC optimization 5 Multiscale denoising with linear steady-state models 5.1 Single-scale model-based denoising 5.2 Multiscale Bayesian data rectification 5.3 Performance of multiscale model-based denoising 6 Conclusions
411
CHAPTER 18 APPLICATION OF ADAPTIVE WAVELETS IN CLASSIFICATION AND REGRESSION (Y. Mallet, D. Coomans and O. de Vel) 1 Introduction 2 Adaptive wavelets and classification analysis 2.1 Review of relevant classification methodologies 2.2 Classification assessment criteria 2.3 Classification criterion functions for the adaptive wavelet algorithm 2.4 Explanation of the data sets 2.5 Results 3 Adaptive wavelets and regression analysis 3.1 Review of relevant regression methodologies 3.2 Regression assessment criteria 3.3 Regression criterion functions for the adaptive wavelet algorithm 3.4 Explanation of the data sets 3.5 Results
437
CHAPTER 19 WAVELET-BASED IMAGE COMPRESSION (O. de Vel, D. Coomans and Y. Mallett) 1 Introduction 2 Fundamentals of image compression 2.1 Performance measures for image compression 3 Image decorrelation using transform coding
457
411 412 414 415 416 418 422 422 425 430 433
437 437 437 440 440 442 444 448 448 450 452 452 453
457 459 461 462
xiii 3.1 3.2 3.3 4
The Karhunen-Loeve transform (KLT) The discrete cosine transform (DCT) Wavelet transform coding Integrated task-specific wavelets and best-basis search for image compression
462 463 465 473
CHAPTER 20 WAVELET ANALYSIS AND P R O C E S S I N G OF 2-D AND 3-D ANALYTICAL IMAGES (S.G. Nikolov, M. Woikenstein and H. Hutter) 1 Introduction 2 The 2-D and 3-D wavelet transform 3 Mathematical measures 4 Image acquisition 4.1 SIMS images 4.2 EPMA images Wavelet de-noising of 2-D and 3-D SIMS images 5 5.1 De-noising via thresholding 5.2 Gaussian and Poisson distributions 5.3 Wavelet de-noising of 2-D SIMS images 5.4 Wavelet de-noising of 3-D SIMS images Improvement of image classification by means of de-noising 6 6.1 Classification 6.2 Results Compression of 2-D and 3-D analytical images 7 7.1 Basics 7.2 Quantisation 7.3 Entropy coding 7.4 Results 8 Feature extraction from analytical images 8.1 Edge detection 8.2 Wavelets for texture analysis 9 Registration and fusion of analytical images 9.1 Image registration 9.2 Image fusion 10 Computation and wavelets 11 Conclusions
479
INDEX
551
479 482 487 487 487 488 488 488 491 491 496 502 5O2 503 506 506 509 509 509 513 513 521 526 526 535 540 542
This Page Intentionally Left Blank
XV
LIST OF CONTRIBUTORS B.K. Alsberg Department of Computer Science, University of Wales, Aberystw),th, Ceredigion SY23 3DB, UK e-mail:
[email protected] Bhavik R. Bakshi Department of Chemical Engineering, The Ohio State University, 140 West 19th A venue, Columbus, OH 43210, USA e-mail:
[email protected] Bas van den Bogaert Solvay SA, Rue de Ransbeek 310, DCRT/ACE, Industrial IT and Statistics, 1120 Brussels, Belgium e-mail." Bas.
[email protected] Foo-tim Chau Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, People's Republic of China e-mail." BCFTCHA
[email protected] Danny Coomans Statistics and Intelligent Data Analysis Group, School of Computer Science, Mathematics and Physics, James Cook University, Townsville, Queensland 4811, Australia e-mail: Danny.
[email protected] H. Hutter Research Group on Physical Analysis and Computer Based Analytical Chemistry, Institute of Analytical Chemistry, Vienna University of Technology, Getreidemarkt 9/151, Vienna 1060, Austria e-mail: h.
[email protected]
Alexander Kai-man Leung Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic UniversiO', Hung Hom, Kowloon, Hong Kong, People's Republic of China e-mail." kmleung(~fg702-6.abct.poO'u.edu.hk Yvette Mallet Statistics and Intelligent Data Analysis Group, School of Computer Science, Mathematics and Physics, James Cook UniversiO', Townsville, Queensland 4811, Australia e-mail: Yvette.Mallet~jcu.edu.au Michael Marth Freiburg Materials Research Center FMF, University of Freiburg, Germany D.L. Massart Pharmaceutical Institute, Vr(je Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium e-mail: fabi@ vub. vub.ac.be Stavri G. Nikolov Image Communications Group, Centre for Communications Research, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol BS8 1UB, UK e-mail.'stavri.nikolo v~ ~bristol.ac.uk Mohamed N. Nounou Department of Chemical Engineering, The Ohio State University, 140 West 19th Avenue, Columbus, OH 43210, USA Jan P. Radomski Interdisciplinary Center for Mathematical and Computational Modeling,
xvi
Warsaw University, Pawinskiego 5A, 02-106 Warsaw, Poland e-mail:
[email protected] Heshel Teitelbaum Department of Chemistry, University of Ottawa, Ottawa, Ontario, Canada KIN 6N5 e-mail:
[email protected] Olivier de Vel Statistics and Intelligent Data Anah'sis Group, School of Computer Science, Mathematics and Physics, James Cook University, Townsville, Queensland 4811, Australia e-mail." olivier.devel@ dsto.defence.gov.au
Beata Walczak blstitute q[ Chemistry, Silesian University, 9 Szkolna Street, 40-006 Kato~rice, Poland e-mail: beata(a tc3.ich.us.edu.pl M. Wolkenstein
Research Group on Physical Anah'sis and Computer Based Analytical Chemistry, hlstitute o1 Anah'tical Chemistry, Vienna Universit)" of Technoiog)', Getreidemarkt 9,/151, Vienna 1060, Austria e-mail. wolken(a mail.zserv.tuwien.ac.at
Part I
Theory
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
CHAPTER 1 Finding Frequencies in Signals: The Fourier Transform Bas van den Bogaert Solvay SA, DCRT/ACE, Industrial IT and Statistics, Rue de Ransbeek 310, 1120 Brussels, Belgium
I Introduction This is a chapter on the Fourier transform. One may wonder: why speak of Fourier in a book on wavelets? To be honest, there are plenty of people that learn to use and appreciate wavelets without knowing about Fourier. You might be one of them. Yet, all those involved in the development of wavelets certainly knew Fourier, and as a consequence, wavelet literature is full of Fourier jargon. So, whereas you may not need to know Fourier to apply wavelets, you probably will need to know it in order to appreciate the literature. The goal of this chapter is to introduce Fourier in a soft way. Fourier has a rather bad reputation amongst chemists, the reputation of something highly mathematical and abstract. We will not argue with that. Part of Fourier is indeed inaccessible to the less mathematically inclined. Another part, however, is easy to grasp and apply. The discrete Fourier transform in particular, as one might use it in digital signal processing, has a simple basic structure and comprehensible consequences. It is also that part of Fourier that links well to the wavelet transform. The discrete wavelet transform, that is, the kind you are most likely to be using in the future. What makes these discrete transforms easy to understand is that they have a geometrical interpretation. In terms of linear algebra: they are basis transformations. Nevertheless, we will take a glance at pure and undiluted Fourier: the transform in integral form. Not that we need it, but it would be odd not to mention it. Moreover, useful notions from the Fourier integrals can be effectively used, if only loosely, for discrete Fourier.
2
The Fourier integral
Let us look the beast in the eye: +vc
F(m)-
/
f(t)e-i~~
'/
(1)
nt-OC
r(t) - ~
F(m) e+imtdm
(2)
--0(2
with i2 = - - 1 We have some function f of t, where t is often associated with time, so that we can think of f as a signal. Eq. (1) transforms f into F, where F is no longer a function in t, but in m. When we associate t with time, we may think of m as frequency, as the exponential may be written as: e -i~ - cos(rot) - i sin(rot)
(3)
Eq. (2) does the same thing as 1, but in the other direction. It takes F of m and transforms it into f of t. We see that in order to go in the other direction, the sign of the exponent has been swapped from minus to plus. Furthermore, there is a multiplication factor outside the integral. The factor is needed to get back to the same size if we were to go from f to F and back again to f. We could also have defined a set of Fourier integrals putting that factor in the first equation, or dividing it over both. Eqs (1) and (2) have everything to scare off chemists. There are integrals, complex numbers, and m is said to represent frequency, which leaves us pondering about the meaning of negative values for it. This is pure mathematics, it seems. Yet, this form of Fourier is not just a toy for mathematicians. It is useful for mathematical reasoning on models of the real world. Analytical solutions may be obtained for real-world problems. Useful or not, our present vision of the world becomes increasingly digital, we observe and manipulate the real world using digital tools that discretise it. Most often, signals are not continuous and infinitely long, they are discrete
and of finite length. Mathematics exist that allow travelling back and forth between the continuous and the discrete representation. When the continuous Fourier reasoning is to be used for our discrete data, the additional maths do not simplify things. Arguments that are compelling in continuous Fourier may get twisted upon translation to the digital domain. In fact, the discrete representation of Fourier analysis may seem better off without the burden of its continuous ancestor. However, it is possible to loosely apply continuous Fourier reasoning to discrete settings, reasoning that gives a feeling for what happens when one filters a signal, for instance. The most interesting example of such reasoning involves convolution, an operation that is ubiquitous in the domains where Fourier is used. It will be discussed in Section 3.
3
Convolution
We will introduce the concept of convolution using a simple example from systems analysis. We will take a small side step to introduce the basics of the system. Suppose we have a single reactor that has one flow going in and one going out as depicted in Fig. 1. Suppose the reactor is a so-called continuously stirred tank reactor, or CSTR. A CSTR is a well-known theoretical concept. In a CSTR, mixing is by definition perfect. As soon as some material arrives, it is instantaneously completely and homogeneously dispersed throughout the reactor. Imagine there is water flowing through. Now we spike the input with some ink. When the ink arrives at the reactor, we will immediately see it appear in the output. Not as strong as it was, but diluted to the volume of the reactor. After the initial jump, the colour of the output will gradually fade away, as the ink is washed out of the reactor. In the beginning, when the
i
J
CSTR
Fig. 1 A C S T R and its hnpuise response.
concentration is high, the material is washed away quickly and the concentration drops fast. As the concentration becomes lower, the rate at which the material leaves with the outflow becomes lower, i.e. the concentration drops more slowly. In short: the rate at which the concentration decreases is inversely proportional to the current concentration. This amounts to a simple differential equation. When we solve this equation we obtain a formula of the concentration profile in the output after a single spike in the input. This is called the impulse response of the CSTR. c(t)
-
(4)
c ( 0 ) e TM
where k, the time constant, depends on the ratio of flow to reactor volume, and c(0), the initial concentration, depends on the amount of ink introduced and, again, reactor volume. A high k means that the reactor is flushed rapidly and the concentration drops fast. Now, what would happen if we were to spike the input several times, with some time in between? As depicted in Fig. 2, the concentration profile in the output would be the sum of the responses to the individual spikes. When we cut through the output profile at some moment t we see that the contributions correspond to different positions on the basic impulse response. For the first spike, we are already on the tail of the response, for the last we are still close to the top. To get another view, we start by taking the mirror image of the impulse response. Its exponential slope will be to the left and its perpendicular edge will be to the right. The three contributions at time t are obtained by multiplying the input with the mirrored impulse response positioned at time t. In this example the input consists of the three impulses in the midst of zeros. Therefore, the multiplication leads to a sampling of three points from the (mirrored) impulse response. In general, the input is a
CSTR
Fig. 2 A series of impulses on a CSTR, the responses and their envelope.
continuous signal, which can be regarded as a series of infinitely closely spaced impulses of different amplitude, and there is an infinite number of contributions. In the example, the overall output signal at time t is the sum of the three products. In general, it is the integral of the product of two signals, namely the input and the mirrored impulse response positioned at time t. To obtain the entire output signal, we drag the mirrored impulse response over the input. At each position t of the impulse response, we multiply and sum to get the output signal in t. This dragging process is illustrated by Fig. 3. That operation is called convolution. Any input signal can be thought of as a series of impulses and the output signal will be the convolution of the impulse response and the input. In other words: if we know the impulse response of a system, we can derive what the output will be like given some input. The formal description of a convolution is the convolution integral: g(t)-
/
f(z)h(t-z)d~
(5)
--OO
where g(t) could be the output of a system with impulse response h(t) and input f(t). Impulse responses are usually relatively easy to come by, but the effect of a convolution is often difficult to picture without actually evaluating the convolution integral, which is seldom a simple task. This is where Fourier comes in. The convolution theorem states that a convolution in the t-domain is equivalent to a multiplication in the co-domain. What we need to do is Fourier transform the input and the impulse response. The product of these functions is the Fourier transform of the output. So if we want the output, we need to transform back.
Fig. 3 Every point of the envelope response can be seen as a multiplication of the input impulses and a mirrored impulse response.
This may not seem to be a simplification, but in many cases it is, and in the following, we will frequently use this property.
4
Convolution and discrete Fourier
In the discrete Fourier setting, the convolution theorem still holds, but with an important modification. The multiplication of discrete Fourier transforms corresponds to a convolution that is circular. One can imagine that the convolution described above, dragging some impulse response along the signal, gets into trouble when we have a finite set of data. At the beginning and the end of the signal, the shape we are dragging along it will stick out, as depicted in Fig. 4. The simplest solution is to exclude those sections of the signal in the output, i.e. to start the convolution on the position where the entire shape encounters signal, and to stop when the front of the shape meets the end of the signal. That would make the output signal shorter than the input. An alternative could be to simply sum the remaining products when, in some position, the shape sticks out. That would be equivalent to assuming that the signal is zero beyond the available data. In the CSTR example above that was a reasonable assumption, but in general it is not. Yet another way of solving the problem of missing data at the edges of the signal is to think of the signal as something that repeats itself. After the end
/ Fig. 4 Convolution at beginning o[ discrete signal. The impulse response is too long.
of the signal, we will suppose it starts over as at the beginning. Hence, before the start, we will suppose the signal behaved as it does at the end. This is a circular convolution, as depicted in Fig. 5. In a discrete convolution, either we lose part of the signal, or we deform that part. As long as the signal is long compared to the shape it is being convoluted with, we do not worry too much about the deformation. Under those circumstances, we will loosely use the convolution theorem, as if the circular aspect were not there.
5
Polynomial approximation and basis transformation
This section will elaborate the following ideas. The Fourier transform can be interpreted as a polynomial approximation of a signal, where the polynomial is a series of sines (and cosines) of increasing frequency. When the degree of the polynomial is high enough, the approximation will be perfect: we will accurately reproduce the entire signal. At that point, the polynomial can be seen as a basis for signal space, and the calculation of the coefficients boils down to a basis transformation. Suppose we have a set of n data points (xi, Yi), a calibration line, for example. The data are plotted in Fig. 6.
[_....._
Fig. 5 In a circular convolution, the signal is wrapped around to avoid the problem of an impulse response that is too long at the edges of the signal.
10
Fig. 6 Scatter plot of a set of x - y data.
We wish to describe y as a function of x. A straight line seems okay as a first approximation. In that case the model is a first-order polynomial: y = [30 + [31x + t;
(6)
where the error term t; describes the fact that the Yi will not perfectly fit our model, due to m e a s u r e m e n t error and to model insufficiency. We might want to add a quadratic term if we suspect curvature, i.e. go to a second-order polynomial: y - 130 + 131x + 132x2 + t:
(7)
Note that the e of Eq. (7) is not the same as in Eq. (6). If a second-order is not sufficient we try a third-order etc. If we use a polynomial of order n - 1, we are sure to perfectly describe the data. There would be no degrees of freedom left. Fig. 7 shows the first orders of a p p r o x i m a t i o n of the data of Fig. 6. In general, a perfect description is not what we aim for. As the responses have not been measured with infinite precision, a perfect description would go beyond describing the process we set out to observe. It would describe the m e a s u r e m e n t error as well. In a polynomial approximation, we would typically stop at an order well below the limiting n - 1. In other words, we suspect the higher-order terms to be representing noise. That is a general principle we will also encounter in Fourier. F o r the calculation of the coefficients in our polynomial model we use linear regression, i.e. a least squares projection of the data onto the model. This is very easy to write down in matrix notation. Our model becomes: y-
XI$+ t;
(8)
25
15 10
l
First order
Zero order
9 OoOoo
'~[
9"'.;~.
1
9o
51~176 O" 0 25 .
~~
01
5
.~ "
5
10
9
-
15
,
0
15
Second order . .
.
0
10
5
-
15
J
20
Tenth order
.,,,a'
25
"
20
9
10
0
5
10
15
20
Fig. 7 Polynomial approximation ol'orders O, 1, 2 and lO for the data in Fig. 6.
where y is the n-vector of responses Yl to y,,, t; the n-vector of residual errors, p the p-vector of the coefficients if the polynomial is of order p - 1 and X the n • p model matrix. In case of a second-order model, X can be constructed as"
X
-
l
1 xl x{/ 9
[
9
1
Xn
x n2
(9)
The coefficients [I are estimated using" -
( x T x ) - ' XVy
(10)
The matrix inversion is a crucial element. In the ideal situation, the columns of X are orthogonal. That means that x T x is diagonal and the matrix inversion boils down to simple divisions. We can go one step further and normalise those orthogonal columns, making XTX the identity matrix and allowing us to write" ~--XTy
(11)
Each coefficient 13j can be calculated independently as the inproduct of the response vector y and column j of X: n
[3J -- Z i=l
xi,jYi
(12)
12 If X is constructed simply by adding increasing powers of the basic x, as in Eq. (9), x T x is not diagonal. However, it is possible to construct polynomials in x that do result in'diagonal xTx, i.e. to construct orthogonal polynomials. A simple solution is the Chebychev series. If the xi are equidistant and we rescale them to {0, 1,... ,n}, the following series of polynomials is orthogonal: p0 = 1 Pl - x - ~ n
1
lk2 (n + 1) 2 - k 2 Pk+l - - P l P k - - ~ 4k 2 - 1 Pk-l
(1 < k <_ n -
1)
(13)
The columns of X contain the successive orders of this polynomial. Orders 0 to 8 are plotted in Fig. 8 for the case of n = 100. With such a large n, the functions become smooth, whereas for small n, they are rather ragged. When the Chebychev polynomial series is developed until the order k equals n - 1, i.e. until there are as many polynomial coefficients as there are observations, the matrix X can be considered an orthogonal basis for n-dimensional space. The coefficients [3j are the co-ordinates in this alternative system of axes, this other domain, as it is often called. We could speak of the Chebychev domain in this case. Eq. (10) describes the basis transformation, i.e. the projection of the signal onto the alternative basis. The transform has only changed our perspective on the data; nothing has been changed or lost. So we could also transform back, using the model at the start of all this: y = X~.
k=O
k=l
k=2
k=4
k=5
k=7
k=8
Fig. 8 Some Chehvchev pol.l'noutMls.
13
People working in chemometrics will be familiar with another kind of basis transformation" principal component analysis (PCA). They may be puzzled by the differences between PCA and orthogonal polynomials. Therefore we will compare the two. An orthogonal polynomial provides a fixed basis for n-dimensional space on which we can project any individual vector that is n points long. The basis is fixed in the sense that is not dependent on the vectors that will be projected on it. Or vice versa, we do not need to know those vectors in order to construct any of these polynomial bases. This is in sharp contrast to PCA, that uses a set of vectors to define a new basis, well suited to represent just that set. The Chebychev polynomial is just one possibility to construct a fixed orthogonal basis for n-dimensional space. There are many others. Interesting members of the family are the Hermite polynomial, Fourier and Wavelets. As there are many, the question arises how to choose between them. Before we are able to answer that question, we need to deal with another, more fundamental one: why do a basis transformation in the first place? The purpose of a basis transformation is always: to make things easier to see or do. Take PCA for instance, its main usage is to reduce the number of dimensions of a data set, i.e. to use only a limited set of basis vectors to describe the data. The reduction can be a goal in itself, but it also allows us to concentrate on the main tendencies in the data. It is like the polynomial approximation, where we strive for a model with few terms, typically only the lower order ones. The purpose of the Fourier transform, or the wavelet transform, is much the same. A dimension reduction, i.e. a description of the data using a subset of basis functions, serves to compress and to improve visibility. When visibility is the issue, the process of dropping irrelevant basis functions is usually referred to as filtering in a Fourier setting, or denoising in Wavelets.
6
The Fourier basis
The Fourier polynomial series is not a sequence of increasing powers of x, like the Chebychev polynomial, but a series of sines and cosines of increasing frequency. In fact, there is no longer a notion of x and y, as in the initial example of polynomial approximation, but just y, a series of num-
14 bers. We will call it a signal, which may be a variable recorded over time, an absorption recorded as a function of wavenumber, or any vector you like. It is common practice to speak of the Fourier coefficients as the representation of the signal in the Fourier domain, or, alternatively, the frequency domain. The signal itself is then said to be a representation in the time domain. This is confusing when our signal is not a function of time, but, e.g. an absorption spectrum. In Fourier and Wavelet literature, however, these notions of time and frequency are so common that they are unavoidable. If the signal to be transformed is n points long, the terms of the polynomial are defined on {0, 1 , . . . , n - 1}. This equidistant grid will be called x. For ease of notation we will assume that n is odd. The terms of the polynomial, i.e. the n functions that make up the Fourier basis are a0=
1
ak--Cos(k2rtx/n),
k c [ 1 , . . . , n - 2 1]
bk-sin(k2rtx/n),
k E {1, . . . . ~ n - 21]
(14)
Fig. 9 gives a plot of the ak for k E (0,1,2,3) and the bk for k C (1,2,3), for n = 99. Small n do not give smooth curves. The functions ak and bk enter as columns into the matrix X, e.g. as X-
[a0
a,
...
a(n-,)/2
b,
...
b(n-,)/2]
(15)
The Fourier transform can be done using either Eq. (10), or Eq. (l 1). The [I will be the Fourier coefficients. When we use Eq. (11), we should realise that X is orthogonal, not orthonormal. Therefore the coefficients will not be the true [I but XVXp. When we back transform from the true p, from the result of Eq. (10), we use simply y = Xp. But when we go back from the result of Eq. (11), we need to use y - x ( x T x ) - I p . In short, the use of (xTx) -1 is inevitable, either forwards or backwards. It should be noted here that the Fourier coefficients will probably not be calculated using Eq. (10) or (11). Most applications of Fourier are based on the so-called Fast Fourier Transform (FFT), which is a clever way of arranging the calculations in order to speed them up. That is of little concern
15
a 1
b 1
a2
b2
a3
b 3
Fig. 9 Some elements of the Fourier series, i.e. base functions of the Fourier basis.
but for two reasons. One is that the F F T works only for signals whose length is an integer power of 2. So we have to do something to the signals that do not happen to have that length. This is not an easy problem. When we cut the signal we lose information, when we add something to it we introduce artefacts. We will not go into any detail on this, but we note that the fast wavelet transform suffers from the same problem. The other reason the calculation may concern us is that the F F T will return the Fourier coefficients as a series of n complex numbers. This is the most common representation in the world of Fourier. We should not be bothered by it. The real parts correspond to the cosine terms ak, the imaginary parts belong to the sine terms bk. But there are n complex numbers and this way we will find 2n coefficients, for only n points of the signal. In fact we need only look at the first half of the series of complex numbers, because this series is symmetrical with respect to its central value. The second half is often referred to as the negative frequencies. A pair of ak and bk for which k is the same refers to a single frequency. In other words, k can be interpreted as frequency. This is mathematically obvious as c o s ( x ) = sin(x + n/2), i.e. a cosine is a shifted sine. By taking the sum of a sine and a cosine, we do not create a different frequency, we describe the same basic frequency they have in common, and by changing the ratio of sine and cosine, we set the phase of this frequency. Fig. 10 illustrates this. By
16
ww
Vk/W Fig. 10 Two lhTear comb&ations o[a s&e and a cosine, result&g hl oscillations o[d(fferent phase.
adding a bit of a sine to a cosine, we shift the cosine towards the sine. The more we add, the closer we get to the sine. The n coefficients refer to (n + 1)/2 frequencies if n is odd, or 1 + n/2 if n is even. a0 is zero frequency, the offset of the signal, k = 1 refers to the base frequency of the analysis; all other frequencies are integer multiples of it. We could also say that it is the frequency resolution of our analysis, as we step through the frequency domain with steps the size of the base frequency. The longer the signal (the bigger n), the lower the base frequency, as its period exactly fits the length of the signal. In other words, the higher our frequency resolution. The maximum frequency in the Fourier basis is uniquely determined by the sampling frequency and does not depend on the length of the signal. It is always n/2 times the base frequency, i.e. its period is two nth of the length of the signal n, i.e. two points. Having set out the outlines of the Fourier basis, we can start to answer the question why one should want to use it. One answer could be: because there are signals for which it is appropriate. Which changes the question to: what kind of signals is that. We have seen that a polynomial development of x is a logical thing to do when we want to describe data that most likely fall onto a straight line, with potential curvature that would need to be captured. Analogously, we can think of a signal that is basically a sinusoidal, with potentially some higher frequencies and certainly noise, as illustrated in Fig. l l(a). The Fourier basis would be an obvious choice for such a signal. In Fig. ll(c), the approximation of the signal using the appropriate frequency has been plotted. Being able to pick the appropriate frequencies is
17
(a)
(b)
I
9
I
|
i
(c)
Fig. 11 (a) noisy periodic signal; (b) its PSD," (c) reconstruction of the signal using only the two strong frequencies.
result of Fourier analysis. When the signal of Fig. l l(a) is transformed, a series of n coefficients is obtained that we know to be grouped in pairs referring to the frequencies in the basis. In this case, we are not interested in the phases of those frequencies, only in their relative importance for describing the signal. The sum of the squares of the coefficients of sine and cosine gives what is called the power of that frequency. We could also look at the amplitude, which is the square root of the power. A plot of power versus frequency is called the power spectrum or power spectral density (PSD). For the signal in Fig. 11 (a), the PSD is given in Fig. 11 (b). Two frequencies clearly stand out in the spectrum. The others will be considered noise. In this example we knew that the signal was periodic. In practice, this may not be the case. There may, e.g. be seasonal effects in a product property, or resonance effects in a controlled variable in a process. Fourier analysis aids in finding those phenomena. We could extend our view to any signal that is periodical, not just sinusoidal. Let us look at the block wave of Fig. 12(a). A sine as a first approximation of the block is obvious, but then what? Higher frequencies, with their steeper slopes, are required to get into the corners of the block. It can be derived mathematically that it is a beautiful series of discrete frequencies that makes up a block wave, viz. f, 3f, 5f, ..., where f is a sine with the same period as the block. The amplitude diminishes as the frequencies get
18
J2/
(a)
(b)
(c)
Fig. 12 (a) Some periods of a block wave and a sine with the same period and corresponding phase," (b) the PSD of the block wave," (c) approximation of the block wave using the first six frequencies differing from zero in the PSD.
higher. The PSD of the signal in Fig. 12(a) is given in Fig. 12(b). When we do not use all frequencies to reconstruct the block wave, oscillations will remain, as illustrated by Fig. 12(c). We observe that, although Fourier is suitable for periodic signals, it is not particularly efficient for sharp edges. It is obvious why not: the basis functions themselves, the sines/cosines are smooth. This leads to a general observation: something sharp, sudden or narrow in the time domain will be something wide in the frequency domain.
7
Fourier transform: Numerical examples
To get a feeling of Fourier transformation, in this section we will go through a small numerical example. Suppose we want to transform signals that are 9 points long. We need to set up a 9-by-9 transformation matrix. The matrix is constructed following Eq. (15):
1 cos(1-2n.0/9).., cos(4-2~-0/9) sin(l.27t.0/9).., sin(4.2n.0/9)) X
_
.
.
.
9
"
1 cos(1 2n 8/9)
i
:
cos(4 2n 8/9) sin(l 2n 8/9)
"
sin(4 2n 8/9)
19
This evaluates to: 1
1 1 1 1 1 1 1 1
1
0.77 0.17 -0.50 -0.94 -0.94 -0.50 0.17 0.77
1
1
0.17 -0.94 -0.50 0.77 0.77 -0.50 -0.94 0.17
1
-0.50 -0.50 1 -0.50 -0.50 1 -0.50 -0.50
0
-0.94 0.77 -0.50 0.17 0.17 -0.50 0.77 -0.94
0
0.64 0.98 0.87 0.34 -0.34 -0.87 -0.98 -0.64
0
0.98 0.34 -0.87 -0.64 0.64 0.87 -0.34 -0.98
0
0.87 -0.87 0 0.87 -0.87 0 0.87 -0.87
0.34 -0.64 0.87 -0.98 0.98 -0.87 0.64 -0.34
Note." The values have been rounded to two decimals merely to keep the above table small. F o r future calculations, more decimals may be needed. The columns of this matrix, i.e. the Fourier basis functions for signals that are 9 points long, are plotted in Fig. 13. The functions are more ragged than those of Fig. 9. N o w let us take i.e. on position signal in system order to obtain cients following
1
1
1 1
0.77 0.17
1
-0.5
1
0.17 -0.94 -0.5
a very simple signal: all zeros except for a one in the middle, 5. This may seem a bit artificial, but it is a very i m p o r t a n t analysis. It is the impulse we can use to perturb a system in its impulse response. The calculation of the Fourier coeffiEq. (11) is given below:
1
1
-0.5 -0.5
-0.94 0.77
-0.94 0.77
-0.5
-0.5
1
1
1
1
-0.5 -0.5
0.17 -0.94
1
-0.5
1 -0.94 0.77 -0.5 0.17 0.17 -0.5 0.77 0 0.64 0.98 0.87 0.34 -0.34 -0.87 -0.98 0 0.98 0.34 -0.87 -0.64 0.64 0.87 -0.34 0 0.87 -0.87 0 0.87 -0.87 0 0.87 0 0.34 -0.64 0.87 -0.98 0.98 -0.87 0.64
1
0
1
0.77 0.17
0 0
-0.94 0.77
0
-0.5
-0.5
-0.94 -0.64 -0.98 -0.87 -0.34
x
1 0 0 0 0
=
0.17 0.34 -0.64 0.87 -0.98
F o r this particular signal, the calculations are strongly simplified because of all the zeros. In fact the signal can be said to select just the 5th column of X T, i.e. the 5th row of X.
20
a1
b~
a2
b2
a3
b3
a4
I)4
Fig. 13 The columns of the matrix X in the numerical example.
The result is not directly interpretable. We prefer to calculate the power spectrum. k
0 1 2 3 4
Power
ao al
a2 a3 a4
1 -0.94 0.77 -0.5 0.17
bl b2 b3 ]14
0.34 -0.64 0.87 -0.98
1 1 1 1 1
We see that all frequencies are equally important! This is exactly the reason why the impulse is so popular: it contains all frequencies. When we use it to perturb a system, we excite every frequency. On the other hand, when efficiency of representation is the issue, the sines and cosines of Fourier clearly are not ideal for this completely localised phenomenon in our signal. Now we will Fourier transform a Gaussian that is 9 points long. The transformation matrix remains the same, and the calculation goes like:
21
1
1
1
0.77
1
0.17
1
-0.5
1 0 0 0 0
-0.94 0.64 0.98 0.87 0.34
1
1
0.17 -0.5 -0.94 -0.5
1
1
-0.94 0.77 -0.5 1 -0.5 0.77 -0.5 0.17 0.34 0.98 0.87 0.34 -0.87 -0.64 -0.87 0 0.87 -0.64 0.87 -0.98
-0.94 0.77 -0.5 0.17 -0.34 0.64 -0.87 0.98
2.51 1 1 0.0003 -1.85 0.17 0.77 0.0111 -0.5 0.72 -0.94 0.17 0.1353 -0.5 -0.14 -0.5 -0.5 0.6065 1 0 . 7 7 - 0 . 9 4 x 1.0000 = 0.00 -0.5 0.67 0.6065 -0.87 -0.98 -0.64 -0.61 0.1353 0.87 -0.34 -0.98 0.24 0.0111 0 0.87 -0.87 -0.10 0.0003 -0.87 0.64 -0.34 1
The power spectrum is obtained as:
k 0 1 2 3
ao al a2 a3
2.51 -1.85 0.72 -0.14
bl b2 [I 3
0.67 -0.61 0.24
4
an
O. O0
b4
--0.10
Power 6.28 3.86 0.89 0.08 O. O0
The spectrum shows us that the Gaussian contains primarily low frequencies. This can be expected, as the shape of a Gaussian is rather smooth, and has no very sharp features that would require high frequencies. For continuous, infinitely long signals, it can be derived using Fourier integrals that the transform of a Gaussian is itself a Gaussian, whose width is inversely proportional to that of the original. In other words, the wider the Gaussian in our signal, the narrower its counterpart in the Fourier domain. The spectrum we just calculated contains only half a Gaussian, because our discrete Fourier transformation finds only positive frequencies. If we want to relate our discrete spectrum to the continuous one, all we have to know is that a spectrum is symmetrical in zero frequency. The next thing we will try is to transform back. As we used p - XTy to transform forwards whilst X was not orthogonal, we have to use y - X ( X T x ) - l p to transform backwards. In other words, we have to multiply our coefficients with ( x T x ) -l That is a diagonal matrix, and this multiplication comes down to dividing each coefficient by the sum of squares of the corresponding basis function, which is 9 for the first and 4.5 for the others. The corrected coefficients are:
22
0.2785 -0.4102 0.1610 -0.0311 0.0022 0.1493 -0.1351 0.0539 -0.0124 When we do not touch these coefficients, going back will simply reproduce the initial Gaussian. Here we will try what happens when we drop the highest frequencies. 1
1
l
1
0.77
0.17
1
0.17
1
1
-0.50
-0.94
-0.94
-0.50
-0.50
1
1
-0.50
1
-0.94
0.77
-0.50
1
-0.94
0.77
-0.50
1
-0.50
1
0.17
1
0.77
-0.50
1
-0.94
-0.50
0.17
-0.50
0.77 -0.50 0.17 0.17
0
0
0.64
0.98
0.98
0.34
0 0.87 -0.87
0.87
-0.87
0
0.34
-0.64
0.87
-0.34
0.64
-0.87
-0.50
-0.87
0.87
0
0.77
-0.98
-0.34
-0.64
-0.98
-0.94
0.87 -0.87
0 0.34 -0.64 0.87 -0.98 0.98 -0.87 0.64 -0.34
x
0.2785
0.0293
-0.4102
-0.0448
0.1610
0.1568
0
0.6494
0 0.1493 -0.1351 0 0
=-
0.9252 0.6494 0.1568 -0.0448 0.0293
To show the consequences, the original Gaussian and the reconstruction made by back transformation after cutting the high frequencies, are plotted together in Fig. 14. We still recognise the initial peak in the reconstruction, but the reconstruction is a bit wider and oscillations appear at the foot. The widening is a general result of suppressing high frequencies, the oscillations are due to the sharpness of the cut-off.
8
Fourier and signal processing
Suppose we have a signal consisting of some gaussian peaks and noise, like a chromatogram for instance. A plot of such a signal is given in Fig. 15(a), and its power spectrum in Fig. 15(b).
23
9 ..
Gaussian n
Fig. 14 A Gaussian and its low-pass reproduction.
(a)
(b)
40
(c)
Fig. 15 (a) Gaussians with some noise," (b) The PSD; (c) the smoothed version of (a).
The peaks are found back at low frequencies, the high frequencies are primarily noise. We can imagine filtering in the Fourier domain by dropping or attenuating the high frequencies and then transforming back. Let us simply set all frequencies above 40 to zero, as illustrated in Fig. 15(d). The result is given in Fig. 15(c). We just applied a low-pass filter (LP). The opposite would be a high-pass filter (HP). Under other circumstances, it may be useful to select a band of frequencies somewhere in the frequency range, not necessarily to the high or low extreme. In that case we would be using a band-pass filter (BP). Which filter is appropriate depends on the type of signal and the type of noise.
24 A filter can be implemented in the time domain as well. It would be the convolution of the signal with the back transform of the weight function we apply to the frequencies. Vice versa, a filter designed in the time domain can be implemented in the Fourier domain as a multiplication with the Fourier transform of the impulse response of the filter. The hard cut-off we applied in Fig. 15 amounts to a weight function with the shape of a block; ones up to the cut-off frequency, and zeros above. The back transform of a block is a sinc function, i.e. the function sin(x)/x. The wider the block, the narrower the sinc. The equivalent operation in the time domain would thus be a convolution of the signal with this sinc, as illustrated by Fig. 16. The consequences of a filter shape can be visualised most easily by picturing what happens if there is a spike in the signal. The output of the filter than contains a copy of the filter shape on the position of the spike. In other words, we get to see the impulse response of the filter. For a filter shape that is wide and oscillating, phenomena that are purely local in the time domain get spread out and deformed. If we want to have a more reasonable filter shape in the time domain, we have to use a smoother cut-off in the frequency domain. It is the sharpness of the cut that introduces the oscillations, as it disturbs the delicate balance of frequencies required to localise something in
Fig. 16 Convolution of signal from Fig. 15(a) with sinc function corresponding to the cutoff applied in Fig. 15 (d).
25 time. As an example: in order to improve the smoothness of the cut-off we could make the weight drop from 1 to 0 like a sigmoidal rather than in a single blow. The more we soften the drop, the more localised the filter shape will be, as illustrated by Fig. 17. Starting in the time domain we arrive at more or less the same conclusions. A popular filter is the moving average. It calculates the average over a small window on the data, a window that is slid over the data. When the xi are the data points and Yi is the output of the moving average, the calculation for a window of 4 points would be:
x4)/4 -+- X4 -t- x5)/4
Yl = (XI -nt- X2 -+- X3 @ Y2 = (x2 -~- x3
Y3 -- (X3 -+- X4 -+- X5 @ X 6 ) / 4
The functioning of the moving average over n points can be seen as a convolution of the signal with a block ( 0 . . . 0 1/n ... 1/n 0 . . . 0). Therefore, its effect in the frequency domain is a multiplication with, again, the shape of a sinc, or, if we evaluate it using a power spectrum, the square of a sinc. Fig. 18 gives the power spectrum of a block of 11 points wide applied to a signal 100 points long. The higher frequencies are strongly attenuated, which describes the smoothing effect of the filter, but there are oscillations that ensure that ranges
Fig. 17 Sigmoidal cut o~'s with di[Jerent slope and corresponding.filters.
26
IJ. ,111,.
frequency
Fig. 18 Power spectrum of a l 1-point block in a signal 100 pts long.
of higher frequencies do get through. These oscillations are due to the sharpness of the filter shape. When we are looking for a compromise, something that is smooth and monotonic in both domains, and has a reasonably sharp cut-off, a Gaussian would be a good choice. When the moving average is calculated over fewer points, the power spectrum gets wider, and the oscillations move to frequencies above those in the signal. Fig. 19 shows the power spectrum of a block that is only two points wide. In the case of a moving average over two points, it is easy to understand what happens to the variation that is averaged away. This variation is found by taking the first difference of the data, i.e. taking the difference of two neighbouring points where the moving average takes the sum. When xi are the data points, Yi is the output of the moving average and zi is the output of the first difference operator, the calculations are:
frequency Fig. 19 Power spectrum of a block of 2 pts in a signal 100 points long, the 1st diff operator, and the sum of the spectra.
27
Yl -- (X1 -Jr- X2)/2
Zl = (X2 -- xl)/2
Y2 -- (X2 q- X 3 ) / 2
Z2 = (X3 -- X 2 ) / 2
Y3 = (X3 -}- X4)/2
Z3 = (x4 -- X3)/2
In fact, when we calculate both y and z we could drop every second Yi and Zi and still be able to reconstruct the original xi. The first difference operator can be seen as a convolution of the signal with the sequence 0 . . . 0 - 1/2 + 1 / 2 0 . . . 0 . It is a coarse approximation of the first derivative of the noise-free signal. Using Fourier integrals, it can be shown that the transform of the first derivative of a function is equal to the transform of the function multiplied by o~. In other words, taking the first derivative amplifies high frequencies- the higher the more they are amplified - and annihilates zero frequency. The power spectrum of the first difference operator shows such a high-pass effect, see Fig. 19, where it has been plotted together with the spectrum of the moving average over 2. Note that the power spectrum would remain the same if the sign of the impulse response were changed. That would change the sign of the output, but that would not change the filtering effect. When combining the power spectra of the first difference operator and the moving average over two, we see that the two spectra have a constant sum. This corresponds with what we already knew, the information passed by the two operators is complementary. The moving average over 2 and the first difference operator form a special pair of an LP filter and an HP filter that divide the frequency domain in two, right in the middle of the domain. This type of filter pair plays an important role in the discrete wavelet transform. Although this kind of pair has a special property, that does not mean it is hard to come by. There are an infinite number of such pairs. Consider for instance the sharp cut-off we applied in Fig. 15. We could place the cut-off in the middle of the frequency domain. Letting pass everything below the cutoff creates a low-pass filter, letting pass everything above it creates a complementary high-pass filter. This very neat cut up of the frequency domain has a less appealing counterpart in the time domain, as the shape of the lowpass filter is a sinc much like the one in Fig. 16 (only narrower) and the shape of the high-pass filter is something similar. Sigmoidals as in Fig. 17 offer an alternative. Where the low-pass filters drops following a sigmoidal, a high-
28 pass filter could be made to rise following the mirror image of that sigmoidal. We would again have a pair of complementary LP and HP filters, as illustrated in Fig. 20.
9
Apodisation
Suppose we have a signal like the one in Fig. 21(a), a signal in which we recognise a strong trend that makes it end much higher or lower than it started. The Fourier transform of such a signal risks being dominated by the trend. The reason is that the Fourier basis does not have any simple shapes at its disposal to describe such a trend efficiently. All it has is zero frequency, which is used to move the sines/cosines to the right level. For the sines and cosines, bridging the distance between the mean and the first value of the signal is just as hard as bridging a sudden jump elsewhere in the signal. High frequencies are required to describe a sudden jump. Fig. 21(b) shows the power spectrum
Spectrum
Impulse response
LP HP
/-4-
Fig. 20 A pair of high-pass and low-pass filters with sigmoidal cut-offl representation in tlle fi'equency domain (left) and time domain (right).
(b)
Fig. 21 (a) A sine on a trend," (h) its PSD.
29 of the signal in Fig. 21(a). That signal consists of a sine and a trend. The power spectrum shows several frequencies in addition to the one sine at f=3. Another way of looking at it starts from the periodic nature of the basis functions. As far as the sines and cosines are concerned, the signal could be just one period of a cyclic phenomenon. When we plot the concatenated signal (Fig. 22(a)), we see that it is dominated by a triangular oscillation, a saw-tooth as it is called. The Fourier transform of the signal will be equally dominated by the transform of that saw-tooth, which, due to the sharp edges in the saw-tooth, contains a lot of high frequencies. To illustrate this, the saw-tooth and its power spectrum have been plotted in Fig. 22(b) and (c). A pragmatic chemist may want to solve the problem by detrending the signal, a common practice in, e.g. data pre-treatment of NIR spectra, but in signal processing, the problem is dealt with by multiplying the signal with a weighting function that squeezes the ends to nearly zero. Apodisation, as it is called. A multiplication in the time domain corresponds to a convolution in the frequency domain. One might say that this time it is the power spectrum itself that is filtered rather than the signal itself. So apodisation will move the problem from a surplus of high frequencies to a global deformation of the spectrum of the signal. The shape of the apodisation function is chosen so as to minimise this deformation. One candidate is the Gaussian. The very wide Gaussian required to apodise corresponds with a very narrow Gaussian in the frequency domain, and will have a limited filtering effect on the spectrum. Fig. 23(a) shows the signal of Fig. 21(a) after apodisation, and Fig. 23(b) gives the power spectrum. We observe that the power spectrum has been cleaned up, but that the one sine in the signal is broadened.
W
/ //V
(c)
I,,..-
.
Fig. 22 (a) The signal of Fig. 21(a) repeated several t#nes; (b) the corresponding sawtooth," (c) its PSD.
30
(b)
IIi Fig. 23 (a) The signal o[Fig. 21(a) after gaussian apodisation" (h) its power spectrum.
We have introduced apodisation as a weighting of the signal, but we can just as well view it as a weighting of the Fourier basis functions. The sines and cosines become squeezed down at the ends, as illustrated by Fig. 24. To the left it shows a sine base function, a gaussian apodisation that is chosen narrow in order to amplify its effect, and the resulting apodised base function that has the shape of a ripple. Without apodisation, the basis functions of the Fourier set correspond to sharp pulses in the frequency domain. With apodisation, these pulses become convoluted with the transform of the apodisation function, e.g. a Gaussian. The convolution of a pulse with some shape moves this shape to the position of the pulse. As a consequence, the frequency domain is not cut up in disjoint frequencies, but in a series of overlapping Gaussians. Note that this is no more than a different view of the filtering effect of the transform of the apodisation function. Cd)
I
!
(t3
Fig. 24 (a) A sine base function," (b) a gaussian apodisation function" (c) the apodised sine," (d) power spectrum of (a)" (e) power spectrum of (b)" (f) power spectrum of ('c).
31 The effect of apodisation in the frequency domain is illustrated by Fig. 24. To the right we see the power spectra corresponding to the signals to the left. A pulse, i.e. a single sine, for the base function, a decaying signal (in fact half a Gaussian) for the Gaussian, and a Gaussian at the location of the single sine for the apodised base function. The apodised base functions no longer represent single frequencies, and, in case of the gaussian apodisation, they are no longer orthogonal.
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
33
CHAPTER 2 When Frequencies Change in Time; Towards the Wavelet Transform Bas van den Bogaert Solvay SA, DCRT/ACE, Industrial IT and Statistics, Rue de Ransbeek 310, 1120 Brussels, Belgium
I
Introduction
In this chapter, wavelets will be introduced starting from the perspective of Fourier transformation. We are not going to derive wavelets in any formal way, nor will we strive for mathematical correctness. We merely wish to make wavelets plausible, to give a simple view of what they do and how they work. Using a discrete Fourier transformation we describe a signal using a set of sines and cosines of different frequency. A signal does not necessarily have to be a recording of a variable over time; it may also be light absorption as a function of wavelength, or any series of data points you like. To make writing about Fourier, and wavelets, easier, we stick to the terms signal and time. Do not let this upset you. It is much like avoiding the use of he/she in a generic text on people. What will drive us towards wavelets is the notion that frequencies may change in time. Or, to be more correct, that the frequency content of a signal may change in time. Consider, e.g. the signal of Fig. 1. We could imagine that this is a signal picked up by a microphone, i.e. that it is a sound recording. What we would hear when we would play back this recording is a series of short beeps of the same pitch. The signal consists of one tone, one sine, interrupted by pauses. However, the building blocks of the Fourier transform continue in the silences. They are not well suited to represent the intermittent nature of this signal. Using continuous Fourier maths we can derive what the Fourier transform of the signal will look like without having to actually calculate it. The signal can
34 be seen as the multiplication of a sine and a block wave, or the convolution of their Fourier transforms, i.e. a pulse for the sine and a decaying series of pulses for the block. The convolution with the pulse is a shift operator, it moves the Fourier transform of the block to the position of the pulse. The actual power spectrum of the signal of Fig. 1 is given in Fig. 2. The spectrum does not contain just the principal sine, but also a sequence of satellite frequencies, required to annihilate the signal during the silences. And what if the pitches of the notes are different? What if the changes themselves are not periodical? In practice, the changes may be completely irregular and the Fourier transform will be far more difficult to interpret than in the above example. It may also lead us to non-optimal solutions to, e.g. filtering problems. We need to know the frequency content of the noise and that of the noise-free signal in order to derive the optimal filter. When the signal changes in frequency content, we would need to adjust the filter in order to remain optimal. In a chromatogram, e.g. the width of the peaks increases with elution time. A wide peak contains more low frequencies and less high frequencies than a narrow one. A global view of the chromatogram would obscure those differences and lead to a sub-optimal filter design. We are faced with a property of Fourier that may be considered a shortcoming: it is not efficient in describing local phenomena. This is inherent to
Fig. 1 An interrupted sine wave.
Fig. 2 Power spectrum of the signal in Fig. 1.
35 the nature of the Fourier basis functions. Each of the sines and cosines covers the entire signal. Only a delicate mix of many of those global oscillations can exactly cancel out everywhere except at the local feature. The most extreme case is the impulse, a one in a series of zeros, whose description requires all frequencies in the basis.
2
Short-time Fourier transform
When localisation is an issue, the intuitive solution still making use of the Fourier transform would be to cut up the signal and to transform the pieces. This approach is called the short-time Fourier transform. It adds a dimension to the Fourier transform, namely time, as it allows following frequencies over time. Where the Fourier transform is a frequency analysis, the short-time Fourier transform is a time-frequency analysis. Instead of describing the signal in either the time or the frequency domain, we describe it in both, a joint time-frequency domain. When we do this, we are faced with a fundamental limitation: we cannot localise in the time domain and the frequency domain at the same time. The Fourier transform describes a signal as a series of sines and cosines of increasing frequency. The set of frequencies consists of zero frequency, a constant, describing the overall level of the signal, the base frequency, whose period is equal to the length of the signal, and integer multiples of the base frequency up to the maximum frequency which has a period of just two points. Hence, the Fourier description of the signal steps through the frequency domain in steps the size of the base frequency. If the signal is long, the base frequency is low, as it has a long period, and the frequency resolution is high. If the signal is short, the frequency resolution is low. As a consequence, if we cut up the signal in many pieces, which wiI| give a good loca|isation in time, we wil| get poor frequency resolution. As an example, Fig. 3 shows the power spectra of an impulse in a short signal and a long signal. The power spectra are rendered as stick plots to clearly show the difference in frequency resolution. It is interesting to return to the basis functions of the Fourier transforms of the individual segments into which the signal is cut up. Take for instance the sine base frequency of the leftmost segment. Now imagine we make this basis function as long as the signal by adding zeros. We can do the same for the base frequency of the next segment, by adding zeros to the left and right. We
36 (a)
(b)
IIIIIIIIIIIIIIIIIII
Fig. 3 Power spectra of an impulse #i a short (a) and a long (b) signal.
can continue doing this for the remaining segments and for all other basis functions. Together, all these zero-padded base functions form an orthogonal basis for the entire signal, the short-time basis. Fig. 4 allows comparing the global Fourier basis to the short-time by showing the first few sine basis functions for a signal of 99 points. For simplicity of the figure, the short-time analysis is done by cutting the signal into just two segments. The base frequency in the short-time analysis is twice as high, hence its frequency resolution is twice as low as that of the global analysis. Instead, it doubles the number of basis functions for each frequency that does belong to the set.
dVV% %%/k/VL V/k/k/ rk/L/ ! Fig. 4 The first sine basis functions of the Fourier basis (left) and the short-time Fourier basis (right j.
37 It is unlikely that each of the segments will end at the level where it started. A segment may well display a trend, which will cause the Fourier transform of that segment to contain many high frequencies. Therefore, it is a logical choice to combine the cutting up of the signal with apodisation. Apodisation is the multiplication of the signal, or, in this case, a segment of the signal, with a weighting function that squeezes down the ends. See section 9 of Chapter 1 for more information. Alternatively, we can think of this multiplication as being applied to the Fourier basis functions rather than to the signal. The resulting basis functions no longer represent distinct frequencies, but small bands of frequencies. The shape of these bands depends on the apodisation function. It could be gaussian, for instance. Their width depends fundamentally on the frequency resolution, i.e. on the length of the pieces into which the signal is being cut up. We can again extend the basis functions of the segments, this time the apodised ones, to the length of the signal to create a basis for the entire signal. In case of a gaussian apodisation, the basis of a segment is no longer orthogonal, so neither will be the overall basis. Fig. 5 gives the equivalent of Fig. 4 for the apodised transforms. The shapes look like little waves, but they are not the wavelets this book is about. An essential difference is that those wavelets do correspond to orthogonal bases.
Fig. 5 The first apodised sine basis functions of the Fourier basis (left) and the short-time Fourier basis (right).
38 Each Fourier coefficient in a transform with apodisation represents a band of frequencies. The width of that band is controlled via the length of the signal that is transformed and the shape of the apodisation function. We can introduce the notion of frequency localisation as an extension of the previously introduced frequency resolution and in analogy to localisation in time. When the bands are wide, the frequency information returned by the transform is less localised than when the bands are narrow. In other words, when the time localisation is good, the frequency localisation is poor. The joint time-frequency domain can be represented on a plane, with time as the horizontal axis and frequency as the vertical one. Time is limited by the length of the signal and frequency by the maximum frequency of the Fourier basis, in turn dependent on the length of the signal. Therefore, the joint timefrequency domain of the signal is a rectangle. Given a certain apodisation function, the time-frequency analysis is defined by just one parameter, viz. the length of the segments into which the signal is cut up. The effect of that parameter can be visualised using a tiling of the time-frequency domain as in Fig. 6. Note that the horizontal boundaries, those that limit tiles in the frequency direction, are not as sharp in reality. The shape of the frequency bands is typically smooth, which blurs the boundaries of adjacent tiles of different frequency.
i
Fig. 6 Different tilings of the time-~'equencv domain for a signal that is 32 points long.
39 In order to reduce the figure to a manageable size the length of the signal has been limited to an unrealistic 32 points only. If we do not cut up the signal, the Fourier basis consists of 16 frequency bands (plus one for zero frequency, but we will leave zero frequency out of the scheme). The information we get for each band is valid throughout the signal. This is visualised as 16 tiles, each having the length of the signal. If the signal is cut in two parts of 16 points, the transforms return 8 frequency bands that are twice as wide as before. If the signal is cut in four, there are only 4 frequency bands left, again twice as wide, etc. By the time we have cut up the signal into the individual points, only zero frequency is left. In other words with one point we have no detail, i.e. no information whatsoever on the frequency content of that one point. On the time boundaries between tiles, the information from the signal is attenuated by the apodisation. For a better coverage, we could therefore decide to create overlapping rather than disjoint tiles, i.e. to slide a window over the data instead of cutting the data into segments. This is known as the Gabor transform. The sliding does not change localisation in either time or frequency. Sliding a window point by point results in as many Fourier transforms as we can fit windows to the signal. A window of 8 points can slide over 32 points in a total of 25 positions, leading to 8 times 25 or 200 coefficients. For a wider window, it would be even more. Using disjoint segments, we obtain 32 coefficients, regardless of the width of the segments. The disjoint segmentation of the time domain can be pictured as a subsampling of the result of the sliding window. Fig. 7 gives an example of the application of a Gabor transform to the signal shown in the introduction to this chapter. The window width is 80 points, whilst the length of the beeps and the silences is 100 points each. This width gives an intermediate time-frequency localisation. For frequencies 0 through 9, the amplitude is plotted as a function of time. The frequency 4 corresponds to the pitch of the beeps. Its amplitude rises and falls following the pattern of the beeps and the pauses. It does not change abruptly because the time localisation is not sufficient to do so. The surrounding frequencies follow the same pattern, as the frequency resolution is not high enough to isolate the beep on its proper frequency. At the borders between beeps and silences, the spectrum temporarily broadens, as the sudden change introduces high frequencies. This type of time-frequency analysis gives us a very rich view of the data, but it is poor when we are after an efficient representation. The results of the Fourier transforms of two neighbouring, not disjoint but severely overlap-
40
f=9 f=8 f=7 f=6 f=5 f=4 f=3 f=2 f=l f=O
Fig. 7 The power of the lower ~'equencies in the signal of Fig. 1 as a function of time.
ping pieces of the signal are likely to be similar. In the following, we will work towards a more efficient representation, and end up with wavelets. Note that a representation without redundancy can also be seen as an orthogonal basis for the signal.
3
Towards wavelets
We will again imagine extending the basis functions of the individual windows to the length of the signal by padding with zeros. Fig. 8 illustrates the effect for a single sine basis function. The total set of zero-padded basis functions for all window positions no longer represents a basis, let alone an orthogonal one. For each window position, the corresponding Fourier coefficient is obtained as the inproduct of the signal and the zero-padded basis function. The inproduct is an element-by-element multiplication followed by a summation of the resulting products. In other words, the row of Fourier coefficients corresponding to all positions of this basis function is like the result of a convolution of the signal with the basis function. Each basis function selects a particular band of the frequency domain; i.e. it acts like a band-pass (BP) filter. To be precise, a sine and a cosine basis function are
41
Fig. 8 The first apodized sine basis functions for several positions of the sliding window.
required to define both amplitude and phase of a frequency band, but one can imagine that phase has lost its interest once the basis functions are slid. Roughly put: a cosine will return the same information as a sine that is lagging behind. We have obtained two different views of the short-time Fourier transform. We can see it as a series of Fourier transforms on a window sliding over the data, or as a set of band-pass filters covering the frequency domain in fixedsize bands. Now we could imagine covering the frequency domain using bands of unequal width. If we are not interested in detail in the high frequencies, we could use larger bands there. A larger band corresponds to a narrower window on the data, or, switching back to the tiling of the timefrequency domain, a narrower tile. Allowing for poorer frequency localisation permits a better time localisation. An example of an alternative timefrequency tiling is given in Fig. 9. Remember that there are more coefficients than tiles, as the tiles in the picture are disjoint in order to show localisation, whereas the window is stepped point by point. It is not coincidental that the example gives high frequencies a better time localisation than low frequencies. There is no rule as to what frequency changes more rapidly, but it is reasonable to require that each frequency be allowed the same number of periods to express itself properly. As a consequence, we would be less demanding on the time localisation of low frequencies than of high frequencies. In fact, that is what a discrete wavelet analysis does systematically. Its tiling for a 32-point signal is given in Fig. 10.
42
....l lilll i I 1111 Fig. 9 A tiling that gives high frequency resolution in low fi'equencies and low resolution in the high fi'equencies.
Reducing frequency localisation over part of the frequency domain reduces the total number of calculations to be made, as there are less band-pass filters to apply to the signal. An additional reduction can come from the observation that it is not necessary to slide the window point by point for something that changes slowly. Hence we could slide the wide tiles with bigger steps than the narrow ones. This is what wavelets do in a clever way, by using a single pair of filters and changing the scale of the data to which these are applied, as will be explained in the following. We focus on a window that is two points wide. Apodisation has no effect here, and we can just as well look at the Fourier basis functions for a signal of that length. These functions are (1 1) and (1 - 1 ) . We recognise the impulse responses of the moving average and the first difference operator, in that order. See section 8 of chapter 1. We know that these operators cut the frequency domain in the middle, into two complementary bands. The moving average is the low-pass (LP) filter, and the first difference the high-pass (HP)
Fig. 10 Wavelet tiling of a signal that is 32 points long.
43 filter. We also know that we can step the convolutions by two points instead of by one without loosing any information on the signal as long as we retain the output of both moving average and first difference. Now let us repeat the operation on the output of the moving average. We will again cut the frequency domain in two. In this case the domain is just the low-frequency part of the original signal, the part that was passed by the moving average. The combined effect of the two filtering steps boils down to that of applying the filters with impulse responses (1 1 1 1) and (1 1 - 1 - 1) to the original signal. In other words, to obtain the lower two quarters of the frequency domain we have applied the original impulse responses, but stretched to a scale that is twice as long. If the length of the signal is a power of 2, we can continue to apply our pair of LP and HP filters, each time reducing the number of points in the output by half, until only one point is left in the output of each filter. The point resulting from the last moving average operation is the overall average of the signal. This sequence of calculations is know as the pyramid algorithm. We will go through a numerical example of the pyramid algorithm for a signal of only 16 points. We will adopt the following code. The output of the LP filter, with impulse response (1 1) will be called a, for approximation, and that of the HP filter (1 - 1 ) will be called ~'d", for detail. The results are collected in Table 1. The first element of the column "a'" is the sum of the first two elements of the signal. The second element is the sum of the elements three and four etc. In the same way, the first element of the column "'d'" is the difference between the first two elements of the signal and so forth. Together, the sixteen elements of the columns "'a'" and "d" form an alternative representation of the signal. We could say that we have just performed a basis transformation. The basis functions are presented in Table 2. The first element of column "a" is the inproduct of the signal and the basis function given in the first column of Table 2. The inproduct means that we calculate the product of the first element of the signal and the first element of the basis function, the product of the second element of the signal and the second element of the basis function, etc. Then we sum the products. As all but the first two products are zero, it is easy to see that the inproduct boils down to the sum we calculated earlier. The basis of Table 2 can be regarded a shorttime Fourier basis for a window width of 2 points. The first eight columns of Table 2 are the equivalent of a convolution with the impulse response of a LP filter. The power spectrum of any of these
44 Table 1 a
Signal
d
0.128 0.097
0.224
0.031
0.096 0.233
0.329
-0.137
0.306 0.776
1.082
-0.470
0.841 1.160
2.001
-0.319
1.007 0.922
1.929
0.084
0.462 0.305
0.767
0.157
0.322 0.025
0.348
0.297
0.092 -0.025
0.067
0.117
Table 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
n
~
~
~
~
n
n
1
n
n
n
n
(~
o
o
-1
45 columns shows the filtering effect in the frequency domain. The same is true for the next eight columns, which correspond to the HP filter. The power spectra are given in Fig. l l(a). The basis transformation can be calculated as a convolution, or as a matrix multiplication. When we call the matrix of Table 2 "W", and the signal x, that multiplication is simply: c = W'x
(1)
where c is a concatenation of the approximation vector a and the detail vector d: c--[al
...
a8
dl
...
d8] T
(2)
To transform back, i.e. to reconstruct the signal starting from the vector c, we calculate: x-
(W'W)-IWc
(3)
The inversion of W'W is required because W is orthogonal, but not orthonormal. The diagonal elements of W'W are equal to 2, hence the inversion leads to a diagonal matrix with diagonal elements equal to 1/2. Let us look in
(b) ~
.
.
(d) ~'.................
Fig. 11 Power spectra of the different stages #1 the pyramid algorithm. (a) first application of a pair of filters to 16-point signal," (b) the filters applied to the low-frequency part of (a); (c) and (d) further cut up of the lower.fi'equencies analogous to (b).
46 some detail at what happens in the calculations of Eqs (1) and (2). Eq (1) can be elaborated to" al - Xl + x2
dl
-
Xl -
x2
a 2 - - x3 -t- x 4
d2 -
x3 -
x4
When going back we can easily obtain from Table 2 that: x, -- (al + d l ) / 2 -- (xl + x2 + x, - x2)/2 -- (2Xl)/2 -- x, X2 - - ( a l
--dl)/2
- (xl + x2 - x, + x2)/2 - (2x2)/2 -- x2
Apart from the scaling by (W'W) -l, the calculation required for undoing the effect of the L P - H P filter pair, i.e. to transform back, boils down to applying the same pair to a reordered version of the output obtained in the forward transform. We recognise the [1 1] and [1 - 1] impulse responses in the rows of Table 2, although their elements are further apart in the matrix. What we wanted to show here is that the back transform can also be performed as a convolution. The basis of Table 2 corresponds to the frequency tiling given in Fig. 12(a). There is a good time localisation, but a very coarse frequency resolution. We will now repeat the LP and HP filter operations on the elements of the column "a" of Table 1. The calculation is summarised in Table 3. The code "aa" means that we have applied the LP filter to "'a". The first element contains the sum of the first two elements of "'a". The column "da" contains the output of the HP filter as applied to the column "a'" Its first element is the difference between the first two elements of "'a". The eight elements of the columns "'aa'" and "'da'" of Table 3 give an alternative description for the eight elements of the column "a". In other words, the elements of "aa", "da'" and "'d" completely describe the signal. The elements of "aa" and "da" could be calculated directly from the signal. The first element of "aa" is the sum of the first four elements of the signal. Analogously, the first element of "da" is the difference between the sum of the elements one and two of the signal and the sum of the elements three and four of the signal. We can again associate basis functions with this calculation. The total basis for the signal is obtained by replacing the basis functions for "a", i.e. the first eight columns of Table 2, by the basis functions for " a a " and "da". The result is given in Table 4. The basis functions inherited from Table 2 are set in italic.
47 (a)
(b)
(c)
(d)
Fig. 12 Time-frequency domain tiling for the pyramid algorithm. (a) First application of a pair offilters to 16-point signal," (b) thefilters applied to the low-frequency part of (a)" (c) and (d) further cut up of the lower frequencies analogous to (b).
Table 3
a 0.224 0.329
aa 0.554
da -0.105
1.082 2.001
3.083
-0.919
1.929 0.767
2.696
1.162
0.348 0.067
0.414
0.281
In contrast to the basis of Table 2 and Fig. 12(a), this is no longer a shorttime Fourier basis. Note that the application of a HP (HP) filter on the output of a LP (LP) filter effectively constitutes a band-pass (BP) filter. This
48 Table 4 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 -1 -1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 -1 -1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 -1 -1
1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 I -1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 -I 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1
can be seen by looking at the power spectrum of any of the columns 5 to 8 of Table 4. In Fig. 11 (b), this power spectrum is given together with that of one of the first four columns, the ones that correspond to the application of the LP filter to the output of a previous LP filter. The power spectrum of HP after LP has the shape of a peak. Only that band of frequencies is passed. The repeated LP operation still has an LP character. The sum of the two power spectra gives the same characteristic as the first columns of Table 2, i.e. the LP curve of Fig. l l(a). Transforming back to the original basis can be done using Eq. (3) when we associate the transformation matrix with Table 4. This is a direct one-step operation. If we implement the filters as convolutions, going back becomes a two-step process just as going forwards. First we undo the second step, then the first. The tiling corresponding to Table 4 is given in Fig. 12(b). The upper half of the domain is covered as in Fig. 12(a), but in the lower half the frequency resolution is doubled and the time localisation is cut in two.
49 Table 5 aa
aaa
daa
0.554 3.083
3.637
-2.529
2.696 0.414
3.111
2.282
We repeat the filter operations once more, obtaining the results given in Table 5. The coding is an extension of what was previously introduced. The first element of "aaa" is the sum of the first two elements of "aa", i.e. the sum of the first eight elements of the signal. The first element of "daa" is the difference of the first two elements of "aa" or the difference between the sum over the points 1 through 4 of the signal and the sum over the points 5-8 of the signal. The column "aa" can be replaced by the columns "aaa" and "daa". That means that the basis functions corresponding to those elements of "aa", i.e. the first four columns of Table 4 can be replaced by a new set of basis functions. The overall basis is given in Table 6, and the tiling in Fig. 12(c). Fig. 1 l(c) shows the power spectra of the first four columns of Table 6. Table 6 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
1 1 1 1 -1 -1 -1 -1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 -1 -1 -1 -1
1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 -1 -1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 -1 -1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 -1 -1
1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1
50 Table 7 aaa
aaaa
daaa
3.637 3.111
6.747
0.526
The first two columns have an LP characteristic; the following two have a band-pass characteristic. The filter operation can be repeated one last time, as summarised in Table 7. The one element of column " a a a a " is the sum over all the elements of the signal. The corresponding basis function is a column of ones. The element of " d a a a " is the difference between the first half of the signal and the second half of the signal. Together, " a a a a " and " d a a a " are an alternative description for "aaa". They correspond to an alternative set of basis functions for the first two columns of Table 6. The overall basis is given in Table 8, and the
Table 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 1 1 -1 -1 -1 -1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 -1 -1 -1 -1
1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 -1 -1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 -1 -1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 -1 -1
1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1
51 Table 9
Column of Table 8 1 2 3 4 5 6 7 8 9 lO 11 12
Coefficient aaaa daaa daa(1) daa(2) da(1) da(2) da(3) da(4) d(1)
15 16
d(7)
13 14
d(2) d(3) d(4)
d(5) d(6) d(8)
Fig. 13 Illustration of the application of the pyramid algorithmfor the numerical example.
52 corresponding tiling in Fig. 12(d). Table 9 lists the coefficients that correspond to the columns of Table 8. Fig. l l(d) shows the power spectra of the first two columns. The first column is simply zero frequency, or the offset of the signal. The second column consists primarily of the base frequency but contains also some small contributions from higher frequencies. Obtaining the coefficients listed in Table 9 by repeated application of a set of LP and HP filters, is like peeling an onion. In each step, the high frequency coat is peeled away and a low frequency core remains. The core can be regarded as an approximation of the signal, and the coat that is peeled off as the details to that approximation. The peeling process is known as the pyramid algorithm or the fast wavelet transform. The pyramid nature can be visualised using yet another representation of the data, a pile of graphs as shown in Fig. 13. The upper graph represents the signal to be transformed. On the row below, we find the output of the LP filter, the approximation, to the left, and that of the HP filter, the details, to the right. We see that the signal is a noisy gaussian, and that its approximation is a smoothed version of that. On the third row, the story repeats itself; this time applied to the approximation box of the second level. The approximation of the approximation to the left, and its details to the right. And so forth, until only one point remains to the left, and one to the right, the one to the left being the sum over the entire signal. Note that a point in a graph does not per se follow from the points that are found directly above it in Fig. 13. The right-most point in the approximation, e.g., results from the right-most points in the upper graph, hence from points that are positioned above the details. The position of the graphs is meant to visualise the reduction in points, and the fact that we zoom in on the approximation. Note also that one is not obliged to continue to the bitter end. When we are after a not too coarse approximation of the signal, we can stop at any level that satisfies us. In the example, this could already be at the first approximation. The basis listed in Table 8 is the simplest wavelet basis, and the associated transform is called the Haar transform. Surprisingly, other wavelet bases will lead to the same frequency tiling as we just found, i.e. to Fig. 12(d). The only way we can picture the difference between wavelets is by considering the fact that the tile boundaries are not always as sharp as they are drawn. Depending on the shape of the wavelets, the tiles are more or less blurred. More complex wavelets have in common with the Haar wavelet that they correspond with a complementary pair of LP and HP filters that cut the
53 frequency domain in the middle. This property is directly related to the orthogonality of the basis functions constructed from the impulse responses of those filters. The Haar impulse responses are the most localised ones that exist, as they differ from zero on only two points, which corresponds to a very dull cut-off of the filters in the frequency domain. In other words: sharp vertical tile boundaries and very blurred horizontal ones. All other wavelet bases will have sharper cut-offs and wider impulse responses, i.e. the vertical tile boundaries get blurred as well, and the horizontal ones get sharper. The wider wavelet basis functions have no resemblance to the series of ones and minus ones of the Haar basis, but they do look rather like the short-time Fourier basis functions for wider windows. For an impulse response that differs from zero on, let us say, four points, several aspects of the pyramid algorithm are less obvious. We need to be able to drop half the points and still represent the signal using the output of the LP and HP filters. In other words, we need to step the linear convolution of signal and impulse response by two points. The Haar wavelet basis is also special in the sense that, as the impulse responses are only two points wide, we do not lose points at the extremes when performing a linear convolution. For wider impulse responses, something has to be done about those extremes, e.g. a circular convolution, which puts an additional constraint on the shapes of those impulse responses.
4
The wavelet packet transform
A wavelet basis allows a time-frequency analysis similar to that of a shorttime Fourier basis. It is different in that its time-localisation is better; hence its frequency localisation is worse, for high frequencies than for low fre-
(a)
(b)
J IIIIIIII
ll
Fig. 14 (a) The pyramid algorithm as an upside-down pyramid of boxes; (b) The wavelet packet decision tree as the full set of boxes.
54
quencies. It is easy to derive a more flexible tool based on the pyramid algorithm. In that algorithm, a pair of LP and HP filters is repeatedly applied, first to the signal, than to the output of the previous pass of the LP filter. Now imagine that the pair of filters is applied to the output of the previous pass of the HP filter. In that case we would improve the frequency resolution of those high frequencies, hence reduce their time-localisation. Once armed with this idea, we have a choice at each stage of the algorithm to continue improving the frequency localisation of the low frequencies and/or the high frequencies. The special filter pair guarantees that their combined outputs give an alternative representation of their input. In other words, whatever path we take through the decision tree, the combination of all the last outputs represents an orthogonal basis for the signal. The transform associated with this family of bases is called the wavelet packet transform.
L
Fig. 15 Some selections of the grid of Fig. 14(b) representing wavelet packet bases (left), and the corresponding tilings of the time-frequent3' domain (right).
55 The decision tree can be illustrated using the pyramid representation of Fig. 13 as the starting point. Instead of the graphs of a signal being analysed, we will show only boxes the size of the graphs they could contain. Fig. 14(a) gives this representation. In the case of the pyramid algorithm we have a single large box on top, picturing the signal, two smaller ones below that together are as wide as the one on top, picturing the first pass of the filter pair. The low pass output to the left, the HP to the right. This structure is repeated for the LP output. In Fig. 14(b), this structure has been complemented with all other possible continuations. Each box can be cut in two, so each following row has twice the number of boxes, but the boxes are half as wide. The scheme looks like a tiling of the time-frequency domain, but it is not. All the boxes on the left-hand edge of the grid correspond to pure LP filters. In the same way, the right-hand edge corresponds to HP filters. All other boxes are related with band-pass filters. The overall grid of boxes does not correspond to a basis; it is highly redundant. A basis is obtained only for special selections of boxes, viz. those selections that completely cover the horizontal dimension without overlap. Each such selection, i.e. each basis, corresponds to a specific tiling of the time-frequency domain. Fig. 15 shows some arbitrary wavelet packet bases and their corresponding tilings. Using the wavelet packet transform, we can zoom in on any frequency band. But of course, as we zoom in, the information obtained becomes less localised in the time domain. Being able to zoom in is a nice feature, but what if one does not know what to zoom in on, which is the most likely situation in chemical applications. We do not usually know what tiling of the time-frequency domain is most suited for, let us say, our N I R spectrum. Fortunately, one does not need to know this in advance. Techniques exist that select the best basis for a particular situation from the wealth of bases offered by wavelets and wavelet packets. In practice, we may not even need a basis. A basis allows us to go back to the domain of the original signal, but sometimes there is no need to go back, e.g. when we want to extract features from our signals.
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
57
CHAPTER 3 Fundamentals of Wavelet Transforms Y. Mallet, O. de Vel and D. Coomans Statistics and Intelligent Data Analysis Group, School of Computer Science, Mathematics and Physics, James Cook University, Townsville, Australia
I
Introduction
This chapter discusses various aspects of the wavelet transform when applied to continuous functions or signals. It may seem strange why this book has a chapter dealing with wavelet analysis for continuous functions, since, in practice most of us will be dealing with discretised functions or discrete signals. Here we think of a discrete signal being equivalent to a discrete nearinfrared spectrum, for example. Whilst continuous functions are not practical, they do assist substantially, in providing an understanding of the basic concepts and theories associated with wavelet analysis. A link actually emerges between wavelet transforms for continuous functions and discrete signals. Before we embark on a discussion on the fundamentals of wavelet transforms associated with continuous functions, we diverge momentarily to present a simple example and to provide motivation for the use of wavelet methods in signal analysis. This example does not use a continuous function, rather a sine curve which has been sampled 512 times.
Example 1 Fig. 1 plots the function sin(2t) which has been sampled 512 times in [-re, re]. The sine curve on the right has a small disturbance at t = 1.5. Below each of the sine curves are the Fourier coefficients and the wavelet coefficients. Since the Fourier coefficients are complex, the magnitude of the coefficients are shown. A noticeable feature of the Fourier coefficients is the large coefficient at the second index which reflects the period of the sine curve. This is evident in the Fourier coefficients from both the original and disturbed signal. There is little visible difference between these sets of Fourier coefficients. The disturbance occurring at t = 1.5 has been spread across all of the Fourier coefficients. The wavelet coefficients are however able to detect the distur-
58
-3
-2
10
0
10
-1
0
1
2
3
-3
-2
-1
0
1
t
t
Fourier Coefficients
Fourier Coefficients
20
30
40
50
60
0
10
20
30
40
2
50
index
index
Wavelet Coefficients
Wavelet Coefficients
20
30
index
40
50
60
0
10
20
30
40
3
50
60
index
Fig. 1 Fourier and wavelet coefficient of a sampled sine signal, with (right) and without (left) a small disturbance.
bance. What is also appealing, is that the change in wavelet coefficients occurs in approximately the same region as the disturbance in the sine curve. The large wavelet coefficients occurring at the edges of the graphs are due to end effects. End effects are discussed in greater detail in Chapter 4. Example 1 illustrates one of the main advantages associated with waveletsthat is their ability to detect changes that occur over a short duration in a signal, and more importantly, when the changes occur. Although not demonstrated in Example 1 wavelets are also able to detect changes that occur over a longer duration in the signal. We now set out to discuss in more detail fundamentals of the wavelet transform. To avoid confusion, it should be stated that much of the theory of wavelets has evolved from continuous functions, so wavelets in this chapter will be addressed using functions which are continuous (unlike the example shown above in Fig. 1). Wavelets form a set of basis functions which linearly combine to represent a function f(t), from the space of square integrable functions L2(R). Functions
59 in this space have finite energy that is, f_~ f2(t)dt < vc is a finite number. Since wavelet basis functions linearly combine to represent functions from L2(R) they are from this space as well. It is important to mention that other spaces could have been chosen. The reason for choosing this space largely relates to the properties of the L 2 norm. This is explained in more detail in Strang and Nguyen [1, pp. 26-27]. The wavelet basis functions are derived by translating and dilating one basic wavelet, called a mother wavelet. The dilated and translated wavelet basis functions are called children wavelets. The wavelet coefficients are the coefficients in the expansion of the wavelet basis functions. The wavelet transform is the procedure for computing the wavelet coefficients. The wavelet coefficients convey information about the weight that a wavelet basis function contributes to the function. Since the wavelet basis functions are localised and have varying scale, the wavelet coefficients therefore provide information about the frequency-like behaviour of the function. This chapter proceeds by introducing the continuous wavelet transform and conditions required for invertibility in Section 2, and Section 3 discusses the inverse or reconstruction formula. In Section 4 we mention that the continuous wavelet transform is difficult to implement because both the transform and the parameters are continuous. It is possible to discretise the parameters and avoid redundancy. The result is a discrete wavelet transform. Discrete because the parameters in the transform are discrete, the function or signal to be analysed is still continuous. Section 5 discusses the concept of multiresolution analysis (MRA) and links scaling and wavelet functions. This section also shows how functions can be represented at different resolutions using combinations of wavelets and scaling functions. When the wavelets are orthonormal it is possible to derive from the MRA, an efficient algorithm for computing wavelet coefficients. This is the fast wavelet transform and is discussed in Section 6. Whilst much of the discussion in this chapter focuses on wavelets that are orthonormal, in Sections 6 and 7 we mention that wavelets need not be orthogonal, and that other kinds of wavelets exist such as biorthogonal and semiorthogonalwavelets.
2
Continuous wavelet transform
Before introducing the continuous wavelet transform, we first recall some details about Fourier transforms. Let f(t) represent a signal from the L2(R)
60 class of functions, then the continuous (integral) Fourier transform of f(t) is written ~C f
a
FVT((0) -- /
f(t)e -j~ dt
(1)
where, t c R = x/-Z-1. Eq. (1) states, that in order to obtain information about a single frequency co, it is necessary to integrate over the entire signal. Essentially, the Fourier transform takes our signal from the time domain into a frequency domain. Since the Fourier coefficients involve complex numbers, the magnitude of the coefficients FvT(co) is often plotted against the frequency c0, as shown in Fig. 2. A disadvantage of the Fourier transform is that any isolated frequency changes in the signal are averaged with the frequencies across the remainder of the signal. This makes it difficult to extract frequency information relative to time. The windowed Fourier transform [2,3] (also called the short time Fourier transform) was introduced so that the frequency information about a signal could be localised with respect to time. Instead of analysing the function f(t) as a whole, the windowed Fourier transform performs a Fourier transform on pieces of the function. The pieces are obtained by using a windowing function G(t) which slides across the function. The windowed Fourier transform of f(t) is defined as DC
FWFT(CO, b) -
f
f(t)G(t - b)e -j~ dt
t. b E R
(2)
~C
f(t)
[ Fv1(co)I Jf
Fourier ~ Transform L
'
r
time
frequency co
Fig. 2 The Fourier transform of a function f ( t ) takes a functionji'om a time domain into a .~'equency domain.
61
/-I w'naow1
,
3
A
f(t)
Windowed Fou de~
Transform /
Constant time interval
yl B
~
A
B
~
A
B
A
B
time
time
Fig. 3 The windowed Fourier transform of f(t) obtains frequency information of the function for constant time intervals.
The windowed Fourier coefficients are functions of frequency o, as well as the translated position of the window b. Fig. 3 demonstrates how the window slides across the function, and thus, how it is possible to obtain information about the frequency content of the signal over some constant time interval. Window functions place more weight on the function which is central to the window, and less weight as the function nears the border of the window. The windowing function should have a finite integral and be non-zero over a finite interval. One example of a windowing function is mathematically described as follows G(t)-
~ 1 + cos(rot) L0
- 1 < t _< 1 otherwise
This window function is presented in Fig. 4. 0
/ ,//
/
o
0
/ i.o
/ -1 o
//
\
//"
,x
",,,,
/
\,
//
\ \
-os
olo
\
o5
t
Fig. 4 An example of a window function.
11o
62 In many instances, the procedure for determining the width of the windowing function is not straightforward. If a window width is too small or too large, then important information may still remain undetected or become distorted. Thus, the choice of window width is not automatic. Indeed, different window lengths may be needed for different parts of a function. For example, parts of a function that change slowly over time will need a long window, while parts of a function which change quickly will need a shorter window length. This problem is reflected in window B in Fig. 3. Within this window there are two frequency components that may go unrecognised or distorted due to an inappropriate window length. The continuous wavelet transform differs from the windowed Fourier transform, in that it allows us to view the signal through windows whose widths vary in size. The continuous wavelet transform convolves the function f(t) with translated and dilated versions of a single basis function q/(t). The basis function qJ(t) is often called a mother wavelet. The various translated and dilated versions of the mother wavelet are called children wavelets. The children wavelets have the form ~ ( ( t - b)/a), where a is the dilation parameter which squeezes or stretches the window. The continuous wavelet transform is written
i
3(;
FcwT(a, b) - [a] -1/2
f(t)~ ( t - ba )
dt
a. b E R, a =/=0.
(3)
--2X2
Eq. (3) is an inner product of the signal f(t) with the children wavelets multiplied by a constant. The inner product of two real functions f, h c L 2(R) is given by (f, h) - f_~ f(t)h(t)dt. The larger the scale or dilation parameter the narrower the window, and vice versa. The factor a1-1/2 is included so that the rescaled wavelets all have equal energy, that is, [ ] ~ ( t - ]bl)a
-][~(t)l[
Note: For clarification of terminology, we refer to the "wavelet transform" as being the procedure for producing the wavelet coefficients. When the function f(t) is represented as a linear combination of the wavelet coefficients and wavelet basis functions this is referred to as the "'wavelet series representation" or "wavelet decomposition" of f(t). This is discussed in greater detail in Section 5.
63 Fig. 5 shows some wavelet functions which are translated and dilated by different amounts. Notice that they all possess the same shape and differ by the amount by which they are translated and dilated. There exist many kinds or families of wavelets. The wavelets shown in Fig. 5 are wavelets from the Daubechies family, named after Ingrid Daubechies. Fig. 6 indicates that by translating and dilating the mother wavelet, localised information about high and low frequency events can be obtained. It should be mentioned that we use the term "frequency" loosely when talking about wavelet transforms, since it is not really frequency that we are describing but rather low and high scale events. 3
Inverse wavelet transform
A desirable property of any transform is to be able to revert from the transformed function into the original function. An inverse transform exists for the continuous wavelet transform. The original function can be reconstructed using
translate
dilate
dilate
translate
I ti
Fig. 5 An example of dilating and translating wavelets from the Daubechies family.
64
,- ................................
The
|
window changes
width
of
the
f(t)
Short time intervals for high frequency events
v
time ~
J
'o f(t) Wavelet Transform
o= g r
i. v
time
,
time
,, ', |
r ........................
|
s i
/
Long time intervals for low frequency events
f(t)
/ \\ \..
-> F,-
timq
Fig. 6 The continuous wavelet transform of f ( t ) obtains localised.~'equencv in/brmation of the function for varying constant time intervals.
oc
f(t)-lc / --~
~r
J ' F c w T ( a ' b ) ] a l - ' / z ~ ( t - d abd) b a a 2 --DC
where the constant c = 2Tt f IqJfTI2 I~1" For c to be finite, the Fourier transform of the mother wavelet should equal to zero at o3 - 0, i.e. qIFf(0 ) -- 0 and ~(t) oscillates so that its integral is zero. A decaying function ~(t) with f qt(t) - 0 is a suitable wavelet for the continuous wavelet transform [1]. It is not necessary to perform the continuous wavelet transform for all values of a and b, since f(t) can be reconstructed from a much sparser set of (a, b)
65 values. In fact, it is possible to obtain an analysis which is just as accurate, and more efficient, by using discrete values for the parameters a and b. This leads to the discrete wavelet transform. (We still speak in terms of a continuous function.)
4
Discrete wavelet transform
Restricting the parameters a and b to represent the discrete measures a = m -j
b = m-Jk
where j, k E Z, m _> 2, m E Z +, then the discrete wavelet transform is defined ~X2
FDWT(j, k) -- m j/2 /
f(t)~(m j - k) dt
j, k E Z
--3C
Typically, m -- 2 [4,5,6] so that, a = 2 -j
b = 2-Jk
in which case the mother wavelet is stretched or compressed by factors of two. Wavelets with m > 2 are sometimes referred to as higher multiplicity wavelets see Chapter 8. Our immediate discussion will however assume that m = 2 unless otherwise stated. The main difference between the continuous wavelet transform and the discrete wavelet transform (of continuous functions) is that the wavelet is stretched or dilated by 2 -j for some integer j, and translated by 2-Jk for some integer k. For example if j = 2, the children wavelets will be dilated by 2 -2 - 8 8 and translated by 1 k.
5
Multiresolution analysis
Multiresolution analysis (MRA) [7,8,9] provides a concise framework for explaining many aspects of wavelet theory such as how wavelets can be constructed [1,10]. M R A provides greater insight into the representation of functions using wavelets and helps establish a link between the discrete wavelet transform of continuous functions and discrete signals. M RA also allows for an efficient algorithm for implementing the discrete wavelet transform. This is called the fast wavelet transform and follows a pyramidal
66 scheme. Of course it should be stated that M R A still exists in the absence of wavelets, and that wavelets need not be associated with a multiresolution. However, the wavelets which we prefer to use, i.e. those with compact support (non-zero over a finite interval), will, in most instances be generated from an MRA. For these reasons it is desirable to have wavelets which satisfy the properties of a multiresolution. Consider the following example which presents some concepts that we will use when we explain the idea of a multiresolution analysis in greater detail. Example 2
Let V0 be a subspace that consists of all functions which are constant on unit intervals k _< t < k + 1 for any k E Z. These intervals are denoted by . . . , [ - 3 , - 2 ) , [-2, 1), [ - 1 , 0 ) , [0, 1), [1,2), [2, 3 ) , . . . An example of such a function is depicted in Fig. 7. You will notice that if we shift f(t) along by 1, then this function still remains in the same space, V0. Hence, if f(t) E V0, then f(t + 1) is also in V0. This property is called a shift invariance or a translation invariance property. Integer translates of any function remain in the same space - this is more generally stated: if f(t) E V0, then f ( t - k) E V0. Notice that if we rescale f(t) by a factor of 2, then this function will be constant on [~, k~__21).The function f(2t)is then in V1. If we translate f(2t) by half an integer, then this function remains in V1. This is demonstrating shift
r
9
-4-
9 " .~
0
-'2
6
.... ~
;~
t
Fig. 7 Example of a piecewise constant function, constant oll integer intervals.
67 invariance again, but at a different scale. Following this pattern, the functions in V2 are constant at [k,k]____!l), the functions in V3 are constant at [gk,k~___tl),and so on. Decreasing the resolution we can say that functions which change value at every second integer, i.e. are constant on [2k, 2(k + 1)] correspond to the space V_~. Note, from this example that the subspaces are nested i.e. V_~ c V0 c V I. Example 3
Fig. 8 depicts the scaling property for another piecewise constant function f(t) over an integer interval. You will observe that f(2 -1 t) is an element of the space V-1 and is "twice as stretched" as the function f(t) which is in the space V0. Conversely, f(2t) is an element of the space V I and is "twice as squashed" as the function f(t). Again the subspaces are nested i.e. V_I c V0 c V l.
f ( 2 1 t ) E V-t
bl
f(t)r
I
f(2t ) e Vl
i
i~
i/2
i
~,
Fig. 8 Piecewise constant functions in
V _ l , V O, V 1 .
68 Now that we have introduced some terminology we continue with the explanation of multiresolution analysis. As the name suggests, M RA allows us to represent functions at different resolutions, which can be likened to wavelets analysing functions through different size windows. A multiresolution divides the space of all square integrable functions L2(R) into a nested sequence of subspaces {Vj}jEz. Each subspace corresponds to a particular scale, just like the functions in Examples 2 and 3 are at different scales in V_ 1, V0 and V1. The subspaces corresponding to the different scales provide the key for representing functions from L 2(R) at different resolutions. The reason being, given some function f(t) E L2(R) then f(t) has pieces in each subspace. Let fvj denote the piece of f(t) deposited in Vj, then fv, is an approximation of f(t) at resolution 2J. We also define fv, to be an orthogonal projection of f(t) onto fvj. This implies that fv, will be the closest approximation of f(t) at resolution 2J, mathematically this is expressed V g(t) E Vj,
IIg(t)- f(t)ll > Ilfvj- f(t)ll.
The subspace Vj contains all the possible approximations of functions in LZ(R) at resolution 2J. For the subspaces to generate a multiresolution, they must satisfy some conditions. It has already been mentioned that the subspaces are nested, this means that Vj E Z, Vj C gj+l. That is, a function at a lower resolution can be represented by a function at a higher resolution. Since information about a function is lost as the resolution decreases, eventually the approximated function will converge to 0, i.e., limj_._~fvj- 0, and the intersection of all subspaces Vj is equal to {0}, or, ["lJ=~ j=_~ Vj - {0}. Conversely, as the resolution increases the approximated function gets progressively closer to the original function l i m j ~ f v j - f(t), and U j=~ vj is dense in L2(R) that is, the space L2(R) is a closure of the union of all subspaces Vj. Where do these subspaces come from? The subspaces {Vj} can be generated from each other by scaling the approximated functions in the appropriate subspace such that, f(t) E Vj ~:# f(2t) E Vj+,,
j E Z.
It can also be stated that integer translates of the approximated functions, remain in the same subspace: f(t) E V j ~ f ( 2 t - k ) E V j ,
j, k E Z .
69
Summarising, the sequence of subspaces {Vj}jc z is a multiresolution of L2(R) if the following conditions are satisfied:
l. Vj C Vj+l j=~ Vj - {0}, [.Jj=_~ j=~ vj is dense in L 2(R) 2. ["]j=_~ 3. f(t) E Vj r f(2t) E Vj+I, 4. f(t) E Vj r
f ( 2 t - k) E Vj,
jEZ j, k E Z
Theorem. If {Vj}jEz is a rnultiresolution of L2(R), then there exists a unique function ~(t) E LZ(R), called a scaling function such that {(~j.k(t) -- 2J/2(~(2Jt- k)} is an orthonormal basis of Vj [8].
Example 4 fv~(t)- ~k~-~ak4~(2t-k) constant.
is the part of f(t) in Vl, where ak is some
Example 5 If we wanted to construct a basis that could be used to represent any piecewise constant function in V0 a simple choice would be the box function (see Fig. 9)" 1 for0< t < 1 d~(t)- 0 otherwise
d)(t)
Fig. 9 The box function 4)(t).
70 This then implies that any function in Vj can be represented by a linear combination of the {~j.k(t)}. Hence, the orthogonal projection of f(t) E L2(R) into gj can be expressed as 3(2'
fvj(t) - ~
cj.k q~j.k(t).
--:3C
The coefficients Cj.k are called scaling coefficients. Since V0 c V~, 3O
4(t)- v'5
lk+(2t- k)
(4)
k=-
Eq. (4) is often referred to as the dilation equation.
Example 6 For the box function 1 0 - l l - l/x~2, thus ~ ( t ) - ~(2t)+ ~ ( 2 t - 1), this is clearly demonstrated in Fig. 10. So how do wavelets enter the picture? Wavelets are basis functions which can be used to represent the information lost in approximating a function at a lower resolution. This difference is called the detailed part of the function. We prefer that this error lies in the orthogonal complement of the Vj's. Consider the difference between approximating a function at resolution 2J and at 2j+l . This difference will lie in the orthogonal complement of Vj which is denoted by Wj such that,
vj+ - vj
4ff2t-l)
~(t) ~(2t)1
1
1
0
1
0
1/2
1
0
Fig. 10 Haar scaling basis functions.
1/2
1
71
In terms of the functions in the subspaces, then
fwj is
fVj+l -- fvj "j-fwj
the orthogonal projection of f(t) into Wj. Further decomposing fv, produces J fvj+I -- fvj_I -J-fwj_I 'J-fwj -- ~
fw i 9
i=-~
Then for some function f(t) we have DC
f(t) - fvj + ( f ( t ) - fvj) - fvj +
Zrwi i=j
and one can then understand how a multiresolution allows us to represent a function at various resolutions. Next, consider how we can represent each fwj. In order to represent the orthogonal projection of f(t) into Wj, it is convenient if we have an orthonormal basis for Wj, just as we had an orthonormal basis for Vj. It can be shown [8] that provided { ~ j . k ( t ) -- 2 j/2 ~ ( 2 J t -- k)} is an orthonormal basis for Vj then there will exist a wavelet basis {qt_j,k(t ) -- 2 j/2 -- q t ( 2 J t - k)} which spans Wj. Since W0 c V1, an expression for the wavelets can be obtained from a linear combination of the scaling functions in the space V1. That is oc
q,(t) - x/2 ~
h k ~ ( 2 t - k)
k=-oc
Example 7 The wavelet function 1
~(t)-~(Zt)-~(Zt-1)-
-1 0
0
which corresponds to the box function as depicted in Fig. 11.
72 The detail of the function obtained by decreasing the resolution from 2j+l to 2J is oc
fwj(t)-- Z
dj'kqtJ,k(t)
k - - - - :x;
, every function in L 2(R) can be represented as a Since LZ(R)= | linear combination of wavelet basis functions :X2
OC
f(t)- ~ Z dj'k~tJ.k(t) j=-oc k=- :x: Thus we have arrived at the wavelet series representation of f(t) (also called the wavelet decomposition of f(t)). Alternatively, one could write f(t) as a linear combination of scaling and wavelet basis functions as follows 2C
f(t)-
3r
3r
~ Cjo.k0(t) -+- Z Z djkqtJ ,k(t) k=-x j=jo k=- x
The Cj,k are referred to as scaling coefficients and the dj.k are the wavelet coefficients as described previously. Often, in practice it is these coefficients that we wish to obtain, and these are obtained from the wavelet transform. Due to the orthogonality of the scaling and wavelet functions the scaling coefficients can be calculated by the inner product
(5)
Cj,k -- [ f(t)~j.k(t)dt
d
1
v l
Fig. 11 The Haar it'avelet.
73 and the wavelet coefficients can be calculated by
f
(6)
f(t)qJJ,k(t)dt"
The orthogonality conditions on the scaling and wavelet functions as presented in Strang and Nguyen [1] are summarized as follows: 1. The scaling functions ~ ( t - k) are orthonormal to each other: f
d~(t- k ) 4 ) ( t - k')dt - 8(k - k')
--00
2. The scaling functions are orthogonal to the wavelets: (X3
~ ( t - k ) q t ( t - k')dt
f
0
--(X)
3. The wavelets q/j,k(t) -- 2J/2~(2Jt- k) are orthonormal at all scales: (x)
/
~j,k(t)~j, k,(t) dt - 8(j
m
j')5(k - k')
--(X)
In many cases ~(t) and ql(t) will not have a closed form, and are not straightforward to calculate. Strang and Nguyen [1] discuss various procedures for calculating ~(t) and ~(t). This may lead to concern regarding the computation of the scaling and wavelet coefficients. Since, if we do not know ~(t) and ~(t), how can the scaling and wavelet coefficients be computed? In the next section, we show that the wavelet coefficients can be obtained without actually having to construct ~(t) and ~(t), using the properties of the MRA. In fact all we need are the lk's and the hk'S. These coefficients are often referred to as the low-pass and high-pass filter coefficients, respectively. What is remarkable is that it is possible to place conditions on these filter coefficients, so that an MRA and associated wavelet basis exist. These conditions are summarized below"
1. ~-~k lk -- X/~. This condition is a result of conservation of area. One way of generating the scaling coefficient is by iteration. The conservation of area insures that the area beneath the scaling function at each iteration remains constant. Typically, f_~'~ ~(t)dt - 1.
74
2. h k - (--1)kll_k. This condition arises from the wavelet function ~(t) being orthogonal to the scaling function ~(t) and all the translates of the scaling function. 3. ~-~khk = 0. This condition arises because the area underneath the wavelet function is equal to zero, i.e. f _ ~ O(t)dt - 0. 4. ~-~klklk+2i- ~0i- If this condition is satisfied, then the scaling function 4)(0 is orthogonal to its translates. Often there is a finite number of non-zero filter coefficients. We use the notation Nf to denote the number of non-zero filter coefficients. Values for the filter coefficients appear in several texts, see for example [7]. Each set of filter coefficients defines the corresponding scaling and wavelet basis functions. Whilst it is possible to use ~off-the-shelf' wavelets, in Chapter 8 we suggest a possible approach for designing your own wavelets. It should be mentioned that other conditions can be placed on the filter coefficients, such as the accuracy condition [1]. For a function f(t) to be described by a polynomial expansion with terms like 1, x , X 2 , . . . , X n we require Y~k (-- 1)kkmlk -- 0 for m - 0, 1, . . . . n - 1.
6
Fast wavelet transform
The fast wavelet transform provides an efficient algorithm for computing the discrete wavelet transform. We will show that provided we know some function fvj, then the scaling and wavelet coefficients can be calculated in the absence of the scaling and wavelet functions, avoiding the integral expression in Eqs. (5) and (6). An expression for the scaling coefficients will be derived first, an expression for the wavelet coefficients then follows. Let us assume that we know fvj, which is expressed as follows 0(2
fvj(t) - ~
Cj.k~)j.k(t)
--0(2
Since the scaling basis functions in Vj are orthonormal to their translates, oo Cj,k - -
/ fvj (])j,k dt - (fv~,~ j . k ) --(X)
(7)
75 Eq. (7) requires some formulation of (l)j.k and hence qb(t) which may be difficult to obtain. It is desirable that an expression for the scaling Cj,k and wavelet coefficients dj.k be attainable without the need to construct ~(t) or ~(t). We now set about doing this. First, write OC
)C
~C
fv,(t) - ~ Cj.k(l)j.k(t) -- Z Cj-lk*J -l.k(t) + Z dj-lkqtJ -l'k(t) -~c k=-x k=-x This is an expression for fv, which has projections in Vj_ 1 and Wj_l. Now an expression for the scaling coefficients can be written as ~C
Cj_l. k --
/
fvjqbj-l.k(t)dt - / f v i , qbj-l.k) --
/
~Cj.kqbj.k(t), qbj-l.k
)
9
Using the fact that qbj_l k -- 2(J-l)/2~)(2J-lt- k) and ~X2
~(2J-lt - k) - ~
lk_ziqbj.k(t)
--2X2
then the following expression for the scaling coefficients is obtained ~X2
Cj_l,i -- Z lk-2i Cj.k. k=-~c Essentially we are using the scaling coefficients at the higher resolution to calculate the scaling coefficients at the next resolution. This is sometimes referred to as the pyramidal algorithm [9,11]. A similar procedure is performed for obtaining the wavelet coefficients, leading to the following expression ~X2
dj_l.i -
~ hk-2i Cj,k. k=-~
Provided we know the scaling coefficients at some resolution level j, the remaining scaling coefficients and wavelet coefficients can be found by the pyramidal filtering algorithm without even having to construct a wavelet or scaling function. We need only work with the filter coefficients lk and hk. Now that we know how to compute the wavelet coefficients (the easy way), there is one last issue that we must deal with. To make use of the fast wavelet
76 transform, we repeatedly say, provided you know the scaling coefficients at some level j, then the wavelet and scaling coefficients at lower levels of j can be easily computed. So what constitutes the scaling coefficients at level j? One such practice is to discretely sample from the function f(t). Be warned however, Strang and Nguyen [1] (p. 232) refer to this procedure as being "a wavelet crime". They do mention that whilst it is a wavelet crime it is a convenient practice to employ. We refer the interested reader to this reference for more details.
7
Wavelet families and their properties
This chapter has briefly eluded to two wavelet families, the Haar and Daubechies wavelets. In fact when Nf = 2 the Daubechies wavelet is identical to the Haar wavelet. In this section we would like to discuss in greater detail more about these wavelet families and other wavelet families. We will also provide a brief comparison between the different properties possessed by these wavelets and other wavelet families. This is important because depending on your application, you may need to choose a wavelet that satisfies special properties. We first review the terms orthogonal and compact support. Following this, we will introduce some more properties, namely smoothness and symmetry of wavelets and also discuss the term vanishing moments. The Haar, Daubechies, symmlets and coiflets are wavelet families which exhibit orthogonality and compact support (see Fig. 12). Criteria which the scaling ~(t) and wavelet ~(t) must satisfy for orthogonality were discussed in Section 5. Also, in this section the term compact support was briefly mentioned. A wavelet is compactly supported if it is nonzero over a finite interval and zero outside this interval. Such wavelets include the Haar, Daubechies, symmlets and coiflets. The decision to use a particular wavelet may extend to other properties besides orthogonality and compact support. Other properties include symmetry, smoothness, and vanishing moments. 9 Symmetry
Symmetry is a useful property in image processing [14]. Unfortunately, with the exception of the Haar wavelet it is not possible to have orthogonal compactly supported wavelets which are symmetric. Owing to this fact, the
77 (a)
(b)
(c)
(d)
Fig. 12 An example of a Haar wavelet (a), and wavelets from the Daubechies (b), symmlet (c) and coiflet (d) families.
symmlets and symmetrical. Usually if symmetry and orthogonality is required, then one opts to use biorthogonal wavelets. These are discussed in greater detail in Section 8. 9 Smoothness
So that a wavelet function can efficiently represent characteristics of an underlying function, it is necessary that the wavelet be sufficiently smooth [14]. If one considers the Haar wavelet depicted in Fig. 11, one can clearly see that this wavelet suffers from a lack of smoothness. Smoothness is closely related to how many times a wavelet can be differentiated and to the number of vanishing moments possessed by the wavelet qJ(t). One measure of smoothness is the H61der exponent [14], which is defined to equal ~ = q + ~, where and 7 are the largest values such that ]t~(q)(t)- ~(q)(t + x)[ _< C[x] ~' for all t,x and f(q) denotes the qth derivative of the wavelet ~(t).
9 Vanishing m o m e n t s
A wavelet is M - 1 times differentiable if it has M vanishing moments. A wavelet has M vanishing moments if
78
oc
f
tq~(t)dt = 0
for q = O, 1,2 . . . . . M - 1.
--C)C
Vanishing moments are a necessary condition for qt(t) to be M times differentiable. More precisely, a wavelet is M times differentiable only if qt(t) has M vanishing moments. (Note that the reverse is not true.) When the wavelet qt(t), has M vanishing moments then the polynomials of form ~0
(a)
(b)
t~,
(c)
[!
/1 Fig. 13 The morlet (a), mexican hat (b) and Meyer (c) wavelets.
79 Table 1 provides a summary of the wavelets discussed in this section. For a more extensive summary see [14].
8
Biorthogonal and semiorthogonal wavelet bases
Much of the literature on wavelets tends to be biased towards discussions on orthogonal wavelets because they are convenient and simple to implement. However, we feel that is necessary to make the reader aware that wavelets need not be orthogonal and that wavelets with other properties can be quite useful too. In this section we discuss biorthogonal wavelets as one alternative to orthogonal wavelets. We direct the reader to [1,7,10,14,15] for more information on other kinds of wavelets. Briefly, when using orthogonal compactly supported wavelets it is not straightforward to obtain a wavelet which has symmetrical properties [7,12] and allows for an exact reconstruction. That is of course with the exception of the trivial Haar wavelet. Biorthogonal wavelets relax the assumptions of orthogonality, and allow for a perfect reconstruction with symmetrical wavelets. There are also semiorthogonal wavelets which are slightly more restrictive than biorthogonal wavelets, but may also be worthy of consideration. Strang and Nguyen [1] presents a section on the symmetry and orthogonality issue and suggests some alternative approaches which can be considered if both symmetry and orthogonality are required. Turcajova [13] provides an excellent discussion on the application of higher multiplicity wavelets as an alternative approach to using orthogonal wavelets with symmetrical properties. When dealing with orthogonal wavelets, the same wavelets used in the wavelet series expansion of f(t), are used in the inverse wavelet transform for computing the wavelet coefficients. When using biorthogonal wavelets, one wavelet is used in the series representation, and another set of wavelets is used for computing the wavelet coefficients from the inverse transform. Thus, biorthogonal wavelet analysis considers two scaling functions qb(t) and ~(t), and two wavelet functions ~(t) and t~(t). The functions ~(t) and q/(t) are the usual scaling and wavelet basis functions, while ~)(t) and ~(t) are called the dual scaling and wavelet functions, respectively. Fig. 14 shows an example of a biorthogonal scaling and wavelet functions together with their corresponding dual functions. Notice that these wavelets are compact.
00
0
Table 1. Properties summary for the Haar, Daubechies, symlets, coiflets, morlet, Mexican hat and Meyer wavelets.
Property Symmetry Asymmetry Nearly symmetrical Orthogonal Compact support Explicit expression Scaling function
Haar
Daubechies
Symlets
Co@t
0
Morlet
Mexican Hat
Meyer
0
0
0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
0
0 0
81
As usual we have ~j,k(t) -- 2J/2q~(2Jt- k,)
~j,k(t) -- 2J/2q/(zJt-
k)
and the dual functions are also translated dilated versions of each other ~j,k -- 2 J / 2 0 ( z J t - k)
~j,k(t) -- 2 J / 2 ~ ( z J t - k).
The biorthogonal wavelet transform is written ,OC
0(3
f(t)- ~
~ ' ~]j,k~j,k(t) j=-oc k=-ec,
where /,
dj,k - J f(t)~j,k(t)dt. Here we have used the original wavelets for representing the functions, and the dual wavelets for computing the coefficients in the expansion of the W a v e l e t Function
Scaling Function
=
y
,
,
,
-1.0
-0.5
0.0
0.5
1.0
'
-r
::
,
1.5
2.0
.
w
,
,
-4
-2
0
.
.
.
T
2
--
r
4
6
Dual W a v e l e t Function
Dual Scaling Function
,e-
. ii
.
F
~
,
,
-1.0
-0.5
0.0
0.5
'
. . . .
~
1.0
.
.
.
,
1.5
2.0
-,4
-2
o
2
4
Fig. 14 An example of biorthogonal scaling and wavelet functions with their duals.
6
82 wavelet basis functions. The roles of these basis functions can be reversed such that f(t) is a linear combination of the dual wavelets oc
f(t)-- E
Z
dj'k~J.k(t)
j = - ~ k=-~c
in which case the coefficients are computed from the original wavelets dj,k -- / f(t)~j,k(t)dt. There are two sets of scaling and wavelet defining equations for each pair of basis functions ~c
CX5
,(t) - v~ Z
l k , ( 2 t - k)
and
,(t) - v~ Z
k=-oc
k--~X2
oo
~(t) - v/2 Z
h k * ( 2 t - k), ~c
[k~(2t- k)
and
}(t) - V~ Z
k=-oc
k---
fak}(2t- k). ~c
As for orthogonal wavelets, a fast wavelet transform exists for computing the scaling and wavelet coefficients. Provided the scaling and wavelet coefficients are known for some scale, the scaling and wavelet coefficients at the next lower scale j - 1, are computed as follows: o(3
k-2icjk
Cj_I, i --
dj-l,i-
Z k---
flk-2iCj-k ~_~
The biorthogonal relationships among the basis functions are [1]" oo
1. /
~ ( t - k)t~(t- k ' ) d t - 8(k - k')
--fiX) OC
2.
/
,(t- k)+ , ( t - k')dt - 0 for all k, k' 0(3
3. /
, ( t - k ) ~ ( t - k ' ) d t - ~5(k- k').
83 The dual scaling function ~ and dual wavelet function t) generate a dual multiresolution analysis such that Vj _L ~r and Wj _1_Vj and Vj + W j -
Vj+, and Vj + VCj - "Vj+l.
--vj /.... .....
vj+,= +
•
+ =~j+~
A biorthogonal scaling and wavelet function are semiorthogonal if they generate an orthogonal multiresolution analysis Wj2_Wj, a n d W j 2 _ W j ,
forjC-j
In the case that Vj - Vj and Wj - Wj, we say that the biorthogonal scaling and wavelet functions are semiorthogonal. For every space Vj, there are two bases dpj.k(t) and (~j.k(t). In the orthogonal case not only do we have Vj - Qj and W j - Wj, but also ~ ( t ) - ~(t) and q l ( t ) - ~(t). For more theoretical details on biorthogonal and semiorthogonal wavelets see [1,7,10].
References 1. G. Strang and T. Nguyen, Wavelets and Filter Banks, Wellesley-Cambridge Press, Wellesley (1996). 2. D. Gabor, Theory of Communication, Journal blstitute of Electrical Engineers, 93 (1946), 429-457. 3. G. Kaiser, A Friendly Guide to Wavelets, Birkauser, Boston (1994). 4. M. Frazier and B. Jawerth, A Discrete Transform and Decompositions of Distribution Spaces, Journal of Functional Analysis, 93 (1990), 34-170. 5. I. Daubechies, The Wavelet Transform, Time Frequency Localization and Signal Analysis, IEEE Transactions on bformation Theoo', 36 (1990), 961-1005. 6. I. Daubechies, Orthonormal Bases of Compactly Supported Wavelets, Communications on Pure and Applied Mathematics, 41 (1988), 909-996. 7. I. Daubechies, Ten Lectures on Wavelets, SIAM (1992). 8. S. Mallat, A Theory for Multi-resolution Signal Decomposition: The Wavelet Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 11 (1989), 674-693. 9. S. Mallat, Multifrequency Channel Decompositions of Images and Wavelet Models, IEEE Transactions on Acoustics Speech and Signal Processing, 37 (1989), 20912110.
84
10. B. Jawerth and W. Sweldens, An Overview of Wavelet Based Multiresolution Analyses, SIAM Review, 36 (1994), 377-412. 11. Y. Meyer, Wavelets." Algorithms and Applications, SIAM, Philadelphia (1993). 12. J. Kautsky, A Matrix Approach to Discrete Wavelets, in Wavelets." Theoo', Algorithms and Applications (C. Chui, L. Montefusco and L. Puccio Eds) (1994), pp. 117335. 13. R. Turcajov'a, Compactly Supported Wavelets and Their Generalizations: An Algebraic Approach, PhD Thesis, The Flinders University of South Australia (1995). 14. B. Andrew and H. Gao, S+WAVELETS User's Manual, Version 1.0, Seattle: StatSci, a division of MathSoft, Inc. (1994). 15. M. Misiti, Wavelet Toolbox User's Guide, Math Works, Massachutes (1996).
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
85
CHAPTER 4 The D i s c r e t e W a v e l e t T r a n s f o r m in P r a c t i c e O. de Vel, Y. Mallet and D. Coomans Statistics and Intelligent Data Analysis Group, School of Computer Science, Mathematics and Physics, James Cook University, Townsville, Australia
I
Introduction
In this chapter we present the finite discrete wavelet transform (DWT) using matrices. The main difference between the description of the D W T in this chapter, compared to the description given in the previous chapter, is now we consider the D W T for discrete data, as opposed to continuous functions. We first provide a brief introduction to matrix theory and discuss some special forms of matrices such as, orthogonal and banded matrices. We then give the matrix representation of the D W T for finite-length signals and discuss some of the practical issues, including signal length and boundary conditions. The same techniques presented in this chapter can be extended to provide the basis for the development of task specific wavelets in Chapter 7. The reader who is familiar with matrix terms and definitions may skip Section 2.
2
Introduction to matrix theory
An n x m matrix B is a rectangular array of elements arranged in a set of n horizontal rows and m vertical columns:
g
bll b21
b12 b22
"'" "'"
blm b2m
bnl
bn2
"-
bnm
where the subscript is used to denote the (row, column) position of each element in the matrix. We write B = [bij]. The matrix has dimensions or order "n by m", usually written n x m. When n = m the matrix is called square matrix of order n.
86
The transpose of an n • m matrix B, denoted B T, is obtained by interchanging the rows and columns of B. This results in an m • n matrix. For example, if 3 5
B-
4 6
then
BT_
1 2
3 4
5 6
Transposing twice in succession returns the original matrix i.e. B -
(BT) T .
A vector is a matrix having either a single row or a single column. A matrix consisting of a single row that has n elements is called a row n-vector; a matrix consisting of a single column is referred to as a c o l u m n n-vector. Note that, if x is a c o l u m n n-vector, then x T is a row n-vector.
2.1 P a t t e r n e d m a t r i c e s
Matrices that have elements exhibiting certain patterns are encountered in many applications, including the DWT. Such matrices are often referred to as p a t t e r n e d matrices. The simplest pattern matrix is the d i a g o n a l matrix D, defined as a square n • n matrix with all off-diagonal elements being equal to zero, i.e. d 0 = 0, if i r j. The diagonal in this case is referred to as the p r i n c i p a l diagonal. If the diagonal elements are unitary, we have the unit diagonal matrix or identity matrix, I. Other patterned matrices include triangular, symmetric and Hermitian matrices. For the D W T we consider the banded, block circulant and permutation patterned matrices. A (p, q)-banded matrix B has all non-zero elements contained in a band consisting of the principal diagonal together with p diagonals above it and q diagonals below it. This is shown as follows: (
all
a12
..-
al. p+l
0
.--
0
'~
a21
a q+l.1
0 .
aq, .
.
.M
. 9
0
...
0
aN,p,
""
aU,Mj
87 Well-known examples of banded matrices include the (1, 1)-banded ("tridiagonal") matrix and the ( n - 1 , 0)-banded ("upper triangular") matrix. An n x n circulant matrix H is defined as / al am am-1
a2 al am
a3 a2 al
...... am a3 999 am-I a2 999 am-2
a3
a4
.--
1-I(al,a2,... ,am) =
a2
am
al
where the matrix elements ai are real numbers and where each row consists of the elements of the preceding row shifted one position to the right, with the right-end element rotating into the first row position. The special case of H(0, 1 , 0 , . . . , 0) is called the shift matrix, since post-multiplying any matrix by H shifts its columns exactly one place to the right. An n x n permutation matrix P has a single value equal to 1 in every row and column of the matrix (the other values are 0) and the rows of P consist of a permutation of the rows. Pre-multiplying any n x m matrix B by P has the effect of applying the same permutation to the rows of B. That is, a 1 in row i and column j of P selects row j in B to be positioned in row i of PB. For example, if
P-
(010) 0 1
0 0
1 0
and
B-
(123) 4 7
5 8
6 9
then
PB=
4 7 1
5 8 2
6) 9 3
Similarly, post-multiplying an m x n matrix B by P has the effect of applying the permutation on the columns of the matrix B. The circulant matrix lq is a special case of a permutation matrix. In the above example, the P matrix is the shift matrix II (0, 1, 0).
88 A Toeplitz matrix T = It;j] has equal elements along diagonals parallel to the principal diagonal. A n x n Toeplitz matrix is defined as
ti+l, j + l
ti4
=
i,j = 1 , 2 , . . . , n
For example, for the case of n = 4: to tl t2 t3
T -
t_~ to tl t2
t_2 t_l to t~
t_3
t_2 t-1 to
2.2 Matrix operations We can define some basic matrix operations:
Matrix addition: The sum B + C of two matrices B and C having the same order is obtained by adding the corresponding elements in B and C. That is,
B + C = [b;j] + [c;j] = [b;j + cij]
/5 3/
So, for example, if
B -
-1
2
7
-5
and C -
3
2/
8
10
-1
-3
8 then B + C =
7
12
6
-8
Matrix multiplication: If B = [b/j] is an n x m matrix and C = [c;j] is an m x p matrix (i.e. the number of columns of B is the same as the number of rows of C) the product D - BC = [d,-j] is defined as m
dij - E
bikCkj
k=l
So, for example, if (1 B-
4
2
3)
5
6
andC-
-1 4
5 3
6 2
7) 1
3
-2
0
5
89 then 16 34
D-BC-
5 23
10 34
24) 63
When B is a row vector, or when C is a column vector, we denote this as a matrix-vector multiplication. We also define the matrix polynomial product, using the symbol <> as the operator" (BoB1,..., Br-1) ~ (CoC1, . . . . Cs-1) - ( G o G 1 , . . . . Gr+s-2) with Gi
BkCi-k
-- ~ k
where the sum is effected over the indices of the matrices Bk and Ck. For example, for the case of r = s = 2, we have (BOB1) <>(CoC1)
-
(BoCo
BoC1 + B1Co
BlC1)
We will use the matrix polynomial product in the context of D W T factorisation (see Chapter 7).
2.3 Some matrix properties The determinant of a square matrix B of order n, denoted by det B, is defined as
detB -
i=l
aijBij - Z aijBij j=l
where Bij is the cofactor of b;j. A cofactor B;j of b;j is defined as
Bij - ( - 1)i+JMij where Mij is called the minor of bij and is the determinant of the sub-matrix obtained from B by deleting row i and column j.
90 So, for example, if 1 2 3
B
-3 -4 5
7 ) -2 1
then det B - a ~ B ~ + a~2B~2 + a~3B~3 1(_1)2 --4
=
--2
5
1
+(-3)(-1
)3 2
-2
3
1
+7(-1) 4
2
-4
3
5
-- 1 ( - 4 + 1 0 ) + 3(2 + 6 ) + 7(10 + 7) -- 129 Though the above definition for the determinant is inherently recursive in nature, it is not computationally efficient for large matrices. Therefore, other computationally efficient approaches are generally implemented. A matrix B is called singular if det B = 0, otherwise if det B -r 0 then B is non-singular. With this we can now introduce the concept of the rank of a matrix. The rank of an n • m matrix B, denoted rank(B), is defined to be the order of the largest non-singular square sub-matrix which can be formed by the selection of (possibly non-adjacent) rows and columns of B. For example,
3 5
B
1 8
6 7
0) 1 1
Here, the square 3 x 3 sub-matrix formed by rows and columns l, 3 and 4 is singular since 2 3 5
1 6 7
0 1 -0 1
However, there is at least one other square 3 x 3 sub-matrices that is nonsingular. Thus rank(B)= 3. There is a direct relationship between the rank of a matrix and the linear independence of its components. That is, the rank of a matrix is equal to the
91 maximum number of its columns which are linearly independent. For the example above, columns 1, 3 and 4 are linearly dependent (since (-1)(B)I + 2(B)~ + (-9)(B)3 = 0, where we have denoted t h e j t h column of B as (B)j). All other columns are linearly independent. So, rank(B)= 3 as before.
3
Matrix representation of the discrete wavelet transform
In Chapter 3 we developed the mathematical framework for (continuous) wavelet construction as formalised by multiresolution analysis (MRA). MRA divides the space of square integrable functions, L2(R), into a series of successive approximation spaces Vs. and their orthonormal band-pass complement spaces Wj. This MRA paradigm is useful for various reasons. Firstly, it is well-suited to many applications such as multiresolution image representation where coarse images are often used as approximations (e.g. when browsing MRI images), or when representing the different characteristics of a spectrum at different scales. Secondly, MRA is intuitive as it is similar to the psychophysical models of the Human visual system. Finally, it provides the basis for the "fast" computation algorithm for the wavelet transform. We now focus on the wavelet transform for discrete signals and, more significantly, for finite length (non-infinite) discrete signals since, in practice, most signals are finite. We develop the matrix representation for the discrete wavelet transform.
3.1 The discrete wavelet transform for infinite signals MRA generates efficient lattice decomposition and reconstruction equations for the wavelet transform of (hypothetical) infinite signals or infinite spectra. The lattice decomposition and reconstruction equations are sometimes referred to as the analysis and synthesis wavelet transform equations, respectively. The analysis phase generates the wavelet coefficients, and the synthesis phase inverts ("back-transforms") from the wavelet coefficients back to the original data. In essence synthesis is the inverse of analysis. In this section the reader will notice many similarities and extensions of the theory presented in Chapter 3. The most notable similarity is the way in which the wavelet and scaling coefficients are computed. Recall from Chapter 3, that the wavelet and scaling coefficients at some level j - 1 are computed using the scaling
92 coefficients at level j, via the pyramidal scheme. This was done using the following pair of recursive equations: DC
lk-2/Cj,k
Cj--l,i-- ~ k=-~c 3(2
dj-l,i- Z
hk-2iCj.k
k--cx2
The two-way lattice decomposition equations from any level j to j - 1 for a discrete infinite signal are computed using the same equations, that is
~c Cj--l'i -- Z lk-2iCj'k k---~c
(1)
and ~c
dj-l.i -
~
h,_2icj.k
(2)
k---~c
where cj, i and dj.i are, respectively, the scaling and wavelet coefficients at level j, and 1~ and hk are the set of "low-pass" (or scale filter) and "high-pass" (or wavelet filter) filter coefficients, respectively. The term 'qattice" refers to the grid indexed by the integer variables i and j. The term "two-way" (also referred to as "dyadic" or "2-band") eludes to the fact that we are using two filters, a single low-pass and a single high-pass filter, so that when a data sequence is passed through each of the filters, two new bands of filtered data emerge. The equations provide an efficient mechanism for the segmentation of the time-frequency (or, for the case of a spectrum, amplitude-wavelength) space. The low-pass filter acts as a smoother, producing a smoothed version of the data sequence which it is filtering. The high-pass filter acts as a differencing operators, producing a non-overlapping band of the signal which the low-pass filter did not capture. The above decomposition equation can be viewed as a discrete convolution where we convolve the signal with the filter coefficients followed by down-sampling by the factor of 2 (i.e. we retain only every second element). The down-sampling by "2"' is taken care of by the factor 2 in the filter coefficient subscript k - 2 i . (We note in passing that the above equations define a "forward" filtering operation unlike the usual discrete convolution notation used in some of the literature where the filtering operation is run "backwards" or time-reversed. There is no real difference between the definitions.)
93 In the simplest case, we can perform the lattice decomposition over the lowpass branch keeping the high frequencies intact, leading to the wavelet transform (see Fig. 1). This division of the frequency axis gives good frequency resolution at low frequencies, and acceptable resolution at high frequencies, a trade-off which works in many practical cases. However, for signals/spectra rich in high-frequency components, this decomposition scheme may not be satisfactory. A full decomposition over the high frequency components may be preferable. The one-dimensional 2-band decomposition scheme can be viewed as a full two-way frequency-time tree,
Frequency (amplitude)
Time (wavelength) Fig. 1 Dyadic wavelet transform (WT) decomposition scheme showing time-frequency segmentation and the associated wavelet transform tree.
Frequency (amplitude)
Time (wavelength) Fig. 2 Dyadic wavelet packet transform (WPT) decomposition scheme showing timefrequency segmentation and wavelet packet tree (depth - 3).
94 with each tree level corresponding to an iteration in time (wavelength) of the lattice decomposition equations (see Fig. 2). This decomposition structure is called the wavelet packet transform (WPT) and is discussed in more detail in Chapter 6. From the full WPT we can generate a large number of possible redundant subtrees, or arbitrary WP trees (called wavelet bases). In fact, the total number of two-way (dyadic) bases is at least (22) n~e' for a tree depth equal to nlev. For example, a dyadic WPT with tree depth nlev- 12 has a "library" of at least 4.2 • 106 bases! The WPT has an important advantage compared with the WT because fast algorithms exist for the efficient search of the best wavelet basis, based on the minimisation of a cost function such as an entropy criterion. For the case of time-varying signals (as encountered in speech, music and video or in applications with time transients), it is possible to segment the time axis into disjoint intervals and construct wavelet bases on each intervalcalled spatial segmentation. This allows the WPT to adapt to each time interval. This is referred to as a multi-tree WPT or spatially adaptive WPT (see Fig. 3 for an example two-way segmentation of the time axis for the case of a dyadic WPT). The lattice reconstruction equation is formulated in a similar way, ~c
~c
Cj., -- ~_~ 1/,--2iCj-l.i -+- ~ j-'-~c
h/,._2idj_1.i
j=-~c
Frequency (amplitude) ~k
-1
I I I
I I Time (wavelength) Fig. 3 Dyadic (m - 2) W P T decomposition scheme with two-way time axis segmentation.
95 As stated previously, with most applications in analytical chemistry and chemometrics, the data we wish to transform are not continuous and infinite in size but discrete and finite. We cannot simply discretise the continuous wavelet transform equations to provide us with the lattice decomposition and reconstruction equations. Furthermore it is not possible to define a M R A for discrete data. One approach taken is similar to that of the continuous Fourier transform and its associated discrete Fourier series and discrete Fourier transform. That is, we can define a discrete wavelet series by using the fact that discrete data can be viewed as a sequence of "weights" of a set of continuous scaling functions. This can then be extended to defining a discrete wavelet transform (over a finite interval) by equating it to one period of the data length and generating a discrete wavelet series by its infinite periodic extension. This can be conveniently done in a matrix framework. We now develop the matrix representation for the wavelet transform that allows us to represent the pyramidal synthesis and analysis lattice equations for finite length signals in a convenient matrix computational framework. We first introduce the concept of a wavelet matrix in the context of infinite signals. The wavelet matrix A is defined as an infinite row of 2 • 2 matrix blocks: A_{...
1-3 h-3
1_2 h-2
1_1 h_l
A=(...
A-2
A-1
Ao
1o ho
ll hi
12 h2
A1
A2
"")
13 h3
.-"~
)
or,
where
Aj--
lzj h2j
12j+1 ) h2j+l
The first row in the wavelet matrix A, simply contains the low-pass filter coefficients. The second row in the wavelet matrix A, contains the high-pass filter coefficients. As shown above, it is sometimes more convenient to store the filter coefficients in the matrix A as a sequence of sub-blocks Aj (j . . . . , - 2 , - 1 , 0 , 1,2,...). The sub-blocks are simply the filter coefficients found in the lattice decomposition and reconstruction equations ar-
96 ranged in row form. This special structure is most useful when attempting to factorise the wavelet matrix (see Chapter 7). We now generate a filter matrix, W, as an infinite block Toeplitz matrix:
r
W
A-1 A-2 A-3
m
A0 A-1 A-2
A1 A0 A-1
A2 AI Ao
A3 A2 A1
"'"
We also define the infinite column vectors of interlaced scaling and wavelet coefficients, f J~
(.
9
9 c j,-1
cj.o
Cj. 1
"'"
cj.0
dj, 0
)T
and
fj -- (''"
Cj,_l
dj,_l
cj.1
dj.l
... )T
Note that the coefficients in fj are interlaced. This is a convenient method for writing down infinite vectors (as opposed to concatenation). The decomposition and reconstruction lattice equations can then be rewritten as a simple recursive matrix multiplication, f9-1-Wf
(o) 9
and f j(0) _ wTfj_ 1 In general, we choose compact wavelets (i.e. only a finite number of coefficients for the dilation and wavelet equations are non-zero) and therefore only a finite number of Aj blocks are non-zero. In this case the matrix W is sparse and banded (see Section 4.2.2). Compactly supported wavelets have good localisation properties but may not always have a high degree of smoothness (e.g. the Haar wavelet). There are many ways of constructing wavelets and the strategy chosen may depend on the particular application. We can construct compact wavelets
97 with additional properties such as smoothness and symmetry (useful for wavelets whose scaling and wavelet filters have linear phase). However, there is no simple scheme to obtain general m-way scaling and wavelet filters as this generally involves solving a set of non-linear equations. In Chapter 8 we demonstrate how to design wavelets targeted for a specific application. Fortunately, there is a sufficiently large library of existing wavelets that are satisfactory for many applications. For reasons of convenience, and as will be shown in Chapter 6, the decomposition and reconstruction lattice equations can also be written as a pair of equations of recursive matrix products where we separate out the low-pass and high-pass filter coefficients in the form of infinite matrices Cj and Dj. That is, e j - 1 --
Cjcj
9
o
Cj-l,-1
Cj-l,O
--
Cj-l,1
9
9
o
o
9
...
1-1
1o
Ii
...
cj _~
...
1-3
1-2
1-1
Cj.-2
...
1-5
1-4
1-3
Cj-I
and,
Djcj
dj_ 1 =
9
9
~
9
.
dj-1,- 1 dj-l,O d / - 1.1
-
9
9
9
9
.
o
9
...
h-i
ho
hi
...
cj.-3
...
h-3
h-2
h-I
cj,-2
...
h-5
h-4
h-3
c/-I
..
3.2 Discrete wavelet transform for signals with finite-length Prior to this chapter, we have considered multi-resolution analysis for the space of square integrable functions, L2(R). However, in most practical applications such as spectral analysis, the data are obtained as a sequence of
98
discrete values over a finite interval rather than continuously over the (infinite) real line. An example finite-length N M R spectrum is shown in Fig. 4. The example chosen here consists of N = 512 discrete sample values, a subset of a raw soft lobophytum compactum coral spectrum containing isolobophytolide. The DWT will require the computation of non-existent values as these values reside outside the interval. The main difficulty is not with the computation of the coefficients away from the interval boundaries, but with those that are at, or close to, the boundaries. The problem that arises is that of what to do with the boundaries of the interval as these, if not taken into account, can lead to artifacts in the filtered data. Similar problems also arise in the context of the discrete Fourier transform. There are many different approaches to handling boundaries. One possibility is to extend the sequence to an infinite one in some suitable way and then apply the standard DWT to the extension. Example extension techniques include extension by constant-padding, by periodicity, by reflection and by
""
I
I
["
0
100
200
"'
I
I
I
300
400
500
Fig. 4 Example finite-length partial NMR spectrum of lobophytum compactum coral (N = 512).
99 polynomial extrapolation. Periodic extension is used implicitly when filtering with the DFT (i.e. a circular convolution). Another possibility is to construct wavelets specifically on the interval. We first consider boundary handling using periodised wavelets. We then briefly consider other simple techniques and look at their practical implications. Another difficulty that arises with data sets is the non-standard size or length. That is, the DFT assumes a data size equal to a power of 2 which is not normally the size of spectra or other similar discrete data. When the data set size is a power of 2 (such as in an image), periodic extension with the DFT inherently handles such a situation. However, with non-standard sizes, some form of extension up to a power of 2 is usually performed, thereby allowing the application of the periodised DFT. This we also investigate in the context of the boundary extension techniques.
3.2.1 Boundary handling using periodised ii'avelets We assume here that the scaling and wavelet functions, qb and q~ respectively, are defined over some interval [0, K] and then expanded to reside on L2(R) by regarding it as a periodic function with period K. This corresponds to "wrapping around" a function over the interval [0, K] and extending it to infinity, as shown in Fig. 5. The circles in Fig. 5 indicate the end-points at which wrap-around is applied. A multi-resolution analysis for L2([0, K]) can be built with the sequence of sub-spaces being of finite dimension equal to 2JK and the top-level subspace representing the coarsest resolution. An alternative way of viewing the periodic extension of data is to periodically wrap the transform matrix. In practice, we make the top-level (say,/ = J ) of coefficients equivalent to the raw input signal or spectrum. Furthermore, we assume that the dimensionality (number of samples of the input signal) is a power of two, that is dim = N - 2 J for some J E Z + . There exist several techniques to take into account signals which do not have a dimensionality equal to a power of 2. We investigate these later. We rewrite the decomposition equations as (finite-length) filtering equations, N f -I
Cj_I, i z
Z lkCj,2i+k k--0
100 (a)
(b)
Je__L 0
K
Fig. 5 Boundary handling b)" periodisation." (a) sample spectrum, (b) spectrum wraparound with period K and extension to infinity.
and NI-1
dj-l'i- Z hkCj.2i+k k=0
or,
fj-1 - WN~ .~ where f j (o) - -
(Cj,0
Cj.I
cj~._
"'"
cj ._~.;_1 )T
and fj-- (Cj,0
dj,o
cj.1
d j,1
""
cj,2;_l
dj,2J-l)W
101
We note in passing that, when there is a finite number of filter coefficients, the filter is called a finite impulse response (FIR) filter. Another commonly used filter is a causal filter. Here the filter coefficients with negative indices are zero, that is, lk = 0 for k < 0, we say that the filter is causal (hk could also have been used in the definition of causal). The remaining discussion will consider filters that are both FIR and causal. The notation N f will be used to denote the number of finite filter coefficients. Similarly, the reconstruction equation is, ~.0)__ w T f j _ 1
The N • N filter matrix WN is
WN--
Ao 0
A1 Ao
A2 A1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..A2
Aq_l ""
A2 A1
--" ...
Aq_l Aq_2
0 Aq-1
...... 0
0 Aq-1
...... 0 -.. .
0 ......
0 0
Ao 0
Al A0
where q - NU/2. Notice the special banded circulant structure of WN, highlighting the periodic wrap-around effect. Since we obtain an interlaced vector of scaling and wavelet coefficients at each stage of applying the decomposition equation, the vector has to be unshuffled by pre-multiplying with a permutation matrix P. The permutation matrix selects every second element (due to the dyadic nature of the decomposition scheme) and reorders them in sequence. P is given by /1 0 0
O
0 0 0
.
0 1 0
-.. 0 0
O
1
9 " "
0
9 .
p
0 0 0
1 0 0
O
0
0 0 0
1 0
9 . o
1
0
102
The decomposition equation therefore becomes (o)
fj-1 -- P WNfj
We illustrate the decomposition scheme for a finite-length sequence by using the following example:
Example Consider the following input sequence of N = 16 sample values (so
s1
$2
9.. s15 ) ~ _
(cj.o
cj,1
cj,2
9.. cj,15 )T
where J = log 2 16 = 4 levels of decomposition (since we down-sample by a factor equal to 2 when going from one decomposition level to another).We calculate the dyadic D W T using, for example, a 4-tap filter (i.e., N U = 4). There is one low-pass filter (with filter coefficients 10.11,12, 13) and one highpass filter (with filter coefficients h0, hl,h2, h3). At the first level of decomposition we have,
/C3.o'~
/1
0
0
0
0
...
C3.1
0
0
1
0
0
0
...
C3.2
0
0
0
0
1
0
0
..-
c3.7
0
0
0
d3.r
0
1
0
0
0
..-
d3,~
0
0
0
1
0
0
.-.
0
d3.:
0
0
0
0
0
1
0
0
kd3,;
\o
0'~ 0 0
1
0 0
0
0
0
1
103 i lo
|1
12
13
ho
hi
h2
h3 11 hi
lo ho
/' s0 '~ 0 12
13
h~
h3
, SI ,
$2 S~
i
I
12 h2
10
]l
12
13
h0
hi
h2
h~
1o
11
ho
hi
13 h3
~,Sls)
Notice the wrap around of the filter coefficients in the last two rows of the matrix corresponding to the extension of the input vector by periodisation. Fig. 6 shows the equivalent scheme for all of the four levels of decomposition. To compute the second level, the vector of coefficients (c3.0 c3.1 .. c3,7)T produced from the first decomposition level is used as the input vector, generating the second level of scaling and wavelet coefficients 9 )T (C2,0 C2,1 C2,2 C2,3 and (d2.0 d2,1 d2,2 d2,3) T respectively. Finally, this process
S0 SI S2
...
i'C'3.0 C3., ... C3,7
S16.., ]
[63.o d3,1 ..-d3,7,1
IC:.o... c:.3l I d2.o... [
'@
[C,.o....
d,.,[
Fig. 6 Dyadic D W T for a signal length equal to N = 16.
104 is repeated twice more until the fourth level is obtained (when the smallest resolution, corresponding to single scalar and wavelet coefficients, is obtained). Periodisation will be applied at each level (i.e. we assume periodic extension for each decomposition). F o r a numerical demonstration, consider the case of the D W T of a simple signal using the Daubechies-4 wavelet filter ( N u - 4). The periodic input signal is one period of a sinusoidal waveform (N - 24 - 16), with matched end-points: SO S1 $2 $3 $4 $5 S6 $7 $8 $9 ~l( ~1; ~1~ ~1,
I
0.38268~ 0.70711 0.92388 1.00000 0.92388 0.70711 0.38268 0.00000 -0.38268 -0.70711 -0.92388 -1.00000 -0.92388 -0.70711 -0.38268 \ 0.00000/
The filter points are given as follows [1]" l0
-
0.482963
h0
-
0.129409
ll
--
0.836516
hi
-
0.224144
12
-
0.224144
h2
-
-0.836516
13
-
-0.129409
h3
-
0.482963
Substituting the above values into the matrix decomposition equation, we obtain the following result for the first and subsequent two decomposition levels"
105
[ C3,0
/ 0.854016
C3.1
1.39830
C3.2
1.12349
C3,3
0.190556
C3.4
-0.854016
(" C2.0 '~
C3,5
-1.39830
C2.1
0.691537
C3,6
-1.12349
C~ "~
-1.80932
C3.7
-0.190556
C2.3
-0.691537
d3,o
-0.081875
d2.0
-0.423848
d3.1
-0.087644
d2.1
0.241590
d3,2
-0.042079
d-,
0.423848
d3.3
0.028151
\d2.3 j
d3.4
0.081874
d3.5
0.087644
d3.6
0.042079
\d3,7
-0.028151
Cl,0 cl.1
I
1.80932
-0.241590
1.13626 _
- 1 13626
dl.0
1 56868
dl 1
1 56868
This result is also shown in Fig. 7. As can be seen, the effect of the low-pass filter is to perform a smoothing operation of adjacent samples whereas the high-pass filter performs a differencing operation. The sub-sampling by a factor of 2 between adjacent decomposition levels can also be observed. As stated previously, repeated application of the matrix decomposition scheme until the fourth decomposition level will yield the DWT. Using the alternative notation we introduced beforehand, the previous example would be computed as follows: For the first decomposition, we calculate the coefficients as c 3 - C4s and d 3 - D4s. The next lower level of coefficients would be computed as e2 - C3e3 and d 2 -- D3e3. Continuing down to level 1 we have the scaling and
106
tr]
t
l I iJ i
I
....
i
tJ
t
T
1
t
[
tI
9
Fig. 7 First two decompositions (three levels are shown, hwluding the top level) of the dyadic D W T f o r a sinusoidal waveJorm with matched end-points, signal length N = 16 and using a Daubechies-4 wavelet filter (Nu = 4).
wavelet coefficients being c~ = C2c2 and dl - D2c2 and finally at the lowest level in the tree we have Co = Clcl and do = Die1. The exact matrix representations for C4, C3; C2 and C~ are given below. The corresponding formats for H4, H3, H2 and H1 have the same structure except the lg's are replaced with h/'s [ lo 0 0 0 C4--
0
0 0 \12
ll 0 0 0 0 0 0 13
12 10 0 0 0 0 0 0
13 Ii 0 0 0 0 0 0
0 12 10 0 0 0 0 0
0 13 Ii 0 0 0 0 0
0 0 12 10 0 0 0 0
0 0 13 Ii 0 0 0 0
0 0 0 12 10 0 0 0
0 0 0 13 ll 0 0 0
0 0 0 0 12 lo 0 0
0 0 0 0
0 0 0 0 13 0 ll 12 0 10 0 0
0 0 0 0 0
0 0 0 0 0 13 0 11 12 0 10
0 '~ 0 0 0 0 0
13 II
107
lo ll 12 13 0 0 0 0 lo 11 12 0 0 0 0 0 10 ll 12 13 0 0 0 0
C3-
C2 --
0 0 0 0 12 0 lo Ii
10 ll 12 13 ) 12 13 1o Ii
C1 -- (lo--k-12
11 + 1 3 )
To demonstrate using a numerical example with an input signal with a substantially larger number of points, consider the case of the DWT with the Daubechies-6 wavelet (N/ = 6). The periodic input signal with matched endpoints (a cosine waveform, N - 2 7 -- 128) and resulting DWT is shown in Fig. 8 (first three levels shown only). As can be seen, the D W T of a periodic signal with matched end-points generates a signal at each decomposition level that is consistent with the wavelet transform of an equivalent infinitely long signal. A further example of the D W T of a signal with matched end-points is shown in Fig. 9, where "band(X,0)" and "band(X,1)" are the set of scaling and wavelet coefficients, respectively, at decomposition level X. The use of periodised wavelets for handling boundaries is used extensively in many applications. It is the default extension scheme (for sample sizes equal to a power of 2) used in some software packages (e.g. S + WAVELETSTM). Unfortunately, one of the main problems observed when handling boundaries using periodised waveletsis that, unless the input sequence is truly periodic .
Level 7
illtlLliiii
Level6
"
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
,,, 'J~lll
" ..................
.
-
I
IIIJm, '" ~llillilil................................................................ i
I.ii!N!i
............
Level 5
I
I
I
I
I
I
0
20
40
60
80
100
I
120
Fig. 8 Results of the D W T using the Daubechies-6 wavelet filter for a cosine waveform.
108
band (8,0) 1
0.5 0 --0.5 50
11~ l iO band (7,0)
200
250
Jititl' .ii/1r I,I,fIIIIIII
1
0 -1 0
50 band (6,0)
2 1
band (7.1)
~~ f
-~
0
100
0
20
50 band (6.1)
40
ll"'"'ll'
1"
60
0
,lit
" .I,,..
2'0
band(5,0)
3 2
100
-1
-1 1
,I,llrl t,, f
I'
1
,'1 tl,,ll,
0
.....~, ,I,,
4'0
6'0
band(5,1 ) 1
0.5 [
l 0
.,
i
I
I
"
"
'
I
I
.
. I I
0
,I,
I
I..
ill II,,,i[
, ,.
,
-5
-1 10
20 band(4,0)
4
30
0
10
20 band(4., 1)
30
3 2 1
0
,
I,
I i
0
.
5
!
.
I
.
.
I,
~
.
10 band(3,0)
15
I
I
I
I
i
0
5
0
1
0
I
,
.
.
I
I
' ' i
10 band(3,1 )
.
3 2
I
I
15
I
-1
-1 0
2
4
6
8
0
2
4
6
Fig. 9 First six levels o f the D W T o f a signal with matched end-points ( N = 256).
109 and the end-points of the sequence match at the boundaries, artificial singularities can be introduced for wavelets near the boundaries. This is due to the discontinuity (sudden jump) of the input sequence at the boundary. Furthermore, it is assumed that periodic extension occurs at each level of the decomposition. If, for a given sub-band, periodicity does not occur then boundary singularities may be observed for one or more decomposition levels below the sub-band. This can be observed in Fig. 10, where artificial values appear in the first sub-band at level 1 owing to the lack of periodicity at the top-level. The singularities will tend to propagate into lower levels of the tree and affect the overall wavelet decomposition of the signal. Singularities will be more pronounced for the wavelet coefficients as these are particularly sensitive to mismatches at the boundaries. A larger number of singularities will also appear at the boundaries as the filter length, Nf, increases. This is due to the increased effect of longer filters on the convolution operation. Fig. 11 shows the DWT decomposition of a short wave infra-red (SWIR) mineralogical spectrum where the same notation to label the bands at each decomposition level as described in Fig. 9 is used. Singularities can be observed in some of the bands, particularly for the upper (right-hand side) wavelet coefficients (e.g. band(7,1)).
illllllllllllllllllllllll~l~~'""'""'"'"....
,,,,,,,,,,,,,,,,,,,,~j~HiIlllll]lllllllll1111[lili
illlrlllllllll~J~,,,,'"'.........,,,,,,,,~llllillllllllIi .............................................................. ,,,,,~iilllll~il...... ............. i,, ....i ..... ..... .
.
.
.
.
.
.
.
.
.
.
.
.
,~. ,i .
I
I
I
I
I
I
I
0
20
40
60
80
100
120
Fig. 10 W P T results for a linear ramp signal with periodic boundary extension ( N = 128, Daubechies wavelet, N f = 6). First six levels shown only.
110 band 8,0)
I O01
_.,,,4e ,a--r-~
-~:=:.~.~ .......
95L/'-
!., / "'+'"v~
85[
"J
9o[
;r
-:i
! ! ::
i:' .i
80
50
100 150 200 band (7,0) 140[/, ~-''z'~ .....~i ::"~',"-:: ...."
130y l,Ot
40
"'
100 120
20
i ,,.J i
30
+il
-2i ............................................................................................................... ~: 20
40
60 80 band (6,1) "
"": 10
band (7,1)
'/'i
_!-;...........
60 80 band (6,0)
170[
16ot
:.",.
}/
20
250
!/:+ i
40
50
60
It-
."
...... , ......... -.,.:
.... .~
.," :: .:.~
i"
,,
~::~::"
:~.
-:
-'i
10
20
30
f ~:
:
10
................-,.,~ ........ ~....~.,,
15 20 band (4,0)
25
4ooi...
::~~.........................~................
380[ 360[
":',. / .............",. :"':,/ ~"~.
340r
4 2~ o!":.,.......... " /\
480[
_.._,
.:. .~9
i: ii,.:,i
-2 a.
:f
----4
"J ............. : ............................
"
5
~i
.:'~ / ! ;
...........................~..........................
10 band (3,0)
15
10~
:
O~
:,...........
,,~
.......
!
\./" ................... ~
~. ~::
" ~::/
15 20 band (5,1 )
25
30
-r.,'"", ...............................
" - "'"+-..
.!!,
i
f .:~ 9 ";.,, ,.'.....
5
':
i
\,
~ :..
.~
/',,. ~:.
-:..-
/
..":
............... <.........- . . . . . . . . . . . . . . . . .
.
i
.
. ..........
10 band (4,1)
"-,.
O~ " -10~
10
-. . . . . . . . . . . . . . . . . . . . . . . . . .
,~
-I 0;+.-..::~.,
~
520t
60
.
30
-,,.:/ 5
50
band (5, )
~. . . . . . . . . . . . . . . . . . . . . . . .
320L
ii.il
40
-.:
220[
5
/
-!
0:~:;,.+'--......r ...."'i;~}~'~+"''~"":[~.:'-z"..iiilJ~!!i/"t
band (5,0) ~:
100 120
+. . . .
,~
15
~:
-.,
'":" +..... ......... .. . . . . . . . . . . .
,,
;"
!
":
"'.,,i
.,..,......... ,. . . . . . . . . . .......... ii:
lll Fig. 11 D W T results for an S W I R mineralogical spectrum with periodic boundary extension ( N = 256). First six levels shown only.
More detailed effects of the periodic boundary extension on the WPT are shown in Fig. 12 (where only the scaling coefficients of the first level of decomposition level are shown) for the NMR spectrum of lobophytum compactum coral (Fig. 4). A relatively large singularity appears at the righthand boundary (circled). The major advantage of periodic extension is that no additional coefficients are required (if the number of data points is a power of 2) and it preserves the orthogonality of the DWT. However, if singular end-effects owing to nonperiodicity is a major concern, then consideration should be given to using other extension techniques, such as symmetric extension. 3.2.2 Boundary handling using symmetric extension
A symmetric extension involves the symmetric reflection of a function about the boundaries of the interval [0, K] in which the finite-length function is defined. This is shown in Fig. 13. Symmetric extension has the advantage, compared with periodic extension, that the function at the interval boundaries is continuous. However, the first
I, II 0'
5'0
1O0 '
1; 0
200 '
250 '
Fig. 12 WPT results (scaling coefficients at the first decomposition level shown only)for an NMR spectrum with periodic boundary extension (Daubechies wavelet, Nf = 6).
112
T
I
t
_
1
v
Fig. 13 Boundary handling using symmetric extension b)' reflection.
and higher-order derivatives of the function at the boundaries may not be continuous. Periodic extension introduces singularities due to a jump at the boundaries, whereas symmetric extension introduces singularities due to the existence of boundary "corners". Discontinuities in the first and higher oddorder derivatives can be eliminated by using anti-symmetric extension. That is, reflection about the boundaries is anti-symmetric (see Fig. 14). The periodised DWT developed in the previous section can be used for the data subject to symmetric or anti-symmetric extension. However, the data will have to be extended to 2J data points, where J - max[(log z dim) + 1, J0], where J0 = [log2 dim~ and dim is the original dimensionality of the data set. This will result in more than 2J wavelet coefficients at levelj due to additional coefficients introduced by the extension procedure. Fig. 15 shows the resulting WPT for a ramp signal (same input sequence as in Fig. 10) when
Fig. 14 Boundary handling using extension by anti-symmetric reflection.
113
illll~,,,,,, ,,,,,,a~llllli]I! .... ......... ' ............ .
.
.
.
.
.
. .................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ii .......................
!i ....................... .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
.
.
.
.
.
.
i
.....i
............ 9 ................i................i............................... i.............i 1"
.
.
.
.
iiiIIi 1
I
i
I
I
I
0
20
40
60
80
100
1
120
Fig. 15 WPT results for a linear ramp signal (N = 128, Daubechies wavelet, Nr = 6) with bounda O' handling using symmetric extension. First six levels shown only.
symmetric boundary extension has been applied. Singularities about the boundaries have been minimised. Fig. 16 shows the effect of symmetric boundary extension on the WPT (only the scaling coefficients at the first level of decomposition are shown) of the NMR spectrum of lobophytum compactum coral (Fig. 4). The singularity generated by applying periodic extension (see Fig. 11) has been eliminated through the use of symmetric boundary extension. Unfortunately, the presence of additional coefficients can introduce dependencies in the coefficients close to the boundaries as well as prevent perfect reconstruction upon back-transformation. Furthermore, the symmetric extension property should be applied at each level of the decomposition. Perfect reconstruction can only be achieved through the use of symmetric wavelets such as bi-orthogonal wavelets [2]. 3.2.3 Boundary handling using constant-padding extension
Constant-padding extension involves padding the coefficients at each level of the DWT/WPT decomposition with a constant value (usually the value zero, or the end values of the signal). Zero-padding is the default extension mechanism used in some software packages (e.g. MATLAB TM) or when signal lengths are not divisible by a power of 2 (e.g. S + WAVELETSVM).
114
I'
I
0
50
I
100
150
"'
I
i
200
250
Fig. 16 W P T results (scaling coefficients at the first decomposition level shown only).for an N M R spectrum with symmetric boundary extension (Daubechies wavelet, Nf - 6).
As in the case of symmetric extension, the data will need to be padded (at both ends of the sequence) to 2J data points prior to applying periodic extension. Fig. 17 shows the resulting WPT for a ramp signal when zeropadding boundary extension has been applied." Fig. 18 shows the effect of zero-padding boundary extension on the WPT (only the wavelet coefficients obtained at the first level of decomposition is shown) of the N M R spectrum of lobophytum compactum coral (Fig. 4). Of the original 512 data points, only 480 values were used. Zero-padding was then used prior to applying the DWT. Significant artifacts can be observed near the boundaries for the wavelet coefficients. The magnitude of these artifacts is similar to the case of periodic extension (for the wavelet coefficients). Since singularities about the boundaries can be significant, boundary handling with constant-padding extension should be used with care.
3.2.4 Boundary handling using polynomial extension In the polynomial extension approach to boundary handling the data sequence is extrapolated by fitting a polynomial to the end-points of the se-
115
!111111 I,,, ,,,,"" ....... ,,,,,,,,,,,ll IIii11, i i9i ,,,mmlllll,, .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i .
.
.....i ..............
:
.
i!!lla'"' iiI
!
I
I
0
20
40
'
!
i
60
80
Fig. 17 W P T results for a linear ramp signal (N = 90, Daubechies wavelet, N f - 6) with boundary handling using zero-padding extension. First.fi've levels shown onh'.
|
0
-
-
I
I
!
I
50
100
150
200
Fig. 18 W P T results (wavelet coefficients at the first decomposition level shown only)for a n N M R spectrum with zero-padding boundary extension ( N = 480, Daubechies wavelet, N f = 6).
quence and then extending that polynomial. The constant-padding and symmetric extension methods are simple examples of polynomial extrapolation. Polynomials of different degrees can be used, the higher-degree
116 polynomials requiring the fitting of more end-points. For example, a polynomial of degree two fits to the first and last three data sequence values whereas a polynomial of degree one needs only fit to the first and last two data points. Higher-degree polynomials generally give better extrapolation accuracies. Fig. 19 shows the resulting W P T for a ramp signal when a degree two polynomial boundary extension has been applied. The boundary singularities have been significantly minimised. 3.2.5 B o u n d a r y handling using wavelets on an interval
The extension techniques for boundary handling so far investigated are quite useful, producing the desired e f f e c t - as long as the appropriate technique is used for the data set at hand. However, a more obvious approach is to directly construct the wavelet on the interval [0, K]. This would avoid any of the unpleasant side-effects such as singularities arising from many extension methods. Various approaches have been suggested including, Meyer's boundary wavelets [3] and dyadic boundary wavelets [4]. Though these wavelet construction methods are more elegant compared with the simpler boundary extension techniques, their implementation is considerably more involved. They also have some of their own
illilllilllllllllllJ J'"'"'"'""
Level 0
'"'""'"'"" llllilillilllllliii
Level 1
Level 2
Level 3 ~'o........ i........... i............ i ........... i........... i........... i........... i......... Level 4 . . . . . . . . . . . . . . . . . . . . .
; . . . . . . . . . . . . . . . . . . . . . . . . .
~. . . . . . . . . . . . . . . . . . . . . . . . .
; . . . . . . . . . . . . . . . . . . . . . . .
[
I
I
I
I
0
20
40
60
80
:
Fig. 19 W P T results for a linear ramp signal ( N = 90, Daubechies wavelet, N f = 6) witl~ boundary handling using polynomial extrapolation extension. First five levels shown only
117 problems, such as numerical instabilities and an inherent imbalance between the number of scaling and wavelet coefficients.
3.2.6 Summary of the D WT and boundary handling methods When taking the discrete wavelet transform of an ordered sequence of N data points, several important issues have to be considered. These include: 1. Is the length or dimensionality, N, of the data set a power of 2 or is it an arbitrary value? 2. Is it important to minimise boundary artifacts? 3. Does the perfect reconstruction of the original signal need to be ensured? 4. Is the preservation of the orthogonality of the transform required? 5. Is the preservation of the property of vanishing moments required? 6. Does the transform need to be fast computationally? 7. Are numerical accuracy and stability of the transform important considerations? Generally, we acquire data sequences with arbitrary lengths that are not a power of 2. One approach is to shorten the data sequence to a length equal to a power of 2 by clipping one or both ends of the sequence. However, this may cause a loss of relevant information, particularly if the important information resides at the ends of sequence. Therefore, to avoid the possible loss of information, we extend the data sequence by using one of the boundary extension techniques prior to applying the DWT/WPT. The DWT/WPT is inherently periodic with a period equal to a power of 2 and can be applied once the boundary extension method has been implemented. The choice of the boundary extension method depends on the morphology of the end-points of the data sequence and on whether the perfect reconstruction of the original data sequence is required. If the end-points match and the sequence is known to be periodic, applying the periodic boundary condition is the preferred option. However, if the end-points do not match, periodic extension may generate singularities (the number of singularities increase with the length of the filter). In this case it is better to use symmetric/antisymmetric extension or boundary handling by using wavelets on the finite interval. Constant-padding or polynomial extrapolation can also be applied successfully in many cases. Perfect reconstruction may be an important consideration when you wish to reconstruct the sequence from its sub-bands.
118 This may be helpful when back-transforming to identify the important features in the original data sequence. Bi-orthogonal wavelets ensure the perfect reconstruction of the original data sequence. In many applications, the DWT/WPT operator should be orthogonal. For example, some mapping transformations (e.g. principal component analysis) assume that the resulting scaling and wavelet coefficients are independent of each other (if the coefficients are used as inputs to the mapping transformation). Orthogonality ensures independence of coefficients. Only the periodic extension method and the boundary handling using wavelets on the finite interval technique preserve orthogonality. The symmetric extension method can be used, but only preserves bi-orthogonality. That is, a bi-orthogonal wavelet should be used in this case. Whether the DWT/WPT should be computationally efficient depends on the particular application. Real-time applications would need to choose boundary extension methods that provide fast transform implementations. Applying the periodic boundary condition requires no preliminary extension as the DWT/WPT is inherently periodic.
References
1. W. Press, S. Teukolsky, W. Vetterling and B. Flannery, Numerical Recipes in C, The Art of Scientific Computing: Second Edition, Cambridge University Press, Cambridge (1992). 2. G. Strang, T. Nguyen, Wavelets and Filter Banks, Wellesley-Cambridge Press, Wellesley (1996). 3. Y. Meyer, Ondelettes sur l'intervalle, Revista Matematica Iberoamericana 7 (1992), 115-133. 4. L. Andersson, N. Hall, B. Jawerth and G. Peters, Wavelets on Closed Subsets of the Real Line, Recent Advances in Wavelet Analysis, Academic Press (1993), pp. 1-61.
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
ll9
Chapter 5 Multiscale Methods for Denoising and Compression Mohamed N. Nounou and Bhavik R. Bakshi Department of Chemical Engineering, The Ohio State University, OH, USA
I Introduction Advances in sensors and computers have made it relatively easy to collect large quantities of measured data from various industrial and manufacturing processes. The data are usually contaminated by errors due to random perturbations, sensor malfunctioning, instrument degradation, and human errors. These errors mask the underlying information and may degrade the efficiency of the tasks that rely on extracting information from the data. Therefore, the measured data need to be cleaned, filtered, or denoised for efficient execution of data-dependent tasks. The task of removing errors from measured data is an ill-posed problem, and cannot be solved without some information about the data. Depending on the type of information used, techniques for data filtering are categorized as: denoising with fundamental process models, denoising with empirical process models, and denoising without process models. Denoising of an individual signal is usually performed by various filtering methods based on assumptions about the nature of the errors and the smoothness of the underlying signal. These methods are the primary emphasis of this chapter. For denoising multivariate data, fundamental or empirical models relating the variables may be used. Denoising with fundamental process models is referred to as data reconciliation [1], and requires solution of a constrained optimization problem. If an accurate fundamental model is not available, multivariate data may be denoised by empirical modeling methods such as principal component analysis. Since accurate process models relating the hundreds of measured variables are not easily obtained, the simplest and most widely used denoising methods de not rely on a fundamental or empirical process model. Instead, these methods use information about the nature of the errors or smoothness of the underlying signal and include various filtering methods. Linear, univariate low pas~, filtering techniques such as mean filtering and exponential smoothing are the
120 most commonly used methods in the chemical industry [2] as they are simple and can be easily used on-line. These filtering methods represent the data at a single scale or resolution in time and frequency which forces them to trade-off the extent of filtering with the quality of filtered local features. Consequently, linear filters are not very effective in filtering signals containing features with different localizations in both time and frequency. The poor representation of the underlying localized features by linear filtering may be overcome by nonlinear filtering methods, such as Finite Impulse Response (FIR) Median Hybrid Filters (FMH) [3] and multiscale wavelet-based filtering [4]. F M H filters are most effective when applied to piecewise constant signals contaminated with white noise, require careful selection of the filter length, and are limited to off-line use. Multiscale filtering methods based on wavelet analysis represent the data as a weighted sum of orthonormal wavelets. The multiscale representation captures deterministic features in a few relatively large coefficients, while stationary autocorrelated errors are approximately decorrelated or whitened. Thus, stochastic stationary errors may be decreased with minimum distortion of the underlying deterministic Signal by eliminating wavelet coefficients smaller than a threshold. The filtered signal may be recovered by reconstructing the thresholded coefficients back to the time domain. This multiscale filtering technique has superior theoretical and practical properties [4], but is restricted to off-line application and to data of dyadic length. Online multiscale (OLMS) filtering [5] applies wavelet thresholding to data in a moving window of dyadic length. When time delay is acceptable, the quality of filtering can be further improved by averaging the filtered signals in each window, resulting in an off-line approach that overcomes the boundary effects encountered in translation invariant (TI) filtering of [6], and is called boundary corrected translation invariant (BCTI) filtering. Also, the wavelet representation lacks robustness to outliers or gross errors. Multiscale filtering has been extended for filtering data with Gaussian and non-Gaussian errors by multiscale median filtering [7]. This robust multiscale method is also incorporated with OLMS and BCTI to deal with non-Gaussian errors. As measured data are collected, efficient methods are needed for their storage and retrieval. Compact storage or compression can be achieved by eliminating irrelevant and redundant contributions in the data. These contributions are usually high frequency, low amplitude noise that contaminate the important features and transients in the data. Therefore, the compression problem is similar to filtering in the sense that both rely on noise removal to extract the important features in the data. Some of the popular compression techniques used in the chemical manufacturing industry include the box car
121 and the backward slope methods [8-10], the swinging door method [1 1], and the piecewise linear on-line trending (PLOT) method [12]. Similar to linear filtering, these compression methods rely on linear interpolation of the data to eliminate irrelevant features, and operate at a single scale in time and frequency. Thus, they are best suited for steady-state analysis of processes. Since real data contain localized features in both time and frequency, a multiscale compression technique is needed for proper and efficient data analysis [13]. As in filtering, efficient wavelet-based representation can be exploited to separate features from high frequency noise components to improve the compression ability by thresholding wavelet coefficients smaller than a threshold and storing only those corresponding to important features. This compression process can even be performed at the same time the data are filtered to reduce computation. The rest of this chapter is organized as follows. A description of multiscale wavelet-based representation of data and its advantages are presented next followed by a description of various methods to characterize different types of errors present in the data. Then, various filtering and compression techniques are described, which include linear filtering such as FIR and infinite impulse response (IIR) filtering, and nonlinear filtering and compression such as F M H filtering and multiscale filtering. Finally, an online multiscale filtering technique is presented followed by some concluding remarks.
2
Multiscale representation of signals using wavelets
Measured data of chemical processes are usually multiscale in nature due to deterministic features occurring at different locations and resolutions and stochastic measurements with varying contributions over time and frequency. A proper analysis of such data requires their representation at multiple scales or resolutions. Such representation can be achieved by expressing the signal as a weighted sum of orthonormal basis functions defined in the time-frequency space such as wavelets. Any square-integrable signal may be represented at multiple scales by decomposition on a family of wavelets and scaling functions as shown in Fig. 1. The signals in Fig. l(b), (d), and (f) are at increasingly coarser scales as compared to the original signal shown in Fig. l(a). The original signal can be represented as the sum of all detail signals and the last scaled signal,
122
(a) OriginalSignal
(d)
~
(e)
u
Fig. 1 Multiscale decomposition of a stair-step signal using Haar.
2Jo-1
x(t) - ~
J 2J-I
Cjo,k,jo,k(t) + Z
k=O
~
dj,k%,k(t)
(1)
J=Jo k=O
where, cj0,k is the kth scaling function coefficient at the coarsest scale J0, and dj,k is the kth wavelet coefficient at scale j. Fast wavelet decomposition and reconstruction algorithms of complexity O(n) have been developed for a discrete signal of dyadic length [14]. A dyadic signal can also be decomposed non-recursively using Fast Wavelet Transform (FWT), in which the signal is multiplied by a decomposition matrix, W, of size (n • n) as follows: %
Cj0
dj0
Dj0
djo+l 9
dj_!
-- Ax(t) --
Djo+l x(t)
(2)
o
Dj-1
where ej0 is the scaling coefficients at scale J0, dj the wavelet coefficients at scale j, Cj0 is the low pass filtering matrix at scale J0 and Dj is the high pass filtering matrix at scale j.
123 Orthonormal wavelet basis functions are of fixed shape as they tile the time-frequency space in a pre-determined and rigid manner. This rigidity may be overcome by developing a library of wavelet packet basis functions by decomposing each detail and scaled signal from the wavelet decomposition by repeated application of the corresponding filters until no further decomposition is possible. This decomposition results in a complete binary tree of basis functions that cover a wide variety of time-frequency localizations and shapes. Efficient algorithms for searching this wavelet packet library by branch and bound with an order of complexity O(nlogn) to select the best orthonormal bases for representing a given signal have been developed [15].
3
Characterization of noise
Proper data filtering and compression requires good understanding of the nature of the contaminating errors. Different types of errors can be present in a measured signal. According to the time-dependency of their statistics, errors can be classified as stationary and non-stationary. An error signal is called stationary if it has time-independent mean, variance, and autocorrelation function, and is called non-stationary if any of these statistics is timevarying. Examples of non-stationary errors include random walk signals and signals with mean shifts. Denoising of signals with non-stationary errors usually requires filtering techniques that adapt their parameters to account for the changing nature of these errors. Errors can also be classified according to their governing probabilistic distributions. Errors are called Gaussian or random if they follow the Gaussian or normal distribution, and are called non-Gaussian or gross errors otherwise. Denoising data containing random and gross errors is usually a more challenging task since gross errors need to be identified and eliminated as an extra step in the denoising process. Moreover, error signals can be classified according to their autocorrelation function or power spectrum as either white or correlated. White noise observations are independent and identically distributed, whereas correlated observations are dependent on their neighbors. Therefore, denoising correlated noise is usually more difficult than that of white noise since the error correlation must be take into account. In summary, efficient denoising requires proper characterization of the noise imposed upon the underlying data. Without such understanding, proper denoising cannot be accomplished. In this section, different noise charac-
124 terization techniques such as the autocorrelation function, power spectrum, and wavelet power spectrum are described.
3.1 Autocorrelation function The autocorrelation function (ACF) reveals how the correlation between any two values of the signal changes as their separation changes [16]. It is a time domain measure of the stochastic process memory, and does not reveal any information about the frequency content of the process. Generally, for an error signal, et, the A C F is defined as, Cov(et, et+k) Ok -- v/Var(et)Var(et+k)
(3)
For a stationary stochastic process of variance cy2, the previous expression for the A C F reduces to
Pk
Cov(et, et+k) O.2
(4)
which is time-independent. A white noise process has an autocorrelation function of zero at all lags except a value of unity at lag zero, to indicate that the process is completely uncorrelated. A correlated process on the other hand, such as A R M A or A R I M A , has non-zero values at lags other than zero to indicate a correlation between different lagged observations. However, the wavelet coefficients corresponding to a correlated process are approximately decorrelated at multiple scales as seen in column 2 of Fig. 2, in which the detail signals of an ARMA(1, 1) process are approximately decorrelated. This advantage of multiscale representation will help improve the quality of denoising as will be shown later.
3.2 Power spectrum A Fourier transform of the autocorrelation function of a stochastic process gives the power spectrum function which shows the strength or energy of the process as a function of frequency [17]. Frequency analysis of a stochastic process is based on the assumption that it contains features changing at different frequencies, and thus it can be described using sine and cosine functions having the same frequencies [16]. The power spectrum is defined in terms of the covariance function of the process, Vk = Cov(et. eta-k), as
125 data
10
1
power spectrum
autocorrelation function
4 2
0.5
0 -10 51
200 400 600 800 1000 .
.
.
.
.
0
5
10
15
i
-2 -6
-4
-2
-4
-2
-4
-2
-4
-2
4
2
I
Z ~ -5
0 5
100 200 300 400 500
10
15
1
5
4
0.5
2
..,.~
0
0 -5
50
100 150 200 250
5
5
10
15
1
2 0
0 20 40 60 80 100120
-2 -6 4
0.5
-5
-2 --6
5
10
15
-2 -6
Fig. 2 The autocorrelation function (correlation versus lag) and power spectrum (log2(power) versus log2(frequency)) o f the wavelet coefficients.for an A R M A ( 1 , 1 ) process with the model Yt = 0.8)'t-1 + at - 0.3at-l, where a is white noise o f variance one.
[
PS(co) - 2 v0 + 2
]
VkCOS(2rteok) 0~<eo~<0.5 k=l
(5)
where n is the length of the stochastic signal. Thus, the relative importance of individual frequencies can be determined by computing the sum of squares of sine and cosine functions coefficients at these frequencies. The power spectrum estimate obtained using Eq. (5) is usually noisy and is filtered to obtain a smoother estimate. Theoretically, the power spectrum function for an uncorrelated random process is a constant function since all frequencies are equally present in the signal. This is not the case for a correlated process. However, since the wavelet coefficients of a correlated process are approximately decorrelated at multiple scales, the power spectrum of the wavelet coefficients at each scale is approximately a constant function as shown in column 3 of Fig. 2, where the power spectra of the detail signals corresponding to an ARMA(1, 1) process are flattened at multiple scales.
126
3.3 Wavelet spectrum Since wavelets can provide time-frequency representation of signals, an approximation of a signal's power spectrum, called "wavelet spectrum", can be obtained from its wavelet representation. Similar to power spectrum, wavelet spectrum shows the energy of a process as a function of frequency. A major difference between the power spectrum and the wavelet spectrum is the type of basis functions used in the computation. Unlike power spectrum, which uses trigonometric functions, wavelet spectrum uses wavelets as basis functions. The wavelet spectrum function is determined by computing the variances of the wavelet coefficients at every scale and then plotting each variance value versus its scale number on a log-log graph [18]. The wavelet spectrum usually results in a smoother but a sparser estimate of the power spectrum as the frequency content at a certain scale is lumped into a single point in the wavelet spectrum function.
4
Denoising and compression
4.1 Denoising and compression of data with Gaussian errors 4.1.1 Linear filtering Currently used data denoising or filtering techniques, such as FIR and IIR filtering, are single scale in nature, meaning that they represent the filtered signal using basis functions that have fixed time-frequency localization. These techniques are very popular in the chemical industry because they are computationally efficient, easy to implement, and can be readily used in online processes [2]. These linear filtering techniques cannot be efficiently used for compression since they keep as many coefficients as the number of measured data points. FIR filtering. Finite impulse response (FIR) filters are linear low-pass filters which can be represented as I-1 Xt -- ~ bixt-i i=O
(6)
where, I is the filter length and {bi} is a finite sequence of weighting coefficients, which define the characteristics of the filter and satisfy the following condition:
127
1
Zbi-
(7)
i
The weightings sequence {bi} is the impulse response of the FIR filter. When all weighting coefficients, {bi}, are equal, the FIR filter reduces to a mean filter. Mean filtering is a popular FIR filtering technique. In mean filtering, measured signals are filtered by taking the mean data points in a pre-specified moving window [2, 19]. For a filter of length I, any filtered data point can be represented mathematically in terms of the average of the last I measured data points as 1
(8)
Xt ---]-(Xt -'t-Xt-1 "-~-"""-~-Xt-I+l)
The mean filter can also be thought of as a convolution of the measured signal with a vector of I constant coefficients, each equals { l / I } . A schematic diagram of the weights used by a mean filter of length I = 5 is shown in Fig. 3(a).
IIR filtering. Infinite impulse response (IIR) filters are linear low pass filters which can be represented as o@ Xt -- Z i=0
(9)
bixt-i
0.5
0.5
__.r 0 . 4 e-._~ o
cn
0.4
~0.3
0.3
e-" r
~0.2
E 0.2
0.1 O~ c2
O.1 4
6
(a)
8
10
T 2
4
6
(b)
8
Fig. 3 Filter used in (a) mean filtering and (b) exponential smoothing.
10
128
and also satisfy the condition shown in Eq. (7). The time horizon of the IIR filter is infinite, and therefore any filtered data point is represented as a weighted sum of all previous measurements. A more detailed discussion of FIR and IIR filters is presented in [20]. Exponentially weighted moving average (EWMA) is a popular IIR filter. An EWMA filter smoothes a measured data point by exponentially averaging that particular point with all previous measurements. Similar to the mean filter, the EWMA filter is a low pass filter that eliminates high frequency components in the measured signal. It is implemented recursively by taking a weighted average of the last measured data point and the previous smoothed one with their corresponding weights, ~ and (1 - u ) , respectively as, Xt - - 0~Xt -~-
(1 - ~)xt-1
The parameter, ~, is an adjustable smoothing parameter lying between zero and unity, which defines the cut-off frequency above which features are eliminated. The weightings of the EWMA filter are shown in Fig. 3(b), which shows that the EWMA filter coefficients drop exponentially depending on the smoothing parameter, ~, giving more importance to the more recent measurements. Higher values of ~ drive the filter coefficients to drop faster which increases the cut-off frequency to keep higher frequency features, while smaller value of ~ slows the coefficients' exponential drop which lowers the cut-off frequency to eliminate features at lower frequencies.
Drawbacks oflinearfiltering. The basis functions representing raw measured data have a temporal localization equal to the sampling interval. Linear filters represent the measurements on basis functions with a broader temporal localization and narrower frequency localization as shown in Fig. 4(a). These filters are single scale in nature since the basis functions have a fixed time-frequency localization. Consequently, these methods face a trade-off between the accurate representation of temporally localized changes and efficient removal of temporally global noise. Therefore, simultaneous noise removal and accurate feature representation of non-stationary measured signals cannot be effectively achieved by single scale filtering methods. Linear filtering methods also do not result in simultaneous data compression.
4.1.2 Nonlinear filtering and compression Nonlinear filtering techniques have been developed to overcome the inability of linear filters to capture features at different scales. Nonlinear filtering methods include FIR-Median Hybrid filtering and multiscale filtering, and
129
time
~
(a) elimintated D basisfunctions
time
(b)
keptbasis functions
Fig. 4 Comparison of the time-frequenc)' space decomposition by (a) linear filtering techniques and ( b ) muitiscale techniques.
nonlinear compression methods include piecewise linear compression and multiscale compression. FMHfiltering. A FIR-Median Hybrid Filter (FMH) is a median filter which
has a preprocessed input from M linear FIR filters [3, 21]. Thus, the F M H filter output is the median of only M values, which are the outputs of M FIR filters applied to the original data. For an F M H filter of length 2I + 1 with three FIR substructures (M = 3), the data are split into three parts on which the FIR filters are applied. Then, the median operator is applied on the outputs of all FIR filters to obtain the output of the F M H filter. For this particular example, the three FIR filters used are: Yl - {xt_ 1 + Xt-2 -~-"""-+- Xt-I ],
I
f
Y2 =
xt
Y3--
{Xt+l -]-Xt+2--]-'''nLXt+l} I
(11)
where, { X t _ I , X t _ I + l , ... ,Xt_l,Xt,Xt+l,Xt+2 ..... Xt-4-I} are the 2I + 1 data points filtered by the F M H filter, and Yl, Y3, and Y2 are called the backward predictor, forward predictor, and the center value, respectively. F M H filtering is a batch filtering technique, which is most effective in capturing sharp changes in piece-wise constant signals. The lengths of the FIR filters are chosen to preserve the signal's features while eliminating the high frequency noise.
130 Long FIR filters can oversmooth sharp edges, while short FIR filters may not eliminate enough noise. Since the central point in the FMH filter is the original noisy data, FMH filters tend to retain some noise. Better noise removal is possible by applying the FMH filter several times which will result in a root signal that does not change with further filtering. Further extensions of F M H filters to using IIR and predictive FIR filters have also been developed [21].
Piecewise linear compression. Methods of compression by piecewise linear approximation rely on representing the signal as linear segments. These methods compress the data by retaining only the end points of each linear segment. The end points of the line segments are chosen to minimize a defined error criterion or satisfy error bounds. The reconstructed signal, obtained only from the retained data points, is a piecewise linear approximation of the original one. These compression techniques are the most widely used in the chemical industries as they are simple and computationally efficient, and include the boxcar and backward slope methods [8-10], the swinging door method [11], and the piecewise linear on-line trending (PLOT) method [12]. In general, these techniques represent the signal as a weighted sum of basis functions of the form Oi(t) -- mit + di,
t E [ti, ti+l]
(12)
where mi--
:~(ti+l) -- x(ti) , ti+l ti -
di-2(ti)-miti
-
The end points of any linear segment, ~(ti) and x(ti+l ), are either interpolated from the data or taken as actual data points as in the Boxcar method. These piecewise linear approximation techniques perform well for steady-state process data with little noise, but are inadequate for process data with important low amplitude transients and are inefficient for data with relevant high frequency features. Also, the line segments used in the approximation satisfy a local, not a global error criterion.
Multiscale filtering and compression. Multiscale filtering and compression using wavelets are based on the observation that random errors in a signal are present over all the coefficients while deterministic changes get captured in a small number of relatively large coefficients. Thus, stationary Gaussian noise may be removed by a three-step method [4]:
131 1. Transform the noisy signal into the time-frequency domain by decomposing the signal on a set of orthonormal wavelet basis functions. 2. Threshold the wavelet coefficients by suppressing coefficients smaller than a selected threshold value. 3. For filtering, transform the thresholded coefficients back into the original domain, and for compression, store only the thresholded coefficients. Multiscale filtering by wavelet thresholding and its statistical properties have been studied [4, 22]. They showed that for a noisy signal of length n knowing the nature of the noise-free signal, the filtered signal will have an error within O(log n) of the error between the error-free signal and the signal filtered with a priori knowledge about the smoothness of the underlying signal. This wavelet thresholding technique can be used to filter white as well as correlated noise since the wavelet coefficients of correlated signals are approximately decorrelated at multiple scales. Threshold selection. Selecting the threshold value is a critical step for both filtering and compression. The accuracy of the reconstructed signal depends on the optimized criterion. A non-exclusive list of optimum criteria includes: the mean-square error as in "Riskshrink" [23], Stein Unbiased Risk Estimate as in "SureShrink" [22], and visual quality as in "Visushrink" [4]. In the Visushrink method, Donoho proposed a universal scale-dependent threshold which is applied using soft thresholding. For wavelet filtering, this universal threshold is given by
tj
--
o'j
v/2 log n
(13)
where n is the signal length and cyj is the standard deviation of the noise at scale j, which can be estimated from the wavelet coefficients at that scale by 1 median{ldjl} (14) cyj - 0.674-------~ where dj, are the wavelet coefficients at scale j. For signals corrupted by white noise, the Visushrink threshold value is constant for all scales since the detail signals obtained by decomposing a white noise signal are also white noise signals with the same standard deviation as that of the original white noise signal. This Visushrink method for estimating the threshold is used in all illustrative examples of this chapter.
132 For wavelet packets, a similar threshold value is applied after selecting the best basis that represents the signal by minimizing a cost criterion. This cost criterion is selected such that it is large when the wavelet coefficients are of the same magnitude and it is small when the wavelet coefficients are of different magnitudes. Such selection guarantees a good separation of those wavelet coefficients corresponding to additive noise from those corresponding to the underlying signal. The threshold for wavelet packet filtering, which is a function of the selected basis, is given by t - cyv/2 log G
(15)
where G = n log2(n), which is the size of the wavelet packets library, and cr is the standard deviation of the coefficients of the selected basis. For compression, other threshold selection criteria include the compression ratio (CR), the mean-square error (MSE), and the local point-wise error of approximation, which are discussed in more detail in [13]. In this chapter, the universal threshold given in Eq. (13) will be used for both filtering and compression. Two thresholding techniques have been studied: Hard thresholding and Soft thresholding [4,6,22]:
1. Hard thresholding. For the wavelet coefficients at scale j, dj.k, and a threshold value, t, the thresholded coefficients, dj.k, are determined as, /dj.k dj,k -- ~, 0
]dj.k] > t Idj.k [ < t
(16)
Hard thresholding can lead to better reproduction of peak heights and discontinuities, but at the price of occasional artifacts that can roughen the appearance of the signals estimate [4]. 2. Soft thresholding. In soft thresholding, all coefficients are shrunk towards zero by the threshold value. Mathematically, it can be written as, _ f sign (dj.k) ([ dj.k ] -- t) dj.k -9 ~, 0
[dj.k[ > t -[dj.k [ < t
(17)
Usually, soft thresholding gives better visual quality of filtering than procedures based on minimizing the mean square error [25].
End effects. The problem of end effect or inaccuracy at the filtered signals boundaries is common in wavelet-based filtering due to the finite length of
133 the measured signal and the non-causal nature of most wavelet filters, which require additional data points at the two ends of the signal for decomposition. These additional data are usually estimated by assuming a mirror image or by cyclic augmentation of the end points in the original signal. These assumptions may or may not be valid depending on the signal and the filter used for decomposition, which introduces inaccuracies at the boundaries. One rigorous solution of the problem of end effects is to use boundary corrected filters as suggested by Cohen et al. [26]. In this approach, different wavelet filters are used at the edges from those used in the signals interior, and are called boundary corrected filters. These boundary corrected filters, which are derived from the edge scaling functions, are designed to be causal and orthogonal to each other and to the interior filters. These boundary corrected filters will be used in the on-line multiscale (OLMS) filtering method presented later. Other alternate solutions for end effects are explained by Cohen et al. [26].
Example 1 Off-line multiscale filtering and compression. The ability of multiscale denoising and compression is illustrated by comparing the quality of the reconstructed signals and the mean square errors at a fixed compression ratio obtained by the multiscale and the boxcar methods. For this example and all examples in this chapter, the mean square error refers to the error between the filtered and the noise free signals and all filtering parameters are selected to minimize this error. A bumps' signal contaminated with white noise of variance 0.5, shown in Fig. 5(a), is used in this illustration. In multiscale filtering, the Daubechies boundary corrected filter (D2) and a scale depth of 3 are used, which resulted in a filtered signal, shown in Fig. 5(b), with a mean square error of 0.1045 and a compression ratio of 7.6992. For the same compression ratio, the boxcar method resulted in a reconstructed signal, shown in Fig. 5(c), with a mean square error of 0.5721. Note that using the boxcar method, it is not possible to obtain a mean squares error as low as that obtained by the multiscale method no matter how small the compression ratio gets. Translation invariant filtering. Filtering by wavelet thresholding sometimes exhibits visual artifacts in the neighborhood of discontinuities due to the lack of translation invariance [6]. When a sharp change is present at a non-dyadic location from the end of a signal, the wavelet function used to represent this change and the change itself do not align, creating an artifact in the recon-
134 12 10 8
8
6
6
4
4 2 0 -2 200
400
(a)
600
800
1000
2oo
6oo
8oo
000
(b)
12
7
10 8 6 4 2 0 -2 200
400
(c)
600
800
1000
Fig. 5 (a) Bumps signal with white noise, (b) muitiscale filtered signal using D2 boundary corrected filter ( M S E = 0.1045, CR = 7.7), (c) reconstructed signal using boxcar compression ( M S E = 0.5721, CR = 7.7).
structed signal, which is not present in the original one. This observation is called the pseudo-Gibbs phenomenon. One method to suppress such artifacts, termed "cycle spinning", is to "average out" the translation dependence [6]. For a signal of length n, translation invariant (TI) filtering is performed by shifting the signal n times, filtering it n times, and then averaging out all translations to weaken the pseudo-Gibbs phenomena, as shown in Fig. 6(a), which can be accomplished with O(n log n) complexity. The idea behind cycle spinning or TI filtering is that it shifts signals so that their features change positions, which diminishes the unfortunate misalignment between features in the signals and features in the basis functions used to represent them. Translation Invariant filtering can be implemented using wavelets as well as wavelet packets. As indicated in [6], hard thresholding and translation invariance combined result in good visual and quantitative characteristics.
135
1, ! ~ !~ ! ~ !, i 21 ~[,1
!o!,1,!,!~!~!:1,] !,1~!,1~!~!,!,1ol I~!,I~1~!,1,!ol,i _
(a)
l'!213
']
i'i213
'151617[8 (b)
Fig. 6 The translation mechanisms used in (a) TI and (b) BCTI filtering (thick lines indicate augmentation of the signal ends).
Translation invariant filtering is a step forward in solving the problem of data filtering and is even referred to as the "Second Generation" of data filtering [6]. It helps diminish the presence of spikes and artifact from the filtered data and gives better visual quality thresholding. However, it usually results in errors at the boundaries, especially when the signal's ends differ in value, since it assumes the signal to be a cyclic list and more importantly it cannot be used for on-line filtering since a signal of dyadic length is needed for filtering. These limitations will be overcome by the on-line multiscale (OLMS) filtering method described later and the boundary corrected translation invariant (BCTI) method presented next.
Boundary corrected translation &variant filtering. Boundary corrected translation invariant (BCTI) filtering diminishes the presence of artifacts due to
136 the pseudo-Gibbs phenomenon by shifting the signals features using a moving window of dyadic length in which the data are filtered and then averaging the filtered signals from all windows as shown in Fig. 6(b). The BCTI filtered value for the first measurement will be the mean of two translations, whereas the filtered value for the fourth measurement will be the mean of five translations. This approach is similar to TI filtering, but since it does not assume the signal to be a cyclic list, it overcomes the problem of boundary effects encountered in TI filtering. This advantage is illustrated in Fig. 6 which compares the translation mechanisms for TI and BCTI filtering. This comparison shows that in the elimination of errors at the boundaries, BCTI filtering will take the mean of fewer translations. Consequently, BCTI will require less computations than TI, but may be less smooth near the boundaries. The performance of this boundary corrected TI (BCTI) filtering technique is illustrated and compared with other off-line filtering methods in the next example. Some of the practical issues regarding the implementation and hints to improve the performance of BCTI filtering will be discussed later in the on-line filtering section along with those of OLMS filtering.
Example 2 BCTI, TI, and FMHfiltering. The performance of BCTI filtering is illustrated by comparing its performance with TI and FMH filtering for a cusp signal contaminated with white noise of variance 0.2, which is shown in Fig. 7(a). In both BCTI and TI filtering, the Haar wavelet, a scale depth of 5, and hard thresholding are used, resulting in BCTI and TI filtered signals, shown in Fig. 7(b) and (d), with mean squares errors of 0.0045 and 0.0055, respectively. Also, in BCTI filtering, an initial window of length 512 is used. In F M H filtering, a filter of length 31 with three substructures is used, resulting in a filtered signal, shown in Fig. 7(c), with a mean squares error of 0.0115. Notice that both TI and BCTI filtering are smoother than F M H filtering, and that BCTI is more accurate at the boundaries than TI, resulting in the least mean square error. 4.2 Filtering of data ~t'ith non-Gaussian errors The presence of outliers changes the statistical properties of the data, such as the autocorrelation function and power spectrum. In the filtering techniques described so far, it is assumed that only stationary Gaussian
137
6
-"
6
5 9
4 2
0
0
..9. " i
: .?:;:-~'iJ '~":;:.:.
.:;~:;;i~;',:'!::;::::~
0
200
400
0
200
400
(e)
4 2
600
800
1000
600
800
1000
(a)
5
0
0
0
200
400
0
200
400
600
800
1000
600 (d)
800
1000
(b)
Fig. 7 (a) Cusp signal with white noise, (b) BCTI.filtering using Haar ( M S E = 0.0045), (c) F M H filtering, filter length = 31 ( M S E = 0.0115), (d) TI.l~'ltering using Haar ( M S E = 0.0055).
errors are present in a measured signal. When this assumption is violated, ouliers are likely to be retained in the filtered signal. Therefore, the tasks of non-Gaussian error detection and isolation need to be incorporated with the filtering process. Some of the non-Gaussian error removal techniques, such as median, F M H , and multiscale median filtering, are described below.
4.2.1 M e d i a n a n d F M H f i l t e r i n g
Standard median and F M H filters have been widely used in non-Gaussian error elimination [21,27]. Standard median filters simply use the middle observation from data in a moving window, whereas F M H filters preprocess the data with F I R filters as discussed earlier. F M H filters are superior to the standard median filters due to their improved ability to preserve temporally localized features, while eliminating errors. However, proper selection of the F I R filters requires knowledge about the maximum duration of the outliers. When such knowledge is available, the length of the F I R filters used can be
138 determined so that a complete elimination of gross errors is achieved. Similarly, the standard median filter length should be selected long enough to eliminate the longest patch of outliers present in the data. Otherwise, some or all outliers may be retained in the filtered signal. For a signal with a maximum outlier patch length p, a median filter of length 2 p + l guarantees complete elimination of all outliers.
4.2.2 Multiscale median filtering Wavelet-based filtering is a very effective approach for denoising signals contaminated by white as well as correlated Gaussian noise. However, if it is used to filter signals with non-Gaussian errors, outliers will be present at multiple scales and large coefficients corresponding to outliers get misinterpreted as important features. Thus, wavelet thresholding is not effective in eliminating non-Gaussian errors. This limitation has been overcome by combining wavelet thresholding with multiscale median filtering [7]. In his technique, outliers are eliminated at each scale using median filters as shown in Fig. 8.
A standard median filter or an F M H filter can be used in this approach. The original signal, S(0) passes through a median filter at the finest scale which results in the signal U(0). Then the low- and high-pass wavelet filters, L and H are applied on U(0) resulting in the scaled and detail signals at the finer resolution, S(1) and D(1), respectively. The same process is repeated on the signal S(1), which passes through the median filter to get U(1) on which the low and high pass wavelet filters are applied to get S(2) and D(2). The process is then repeated to get scaled and detail signals at coarser scales. Due to the dyadic down-sampling used in wavelet transform, the effective median filter length increases at coarser scales. Therefore, outliers in short patches can be eliminated at finer scales while longer patches of outliers can be eliminated at coarser scales. When the effective median filter length at finer scales is shorter than the duration of scaling function coefficients corresponding to a particular outlier patch, outliers will leak into coarser scales. Such leakage may be j=J-I I
j=J-2
"
11
U(O)
,.
.~ S(I)
IXl)
'
-
S=== I "l Filter l
I
U(I)
L
-~ S(2)
H'~'~D(2)
Fig. 8 Robust multiscale wavelet decomposition method [7].
139 eliminated by repeating robust multiscale filtering and by selecting a long enough median filter [7]. In theory, for a low pass wavelet filter of length z, leakage is prevented by using median filters of length 2z + 1 [28]. However, when long patches of outliers are present, a complete elimination of outliers can be accomplished by selecting the median filter long enough so that the effective median filter at the coarsest scale is longer than the longest patch of outliers and by repeating multiscale median filtering. The advantages of multiscale median filtering are combined with those of BCTI to improve upon its performance for data with non-Gaussian errors. BCTI filtering of non-Gaussian errors. The capabilities of BCTI filtering techniques may be extended to simultaneously eliminate Gaussian and nonGaussian errors by combining it with multiscale median filtering. The robust BCTI technique uses a moving window of dyadic length, in which the data are filtered using the robust wavelet transform illustrated in Fig. 8. In the robust BCTI technique, good elimination of Gaussian and non-Gaussian errors is possible by averaging the filtered signals from different time shifts. The averaging is equivalent to combining the decisions about the nature of a particular change from all windows, and thus allows better judgment and improved non-Gaussian error elimination as illustrated by the next example. Example 3 Robust BCTI and F M H filterhlg of data with non-Gaussian errors. The performance of the robust BCTI technique is shown and compared with that of F M H filtering for a bumps' signal with a mean shift contaminated with white noise of variance 0.5 and two outlier patches of length 3, which is shown in Fig. 9(a). In robust BCTI filtering, the Haar wavelet, a scale depth of 2, and a median filter of length 7 are used resulting in a filtered signal, shown in Fig. 9(b), with a mean square error of 0.1989. In F M H filtering, a filter of length 21 with three substructures is used resulting in a filtered signal, shown in Fig. 9(c), with a mean square error of 0.2704. As Fig. 9 shows, both techniques could identify and eliminate all outliers. However, F M H filtering retains more noise than the robust BCTI leading to a higher mean square error.
5
On-line multiscale filtering
The existing nonlinear filtering techniques described in the previous section do perform better than linear filters, such as FIR and IIR filters, for a broad
140 20
9.
.
.
.
,
~
. . . . . . .
_i0
~
i
0
-10
-
0
20]-.
A
I
_
200
/
,
0
200
l
400
,
/
"'"
, outliers
l
200
2ol-
~
.4/~. .... . . . . . . . .
--
.
l
600
800
(a)
.
.
J
1000
1200
.
i
i
400
600
~
1
400
600
,.
(b)
l
J
800
1000
'
1200
!
.A
_
,,,,
(c)
I
800
1000
1200
Fig. 9 (a) Noisy and noise-free bumps' signal with outlier patches of length 3, (b) robust BCTI filtering using Haar and a scale depth of 2 ( M S E = 0.1989), (c) F M H fiitering using a filter of length 21 ( M S E = 0.2704).
variety of signals. However, a significant disadvantage of these nonlinear or multiscale methods is that they can not be implemented on-line. The noncausality of most wavelet filters introduces a time delay in the computation that increases at coarser scales and smoother filters. This time delay may be overcome in a rigorous manner by using boundary corrected filters at the signal boundaries [26]. Another reason for restricting the wavelet-based methods to off-line use is the dyadic discretization of the wavelet parameters, which requires a signal of dyadic length for the wavelet decomposition. A signal containing a dyadic number of measurements can be decomposed as shown in Fig. 10(a). In contrast, if the number of measurements is odd, the last point cannot be decomposed without a time delay. For example, when three points are
141
1
2
1
2
3
1
2
3
4
5
6
7
[ill(ill I
I11 I
I ! 1 (a)
(b)
[' ] datapointsusedin D waveletdecomposition
I
(e)
datapointscorresponding to timedelay
Fig. 10 Time delay introduced due to the dyadic length requirement in wavelet decomposition.
available, as shown in Fig. 10(b), only the first two data points can be decomposed, introducing a time delay of one data point. However, when seven data points are available, as shown in Fig. 10(c), only the first four data points can be decomposed to two scales and the next two data points can be decomposed to only one scale. Therefore, a time delay of one data point (point 7) at the finest scale and a time delay of three data points (points 5-7) at the coarser scale are introduced. In many applications such a time delay may be unacceptable. Consequently, this section describes an on-line method for multiscale filtering (OLMS), where absolutely no time delay is allowed. When time delay in the filtering is acceptable, then OLMS filtering reduces to boundary corrected TI (BCTI) filtering. Both of these techniques are also extended to deal with measurements corrupted by non-Gaussian errors using multiscale median filtering [5]. 5.1 On-line multiscale filtering of data ~rith Gaussian errors On-line multiscale filtering is based on multiscale filtering of data in a moving window of dyadic length as shown in Fig. 11. The OLMS methodology can be summarized as follows: 1. Decompose the measured data within a window of dyadic length using a causal boundary corrected wavelet filter. 2. Threshold the wavelet coefficients and reconstruct the filtered signal.
142
121
i-th filtered data point
1-71
[-71 161
U] m Fig. 11 A schematic diagram of OLMS filtering.
3. Retain only the last data point of the reconstructed signal for on-line use.
4. When new measured data are available, move the window in time to include the most recent measurement while maintaining the maximum dyadic window length. The window length is held constant after reaching an upper limit, which will be discussed in Section 5.3. The measurements in each window are filtered by the wavelet thresholding approach described in the previous section [4]. This simple approach is very effective compared to the single scale techniques as shown by the theoretical analysis presented next and the illustrative example. It retains the benefits of the wavelet decomposition in each moving window, while allowing each measurement to be filtered on-line. Deeper insight into the properties and benefits of OLMS filtering may be obtained by relating it to the existing single scale methods of mean filtering and exponential smoothing. The filters corresponding to the Haar scaling function at different scales are shown in Fig. 12(a). The shape of these filters is identical to that of a mean filter shown in Fig. 3(a). Due to its multiscale character, OLMS filtering using Haar wavelets is able to automatically select a mean filter of dyadic length that is best for representing each signal feature. In practice, OLMS filtering using Haar wavelets subsumes a class larger than mean filters of dyadic lengths because if the last wavelet coef-
143
0.8 j=J-1
(a)
,
,
0.8
,
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.8
~C
,..., 0 2
0 4
6
8
10
j=J-2
0.4
(b).
,l
0 coo 0 5 j=J-2
10
15
20
10
15
20
15
20
0.8
lIII
0.6
j=J- 1..
0.6 0.4 0.2
0.2 0 ~ 0
0 2
j=J-3
0.4
.
O
.
~ 4
.
0
.
6 .
8
10
.
0
0
0.4
5 j=J-3
TIll
0.3
0.3
o.,
o., TTITT
0 --0 0
2
4
6
8
10
O v ? 0
5
10
Fig. 12 (a) Haar fiher and (b) post-conditioned Daubechies D2 boundary corrected filter at multiple scales.
ficient at some intermediate scale is eliminated while the coefficient at the finer scale is retained, then the effective filter will not be a mean filter of dyadic length. Instead, its coefficients will have different magnitudes to give more weighting to the more important features in the signal. For example, if a signal is filtered by decomposing it using Haar at two scales and by keeping the scaled signal at j - - J - 2 and the wavelet coefficients at the finest scale j - J - 1 but eliminating the wavelet coefficients at the coarser scale j = J - 2 , the effective FIR filter will have the shape shown in Fig. 13, which is not a mean filter. It gives more weighting to the most recent measurement to capture the important feature represented by the last retained wavelet coefficient at scale j = J - 1 and at the same time, it gives a negative weighting to the previous measurement to account for the eliminated noise component represented by the eliminated wavelet coefficient at scale j = J - 2 . The filter used in OLMS filtering will also be able to adapt its length to best represent the important features in the measured signal. Thus, a short filter will be automatically selected for representing a fast change in the underlying sig-
144
0.7 0.6 0.5 0.4
I
0.3 0.2 0.1 O< -0.1
I
-0.2 -0.3 --0.4
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Fig. 13 The effective FIR filter corresponding to O L M S Haar filtering at two scales (L = 2) by keeping all coefficients except the last wavelet coe~cient at scale m = 2.
nal, whereas a long filter will be selected for representing a slow change or a constant segment. The nature of the filter selected for obtaining each filtered measurement is decided by the coefficients that are retained after thresholding. The theoretical analysis of OLMS Haar filtering shows that it should be better than mean filtering using a dyadic filter. In practice, due to its adaptive nature, OLMS Haar filtering performs better than mean filtering even for signals that are best for non-dyadic mean filters. OLMS filtering using smoother wavelets will approximatel)' subsume exponential smoothing to provide better filtering; compare Figs. 3(b) and 12(b) [5]. The performance of OLMS filtering is compared with that of existing methods in the next illustrative example. Example 4 On-line multiscale ( O L M S ) filtering. In this example, the performance of OLMS filtering is illustrated and compared with those of mean filtering and exponential smoothing. The noisy signal, shown in Fig. 14(a) is a bumps' signal with a mean shift contaminated with white noise of variance 0.5. In OLMS filtering, the Haar wavelet, a scale depth of 3, Hard thresholding, and an initial window of 1024 were used, resulting in a filtered signal, shown in Fig. 14(b), with a mean square error of 0.1635. In mean filtering and expo-
145 20
~
...........
20
.....
15 .
.
,
,,.::.~'~.,.;~:5,~
-5
0
200
10
400
600
800
1000
-5
0
2IX)
4(~)
(a) 20
.
.
"
6(~)
800
1000
600
800
1000
(b) .
.
20
15
15 10,
Oi _fi
0
200
400
600
800
1000
_ill
0
200
400
(c)
(d)
Fig. 14 (a) Bumps' signal with mean shift and white noise of variance 0.5, (b) O L M S /filtering using Haar ( M S E = 0.1635), (c) mean filtering ( M S E = 0.2530), (d) exponential smoothing ( M S E = 0.2237).
nential smoothing, a mean filter of length 2 and a smoothing parameter value of 0.49 were used, resulting in filtered signals, shown in Fig. 14(c) and (d) with mean square errors of 0.2530 and 0.2237, respectively. The length of the original noisy bumps' signal is 2048. The signals shown in Fig. 14(b)-(d) are the last 1025 points of the signals which are filtered by the various filtering methods. This example shows the advantage of OLMS filtering in improved noise removal capability over mean filtering and exponential smoothing while capturing the main features in the data. 5.20LMSfiltering
o f data with non-Gaussian errors
Similar to the robust BCTI technique, OLMS filtering may be extended to simultaneous on-line elimination of Gaussian and non-Gaussian errors by combining it with multiscale median filtering. In the robust OLMS technique, data are filtered in a moving window of dyadic length that always includes
146 the current measurement, and only the last point in the window is kept for on-line use. The robust OLMS technique uses the robust wavelet transform algorithm illustrated in Fig. 8 to help suppress outliers at each scale. Without the help of process models to identify the nature of a sharp change in the data, application of the robust OLMS filtering tends to smooth the change since it is assumed to be an outlier until it persists beyond'the coarsest scale of the median filter. Consequently, robust multiscale filtering is best suited for steady-state data as demonstrated by the next example. Example 5 R o b u s t O L M S f i l t e r i n g . The performance of robust OLMS filtering is illustrated using a noisy bumps' signal with a mean shift and two outlier patches of length 3, which is shown in Fig. 15(a). In robust OLMS filtering, a median filter of length 9, a scale depth of 2, and the Haar wavelet are used, resulting in the filtered signal shown in Fig. 15(b), with a mean square error of 0.8366. Fig. 15(a) shows that robust OLMS filtering tends to oversmooth sharp changes in the data. As discussed earlier, this smoothing is due to the fact
20t+,
"~--.':--:;-'::-:::-':.'~--::":;!~i-.":.:
Io
-5
.
9
.
.
.
outliers
".:+ ~:-'~
0
200
400
0
I 200
I 400
600
(a)
800
1000
1200
15 10 5 0 -5
I
I
600
800
,.
,
I
1000
1200
(b)
Fig. 15 (a) A bumps' signal contaminated with white noise of variance 0.5 and outlier patches of length 3, (b) robust O L M S fihering, median jilter length = 9, Haar wavelet, scale depth = 2 ( M S E = 0.8366).
147 that a significant change is initially considered to be an outlier until the change persists for a duration longer than the coarsest scale of the multiscale median filter. However, at steady state, data with non-Gaussian errors can be filtered without distortion by the robust OLMS technique as shown in the last 300 points in Fig. 15(b).
5.3 Hints for tuning the filter paralneters in multiscale filtering and compression As in the case of any filtering or compression method, multiscale filtering relies on some information about the data and the nature of the errors to tune its filter parameters, which include the threshold, decomposition depth, the wavelet filter, and the size of the median filter for the robust techniques. Hints for selecting these tuning parameters in off-line and on-line modes are discussed below.
Value of threshold. In off-line filtering and for data corrupted with stationary errors, the threshold may be estimated by applying the Visushrink method described earlier to the available measurements. In on-line filtering, however, the threshold value needs to be estimated every time a new measurement is collected. For stationary noise, the value of threshold stops changing much after an adequate number of measurements are available. Consequently, the threshold may be estimated from the measurements until the change is below a user-specified value. Note that the Visushrink method for estimating the threshold can not be performed recursively due to the median operator used in Eq. (14) and will require storage of a large number of measurements. Modifications for determining the threshold for time-dependent errors have been suggested by Neumann and Von Sachs [29]. Depth of decomposition. Thresholding wavelet coefficients at very coarse scales usually increases the compression ratio but may result in the elimination of important features, whereas thresholding only at very fine scales may not eliminate enough noise. Therefore, the depth of wavelet decomposition needs to be selected to optimize the quality of the filtered signal, while maintaining as high compression ratio as possible. Empirical evidence suggests that a good initial guess for the decomposition depth is about half of the maximum possible depth, that is (log2 (n) /2), where n is the signal's length in off-line mode and the moving window length in on-line mode. However, a smaller depth might be more appropriate in OLMS filtering if a long
148 boundary corrected filter with large support is used in the decomposition since the filters at the two edges might overlap at very coarse scales. For signals with non-Gaussian errors, the decomposition depth also needs to be chosen so that the effective median filter at the coarsest scale is longer than the longest patch of outliers.
Selected wavelet filter. The type, length, and nature of the wavelet filter affect the quality of filtering. In off-line filtering, using a boundary corrected filter improves the accuracy of the filtered signal at the boundaries. In OLMS filtering, using a causal or boundary corrected filter is an even more critical issue since only the last filtered data point from each window is used. If boundary corrected filters are not used, then the last point is among the least accurate ones due to end effect errors. This is not a major concern in TI and BCTI filtering because inaccuracies due to end effects are greatly reduced by averaging different translations. In off-line filtering, the similarity between the shape of the wavelet filter and the shape of the signal enhances the quality of the filtered signal. For example, the Haar filter is a better choice for a stair-step signal than a smoother filter, such as Daubechies; however, a smooth Daubechies filter is a better choice than Haar for a smooth signal. Since the filtered signal using TI or BCTI is just the average of different off-line filtered signals, the quality of filtering is improved when the wavelet filter resembles the signal's features. For OLMS filtering, however, this advantage does not necessarily hold. Just as EWMA often gives a smaller mean square error than mean filtering and preserves sudden changes more accurately, OLMS filtering using Daubechies second-order boundary corrected filter often results in smaller mean square error than OLMS filtering using Haar.
6
Conclusions
This chapter provides an overview and a comparison of multiscale methods for denoising and compression with other linear and nonlinear filtering and compression techniques. The advantages of multiscale representation are better understood through a discussion about the nature of various types of measurement errors and techniques to characterize their behavior. These advantages of multiscale representations are reflected on the quality of noise removal in multiscale filtering and compression, especially for signals with multiscale features as shown through several examples. Some of the limita-
149 tions of multiscale filtering such as end effects, sensitivity to non-Gaussian errors, and their restrictions to off-line use are addressed. In this chapter, we also present an on-line multiscale filtering (OLMS) method that extends the advantages of multiscale filtering using wavelets to on-line processes where no time delay is allowed. The method is based on filtering the data in a moving window of dyadic length by wavelet thresholding. O L M S is shown to outperform other linear filtering method through a theoretical analysis and illustrative examples. When off-line filtering is acceptable, the filtering quality can be further improved by averaging the filtered signals in each moving window, which is equivalent to TI filtering without boundary effects, and is called boundary corrected translation invariant (BCTI) filtering. These methods are also extended to deal with non-Gaussian errors by using multiscale median filtering.
References 1. M.A. Kramer and R.S.H. Mah, Model Based Monitoring, in Proceedings of the International Conference on Foundations o[ Computer Aided Process Operations (D. Rippen, J. Hale, J. Davis, Eds) CACHE, Austin, TX (1994). 2. M.T. Tham and A. Parr, Succeed at On-Line Validation and Reconstruction of Data, Chemical Engineering and Progressive 90 (5), (1994), 46. 3. P. Heinonen and Y. Neuvo, FIR-Median Hybrid Filters, IEEE transactions on Acoustics, Speech, and Signal Processing, ASSP-35 (6), (1987), 832-838. 4. D.L. Donoho, I.M. Johnstone, G. Kerkyacharian, and D. Picard, Wavelet Shrinkage: Asymptotia?, Journal of the Royal Statistical Society B, 57 (2), (1995), 301-369. 5. M.N. Nounou and B.R. Bakshi, Online Multiscale Filtering of Random and Gross Errors without Process Models, AICHE Journal 45 (5), 1041-1058, 1999. 6. R.R. Coifman and D.L. Donoho, Translation-Invariant De-Noising, Lecture Notes in Statistics, Anestis Antoniadis and Georges Oppenheim, Issue Description: Wavelets and Statistics, New York, 103, (1995), 125. 7. A.G. Bruce, D.L. Donoho, H.-Y. Gao, and R.D. Martin, Denoising and Robust Non-Linear Wavelet Analysis, Proceed#lgs o[SPIE, 2242 (1994), 325-336. 8. J.C. Hale and H.L. Sellars, Historical Data Recording for Process Computers, Chemical Engineering and Progressive, 38 (November. 1981). 9. F.B. Bedar and T.W. Tucker, Data Compression Applied to a Chemical Plant Using a Distributed Historian Station, ISA Transactionsm, 26 (4), (1987a), 9. 10. F.B. Bedar and T.W. Tucker, Real Time Data Compression hnproves Plant Performance Assessment, hlTech, 53 (1987b). 11. E.H. Bristol, Swinging Door Trending: Adaptive Trend Recording? ISA Conference Proceedings, 749, (1990). 12. R.S.H. Mah, A.C. Tamhane, S.H. Tung and A.N. Patel, Process Trending with Piecewise Linear Smoothing, Computer Chemical Engineering, 19 (2), (1995), 129.
150 13. B.R. Bakshi and G.. Stephanopoulos, Compression of Chemical Process Data by Functional Approximation and Feature Extraction, AICHE Journal, 42 (2), (1996), 477. 14. S.G. Mallat, A Theory of Multiresolution Signal Decomposition" The Wavelet Representation, IEEE Transaction on Pattern Analysis and Machine hltelligence, l l (7), (1989), 764. 15. R.R. Coifman, Y. Meyer and M.V. Wickerhauser, Wavelet Anah'sis and Signal Processing, Wavelets and their Applications, (M.B. Ruskai et al. Eds), Jones and Bartlett, Boston, (1992). 16. G.E.P. Box and G.M. Jenkins, Time Series Analysis, Forecasting and Control. Holden Day, McGraw-Hill, New York and Maidenhead, Eng., Revised Edition, (1976). 17. F. Sakaguchi, Pseudodiagonalization of the Autocorrelation of a Stochastic Process by an Over-Complete Wavelet System, Electronics and Communications in Japan, Part 3, 78 (4), (1995). 18. P. Bansal, On-line rectification of stationary random errors from chemical process data using wavelet-based methods, M.S. Thesis, The Ohio State University, (1996). 19. J. Schroedar and M. Chitre, Adaptive Mean Median Filtering, IEEE Acoustics, Speech, and Signal Processing, l (1996). 20. R.D. Strum and D.E. Kirk, First Principles of Discrete Systems and Digital Signal Processing, Addison-Wesley, Reading, MA, (1989). 21. P. Heinonen and Y. Neuvo, FIR-Median Hybrid Filters with Predictive FIR Substructures, IEEE transactions on Acoustics, Speech, and Signal Processing, 36 (6), (1988), 892-899. 22. D.L. Donoho and I.M. Johnstone, Ideal De-noising in an Orthonormal Basis Chosen from a Library of Bases, Technical Report, Department of Statistics, Stanford University, (1994). 23. D.L. Donoho and I. Johnstone, Ideal Spatial Adaptation by Wavelet Shrinkage, Technical Report, Department of Statistics, Stanford University, (1993). 24. G.P. Nason, Wavelet Shrinkage using Cross-Validation, Journal of the Royal Statistical Society, 58 (2), (1996), 463. 25. D.L. Donoho, Denoising by Soft Thresholding, Technical Report, Department of Statistics, Stanford University, (1992). 26. A. Cohen, I. Daubechies and V. Pierre, Wavelets on the Interval and Fast Wavelet Transforms, Appl Comput. Harmonic Analysis, 1 (1993), 54-81. 27. N.C.JR. Gallagher and G. Wise, Theoretical Analysis of the Properties of the Median Filters, IEEE transactions on Acoustics, Speech, and Signal Processing, ASSP-29 (1981), 6. 28. A. Bruce and H.-Y. Gao, Applied Wavelet Analysis with S-Plus, Springer, New York, (1996). 29. M. Neumann and R. Von Sachs, Wavelet Thresholding: Beyond the Gaussian I.I.D. Situation, Lecture Notes in Statistics, Anestis Antoniadis and Georges Oppenheim, Issue Description." Wavelets and Statistics, New York, 103 (1995), 103.
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
151
CHAPTER 6 Wavelet Packet Transforms and Best Basis Algorithms Y. Mallet, D. Coomans and O. de Vel Statistics and Intelligent Data Analysis Group, School of Computer Science, Mathematics and Physics, James Cook University, Townsville, Australia
1 Introduction The wavelet packet transform (WPT) [1] is an extension of the discrete wavelet transform (DWT). The basic difference between the wavelet packet transform and the wavelet transform relates to which coefficients are passed through the low-pass and high-pass filters. With the wavelet transform, the scaling coefficients are filtered through each of these filters. With the WPT, not only do the scaling coefficients pass through the low-pass and high-pass filters, but so do the wavelet coefficients. Since both the scaling and wavelet coefficients are filtered there is a surplus of information stored in the WPT which has a binary tree structure. An advantage of this redundant information is that it provides greater freedom in choosing an orthogonal basis. The best basis algorithm [2] seeks a basis in the WPT which optimizes some criterion function. Thus, the best basis algorithm is a task specific algorithm in that the particular basis is dependent upon the role for which it will be used.
2
Wavelet packet transforms
So far we have only considered filtering the scaling coefficients, but it seems perfectly viable to filter the wavelet coefficients. The WPT is obtained by filtering both the scaling and wavelet coefficients. In this section the discussion on the WPT assumes the m - 2 case. Although it is not necessary, this discussion on the WPT can be simplified if one assumes that p - 2J. The WPT has a tree-like structure, where each band in the transform produces two new children bands at the next lower level. The tree-like structure occurs because now the detailed (or wavelet) coefficients are filtered through a low-pass and a high-pass filter to obtain the next lower level of the WPT.
152 This is done in the same way that the smoothed (or scaling coefficients) are filtered. Fig. 1 presents the structure of a wavelet packet transform for some _ _ I )v - x [Jl (0). Here the notation discretely sampled signal x - (x0, xi . . . . . x,J ~ is used to represent the wavelet packet coefficients which occur at the j t h level in the zth band of the decomposition. We now describe how the filtering operations depicted in Fig. 1 are obtained . x. - (x0, . xl , x~_j _ )1- _ x[J](0), the ( J - 1)st level mathematically . For. some of the W P T would be obtained as for the D W T , that are the data is passed through a low-pass and a high-pass filter so that
~
-- Cj ~
-- Cjx
~
- Dj ~
-- Cjx
In Fig. 1 the number of bands doubles from one level to the next (lower) level, since each of the bands in the previous level are passed through a lowpass and a high-pass filter. At the next level, there will be four bands of wavelet packet coefficients which are obtained by
~ ~
(0) - - C j _ 1 Ox[J-t] (0) ) -- Dj_ 1 ~
~
-- Cj_
~
-- Dj_ 1 ~
1
~
Continuing to the next level, one then has
~
-- Cj-2 ~
~
- Dj_2 ~
~
- Cj-2 ~
~
-- Dj-2 ~
~
-- Cj-2 ~
~
-- DJ-2 ~
~
-- Cj-2 ~
~
- Dj-2 ~
153 The same procedure may continue until there is one wavelet packet coefficient in each of the bands. As for the DWT, there can be a maximum of J levels in the WPT, the main difference is that the WPT has 2J-j bands at each level j E { J , J - 1 , . . . , 0 } . One may see some resemblance between the WPT presented in Fig. 1. and the DWT. In actual fact, the DWT consists of the two left most bands at each level of the WPT. Fig. 2 displays the coefficients appearing in the wavelet packet transform for a discrete signal that has 128 data points obtained by discretely sampling a block function and two sine functions with respective periods of six and two, respectively. Since the dimensionality of the signal is equal to 128, we refer to the highest level of coefficients being at level J : 7 since 2J : 2 7 : 128. The original signal is therefore displayed in the top row in Fig. 2. Each coefficient is plotted as a vertical line extending from zero. The length of the vertical line indicates the magnitude of the coefficient and the direction of the line indicates the sign of the coefficient. The nodes in the WPT are outlined by dashed grid lines. The filter coefficients which produced the wavelet packet coefficients in Fig. 2 define a symmlet wavelet with Nr = 8 filter coefficients.
HI
o x (o)
.............
.. . . . . . . . . . . . . . . . . . . .
0
lJ-l] X (o)
Cj I~EI o/J-2, X (0)
[ ]
CJ-2~J-2
1 x(o)~ ,.,"" ",...
[J-l] X
.........
.......................
i
0
Dj 1 o
a
/J-2/ X (!) ....
~D
..... Cjl f ] | |
(t)
O lJ-2/ X (2)
J-I
0 x l(3J-)2l ............
. . . .
D CJ - 2 ~ D I J-2 CJ - 2 ~ N ~ DJ-2 CJ - 2 ~ ..........................................
(,) [ x.....
...'"" "'"..,.
]
x ,~, 2,
.- ."'" ".....
x ,~.... [ lx-,~
...,.'/"",,'.
x ....... ,~,
..,:" ""-..:.
/
x............ (~,
y/" . ...,..
,
x ~,,
1 .......
. .:" ,....
Fig. 1 Wavelet packet transJorm with m = 2.
D J-2
Ix("
......'"'.....
t
154
..................ii,,i,iil,r,,ii..................l,JiiJfii,,iiiii..................,ii,,,ii,li,,..................
Level 7
i ........ i ............................... 'I ...... , ,IJJ~J~l, flail i ........ i ......... ,ilili, ......... ,....... ,
Level 6
.
! ......... i ............. i .....
'
I' '
!
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....................
Level 5
Level 4
ilili,,iii,.!iiii,,iiii ,iiii,iiliiiiiiiii~i illiliiiiiiillii,,iiif,,,iiil.,iiiliilil"iii li
Level 3
Level 2
i' i l '! ii
"i 'i' i" i"i
i' i"! ....i....~....i....i"i '~ !"i ! .... ii
r
I
I'
I
0
20
40
60
"'
'i' "ii" ~"ii"i
....i....i....i
I
I
I
80
100
120
Fig. 2 Example of the wavelet packet transjbrm applied to a simulated signal.
2.1 What do wavelet packet functions look like? Before we answer the question addressing this section, we probably need to take a brief look at wavelet packet functions from the continuous side. Because wavelet packet analysis filters high frequency components as well as the low frequency components, wavelet packets are more suited to representing functions that have oscillatory or periodic behaviour [3]. We recall from wavelet analysis that the scaling function qb(t) plays a key role, in that the wavelet qt(t) is generated from the scaling function. The same is true for wavelet packets. We still have the scaling function ~(t), and the wavelets generated from the mother wavelet qfft). In addition to these functions, we also have wavelet packet functions, so a function f(t) can now be represented as a sum of orthogonal wavelet packet functions. We will let W(t) represent a wavelet packet function. Now that we speak of wavelet packet functions we need to add another index. We still have the dilation (j) and translation (k) indices, but there is also an oscillation index (b). Hence a continuous function can now be represented as a linear combination of orthogonal wavelet packet functions as follows: f(t) ~ Z j,b,k
Wj.b.kWj.b,k(t),
155
where the coefficients Wj,b, k a r e referred to as wavelet packet coefficients and the wavelet packet functions for a defined oscillatory index are defined as follows: Wj,b,k(t) -- 2-J/ZWb(2-Jt- k). Due to the orthogonality of the wavelet packet functions we have Wj,b,k -- / " W j , b , k ( t ) f ( t ) J
-- dt.
Unlike representation of functions using wavelet basis functions, there are many different combinations of wavelet packet basis functions that can be used in signal representation. Hence there is some degree of redundancy. With redundancy comes choice, and Section 3 discusses one approach for selecting a set of basis functions. Fig. 3 demonstrates how the dilation and oscillatory behaviour of a symmlet changes depending on the indices j and b. Changes in the index k simply result in a shift across the horizontal axis of the wavelet packet function. Note that the oscillatory behaviours of the wavelet packets increases as one moves from right to left. Hence large coefficients occurring in bands on the right-hand side of the tree will infer that a wavelet packet function with a high frequency will contribute largely to the representation of that function, i.e. the coefficient is representative of a high frequency event in the original function. When both the scaling and wavelet coefficients are filtered there is a surplus of information stored in the wavelet packet tree. An advantage of this redundant information is that it provides greater freedom in choosing an orthogonal basis. The best basis algorithm is a routine which endeavours to find a basis in the WPT which optimizes some criterion.
3
Best basis algorithm
The best basis algorithm seeks a basis in the WPT which optimizes some criterion function. Thus, the best basis algorithm is a task-specific algorithm in that the particular basis is dependent upon the role for which it will be used. For example, a basis chosen for compressing data may be quite different from a basis that might be used for classifying or calibrating data, since different criterion functions would be optimized. The wavelet packet coefficients which are resultant of the best basis, may then be used for some specific task such as compression or classification for instance.
156
,/
i
mm
IT i !,
m__nn
i
I
Fig. 3 A demonstration on how the dilation and oscillatory behaviour o f a svmmlet changes depending on the indices j and h.
The first step in obtaining the wavelet packet coefficients from the best basis is to produce the wavelet packet decomposition tree to some level J0. A criterion measure for each of the wavelet packet coefficients in each node (or band) in the wavelet packet decomposition is calculated and is denoted by
157
~
n
IT
II
,i
\
I
i /
III I
'i
II
! II
~ii
~ ......
I
~i i,
i /111
....
-~_,\
Ii !,
",.~----
Fig. 3 (Contd.)
,~(~ for j - J , . . . ,J0. One starts at level J0 in the tree and works up, gradually deleting the bands of coefficients in the tree which do not produce sufficiently good criterion measures. This can be formalised. Initially, the criterion measure for each of the bands of coefficients at level J0 + 1 is
158
II
III
i, I
I
1
III l
I IE
I I
I|
.....
viiI,
Fig. 3 (Contd.)
compared with the criterion measures for the bands of the coefficients in the descendants at level J0. Here descendant nodes are used to categorize any nodes which lie beneath a node at a higher level in the tree. The node which the descendant nodes lie under is called a parent node. If the criterion measure of the parent node is superior to that of the descendant nodes, then the descendant nodes are deleted. If the descendant nodes produce a superior criterion measure, then the descendant nodes are kept and the parent node is deleted. This procedure continues all the way to the top of the tree and the coefficients in the best basis will lie in the bands which were not deleted in the elimination process.
159
Fig. 4 describes the best basis algorithm, or more specifically how to find the wavelet packet coefficients from the best basis algorithm. Step 1 performs the W P T to some prespecified level j0 as described previously. Step 2 then initializes a current best basis or best set of bands. Initially, the best set of bands (BB) is simply all the bands at level J0 in the WPT. Steps 3-9 begin to compare the cost measure of the parent nodes against the current best of bands which are descendants of the parent nodes being examined. Here, the aim is to minimize the cost measure. Consider finding the best basis for some signal x - (x0, x l , . . . , , x 7 ) T . Once the wavelet packet transform has been calculated, the next step of the best basis algorithm is to calculate the criterion measurement for each of the nodes in the wavelet packet transform. This is done for some task-specific criterion. The criterion measurements for each of the nodes is shown in Fig. 5, so that ~ ( b a n d ( 1 , 2 ) ) = 8. The best basis is also highlighted in Fig. 5 for some cost function which is to be minimized. For this example, J0 = 0 since the W P T transform is performed to the lowest level. We now describe how the best set of bands is formed by working our way up the tree, comparing descendant and parent nodes. The value of the cost function for each band of coefficients is displayed in the respective models. Assuming that we wish to minimize the cost function, the best basis for this example is computed as follows:
Obtaining the wavelet packet coefficients From the best basis algorithm 1. Perform W P T for x = ( X o , X l , . . . , x 2 J _ l ) T - - x [J] 2. BB(jo,~) = band(jo,~) for r - 0 . . . . ,20J-J - 1 3. F O R j = jo-1,...,J 4.
FOR
1: =
0,...,2 J-j -
(0)
to level jo
1
5. IF 3(band(j,r)) < B ( B B ( j - I , 2 r ) u B B ( j - I , 2 r + 1)) 6. BB(j,r) = band(j,r) 7. E L S E BB(j,r) = BB(j- 1,21:) u BB(j- 1,21: + 1) 8. END 9. E N D Fig. 4 Best basis algorithm.
160
43
i
29
21
8
13
Fig. 5 An example o f a best basis where the aim was to minimize the cost o[" the nodes.
when j = 1" BB(1,0) - {band(l, 0)} since 6 _< 5 + 4 BB(1, 1) - {band(0, 3), band(0,4)} since 21 > 7 + 11 B B ( 1 , 2 ) - {band(l, 3)} since 8 _< 3 + 12 BB(1,3) - {band(0, 6), band(0, 7)} since 13 > 7 + 2 when j = 2: BB(2, 0 ) -
{band(I,0), band(0, 3), band(0,4)} since 29 > 6 + 7 + 11
BB(2, 1) - {band(2, 1)} since 15 < 8 + 7 + 2 when j = 3: BB(3, 0) - {band(l, 0), band(0, 3), band(0, 4), band(2, 1)} since 43 > 6 + 7 + 11 + 15 One commonly used cost function particularly in data compression is entropy. If we let wj,~.i denote the ith wavelet packet coefficient band(j, ~) of the wavelet packet transform, then the entropy-like criterion for band(j,~) is defined as follows: E(j ' z ) - _ ~-'~,2j,z.k log W;z,k -~ k
where wj,~.k is the normalized wavelet packet coefficient obtained by ~ where ~ is the kth wavelet packet coefficient in band(j, z).
161
Let x1, x2,... , Xn be any n non-negative real numbers with a given sum, the
\2
log
i=1
is a measure of the scatter or dispersion of the values X I , X 2 , . . . , X n. The more similar the values the larger the value for E, the more concentrated the frequency is in one class, the smaller the value of entropy. It is also worthy to note that 0 • log(0) is defined to be zero.
Example 1 Compute E for each of the following: (a) (b) (c) (d) (e) (f) (g)
1,1,1,1,1,1 E = 1,1,1,2,2,2 E = 1,1,2,2,3,3 E -1,2,3,4,5,6 E = 1,1,1,3,3,3 E = 2,2,10,2,2,2 E = 1,1,1,1,1,6 E --
1.791759 1.599015 1.523619 1.443165 1.423695 0.7188009 0.567067
Example 2 Perform the wavelet packet transform using filter coefficients associated with the Haar wavelet and scaling functions, then, compute the wavelet packet coefficients associated with the best basis using the entropy cost function for the signal x = (0.0000, 0.0491,0.1951,0.4276, 0.7071,0.9415, 0.9808, 0.6716). The Haar low-pass filter coefficients are 10 - 11 - -~2' and the respective highpass filter coefficients are h0 - ~2' hi ~ . The wavelet packet coefficients resulting from the wavelet packet transform are displayed in Fig. 6 and the entropy cost table is displayed in Fig. 7. The best basis is shaded in grey. Another cost function is the threshold cost function which counts the number of coefficients greater than some prespecified value. For example, S-Plus [3] sets the threshold value (thresh) as the median of the absolute value of all the
162
0.0000 0.0347 0.3359 1.4046
0.0491 0.4403 1.6505 0.9296
0.1951 1.1658 0.2868 -0.2015
0.4276 1.1684 0.0018 0.2041
0.7071 0.0347 0.0917 -0.1274
0.9415 0.1644 -0.2718 -0.2571
0.9808 0.1658 0.1408 -0.1260
0.6716 -0.2187 -0.0374 0.0731
Fig. 6 An example of a wavelet packet transform.
1.5360 0.8977 0.2164 0.2785 0.3580
0.0981 0.0579 0.0590
0.1536 0.1071 0.0365 0.0281 0.0836 0.0276 0.0112
Fig. 7 Entropy cost table with best basils" shaded #1 grey.
coefficients in the wavelet packet transform. This cost function can also be useful for data compression. More formally, the threshold cost function for a band(j, ~) of wavelet packet coefficients is defined Threshold(j, ~) - ~
I(l~
> thresh)
k
where I = 1 if I~ > thresh and I = 0 otherwise. Other cost functions include "Stein's Unbiased Risk Estimate" and the Lp norm cost function. For more details we refer the reader to Chapter 5 or the reference [3]. As with the E criterion, the Threshold criterion is minimized since minimizing this criterion will result in a few large coefficients which is typically preferred for data compression. The best basis algorithm for the simulated signal shown in Fig. 2 is presented in Fig. 8. The cost function used here is entropy. It is interesting to mention that the same "best-basis" results for all four cost functions. This may not always be the case however. The next example performs the wavelet packet transform for a linear chirp signal. We show the coefficients in the wavelet packet transform in Fig. 9 and then the best-basis resulting from the entropy criterion in Fig. 10. The best-basis-resulting from the threshold criterion is presented in Fig. 11. Notice that for this example the best-basis changes with the cost function- hence the best in best-basis is really dependent on the cost function and the goal of the waveleteer!
163
Fig. 8 Best basis.for the simulated signal.
,_,+ !,.,,.,,,,~l,i,l.ll,~ll,,lllh,,~llllll~,,."~tllllllil~' ....i,..~r~ " ,,........ ~eve,~ Level 4
+....... ~iili:ili,~Liiii~++iii I,.................... ~,,,,i ................... fill+fill .................. i iiiii"i .... 11 ........ ill ..... 'ii ............... i .... l ....................... i '........................... ..... " II
....
"It ......
; ...........................
Level2
: ......
. ......
'l . . . .
~'
'"
.. ...........................
"
"
: ...............
'~'~
: .......
i .......
'
i .....
"..i .......
i!lii~l!~ii!~i~!ii!!i!~!i!i!i!i!i!..i :1.
:":'
":' " : + ' " : '
I
I
0.0
0.2
'"-' . . . . . .
-
.'":"'
'-':'""
? v-':,.
:
":
:
.
:
.." ':
I
I
I
0.4
0.6
0.8
'.:' :,
:
Fig. 9. Wavelet packet transform of a linear chirp signal.
.'
i
:
:
I
1.0
164
Fig. 10 Best basis for the lineal" chirp signal using the entrop)' cost function.
Fig. 11 Best basis for the linear chirp signal using the threshold cost function.
References 1. M. Wickerhauser, Adapted Wavelet Analysis from Theory to So[tware, AK Peters, (1994). 2. R. Coifman and V. Wickerhauser, Entropy-based Algorithms for Best-basis Selection, IEEE Transactions on Information Theoo', 38 (1992), 496-518. 3. A. Bruce and H. Gao, S + Wavelets User ~ Manual, Version 1.0, Seattle." StatSci, a Division of Mathsoft, (1994).
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
165
CHAPTER 7 Joint Basis and Joint Best-basis for Data Sets B. Walczak* and D.L. Massart ChemoA C, VUB-Fabi, Laarbeeklaan 103, B-1090 Brussels, Belgium
1
Introduction
In chemometrics we are very often dealing not with individual signals, but with sets of signals. Sets of signals are used for calibration purposes, to solve supervised and unsupervised classification problems, in mixture analysis etc. All these chemometrical techniques require uniform data presentation. The data must be organized in a matrix, i.e. for the different objects the same variables have to be considered or, in other words, each object is characterized by a signal (e.g. a spectrum). Only if all the objects are represented in the same parameter space, it is possible to apply chemometrics techniques to compare them. Consider for instance two samples of mineral water. If for one of them, the calcium and sulphate concentrations are measured, but for the second one, the pH values and the PAH's concentrations are available, there is no way of comparing these two samples. This comparison can only be done in the case, when for both samples the same sets of measurements are performed, e.g. for both samples, the pH values, and the calcium and sulphate concentrations are determined. Only in that case, each sample can be represented as a point in the three-dimensional parameter space and their mutual distances can be considered measures of similarity. Now suppose that the data are indeed organized in a matrix form. These can be, for instance, spectroscopic or chromatographic data, which contain a small number of objects and variables, a high number of objects but a small number of variables, a small number of objects but a high number of variables, or high numbers of both, objects and variables (see Fig. 1).
* Present address: On leave from Silesian University, 40-006 Katowice, Szkolna 9, Poland
166 (a)
(b)
(c)
(d)
_2
Fig. 1 Data sets with different dimensionalitv.
Depending on the type of data, different problems can be encountered. Of course, these problems are associated with the data processing methods applied. Consider for instance Principal Component Analysis (PCA), which is one of the most popular chemometrical methods. PCA can be used for many purposes, e.g. to reduce data dimensionality, to visualize data, to orthogonalize variables, to suppress data noise, etc. If the data set is small or flat (cases (a)-(c) on Fig. 1), very efficient PCA algorithms are available [1,2]. The problems start with a data set belonging to the case (d) in Fig. 1. If the data dimensionality exceeds 104, there is a problem. In other chemometrical techniques of data processing, such as Linear Discriminant Analysis (LDA), the problems occur not only due to big data sets (case d), but also in a case, where the number of variables exceeds that of objects (case c). The same is true for the Neural Networks (NN) approach. Numerous variables, which are very often strongly correlated, are not the best input to NN. In this situation, PCA is often used to reduce dimensionality and orthogonalize variables. New orthogonal features, PCs, speed up NN training and allow reduction of its architecture. This approach, although extremely popular, has a drawback, namely that PCs are global features, which means that local phenomena, that would allow data classification, are spread over numerous PCs. The respective final models are then in some way redundant and unstable. In this situation we would prefer to have local features, describing data variability [3]. The wavelet transform can help to solve such problems. It allows transformation of the data from the original domain (i.e. from the time or space domain) into the frequency-time, or the scale-time domain. New features, i.e. wavelet coefficients, describe local phenomena in a very efficient way and, moreover, allow data compression. As the reader may have already noticed,
167
in preceding chapters the wavelet transform was applied to the individual signals only. What is the difference between the compression of individual signals and of a data set as a whole? In the case of an individual signal, we are looking for the basis optimal for its compression, whereas while dealing with the data set, a joint basis for the compression of all signals has to be found. Only a joint basis allows the uniform data representation required for further processing by chemometrical techniques. Bases, which are optimal for the individual signals from a given set, can differ to a great extent. To have a uniform representation of the signals in the scale-time domain, a joint basis has to be found, which is optimal in the statistical sense for all the signals. In this chapter, we will present a methodology to select such a basis.
2
Discrete wavelet transform and joint basis
Let us consider the Discrete Wavelet Transform (DWT), applied to the set of m signals (e.g. spectra) of length n each, presented in the form of matrix X. If all signals are decomposed by DWT with the same filter and to the same decomposition level, they can be presented as m vectors of length n each in the timefrequency domain, forming matrix Z (see Fig. 2). The information content of signal 1
signal m
signal 2 n
U~I 1~,I+'1
::
n
::::::
. . . .
ill I
liiliilii!!!il!!11!11!1
I:iI"i!::i!111!:ii!:!-!~
n
t 2
n
I I~i !~+ii i +i'-~ ++
n
n
E~:.l::iiiiililiiii:i:+:ii
Fig. 2 Schematic representation of DTW decomposition o[+set o[signals.
168
both matrices, X and Z is exactly the same and so is also the data variance. The total variance of matrix X is equal to the total variance of matrix Z. Now we would like to compress the data in the time-frequency domain, without loss of an essential information about the data variability. This can be done, based on the variance of the wavelet coefficients (matrix Z). For each column of matrix Z, a variance can be calculated and in this way vector v (1 x n) is obtained, which contains n elements, each of them describing the variance of one column of matrix Z (see Fig. 3). The jth element of vector v is defined as: vj-Zi(zij-zj)2/m
fori-l,2
. . . . ,n and j -
1,2,...,m
(1) where: zij denotes the element of matrix Z, i numbers the rows of Z, j numbers the columns of Z, and z.j denotes the mean of the j-th column: zj-~izij/m
for i -
1,2 . . . . . n and j -
1.2 . . . . , m
(2)
In the matrix-vector notation used in Matlab, the vector v can be calculated as: v - (sum(Z - o n e s ( m , l ) , mean(Z )).^2))/m Of course, the sum of elements of vector v equals the total variance of the data. The elements of vector v can be sorted in decreasing order. The first n' coefficients, their sum exceeding a predefined percentage of variance, form the joint basis for all signals. time domain
wavelets domain n
n
n
compression
DWT
m
v
I
n
J---)~
varience vector
n'
compressed joint basis
Fig. 3 Schematic representation o f data compression in the joint basis.
169
Example A Near-Infrared data set, containing 1000 spectra of length 512 each, was decomposed by DWT (filter Daubechies No. 4) and the variance vector, v, calculated for the decomposed spectra, is presented in Fig. 4(a). As can easily be noticed, many elements of vector v are very small. The elements of v, ranked according to their values, are presented in Fig. 4(b). By changing the scale of Fig. 4(b), the first 100 elements of v can be visualized. The sum of all elements of vector v is equal to the data variance. For the data set studied, the total variance equals 3.6686. The cumulative sum of the elements of v, sorted in the decreasing order and expressed in percentage of the explained variance, is presented in Fig. 4(d). As can be noticed, the percentage of explained variance increases very rapidly for the first 20 coefficients, and then slowly approaches the limit of 100%. If the percentage of the explained variance needed is predefined as, e.g. 99%, then the 62 top coefficients are needed to describe the data set. The sum of these 62 top elements of vector v is equal to 3.6313, thus corresponding to 99% of the
(a)
(b)
1
1
0.5
~I
ol o
200 400 coefficients
0
(c)
200 400 sorted coefficients
(d)
0.01
0.005
0
0
50 sorted coefficients
100
00 2O 4O number of retained coefficients
Fig. 4 Elements o f the variance vector for the NIR data set (a): elements of variance vector sorted according to their amplitude (b), zoom on the top 100 elements o f variance vector sorted (c) and cumulative percentage o[the explained variance versus the number of the retained coefficients (d).
170
total variance (100* (3.6313/3.6686)= 99%). In other words, the data matrix Z with 512 columns can be compressed to a data matrix Z* with only 62 columns, corresponding to those of the top 62 elements of vector v. In Fig. 5, one randomly selected spectrum from the data set studied is shown, reconstructed with 10, 20 and 62 top wavelet coefficients from the selected basis. (a)
spectra 1
-
residuals ~/~
< 0.5
O0
200 460 variable number 0.1
(b) 1
0.05 < 0.5
(3
~
-0.0_~ -0.1 0
200 400 variable number
0.1
(c) 1 i ....
0.05
/ <
o (d)
200 400 variable number
0 -0.05 -0.1
200 400 variable number
1
0
200 400 variable number
0.1 0.05 <
< 0.5
0 -0.05 -0.1
0
200 400 variable number
0
200
400
variable number
Fig. 5 Original NIR spectrum (a) and the spectra reconstructed with 10 (b), 30 (c) and 62 (d) top wavelet coefficients with the corresponding residuals, i.e. differences between the original and reconstructed spectra.
171
3
Wavelet packet transform and joint best-basis
It is also possible to apply a filter F to decompose all m signals, using the Wavelet Packet Transform (WPT). For each signal, a matrix is obtained that contains the wavelets coefficients (see Fig. 6). Element w~k~/idenotes the ith wavelet coefficient at the jth level in the ~ band of the kth signal decomposition. This representation of individual signals in the scale-frequency domain is redundant, as for each signal there are many possible orthogonal bases. In the case of individual signal compression, an efficient and elegant best-basis selection algorithm, proposed by Coifman and Wickerhauser [4] can be used. The basis, with the amplitudes (i.e. the absolute values) of the transform coefficients differentiated to the highest degree possible, is considered optimal, because in this basis the signal can be compressed efficiently, i.e. represented by a few coefficients with high amplitudes only, whereas the remaining small coefficients can be discarded without an essential loss of information. As the criterion of amplitude differentiation, entropy can be used. The entropy depends on the distribution of coefficients. When all the coefficients are almost randomly distributed, the entropy reaches a maximal value, which tends to decrease when there is a less random distribution such as when only a few coefficients achieve high amplitudes, whereas the rest approaches zeros. If the best-basis selection algorithm with the entropy criterion is applied to individual signals decomposed by WPT, the following situations can occur: (a) for all signals the same best-basis is selected (see Fig. 7(a)), or (b) for the different signals different best-bases are selected (see Fig. 7(b)). The first case occurs, when the data set contains very similar signals, which can happen in calibration or pattern recognition situations. In this case all
m
WPT __.)
2
3 I
1 F
E_
~Vi,,,i . _
m
I !11
I
!
1
Fig. 6 W P T decomposition of the set of m signals.
172 (a)
time domain
representation in best-basis
WPT decompositions tables and best-bases
n
n
WPT
compression
.--.)
m
V
(b)
t~
WPT decompositions tables and best-bases for two signals time domain n
WPT
uniform representation is impossible -_._)
Fig. 7 All signals have the same best-basis (a), and d(fferent best-bases are selected for different signals (h).
signals can be represented in the same basis and the further procedure of data compression is fully analogous to that described for DWT. In the second case, the uniform representation of signals in the wavelet domain is not possible and instead, the joint best-basis, in which a small number of wavelets coefficients can describe the majority of data variance, must be selected. This can be done, by applying Coifman and Wickerhausers algorithm of best-basis selection to the so-called ~variance tree', the elements of which represent the variance of wavelet coefficients with the same addresses (indices) (see Fig. 8) [5]. As the data set at hand can contain a huge number of objects, m, one would like to avoid storage of the m decomposition tables. For this purpose, the variance can be described by following formula:
173 m i
3
2 [
],
variance tree
I
Wi,l:,i~l~
!V i,'c,i
I
i
--'
Ill ! ! ! -
1
I
111 ill
Fig. 8 Schematic representation of variance tree, the elements o[ which represent the variance of wavelet coefficients of m signals with the same addresses.
Vj,,.i -- ( | / m )
~-~k-,:m
r (k)]2 [
kWj.,.i - (l/m) T
sum of squared coefficients
Zk-l:m
.(k>7 2 wJ.~.iJ
(3)
T
sum of coefficients
This formula indicates that the results of the consecutively decomposed signals can be arranged into two accumulator trees: 1) The tree of coefficient sums. The elements of this tree, divided by the n u m b e r of signals, form the 'tree of means', TM, with elements: tmj,,.i - (1/m) ~ k = l : m
r (k) [wJ .~.i]
(4)
2) The tree of the sums of squared coefficients. The elements of this tree, divided by the n u m b e r of signals, m, form the 'tree of squares', TS. tsj,~,i - ( l / m ) Z k = l
r (k) l 2 :m [wj.~.iJ
(5)
Based on these two accumulator trees, TM and TS, the "variance tree', VT, can be constructed, with its elements calculated as: vtj.r.i -- tSj.~.i - tm 2r.i In Matlab code the algorithm can be presented as: fori=
l:m
wp = wpanalysis(X (i,:), entropy, filtercoefficients) T M = T M + wp TS = TS + wp.^2 end VT = T S . / m - (TM./m).^2
(6)
174 where: wpanalysis means WPT decomposition of signal i ( i - 1 , . . . , m) with the selected filter and the entropy as the criterion for best-basis selection. Once the variance tree is constructed, it can be searched for the joint bestbasis, using the Coifman-Wickerhauser best-basis selection algorithm with, e.g., the entropy criterion. The entropy cost function (see Chapter 6) for the variance tree coefficients, which occur at the jth level in the ~ band of the signal decomposition is defined as: Entropy(j, r) - - Z
v--[J.r.i- log vt--j.r.i i
where v-tj.~,i is the normalized within ~ band coefficient obtained by vtj.~.i/llvt(j, r)l[ 2. The selection of best-basis involves computing the entropy for each subblock (band) of the variance tree, and then performing recursive binary comparison between the entropy of subblock r and the sum of entropies of this subblock's two immediate descendents. The orthogonal basis minimizing entropy is considered as the joint best-basis for the data set. Of course, while entropy is a good measure of efficiency of an expansion, various other cost functions are possible. The main steps of the joint best-basis selection procedure [4] can be summarized as (Fig. 9): 1. Expanding m vectors into wavelet packet coefficients 2. Summing squares into the variance tree 3. Searching the variance tree for a best basis tree of meansTM tm
i,l,i
,
variance tree
i
joint best-basis
lfl V i,z,i
tree of squaresTS 1 ts i,x,i
!
Ill [II
llml III
l
111 II! Fig. 9 Joint best-basis selection.
175 4. Sorting the best basis vectors into a decreasing order 5. Presenting all vectors into the best basis reduced to the top n vectors. Updating of the joint best basis requires the following steps: 1. 2. 3. 4. 5.
Expanding one vector into wavelet packet coefficients Adding the coefficients into the means tree Adding the squared coefficients into the squares tree Forming the variance tree and computing the new information costs Searching the variance tree for the joint best basis.
Example
All signals from the NIR data set were decomposed by WPT. The joint bestbasis selected for the variance tree, is presented in Fig. 10(a), whereas the variance vector in that basis is visualized in Fig. 10(b). The percentage of the explained variance in function of the number of the retained coefficients is presented in Fig. 10(c). In the selected joint best-basis, only 39 wavelet coefficients only are necessary to describe 99% of data variance. The spectrum (presented in Fig. 5(a) reconstructed with 10 and 39 coefficients, is given in Fig. l l(a) and (c). In subplots (b) and (d), the respective residuals are visualized.
Fig. 10 (a) Joint best-basis, selected for the variance tree of NIR data set; (E gain denotes entropy drop between parents and child bands); (b) Elements of rectory in the joint bestbasis; (c) Cumulative percentage of the explained variance versus the number of the retained coefficients.
176
0.1
(a) 1[
k variable number
-
"
-oo5[
1
-0.11 0
, . 1 200 400 variable number
0.1 o . o 2 ~ , ~
OoL
variable number
- 0 . 0 5 ~ 01 ~ 90 200 400 variable number
Fig. 11 Spectra reconstructed with 10 (a) and 39 (b) wavelet coefficients and the corresponding residuals.
The presented example demonstrates the higher compression efficiency of WPT, when c o m p a r e d with D W T . In order to describe 99% of variance, D W T decomposed data require 62 wavelet coefficients, whereas W P T require only 39 coefficients.
References 1. W. Wu, D.L. Massart and S. de Jong, The Kernel PCA Algorithms for Wide Data. Part I" Theory and Algorithms, Chemometrics and hltelligent Laboratory Systems, 36 (1997) 165-172. 2. W. Wu, D.L. Massart and S. de Jong, The Kernel PCA Algorithms for Wide Data. Part II: Fast Cross-Validation and Application in Classification of NIR Data, Chemometrics and Intelligent Laboratory Systems, 37 (1997) 271-280. 3. B. Walczak, B. van den Bogaert and D.L. Massart, Application of Wavelet Packet Transform in Pattern Recognition of NIR Data, Analytical Chemistry, 68 (1996) 1742-1747. 4. R. Coifman and V. Wickerhauser, Entropy-Based Algorithm for Best-Basis Selection, IEEE Transactions on Information Theory, 38 (1992) 496-518. 5. V. Wickerhauser, Adapted Wavelet Analysis from Theory to Software, A.K. Peters, Wellesley, MA, (1994).
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
177
CHAPTER 8 The Adaptive Wavelet Algorithm for Designing Task Specific Wavelets Y. Mallet, D. Coomans and O. de Vel Statistics and Intelligent Data Analysis Group, School of Computer Science, Mathematics and Physics, James Cook Universit)', Townsville, Australia
1 Introduction There exists many different kinds or families of wavelets. These wavelet families are defined by their respective filter coefficients which are readily available for the situation when m - 2, and include for example the Daubechies wavelets, Coiflets, Symlets and the Meyer and Haar wavelets. One basic issue to overcome is deciding which set (or family) of filter coefficients will produce the best results for a particular application. It is possible to trial different sets of filter coefficients and proceed with the family of filter coefficients which produces the most desirable results. It can be advantageous however, to design your own task specific filter coefficients rather than using a predefined set. In this chapter, we describe one method for generating your own set of filter coefficients. Here we demonstrate how wavelets can be designed to suit almost any general application, but in this chapter we concentrate on designing wavelets for the classification of spectral data. In Chapter 18, we extend the principle of the adaptive wavelet algorithm to regression and classification. Since wavelets can be derived from their respective filter coefficients, we generate the filter coefficients which optimize a relevant criterion function. We introduce a wavelet matrix called A which stores both the low-pass and high-pass filter coefficients. Instead of optimizing over each element in A, we make use of the factorized form [1] of a wavelet matrix and the conditions placed therein to reduce the number of parameters to be optimized. Since the filter coefficients gradually adapt to the application at hand, the procedure for designing the task specific filter coefficients is referred to as the adaptive wavelet algorithm (AWA). This should not be confused with the adaptive wavelets of Coifman and Wickerhauser who refer to adaptive wavelets as a procedure for constructing a best basis [2] (see Chapter 6).
178 There exist other applications involving the optimization of wavelets. This includes the work performed by Telfer et al. [3] and Szu et al. [4]. In [3] the shift and dilation parameters of the discretization of a chosen wavelet transform are optimized, while [4] sought the optimal linear combination of predefined wavelet bases for the classification of speech signals. In both papers, the wavelet features are updated by adaptively computing the wavelet parameters and shape. This is a form of integrated feature extraction which also makes use of neural networks. Sweldens [5] also discusses a lifting scheme for the construction of biorthogonal second generation wavelets. The main distinction between [3,4,5] and our algorithm is that the filter coefficients are generated from first p r i n c i p l e s without any reference to predefined families. Our approach also allows for the general m-band wavelet transform to be utilized, as well as the more common 2-band wavelet transform. The adaptive wavelet algorithm presented in this chapter is an extension of the material presented in [6] who introduced adaptive wavelets for the detection and removal of disturbances from signals. Before describing how wavelets can be designed for a specific task, we first discuss the idea of higher multiplicity wavelets and the m-band discrete wavelet transform. This is done in Sections 2 and 3, respectively. Basically, higher multiplicity wavelets consider dilating the wavelet functions by integers greater than or equal to two. This can be likened to down-sampling discrete signals by integer amounts greater than or equal to 2. We let m equal the amount by which we dilate or down-sample. Consequently, the m-band discrete wavelet transform has m bands. One band contains the scaling coefficients, and the remaining r n - 1 bands contain wavelet coefficients. In Section 4 we discuss conditions which can be placed on the filter coefficients so that a multiresolution analysis (MRA) and wavelet basis exist. These coefficients are stored in a matrix called a filter coefficient matrix. Section 5 shows how we can factorize the filter coefficient matrix, thereby allowing us to search for a filter coefficient matrix which optimizes some cost function relevant to the task at hand. Section 6 summarises the adaptive wavelet algorithms. Section 7 discusses various criterion functions. The chapter concludes in Section 8 where key issues arising from the implementation and interpretation of adaptive wavelets are discussed.
179
2
Higher multiplicity wavelets
Much of the discussion on wavelets has focused on the case m = 2, i.e. when wavelets are rescaled by a factor of two (See Chapters 3 and 4). In some situations it may be advantageous to rescale by some integer m > 2. When m > 2, wavelets are referred to as higher multiplicity wavelets [7,8,9,10]. For higher multiplicity wavelets, there exists a single scaling function defined by ZX2
4'(t)- ~
Z
lkqb(mt- k)
k=-~c
which generates m -
1 wavelets ~c
~(~) (t) - v ~
~ k---
h~~)4'(mt - k)
z - 1. . . . ,m - 1
:xz
which have m - 1 corresponding sets of high-pass filter coefficients {h~z) }z=lm-1. The normalization constant ~ is used so that the wavelets form an orthonormal basis. We first consider redefining a multiresolution to cater for situations when functions are rescaled by a general factor rn > 2 and then show how the fast wavelet transform (or pyramidal algorithm) is performed for higher multiplicity wavelets. The sequence of closed subspaces {Vj}je z is an m-multiresolution of L2(R) if the following conditions are satisfied [11]" 1. 2. 3. 4.
Vj C Vj+l limj~_~ Vj - ['-'lgj - { 0 } , Vj - U V j is dense in L 2 (R) f(t) E Vj ~=~ f(mt) E Vj+l, j E Z f(t) E V j ~ f ( t - k ) E V j , j, k E Z
The subspace Vj contains all the possible approximations of functions in L2(R) at resolution mJ. The orthogonal projection of some function f(t) E L2(R) into Vj is written as oc
(t)
cj. *j.k(t)
k---
(~
where, d~j,k(t ) -- m j/2 Zk~C__ vc l k ~ ) ( m J t -- k). The orthogonal projection of f(t) into Wj is described by
180
m-1
fwj(t)- Z
Z
a(Z)'l'(Z) "j,k Vj,k (t)
z= 1 k=-oc,
Notice that the wavelet coefficients "j.k a(z) are also indexed by z. The function f(t) can be written as a linear combination of wavelet basis functions m-I
:~c
f(t)- Z
Z
oc
Z
a(z)'l'(z) "*j,kVj,k (t)
z= 1 j=-oc k=-oc
in what is often called the wavelet series representation of f(t). We reserve the term wavelet transform as the procedure which results in the computation of the scaling and wavelet coefficients. When m - 2, a pyramidal scheme is used for computing the scaling and wavelet coefficients. Likewise, a pyramidal algorithm can also be used for calculating the scaling and wavelet coefficients comprising the wavelet transform for higher multiplicity wavelets. That is, the scaling coefficients at some resolution are used to produce the scaling and wavelet coefficients at the next (lower) resolution. This is done as follows oc
Cj-l,i
Z Ik-miCj.k k=-oc OO
d~Z)l,i =
~ k---
3
.(z) nk_miCj.k oo
m-Band discrete wavelet transform of discrete data
Similar recursion formulae exist for computing the scaling and wavelet coefficients in the m-band DWT of discrete data as those derived for the DW'I of continuous functions using higher multiplicity wavelets. Recall that in th~ case of higher multiplicity wavelets there is one scaling function defined b3 one set of low-pass filter coefficients, and m - 1 wavelet functions which wer~ defined by m - 1 sets of high-pass filter coefficients. The DWT with higheJ multiplicity wavelets for continuous functions is likened to performing th~ DWT on discrete data using a filter system which contains one low-pass filte~ and m - 1 high-pass filters. The latter is referred to as an m-band DWT [10
181
of discrete data. For the m-band DWT, the down-sampling rate is by a factor of m. This corresponds to shifting the filter coefficients in each row of the filter matrices by m. This is explained further in the example presented next. A 3-band DWT for the spectrum x = (x0, xl . . . . ,x8) is shown in Fig. 1. There is one low pass and two high pass filters producing one set of scaling (or smoothed) coefficients and two sets of wavelet (or detailed) coefficients. As before, to go from one level to the next, only the scaling coefficients are filtered and the number of coefficients in each band is reduced by one third when moving from one level to the next. We have presented a transform with two levels ( n l e v = 2) Following the same notation as introduced earlier, band(j,r) will be referred to as the ~th band ~ E {0, 1 , . . . , m - 1} at the jth level j E { J , J - 1,... , J - nlev} of the DWT. The band at the top of the tree is band(2,0). At the next level the bands from left to right are referred to as band(I,0), band(I,1) and band(I,2). Similarly, the bands in the last level of the DWT are band(0,0), band(0,1) and band(0,2). Using the notation of Chapter 4, where the low- and high-pass filter coefficients are combined into one matrix, the 3-band DWT outlined in Fig. 1. Going from level 2 to level 1 is written fl -- P W ~ ~
Let us say that N f -
6, then the full matrix expression is written
Xo X!
X2
C2,0 C2,1 C2.2
X3 X4 C2,3 C2A
X5 X6 X7 X8 C2,5 C2.6 C2.7 C2.g
l
Fig. 1 A 3-hand discrete wavelet transform.
182
fCl, 0 '~
tl
0
0
0
0
0
0
0
0'~
el,1
0
0
0
1 0
0
0
0
0
Cl,2 (1) 1.o
0
0
0
0
0
0
1 0
0
0
1
0
0
0
0
0
0
0
d (1) 1,1
0
0
0
0
1 0
0
0
0
0
0
0
0
0
0
0
1 0
0
0
1 0
0
0
0
0
0
0
0
0
0
0
1 0
0
0
\0
0
0
0
0
0
0
1
/ lo h~l)
ll hl l)
12 h~ ')
13 14 h(')3 hi ')
h~2)
hl 2)
h?
h
0
0
0
0
0
(1) 1,2 (2) 1.o d (2) 1,1 A (2)
\ '~1,2 /
•
0 15 h~ i)
0
0
0
/ C2.0 '~
0
0
0
C2.1
2,
0
0
0
C2.2
12
14 h~ l)
15 h~ 1)
C2.3
o
13 h2~ ')
0
0
h~2)
h(l2)
h~?)
h~2)
h(2)"4 h~2)
C2.5
lo
11 hl 1)
12 h~ 1)
0 0
0 0
0 0
13 h3(1)
14 h~l)
15 h~l)
C2,6
h~ 1)
hl 2)
h~2)
0
0
0
h~2)
h~2)
h~2)
\c2.8 /
0
hi '-)
lo
ll
C2.4
C2,7
If we let A denote the matrix of filter coefficients with the first row containing the low-pass filter coefficients and the remaining m - 1 rows the sets of high pass filter coefficients and if Nf is the number of filter coefficients contained in each filter, then A will be an m • Nf matrix. A can be partitioned into m • m sub-matrices as follows A - (Ao Al ...Aq) Here, q is a non-negative integer such that q - (Nf/m) - 1. For our example, there were three filters (m -- 3), with each filter containing six filter coefficients ( N f - 6), hence q - 6 / 3 - 1 - 1 then 10 A
~
ll
) h l 1)
12
~0~12) h 1 2 ) h ~ 2)
13
14
h~2)
h~2)
l)
1(51 ) h5 )
183
could be expressed as A = (A0 A1) with lo 11 12 ) h~ 1) hl 1) h~ 1) Ao -h~2) hl 2) h~2) and 13 14 15 ) h~l) hi ') h~') h~2) h~2) h~2)
A1
An alternative way to describe the DWT is to introduce a convolution matrix for each of the low-pass and high-pass filtering operations. For the case m = 3 and N f - 6 as presented previously, the filter coefficient matrices which decomposed the original data at level 3 to the next lower level 2, would be represented as follows 10 ll 12 13 14 15 0 0 0 '~ 0 0 0 10 Ii 12 13 14 15 13 14 15 0 0 0 10 11 12
)
C2 -
o l,
--
0 h~ l)
D~2)
0
0
h~ 1) h~ l)
h l,
h~ l)
hl l)
0
0
0 0 0/
h~ 1) h~ l) 0
h~ l)
h~ l)
h; l)
h(l l)
h~ l)
0
0
/ h ~ 2 ) h l 2) h~2) h~2) h~1) h~2)
-
0
0
0
h~2) h~2) h~2)
0 /
h~2) h(12) h~2) h~2) h~2) h~2) 0
0
0
h~2) hl 2) h~2)
and the scaling and wavelet coefficients at level one in each of the bands would be calculated by Cl ~ C2c2
dl 1) - D~l)c2 dl 2) - D~2)c2 where 122 --(C2,0 C2.1 C2,2 C2,3 C2.4 C2.5 C2.6 C2.7 C2.8)T r --(Cl,0 Cl.1 Cl,2) T
184
dl 1 ) - (dl 1).0d(lll, dl')) T.2 d{2 ) - (d (2) d (2) A(2))T 1,0 1.1 "1.2 In general, the m-band DWT from some level j to the next lower level j-1 can be computed cj_j = Q g
d~Z_)I --DJZ)lCj_,
z--1,...,
m-
1.
In summation notation one has
Nf-1 ej-l,i- E lkej,mi+k. k=O Nf-1 dJZ)l,i -- E hkZ)Cj'mi+k k=O
for z--
1,..., m-
1.
Periodic boundary conditions have
Cj,k = Cj,mJ+k (Z) _ d!Z)+ k
,k
j.mJ
These operations can be considered equivalent to the discrete wavelet transform of a continuous function using higher multiplicity wavelets. Our applications involve performing the m-band D W T m > 2 for each object vector in a spectral data set containing n spectra each of dimension p. The wavelet (or scaling) coefficients produced from the DWT are used as features for some multivariate method. The m-band DWT has previously been described for a single data vector, but it is more convenient to redefine the transform using a slight change of notation. Let x[J](~) be a column vector containing the coefficients in band(j,r) of the DWT, so that for a given j, the scaling coefficients will be stored in x[J](0) and x[J](r) will be a vector of wavelet coefficients for ~ E {1,... , m - 1}. The D W T from level j to level j - 1 is then described by the matrix operations
x[J-1](0 ) -- Cjx[J](0) x I j - l l ( z ) - D~Z)x[J](0)
z-
1,... , m -
1.
185
The D W T from level j to level j then described by
1 for all spectra comprising a dataset is
X~_,](0) - CjX~I(0) x [ J - ' ] ( z ) - D~Z/x[3](0)
z-
1. . . . . m -
1.
where x[J](~), is the matrix containing the coefficients for the objects which would lie in band(j,~). Or more specifically, if x~](~) denotes the coefficients in band(j,~) obtained for object xi to level j then, this vector will form the ith column in x[J](~). The original data matrix would be represented by
xI,l(o). 4
Filter coefficient conditions
We have demonstrated that it is possible to obtain the discrete wavelet transform of both continuous functions and discrete data points without having to construct the scaling or wavelet functions. We only need to work with the filter coefficients. One may begin to wonder where the filter coefficients actually come from. Basically, wavelets with special characteristics such as orthogonality, can be determined by placing restrictions on the filter coefficients. The restrictions which are imposed on the filter coefficients so that an M R A and orthogonal wavelet basis exist are summarized as follows [6] 1. Orthogonality Z
A k A kT +i
_ ~5(0,9i)I ,
k
where 8(0, i) - 1 if i - 0 , and zero otherwise, I is the identity matrix. 2. The basic regularity condition
k
3. The Lawton matrix Mij - ~
lklk+j-mi k
must have 1 as a simple eigenvalue.
186 If more sophisticated wavelet and scaling functions are required, then more constraints need to be placed on the filter coefficients. We now consider in more detail the factorized form of a wavelet matrix, and show that A can be constructed from some set of normalized vectors, denoted by U l , . . . , Uq, and v. 5
F a c t o r i z a t i o n of filter coefficient matrices
Recall from Section 4, that the wavelet matrix A can be partitioned into m x m submatrices as follows A = (AoAl ... Aq). Provided that the orthogonality condition: ~-]k AkAk+iT --8(0, i) I is satisfied, the wavelet matrix can also be written in the factorized form [1] A = Q<>FI <>F2.-" <>Fq. The symbol <> denotes the "polynomial p r o d u c t " which is defined by (BoB1 "'" Bp-I)~>(CoCICs-I) - (GoGI "'" Gp+s-2) with Gi - ~
BkCi-k k
The factors Fi -- (Rill - Ri)
where Ri is a projection matrix and Q -
~-~i Ai is an orthogonal matrix.
Example If m = 3 and q = 2 then A - (A0 Al A2) with each Aj having dimension 4 • 4 thus, A has size m • m(q + 1) - 3 • 9. Assuming the orthogonality condition is satisfied then A - Q~>FI~F2 = Q<>(R1 ]I - R1)~(R2 1 - R2) = [QR1R21Q(R1 - 2RIR2 + R 2 [ Q ( I -
R1)(I- R2)]
Our aim is to construct Q and each projection matrix Ri (for i - 1 , . . . , q). We first consider the representation of Q. The regularity condition ~-]k lk -- v/m, places a constraint on the first row of Q. This is equivalent to setting the first row of Q to (1/X/~)l~m where lm denotes an m x 1 column vector of ones.
187 The remaining m - 1 rows are constructed ensuring the orthogonality of Q is maintained. If the last m - 1 rows are calculated by (I - 2vvX)T 9 D Q will be orthogonal. Here, v represents a normalized vector, T is an upper triangular matrix with Tii = i - m and off-diagonal elements equal to 1. The symbol 9 indicates a form of element by element scalar multiplication across two matrices such that B 9 C = G ~ BijCij = G i j . This scalar product of T with some matrix D normalizes the rows of T. The m x m orthogonal matrix Q is partitioned as follows, Q-
((i (1/x/~)lT) -2vvT)T 9
"
Now for the projection matrices. A symmetric projection matrix of rank P can be written R - UU r where Um• is a matrix with orthonormal columns. For the wavelet matrix to be non-redundant we require r a n k ( R 1 ) < rank(R2) < . - - < rank(Rq). That is the individual ranks of the projection matrices form a monotonically increasing sequence [1]. For simplicity we set rank(Rl) = rank(R2) . . . . . rank(Rq) = 1 and Ri where
- - ui uT
uTui-
1.
Example The following example illustrates how A with m = 3 and q = 2 can be constructed. The example begins by defining the column vector v of length m - 1 and two columns vectors ul and u2 both of length m. Let v -- (-0.7918, -0.6107)1
Ul -- (-0.3873, -0.9097, 0.1497)v U2 - -
(-0.9062, 0.1674, 0.3884)a-
First, consider calculating the symmetric projectors R l u2uf.
R1--
0.1500 0.3523 -0.0580
0.3523 0.8276 -0.1362
-0.0580 ) -0.1362 0.0224
ulu~ and R2 =
188 and 0.8212 -0.1517 -0.3520
Re -
-0.1517 0.0280 0.0650
-0.3520\ 0.0650 0.1509
)
Now consider calculating Q. The first row of Q is (1/v/3, 1/v~, 1/x/~), and the remaining two rows are calculated by (I - 2vvr3)(T 9D) where (-2 0
T . D -
1 -1
__ ( - 0 . 8 1 6 5
0
l)(1/~/-6 1 l/v/2 0.4802 -0.7071
1/v~ l/q~
lj )
0.4802) 0.7071
i _ 2vvf _ (--0.2539 --0.9671) --0.96710.2541 which together give Q -
0.5774 0.2073 0.7896
0.5774 0.5802 -0.5745
0.5774 ) -0.7875 -0.2151
Now consider forming the wavelet matrix A. Using the factorized form of the wavelet matrix one has A -
Q(~Ft (~F2
R2) = [QR1R2IQ(RI - 2R1R2 + R2]Q(I- R l ) ( l - R2)].
= Q0(R1
[I -
R1)~(Retl
--
then substituting for Q, R~ and R2 one arrives at the following result for A. A-
0.1542 0.1690 -0.0430
-0.0285 -0.0312 0.0079
-0.0661 -0.0724 0.0184
0.1316 0.3027 0.8258
-0.0456 -0.1179 0.3569
0.2917 -0.2643 0.0069
-0.0198 -0.0451 -0.2488
0.6891 \ -0.5927 0.1234
-0.0285 -0.0312 0.0079
-0.0661) -0.0724 0.0184
where Ao -
0.1542 0.1690 -0.0430
0.6257 0.6566 -0.3336
)
189
A1
---
A2 -
0.1316 0.3027 0.8258 0.2917 -0.2643 0.0069
0.6257 0.6566 -0.3336 -0.0198 -0.0451 -0.2488
-0.0456 \ -0.1179 0.3569
)
0.6891 \ -0.5927 0.1234
)
We have now shown that A can be constructed from the normalized vectors u1,... , Uq and v. Initially, u l , . . . , Uq and v are randomly assigned elements from the uniform distribution. The optimization routine then proceeds to update the elements of these vectors so that some modelling criterion can be optimized.
6
Adaptive wavelet algorithm
In this section we summarize the adaptive wavelet algorithm for designing task specific wavelets. Fig. 2 summarizes the adaptive wavelet algorithm. Step 1 of the algorithm sets values for the parameters m, q, J0 and t0 and Step 2 initializes v and u l , . . . , Uq. Steps 3-6 go about constructing the filter coefficient matrix A, so the m-band DWT can be performed. Step 7 performs the DWT to level J0. Step 8 extracts the coefficients X[J~ and the multivariate criterion measure which we denoted by 3(x[J~ is calculated for the extracted data in Step 9. Step 10 assesses if the stopping criterion of the algorithm has been reached. The stopping criteria are discussed further at the end of this section. If the stopping criterion has not been reached, then the parameters v and {Ui}~i=l} are updated and the algorithm proceeds to Step 3. If some stopping criterion has been reached, then the algorithm proceeds to Step 10 where the Lawton matrix condition is verified. Provided Conditions 1 and 2 of Section 4 hold, then the Lawton matrix condition will not be satisfied for exceptional degenerate cases, thus the Lawton matrix is verified after the adaptive wavelet has been found. Finally, the multivariate statistical procedure can be performed using the coefficients XfJ~ The optimizer used in the adaptive wavelet algorithm is the default unconstrained MATLAB optimizer [12]. Before applying the adaptive wavelet algorithm, the m,q,j0 and t0 values need to be specified. There is no empirical rule for determining these parameters and more experimentation is required to find a suitable combination. We can however suggest some recommendations.
190
1. Set values for
m, q, j,,,~o,
q
V, Ill,..,tlq
Construct = ( R i I [ - Ri)
6. Set A =QOFt 0... OFq
7. Calculate X~J('r'~ ~
8. Calculate modelling criterion
~(xt~(to))
No
9~Yes matrix condition
Model 9
Fig. 2 The adaptive wavelet algorithm for designing task specific wavelets.
Choosing values for m and q." Since m determines the number of bands in the D W T and the down-sampling factor, we choose m such that p / m 0-j~ is an integer value. It is important to recall that m combines with q to determine the number of the filter coefficients (Nf = m(q + 1)). The larger the value for Nf the more parameters that are required to optimized. For this reason another constraint is placed on m so that Nf does not become too large. We constrain q for similar reasons. In this book we consider setting Nf -- 12 and Nf = 16.
191
Choosing values for Jo and ~o. 9 The parameters J0 and t0, simultaneously determine the band(j0, t0) and hence the coefficients X [j0](~0) for which optimization of the discriminant criterion is based. The coefficients X[i0](~0) are later used as inputs to the multivariate statistical method. The value for J0 determines the level of the D W T that the spectra are to be decomposed. A value for J0 should be chosen such that p/m (J-j~ which is the number of coefficients in band(j0, ~0), is suitable (not too large) for the multivariate procedure. For example, in classification the number of objects should be taken into consideration, since classifiers such as Bayesian linear discriminant analysis prefer that the number of variables be much less than the number of observational units. In our application we set the reduced dimensionality of the data set to be 8 or 16. 9 A value for ~0 is also required. To ensure the best J0 and ~0 combination, each of the appropriate values of j0 should be individually tested with each value of ~0. To reduce this computational burden, we have chosen to select ~0 as the band which gives the largest 3(x[J~ at initialization. It is recommended that if one suspects the basic shape of the data will be useful, then optimization over the scaling band may prove worthwhile.
7
Criterion functions
The adaptive wavelet algorithm outlined in Section 6 can be used for a variety of situations, and its goal is reflected by the particular criterion which is to be optimized. In this chapter, we apply the filter coefficients produced from the adaptive wavelet algorithm for discriminant analysis. It was stated earlier that the dimensionality is reduced by selecting some band(j0, t0) of wavelet coefficients from the discrete wavelet transform. It then follows that the criterion function will be based on the same coefficients i.e. X [j~ (%). If the filter coefficients are to be used for discriminatory purposes, then the criterion function should strive to reflect differences among classes. In this section three suitable discriminant criterion functions are described. These discriminant criterion functions are Wilk's lambda (3A), entropy (3E), and the cross-validated quadratic probability measure (~cvqpm)-
192
Wilks Lambda The Wilks' A criterion can be used to test the significance of the differences between group centroids [13]. A smaller value for A is preferred since this indicates a larger significance. Wilks' A is the ratio of the determinant of the within covariance matrix to the determinant of the total covariance matrix and is defined to be
A -ISwl
IS I ISwl ]Sw + SB] where the total covariance matrix S T - SB + Sw is the sum of the between (SB) and within (Sw) covariance matrix.
Entropy Saito and Coifman [14] discuss a cross-entropy measure which can be used to measure how differently vectors are distributed. Let ~(1) and ~(2) be vectors from classes 1 and 2 respectively. If the elements in ~(1) and ~(2) are nonnegative and sum to unity, then cross-entropy is defined by
Ecross
( ) P0 ~i(l) ;(1),~(2) -- Z ~i(1)log i=l ~i(2)
(1)
where Po-length(~(1))-length(~(2)) is the dimensionality of vectors. Eq. (1) is not symmetric, that ~s the measure of discrepancy for Ecross(~(1), g(2)) will be different to that for Ecross(~(2), ~(1))" For our purposes we prefer to use a symmetric criterion which is defined in [14] as
Esym(~(1),~(2)) -- Ecross(~(1),~(2))+ Ecross(~(2),~(1)) Measuring the distinctness of several vectors from different classes, involves
calculating Esym for each combination of vectors. Call this entropy measure the total entropy Etot. For example, the total symmetric entropy for ~(1), ~r and ~(3) is calculated as follows
Etot- Esym(~(1),~(2)) --[--Esym(~(1),~(3)) + Esym(~(2),~(3)) It is necessary to construct a single vector which in some way is representative of the classes, this could for instance be a mean vector. In [14], the repre-
193 sentative vector from each class is an energy vector. More specifically, define the class energy vector of the wavelet coefficients from band(j, 1:) as e [j] ('t)
(r)
diag(Xll)('r))(XP)/(1:))T --
r--1
const
.... '
R.
The denominator is a normalization constant. The numerator is simply the sum of squares of the wavelet coefficients from either the DWT or WPT which occur in the same position of the wavelet trees, where the DWT or WPT has been performed for objects belonging to the same class. The discriminatory criterion function is then [j] [j] (T) ''''' e (R) Lj] (1:)'~ -,~E \/e (r)(T)) -- Etot \(e (1) ]
--~ZEsvm(ell)(~). I
' e[J] (~))(r)
r:rr
Cross-validated quadratic probability measure ( CVQPM) The cross-validated quadratic probability measure (CVQPM) (see Chapter 12 for more details) assesses the trustworthiness of the class predictions made by the discriminant model. The CVQPM ranges from 0 to 1. Ideally, larger values of the QPM are preferred, since this implies the classes can be differentiated with a higher degree of certainty. The CVQPM criterion function based on a band of coefficients X t(~) would be defined as follows i -,~cvqpm ( ) ~(1:) x)
--- 1 ~
r )), - i ) . aQ ( (xpl
n i=l
where (xij ]
1
9
1
~ -, (r]X~r)()) /
~
R
.
2 I
r--I The posterior probability P(rJx) is computed as for Bayesian linear discriminant analysis [15]. That is, P(r}x) - p(xfr)P(r) p(x) where P(r) is the a priori probability of belonging to class r, p(x) the probability density of x and p ( x l r ) = (2rc)-P/21Swl -~ e x p [ - 0 . S ( x - ~ r ) Swl(X_ ~r)T] is the class probability density function which is assumed to follow a normal distribution.
194 8
Introductory examples of the adaptive wavelet algorithm
Section 8 applies the adaptive wavelet algorithm to two sets of data in an attempt to further illustrate the mechanics behind the procedures. The first set of data is simulated, whilst the second considers real spectra of various kinds of minerals. The classifier that we use is Bayesian linear discriminant analysis [15]. 8.1 Simulated spectra
The simulated data containing three classes were previously generated by [14] as follows: - - (6 + q).~[a,b](t) q- ~3(t) Class 1 x(2)t- (6-k- q).Z[a.b](t)(t- a ) / ( b - a)-k- ~;(l) Class 2
X(1)t
x(3),- (6 + rl).Z[a,b](t)(b- t ) / ( b - a) + ~(t)
Class 3
Here 11 and e ( j ) ~ N(0,1), Z[a.b](J)- 1 if j E [a,b] and zero otherwise, a ~ Uz(16, 32) and ( b - a) ~ Ul(32, 96) where Uz denotes the integer-valued uniform distribution. Each of the parameters from the normal and uniform distributions varies for each object. The training data set contains 300 spectra with equal class sizes and the testing data contains 3000 spectra also with equal class sizes. The dimensionality of the data (i.e. number of variables) is 128 of the simulated data is 128. Fig. 3 shows five sample spectra from each class of the training data. For this data we believe that the basic shape of the data will be useful for classification and the scaling band will therefore be considered as a possible candidate. Indeed, the scaling band produced the largest CVQPM of 0.9353 at initialization (see Fig. 4(a)). Thus, ~ = 0 was selected. A marginally smaller CVQPM was produced for band(2,1) at initialization followed by band(2,3) and then band(2,2). Upon termination of the algorithm, the discriminant measure for band(2,0) has further increased to 0.9641 and clearly produces a larger CVQPM than the remaining bands. To test the classification performance of the adaptive wavelet, the coefficients from each of the bands (at level 2) at initialization and at termination of the algorithm were used as inputs to the classifier. The results are summarized for both the training and test data in Table 1. At initialization the coefficients in band(2,0) gave the best classification rates closely followed by band(2,1). At completion the classification performance of band(2,0) has further improved,
195
_51 . . . . . . . . . 0
20 ,
10
40 ,
/
t
60 80 , index , ^ ,--, ~ _
s0 -5;
2b ..... .
[
' .~,~,
^ _a
6b
10
120 ,
class 2
8b index
_.'
100 ,
'
z_ o
_ -
.
'
'class3
'
index
Fig. 3 Simulated .wectra. (a)
Intialization l
"
"
Termination -
]
~o. ~.
1
J ]
~ 0.9
ro
ffSq 0 (b)
l!
1
2 3 band Intialization
0
l[
1 2 3 band Termination
1
Cy
0
1
2 band
3
0
1
2 band
3
Fig. 4 The C V Q P M for the coefficients at initialization and termination of the adaptive wavelet algorithm. Optimization was based on (a) the coefficients X~2](O) and (b) the coefficients ~ 2 ] ( 1 ) .
producing the most favourable results, band(2,1) gave the next best classification results. Since band(2,1) produced quite a competitive CVQPM at initialization, and promising classification results in the previous analysis, optimization over
196
Table 1. The percentage of correctly classified spectra, at initialization and termination of the adaptive wavelet algorithm. Optimization was based on the coefficients Xlel(0).
Initialization Termination
0 92.7 90.3 95.7 93.3
Train Test Train Test
1 89.7 88.4 81.7 80.8
2 58.7 53.8 72.7 67.0
3 67.0 66.7 53.0 50.4
3
~CVQPM ~CVQPM
this band was investigated. As presented in Fig. 4(b), band(2,1) now gives the largest CVQPM at termination, even larger than the scaling band. This is a general observation that can be m a d e - the band which the discriminant measure is calculated, will, in most instances produce the best discriminant measure at completion. The percentage of correctly classified spectra as displayed in Table 2 is also more favourable than the remaining bands, but not as favourable for the testing data as those produced when optimization was based on the scaling coefficients X[2](0) (see Table 1).
8.2 Mineral spectra We now apply the AWA to a mineralogical spectral data set. In this example we will investigate the performance of the various discriminatory criterion functions. These are 3A, 3E, 3CVQPM. The mineral data consist of five classes each representing the different minerals n a m e l y - amphilolite, calsilicate, granite, mica and soil. Both the training and test sets contains 20 spectra per class. The response is Log 1/reflectance, and this was measured for 512 wavelengths 1478, 1480,..., 2500 nm, hence the dimensionality of the data is
Table 2. The percentage of correctly classified spectra, at initialization and termination of the adaptive wavelet algorithm. Optimization was based on the coefficients Xi21(l). r Initialization Train Test Termination Train Test
0 92.7 90.3 86.0 87.6
1 89.7 88.4 96.3 90.6
2 58.7 53.8 61.3 60.5
3 67.0 66.7 61.0 58.8
~3
~CVQPM 3CVQPM
197 | 0 0 - -
r
..
'~~ 80I Amphilolite 60' " 1500 1900
l O 0
"
'
'
-
-
,
80
I,..,
6 0
2300
L
~
1500
_
,
.
1900
.
.
.
,. . . . . .
2300
8o
=
l_,Granite 1~/ I 60 ...................................... t 1500 1900 2300
1500
6 0
lOO~~..-
"
1900 '
.
.
.
.
2300
.~.-, ..
"~ 80 Soil 60~
~500
~9~
wavelength
23b0
Fig. 5 Five sample spectra ~'om each class o / t h e mineralogical data.
512. Fig. 5 shows five sample spectra from each class of the mineralogical data. In this example, the parameters m, q and J0 were set at 4, 3 and 3, respectively. Optimization was based on the coefficients X[3l(z) which gave the maximum 3(X[3](1:0)) at initialization where ~ E {0, 1,2,3}. The results for each of the criterion functions are displayed in Table 3. Here the classification rates of the individual bands at initialization and at completion of the algorithm are shown. Note that the same starting parameters for v, u~ and u2 have been used for the implementation involving the different modelling criteria, hence the same classification results occur at initialization for each of the criterion functions 3A, 3w, and 3CVQPM. The shading indicates which band optimization was based upon. For the Wilk's Lambda criterion, optimization was based on band(3,3), while the entropy criterion optimized over band(3,2). The CVQPM criterion optimized over the scaling band(3,0). Some features which we might expect from the adaptive wavelet algorithm, is that at termination, the band on which optimization was based would outperform the other bands, at least in
198
Table 3. The percentage of correctly classified spectra, using the coefficients Xl31(z) for ~= 0 , . . . , 3 at initialization and at termination of the adaptive wavelet algorithm. The discriminant criterion functions were Wilk's Lambda, symmetric entropy and the CVQPM.
Initialization Termination Termination Termination
r Train Test Train Test Train Test Train Test
0 97 90 98 91 97 86 100 96
1 96 90 96 89 94 89 98 92
2 97 91 95 88 94 90 96 89
3 97 88 100 90 97 87 95 87
~A, ~E, 3CVQPM 3A. 3E 3CVQPM
terms of the percentage of correctly classified training objects. This is the case with the CVQPM and entropy criterion, but is not so, for Wilks Lambda criterion. Overall, for the results presented in Table 2, the CVQPM seems to be performing the most adequately. It is the only criterion function which has improved the test classification rate from those obtained at initialization. One reason why the CVQPM, maybe outperforming the other criterion functions could be due to the fact that optimization and hence classification is based on scaling coefficients. So that a fair comparison could be made, the optimization routine using the Wilk's Lambda, and symmetric criterion functions was repeated, this time forcing optimization over the scaling band. These results are summarized in Table 4. Optimization over the scaling band did improve the results slightly for the Wilk's Lambda and symmetric entropy criterion, but these criterion functions were not able to improve upon the results previously obtained with the CVQPM criterion function. As demonstrated in this example the CVQPM criterion seems to be behaving more appropriately than the other two functions. This may be due to the cross validation being implemented as well as the probability-based measure. In applications to follow we consider the CVQPM function only. More ex-
199
Table 4. The percentage of correctly classified spectra, using the coefficients Xlal(z) for ~= 0 , . . . , 3 at initialization and at termination of the adaptive wavelet algorithm. Optimization was based o n X[31(0) and the discriminant criterion functions were Wilk's Lambda, symmetric entropy and the CVQPM.
Initialization Termination Termination Termination
Train Test Train Test Train Test Train Test
0 97 90 I00 91 ,96 92 100 96
1 96 90 95 89 94 90 98 92
2 97 91 96 86 85 76 96 89
3 97 88 96 90 91 87 95 87
3A,-~E,~CVQPM -~A. ~E ~CVQPM
amples of the AWA algorithm are presented in Chapter 12 where comparisons are made with the predefined filter coefficients.
9
Key issues in the implementation of the AWA
There are several items regarding the adaptive wavelet algorithm which warrant further discussion. These items are now considered separately.
Number of iterations. One can argue that using a prespecified number of iterations in the AWA (as we have done) does not necessarily allow for an optimal value to be found. This is quite a valid statement, but from a practical perspective it is more convenient. It is possible that with more extensive experimentation on real and simulated data, that a more suitable number of maximum iterations could be found. Local and global minima. If the AWA algorithm does converge to an optimal value prior to reaching the maximum number of iterations then one can query if it is indeed a local or global minima. As we have discussed previously, unless the problem is continuous and has only one optimal point, there can be no guarantee that a global optimal value has been found. One suggestion offered in [12] is that starting the optimization routine using different values for parameters at initialization may assist in overcoming this problem. Due to time constraints this was not
200 done for every model produced by the AWA. It was however trialled for a few settings where the criterion function did converge to the same optimal value. There is a need for more experimentation to be conducted with regards to the optimization part of the AWA. Constrained optimization versus unconstrained optimization. In the adaptive wavelet algorithm, it was possible to avoid using constraints which ensured orthogonality. This is due to some clever algebraic factorizations of the wavelet matrix for which much credit is due to [6]. However, one constraint which we have not discussed in very much detail is that the vectors v, u ~ , . . . , Uq are required to have unit length. This normalization procedure occurs during the optimization routine. An alternative strategy which could be employed, is to place constraints on these vectors requiring them to be normalized. Choosing the best (m, q, l, ~) settings. Selecting the (m, q, 1, ~) combination, involved trialling several suitable combinations of these values. Presently, it is unknown how one might be able to predetermine with any degree of certainty which setting combinations may produce more preferred results. In order to determine which settings are more preferable remains to be further explored. Validation without an independent test set. Each application of the adaptive wavelet algorithm has been applied to a training set and validated using an independent test set. If there are too few observations to allow for an independent testing and training data set, then cross validation could be used to assess the prediction performance of the statistical method. Should this be the situation, it is necessary to mention that it would be an extremely computational exercise to implement a full cross-validation routine for the AWA. That is, it would be too time consuming to leave out one observation, build the AWA model, predict the deleted observation, and then repeat this leave-one-out procedure separately. In the absence of an independent test set, a more realistic approach would be to perform cross-validation using the wavelet produced at termination of the AWA, but it is important to mention that this would not be a full validation.
References 1. R. Turcajov'a and J. Kautsky, Shift Products and Factorizations of Wavelet Matrices, Numerical Algorithms 8 (1994), 27-54.
201
2. R. Coifman and M. Wickerhauser, Entropy-Based Algorithms for Best Basis Selection, IEEE Transactions on Information Theory 38 (1992), 713-718. 3. B.A. Telfer, H.H. Szu, G.J. Dobeck, J.P. Garcia, H. Ko, A. Dubey, and N. Witherspoon, Adaptive Wavelet Classification of Acoustic and Backscatter and Imagery, Optical Engineering 33 (1994), 2192-2203. 4. H.H. Szu, B. Telfer, and S. Kadambe, Neural Network Adaptive Wavelets for Signal Representation and Classification, Optical Engineer#lg 31 ( 1992), 1907-1916. 5. W. Sweldens, The lifting scheme: a construction of second generation wavelets, Preprint Department of Mathematics, University of South Carolina (1994). 6. J. Kautsky and R. Turcajov'a, Adaptive wavelets for signal analysis: Proceedings of the sixth International Conference on Computer Analysis of h~lages and Patterns, Prague, Springer-Verlag (1995), 906-911. 7. J. Kautsky and R. Turcajov'a, Pollen Product Factorization and Construction of Higher Multiplicity Wavelets. L#1ear Algebra and its Applications 22 (1995), 241-260. 8. J. Kautsky, An Algebraic Construction of Discrete Wavelet Transforms, Applications of Mathematics 3 ( 1993), 169-193. 9. P. Steffen, P. Heller, R.A. Gopinath, and C.S. Burrus, Theory of Regular M-band Wavelet Bases, IEEE Transactions in Signal Processing, 41 (1993), 3497-3511. 10. P. Heller, H. Resnikoff and R. Wells, Jr, Wavelet Matrices and the Representation of Discrete Functions, In Wavelets- a Tutorial in Theory and Applications (C. Chui, Ed) Academic Press (1992). 11. R. Turcajov'a, Compactly Supported Wavelets and Their Generalizations." An Algebraic Approach, Ph.D. Thesis, The Flinders University of South Australia (1995). 12. Grace, Optimization Toolbox for Use with Matlab, The MathWorks, Inc., Natick, (1994). 13. M. Tatsuoka, Multivariate Analysis. Techniques for Educational and Po'chological Research, Wiley, New York (1971 ). 14. N. Saito and R.R. Coifman, Local Discriminant Bases, In Mathematical hnaging." Wavelet Applications in Signal and h~lage Processing H (A.F. Laine and M.A. Unser, Eds), Proc. SPIE, 2303 (1994). 15. G. McLachlan, Discriminant Anah'sis and Statistical Pattern Recognition, Wiley, New York (1992).
This Page Intentionally Left Blank
Part II Applications
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
205
CHAPTER 9 Application of Wavelet Transform in Processing Chromatographic Data Foo-tim Chau* and Alexander Kai-man Leung Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic UniversiO', Hung Horn, Kowloon, Hong Kong, People's Republic of China
1 Introduction The term chromatography is derived from Greek words meaning "colour" and "write" [1]. The name of this technique evolved from the earliest work of separating dyes or plant pigments on paper. Today, chromatography is used widely in analytical chemistry for the separation of compounds in sample mixtures. By adopting different chemical and physical properties, various chromatographic techniques and instruments have been developed for chemical analysis. Such techniques include paper chromatography, thin layer chromatography (TLC), gas chromatography (GC), liquid chromatography (LC), capillary electrophoresis (CE), supercritical fluid chromatography (SFC), ion chromatography (IC) and gel permeation chromatography (GPC). In the past, chromatography was used mainly for the separation of compounds. However, this situation has changed in the last decade. There was a tendency to combine different analytical techniques or instruments with chromatography for separation and characterization [2]. Examples include gas chromatography coupled with mass spectrometry (GC-MS) or Fourier transform infrared spectroscopy (GC-FTIR), liquid chromatography coupled with mass spectrometry (LC-MS), high performance liquid chromatography coupled with a diode array detector (HPLC-DAD), and capillary electrophoresis coupled with mass spectrometry (CE-MS) or a diode array detector (CE-DAD) [3]. In recent years, the development of wavelet transform (WT) theory in different fields of science has been growing very rapidly. The WT has two major characteristics, in that the basis functions of WT are localized in both the time and frequency domain, and there are a number of possible wavelet basis functions available. Such properties have attracted analytical chemists to
206 adopt WT in data analysis and signal processing in chromatography. Up to 1998, more than 120 publications reported the application of WT as a tool for data and signal processing (Table 1). So far, thirteen papers have reported on the adoption of WT in chromatographic data processing [4,5]. This Chapter brings to the attention of the international chemometrical community the results of the above research, originally largely published exclusively in Chinese.
2
Applications of wavelet transform in chromatographic studies
In chromatographic data analysis and signal processing, analytical chemists always face several problems such as noise suppression, signal enhancement, peak detection, resolution enhancement, and multivariate signal resolution [6]. Various chemometric methods have been proposed for tackling these problems, and give satisfactory results. Transformation techniques such as the Fourier transform, Laplace transform and Hartley transform have been utilized in chromatography for data processing [6]. Recently, the new mathematical technique WT has been introduced to help find answers to the above problems. In the following Sections, we describe selected major applications of WT in chromatography.
Table 1. Number of published papers from 1989 to 1998 that relate to the application of the wavelet transform in chemistry.
Year
Number ofpubl~hedpapers
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
1 0 0 2 5 6 4 22 50 34
Total
124
207
2.1 Baseline drift correction
Baseline drift is a very common problem in chromatographic studies. It is classified as a type of long-term noise and is defined as a change in the baseline position. This kind of drift is mainly caused by changes of temperature, or solvent programming and temperature effects on the detector [7]. In most cases, the drift is represented by a curve instead of a linear function. As a result, it induces errors in the determination of peak height and peak area, which are very important parameters for quantitative analysis. In practice, an artificial baseline is usually drawn beneath the peak (Fig. 1). The peak areas and heights determined will be either greater or smaller than the actual values, which depend on whether the true baseline has a convex or concave shape. Therefore, most analytical chemists prefer to find out the exact shape of the baseline and then to subtract it from the original raw chromatogram. Pan et al. [8] developed a wavelet-based method for correcting baseline drift in high performance liquid chromatography (HPLC). In general, the noise, 1.4
1.2
~o co o..
0.8
rr"
0.6
0.4
0.2
0
1
2
I
1
3
4
Retention Time (rain.)
._1
5
.l
6
I
7
8
Fig. 1 A simulated chromatogram with baseline drift is shown. The straight line drawn beneath the peak represents the artificial baseline for peak area measurement. (**)
208 chromatographic peaks, and the baseline are located, respectively, in the higher, middle and lowest frequency regions of the raw data. W T has an intrinsic property of enabling the resolution of a signal into higher and lower frequency parts. With proper use of this property, the correct baseline can be extracted from the raw data. Pan et al. proposed processing of the chromatogram with the Daubechies 06 wavelet function at an optimum resolution level, j. Then, zero values are assigned to the corresponding peak positions in Cj. After the inverse WT treatment of the signal, the reconstructed chromatogram at the resolution level J, Cj. baseline, is obtained, which represents the baseline for that chromatogram under study (Fig. 2(a)). Finally, a baseline-free chromatogram can be obtained by subtracting the baseline from the raw data (Fig. 2(b)). These workers applied this technique to resolve the baseline for an H P L C determination of a complex mixture of sixteen rare earth-elements and a satisfactory result was obtained. A similar technique was also adopted by Alsberg et al. [9] for baseline removal in a R a m a n spectroscopic study. 2.2 Signal enhancement and noise suppression
Noise suppression is a very common technique in chromatographic data processing. It aims to enhance an analytical signal to give a higher signal-tonoise ratio. Nowadays, many chromatographic instruments are controlled by computers and it has become a common practice to reduce the noise by 35000,- .
.
.
.
(a)
35ooor/
9
30000 ~
25000 .~ 20000
i
(b)
1
1 5ooo'-1
-50001 0
-
I ._ ~ ~ I .l l i0 20 30 40 50 60 t/s
J 70
,
'
i 50000
10 90 30 40 t/s
50 60
70
Fig. 2 (a) Curve 1 shows the signal of a mixture o[ sixteen rare-earth elements from an HPLC measurement and Curve 2 shows the baseline after wavelet treatment. (b) HPLC signal with baseline subtraction. Reproduced from reference [8] with the kind permission of Chinese Chemical Society.
209 employing digital processing methods such as filtering. Traditionally, analytical chemists favour the adoption of the Savitzky-Golay, Fourier, and Kalman filters for signal processing [10,1 1]. After the introduction of the WT technique to analytical chemistry, some workers found that the performance of WT is much better than that from the above mentioned filters in data denoising [ 12,13]. Shao et al. [14] reported the use of WT to smooth the HPLC signals of rare earth elements. Smoothing and de-noising are two different processes. Smoothing removes high frequency components of the transformed signal regardless of their amplitudes, whereas de-noising removes small-amplitude components of the transformed signal regardless of their frequencies [13]. The basic principle of WT smoothing is very simple. When a chromatogram in the digital form is treated with a proper wavelet function at the optimum resolution level j, the Cj thus produced represents the smoothed chromatogram, while Dj, Dj+I, K, Dj-1 represent the noises at various resolution levels (Fig. 3). In Shaos work, the original chromatograms were treated with the Harr wavelet function, and the smooth chromatogram Cj was employed for (a) 9 .
.
.
8
.
.
(b) '
9--
,
,
,
,
A
7
5
Dj_4 Dj_3 3 Dj- 2 2
Dj_I
0
-1 0
2
4 6 8 Retention T=me (m=n)
0
-1 0
2
4 6 8 Retenhon Time (mm.)
10
Fig. 3 Cj is a simulated chromatogram with white noise having a value of O.05. (a) shows the scale coefficients C at resolution level J tO J - 4 , and (b) shows the corresponding wavelet coefficients D at these resolution levels. The Daubechies DI6 wavelet function was adopted for the W T computation.
210 quantitative analysis. This result shows that WT can improve the signal-tonoise ratio and the detection limit for HPLC analysis. As compared with WT smoothing, de-noising via WT is another story. It requires one more step, thresholding, for removing noisy components from the wavelet coefficients D. Several methods have been proposed for discarding negligible coefficients or noise in the wavelet domain. These include absolute cut-off, relative energy, an entropy criterion and decreasing-rearrangements, fixed-percentages methods [15], and a universal thresholding algorithm [16]. In these methods, only coefficients with values greater than a pre-defined threshold value are retained. A zero value is assigned to those coefficients with magnitudes less than the threshold value. After the inverse WT treatment, a de-noised chromatogram is obtained. Mittermayr et al. [17] adopted this technique to process chromatographic data. These authors aimed to apply the German DIN 32645 standard for the determination of the detection limit of chromatographic data. Their results demonstrated that WT de-noising could improve the detection limit by up to a factor of three. They explained that the de-noising process could reduce the variance of the peak area and height, and the limit of detection is mainly determined by their variance. 2.3 Peak detection and resolution enhancement
Both peak detection and resolution enhancement are other problems encountered by analytical chemists in chromatographic studies. Obviously, the performance of each chromatographic system has limitations of its own, and consequently none of these systems is sufficiently universal to provide complete separation of excessively complex compound mixtures. As a result, peak overlapping always exists to a certain extent in the chromatogram. In this situation, we must resort to mathematical techniques to solve the problem. Usually, these problems are solved by using linear or non-linear regression analysis, curve-fitting techniques [18,19], derivative techniques [6], neural networks [20], statistical theory [21], and factor analysis [22]. The curve fitting technique is the most common method and has been widely available in most commercial software packages such as PeakFit (SPSS Inc.) and GRAMS/32 CurveFit (Galactic Industries Corporation) for chromatographic data processing. These packages allow the user to fit the chromatogram with a certain number of Gaussian and/or exponentially modified Gaussian functions [23] (Fig. 4). The parameters for these Gaussian functions are determined via linear or non-linear regression analysis. In the following sub-sections, techniques are given which use WT to handle peak-detection and resolution enhancement.
211 I
08
I
I
I
I
1
I
!
I
3
4
5
6
7
I
|
8
9
84
_>'06 ._ c-
04 i
~I
0
I
I
1
2
10
R e l e n l l o n T i m e (mJn.)
Fig. 4 Overlapped chromatographic peaks resoh'ed by the conventional curve-fitting method. The dotted lines represent the Gaussian functions ~'om curve-fitting treatment.
2.3.1 Derivative technique
The Derivative technique is another powerful method for resolving overlapping chromatographic peaks because it offers a higher apparent resolution of the differential data compared to the original data [24]. Although the technique is a useful tool for data analysis, it has a major drawback in increasing the noise level in higher-order derivative calculations [25]. Recently, our research group proposed a novel method which uses WT for approximate derivative calculations [26]. This method can enhance the signal-to-noise ratios through higher order derivative calculations and, at the same time, retain all the major properties of the conventional methods. An approximate first-derivative of an analytical signal can be expressed as the difference between the two scale coefficients C~_l, which are generated from any two different Daubechies wavelet functions. For example, a chromatographic signal X can be treated with two Daubechies wavelet functions of D2m and D2m, respectively with m and rh being any positive integer, and m-r ill. Then, the first derivative of X can be expressed as: X (1) ~ C J _ l , U 2 m -- CJ_l,O2rh
(1)
212
Eq. (1) can be applied to X (") again, to determine the approximate derivative at the next higher order. The approximate derivative calculation at a higher order can be generalized as: X (n)
,'~ C J _ I , D 2
m -r rh
m --CJ_l.D2th
and
n ~ 1
(2)
with C D2m.n (n) and C~!m_" being obtained from a WT treatment of X/n-I) at the _ , first resolution level with Daubechies wavelet functions D 2 m and D2m. In our studies, we have found that the signal-to-noise ratio value of the first derivative is highest with the use of Ds and D~8 wavelet functions. Fig. 5 shows a comparison between the conventional and wavelet methods for a signal with overlapping peaks. The first derivative obtained from the traditional method was smoothed with the Savitzky-Golay 17-points filter for the sec-
(a)
--
0t 0 0.02 .
9
4 .
.
.
6
8
.
lO ,
(b)
..........
"
_J
~9
c
001[ -0"02
9 ' 0 2 x 10 -`=
51
"
, 4 ",
"
"
"
d)
6
8
10
4 6 8 Retention Ttme/ (min.)
10
o.ol
0 ............
-
.
>., 0.02
~
-0.01 , 6
,"
, 8 "
1 10
-0.02
2
. 2 0 x 10 -3
.
. 4
.
(C) t
~o~wl L og ~ s 0
_.c_ 2
2
4 6 8 Retention Time / (min.)
10
0
2
Fig. 5 (a) A simulated chromatographic signal generated bl' overlapping of two Gaussiar, /'unctions with a white noise level of 0.001 being added (SNR = 500 for peak 1 ant, SNR = 250for peak 2 ) f o r WT derivative calculation. The first derivative (b), and the second derivative (c), of the signal. Fig. 5(a) was obtained b)' using the conventiona, method. Tke first derivative (d), and the second derivative (e), of the signal Fig. 5(a) Ira.! obtained by the proposed WT derivative method. (**)
213 ond derivative calculation (Fig. 5(c)). In the first-derivative plots (Figs 5(b) and (d)), both methods give the same results on the positions of the peak maximum and turning point. Moreover, in the second-derivative plots (Figs 5(c) and (e)), both give similar results on the peak-centre position. The major differences between the two methods were the signal-to-noise value on the first and second derivatives, and the number of coefficients of each derivative, as can be visualized from the plots. This wavelet-based derivative method can help analytical chemists to resolve overlapping chromatographic peaks at lower signal-to-noise ratios.
2.3.2 Wavelet coefficients method Shao and his co-workers developed another WT method to resolve overlapping chromatographic peaks [27-29]. They adopted the wavelet coefficients Dj for quantitative calculation. Fig. 6 shows the results of their study. The chromatograms of a mixture of benzene, methylbenzene and ethylbenzene at different concentrations are given in Fig. 6(a). Peak overlapping is observed in these chromatograms. After the wavelet treatment on one of the chromatograms, with the Harr wavelet function, the wavelet coefficients D at resolution levels J - 1 to J - 4 are depicted in Fig. 6(b). In this case, these workers found that Dj_3 is the best case for resolving the overlapped peaks and quantitative calculation. Fig. 6(c) shows the Dj_3 signals for all samples in Fig. 6(a). In order to determine the peak areas for individual components in Fig 6(c), the Dj_3 signals were baseline-corrected first, by linking the minimum points of every peak as baseline. After this treatment, three separated peaks can be identified from the baseline-corrected Dj_3 signal plot (Fig. 6(d)). Each peak corresponds to one of the components in the samples. Fig. 6(e) shows the calibration curves for the individual component in Fig. 6(d). Satisfactory results were obtained in their study for quantitative analysis. These authors have applied this method successfully for the quantitative analysis of plant hormones by HPLC with WT treatment [30] and found that better calibration curves were obtained when the chromatograms were processed with WT. 2.3.3 Multi-resolution and factor anal~'sis In modern chemical laboratories, new types of instrument known as hyphenated instruments, such as HPLC-DAD, GC-MS, GC-FTIR and CEDAD, have become very powerful tools for quantitative and qualitative analysis. These instruments can provide information from both chromatographic and spectroscopic studies at the same time. However, two-way data
214
(a)
1200000
(b)
1000000 z
-~
8o(x)oo
o o m 2~oco
-~00(~
......... 1.2
i ........
,! . . . . . . . . .
1.6
2.0
i.i, ...... 2.4
I., ...... 2.8
3.2
1.2
1.4
1.6
TIME(rain)
(c)
~;~.
'
2.0
2.2
2.4
2.6
2.8
3.0
3.2
TIME(rain)
2OOOOO 15oooo
1.8
25o0o0
/
(d) 2o0oo0
1ooooo z o p oL o
L'
~
'-
'
-~
o
--.,
-50000
i!!l,
Z 15OO00 _O l-.
5oooo
loe:x.no
,<
.~'i
-10oo00 -1~000
............................................... 1 .Z 1.6 2.0 2_4 2.8
3.2
TIME(rain)
(e) v (m)
251X)0
/' //v
<
//v
/-~ (ii)
~. --"* (')
15,000 F
I ./ 5000 0
~--" .... 0.0
~ 2.0
. . . . . . L. . . . . . 4.0
L~. . . . . . . l . . . . . . . . t . . . . . . . . 6.0
......... ' ......... 1.2
1.6
'......... 20
!
8 0
Concentration(ul/ml)
10.0
12.0
.........
2.4
TIME(rain)
3(x)oo
2(~00
~
| - J o- ~ - , 2.8
3.2
215 matrices with overlapped signals are frequently produced from these kinds of measurements. It is very difficult to determine the number of components within the peaks and to resolve them from the raw data. The most economical ways to handle this problem use computer and chemometric techniques, such as factor analysis. Commonly used methods include evolving-factor analysis (EFA) [31,32], alternating regression (AR) [3], window factor analysis (WFA) [33], fixed size moving window evolving factor analysis (FSMW-EFA) [34], and heuristic evolving latent projections (HELP) [35]. All these methods are very useful for resolving two-way data matrices. However, in the real situation, their performance is usually affected by the presence of high-level noises. The data de-noising and compression properties of WT can be utilized to enhance factor analysis in this situation. In factor analysis, a mathematical model is set up to resolve the chromatogram and spectrum into individual components. Two main steps are involved in the model. The first one involves the determination of the minimum number of components or significant factors, and the position of the compounds eluted. The second step involves performing factor-rotations to determine elution profiles and pure spectra [36]. Chromatographic baseline drift, spectral background, and high levels of noise have been identified as the major factors which affect the accuracy of the model. As pointed out by Maeder and Zilian [31] and Gemperline [37], baseline offset introduces additional factors in the factor analysis. It causes the wrong determination of the number of components in the analytical system and also affects the final resolution of the mixture. Besides, the presence of a spectral background may also introduce false information into a system. Although this problem can be fixed by the double-centring method as proposed by Lewi [38], it induces rank-changes in the matrix, as well as destruction of the positivity of the data [39]. Under the theory of factor analysis, the term rank is defined as the number of components in the analytical system [33]. In the real situation, the value of rank is always greater than the true number of components.
Fig. 6 (a) HPLC chromatograms of samples at different concentrations. (b) Plots of the wavelet coefficients D obtained by decomposing of one of the samples in (a) using the Harr wavelet function. (c) A plot of the ~t'avelet coefficients Dj_~ of all the samples in (a). (d), A plot of the baseline-corrected wavelet coefficients Dj-3 of all the samples in (a). (e) Calibration curve of samples in (a)." (i) benzene, (ii) methybenzene, and (iii) ethylbenzene. Reproduced from reference [29] with the kind permission of the American Chemical Society.
216 The background of the spectro-chromatogram has the following properties: there is no direct correlation between the chromatographic baseline drift and the spectral background: there is a very similar spectral background at the two ends of a chromatographic peak: and, since the scanning time for each spectrum is very short, there is a similar drift of the baseline at each retention time [40]. Based on these properties, Shen et al. [39,40] proposed that one can employ WT for the simultaneous removal of the chromatographic baseline drift and spectral background in H P L C DAD studies. In their work, the high-pass filter H was adopted to filter out the spectral background from the zero-component regions. The zerocomponent region is defined as a region with no chemical component eluted, and corresponds to the noise from the measurement [35]. Then, correction of spectral background can be done by directly subtracting the transformed background spectrum determined in the zero-component regions. In the chromatographic direction, a similar approach - as proposed by Pan et al. [8] - can be employed to fix the drift of the baseline. The eigenvalues plots in Fig. 7 show the results from their study. With WT, a proper rankmap was observed and the number of components in the system could be deduced correctly from the rankmap. Shen's algorithm has two advantages, in that WT does not induce any change of the chemical rank in the analytical system and it can eliminate various kinds of background regardless of their shapes and behaviours. In the previous section, we have mentioned that Shao and his co-workers developed a wavelet-based method to resolve the overlapped chromatographic peaks. The same research group also applied WT as a pre-processing step for H P L C - D A D data analysis [41,42]. In one of their works, the component number in an overlapping chromatogram was determined from the wavelet coefficients D [42]. When a chromatogram is processed with different wavelet functions such as the Harr, Daubechies, and Symmlet wavelet functions, a special pattern is observed in these coefficients (Fig. 8). By counting the number of positive peaks in one set of the wavelet coefficients, the number of components can be determined. This algorithm has some limitations, in that the resolution cannot be too small and the relative heights cannot be too different. As compared with the abstract factor analysis method, which is a type of chemometrics technique for the determination of the number of components in H P L C - D A D studies, more accurate results can be obtained, within a short processing time. In other work by these authors, WT was coupled with window factor analysis (WFA) for the resolution and quantitative determination of multi-component chromatograms
217 10 (a)
1o
(b)
]
~-5 ~0 ..1 0
-5~0
10
-5 20 40 60 Retention time 9
-
20 40 60 Retention time 10
-
~-5 o
0
(d)
,A
-5
-5 0
| 0
20 40 60 Retention time 9
-
-
-
0
10
-
20 40 60 Retention time
(t3 -
]
5
o
..1
-5
-5 0
20 40 60 Retention time
0
20 40 60 Retention time
Fig. 7 Comparison of the rankmap results be[bre and after WT treatment for a simulated two-component system. (a) Rankmap of the data, without adding any background, after WT treatment. (b) Rankmap of the data without adding an)' background after WT treatment. (c) Rankmap of the raw data with onl)' chromatographic baseline drift added. (d) Rankmap of the data with chromatographic background added, after WT treatment. (e) Rankmap of the raw data with spectral and chromatographic background added. (f) Rankmap of the data after background correction by WT. Reproduced from reference [39] with the kind permission of Elsevier Science.
[41]. The WFA technique is one of the most powerful methods for resolving overlapping multi-component chromatograms. However, it is difficult to obtain satisfactory results when the data matrix contains a high level of noise [43]. Wavelet transform was chosen as a pre-de-noising step in WFA. Chromatograms at each wavelength are processed by WT, and the scale coefficients C at the optimum resolution level, j, that represent a smoothed data set, were selected for WFA. In this way, better resolved chromatograms can be generated (Fig. 9), leading to improvement in the quantitative analysis.
218
(b)
(a)
(
(
(
_
0
2
4
6
8
10
12
14
16
18
20
0
(c)
(d)
(S
(
2
4
6
8
10
12
14
16 18
20
2
4
6
8
lO
12
14
16 18
20
(
(
0
2
4
6
8
10
12
14
16 18
20
0
Fig. 8 (a) The simulated overlapping double peaks with different resolutions and discrete details obtained from WT. (b) The simulated overlapping tetra peaks with different resolution and discrete details obtained from WT. (c) The simulated overlapping triple peaks with different half-height width and discrete details obtained from WT. (d) The simulated overlapping triple peaks with different relative heights and discrete details obtained by WT. The lines (H), (D) and (S) represent wavelet coefficients generated from WT using the Harr, Daubechies D4 and Symmlet $4 wavelet functions, respectively, at resolution level 5. Reproduced from reference [42] with the kind permission of Elsevier Science.
219
(a) 1.0
0.8 O
"'= 0.6 0
< 0.4 .>_
~
~.
0.2
0.0 -0.2 4.0
5.0
6.0 7.0 Time(min)
8.0
(b) 1.0 0.8 e-
9~ 0.6
J
f
0
i
< 0.4 ._>
i
0.2
~
\
0.0 -0.2
. 4.0
5.0
. 6.0
.
. 7.0
. 8.0
Fig. 9 Normalized chromatograms of Yb and Tm resoh,ed by WFA (a) without and (b) with WT treatment. The solid line represents the resoh'ed chromatograms while the dotted line represents the standard chromatograms. Reproduced from reference [41] with the kind permission of Wiley & Sons, Ltd.
2.4 Pattern recognition with combination of wavelet transform and artificial neural networks Two special applications of WT to chromatographic studies have been reported in recent years. Collantes et al. [44] proposed the employment of the wavelet packets transform (WPT) for pre-processing HPLC results by an artificial neural network. The application of WPT for data processing in chemistry is very rare. These authors aimed to evaluate several artificial
220
neural networks (ANN) algorithms as potential tools for pharmaceutical fingerprinting based on analysis of HPLC trace-organic impurity patterns. The WPT method was chosen as a pre-processing scheme for compressing raw HPLC data in their work. The compressed data at the optimum resolution level were rearranged and utilized as inputs for various neural networks. It was demonstrated that WPT could provide a fast and efficient method for encoding the chromatographic patterns into a highly reduced set of numerical inputs for the classification process. Shao et al. [45] proposed a new technique called the immune neural network (INN) to process chromatographic data. The construction of an INN resembles an ANN, but it adjusts itself during the process of evolution, according to the output of the immune system. The overlapping chromatographic signal acts as an antigen while the signals of pure standard components act as antibodies. The WT was employed in the immune interaction process to regulate the immune system. Therefore, the INN takes advantage of both the ANN and WT. These authors applied this new algorithm to a noisy three-component overlapping chromatogram. The results showed that the noise in the original signal, the baseline and impurity peaks can be clearly removed by the method. This results in successful retrieval of the information in every component in the overlapping chromatogram.
3
Conclusion
In conclusion, wavelet transforms have been employed by analytical chemists to solve various problems in chromatographic studies. Owing to the popularity of hyphenated instruments, more applications based on twodimensional wavelet transform (2D-WT) will be developed. The 2D-WT technique is more suitable for processing data produced from such instruments.
4
Acknowledgement
This work was supported by the Research Grant Council (RCG) of the Hong Kong Special Administration Region (Grant No. HKP 45/94E) and the Research Committee of The Hong Kong Polytechnic University (Grant No. A020).
221
References 1. R.A. Day, Jr. and A.L. Underwood, Quantitative Analysis." Sixth Edition. PrenticeHall, Englewood Cliffs, N J, (1991), pp. 490-492. 2. E. Jooken, Hyphenated Techniques in Chromatography, Trends in Analytical Chemistry 17 (1998), VIII-IX. 3. E.J. Karjalainen and U.P. Karjalainen, Data Analysis for Hyphenated Techniques, Elsevier, Amsterdam, (1996), pp. 17-22. 4. A.K.M. Leung, Wavelet Transform in Chemistry, http'//fg702-6.abct.polyu.edu.hk/ ~,,kmleung/wavelet.html, (accessed January 1999). 5. A.K.M. Leung, F.T. Chau and J.B. Gao, A Review on Applications of Wavelet Transform Techniques in Chemical Analysis: 1989-1997, Chemom. Intell. Lab. Syst. 43 (1998), 165-184. 6. A. Felinger, Data Analysis and Signal Processing in Chromatography, Elsevier, Amsterdam, (1998). 7. N. Dyson, Chromatographic hltegration Methods. Second Edition Royal Society of Chemistry, Cambridge, (1998), pp. 50-60. 8. Z.X. Pan, X.G. Shao, H.B. Zhong, W. Liu, H. Wang and M.S. Zhang, Correction of Baseline Drift in High-Performance Liquid Chromatography by Wavelet Transform, Chinese Journal of Analytical Chemistry 24 (1996), 149-153 (in Chinese). 9. B.K. Alsberg, A.M. Woodward and D.B. Kell, An Introduction to Wavelet Transforms for Chemometricians: A Time-frequency Approach, Chemometrics Intelligent Laboratory Systems 37 ( 1997), 215-239. 10. S. Brown, T.B. Blank, S.T. Sum and L.G. Weyer, Chemometrics, Anal. Chem. 66 (1994), 315R-359R. 11. S. Brown, S.T. Sum and F. Despangne, Chemometrics, Anal. Chem. 68 (1996), 21R-62R. 12. C.R. Mittermayr, S.G. Nikolov, H. Hutter and M. Grasserbauer, Wavelet Denoising of Gaussian Peaks: A Comparative Study, Chemometrics hltelligent Laboratory Systems 34 (1996) 187-202. 13. V.J. Barclay, R.F. Bonner and I.P. Hamilton, Application of Wavelet Transforms to Experimental Spectra: Smoothing, De-noising, and Data Set Compression, Anal. Chem. 69 (1997), 78-90. 14. L.M. Shao, B. Tang, X.G. Shao, G.W. Zhao and S.T. Liu, Wavelet Transform Treatment of Noise in High Performance Liquid Chromatography, Chinese Journal of Analytical Chemistry 25 (1997), 15-18 (in Chinese). 15. R.R. Coifman, Y. Meyer, S. Quake, and M.V. Wickerhauser, Signal Processing and Compression with Wavelet Packets, In Wavelets and Their Application J.S. Byrnes, J.L. Byrnes, K.A. Hargreaves and K. Berry (Eds) Kluwer Academic Publishers, The Netherlands, (1994), pp. 363-379. 16. D.L. Donoho, De-noising by Soft-Thresholding, IEEE Transactions Information Theory 41 (1995), 613-627. 17. C.R. Mittermayr, H. Frischenschlager, E. Rosenberg and M. Grasserbauer, Filtering and integration of Chromatographic Data: A tool to Improve Calibration?, Fresenius' J. Anal. Chem. 358 (1997), 456-464.
222 18. D. Ozdemir and R.R. Williams, Simple Method for Extracting Gaussian Peak Parameters, Applied Spectroscopy 51 (1997), 749-754. 19. M.L. Phillips and R.L. White, Dependence of Chromatogram Peak Areas Obtained by Curve-fitting on the Choice of Peak Shape Function, J. Chromatogr. Sci. 35 (1997), 75-81. 20. S.R. Gallant, S.P. Fraleigh and S.M. Cramer, Deconvolution of Overlapping Chromatographic Peaks using a Cerebellar Model Arithmetic Computer Neural Network, Chemometrics Intelligent Laboratory Systems 18 (1993), 41-57. 21. F. Dondi, A. Bassi, A. Cavazzini and M.C. Pietrogrande, A Quantitative Theory of the Statistical Degree of Peak Overlapping in Chromatography, Anal. Chem. 70 (1998), 766-773. 22. F.C. Sfinchez, S.C. Rutan, M.D. Gil Garcia and D.L. Massart, Resolution of Multicomponent Overlapped Peaks by the Orthogonal Projection Approach, Evolving Factor Analysis and Window Factor Analysis, Chemom. bltell. Labs. Syst. 36 (1997), 153-164. 23. S. Le Veng, Simulation of Chromatographic Peaks by Simple Functions, Anal. Chim. Acta. 312 (1995), 263-270. 24. M.J. Adam, Chemometrics in Analytical Spectroscopy, Royal Society of Chemistry, Cambridge, (1995), pp. 54-62. 25. S.J. Haswell, Practical Guide to Chemometrics, Marcel Dekker, New York, (1992), pp. 264-267. 26. A.K.M. Leung, F.T. Chau and J.B. Gao, Wavelet Transform: A Method for Derivative Calculation in Analytical Chemistry, Anal. Chem. 70 (1998), 5222-5229. 27. X.G. Shao, P.Y. Sun, W.S. Cai and M.S. Zhang, Resolution of Overlapping Chromatograms by Wavelet Transform, Chinese Journal of Analytical Chemistry 25 (1997), 671-674 (in Chinese). 28. X.G. Shao, P.Y. Sun, W.S. Cai and M.S. Zhang, Wavelet Analysis and its Application to the Resolution of Overlapping Chromatograms, Chemistry (Huaxue Tongbao) 8 (1997), 59-62 (in Chinese). 29. X.G. Shao, W.S. Cai, P.Y. Sun, M.S. Zhang and G.W. Zhao, Quantitative Determination of the Components in Overlapping Chromatographic Peaks using Wavelet Transform, Anal. Chem. 69 (1997) 1722-1725. 30. X.G. Shao, S.Q. Hou, N.H. Fang, Y.Z. He and G.W. Zhao, Quantitative Determination of Plant Hormones by High Performance Liquid Chromatography with Wavelet Transform, Chinese Journal of AnaO'tical Chemistry 26 (1998), 107-110 (in Chinese). 31. M. Maeder, Evolving Factor Analysis: a New Multivariate Technique in Chromatography, Anal. Chem. 59 (1987), 527-530. 32. M. Maeder and A. Zilian, Evolving Factor Analysis, A New Multivariate Technique in Chromatography, Chemometrics hTtelligent Laboratory Systems 3 (1988), 205-213. 33. E.R. Malinowski, Window Factor Analysis: Theoretical Derivation and Application to Flow Injection Analysis Data, J. Chemom. 6 (1992), 29-40.
223 34. H.R. Kell and D.L. Massart, Peak Purity Control in Liquid Chromatography with Photodiode-Array Detection by a Fixed Size Moving Window Evolving Factor Analysis, Anal. Chim. Acta. 246 (1991), 379-390. 35. O.M. Kvalheim and Y.Z. Liang, Heuristic Evolving Latent Projections: Resolving Two-way Multicomponent Data 1. Selectivity, Latent-Projective Graph, Datascope, Local Rank, and Unique Resolution, Anal. Chem. 64 (1992), 936-946. 36. A.K. Elbergali and R.G. Brereton, Influence of Noise, Peak Position and Spectral Similarities on Resolvability of Diode-Array High-Performance Liquid Chromatography by Evolutionary Factor Analysis, Chemometrics Intelligent Laboratory Systems 23 (1994), 97-106. 37. P.J. Gemperline, Target Transformation Factor-Analysis with Linear Inequality Constraints Applied to Spectroscopic Chromatographic Data, Anal. Chem. 58 (1986), 2656-2663. 38. P. Lewi, Spectral Map Analysis: Factorial Analysis of Contrasts, Especially from log ratios, Chemometrics Intelligent Laboratory Systems 5 (1989), 105-116. 39. H.L. Shen, J.H. Wang, Y.Z. Liang, K. Pettersson, M. Josefson, J. Gottfries and F. Lee, Chemical Rank Estimation by Multiresolution Analysis for Two-way Data in the Presence of Background, Chemometrics Intelligent Laboratory Systems 37 (1997), 261-269. 40. H.L. Shen, J.H. Wang, Y.Z. Liang and W.C. Chen, Multiresolution Analysis of Hyphenated Chromatographic Data, Chemical Journal of Chinese University 18 (1997), 530-534. 41. X.G. Shao and W.S. Cai, Resolution of Multicomponent Chromatograms by Window Factor Analysis with Wavelet Transform Preprocessing, J. Chemom. 12 (1998), 85-93. 42. X.G. Shao, W.S. Cai and P.Y. Sun, Determination of the Component Number in Overlapping Multicomponent Chromatogram using Wavelet Transform, Chemometrics Intelligent Laboratory Systems 43 (1998), 147-155. 43. A.K. Elbergali, R.G. Brereton and A. Rahmani, Influence of the Method of Calculation of Noise Thresholds on Wavelength Selection in Window Factor Analysis of Diode Array High-Performance Liquid Chromatography, Analyst (London) 121 (1996), 585-590. 44. E.R. CoUantes, R. Duta, W.J. Welsh, W.L. Zielinski and J. Brower, Preprocessing of HPLC Trace Impurity Patterns by Wavelet Packets for Pharmaceutical Fingerprinting using Artificial Neural Networks, Anal. Chem. 69 (1997), 1392-1397. 45. X.G. Shao, Z.H. Chen, J. Chen and X.Q. Lin, Immune Neural Network Algorithm and its Application in High Performance Liquid Chromatography Analysis; Abstracts of The Third International Symposium of Worldwide Chinese Scholars on Analytical Chemistry, 16-18 Dec., 1998, Hong Kong, Hong Kong Baptist University Printing Section, Hong Kong, (1998), pp. 3-6.
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
225
C H A P T E R 10 Application of W a v e l e t Transform in E l e c t r o c h e m i c a l Studies
Foo-tim Chau and Alexander Kai-man Leung Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Horn, Kowloon, Hong Kong, People's Rebublic of China
1
Introduction
Electrochemistry involves the study of the relationship between electrical signals and chemical systems that are incorporated into an electrochemical cell. It plays a very important role in many areas of chemistry, including analysis, thermodynamic studies, synthesis, kinetic measurements, energy conversion, and biological electron transport [1]. Electroanalytical techniques such as conductivity, potentiometry, voltammetry, amperometric detection, coulometry, measurements of impedance, and chronopotentiometry have been developed for chemical analysis [2]. Nowadays, most of the electroanalytical methods are computerized, not only in their instrumental and experimental aspects, but also in the use of powerful methods for data analysis. Chemometrics has become a routine method for data analysis in many fields of analytical chemistry that include electroanalytical chemistry [3,4]. In the previous chapters, applications of wavelet transform (WT) in spectroscopic and chromatographic studies have been discussed. In this chapter, we will focus our discussion on the applications of WT in electrochemical studies. Up to December 1998, 25 publications report the use of WT in one area of electrochemistry- voltammetry [5,6].
2
Application of wavelet transform in electrochemical studies
2.1 B-spline wavelet transform in voltammetry In WT computation, many wavelet functions have been proposed by different workers. The simplest one, the Harr wavelet- which is also the first member of the family of Daubechies wavelets [7] - has been known for more
226 than 80 years in various mathematical fields. The Daubechies wavelet is the most popular one in WT applications. In addition, there are many other wavelet families such as the Meyer wavelet, Coiflet wavelet, orthogonal wavelet, and spline wavelet [7,8]. In voltammetry, the spline wavelet was chosen as the major wavelet function for data de-noising. The function has been applied successfully to analyse voltammetric data since 1994 by Lu and M o [9]. M o and his co-workers have published more than fifteen papers on this topic in various journals. The spline wavelet is different from the Daubechies wavelet functions. Mathematically, the mth order basis spline (B-spline) wavelet, Nm, is defined recursively by convolution of the Harr wavelet function as follows [10]: 1
Nm(t) - Nm-1 (t)* N1 (t) -- / N m - l ( t - x ) d x
m > 2
(1)
0
The symbol 9 denotes the convolution operation between Nm-1 and Nm. The kth term of Nm is given by k
Nm(tk) -- Z Nm-, (tj)N1 (tk-j) j=0
(2)
with j >_ 0 [11]. The result is equivalent to the sum of products between coefficients in Nm-1 and Nm in a shifted manner. The mother wavelet function ~()v) may be expressed as 3m--2
qJ(t) - Z
qnNm( 2 k t - n)
(3)
n=0
with
qn-- 2m_-----lT n (m)j
N 2 m ( n - j + 1),
n - 0, 1, 2 , . . . , 3m - 2.
j=0
(4) In spline wavelet computation, optimization needs to be carried out on two parameters, namely the order of B-spline, m, and the truncation frequency or frequency scale, L, which represents the cut-off (or truncation) frequency value between the useful signal and noise. Details of the B-spline theory can be found in the references [12,13].
227
In 1995, Mo and Yan published their first paper on the application of WT in electroanalytical chemistry [14]. They developed a real-time continuous wavelet to de-noise signals from the staircase voltammetry. After a prolongation pre-treatment of the original voltammetric signal, the pre-processed signal is taken to act as the input signal of the filter system. Then, a detailed study is performed on the discrete value in the time-domain of the impulse response function for the wavelet filter. The real-time wavelet filter is set up by identifying the relationship between the prolongated signal and the original signal. By modifying the boundary condition of the input signal, the filter can improve the signal-to-noise ratio (SNR) and standard deviation of the post-processed signal. As shown in Fig. 1, the performance of the realtime wavelet filter in de-noising the voltammetric signals is very good. The authors [14] applied the new method successfully for real-time signal de-!
-iooo
oo
-9oo
6
~
e 9
9
3
2~
e o
o~
b)
_
!
-4~
I
-3~
!
-1~
l
0
I
1~
E/rnV (vs AglAgCt)
Fig. 1 (a) Experimental voltammogram from a solution conta&ing 1.0 • 10 -7 mol/l ZnS04 + l.O • 10 -3 mol/l K2S04 and the signal processed with the single side prolongation treatment. (b) 0.5 order deconvolution voltammogram from a solution containing 4.30 • 10 -5 mol/l K3Fe(C204)3 + 0.10 mol/l K2C204 + 0.010 mol/l H2C204 and the signal processed with the double side prolongation treatment. (e represents the original signal a n d - represents the processed signal with third order B-spline wavelet filter). Reproduced from reference [14] with kind permission of Science in China Press.
228
noising in investigating staircase voltammetry in ZnSO4-K2SO4 K3Fe(C204)3-KzC204- H2C204 systems.
and
In another two publications, Mo and his co-workers report a detailed study to optimize the order of B-spline wavelet basis, m, the truncation frequency L, the SNR, and sampling points, n, for voltammetry [13,15]. In these studies, the B-spline wavelet de-noising technique was employed to analyse the Ti(IV)-HzC204 reversible system [13,15], W(VI)-Mo(VI)-HAPP-KC103 absorption catalytic system [15] and ferrocene-LiC104-CH3CN system [16]. Fig. 2 shows the effect of different B-spline orders, m. The authors found that the third-order B-spline wavelet with m - 3 is the most suitable function to de-noise the voltammetric signals. This is because a very smooth voltammogram can be obtained (Fig. 2(b)). Furthermore, there is a minimum change in the peak width when compared to the theoretical voltammogram. Fig. 3 shows the effect of different truncation frequencies, L. When L is greater than 4, noise is embedded in the voltammograms. At L = 3, a smoothed voltammogram with minimum derivation from the theoretical peak potential, Ep, and peak current, lp, is observed (Fig. 3(d)). For the effect of SNR (from 1.0 to 0.15), they found that there is a large deviation in Ep and (n)
0.50
o.so
0.38
0.38
0.26
0.26
0.14
0.14
0.02
0.O2
o4-0.10 ,....
e,~-o, lO
(b)
> 0.50
0.48
0.26
0.36
0.14
0.24
0.02
0.12 o
-s
o
5
(E-s 1/2)NF/RT
~o
,
,
|
i
> 0.60
0.38
-0.10
(c)
0.00
i
(d)
-
-~o
,
-s
o
s
lo
(E-E 1/2)NF/RT
Fig. 2 Effect o f different B-spline order oJ~ (a) 177 = 2; (b), m = 3; (c), m = 4 and (d), the theoretical curve with S N R = 0.2, L = 3, n = 28. Reproduced f r o m reference [13] with kind permission o f Elsevier Science.
229 (a)
1.30
~-
o.8o
0.87
0.54
0.44
0.28
0.01
0.02
-0.42
-0.24
-0.85
.~-0.50
0.55
(c)
~
0.43
0.19
o.31
f
9
i
i
0.55
0.37
0.01
(b)
,
(d)
o. 19 0.07
-0.17 -0.35 -~o
-5
o
s
(E-E~/zlNF/RT
1o
-0.05 -~o
-5
0
5
(E-E~/2)NF/Rr
' "~o
Fig. 3 Effect o f different truncation li'equency q/] (a) L = 6" (b) L = 5," (c) L = 4 and (d) L = 3 with S N R = 0.2, in = 3, n = U. Reproducedji'om r~:/erence [ 1 3 ] with kind permission o/" Elsevier Science.
Ip when the SNR value is reduced. Finally, they tested the effect of the number of sampling points from 26 to 210. The deviations in Ep and Ip are reduced with a higher value of this parameter. The authors concluded that the third-order B-spline wavelet basis, and truncation frequency L = 3, are the optimum parameters for processing voltammetric signals. Recently, Mo's research group has developed a multi-filtering technique to process voltammetric signals with the B-spline wavelet [17, 18]. In this investigation, the B-spline wavelet analysis was adopted to decompose the signals into different low frequency components and noise. Occasionally, parts of useful information are filtered out together with noise. In order to extract useful information thoroughly, the noise contribution was treated as the original signal and processed with the B-spline wavelet method. The low frequency signals were then utilized to compensate the loss of information in the original signal after de-noising. In this work, the proposed method was tested with the simulated staircase voltammetric signals: Eqs (5)-(7) are the equations used in their simulation. The function is F(x) -
1
1 + exp(x)
(5)
230 and its semi-differential equation is Vk -- yO.5 __
d0.5 dxO.5
F(x)
9
(6)
Here, Vk, representing the potential, and x, can be derived from the currentfunction as follows: x-
( E - El/2) nF . RT
(7)
In Eq. (7), E and El/2 represent, respectively, the measured and half-wave potentials, while n, F, R, and T denote the number of electron transferred, the Faraday constant, the Universal Gas Constant, and temperature, respectively. The authors found that the relative errors in the peak height were less than 3%, and those of the peak potential were less than 10% when the SNR value was reduced to 0.1. This technique has been applied successfully to analyse the voltammetric signal of a Cd(II)-succinate-oxalate complex system [19,20]. In the past, the Fourier transform, spline function, and Kalman filter were the major techniques for data processing in electroanalytical chemistry [21]. Mo and his co-workers have also performed comparisons on the application of the Fourier transform (FT) and wavelet transform (WT) in voltammetric studies. The FT has a difficulty in filtering low frequency signals. Although this problem can be tackled by increasing the number of sampling points, this treatment may increase dramatically the number of computations [22,23]. Other alternatives such as lowering the truncated coefficient and operating a "quadratic filter" [24] can partly filter away the low frequency noise. However, these operations have a great effect on the original peak profile and reduce the height of the peak under study [23]. In general, the B-spline wavelet analysis performs better than the FT method in de-noising. When signals with singular points are processed with WT, these singular points affect the signal at both ends but do not spread to other data. For a signal with large differences at the boundaries, singular points will be observed which cause fluctuation of the signal at the boundaries after the de-noising treatment (Fig. 4). Bao et al. [23] performed a comparison on the differential pulse stripping voltammetric signal of a Pb(II)-KC1 system with both the Bspline wavelet and FT methods. Fig. 5 shows the result of the comparison. The signal processed by WT is smooth and very close to the real value.
231 0.50
(,) 0.40
(c) "~ 0.30 + ,~, 0.20 ./
(b)
!" "~ o.lo
&/ /
0.00
-0.I0
|
-lO
l
-S
~
-6
i
t
.4
i
-2
0
i 2
I 4
i 6
i
lO
g
% Fig. 4 Influence of singular points on FT and WT treatment. (a) Theoretical curve (solid line)" (b) Inverse FT curve (dotted line)" (c) the curved processed by B-spline WT (dashdot line). Reproduced from reference [23] with kind permission of American Chemical Society.
However, the signal processed by FT is vibrant and gives a lower value for the peak-height. The FT result leads to unavoidable errors in the analysis. B-Spline wavelet analysis is a very good technique for processing voltammetric signals, but it also suffers a drawback. For a signal with a low SNR value, peak-potential shifting is always observed in the de-noised signal [25]. Therefore, Mo and his co-workers developed a new procedure by combining
16:4
1
14i 12.A i
10. 8. 8
9
!~
. 9
_ .L_ 9 .gto
9
. ,oe. ., , L o .
9~
leel
~..
sg~
.......
9
~
"
9
.
9eo
4
2. 0 - I ,500
"-1o4~
"1.~[~
"1.200
-- | .100
--1.(~
V
Fig. 5 Results of the differential pulse stripping voltammetric signal of the Pb(II)-KCI system processed with (a) wavelet de-noising and {b) Fourier de-noising, respectively. Reproduced from reference [23] with kind permission of American Chemical Society.
232 both B-spline wavelet analysis and FT to process voltammetric signals [2528]. The combined algorithm can compensate the disadvantages of the Bspline wavelet analysis and FT. That is, FT filtration can keep the original peak position while WT can eliminate the low frequency noise signal. This new method involves decomposition of the original signal into discrete approximation and discrete details, in different density with the B-spline WT treatment. Then, the wavelet-transformed signals were processed further by the modified Fourier method [24,25] to obtain satisfying results. The conventional Fourier de-noising method usually involves the use of a multifunction of a rectangular filter function, fk, for both the real and the imaginary parts. The quantity fk is defined as: 1, fk--
k=0,1,...,i-
0,
i i+,
1
int( +')
(8)
with i, k and N representing the point of truncation, a running index, and the total number of data points, respectively. The symbol Int( ) denotes the integer function. In the modified Fourier de-noising method, the filter function is given as follows: 1 -(k/i)
fk--
O,
2
k=0,1,...,ik - i,i + 1 , . . . , I n t
1 2
"
(9)
This combined method has been applied successfully to process differentialpulse-stripping voltammetric signals of the Zn(II)-KC1 (Fig. 6) [25] and formaldehyde-acetaldehyde [26,27] systems. Very recently, Mo and his co-workers developed another combination technique to de-noise voltammetric signals. The spline wavelet and the Riemann-Liouville transform (RLT) were coupled together for the first time to filter random noise as well as extraneous currents [29-31]. Capacitative current is the most significant extraneous current. The RLT is a very effective method for removing some extraneous regularly changing currents, but it is not good for filtering random noise. It is an integral transformation employed in fraction calculus. Since the RLT procedure is quite complicated, details are not given here, but they can be found in references [29,31]. As with the combined B-spline wavelet and FT methods, the spline wavelet was applied first to filter random noise from the current signals. Then, the wavelet-processed current curves at different sampling times were treated
233
(a) 18 !
Zn
15 12 r
9 6 3
,
!
,
,
-1.1
-[.0
-0.9
-1.0
-0.9
,
-~1.4 -1.3 - 1 . 2
--
E/V (b) 18 I 15
o+,1
3
-1.4
!
,I
-1.3
-1.2
,
I
-1.1
E/V Fig. 6 Results of the (a) differential pulse stripping vohammetric signal of the Z n ( I I ) KCI system and (b) the background current signal of KCI solution that was de-noised with the combined B-spline wavelet and Fourier transform analysis. Reproduced from reference [25] with kind permission of Science in China Press.
with RLT to remove capacitative current from these current signals at every step. Fig. 7 shows the result of the combined spline wavelet and RLT on the Cd(II)- KNO3 system [29]. With this new method, the errors associated with the peak current were less than 5.0%, and those of the peak potential were less than 1.0%. 2.2 Other wavelet transform applications in v o l t a m m e t r y
Another research team has successfully applied WT in voltammetric studies. Specifically, WT was applied to process DPV signals [32], potentiometrictitration- [33], and oscillographic chronopotentiometric signals [34,35]. Chen et al. [32] proposed a new type of wavelet function, the difference of
234 (a)
( b ) 0.5 1.5 '
I
'
1.0 o
0.4 0.3 0.2
0.5
o.1 0.0
o.o
-.-0.5 --0.4 --0.5 --0.6 -0.7 -0.8
-o.1
-0.4 -0.5 -0.6 -.-0.7 --0.8
EAt versus SCE
Fig. 7 Results of the current curve of step voltammetric signal of the Cd(H)-KN03 system (a) without, and (b) after spline and RLT processing. Reproduced from reference [29] with kind permission of Royal Society of Chemisto'.
Gaussians (DOG), to process differential pulse voltammetric signals. The function is defined as" W ( t ) - exp ( - ~ ) -
1
~exp ( - ~-).
(10)
In DPV quantitative analysis, it is very difficult to measure the peak-height of a component in a sample with low concentration because it affects the linear detection range of the DPV system. So, these workers [32] employed the DOG wavelet function to transform the DPV signal, as generated from a highly concentrated Cu z+ solution, and to determine the scale parameter, a. Then, the DPV signals at other concentrations were transformed with a predetermined scale parameter. A new linear calibration curve was obtained by using the results deduced from the WT treatment, and this results in a lowering of the detection limit of the analysis. Wang et al. [33] reported a novel application of WT for potentiometric titration. They made use of the edge-detection property of WT to determine the end-point in the potentiometric titration. There are two common methods for end-point determination in potentiometric titration. The first one is achieved through direct graphical interpretation of the titration curve, using methods such as Behrend's-, Br6tter's-, and Tubbs' method [36]. The second is achieved through mathematical interpretation of coordinates of the recorded points using methods such as Gran's [37-39] and a derivative method [40]. Wang et al. [33] proposed the use of the maximum absolute value of the first-order differential function to determine the end-point for potentiometric titration. A second-order continuously differential spline function was chosen
235
in this work to process the titration curve obtained. After WT treatment on the curve, the maximum absolute value among the wavelet coefficients represents the end-point of that titration (Fig. 8). Oscillographic chronopotentiometry is a new type of electroanalytical technique, developed in the P. R. China [35]. This technique is based on the change of oscillographic signal on the cathode ray oscilloscope. Harr and Daubechies wavelet functions were employed by another group in the P. R. China to de-noise the oscillographic signals of Pb(II) ions in NaOH solution and multi-components systems such as Cu(II) and AI(III) ions in LiC1 solution and Cd(II) and In(III) in NaOH solution [34,35]. They found that this
(a)
(b) ]2
• NaOH
x N~OH
f
+ ,d,/
10
12
=...=..x
l
8
8
x
_- +~,/ ~
j
_ ~ x .,..xX
x~~"
I
x
- -4-
,!
d-
,4-
"4-
)
)
z,
5
o
10
15
~
§
-+
.4.-
I 20
I 25
]
o
t ....
5
+
+
l
!
1o
V.~./mL (c)
§ t
]5
20
Vx= x / m L (d)
]2
12
1o
I0
-
_l
25
_,l
30
• NaOH
f~...x
+ ~'f
7
1r
!
$
8
K~X
o
-X~-
-
4-
1
o
,,,x
+
J
~ .x....xx= x x ' ~
+
- x.......__x~ § !
lO
I
20
V),.oH/mL
+ _1
30
+
+ L
40
-+ _!
0
+
+ +4-::::-+++~ t ! l 1 10
20
t
+ I
30
V)~.o./mL
Fig. 8 Titration curves of (a) HCI," (b) HOAc" (c) 0 3 1 9 0 4 . and (d) H2C204 with NaOH and their discrete wavelet coefficients. Reproduced from reference [33] with kind permission of Higher Education Press.
236 method gives a significant reduction in the detection limit when compared with the classical signal-processing methods. Fang and Chen [41] proposed a new tool for processing electroanalytical signals with an adaptive wavelet filter based on the wavelet packet transform technique. They investigated the feature of WPT for white noise which is caused by random and irregular processes. The decomposition was performed with the B-spline wavelet of order 3. Their outcomes showed that the adaptive wavelet filter could be applied to a system with the interference originating from an existing power supply which is useful for the study of fast-electron-transfer processes. Fang et al. [42] also proposed a new algorithm for processing electroanalytical signals with data-length not equal to 2p, where P is an integer. Under this method, the signal S was broken down into two parts, namely Sa and Sb, with each part having a data length of 2p, as follows:
s /
r
Sa
SI~ $2~ $3~ $4~ $5, $6, $7, $8~..., Sn-5, Sn-4, Sn-3, Sn-2, Sn-l~ Sn
/
Sb
(11) Then, the wavelet de-noising process was performed separately on S~ and Sb, and the de-noised signals were then recombined to regenerate the signal in the original domain.
3
Conclusion
In conclusion, researchers in the P.R. China have applied WT successfully in electroanalytical studies, and we hope that this Chapter will introduce the works from the P.R. China to the other chemists around the world.
4
Acknowledgement
This work was supported by the Research Grant Council (RCG) of Hong Kong Special Administration Region (Grant No. HKP 45/94E) and the Research Committee of The Hong Kong Polytechnic University (Grant No. A020).
237
References 1. W.R. Heineman, Introduction to Electroanalytical Chemistry, In Chemical Instrumentation:A Systematic Approach: Third Edition (H.A. Strobel and W.R. Heineman, Eds) Wiley, New York, (1989), pp. 963-999. 2. J. Osteryoung, Introduction, In Handbook o/'hlstrumental Techniques for Analytical Chemistry (F. Settle, Ed) Prentice-Hall PTR, Upper Saddle River, N J, (1997), pp. 685-690. 3. S.D. Brown and R.S. Bear, Jr, Chemometric Techniques in Electrochemistry: a Critical Review, Critical Review of Analytical Chemistry, 24 (1993), 99-131. 4. Y.N. Ni and J.L. Bai, Applications of Chemometrics in Electroanalytical Chemistry, Chinese Journal of Analytical Chemistry, 24 (1996), 606-612 (in Chinese). 5. A.K.M. Leung, F.T. Chau and J.B. Gao, A Review on Applications of Wavelet Transform Techniques in Chemical Analysis: 1989-1997, Chemometric and Intelligence Laboratory Systems, 43 (1998), 165-184. 6. A.K.M. Leung, Wavelet Transform in Chemistry, http://fg702-6.abct.polyu.edu.hk/ ,~kmleung/wavelet.html, (accessed January 1999). 7. I. Daubechies, Ten Lectures on Wavelets, SIAM Press, Philadelphia, (1992). 8. C.K. Chui, An Introduction to Wavelets, Academic Press, New York, (1992), p. 49. 9. X.Q. Lu and J.Y. Mo, Wavelet Analysis as a New Method in Analytical Chemometrics, Chinese Journal of Analytical Chemistry, 24 (1996), 1100-1106 (in Chinese). 10. X.Q. Lu and J.Y. Mo, Spline Wavelet Multi-resolution Analysis for High-noise Digital Signal Processing in Ultraviolet-visible Spectrophotometry, Analyst (Cambridge, U.K.), 121 (1996), 1019-1024. 11. B.B. Hubbard, The World According to Wavelets. The Story q/'a Mathematical Technique in the Making, A.K. Peters, Wellesley, MA, (1996). 12. S. Sakakibara, A Practice of Data Smoothing by B-Spline Wavelets, In Wavelets Theory, Algorithms, and Applications (C.K. Chui, L. Montefusco and L. Puccio, Eds) Academic Press, San Diego, CA, (1994), pp. 179-196. 13. X.Y. Zou and J.Y. Mo, Spline Wavelet Analysis for Voltammetric Signal, Analytical Chimica Acta, 340 (1997). 115-121. 14. L. Yan and J.Y. Mo, Study on New Real-time Digital Wavelet Filters to Electroanalytical Signals, Chinese Science Bulletin, 40 (1995), 1567-1570 (in Chinese). 15. X.Y. Zou and J.Y. Mo, Spline Wavelet Analysis of Step Voltammetry Signal, Chemical Journal of Chinese University. 17 (1996), 1522-1527 (in Chinese). 16. X.Q. Lu, J.Y. Mo, J.W. Kang and J.Z. Gao, Method of Processing Discrete Data for Deconvolution Voltammetry II" Spline Wavelet Transformation, Analytical Letters, 31 (1998), 529-540. 17. X.Y. Zou and J.Y. Mo, Spline Wavelet Multifiltering Analysis. Chinese Science Bulletin, 42 (4) (1997), 382-385 (in Chinese). 18. X.Y. Zou and J.Y. Mo, Spline Wavelet Multifiltering Analysis. Chinese Science Bulletin (English Edition), 42 (8) (1997), 640-644. 19. X.Q. Lu and J.Y. Mo, Spline Wavelet Multifrequency Channel Filters for High Noise Digital Signal Processing in Voltammetry, Acta Science National University Sunyatseni, 36 (1997), 129-130.
238 20. X.Q. Lu and J.Y. Mo, Methods of Handling Discrete Data for Deconvolution Voltammetry (I): Wavelet Transform Smoothing, Chemical Journal of Chinese University, 18 (1997), 49-51 (in Chinese). 21. X.Q. Lu, J.Y. Mo, J.W. Kang and J.Z. Gao, Application of Signal Processing Method in Electroanalytical Chemistry, Chinese Journal of Analytical Chemisto', 26 (1998), 597-602 (in Chinese). 22. L.J. Bao, J.Y. Mo and Z.Y. Tang, Comparative Study on Signal Processing in Analytical Chemistry by Fourier and Wavelet Transforms, Acta Chimica Sinica, 55 (1997), 907-914 (in Chinese). 23. L.J. Bao, J.Y. Mo and Z.Y. Tang, The Application in Processing Analytical Chemistry Signals of a Cardinal Spline Approach to Wavelets, Analytical Chemistry, 69 (1997), 3053-3057. 24. E.E. Anbanel, J.C. Myland, K.B. Oldham and C.G. Zeski, Fourier Smoothing of Electrochemical Data Without the fast Fourier Transform, Journal of Electroanalysis Chemistry, 184 (1985), 239-255. 25. L.J. Bao and J.Y. Mo, A Modified Fourier Transform Method for Processing High Noise Level Electrochemical Signals, ChhTese Science Bulletin, 43 (1) (1998), 42-45 (in Chinese). 26. L.J. Bao, J.Y. Mo and Z.Y. Tang, The Application of Spline Wavelet and Fourier Transformation in Analytical Chemistry, Chemical Journal of Chinese UniversiO', 19 (1998), 193-197 (in Chinese). 27. L.J. Bao, J.Y. Mo and Z.Y. Tang, Combined Spline Wavelet and Fourier Transform Processing Analytical Chemistry Signal, Chemistry ill Hong Kong, 2 (1998), 53-58. (**) 28. L.J. Bao, Z.Y. Tang and J.Y. Mo, The Application of Spline Wavelet and Fourier Transform in Analytical Chemistry, In New Trends h~ Chemometrics, First International Conference on Chemonletrics #1 China, Zhangjiajie, China, October 17-22, 1997, (Y.Z. Liang, R. Nortvedt, O.M. Kvalheim, H.L. Shen, Eds) Hunan University Press, Changsha, (1997), pp. 197-198. 29. X.P. Zheng, J.Y. Mo and P.X. Cai, Simultaneous Application of Spline Wavelet and Riemann-Liouville Transform Filtration in Electroanalytical Chemistry, Analytical Communication, 35 (1998), 57-59. 30. X.P. Zheng and J.Y Mo, The Coupled Application of B-Spline Wavelet and RLT Filtration in Staircase Voltammetry, In New Trends #1 Chemometrics, First hlternational Conference on Chemometrics in ChhTa, Zhangiiajie, China, October 17-22, 1997 (Y.Z. Liang, R. Nortvedt, O.M. Kvalheim, H.L. Shen, Eds), Hunan University Press, Changsha, (1997), pp. 199-200. 31. X.P. Zheng and J.Y. Mo, Removal of Extraneous Signals in Step Voltammetry, Chinese Journal of Analytical Chemisto', 26 (1998), 679-683. 32. J. Chen, H.B. Zhong, Z.X. Pan and M.S. Zhang, Application of wavelet transform in differential pulse voltammetric data processing, Chinese Journal of Analytical Chemistry, 24 (1996), 1002-1006 (in Chinese) 33. H. Wang, Z.X. Pan, W. Liu, M.S. Zhang, S.Z. Si and L.P. Wang, The Determination of Potentiometric Titration End-points by using Wavelet Transform, Chemical Journal of Chinese University, 18 (1997), 1286-1290 (in Chinese).
239
34. J.B. Zheng, H.B. Zhong, H.Q. Zhang and D.Y. Yang, Application of Wavelet Transform in Retrieval of Useful Information from d2E/ dt 2 - t Signal, Chinese Journal of Analytical Chemisto', 26 (1998), 25-28 (in Chinese). 35. H.B. Zhong, J.B. Zheng, Z.X. Pan, M.S. Zhang and H. Gao, Investigation on Application of Wavelet Transform in Recovering Useful Information from Oscillographic Signal, Chemical Journal of Chinese University, 19 (1998), 547-549 (in Chinese). 36. K. Ren and A. Ren-Kurc, A New Numerical Method of Finding Potentiometric Titration End-points by Use of Rational Spline Functions. Talanta, 33 (1986), 641647. 37. G. Gran, Determination of the Equivalence Point in Potentiometric Titrations Part II, Analyst (London), 77 (1952), 661-671. 38. F.T. Chau, Introducing Gran Plots to Undergraduates, Ed. Chem, 27 (1990), 109110. 39. F.T. Chau, H.K. Tse, and F.L. Cheng, Modified Gran Plots of Very Weak Acids on a Spreadsheet, Journal of Chemical Editions, 67 (1990), A8. 40. W.R. Heineman and H.A. Strobel, Potentiometric Methods, In Chemical Instrumentation: A Systematic Approach: Third Edition (H.A. Strobel and W.R. Heineman, Eds), Wiley New York, (1989), pp. 1000-1054. 41. H. Fang and H.Y. Chen, Wavelet Analyses of Electroanalytical Chemistry Responses and an Adaptive Wavelet Filter, Anah'tical Chimica Acta, 346 (1997), 319325. 42. H. Fang, J.J. Xu and H.Y. Chen, A New Method of Extracting Weak Signal, Acta Chimica Sinica, 56 (1998), 990-993 (in Chinese).
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
241
CHAPTER 11 Applications of Wavelet Transform in Spectroscopic Studies Foo-tim Chau and Alexander Kai-man Leung Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic UniversiO', Hung Horn, Ko~rloon, Hong Kong, People's Republic of China
I
Introduction
The spectroscopic techniques, ultraviolet-visible (UV-VIS) spectroscopy, infrared (IR) spectroscopy, mass spectrometry (MS), nuclear magnetic resonance (NMR) spectroscopy, and photoacoustic (PA) spectroscopy, are widely used in analytical chemistry for both qualitative and quantitative analysis. Nowadays, most analytical instruments in modern laboratories are computerized, partly owing to the rapid development of advanced microelectronics technology. Digitized spectroscopic data can be exported from these instruments very easily for subsequent signal processing. Several types of technique are employed commonly in analytical chemistry for signal processing, and include data smoothing (denoising) and data compression. Data smoothing aims to remove high frequency components of the transformed signal regardless of their amplitudes, whereas data denoising aims to remove small amplitude components of the transformed signal regardless of their frequencies [1]. Both methods can increase the signal-to-noise ratio (S/ N) of signals by eliminating the noise or background via digital filters [2]. On the other hand, data compression aims at reducing the storage space and processing time during and after signal processing [3]. In chemical analysis, data compression is very important, especially in setting up digitized spectroscopic libraries [4]. Before 1989, Fourier transform (FT) and fast Fourier transform (FFT) were employed mainly by chemists to manipulate data from analytical studies [57]. After the publication of an important paper by Daubechies [8] in 1988, a new transformation algorithm called wavelet transform (WT) became a popular method in various fields of science and engineering for signal processing. This new technique has been demonstrated to be fast in computation, with localization and quick decay properties, in contrast to existin~ methods such as FFT. A few chemists have applied this new method foI
242
signal processing in chemistry, and satisfactory results have been obtained. Up to December 1998, more than 140 papers have been published in various fields of chemistry and chemical engineering [9~10]. Since few chemists are familiar with the wavelet theory and its application in chemistry, we shall present some specific applications of WT in analytical chemistry in this and the following chapters. We will focus our discussion on three major areas in analytical chemistry that include spectroscopy, chromatography, and electrochemical studies. The application of WT in chromatographic and electrochemical studies will be discussed in other chapters. In this chapter, selective applications of WT in UV-VIS, IR, MS, NMR and PA spectroscopies will be described. In spectroscopic measurement, the raw spectral data X are a combination of the true readings and noise in the discrete format. In order to extract the true readings from the raw data, a digital processing method such as filtering is commonly employed. In the past, a number of filters of various kinds have been developed in different fields of science and technology. However, only a few, such as the Savitzky-Golay, and Fourier and Kalman filters [11,12], are extensively used by chemists. These types of filters are implemented in most of the modern analytical instruments for data denoising. Since 1989, the development of wavelet theory has had a remarkable impact on analytical chemistry. Wavelet filters have been introduced into this area of chemistry for signal denoising. Recently, WT has been utilized to compress spectral data or to distinguish important properties from the acquired data. Generally, WT is superior to FT in many respects. In Fourier analysis, only sine and cosine functions are available as filters [13]. However, many wavelet filter families have been proposed. They include the Meyer wavelet, Coiflet wavelet, spline wavelet, the orthogonal wavelet, and Daubechies' wavelet [14,15]. Both Daubechies' and spline wavelets are widely employed in chemical studies. Furthermore, there is a well-known drawback in Fourier analysis (Fig. 1). Since the filters chosen for the Fourier analysis are localized in the frequency domain, the time-information is hidden after transformation. It is impossible to tell where a particular signal, for example as that shown in Fig. l(b), takes place [13]. A small frequency change in FT produces changes everywhere in the Fourier domain. On the other hand, wavelet functions are localized both in frequency (or scale) and in time, via dilations and translations of the mother wavelet, respectively. Both time and frequency information are maintained after transformation (Figs. l(c) and (d)).
243 (a) o~
(b) 1
lOO
E0.8
5O
#- 0.6 "6
~
8 ~_o.2 0
-50
1000
(s 8 1.5
2000 3000 Wavenumber / cm- 1
-100
4000
At resolution level J - 1
(d) ~
g
g
~_
~_ o
_.,/~J
l ..~
E
0.4
,ooo
~ooo
~ooo
Wavenumber / cm-~
,ooo
~
500 1000 1500 Arbitary Frequency At resolution level J - 2
2
o.s
o
1000
2000 3000 Wavenumber / crn-~
4000
Fig. 1 (a) Experimental; (b) Fourier tran,s.formed, and (c) ~ravelet-trans./brmed IR spectrum of benzoic acid. Spectra (c) and (d) were derived /i'onz (a/ with a Dauhechies DI6 wavelet.filter at resolution levels J - 1 and J - 2, re,v~ectivel.v.
Up to December 1998, more than 30 publications have reported spectroscopic studies with the use of a WT algorithm [9,10]. Within this work, WT has been utilized in three major areas that include data denoising, data compression, and pattern recognition. Two classes of wavelet algorithm namely discrete wavelet transform (DWT) and wavelet packet transform (WPT), have been commonly adopted in the computation. The former one is also known as the fast wavelet transform (FWT). The general theory on both FWT and WPT can be found in other Chapters of this book and some chemical journals [16-18], and is not repeated here. In the following sections, selected applications of WT in different spectral techniques will be described.
2
Applications of wavelet transform in infrared spectroscopy
Infrared spectroscopy plays an important role in the identification and characterization of chemicals and is used widely in modern laboratories. So
244 far, 14 publications have been reported which use WT in I R spectroscopy [9,10]. Besides data compression and data denoising, WT has been applied in some special areas such as wavelet neural networks, and standardization in IR studies. New computational algorithms involving WT will also be discussed in this section.
2.1 Novel algorithms for wavelet computation in IR spectroscopy Traditionally, the number of data points to be processed with WT must equal 2r' where P can equal any positive integer [19]. A data set with 2 r' data points can improve the computational efficiency. In real situations, it is not easy for a chemical instrument to generate exactly 2 P pieces of data. As in FFT, a series of zeros can be appended to one end or both ends of the original data set in order to bring the total length to the next power of 2. This method is called the zero padding method [20,21]. Alternatively, truncation of data at one or both ends of the original data, to the previous power of 2, may be adopted in some cases. In the WT treatment, the data length has another limitation. Owing to the generic nature of WT, the data length of the scale coefficient, Cj, and wavelet coefficient, Dj, must be the same after WT treatment at a particular level, j. So, the data length for a basis to be processed at the previous resolution level must be an even number. These constraints seriously limit the application of WT in signal processing. A novel algorithm called the coefficient position retaining (CPR) method has been introduced by our research group to improve the WT computation in IR spectroscopy [22]. This method can guarantee a smooth operation of FWT and WPT computation on spectral data with any data length. Suppose the original spectral data are represented by Cj, and J represents the highest level in the FWT computation. In this approach, if the data length, nc.j, of a scale coefficient Cj is an even number, FWT is applied as usual. Then, the scale and wavelet coefficients obtained at resolution level (j - 1) will have the number of coefficients nc.j-l(--nc.j/2) and nd.j_l( = no.j/2), respectively. On the other hand, if nc,j is an odd number, FWT is applied without using the last coefficient of Cj in the calculation. This coefficient is retained and transferred downward to the same position in the next higher resolution level. Then, it becomes the last coefficient of Dj_l at the next resolution level. As a result, the scale and wavelet coefficients will have nc.j-l(--nc.j/2) and nd,j-l((= nc.j/2)+ 1) elements respectively. Figs. 2(a)-(c) show a schematic diagram for applying the CPR algorithm in FWT to the data set with 1024, 1531, and 1023 data points, respectively. With the use of the CPR method, it
245
(1024)
1
I
H
L
J-1
Cj. I (512)
1
1
L
H
(255)
[ ~ii/~((256) ....";j;,] |,/:d ,,9:," f:,": " ":(
J-2
I
I
L
H
I; ++., ,t, o....
1
J-3
!:(128) l-(12s~:!
Coefficients for storage
Cj s
Dj
"
"
'S"
I
(a)
C1 (1531) I
I Applying CPR at position ,1. t53 I
H ~,
J- I
Cj.~ (765)
I
ApplyingCPR
H
at posit,on 765
~+~?'::!!i
+
C, ~
I
L
1
I L
J';:".,, D , ,
J-2
, . .'"
I
!t
l.c?+IV~I
J-3
(191) ~ ';(19|)
Coefficients for storage
' I)j '
:
--
F(~. 2a and h
,
!
(b)
246
J
Cj
(I023)
]
I
L
,
I
positi~1023
H ~
Applying
CPR
at
C j_!
(511)
I
L
I
H
i
Dosltion 511
Applying
CPR
al
J-2
i ion
,/-3
Coefficienls
for s l o r a g e 9
J-3
J,3
.
..; . . . . . . . .,
I
: .; ; ? ~ . . . ; i ! . ! . ' : . . : ; i ~ , / ; :
.,;'i~;;.-~?;i;.;
,' .. ,<;', ,,,,, "..".~/ ,,' .... //,.... :..,:'/9,,'..
," ':::" , :,.>.,z,',:,.",>,: . . . . .> ..... , '..: ";"~,,">/'2> . .~.,.
9 . .... ,
. .
..
. /..,,.,,, ,.
l
".,.:".i":,~.~ : .':~ :.., ~:.~~i".;~i ''
(c)
Fig. 2 A schematic diagram showing the operation off the F W T method with a data length o f (a) N = 1024; (b) N = 1531 and (c) N = 1023, coupled with CPR treatment. The slanting line represents coefl?cients to be stored and the black region shows the position o/" the coefficient(s) to be archived in us#lg the CPR method.
can be guaranteed that the total data length remains unchanged and the quality of the reconstructed spectrum is not affected. This algorithm can also be applied to other chemical systems such as UV-VIS spectroscopy and chromatography for WT treatment. WT has been proposed as a new method for compressing spectra for storage and library searching in our study. In this kind of work, spectra are reconstructed from time to time from the compressed data. In order to maintain the quality of the reconstructed spectra, we have introduced another technique called the translation-rotation transformation (TRT) method [23] in the wavelet computation. In the FWT operation, the spectral data vector Cj needs to be extended periodically at the two extremes in the following manner: cextend { } j CJ.n-1, CJ.n, CJ.I, CJ.2;..., CJ.n-l~ CJ.n~ CJ.1; CJ.2~ CJ.n-l, - -
. . . ~
.
(1)
.
.
.
247
In the real situation, or in practice, the first data point, cj.~, and the last data point, Cj.n, at the two extremes do not have a common value. As a result, a small delay which results from discontinuity of the spectral data at the boundary will be observed at both ends of the reconstructed spectrum. Such a phenomenon is known as the side-lobe problem and causes deterioration of the quality of the reconstructed data (Fig. 3) [24]. The T R T algorithm involves subtraction of the data vector Cj by selected quantities B { = bl, b 2 , . . . , bk} to give the rotated array by cTRT
J.k
--
Cj.k -- bk
(2)
with
b k - cj.1 + (Cj,n- Cj ~)(k- 1)
(3)
n-1
(a)
(b)
(c)
1
~ 0.8 ~g
\
-~0.6
/
\
,j,
~0.7
~0.2 0 400
1000 2000 3000 4000 Wavenumber (cm 1 ) (d)
o~ 0 . 8
:'
,:~
'
/"
~ lit!l!, 1000 2000 3000 4000 Wavenumber (cm "1)
(f)
0.8
,!.1
0.4
0.6 3800
500 600 Wavenumber (cm n )
(e) l il
~o.~ 9 g
~
/ , ",.j
0.2 400
3900 4000 Wavenumber (crn1 )
1
0.9
--\
0.4
0
0.9
.5 0.8
~_ 0.4
g
~
1
'
500 600 Wavenumber (cm "1)
~
0.8
,...., E D
0.7 0.6 3800
3900 4000 Wavenumber (cm -1)
Fig. 3 (a) The reconstructed IR spectrum o/" benzoic acid that was pro~hwed./i'om the compressed data with the F W T and zero-padding method. (b) and (c) show the magnified plots of the reconstructed spectrum in (a). (d) The reconstructed IR spectrum o/benzoic acid that was produced from the compressed data using the FWT and TRT methods together. (e) and (f) show the magnified plots o/'the reconstructed spectrum in (d).
248 where k is a running index from 1 to n. After TRT treatment, both cJ,1 TRT and cTRT at the two extremes of Cj TRT are the same, and a smooth extension J.n vector can be observed. Figs. 3(d)-(f) show the result of the reconstructed IR spectrum of benzoic acid with TRT treatment. It is obvious that less sidelobes are observed compared to those without TRT treatment (Figs. 3(b) and (c)). Both CPR and TRT schemes are not limited to IR spectroscopic data only. They can be adopted in other areas of analytical chemistry for WT data processing.
2.2 Spectral compression with ~t'avelet neural net~vork The wavelet neural network (WNN), which is a combination of wavelet transform and neural network was proposed as a new algorithm for IR spectral data compression [25]. The neural network has been applied widely in chemistry [26]. It may be considered as a "black box" to transform mvariable inputs into n-variable outputs [27]. The network is formed by a group of neurons that are organized in different layer(s). Each neuron can accept m-variable inputs with different weighting factors which will be modified during a network training process. After the required computation, each neuron can deliver its own output from the current layer to the neurons at the next layer. This process is repeated on each layer until it reaches the output layer. Fig. 4(a) shows a typical single layer neural network with m inputs and 1 output. Each circle represents a neuron and has a particular weighting factor, w, which is usually derived from the sigmoidal transfer function (SF). The Z sign and S-shape symbol in each neuron represent respectively the summation operation and the sigmoidal transfer function. When spectral data X are applied to this neural network, a response or an output value, Ysv, is obtained through the following expression: n
YSF,i -- Z WiXi. i-1
(4)
As suggested in reference [25], the traditional sigmoidal function can be replaced with the Morlet wavelet basis function FDWT in neural network analysis (Fig. 4(b)). When a spectral data, X, is applied to this W N N system, a response or an output value YDWT is obtained as follows:
n
YDWT,i -- Z WiFDwT i=l
bi) ai
(5)
249
(a)
x (x,, x 2..... x~
(b)
x (x,, x 2..... x~
i F
W2
Yst (Ysr.l, YSF.2 ..... YSF.n)
(c)
YD~3- (YDv,q.I' YDW'r.2 ..... YDWT.n)
X (x,. x2..... xo)
( /
...
Wi
YDWT
Fig. 4 The architecture of (a) a single lal"er neural network with tile sigmoidal transfer [unction, as well as the wavelet neural network [br (b) IR spectral data compression, and (c) pattern recognition hi U V - V I S spectroscop.v.
and FDWT(X) -- COS(1.75X)exp(--x2/2).
(6)
In the above equation, wi, bi and ai denote the weighting factor, translation coefficient and dilation coefficient, respectively, for each wavelet basis. In Liu's work [25], the wavenumber and transmittance quantities of the IR spectrum were used as the input and target output values, respectively, of the network. Their proposed neural network consisted of a single layer network
250 with 49 neurons. With proper training, the weighting factor for each neuron and the required parameters in the wavelet function can be optimized. These authors adopted this WNN scheme to compress selected IR spectra: compression ratios of 50% and 80%, were reported when wavenumber intervals of 2.0 and 0.1 cm -1 , respectively, were used. Their work demonstrated that the original spectra can be represented and compressed by using the optimized W N N parameters, with the features of the I R spectra being well preserved.
2.3 Standardization of IR spectra with ~t'avelet transform Analytical chemists always face a problem in comparison of the performance between analytical instruments. There is no simple rule to justify which one is better because of the variations between the instrumental responses. In order to correct this, a standardization approach is generally adopted. However, a calibration model as developed on an instrument cannot be employed for the other instrument in the real situation. Walczak et al. [28] suggested a new standardization method for comparing the performance between two nearinfrared (NIR) spectrometers in the wavelet domain. In their proposed method, the NIR spectra from two different spectrometers were transformed to the wavelet domain at resolution level ( J - 1). Suppose c N IR1 and C NIR2 correspond to the NIR spectra from Instruments 1 and 2, respectively, in the wavelet domain. A univariate linear model is applied to determine the transfer parameters t between C NIRI_ and C NIR2"_I NIR1 --t f, NIR2 J - 1.n "n ~-'J - 1.n
(7)
where the quantity n represents the number of coefficients in C~_IRI and C~IR2. Once t is deduced from the standardization process, any sample spectrum from Instrument 2 in the wavelet domain can be transferred to the corresponding spectrum as acquired in Instrument 1, by: c
NIR2,new J - 1.n
--
~ /-~NIR2 tn~j_ 1.n"
(8)
Then, the standardized NIR spectrum can be obtained via inverse WT on cNIR2.new for subsequent data analysis The results [28] show that the proJ-l,n posed standardization method in the wavelet domain is superior to the traditional standardization methods.
3
Applications of wavelet transform in ultraviolet visible spectroscopy
Ultraviolet-visible spectroscopy is another technique that has been used extensively in analytical chemistry for characterization, identification and
251 quantitative applications [9,10]. They nition, data
analysis [29]. As compared with I R spectroscopy, only seven have been published which employ WT in UV-VIS spectroscopy can be classified into three major areas, namely pattern recogcompression, and data denoising.
3.1 Pattern recognition ~'ith ~t'avelet neural net~'ork Generally speaking, the term pattern recognition refers to the ability to assign an object to one of several possible categories according to the values of some measured parameters [30]. Classification of samples is one of the principal goals of pattern recognition, and can be achieved via unsupervised or supervised approaches [31]. Details of chemical applications with pattern recognition can be found in the literature [32,33]. As mentioned in the previous section, Liu and his co-workers adopted WNN for IR spectral data compression. Meanwhile, they also employed WNN in their UV-VIS spectroscopic studies [34,35]. These authors tried to determine the concentrations of molybdenum-tungsten mixture and amino acids mixtures simultaneously from the corresponding highly overlapped UV-VIS spectra with WNN. The architecture of WNN as used in their study is shown in Fig. 4(c). Mathematically, the network can be expressed as:
k n (Xi _ bi ) YDWT -- Z Wi Z XiFDwT i=l i=l ai
(9)
where k denotes the number of the wavelet basis used. The network parameters, wi, bi and ai are optimized to recognize an individual UV-VIS spectrum in the sample solution as well as possible. With a proper training of the network, the performance of WNN is better than the traditional backpropagation neural network. Besides, WNN has a higher ability to identify minor differences between UV-VIS spectra of individual components.
3.2 Compression of spectrum ~l'ith ~vavelet transfornl The advances in microelectronics have greatly enhanced mass storage capacity and processing speed. Archives of information of full spectra rather than only those of absorption peaks become more feasible. However, the demands of huge storage capacity are still somewhat prohibitive for highresolution spectra. Even if this problem can be resolved, the computer processing speed and the bandwidth of the telephone line or network is still
252 a limiting factor in data transmission. To tackle this, spectral data compression techniques play a very important role. When a spectrum is compressed by a certain method, two objectives should be met. First, a higher compression ratio should be achieved. Secondly, the reconstructed spectrum should have minimum distortion. The most commonly used compression technique in chemical studies is the Fourier transform and its variants. The advantages of FT are frequency localization, orthogonality, and the availability of fast numerical algorithms. Recently, researchers proposed to make use of WT for UV-VIS spectral data compression [17,24,36,37]. The compression process involves transformation of a spectrum to the wavelet domain. Then, the thresholding criterion is employed to select suitable coefficients in the wavelet domain, for storage. The spectrum can be restored to its original form via an inverse WT treatment on the selected (or compressed) data. Chau and his co-workers have proposed some wavelet-based methods to compress UV-VIS spectra [24,37]. In their work, a UV-VIS spectrum was processed with the Daubechies wavelet function, DI6. Then, all the Cj elements and selected Dj coefficients at different j resolution levels were stored as the compressed spectral data. A hard-thresholding method was adopted for the selection of coefficients from Dj. A compression ratio up to 83% was achieved. As mentioned in the previous section, the choice of mother wavelets is vast in WT, so one can select the best wavelet function for different applications. However, most workers restrict their choices to the orthogonal wavelet bases such as Daubechies' wavelet. Chau et al. chose the biorthogonal wavelet for UV-VIS spectral data compression in another study [37]. Unlike the orthogonal case, which needs only one mother wavelet tp(t), the biorthogonal one requires two mother wavelets, q~(t) and tb(t ), which satisfy the following biorthogonal property [38]: (Dj.k(t)t~l.m(t)dt -
{ 1 0
ifj- land kotherwise.
m.
(10)
In the biorthogonal WT method, two sets of low-pass filters, lk and ik, and high-pass filters, hk and hk, are adopted for signal decomposition and reconstruction stages, respectively. After applying biorthogonal WT to the UV-VIS spectrum, a set of scale and wavelet coefficients of Cj and Dj, Dj_I, K, D1 are obtained and are usually expressed in the floating-point representation. In order to enhance the storage efficiency, we have introduced two extra algorithms, namely the optimal bit allocation (OBA) algorithm [38] and variable length coding (VLC) algorithm [39] for data compression. As ~
253 compared with our previous work [24], the proposed biorthogonal WT method gives better performance.
3.3 Denoising of spectra )l'ith )~'avelet transform One of the main goals in analytical chemistry is to extract useful information from recorded data. However, the achievement of this goal is usually complicated by the presence of noise. In the past decades, a large number of digital filters has been developed in different fields of science and technology for the reduction of noise. In spite of the existence of diverse filters, only a few, such as the Savitzky-Golay, and Fourier, and Kalman filters [11,12] have been extensively used by chemists. Recently, WT has been identified as an effective method for denoising chemical data [1,40-42]. In UV-VIS spectroscopy, only two publications have been found which adopt WT as a denoising technique [43,44]. As stated in the previous section, most workers confine their wavelet functions in the Daubechies wavelet series only. For example, we have adopted the Daubechies wavelet function to denoise spectral data from a UV-VIS spectrophotometer [43]. In order to make use of the other available wavelet functions for chemical data analysis, Lu and Mo [44] suggested employing spline wavelets in their work for denoising UV-VIS spectra. The spline wavelet is another commonly used wavelet function in chemical studies. This function has been applied successfully in processing electrochemical signals [9,10] which will be discussed in detail in another chapter of this book. The mth order basis spline (B-spline) wavelet, Nm, is defined as follows [44]: 1
Nm(t) - Nm-1 (t) * N1 (t) -- / Nm-1 (t - x)dx
m > 2
(11)
O r
0
The symbol 9denotes a convolution operation between Nm-l and Nm. The kth term of Nm is given by k
Nm(tk) -- Z
Nm-1 (tj)Nl (tk-n)
(12)
n=0
with j _> 0 [13]. The result is equivalent to summing up the products between coefficients in Nm-1 and Nm in a shifted manner. The mother wavelet function W(t) may be expressed as
254
3m--2
W(t) -
~
q n N m ( 2 k t - n)
(13)
n=0
with qn= j:0
J
N , m ( n - j + 1) -
n - - 0 . 1 . 2 ..... 3 m - 2 .
(14)
In spline wavelet computation, two parameters, namely the order of B-spline, m, and truncation frequency, L, which represents the cut-off (or truncation) frequency value between the true signal and noise, need to be optimized. In Lu and Mo's study [44], they found that the best result for denoising UV-VIS spectra with high noise level was obtained with m = 3 and L = 4. Zhao and Wang [45] proposed a technique called the wavelet transform K-factor threewavelength method to determine simultaneously the concentrations of vanadium, molybdenum and titanium with UV-VIS spectroscopy. In their study, WT was adopted to denoise the spectra acquired. The concentrations of individual ions were determined from the UV-VIS spectra at three selected wavelengths.
4
Application of wavelet transform in mass spectrometry
In mass spectrometric studies, WT have been applied mainly in two areas including secondary ion mass spectrometry (SIMS) and instrumentation design. SIMS is a type of surface technique for trace analysis, determination of elemental composition, and the identity and concentrations of adsorbed species and elemental composition as a function of depth [46]. The application of wavelet denoising techniques to SIMS images has been studied by Grasserbauer et al. [47-50], and details about these studies are presented in another chapter of this book. With regard to instrumentation design, WT was applied to process real-time signals from the mass spectrometer. Shew [51] invented a new procedure for determining the relative ion abundances in ion cyclotron resonance mass spectrometry, by utilizing WT to isolate the intensity of a particular ion frequency as a function of position or time within the transient ion cyclotron resonance signal. In 1995, this new method was patented in the U.S. Shew explained that the WT intensity corresponding to the frequency of each ion species as a function of time can be fitted by an exponential decay curve. By
255 extrapolating these curves back in time to the end of the excitation phase, accurate values of the relative abundances of different ions within a sample can be deduced. An ion cyclotron resonance mass spectrometer with a Haar wavelet analysis module was thus set up. The result of Shew's work indicated that WT can provide high efficiency isolation of individual frequencies in the received signal corresponding to individual species. In another research study, Rying et al. [52] demonstrated the use of WT for automated run-torun detection of transient events, such as equipment faults, and automated extraction of features from time-dependent signals from a quadrupole mass spectrometer. These authors employed the wavelet analysis techniques to model MS signals for detection and control of run-to-run variability in a semiconductor process. Also, WT was utilized to transform the carrier gas (Ar +) and reaction by-product (H +) signals into the time-scale space for feature extraction, and statistical discrimination between nominal and induced fault process runs.
Application of wavelet transform in nuclear magnetic resonance spectroscopy Nuclear magnetic resonance (NMR) spectroscopy is one of the most powerful non-destructive techniques available for probing the structure of matter. So far only a few publications have related to the application of WT in N M R spectroscopy. In 1989, Guillemain et al. [53] were the first research group. They aimed at investigating how an appropriate use of WT could lead to an excellent estimation of the frequency of spectral lines in a signal and provide direct information on the time-domain features of these lines in N M R spectra. These authors reported seven applications of WT in N M R spectroscopy: these included the estimation of frequency- and amplitudemodulation laws in both simple and general cases, spectral line subtraction and re-synthesis, ridge extraction, addition of two sine waves, and of three exponentially decreasing sine waves. Recently, Neue [54] published another paper on an application of WT in dynamic N M R spectroscopy which could simplify the analysis of the free induction decay (FID) signal. Dynamic N M R spectroscopy is a technique used to measure rate parameters for a molecule [55]. The measured resonance frequencies represent the spatial coordinates of spins. Any motion, such as bond rotation and other molecular gymnastics, may change these frequencies as a function of time. The localization property of WT gives a better picture
256 of the nature of the underlying dynamic process in both the frequency and time domains. The third-order Battle-Lemari wavelets were employed for crystal rotation, and first-order kinetics, with NMR spectroscopy in their study. They concluded that WT will become a routine method for data analysis in NMR spectroscopy. A similar approach can also be found in a reference book by Hoch and Stern [56] who introduced WT as a new data processing technique for smoothing NMR data. Recently, Barache et al. [57] proposed the adoption of the continuous wavelet transform (CWT) for removal of a large spectral line and re-phasing the N M R signal influenced by eddy currents. In the NMR spectra of polymers or proteins, a large spectral line is always observed which masks some important small lines: CWT was employed in this situation to subtract this large component from the others. The authors also mentioned the application of CWT in pulsed magnetic-field-gradient NMR spectroscopy. Pulsed magnetic-field-gradient NMR is a standard technique for studying both diffusive and coherent molecular motions [57]. When there is a large change in the amplitude gradient, mechanical vibration and/or eddy currents will be induced both in the probe and in the magnet. These effects introduce large errors in measuring diffusion coefficients. CWT can remove such distortions by gradient switching, and simplify the analysis procedure.
6
Application of wavelet transform in photoacoustic spectroscopy
Photoacoustic (PA) spectroscopy is a combination of optical spectroscopy and calorimetry [58]. It is a technique for studying those materials that are unsuitable for the conventional transmission or reflection methodologies. It can be used to measure thermal and elastic properties of materials, to study chemical reactions, to measure the thickness of layers and thin films, and to perform a variety of other non-spectroscopic investigations. This technique can be applied to different types of inorganic, organic and biological materials in the gas-, liquid-, or solid phase. Nowadays, PA spectroscopy is mainly employed for material characterization [59]. Compared with other spectroscopic techniques, PA spectroscopy provides a non-destructive analysis and does not require any sample preparation. Recently, some researchers have tried to employ WT in processing signals from PA spectroscopy, and satisfactory results were obtained [60,61]. The major role of WT in PA spectroscopy is to filter simultaneously the noise and
257 the baseline in the PA spectra. The proposed method was applied to analyse PA spectra of degraded poly(vinyl chloride) (PVC) which is one of the most important commercial polymers. PA spectroscopy encountered some difficulties in characterizing the numbers of polyene sequences in degraded PVC [60]. In the initial stage of degradation, which is a very important process in the PVC industry, the absorption in the visible range is very weak [62]. As a result, the PA spectra of initially degraded PVC are often disturbed by the presence of noise at a high level. The irregular baseline also introduces difficulties in determining the PA absorption bands. In this situation, the zoomin and zoom-out capabilities of WT can help to extract both noise and baseline from the PA spectrum. After processing the PA spectrum with WT at a particular resolution level, j, the scale coefficients, Cj, represent the baseline while the wavelet coefficients {Dj, Dj+I. K. Dj_l } represent the noise.
7
Conclusion
In conclusion, WT has been applied successfully for data processing in various fields of spectroscopic studies. We can see that more W T - related applications in spectroscopy will be developed in the near future when more spectroscopists become aware of the unusual properties of wavelets. Raman spectroscopy, electronic spectroscopy, rotational spectroscopy, and vibrational spectroscopy could be new areas to be explored- and no application of WT has yet been reported in such areas.
8
Acknowledgement
This work was supported by Research Grant Council (RCG) of the Hong Kong Special Administration Region (Grant No. HKP 45/94E) and the Research Committee of The Hong Kong Polytechnic University (Grant No. A020).
References 1. V.J. Barclay, R.F. Bonner and I.P. Hamilton, Application of Wavelet Transforms to Experimental Spectra: Smoothing, DENOISING, and Data Set Compression, Analytical Chemisto', 69 (1997), 78-90. 2. P.D. Willson and T.H. Edwards, Sampling and Smoothing of Spectra, Applied Spectroscopic Review, 12 (1976), 1-81.
258
10. 11. 12. 13. 14. 15. 16.
17.
18. 19. 20.
21. 22.
K.M. keung and F.T. Chau, A Review on Signal Compression of Spectroscopic Data in Analytical Chemistry, Acta Phi"sits Chim. Sin., 13 (1997), 857-864. W.A. Warr, Computer-assisted Structure Elucidation- Library Search and Spectral Data Collections, Analytical Chemistry, 65 (1993), 1045A-1050A. L. Glasser, Fourier Transforms for Chemists Part I. Introduction to the Fourier Transform, Journal of Chemistry Education, 64 (1987), A228-A233. L. Glasser, Fourier Transforms for Chemists Part II. Fourier Transforms in Chemistry and Spectroscopy, Journal of Cllemistrv E~hication, 64 (1987), A260A266. L. Glasser, Fourier Transforms for Chemists Part III. Fourier Transforms in Data Treatment, Journal of Chemistry' E~hlcation, 64 (1987), A306-A313. I. Daubechies, Orthonormal bases of Compactly Supported Wavelets, Communications on Applied Pure Mathematics, 41 (1988), 909-996. A.K.M. Leung, F.T. Chau and J.B. Gao, A Review on Applications of Wavelet Transform Techniques in Chemical Analysis" 1989-1997, Chemometric Intelligent Laboratory System, 43 (1998), 165-184. A.K.M. Leung, Wavelet Transform in Chemistry, http" fg702-6.abct.polyu.edu.hk/ ~kmleung/wavelet.html, (accessed January 1999). S. Brown, T.B. Blank, S.T. Sum and L.G. Weyer, Chemometrics, Analytical Chemistry, 66 (1994), 315R-359R. S. Brown, S.T. Sum and F. Despangne, Chemometrics, Anal~'tical Chemisto', 68 (1996), 21R-62R. B.B. Hubbard, The World According to Wavelets. The Story of A Mathematical Technique in The Making, A K Peters, Wellesley, MA, 1996. C.K. Chui, An Introduction to Wavelets, Academic Press, Boston, MA, (1992). I. Daubechies, Ten Lectures on Wavelets, SIAM Press, Philadelphia, PA, (1992). B.K. Alsberg, A.M. Woodward and D.B. Kell, An Introduction to Wavelet Transforms for Chemometricians: A Time-frequency Approach, Chemometric Intellelligent Laboratory Systems, 37 (1997), 215-239. B. Walczak and D.L. Massart, Noise Suppression and Signal Compression using the Wavelet Packet Transform, Chemometric Intelligent Laboratory Systems, 36 (1997), 81-94. B. Walczak and D.L. Massart, Wavelets- Something for Analytical Chemistry, Trends Analytical Chemistry, 16 (1997), 451-463. S.G. Mallat, Multiresolution Approximation and Wavelets, Translational American Mathematical Society, 315 (1989) 69-88. F.R. Verdun, C. Giancaspro and A.G. Marshall, Effects of Noise, Time-domain Damping, Zero-filling and the FFT Algorithm on the "'Exact" Interpolation of Fast Fourier Transform Spectra, Applied Spectroscopy, 42 (1988) 715-721. N. Morrison, Introduction to Fourier Analysis, Wiley, New York, (1994), p. 388. A.K.M. Leung, F.T. Chau, J.B. Gao and T.M. Shih, Application of Wavelet Transform in Infrared Spectrometry" Spectral Compression and Library Search, Chemometric Intelligent Laboratory System, 43 (1998), 69-88.
259 23. J.W. Hayes, D.E. Glover, D.E. Smith and M.W. Overton, Some Observations on Digital Smoothing of Electroanalytical Data Based on the Fourier Transformation, Analytical Chemisto., 45 (1973), 277-284. 24. F.T. Chau, T.M. Shih, J.B. Gao and C.K. Chan, Application of the Fast Wavelet Transform Method to Compress Ultraviolet-Visible Spectra, Applied Spectroscopy, 50 (1996), 339-349. 25. W. Liu, J.P. Li, J.H. Xiong, Z.X. Pan and M.S. Zhang, The Compression of IR Spectra by Using Wavelet Neural Network, Chinese Science Bulletin (English Edition), 42 (10) (1997), 822-825. 26. J. Zupan and J. Gasteiger, Neural Networks: A New Method for Solving Chemical Problems or Just a Passing Phase? Anah'tical Chinl. Acta, 248 (1991), 1-30. 27. J. Zupan and J. Gasteiger, Neural Networks for Chemists, VCH, Weinheim, (1993). 28. B. Walczak, E. Bouveresse and D.L. Massart, Standardization of Near-infrared Spectra in the Wavelet Domain, Chemometric Intelligent Laboratory System, 36 (1997), 41-51. 29. H.H. Perkampus, UV-VIS Spectroscopy and Its Applications, Springer-Verlag, Berlin, (1992). 30. M.J. Adams, Chemometrics in Analytical Spectroscopy, The Royal Society of Chemistry, Cambridge, (1995), pp. 92-154. 31. R.G. Brereton, Chemometrics in Analytical Chemistry: a Review, Am~ivst, 112 (1987), 1635-1657. 32. D.D. Wolff and M.I.L. Parsons, Pattern Recognition Approach to Data Interpretation, Plenum, New York, (1983). 33. O. Strouf, Chemical Pattern Recognition, Research Studies Press, Letchworth, (1986). 34. W. Liu, Y.M. Wang, Z.X. Pan, W.L. Zhou and M.S. Zhang, Simultaneous Determination of Molybdenum and Tungsten using Wavelet Neural Network, Chinese Journal o['Anah'tical Chemistry', 25 (1997), 1189-1191 (in Chinese). 35. W. Liu, J.H. Xiong, H. Wang, Y.M. Wang, Z.X. Pan and M.S. Zhang, The Recognition of UV Spectra by Using Wavelet Neural Network, Chem. Journal o.[ Chinese UniversiO', 18 (1997), 860-863 (in Chinese). 36. F.T. Chau, J.B. Gao, T.M. Shih and J. Wang, Infrared Spectral Compression Procedure using the Fast Wavelet Transform Method, Applied Spectroscopic, 51 (1997), 649-659. 37. H.L. Ho, W.K. Cham, F.T. Chau and J.Y. Wu, Application of Biorthogonal Wavelet Transform to the Compression of Ultraviolet-visible Spectra, Comput. Chemisto', 23 (1999), 85-96. 38. M. Antonini, M. Barlatid, P. Mathieu and I. Daubechies, Image Coding using Wavelet Transform, IEEE Transactions on hnage Processing, 1 (1992), 205-218. 39. A. Fourier, SIGGRAPH "94 Course Notes, SIGGRAPH "94, July 24-28, Orlando, FL, Association for Computing Machinery. New York, (1994). 40. C.R. Mittermayr, S.G. Nikolov, H. Hutter and M. Grasserbauer, Wavelet Denoising of Gaussian Peaks: a Comparative study, Chemometric Intelligent Laboratory Systems, 34 (1996), 187-202.
260 41. C.R. Mittermayr, E. Rosenberg and M. Grasserbauer, Detection and Estimation of Heteroscedastic Noise by Means of the Wavelet Transform, Analytical Communnication, 34 (1997), 73-78. 42. S.N. Qian and H. Sun, Data-compression Method Based on Wavelet Transformation for Spectral Information, Spectroscopic. Spectral Analysis (Beijing), 16 (1996), 1-8. 43. J.B. Gao, F.T. Chau and T.M. Shih, Wavelet Transform Method for Denoising Spectral Data from UV-VIS Spectrophotometer, SEA Bulletin Mathematics, 20 (1996), 85-90. 44. X.Q. Lu and J.Y. Mo, Spline Wavelet Multi-resolution Analysis for High-noise Digital Signal Processing in Ultraviolet-visible Spectrophotometry, Analyst, ]2l (1996), 1019-1024. 45. K. Zhao and Z.H. Wang, Simultaneous Determination of Vanadium, Molybdenum and Titanium by Wavelet Transform K-factor-Three-Wavelength Method, Chinese Journal of Analytical Chemistry, 26 (1998), 620 (in Chinese). 46. H.A. Strobel and W.R. Heineman, Chemical Instrumentation A Systematic Approach, 3rd Ed., John Wiley & Sons, Inc., New York, (1989), pp. 824-829. 47. H. Hutter, C. Brunner, S.G. Nikolov, C. Mittermayer and M. Grasserbauer, Imaging Surface Spectroscopy for Two- and Three-Dimensional Characterization of Materials, Fresenius' Journal of Analytical Chemistr)', 355 (1996), 585-590. 48. S.G. Nikolov, H. Hutter and M. Grasserbauer, De-noising of SIMS Images via Wavelet Shrinkage, Chemometric Intelligent Laboratory System, 34 (1996), 263273. 49. M. Wolkenstein, H. Hutter, M. Grasserbauer, Wavelet Filtering for Analytical Data, Fresenius' Journal of Anah'tical Chemistry, 358 (1997a), 165-169. 50. M. Wolkenstein, H. Hutter, S.G. Nikolov and M. Grasserbauer, Improvement of SIMS Image Classification by Means of Wavelet Denoising, Fresenius' Journal o) Analytical Chemistry, 357 (1997b), 783-788. 51. S.L. Shew, Method and Apparatus for Determining Relative Ion Abundances in Mass Spectrometry Utilizing Wavelet Transforms, US Patent 5,436,477, July 25, Government Printing Office, Washington, DC, (1995). 52. E.A. Rying, R.S. Gyurcsik, J.C. Lu, G. Bilbro, G. Parsons and S.Y. Sorrell. Wavelet Analysis of Mass Spectrometry Signals for Transient Event Detection and Run-to-run Process Control, in Process Control, Diagnostics, and Modelling i~ Semiconductor Manufacturing, Electrochemical Society Proceedings, Vol. 97-9. Montreal, Canada, May 1997, (M. Meyyappan, D.J. Economou and S.W. Butle~ Eds), The Electrochemical Society, Inc., New Jersey, (1997), pp. 37-44. 53. P. Guillemain, R. Kronland-Martinet and B. Martens, Estimation of Spectral Line, with the Help of the Wavelet Transform- Application in NMR Spectroscopy, ir Wavelets and Applications, Proceedings of the Second International Conference or Wavelets and Their Applications, Marseilles, France, May 1989, (Y. Meyer Ed.) Springer-Verlag, Paris, 1992, pp. 38-60. 54. G. Neue, Simplification of Dynamic NMR Spectroscopy by Wavelet Transform Solid State Nuclear Magnetic Resonance, 5 (1996), 305-314.
261
55. W. Kemp, NMR in Chemistry A Multinuclear Introduction, Macmillan, London, (1986), pp. 158-168. 56. J.C. Hoch and A.S. Stern, NMR Data Processing, Wiley-Liss, Inc., New York, (1996), pp. 144-151. 57. D. Barache, J.P. Antoine and J.M. Dereppe, The Continuous Wavelet Transform, an Analysis Tool for NMR Spectroscopy, Journal o/" Magnetic Resonance, 128 (1997), 1-11. 58. A. Rosencwaig, Photoacoustics and Photoacoustic Spectroscopy, Wiley, New York, (1980), pp. 1-6. 59. M.L. McKelvy, T.R. Britt, B.L. Davis, J.K. Gillie. L.A. Lentz, A. Leugers, R.A. Nyquist and C.L. Putzig, Infrared spectroscopy, Anah'tical Chemisto', 68 (1996), 93R-160R. 60. J.J. Mao, Q.D. Su and M.S. Zhang, Wavelet Analysis Applied in Photoacoustic Spectroscopy, in New Trends in Chemometrics, First International Conference on Chemometrics in China, Zhangjiajie, P.R. China, 17-22 October, 1997, Y.Z. Liang, R. Nortvedt, O.M. Kvalheim, H.L. Shen and Q.S. Xu (Eds). Hunan University Press, Changsha, P.R. China, (1997), pp. 197-198. 61. J.J. Mao, P.Y. Sun, Z.X. Pan, Q.D. Su and M.S. Zhang, Wavelet Analysis on Photoacoustic Spectra of Degraded PVC, Fresenius' Journal of Analytical Chemistry, 361 (1998), 140-142. 62. A.A. Yassin and M.W. Sabaa, Degradation and Stabilization of Poly(vinyl chloride), Macromolecular Chemical Physics, C30 (1990), 491-558.
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
263
CHAPTER 12 Applications of Wavelet Analysis to Physical Chemistry Heshel Teitelbaum Department of Chemistry, University o[ Otta~'a, Ottawa, Ont., Canada K1N 6N5
1 Introduction The four cornerstones of physical chemistry are quantum mechanics, statistical mechanics, thermodynamics, and kinetics. Within these divisions we can recognize distinct subdivisions. Quantum mechanics, for example, can be divided into sections dealing with the solution of the Schr6dinger e q u a t i o n namely aspects which deal with the wavefunction and aspects which deal with the associated energy. The former gives us information about the atomic, molecular, and crystalline structure; whereas the latter gives us information about energy levels. Transitions among energy levels, of course, comprise the subject of spectroscopy, treated in its fundamental sense as a probe of the inner structure of molecules (in distinction from the applied or analytical sense which gives us qualitative and quantitative information about the substance, and which is treated elsewhere in this volume). Thermodynamics describes the equilibrium properties of substances such as bulk structure of solids, liquids and gases. Finally, for the purposes of this overview we consider chemical kinetics as one section of the broader field of ~'change", the other section being chemical dynamics. Physical chemistry is among the youngest of the scientific disciplines to which wavelet analysis has been applied. Thermodynamics and statistical mechanics, for example, are not yet represented. As such, only a portion of physical chemistry is actually addressed. However, common to all of the examples which we do have, is an ~'observed" signal or image which is a complicated function of either time, frequency or space. The use of wavelet analysis is geared to extracting patterned information buried in that signal. This can take the form of deconvolution, signal compression, signal denoising, or simulation by solving the differential equation which governs the observed phenomenon.
264
2
Quantum mechanics
2.1 Molecular structure
The distribution of electron density in space is a simple means of visualizing the structure of molecules. One needs to calculate the probability distribution, pi(z), using the wavefunction ~Pi pi('r) -- / ~ dr where dr is the generalized volume element, and ~Pi is the solution of the time-independent Schr6dinger equation+ H ~ i = EitlJi wherein H is the Hamiltonian operator of the molecular system, and Ei is the corresponding eigenvalue or energy of the ith stationary state. The Hamiltonian is a combination of the Laplace and potential operators. The potential energy operator accounts for the interactions between all electrons and nuclei. Often the nuclear motion can be ignored (Born-Oppenheimer approximation); however, one is still left with a pairwise-additive multi-electron interaction described by three co-ordinates for each electron. The resulting partial differential equation can only be solved by approximate numerical means for almost all problems of practical interest in chemistry because of the difficulty of numerically integrating multi-electron terms with inseparable co-ordinates [1]. One of the ways of simplifying the task is to recognize that the set of momentum co-ordinates, p, is conjugate to the set of position coordinates, r. Representing the Schr6dinger equation in terms of momentum co-ordinates transforms the problematical terms into one-centre terms, leaving the Hamiltonian invariant. Apart from this advantage, one also obtains a different visualization of the molecule in terms of electronic momenta, which would otherwise be averaged out in the usual representation. This presents an interesting approach which is amenable to analysis by wavelet transforms. This feature was first demonstrated by Fischer and Defranceschi [2,3]. For pedagogical purposes they chose, in their first study, to represent the typical molecular wavefunction, q~i, as a linear combination of atomic orbitals, (I)i (normally expressed in position space), weighted by coefficients ci, and then they expressed it in momentum space from which they were able to derive Fourier transforms, and compared the results with wavelet transforms. For example, the simplest basis function i.e. the l s Gaussian-type orbital,
265
(I)(x) -- (2at/rt)l/4e-Z~x-" is Fourier-transformed in momentum space to 1 / +(P)- x/~
(I)(x)e -ixp dx - (2~)-1/4e -p2/4
where x and p are the conjugate position and momentum co-ordinates respectively. It is the property of the Fourier transform which gives rise to the similarity in forms. Because of the inverse relation between x- and p- spaces the two representations are complementary [4]. One of the questions one would like to answer for atoms is in what position in space the momentum changes significantly. Unfortunately, Fourier transforms cannot give simultaneous information on position and momentum distributions. Physically this is because the momentum operator is basically a derivative, making the Fourier transform small when x is large, and conversely the momentum density large near the atomic nucleus. However the wavelet transform permits viewing both aspects simultaneously: and at the same time it can avoid the spurious non-physical oscillations in W which sometimes result from standard quantum chemistry programs [5]. Wavelets are a set of basis functions that are alternatives to the complex exponential functions of Fourier transforms which appear naturally in the momentum-space representation of quantum mechanics. Pure Fourier transforms suffer from the infinite scale applicable to sine and cosine functions. A desirable transform would allow for localization (within the bounds of the Heisenberg Uncertainty Principle). A common way to localize is to left-multiply the complex exponential function with a translatable Gaussian "window", in order to obtain a better transform. However, it is not suitable when r varies rapidly. Therefore, an even better way is to multiply with a normalized translatable and dilatable window, q/a.b(X)- atl/2q/([X- b]/a), called the analysing function, where b is related to position and 1/a is related to the complex momentum, qt(x) is the continuous wavelet mother function. 1 The transform itself is now
IThe reader should note a possible source of confusion. The traditional symbols for quantum mechanical wavefunctions and their component basis sets are the same as the wavelet and scaling functions used in wavelet analysis. In wavelet analysis one multiplies the function of interest, here the wavefunction, by the wavelet and scaling basis sets. In order to avoid confusion, we choose upper case W and 9 for quantum mechanical wavefunctions and lower-case ~ and q~ for wavelet and scaling functions.
266
F
Fo(a,b) - J O(x)q/a.b(X)*dx An important feature of wavelet analysis is to find the most appropriate mother function. This is not always obvious. The ranges of a and b are flexible, giving rise to continuous wavelets if unlimited [6], or orthonormal discrete wavelets if limited [7]. For the atomic orbital example, above, the authors demonstrated the effect of choosing as a mother function
~/a,b(X) --
2x
eX2/2
In that case the continuous wavelet transform becomes
(2) 1/4( 2a )3/2 Fo(a,b) - 2b
1 + 2a 2
e-b2/(l+2a):
which displays simultaneously both the momentum and position dependencies. This kind of visualization is useful, for example, when interpreting experimental electron momentum densities and spectroscopy [8,9]. (Note that Fo(a,b) will play, below, the role of a set of coefficients of an orthonormal basis set, which we shall see in an application of discrete wavelet analysis.) Further developments [3] lead naturally to improved solutions of the Schr6dinger equation, at least at the Hartree-Fock limit (which approximates the multi-electron problem as a one-electron problem where each electron experiences an average potential due to the presence of the other electrons.) The authors apply a continuous wavelet mother, qJ(x), to both sides of the Hartree-Fock equation, integrate and iteratively solve for the transform rather than for the wavefunction itself. In an application to the hydrogen atom, they demonstrate that this novel approach can lead to the correct solution within one iteration. For example, when one separates out the radial (one-dimensional)component of the wavefunction, the HartreeFock approximation as applied to the hydrogen atom's doubly occupied orbitals is, in spherical coordinates, 1 d20(x) 2
dx 2
O(x) Ix
=
O(x)
where ~ is the eigenvalue and 9 is the eigenfunction of interest. The transformed equation becomes
267
2 ~2 (q/a'b(X)dxx ) f ~ ( x ) li -- J" d 2dx
q/a'b(X)dx -- 8 f (I)(x)q/a.b(X)dx
The term on the right-hand side is obviously eF,; while the two terms on the left-hand side, can also each be written in terms of F,. As a trial function Fischer and Defranceschi used the following scaled and shifted mother wavefunction:
~/a.b(X)_ _~/ a3 2v/-~(x - b)
e
i,~-bt 2 2a2
in which case the transformed integro-differential equation becomes 3 ) -~
Fo(a,b) -
1 ~Fo(a,b) 2a aa b 2 p2
2 ~j(
+ --~ ~-, v, rCeD
0
dp
/
d[3
efr~-p
F,
a
,
[3
0
Now, instead of solving for the unknown pair, e and ~, one solves for the pair, e and F , . The authors tested the method on the 2p orbital of the hydrogen atom. Using the known eigenvalue of the trial Gaussian type function, O(x) -
xe -x2
they determined the corresponding first analytical approximation to F , . This was substituted into the right-hand side of the transformed Hartree-Fock equation to determine the next approximation to F , . Although, in principle, one could go on to determine a better value for e and then obtain a better F , , the authors stopped at the first iteration, since it already gave a result which was very close to the correct transform of the true solution, 21/2xe-I"l. Of course, the solution of the hydrogen atom problem is known. However, the implication is that the method will work also for more difficult problems with unknown solutions. In yet another development, rather than using continuous wavelet transforms, the same authors investigated the use of orthonormal wavelets in conjunction with the BCR algorithm in order to develop a Fast Wavelet
268
Transform useful for representing the H a r t r e e - F o c k operator for solving large chemical systems [3,10,11]. They chose the discrete wavelet basis sets described by Daubechies [12]"
L-I q/J'k(x) -- Z glq)J-1.2k+l(x)' 1=0
J -- 1. . . . . n
L-1 q)J'k(x) -- Z hlq)J -1.2k+l(x)' 1=0
j -- 1. . . . . n
where L is a limited number of coefficients. The actual coefficients, g and h, are related by gk -- (--1)khL-k-1,
k - 0 , . . . ,L - 1
with hk -- (q), q)-l,k). The value of j defines the coarseness of the scale. The wavelet transforms are
L-1 (I)(x)*j'k(X)dx - Z glCj-l'2k+l 1=0
dj'k -j q,k
m
L-1 (I)(x)q)J.k(X)dx -- Z hlCj-l.2k+l l=O
With an initial set of coefficients, C0.k, these recursion formulae lead to a complete set of coefficients. For hydrogen-like atoms the Schr6dinger equation is reduced to a one-dimensional eigenvalue problem in terms of the radial co-ordinate, as above. Starting with an initial guess to the wavefunction, (I)(x), the solution is obtained by iterative approximations involving the evaluation of the H a r t r e e - F o c k operator operating on ~(x), as above, and generating a better ~(x). What used to be a continuous function, F , , above, is now treated as a set of discrete coefficients, dj.k. When the H a r t r e e - F o c k operator is expressed (in non-standard form) as a sparse matrix, and it is applied to the set of transforms, dj.k and q.k, treated as a vector, the result is a set of transformed coefficients which can be used to calculate the system energy. It is found that, for several trial wavefunctions, (although the expectation values of the potential and kinetic energies are poor) the total energies are within 3% of the theoretical values. It is also shown that greater accuracy results from smaller discretization intervals, i.e. from a larger number of scales (at the expense of making the matrix denser and the calculation more costly). Furthermore, too many iterations lead to a spurious eigenvalue, which can be traced back to the influence of the Coulombic
269 singularity transforming into an undesireable pseudo potential. However, with care the method seems to hold promise for solving structures of complex molecules. In summary, the expected success of wavelet transforms for solving electronic structure problems in quantum mechanics are due to three important properties: (a) the ability to choose a basis set providing good resolution where it is needed, in those cases where the potential energy varies rapidly in some regions of space, and less in others; (b) economical matrix calculations due to their sparse and banded nature; and (c) the ability to use orthonormal wavelets, thus simplifying the eigenvalue problem. Although, in the above examples, it is Coulombic potentials that are involved, methods have also been developed which allow us to deal with arbitrary potentials. The procedure again involves transforming the Schr6dinger equation with the wavefunction expanded in terms of wavelets. Methods developed by Latto et al. [13], Beylkin [14], and Dahmen and Micchelli [15] then allow us to obtain the matrix elements of the potential operator, V, and of the kinetic energy operator, T, usable to evaluate the Hamiltonian matrix elements: (q/j.k]H]q/j,.k,). (q/j.k]H]q)j.k,) and (q)j.k]H]q)j.k,). For the kinetic energy components the procedure reduces to solving a set of coupled linear algebraic equations. This is simplified even further by the relationship due to the scaling between levels of detail. For example, consider the scaling functions, q~. If one scales successively by factors of 2 then one can show that (q0j.klTlq0j.k, } --2-J{q00.klTlq~0.k,), where T is a second derivative which can be applied analytically. For the terms involving potential energy, actual numerical quadrature is required. However, expansion of the potential function in terms of wavelets makes the procedure accurate and efficient. Thus,
J
V(x) j=0 k
k
Sweldens and Piessens [16] have demonstrated how to numerically calculate the expansion coefficients of the potential energy in terms of Daubechies basis functions. Modisette et al were the first to treat complicated potentials
270 using these orthonormal basis functions [17]. They examined the simple harmonic oscillator potential as well as a steep double well potential. Thus they have taken the first step in the transition from the description of atoms to that of molecules. In the harmonic oscillator case, for which V(y) cx 1/2y 2 and for which exact solutions are known, the authors were able to solve for the eigenvalues to an accuracy of one part in 1014, for maximum values of k-,~ 30. This could be obtained only when the Daubechies basis functions were very smooth, i.e. when L was as large as 20. The larger the scale of resolution the quicker the convergence with respect to k; but this came at the price of increased bandedness of the matrices. Note that in this special case, all terms < ~j.k]V >, are identically zero because of the property of vanishing moments for the wavelet functions, DC
J
" xn~(x)dx -- 0
--OC
where n is a positive integer. So the expansion of the potential only contains contributions from the scaling functions, which are easily evaluated. A better test of the Daubechies basis set's ability to handle complex problems is the pointy double well potential showing regions of rapid and slow dependence on y: V(y) o( -
1
1
V/(y- 5a)2+ a2
V/(y + 5a) 2 + a 2
Not only is there a near-singularity approaching that of a Coulombic centre, but there are two such centres, at y 4-5a. When a - 0.1 atomic units one could model, for example, the H~+ molecule ion. To rewrite the potential in terms of wavelets the authors chose y - 0 as the expansion origin, considered y to extend to 4-250 a, and subdivided the space into 100 equal intervals around the origin, thus taking 101 scaling functions. The wavelet expansion coefficients, though, decrease rapidly as y increases. Therefore, only six levels of resolution were needed (j -- 0-5), associated with spacings of 5a • 2j-5. The finer level wavelets concentrate in the wells. Focusing on the 5th excited energy level (which is influenced by both the rapidly as well as the slowly varying parts of the potential), the authors solved the Schr6dinger equation by the method described above. This ne-
271 cessitated expanding the quantum wavefunction in wavelets, generating and diagonalizing the Hamiltonian matrix, and determining the eigenvalue for different wavelet bases, retaining as many wavelets as necessary to maintain an accuracy of 1 part in 108 at a given level of resolution. Only 25 scaling functions were needed for the coarsest level of resolution. The procedure was then repeated for successively better levels of resolution until convergence at the exact result was reached. A total of 110 basis functions of all types were required for this part of the calculation. We conclude that once other families of wavelets are compared with the Daubechies set, and once criteria are developed for deciding on the origin and scale needed for expanding generalized potentials, the technique will result in accurate calculations of molecular energies. From a modest beginning with H-like atoms, to diatomic molecules, the field has now expanded to multidimensional problems. Here density functional theory is most appropriate, and initial studies have emerged (using nonorthonormal wavelets) [18]. Our abilities to calculate the electronic structure of multi-electron substances in cubic lattices [19] and molecular vibrations in four-atoms systems [20,21] have been extended by making full use of powerful parallel computers. The approach of Arias et al. [19] to determine the electronic structure of all the atoms in the periodic table is to expand functions, f, in three dimensions as a sum of scaling functions at the lowest resolution plus wavelet functions of all finer resolutions: Jnla\
f(r) -- Z Cjo'nq)Jo.n (r) ~- Z E dj.n q/j.n (r) n J=Jo n 9
.
As in the work of Fischer and Defranceschi described above, the Schr6dinger equation reduces to an eigenvalue problem. Beginning with a subspace containing scaling functions at scale J0, i.e. qb(2J"x- 1)~(2J"y- m)q~ (2J"z- n), space is subdivided into more and more detailed lattices of higher and higher resolution. Only the basis functions that have significant coefficients need to be retained. At distances far from the atomic core lower resolution is required. As the core is approached, where the electronic wavefunction oscillates rapidly, finer scales are added as needed, until the calculations converge or the desired accuracy is achieved. In the case of the hydrogen atom the authors find that, to reproduce the energy of the ls state within 2%, 7 scaling functions are needed for the region bounded by radii of 0.5 to 1 Bohr: while an additional 6 are needed for the region 0.25 to 0.5 Bohr: and another 6 for 0.125 to 0.25 Bohr, and another 6 for the innermost core. For greater ac-
272 curacy, they needed only to increase the size of the basis set modestly. As the atomic number increases, and the Coulomb potential becomes stronger, greater resolution is required; but it is found that only one scale need be added every time the atomic number doubles. Thus for uranium, rather than requiring a basis set of 108 plane waves as normally required, the wavelet approach requires only 67 basis functions. Using the same fixed basis set, the l s state's energy could be calculated to within 3% for all atoms of the periodic table. Interestingly, the same procedure as in the H atom can be used to calculate the energy of the molecular ion H +. The centres of the basis functions do not change as the atomic positions are varied. With no more than 167 basis functions the total energy is within 1% of the exact value for all arbitrary choices of atomic separations. When the need arises to suddenly change the basis set at a particular geometry, an extremely small discontinuity ~0.3 meV results. The same authors are continuing the development of the technique to describe the structure and energies of systems consisting of many electrons. They have succeeded in the case of an array of carbon atoms on a small cubic lattice using the local density approximation. The properties of wavelets as basis functions or as tools to visualize position and m o m e n t u m space simultaneously are only two of several. Others have been barely investigated. In particular, prospects for solving the time-dependent Schr6dinger equation [22] are exciting. Also, as has been emphasized by Calais [23], wavelets themselves can be treated as coherent states. The dilation/translation operation cited above, Illa.b(X) --1o[[l/2q/([X- b]/a), can be viewed as the application of a unitary operator, U(a,b), tp~,b(X) = U(a,b)~(x) In this case one can show that U(a,b)U(a', b') = U(a'a,b + a'b) and that if a' = 1/a and b' = b/a, then U(a,b) -1 - U(1/a, b/a) - U(a,b) + thus satisfying the same properties that e.g. the coherent states of a harmonic oscillator satisfy, U[p,q]+qU[p,q] = Q + ql U[p,q]+PU[p,q] - P + pl
273 where [p, q] are the co-ordinates in phase space, and P and Q are the position and momentum operators. Thus the wavelets, ~a.b(X), form a set of coherent states. Calais also discusses Zak's quantum mechanical kq-representation of Bloch electrons in electric and magnetic fields. It is another example of the concept of wavelet transforms since the translation and dilation operators which are involved, T(a) and ~(2rt/a) respectively, form a special case of the wavelet transform, U(a,b), i.e. X(na)f(x) = f ( x - na) ~(2rtk/a) - e -2rtkix/a The Zak transform, i.e. the combined operation on the set of Bloch functions, ~/kq(X), a(X -- q - na)e ikna
~/kq(X) -- ~//7-~ Z n
generates a family of coherent states where n and k are integers and 8 is the delta-function. Exploitation of this approach does not seem to have occurred yet in quantum chemistry.
2.2 Spectroscopy The electromagnetic spectrum of a molecule is essentially a representation of the probability, Jan,n]2 for transitions between any initial state, n, and any final state, n', as a function of the frequency, v, of incident radiation and time, t. In the limit of small perturbations by the radiation
4
>2
an,(t)an(t ) -- h2A2 < kIJn,l~l.lklJn
[E0(v)] "- sin2(rtAt)
where the transition moment integral, [LI.n,n,is defined as ~-In,n -- J" W~,(q)ll(q)Wn(q)dq
E ~ is the spatially dependent amplitude of the radiation's electric field, I1 the instantaneous dipole moment of the molecule, W the molecular wavefunction, A-- 8n,/h- 8n/h- v is the energy of the system state, and where the integration is carried out over all co-ordinate space q. If the wave functions of initial and final states are known, then the transition moment integral is easily evaluated. Con-
274 versely it is possible, in principle, to deconvolute the experimental ultrahigh resolution spectrum and obtain the wavefunctions, or at least the effective Hamiltonian. Quack and Jolicard have each written good reviews on the subject [24,25]. We note that the transition moment integral, [In,n, can be thought of as a transform. If the wavefunctions can be expanded as wavelets, as described above, then it should be possible to reformulate the problem and determine the effective potentials with less effort than usual. This appears to be an ideal approach, considering that the potential has regions of greater or lesser detail (in position space), and so does the spectral transform (in frequency space). The procedure of Wickerhauser for inverting complicated maps of large pparameter configuration space to a large d-dimensional measurement space may find important applications here, since the usual analysis requiring an effort of the order of d 3 can be replaced by one of the order of d 2 log d using Haar-Walsh wavelet transforms [26]. However, it seems that this approach has not yet been attempted. Instead, the application of wavelets to spectroscopy has surfaced differently, in terms of signal processing - imaging, decomposition, quantification, filtering, compression, and denoising. Spectroscopy in the ultraviolet [27], the infrared [28,29], the microwave [30], and radio (NMR) [31] regions of the electromagnetic spectrum have all been investigated. Secondary ion mass spectrometry (SIMS), as an imaging tool, has also profited from wavelet-transform processing [32,33]. Algorithms for fast decomposition and reconstruction of the spectra have been described by Depczynski et al. [34]; while those for denoising aspects have been described by Alsberg et al. [29] and by Wickerhauser [26]. These subjects are more fully described in the chapters on applied spectroscopy and on analytical chemistry.
3
Time-series
3.1 Chemical dynamics Quantum molecular dynamics is a natural offshoot of quantum mechanics whereby the fate of an encounter between atoms or molecules is determined by experiment or by numerical simulation. Simulation essentially involves solution of the Schr6dinger equation. The potential energy of interaction is assumed to have already been determined previously by variational techniques; and the initial wavefunctions (energy and geometrical structure) of
275
the reagents determined by techniques such as those described above. One can either solve the time-dependent Schr6dinger equation [22], or else solve the quantum scattering problem in terms of reaction probabilities using the time-independent Schr6dinger equation [34--44]. Alternatively, if the encounter is understood at its deepest level of detail (position and momenta of all atoms as a function of time), then one can invert the problem to determine the interaction potentials. Of course, the dynamics can be studied approximately, by solving the classical Hamiltonian equations of motion [45,46], instead of the Schr6dinger equation: and this is the procedure to which wavelet transforms have actually been applied. Essentially a time series of positions and velocities is analysed. The variations with time are complicated, displaying irregular regions of high and low frequency components, sometimes buried in noise. This localization is exactly what wavelet analysis is capable of addressing. Second, the method's feature of multi resolution analysis permits it to decompose observations into subsets, remove some, and thus act as a filter, especially of noise. Instead of space and momentum coordinates, as in the case of molecular structure, here we have time and frequency co-ordinates (as in the case of signal and image processing). The first applications of wavelet transforms to analyse time series in the field of chemical dynamics were those of Permann and Hamilton [47,48]. Their interest lay in modelling diatomic molecules, close to dissociation, perturbed by a photon. They modelled the reaction using the equation of motion for a forced and damped Morse oscillator, given by: d2x
P dt 2 =
-213D(e -~• -- e -213x) -- C ~dx - + (1 - ~x)e-~XF sin(rot)
where p is the reduced mass, x the atomic separation, [3 the Morse force constant, C the damping coefficient, F the forcing coefficient (representing photon electric field intensity), m the photon frequency and ~ describes the coupling of the photon to the molecular dipole moment. The initial value of x is set to correspond to large energies close to the dissociation limit, D; and the differential equation is solved numerically. This equation is nonlinear; and because the initial departures from equilibrium are ~large" and there is positive feedback between oscillator and photon, the solution results in oscillatory or chaotic behaviour [49], depending on the choice of parameters, C and F. For significant forcing and damping it would be trivial to observe the regimes when oscillation gives way to chattering and then to chaos. One traditional way of analysing for chaos is to draw phase plots of x versus dx/dt and check for banded limit cycles. However, it is not so clear how to
276 investigate the onset of chaos when C or F are small. The authors present an approach to magnify the effects by drawing phase plots involving higher derivatives. This assumes that the unstable phenomenon is not so small that it would be buried within the instability of the numerical integrator itself. Despite success, one is still unable to view the frequency spectrum in such plots by the traditional Fourier transform, because the frequency is highly and irregularly time dependent. In order to overcome this problem, the authors applied a wavelet analysis to decompose the observed x(t). (Similar procedures were simultaneously being developed at the time for chaotic phenomena in other fields [50].) The coefficients, C0.k, were chosen to mimic the sequence of data points obtained from the integration at multiples of the time interval, C0,k -- x(k~) Typically, the time scale was subdivided into 4096 units. The values of C0.k were considered to be coefficients of the basis set, q~j.k, such that the observable, at its finest level of detail, j - 0, could be interpreted as x(t) - ~
C0.k%,k(t) k
Various levels of approximation to this function can also be written. The fine approximations, Xj_l(t), can be written in terms of fuzzier approximations, xj(t)" xj_l ( t ) - x j ( t ) + Dj(t) - Z k
Cj.kq~j.k + Z
djkq/J -k
k
The difference in information between two successive degrees of approximation is the detail, Dj(t) at that level of resolution. Each cruder degree of detail has half of the data points of the finer detail. These terms could be determined using the recursion relations cj+l.k -- ~
hn-2kCj.n n
dj+l,k -- Z
gn-2kCj'n
n
and the properties of the scaling and wavelet basis sets, q~j,k(t) -- 2-J/2q~(2-Jt- k) %,k(t) -- 2-J/2qt(z-Jt- k)
277 where the scaling and wavelet mother functions as well as the high- and low-pass filters, h and g, were obtained from the fairly smooth 8-level expressions and values of Daubechies [51]. The authors focused on the details, Dj(t), in order to probe the chaotic behaviour. In principle, 12 details could be accessed (212= 4096); but in practice, D3, D4 and D5 showed all the interesting results. (Successive details had half the number of data points of the more resolved ones. Thus zeroes were added at the ends of the data sets to maintain the total of 4096 points.) In general, Dj increased in magnitude with increasing j; but only one choice of j was ideal for detecting the forcing frequency or its nonlinear effects on the oscillator. The real novelty of the studies, though, was to demonstrate the power of wavelet analysis as a pre-processing transformation for detecting buried information. Instead of using a single forcing frequency, the authors replaced the cosine function by three cosines, each with the same tiny amplitude but with different frequencies. The Fast Fourier Transform of x(t), even when amplified, could not reveal the presence of three separate frequencies, because their effect was buried in the relatively monstrous oscillation of the diatomic molecule. Although the appropriate details, D3 and D4, could successfully filter out the molecular oscillation, it was not obvious how many separate forcing frequencies were present. However, a combined FFT of the WT revealed the three frequencies most dramatically. Nine separate forcing terms could also be detected; however in this case, additional detail was observable, presumably due to bifurcations. In yet another interesting feature, the authors showed that by deleting those (few) parts of the Dj's which had unusually large amplitudes, the F F T - W T was clarified even more, without affecting the reconstruction of x(t). Subsequent to Permann and Hamilton's studies, Askar et al. applied wavelet analysis to two other physical problems: (a) deterministic motion of a threedimensional polymer; and (b) random motion of a one-dimensional chain [52]. In both cases the motion is ruled by differential equations which are numerically solved. The results are superimposed in case (a) by high-frequency deterministic components, and in case (b) additionally by noise. In the example of the motion of a 32-atom polymer of masses, m, subject to Newton's laws of motion, F(q) - m~2q/& 2 - - ~ V / ~ q
278 where the initial positions and velocities are fixed, the potential energy of interaction is given by
V - Z
Kij (bij - b~) 2 -+- Z
-t- Z
KijBk(0ijk -- 0i~k)2
KijTkl [1 - cos 3((~ijk 1 -- (~i~kl)]
q is the generalized set of 6 centre-of-mass co-ordinates, 29 torsional angles, qbiju, 30 bending angles, 0ijk, and 31 interatomic bond separations, bij. Superscripted co-ordinates denote equilibrium values. They, as well as the force constants, Kij, Kijk, and Kijkl, are also fixed. Starting with an initial helical conformation and a large impulsive velocity in one of the modes, the equations of motion were integrated, and all 96 co-ordinates were computed at fixed time intervals, kAt, where k - 1,2 . . . . . until the polymer rolled into a ball-like structure. The strategy was to represent the observable positions or angles, f(t), in terms of wavelet functions, as f(t) - Z
Z
j
dj'kl~/J.k(t)
k
(The authors decomposed the results, iterating up to k = 7 and j = 3.) At the finest resolution it was assumed that
fo(t) - ~
co,kq:)o,k(t)
k where the scaling coefficients, C0.k, were chosen as C0.k -- f(kAt), k - 1,2,... Using this as a starting point the authors evaluated the other scaling and wavelet coefficients from the recursion formulae
Cj+l,k -- Z
hn-2kCj.n n
dj+l,k -- Z
n
gn-2kCj'n
with the low- and high-pass filters, g and h, derived from the Daubechies wavelets using 8 coefficients [51], and with the basis sets given by q~j,k(t) -- 2-J/2q:)(z-Jt- k) ~j,k(t) -- 2 - J / 2 ~ ( z - J t - k) where the scaling and wavelet functions are also given by Daubechies [50]. The authors also made use of the wavelet transform for time-frequency analysis, as in Permann's work above. Noting that the wavelet,
279 ~j,k(t) ~ ( 2 - J t - k), is centred about t = 2Jk, they could focus on greater or lesser detail. The higher the value of j, the lower the frequency that the detailed wavelet coefficient could address. The multiple vibrations of bonds and gyrations of groups of atoms in space resulted in observables which varied in a chaotic manner. The torsional motions of two particular dihedral angles were e x a m i n e d - one for its small amplitude/low frequency plus large amplitude/high frequency character and the other for its large amplitude/low frequency plus small amplitude/high frequency character. The scaling coefficients, Cj,k, revealed the motions at lower and lower resolution as j increased; whereas the complementary detailed wavelet coefficients, dj.k, revealed the time history of the superimposed oscillations at poorer and poorer resolution as j increased, thus effectively demonstrating the filtering nature of the procedure. An interesting summary of the variability over all time steps was defined by the authors as E]dl.k . This quantity can identify those coordinates and their neighbours which are most "active".
12
This technique holds promise for addressing the holy grail of molecular dynamics calculations - the simulation of protein folding over long time scales- by using wavelets to perform the dynamical calculation itself without having necessarily to focus on fine-scale details. The same approach was taken for the second case of a polymer chain containing 16 atoms driven by random forces. To simplify matters, the bond lengths and angles were fixed. Only torsion was permitted, governed by a double well potential, V ( ~ ) - 7 ( 1 - ~2)2, where ~ represented the relative distances between nearest neighbouring atoms. In this case damping and stress terms were also included, as well as random forces which depended on the temperature, similar to the Langevin description of Brownian motion. A sudden heating at a predetermined time was found to cause the polymer to contract. The atomic motions were all rather equally random; however, the sudden change in randomness at the time of heating was easily detectable, not only among the scaling coefficients, but also among the wavelet coefficients, even at low resolution. In contrast, there were no features in the traditional Fourier transform which one could recognize as a detection of the sudden change in atomic behaviour.
3.2 Chemical kinetics Chemical kinetics may be considered to be the macroscopic version of chemical dynamics. Dynamics is concerned with determining the details of
280 an elementary chemical interaction; whereas kinetics averages out those details on a vastly longer time scale. From an a priori point of view kinetics makes use of information deduced by dynamical c a l c u l a t i o n s - the properties of molecular products averaged over time, over initial orientations of molecular reagents, as well as over their translational, rotational and vibrational energies - interpreted in terms of rate constants, k [53]. The rate of the elementary chemical reaction, A + BC--, AB + C, is thus expressed simply as
diAl
R a t e = d---~ = -k[A][BC] where [A] and [B] are the concentrations of species A and B respectively at time t. A complex chemical reaction is composed of several elementary steps, and they each contribute such terms to the overall rate of change of the concentration of each chemical species involved. Rate equations (treated as differential equations) are thus used in a phenomenological sense to describe observed time profiles of concentrations of reagents, intermediates and products. As long as the chemical mechanisms are not too complex, and as long as the chemical system is not too far from equilibrium, the mathematical problem is essentially linear, and solutions of sets of differential rate equations result in smooth variations of concentrations with time. Experimentalists have tended to expect this behaviour. Experimental signals could, at most, be processed for information content buried within noise. However, in recent years, it has been recognized that real chemical reactions are not so ideal. The oscillatory behaviour of the hydrogen/oxygen reaction has been known since 1936. Chaotic behaviour has been demonstrated for this reaction in 1988 [54], but has been well known for other reactions for a longer time [55-58]. The essential requirements for all chaotic behaviour is a nonlinear deterministic governing set of differential equations along with initial conditions which are far from equilibrium. In this sense, non-linear chemical kinetics can be considered to be another example of non-linear dynamics [49]. The essential observable features of chaotic systems is (a) a large frequency spectrum, (b) extreme sensitivity to initial conditions, and (c) extreme sensitivity to choice of parameters. Thus there could be a region of parameter space where chaos or oscillations can occur between two well-behaved regions [59]. Such situations can occur in such varied chemically driven areas as industrial processes or physiological reactions such as heart attacks [60]. It therefore becomes essential to know how to detect the onset of chaos. Traditional methods of analysis are very valuable [61]. However, because of nonlocal variability or because of small amplitude effects, techniques such as
281 Fourier analysis are not as useful. This is where wavelet analysis can make an impact. It appears that only one application of wavelet analysis to the study of chemical chaos has a p p e a r e d - that of Permann and Teitelbaum [62]. In an experimental study they found that CC13F, an erstwhile popular refrigerant, progressed homogeneously towards condensation via a nonlinear process. Electronic signals were generated by laser refractometry of a shock-compressed slug of gas passing by a stationary laser beam. The chemical reaction's history was spread out spatially, and could thus be probed by time-resolved refractometry. Generated signals were proportional to the rate of reaction (in this case production of dimers and trimers, etc. on the microsecond to millisecond time scale). Regular, oscillatory, and chaotic regimes were observed. It was desireable to determine the frequency spectrum of the chaotic signals. Fourier analysis proved to be frustrating because the transience of the signal introduced large uninteresting components (even when the signal is DC-shifted to decrease the amplitude of zero-frequency components). However, wavelet transformation proved to be very helpful. Digital signals were treated as a time series of 32,768 data points. Only every 8th point was retained and the 4096 remaining points were processed as described above for references [47,48]. Two observations were noted. First, decomposition of a millivolt signal into wavelet components allowed the researchers to modestly edit some of the lower numbered details (containing high frequency noise) and thus to reconstruct the signal with a signal-to-noise ratio improved by a factor of 10. This improvement revealed an underlying oscillation, not evident in the original signal. Second, the researchers could perform a Fast Fourier Transform of the wavelet details themselves in order to clearly reveal a crisp frequency spectrum, which in their case consisted of two frequencies, 62.5 and 94.2 kHz and their harmonics extending up to 1.3 MHz. It was also noted that a wavelet analysis of the original signal where every second point was retained, rather than every eighth point, was less effective in detecting all of the frequencies. Thus preliminary "smoothing" appears to be advantageous. With due care one can therefore extract the forcing frequencies which are characteristics of incipient chaos and determine if bifurcation is present. In addition, the Wavelet-Fast-Fourier-Transform technique essentially performs for an unrepeatable single-shot experiment the same task as a box-car averager does for repeatable experimental signals. The de-noising feature can also be applied to traditional linear kinetics. This was demonstrated by Fang and Chen for voltammetry [63]. Despite its use
282 in analytical chemistry, voltammetry is a technique which determines the current response to a sudden change in voltage and is designed, in principle, to determine the rate of electron transport in an electrochemical system. It is therefore representative of other techniques in kinetics which generate signals suffering from a finite level of white noise. Filtering by selecting an arbitrary frequency-cutoff after Fourier analysis is somewhat risky, in general. In addition, there are always the edge effects which introduce spurious Fourier components. In response, Fang and Cheng used Walczak's Wavelet Packet Transform procedure (for recognizing patterns in IR spectra by multi-resolution analysis) [64], followed by the application of an adaptive wavelet filter. Edge effects for a differential pulse polarogram could be minimized or eliminated by baseline subtraction prior to applying the wavelet transform. The procedure for low-pass filtering is to reconstruct signals from only those details which pass a test based on the magnitude of the power spectrum of the signal, subject to a minimization criterion. Although the procedure is more objective than that adopted by Permann, above, [62] there is still subjectivity in choosing the level of the pass criterion. The authors reported an improvement in signal-to-noise ratio of a factor of 6 with little distortion.
3.3 Fractal structures
Fractals are closely allied to chaos [65-69]. Instead of temporal stability/ instability, it is spatial regularity/irregularity which is involved. Although formally, fractal behaviour is not strictly a phenomenon of kinetics, it does appear quite often in materials science, and since it is also a phenomenon generated by nonlinear deterministic rules, as in the case of nonlinear kinetics, it is considered here in this section. Two applications of wavelet transforms to fractals have appeared. As will be seen below, they demonstrate the power of wavelet analysis for revealing the underlying deterministic rules (cf. rate equations) and for studying time-resolved chemical kinetics. Multifractals are characterized by a spectrum, f(cx), of singularities of magnitude, cx. This spectrum is often a narrow function of 0~ centred at cx- Do, where Do is termed the fractal dimension. When the fractal structure is globally self-similar, ie sequential magnifications of the structure are identical to the original structure, then f(cx) is sharply peaked with a unique value of a fractal dimension. Otherwise, there is a spectrum of dimensions, Dn (which decrease with increasing n). This number is essentially a summary of the effective number of dimensions needed to describe
283 the object's complexity; but it does not reveal anything about the spatial location of its singularities. Fourier analysis is limited, as usual, by its inability to localize frequency information. Thus wavelet analyses have been developed for analysing one-dimensional [70] and two-dimensional fractal images [71,72] and the analyses have been tested on several models: ThueMorse chains, period-doubling chains, snowflakes and diffusion-limited aggregates [73]. In the one-dimensional case Liviotti studied the formation of chains of "atoms" obeying simple deterministic rules: For the Thue-Morse (TM) chain, element A is replaced by the combination AB, while B is replaced by the combination BA. Starting with A, a line of elements is generated which doubles in length at every generation. The order of elements rapidly becomes chaotic. For the period doubling (PD) chain, the rules are: A is replaced by AB, while B is replaced by AA. Similar chaos results here. At least the TM chain could, for example, form the basis for a model description of polymer growth. A and B have lengths a and b, respectively. The author then defines a structure factor, Sh(q), -) 1
SN(q)
9
e~qXk
N k=0
where Xk is the position of the k-th element, and N is the number of elements. Long-range order appears as N increases, and SN displays sharp peaks whose magnitudes are proportional to N r, where 7 is related to the spectrum of fractal dimensions according to at = 2 ( 1 - 7 ) . To determine the spatial dependence of at the authors applied a "Mexican hat" wavelet function, ~(q), to SN:
h-~lim-slJ DC
T(s,u)-
ql (q -S yu)( q ) d q s
--O(2
The mathematical properties of fractals result in the following scaling relation T(2s,u) ~ Za-lT(s,u) which is satisfied by the relation T ( s , u ) = s~-1. Consequently a plot of In IT(s,u)] vs In s gives a straight line of slope (<,-1). Varying the scale factor and the translation allows one to zoom in to any particular region of the fractal space, and determine a(u). The author's results for both models agrees
284 with theoretical predictions, opening the way to the analysis of arbitrary fractals. The same procedure has been applied to two-dimensional chains [73]. In order to overcome the large computational effort (an order of magnitude greater than for one-dimensional time-series analysis) Freysz et al. [74] have extended an existing ingenious experimental analogue of the computation. Noting that classical double Fraunhofer optical diffraction is simply a handy implementation of Fourier transformation, the authors added an additional optical filter, equivalent to a wavelet analysing function. The coherently illuminated test object is placed at the front focus of a telescope at whose internal focus is placed a filter consisting of the image of a transparent ring. The width and diameter of the ring defines the scale, a, and shift, b, of the optical wave transform. The effect of the filter approximates well the Fourier transform of the radial Mexican hat function. Therefore the procedure is equivalent to performing a wavelet transform of an object. A CCD camera is placed at the back focus of the telescope, and produces a spatially resolved signal which is proportional to the square of the mathematically equivalent wavelet transform, ]T(a,b)] 2, at all local points, x0. Since the wavelet transform, T(a,b), of an object function, f(x), with respect to a dilated and translated wavelet function, ~(x), is given by
where + and t" are the Fourier transforms of ql and f respectively, and since the definition of fractal self-similarity means that scaling by a factor, k, at point x0 changes the object density by a factor of k ~, then IT(Xa, x0)l 2 oc Z2(~-2)lT(a. x0)l 2 and a plot of the logarithm of the light intensity vs the logarithm of the scale factor, a, gives the scaling exponent 2(~t-2), similar to the one-dimensional chain example, above. At two separate points in the images, this procedure results in values for at very close to the theoretical predictions (1.465 for snowflakes and 1.60 for diffusion-limited aggregates) as well as close to those predicted from the numerical wavelet analyses [73]. Although tested on simulated fractals, the implication is that if the images were time-dependent, then time resolution of dynamical phenomena could be probed, such as percolation, colloidal aggregation, crystal nucleation and growth, as well as other phase changes, etc. In essence a powerful new tool in kinetics is now available.
285
4
Conclusion
The application of wavelet analysis to physical chemistry and chemical physics is in its infancy. The present survey indicates that there are many opportunities remaining to advance the fields of q u a n t u m chemistry, chemical dynamics and chemical kinetics by using this powerful new tool.
5
Acknowledgements
This work is supported by the Natural Sciences and Engineering Research Council of Canada. The author is grateful to Dr. Del Permann for valuable discussions and to Prof. Alain St.-Amant for reading the manuscript.
References 1. R. McWeeny, Methods of Molecular Quantum Mechanics, 2nd ed., Academic Press, New York, (1989). 2. P. Fischer and M. Defranceschi, Looking at Atomic Orbitals through Fourier and Wavelet Transforms, Internat. Journal o/" Quantum Chemistry, 45 (1993), 619-636. 3. P. Fischer and M. Defranceschi, The Wavelet Transform: A New Mathematical Tool for Quantum Chemistry, in Conceptual Trends in Quantum Chemistry, (E. S. Kryachko and J. L. Calais Eds), Kluwer Academic, Dordrecht, (1994), pp. 227-247. 4. A.C. Tanner, The Bond Directional Principle for Momentum Space Wavefunctions: Comments and Cautions, Chemical Physics, 123 (1988), 241-247. 5. L. De Windt, J.G. Fripiat, J. Delhalle and M. Defranceschi, Improving the Oneelectron States of ab initio LCAO-GTO Calculations in Momentum Space. Applications to Be and B+ Atoms, Journal of Molecular Structure (Theochem) 254 (1992), 145-159. 6. A. Grossman, R. Kronland-Martinet and J. Morlet, Reading and Understanding Continuous Wavelet Transforms in Wavelet Transforms, (J.M. Combes, A. Grossman and Ph. Tchamitchian Eds), Springer-Verlag, Berlin, (1990), pp. 2-20. 7. Y. Meyer, Ondelettes et Op6rateurs I., Hermann, Paris, 1990. 8. B.G. Williams, The Experimental Determination of Electron Momentum Densities, Physics Scripta, 15 (1977), 69-79. 9. I.E. McCarthy and E. Weigold, Electron Momentum Spectroscopy for Atoms and Molecules, Rep. Prog. Phys., 82 (1991), 827-840. 10. P. Fischer and M. Defranceschi, Representation of the Atomic Hartree-Fock Equations in a Wavelet Basis by Means of the BCR Algorithm, in Wavelets: Theory, Algorithms, and Applications, (C.K. Chui, L. Montefusco and L. Puccio Eds), Academic Press, New York, (1994), pp. 495-506.
286
11. P. Fischer and M. Defranceschi, Numerical Solution of the Schr6dinger Equation in a Wavelet Basis for Hydrogen-like Atoms, SIAM Journal o[ Numerical Analysis, 35 (1998), 1-12. 12. I. Daubechies, Orthonormal Bases of Compactly Supported Wavelets, Comm. Pure Appl. Math., 41 (1988), 909-996. 13. A. Latto, H.L. Resnikoff and E. Tenenbaum, cited in ref. 17 below. 14. G. Beylkin, On the Representation of Operators in Bases of Compactly Supported Wavelets, SIAM Journal of Numerical Analysis, 29 (1992), 1716-1740. 15. W. Dahmen and C.A. Micchelli, Using the Refinement Equation for Evaluating Integrals of Wavelets, SIAM Journal of Numerical Analysis, 30 (1993), 507-537. 16. W. Sweldens and R. Piessens, Quadrature Formulae and Asymptotic Error Expansions for Wavelet Approximations of Smooth Functions, SIAM Journal oj Numerical Analyses, 31 (1994), 1240--1264. 17. J.P. Modisette, P. Nordlander, J.L. Kinsey and B.R. Johnson, Wavelet Bases in Eigenvalue Problems in Quantum Mechanics, Chem. Physics Letters, 250 (1996), 485-494. 18. K. Cho, T.A. Arias, J.D. Joannopoulos and P.K. Earn, Wavelets in Electronic Structure Calculations, Physics Review Letters, 71 (1993), 1808-1811. 19. T.A. Arias, K.J. Cho, J.D. Joannopoulos, P. Lam and M.P. Teter, Wavelet Transform Representation of the Electronic Structure of Materials, in Toward Teraflop Computing and New Grand Challenge Applications, (R. K. Kalia and P. Vashishta Eds), Nova Science Publ., Commack, N.Y., 1995, pp. 25-36. 20. M.J. Bramley and J. Tucker Carrington, A General Discrete Variable Method to Calculate Vibrational Energy Levels of Three- and Four-Atom Molecules, Journal of Chemical Physics, 99 (1993), 8519-8541. 21. J. Antikainen, R. Friesner and C. Leforestier, Adiabatic Pseudopotential Calculation of Vibrational States of Four Atom Molecules: Application to Hydrogen Peroxide, Journal of Chemical Physics, 102 (1995), 1270-1279. 22. R. Kosloff, Propagation Methods for Quantum Molecular Dynamics, Annual Review of Physical Chemisto', 45 (1994), 145-178. 23. J-L. Calais, Wavelets - Something for Quantum Chemistry ?, bTternational Journal of Quantum Chemistry, 58 (1996), 541-548. 24. M. Quack, Spectra and Dynamics of Coupled Vibrations in Polyatomic Molecules, Annual Review of Physical Chemistr)', 41 (1990), 839-874. 25. G. Jolicard, Effective Hamiltonian Theory and Molecular Dynamics, Annual Review of Physical Chemistry, 46 (1995), 83-108. 26. M.V. Wickerhauser, Large Rank Approximate Principal Component Analysis with Wavelets for Signal Feature Discrimination and the Inversion of Complicated Maps, Journal of Chemical Information and Computer Science, 34 (1994), 1036-1046. 27. X-Q. Lu and J-Y. Mo, Spline Wavelet Multi-Resolution Analysis for High-Noise Digital Signal Processing in Ultraviolet-Visible Spectrophotometry, Analyst, 121 (1996), 1019-1024. 28. F-T. Chau, J.B. Gao, T.M. Shih and J. Wang, Compression of Infrared Spectral Data Using the Fast Wavelet Transform Method, Applied Spectroscopy, 51 (1997), 649-659.
287
29. B.K. Alsberg, A.M. Woodward, M.K. Winson+ J. Rowland and D.B. Kell, Wavelet Denoising of Infrared Spectra, Ana(vst, 122 (1997), 645-652. 30. K. Gopalan, N. Gopalsami, S. Bakhtiari and A.C. Raptis, An Application of Wavelet Transforms and Neural Networks for Decomposition of Millimeter-Wave Spectroscopic Signals, Proceedings IEEE 21st International Conference On Industrial Electronics, Control and Instrumentation. Part 2, (1995), IEEE, Los Alamitos, CA, pp. 1411-1414. 31. H. Serrai, L. Senhadji, J.D. De Certaines and J.L. Coatrieux, Time Domain Quantification of Amplitude, Chemical Shift, Apparent Relaxation Time T~, and Phase by Wavelet-Transform Analysis. Application to Biomedical Magnetic Resonance Spectroscopy, Journal of Magnetic Resonance, 124 (1997), 20-34. 32. H. Hutter, Ch. Brunner, St. Nikolov, Ch. Mittermayer and M. Grasserbauer, Imaging Surface Spectroscopy for Two- and Three-Dimensional Characterization of Materials, Fresenius Journal of Analytical Chemisto', 355 (1996), 585-590. 33. M. Wolkenstein, H. Hutter and M. Grasserbauer, Wavelet Filtering for Analytical Data, Fresenius Journal of Analytical Chemistry, 358 (1997), 165-169. 34. U. Depczynski, K. Jetter, K. Molt and A. Niem611er, The Fast Wavelet Transform on Compact Intervals as a Tool in Chemometrics. I. Mathematical Background, Chemometrics and Intelligent Laboratory Systems, 39 (1997), 19-27. 35. R.D. Levine and R.B. Bernstein, Molecular Reaction Dynamics and Chemical Reactivity, Oxford University Press, Oxford, 1987, pp. 276-289. 36. D.C. Clary (Ed.), The Theory of Chemical Reaction Dynamics, Reidel, Dordrecht, 1985. 37. M. Baer (Ed.), Theory of Chemical Reaction Dynamics, Vols. I-IV, CRC Press, Boca Raton, 1985. 38. J.M. Bowman (Ed.), Advances in Molecular Vibrations and Collision Dynamics, Vols I-II, JAI Press, Greenwich, 1994. 39. R.E. Wyatt and J.Z.H. Wang (Eds.), Dynamics of Molecules and Chemical Reactions, Dekker, New York, 1994. 40. M.F. Herman, Dynamics by Semiclassical Methods, Annual Review of Physical Chemistry, 45 (1994), 83-111. 41. W.H. Miller, Beyond Transition-State Theory: A Rigorous Quantum Theory of Chemical Reaction Rates, Accounts of Chemical Research, 26 (1993), 174-181. 42. H. Nakamura, Theoretical Studies of Chemical Dynamics, Annual Review oj Physical Chemistry, 48 (1997) 299-328. 43. M.S. Child, Molecular Collision Theory, Academic Press, New York, 1994. 44. L.J. Butler, Chemical Reaction Dynamics Beyond the Born-Oppenheimer Approximation, Annual Review of Physical Chemisto', 49 (1998), 125-171. 45. D.L. Bunker, Classical Trajectory Methods, Methods Comput. Physics, 10 (1971), 287-326. 46. J.C. Polanyi and J.L. Schreiber, The Dynamics of Bimolecular Reactions, in: Kinetics in Gas Reactions. Vol 6 (Part 1) of Physical Chemistry- An Advanced Treatise, (H.W. Eyring, W. Jost and D. Henderson Eds), Academic Press, New York, (1973), pp. 383-487.
288
47. D. Permann and I. Hamilton, Wavelet Analysis of Time Series for the Duffing Oscillator: The Detection of Order Within Chaos, Physical Review Letters, 69 (1992), 2607-2610. 48. D. Permann and I. Hamilton, Wavelet Analysis of Time Series for the Weakly Forced and Weakly Damped Morse Oscillator, Journal of Chemical Physics, 100 (1994), 379-386. 49. R.L. Devaney, An Introduction to Chaotic Dynamical Systems, Benjamin-Cummings, Toronto, 1986. 50. M.B. Ruskai, Wavelets and their Applications, Jones and Bartlett, Boston, 1992. 51. I. Daubechies, Ten Lectures on Wavelets, SIAM Press, Philadelphia, (1992). 52. A. Askar, A.E. Cetin, and H. Rabitz, Wavelet Transform for Analysis of Molecular Dynamics, Journal of Physical Chemistry, 100 (1996), 19165-19173. 53. J.I. Steinfeld, J.S. Francisco and W.L. Hase, Chemical Kinetics and Dynamics, Prentice-Hall, Engelwood Cliffs, 1999. 54. P. Gray, Instabilities and Oscillations in Chemical Reactions in Closed and Open Systems, Proc. Roy. Soc. London, 415A (1988), 1-34. 55. B.L. Clarke, Stability of Complex Reaction Networks, Chapter 1 in Vol 43 of Advances in Chemical Physics, I. Prigogine and S.A. Rice (Eds), 1980, pp. 1-215. 56. R.J. Field and M. Burger, Oscillations and Travelling Waves in Chemical Systems, Wiley, New York, 1985. 57. G. Nicolis and F. Baras, Chemical Instabilities. Applications in Chemistry, Engineering, Geology and Materials Science, Reidel, Dordrecht, 1984. 58. L.E. Reichl and W.C. Schieve, Instabilities, Bifurcations and Fluctuations in Chemical Systems, University of Texas Press, Austin, 1982. 59. K.J. Laidler, Chemical Kinetics, 3rd ed., Harper and Row, New York, 1987. 60. T.A. Winfree, When Time Breaks Down, Princeton University Press, Princeton, 1988. 61. S.K. Scott, Chemical Chaos, Oxford University Press, Oxford, 1991. 62. D.N.S. Permann and H. Teitelbaum, Wavelet Fast Fourier Transform (WFFT) Analysis of a Millivolt Signal for a Transient Oscillating Chemical Reaction, Journal of Physical Chemistry, 97 (1993), 12670-12673. 63. H. Fang and H-Y. Chen, Wavelet Analyses of Electroanalytical Chemistry Responses and an Adaptive Wavelet Filter, Anal. Chim. Acta, 346 (1997) 319-325. 64. B. Walczak, B. Van den Bogaert and D.L. Massart, Application of Wavelet Packet Transforms in Pattern Recognition of Near-IR Data, Analytical Chemisto', 68 (1996), 1742-1747. 65. P. Fischer and T. Smith, Chaos, Fractals and Dynamics, M. Dekker, New York, 1985. 66. F. Barnsley and S.G. Demko, Chaotic Dynamics and Fractals, Academic Press, New York, 1986. 67. E.R. Pike and L.A. Lugiato, Chaos, Noise and Fractals, Hiller, Bristol, 1987. 68. I. Prigogine and S.A. Rice, Evolution of Size Effects in Chemical Dynamics, Wiley, New York, 1988. 69. A.J. Crilly, E.A. Earnshaw and H. Jones (Eds), Fractals and Chaos, Springer, New York, 1991.
289 70. E. Liviotti, A Study of the Structure Factor of Thue-Morse and Period-Doubling Chains by Wavelet Analysis, Journal o[" Physics." Condens. Matt., 8 (1996), 50075015. 71. A. Arneodo, F. Argoul, J. Elezgaray and G. Grasseau, Wavelet Transform Analysis of Fractals: Application to Nonequilibrium Phase Transistions in Nonlinear Dynamics, G. Turchetti (Ed.), World Scientific, Singapore, (1988), pp. 130-180 72. R. Murenzi, Wavelet Transforms Associated to the n-Dimensional Euclidean Group in Wavelets: Time-Frequency Methods and Phase Space, (J.M. Combes, A. Grossman and Ph. Tchamitchian Eds), Springer-Verlag, Berlin, (1990), pp. 239-246 73. F. Argoul, A. Arneodo, J. Elezgaray, G. Grasseau and R. Murenzi, Wavelet Transform of Fractal Aggregates, Physics Letters, 135A (1989), 327-335. 74. E. Freysz, B. Pouligny, F. Argoul and A. Arneodo, Optical Wavelet Transform of Fractal Aggregates, Physical Review Letters, 64 (1990), 745-748.
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
291
CHAPTER 13 Wavelet Bases for IR Library Compression, Searching and Reconstruction Beata Walczak 1 and Jan P. Radomski 2 l Institute of Chemistry, Silesian Universit)', 9 Szkolna Street, 40-006 Katowice, Poland 2Interdisciplinary Center jbr Mathematical and Conlputational Modeling, Warsaw University, Pawinskiego 5A, 02-106 Warsa~', Poland
1 Introduction There is rather substantial difference between sizes of available, electronically searchable databases of chemical compounds on one hand, and spectral databases on the other. While Chemical Abstracts contain approx. 14 million entries, and Beilstein more than 7.5 million, the biggest available infrared databases seldom exceed tens of thousands compounds (e.g. Sadtler's Condensed Phase IR Standards- 75620 spectra: Nicolets' combined Aldrich and Sigma libraries- approx. 29000 spectra). For mass spectroscopy the situation is only slightly b e t t e r - about two hundred thousand spectra in MS Wiley library. Smaller still are N M R spectral databases. The past few years and a rapid pace of combinatorial chemistry advances have increased this gap even more. The new robotic systems are capable of producing and efficiently screening new lead compounds for drug discovery at a rate of 50000 a month or more, thus producing in-house structural databases of rapidly growing size. Virtual combinatorial libraries of synthesable compounds have recently reached monstrous size of 1012 entries [1]. At the same time, the use of theoretically calculated IR spectra have shown to be possible and quite effective in large scale QSAR studies [2-4], thus indicating new possibilities for virtual screening. It might be of interest to compare results obtained using theoretical and real IR spectra for this purpose. However, the lack of big enough spectral databases poses a serious obstacle for such attempts. Despite recent advances, the problem of spectral libraries size build-up, and search speed receives still considerable amount of attention. Most of the commercial databases to date use Fast Fourier Transform (FFT) for spectra compression. However, the past ten years have brought explosive growth of wavelet applications in signal processing. The IR spectra show many ab-
292 sorption bands of local character, which makes wavelets especially well suited for their decomposition. Also, Wavelet Transform is quite faster than FFT. During successful library construction three important factors must be considered, namely: efficient compression algorithm and ratio, fast search speed method, and also good spectra reconstruction quality. Fulfillment of all these demands requires some kind of compromise, and there are different possible approaches to this problem. In the presented study a solution based on the joint best-basis, enabling uniform data compression, is applied to the data set containing ca. 3400 IR spectra of both different, and also closely related compounds. The proposed approach, although applied to the IR library compression and search, is a general one and can be applied without modification to any type of spectral data (e.g. to NMR) as well.
2
Theory
2.1 Wavelet transforms Wavelets [5], well localized in both time and frequency (scale) domains, are basis functions ideally suited for description of the unstationary instrumental signals such as, for instance IR or NMR spectra. Each discrete spectrum of the length L = 2 n can be transformed into the wavelet domain using Fast Wavelet Transform (FWT) [6] (known as the Mallat algorithm, Discrete Wavelet Transform, or as the pyramid algorithm). In FWT, the pair of the so-called wavelet quadrature mirror filters (a low-pass filter, and a high-pass filter) is applied iteratively. The output of a low-pass filter is called approximation, whereas the output of the high-pass filter is called 'detail'. Length of approximation and detail are halved after each iteration, and the whole process is continued until only one element remains. The pyramid algorithm can be easily extended to the Wavelet Packet Transform (WPT) [7]. The WPT is much more flexible tool for signal decomposition, because it allows datadependent partition of the time-frequency domain. Among all possible orthogonal bases, the one with coefficients differentiated to a highest degree is of special interest for signal compression. It can be selected based on the entropy criterion [8], and is called the best-basis (BB).
2.2 Compression of individual signals Signals such as, for instance IR spectra, have in a wavelet domain sparse representation. It means that in wavelet domain there are many wavelet
293 coefficients with very small amplitude (absolute value), which can be discarded without loss of essential information carried by a signal. Elimination of small coefficients is equivalent to spectra compression. There are different criteria of wavelet coefficients' thresholding. Among the most popular ones are the universal threshold (Visu), Sure, Min-Max [9] (see Chapter 5 in Theory part) and Minimum Description Length (MDL) [1012] criteria. The main idea of MDL approach can be summarized as follows: the MDL cost function represents two conflicting requirements, we would like to compress signal to the highest possible degree, but simultaneously we would like to have as small reconstruction error as possible. These requirements are represented by two terms: the first term describes the reconstruction error depending on the number of retained coefficients, and the second one is a penalty function, increasing with the number of the retained coefficients: MDL(N*, f) - min[(3/Z)N log(L)
(1)
+ (L/Z)logl[(l _ (O(n))~es [2] for 0 _< N _< L where L denotes the length of the signal s, N denotes the number of non-zero elements in the vector af, f describes the filter, I is the N-dimensional identity operator (matrix), O (y) is a thresholding operation which keeps the N largest (in absolute value) elements intact, and sets all other elements to zero, [[(I-(O(Y/)aes[I represents the error between the original signal and the signal reconstructed with the N largest elements. The basic assumption of this approach, discussed in detail by Saito [13], is that for real signals, the MDL cost function reaches a minimum, which indicates the optimal number of retained wavelet coefficients (Fig. 1). The optimal number of retained wavelet coefficients, and the reconstruction error strongly depend on the applied filter. It means that the MDL cost function can be used for the filter optimization. Filter for which the MDL achieves minimal value or the minimal number of retained wavelet coefficients can be considered the optimal one for data compression.
2.3 Data set (library) compression While dealing with the data set we are interested in finding a joint basis (or joint best-basis) for compression of the whole set. Only joint basis allows
294
x 10 4
O
2
"= 1 r
O
-20
2O0
400
66o
800
N
Fig. 1 M D L cost function (1) versus the number of the retained coefficients; the first term (2), and the second term (3) #1 Eq. (1).
uniform representation of all signals, which is of primary importance for the further data processing. Compression of individual signals could lead to a better compression ratio for individual signals, but then different filters and different bases would be involved, and uniform representation will not be possible. 2.3.1 Joint best-basis selection Algorithm for fast approximate factor analysis, proposed by Wickerhauser [14], can be used as a data compression algorithm. The main idea of this algorithm is as follows: each individual signal from the set of m signals is decomposed with the predefined filter to get a complete packet analysis. The coefficients of the resulting m binary trees are used to construct the binary tree of variance. The resulting variance tree is searched for the best-basis according the Coifman and Wickerhauser's best-basis selection algorithm. The best basis of that variance tree is called joint best-basis for the data set X, in the wavelet packet library (for more details see Chapters 6 and 7 in the Theory part). 2.4 Compression ratio The speed of matching for a compressed library depends on the number N of retained coefficients. In the case of individual compression, the average number of wavelet coefficients is taken into account. Compression ratio can be calculated as: CR-
length of the initial signal/the number of retained coefficients
295
2.5 Storage requirements For discussion of the storage requirements of applying wavelet methods to compress spectral libraries we ought to consider the following situation: (A1) (A2) (A3) (A4)
individual compression individual compression compression of signals compression of signals
of signals by DWT (iDWT) of signals by WPT (iWPT) in the joint basis by DWT (JB-DWT) in the joint best-basis by WPT (JBB-WPT)
Fig. 2 shows appropriate block diagram for each. When using DWT (A1), or WPT (A2), which treat each signal individually, we need to store m times N wavelet coefficient values, their addresses, and the number of the selected filter, F. Additionally, in A2 approach, the address of each best-basis needs be stored as well (A l: matrices R, A and F; A2: matrices R, A, B and F). For the A3 approach one has to store m times N wavelet coefficients values, and a vector containing their addresses, i.e. A3: matrix R and vector A. For the A4 approach we need to store m times N wavelet coefficients values, vector containing their addresses, and vector containing addresses of the joint best-basis, i.e. A4: matrix R, and vectors A and B. Both A and B might be stored in a compact, binary form, as we only need to know whether wavelet basis component is being used or not. That is, one bit X
L '1o
R
A
B
F
+
Fig. 2 Schematic representation of the storage requirements for the approaches A1-A4, as described in the text.
296 will suffice, cutting storage space needed for A and B by eight. For storing 1024 addresses (A) we need m times 128 bytes (1024/8), which corresponds to 16 double precision floating numbers. Effective compression ratio can be calculated a s ECR-
the number of bytes before compression/the number of bytes after compression
2.6 Matching criteria
A normalized scalar product of two spectral vectors was used during matching as the similarity measure, and sequential searches through the entire library were always performed. Thus, each spectrum in turn was treated as a query. To simulate small variances in data acquisition, and/or spectral differences for very similar compounds, 1% and 5% of random white noise, has been added. Appropriate decomposition (wavelet or PCA) was then performed, and the resulting vector was compared to each of the spectral vectors in the compressed library. 2.7 The data
The IR library containing both closely related, and quite different spectra of 3339 organic compounds have been used. The range of 4000-416 cm -1 was converted into 896 bands of 4 cm -1, and augmented to the nearest power of 2, that is to 1024 wavelengths. The histogram of the RMS pair-wise distances for the whole set of 3339 spectra is shown (Fig. 3). To illustrate in another way spectra variation, the 'variance spectrum" of the studied library is presented in Fig. 4.
x l0t
0 0
RMS
0'.5
Fig. 3 The histogram of the RMS pair-wise distances jbr the whole set of 3339 spectra.
297 3
Results and discussion
There are few possible strategies of library compression. Each of them has its own advantages and drawbacks. The most efficient method of data set compression, i.e. Principal Component Analysis (PCA), leads to use of global features. As demonstrated in [15] global features such as PCs (or Fourier coefficients) are not best suited for a calibration or classification purposes. Often, quite small, well-localized differences between objects determine the very possibility of their proper classification. For this reason wavelet transforms seem to be promising tools for compression of data sets which are meant to be further processed. However, even if we limit ourselves only to wavelet transforms, still the problem of an approach optimally selected for a particular purpose remains. There is no single method, which fulfills all requirements associated with a spectral library's compression at once. Here we present comparison of different methods in a systematic way. The approaches A1-A4 above were applied to library compression using 21 filters (9 filters from the Daubechies family, 5 Coiflets and 7 Symmlets, denoted, respectively as filters Nos. 2-10, 11-15 and 16-22). 3.1 Principal component analysis applied to IR data compression The most efficient method of data set compression in the joint basis is Principal Component Analysis (PCA). Principal Components (PCs) are constructed as a linear combination of original variables to maximize the description of data variance. They are eigenvectors of the auto-covariance matrix of data set. Each eigenvector is associated with the corresponding eigenvalue, which describes its importance in data variance description. For the studied IR library, 57 eigenvectors (principal components) are necessary to describe 95% of data variance, whereas as much as 109 eigenvectors are needed to describe 99% of data variance (see Fig.5). The mean value of RMS
9
"l
0.1 .~_ 0.05
00-
---500 1000 variable number
Fig. 4 'Variance spectrum' of the IR library.
298 calculated for the original spectra and spectra reconstructed with 57 and 109 principal components equals 0.0298 and 0.0127, respectively. These results are included to give the reader some indication about possible compression ratio when using global signal features. 3.2 Individual compression o f IR spectra in wavelet domain
Spectra contained in any digital library differ to a great extent. In practice this means that filter optimal for compression of one spectrum is not optimal for compression of another. Efficiency of compression of individual spectra strongly depends on the applied threshold criterion as well. The choice of threshold criterion ought to be problem dependent, but very often the problem at hand involves conflicting demands. Sometimes, we would like to have a possibility of good spectra reconstruction for, e.g. visualization. For this purpose the best suited is Visu threshold criterion, which leads to small error of spectra reconstruction. But the small error of spectra reconstruction is inherently associated with high number of the retained wavelet coefficients. When DWT with Visu criterion is applied to the studied data set, the average number of the retained coefficients equals 257, i.e. the achieved compression ratio of 1024/257 = 3.9844 is not very high. One can, of course, compress data to the desired ratio, optimizing simultaneously filters to achieve the smallest RMS error of spectra reconstruction. For the studied data set, if the desired compression ratio equals 10, the mean value of the RMS error of spectra reconstruction equals 0.2, and is too high for the visualization purpose and spectra matching. Compromise between these two conflicted demands, i.e., good reconstruction and high compression ratio, can be achieved using M D L criterion. MDL takes into account the different noise levels of the individual spectra (similarly as Visu threshold), and seems the most appropriate for the individual compression of spectra. DWT with M D L cri-
100 ~D
=9 50
0
1O0 number of PCs
200
Fig. 5 Percentage of the explained variance versus the number of PCs.
299 terion allows data compression using 144.6 coefficients (mean value) with the mean R M S equal to 0.0024. Some additional calculations were performed for comparison sake+ using as a criterion the RMS error of spectral reconstruction. In particular, each spectrum was compressed to the degree which allows it reconstruction with the RMS = 0.0127. This value corresponds to the mean value of RMS observed for the data compressed by PCA, or in the joint basis accounting for the 99% of data variance. D W T with RMS criterion (where R M S = 0.0127) leads to 66.8 wavelet coefficients. For illustrative purpose the original spectrum and the spectrum reconstructed with RMS = 0.0127 are presented in Fig. 6. Using D W T with different threshold criteria, and optimizing filters for each spectrum individually we can construct histograms of filter frequencies, illustrating how often the particular filter was selected as the optimal one. As shown in Fig. 7, the histogram profiles strongly depend on the applied criterion. The highest differentiation of filters" frequency is observed for the criterion CR. For more than half of the spectra, the filter No. 4 was found to be the optimal one, and many filters were never selected at all. For the criteria Visu and M D L , the only filter never selected was filter No.13. For Visu criterion, Symmlet filters dominate over other filters, whereas for M D L , the Daubechies filter No. 2 dominates to high degree over the remaining filters. In the case of R M S criterion, both Coiflets and Symmlets (except filter No.13) dominate over Daubechies filters. Visu and CR criteria were used for the illustrative purpose only, i.e. in further discussion only the M D L and RMS criteria will be considered. Both these
(a)
originalspectrum
(b) reconstructedspectrum
r
1
0.5
0
0.5
'
5O0
variable number
1000
0
0
5O0
1000
variable number
Fig. 6 Original spectrum, and the spectrum reconstructed with the R M S
= 0.0127.
300
(a)
'Visu'
00j
(b)
000,
1~176176
00
10 filter (c)
20
O/0
,n_
10 filter
20
RMS
(d)
MDL
CR=10
500I
500
5
10 15 filter
20
0
10 filter
20
Fig. 7 Histograms of filters for D WT of IR librarl" with d((erent threshold criteria.
criteria were applied with DWT and WPT, and the results of these approaches are summarized in Table 1, and presented in Figs. 8 and 9. Let us have a closer look at these results. For DWT and MDL, filter No. 2 is selected most often as the optimal one. RMS error calculated for the optimal filter varies in the range from 0.0 to 0.0264, whereas the number of retained coefficients changes from 49 to 364. For WPT, also filter No. 2 is selected most often as the optimal one, RMS error varies from 0.0004 to 0.0514, and the number of retained coefficients varies from 56 to 303. As one could expect, the compression observed for WPT is somewhat better than the compression for DWT, and equals 7.6418 and 7.0621, respectively. Nevertheless WPT requires additional storage space for the best-basis addresses (see Fig. 2). The total number of bytes required for the compressed library equals m times (145 + 16), and m times (134 § 16 + 32) for DWT and WPT, respectively. Thus, the effective compression ratio equals 6.3602 for DWT, and 5.6264 for WPT. So, although the average number of the retained wavelet coefficients for the WPT is about 10% lower than for the DWT, the memory requirements are lower for DWT, as there is no need to store bestbasis addresses for each spectrum. For DWT and WPT with RMS criterion, the profiles of filters' histogram are also quite similar. In the case of DWT, filter No. 17 is selected most often, followed by filter No. 12. For WPT this order is reversed. In both cases the compression ratio is very high, and equals 15.3248 for DWT and 15.6575 for WPT. This small difference in compression ratios (ca. 2%) leads to higher difference in the effective compression ratio
Table 1. Results of the library compression using iDWT and iWPT approaches, and MDL and RMS criteria; mean, minimal (min), and maximal (max) values, and standard deviation (std) of Root Mean Squares Error (RMS), and the number of the retained coefficients (N). RMS
N
i DWT-M D L i W PT-M D L i DWT- RM S i WPT- R M S
Mcrm
Min
M~.Y
Stri
Mwn
M in
ML1.Y
Stti
144.6 133.8 66.8 65.4
49 56 12 12
364
30.2
0.0024
0.0000
0.0264
0.0026
303 186 1x1
24.3 22.7 22.3
0.0057 0.0 126 0.0 126
0.0004
0.05 14
0.0043
302
(a)
DWT
5
(b)
500 0
l0 15 filter
WPT
20
5
10 15 filter
20
&
0
(e)1000~
200 N
400
9
400 N .....
1000 In Ill, ~oo~1 ~
0 IIiill IUn-_.-_
0
0.01 0.02 RMS
0.03
0
0.02 0.04 RMS
0.06
Fig. 8 Results of the individual compression of 3339 spectra using MDL criterion for D WT, and WPT, presented asji'equeno" histogram.[br (a).fihers selected as the optimal ones, (b) number of the retained coefficients, and (c) R M S error o/spectral reconstruction.
(a)
DWT
WPT
6001
I
400t 0
600 !
400]~ ~o, 2~176
10 filter
20
"0
10 filter
20
(b) 400 300 200
200[
100
100 I
0 0
100 N
200
0(~
100 N
200
Fig. 9 Results of the individual compression of 3339 spectra using R M S criterion for D WT, and WPT, presented as frequency histogram for (a)filters selected as the optimal ones, and (b) number of the retained coefficients; R M S = 0.0126.
303 (due to the need for the best-basis address storage), and equals 12.3642 and 9.0300 for DWT and WPT, respectively. To summarize, the MDL criterion is the objective one, whereas the RMS is rather arbitrary, although quite acceptable for visualization purposes, and for spectra matching. The main advantage of individual compression is lack of need for the compressed library updating. New spectra can be individually compressed, and easily added at any time to the library. The main drawback of this approach is that compressed spectra do not have uniform representation. It means there is no easy way to use them for further data processing. Another disadvantage is that for each additional new spectrum, decomposition needs to be done using all filters, and also complicated matching procedure. So although the high compression ratio is achieved, the matching procedure is rather time-consuming. 3.3 Joint basis and joint best-basis approaches to data set compression In the discussed approach to data set compression there is a need to predefine percentage of variance to be preserved by the retained coefficients (or alternatively the average RMS of data reconstruction). We performed all calculations for 99% of variance. The best-basis was calculated for all studied filters using entropy as basis selection criterion. As expected, the number N of retained wavelet coefficients, depends on the applied filter. The number of the retained coefficients varies from 139 to 177. The worst compression is with filter No. 2, and the best with filter No. 15 (see Fig. 10). Final results for the optimal filter (No.15) are presented in Table 2. It can be noticed, there is no big difference in the results for DWT and WPT. The histograms of filters frequencies, and the RMS errors of spectra reconstruction are presented in Fig. 10. Profiles of histograms are very similar for both transforms. This similarity is associated with the fact, that the selected best-basis is very similar to the DWT basis (see Fig. 11), and coefficients of the variance tree are similar to the squared coefficients of the DWT (see Fig. 12(a)). Cumulative percentage of variance for DWT and WPT, presented in Fig. 12(b), can be compared with the analogous figure for PCA compression (Fig. 5). In the discussed approach the compressed data set has uniform representation (see Fig. 2). Compression ratio equals 7.0621 for DWT, and 7.3669 for WPT. Both, joint basis and joint best-basis minimize the storage require-
304 DWT
(a)
~ (b)
lO filter
WPT
20
filter
6~176 t
40OilL
3~176 2OO[l/
:::fk o~
o:o5 RMS
o.1
~o
0.05 RMS
o.1
Fig. 10 (a) Number of the retained coefficients versus the fi'lter number, and (b) histograms of R M S error of spectra reconstruction in the opthnal bases," JB for D W T and JBB for WPT.
Table 2. Results of the library compression using JB99 and JBB99 approaches; filter No. 15.
N JB99 JBB99
145 139
RMS Mean
Min
Max
Std
0.0124 0.0126
0.0033 0.0023
0.0543 0.0510
0.0064 0.0068
~-2 no_ 3 rxl -4
zL
-5
0
0.5 1 Frequency domain splits
Fig. 11 Best-basis selected for the variance tree.
305
ments. The effective compression ratio is almost the same as compression ratio, and equals 7.0618 and 7.3667, for DWT and WPT, respectively. It means that using joint bases one cannot compress data set to the similar degree as when applying individual compression, but this approach offers compression comparable with that of PCA, however allowing simultaneously to work with local spectral features.
3.4 Matching performance The IR library, compressed according to the discussed approaches, was tested for performance in matching. For this purpose, the spectra with 1% and 5% of random white noise added (see Fig. 13) were decomposed and represented in the same basis as the compressed library, and then each of them was matched with all spectra from the compressed library. The same operation was performed for the data set in the original domain. Matching performance, expressed as the percentage of mismatch, for all compressed data sets was the same as for the original data. For the data with 1% noise added, in both domains (original uncompressed, and wavelet) the
(a)
DWT (JB)
WPT (JBB)
3r
3 2 1
1
%
160 N
200
(b)
0
0
100 N
200
100 N
200
100
o 1~176 / ~> 5o
~ 50
lOO N
200
~o
Fig. 12 (a) The top 200 coefficients (squared) in the joint basis, and the top 200 elements of 'variance tree' in the joint best basis sorted according to their amplitude and (b) cumulative percentage of the explained variance in the join basis, and #7 the best jointbasis.
306
percentage of mismatch equals zero. For the data with 5% noise, the percentage of mismatch equals 0.8985%, again same in both domains. It means that the compression performed does not deteriorate matching performance observed for the data in original domain (uncompressed). These results encouraged us to further reduce the number of wavelet coefficients used for spectra matching. To this end, the matching performance using different number of wavelet coefficients has been studied, for library compressed with JBB approach. The results are presented in Fig. 14. As one can notice, with decreasing number of wavelet coefficients from 139 to 1 we observe at first slight deterioration of the matching performance, and then very rapid increase in the percentage of mismatch. For the data with 1% noise, this rapid increase of mismatch is observed at the level of 4 wavelet coefficients, whereas for the data with 5% noise it is observed already at the level of 10 coefficients. Percentage of mismatch for the data with 1% noise added, using 5 wavelet coefficients, equals 0.3314%. For the data with 5% noise, using 5 coefficients, this percentage equals 16.3905. (b)
(a) 1
'
"
0.50
1
I
0.5
0.5
(c)
0 00 number of variable
~
500 1000 number of variable
500
lO0O
number of variable
Fig. 13 (a) Original spectrum, and spectra with (b) 1% and (c) 5% of noise. (a)l% .
.
.
.
(b)5% 1o
.
5
.
.
.
.
50 ~ 1 0 0 ~ 1 5 0 N
o
0
50
100
150
N
Fig. 14 The percentage of mismatch for the data with (a) 1% additional noise, and (b) 5% noise observed for the different number of wavelet coefficientsjor data compressed by JBB approach," the horizontal line hi (b) represents the percentage of mismatch observed .for data in the original domain.
307
For the comparison purpose, the same study was performed using PCs. The same percentage of mismatch as achieved with 5 wavelet coefficients from JBB, requires as many as 17 PCs, for the data with 1% of additional noise. In contrast, for the data with 5% noise exactly the same percentage of mismatch was achieved with 50 wavelet coefficients as with 50 PCs. The actual percentage of mismatch, based on the top five wavelet coefficients, can be further reduced, if the following modification is introduced into the matching procedure: 1. match the unknown spectrum with all the spectra in the library using 5 top wavelet coefficients and select the 5 most similar spectra; 2. perform matching of unknown spectrum with these 5 most similar spectra using 139 wavelets coefficients. It ought to be pinpointed, that whichever matching criterion is used, it cannot distinguish between situation when a given distance minimum is a sum of many very small differences, and a contrary situation when this minimum distance arises from a single larger difference, or perhaps few larger differences. Alternatively, actual decision can be made by an interactive user, based on additional criteria, particular to this user needs, which are difficult to define precisely 'once and for all'. To summarize, when the individual compression is applied, all the retained coefficients are used both for spectra matching, and for spectra reconstruction. In the case of the proposed approach, two sets of wavelet coefficients have to be taken into account (Fig. 15). The first set, denoted as R1, is required for spectra matching, whereas the second set, denoted as R2, is necessary for spectral reconstruction. A new, unknown signal has to be decomposed using filter F, and represented by the set of wavelet coefficients in the compressed basis. Matching procedure is
X
R
A
B
+
/\ R1 R2
Fig. 15 Schematic representation q/the storage requirements./or the proposed approach.
308 performed based on the set R1 of wavelet coefficients. This set contains all necessary information about data diversity. Only when the matching procedure indicates the most similar spectrum in the library, the remaining wavelet coefficients (R2) are used for this spectrum reconstruction. It means that the spectrum found as the most similar to an unknown are reconstructed using the sets R1 and R2 of wavelet coefficients (R - R1 +R2). For the studied IR library R1 is 5 wavelets coefficients only. For any other data set the proper number of coefficients R1 has to be estimated.
4
Conclusions
Compression and effective compression ratios for the approaches studied are summarized in Table 3. Obviously, no single approach fulfills all requirements associated with library compression. Taking into account all aspects of the library compression and searching, we have to find a compromise among all different requirements. It means we should base our final choice of the optimal strategy on criteria described above, weighting their respective importance. The highest compression and effective compression ratios are achieved with individual compression of spectra. This approach does not allow uniform data presentation for further processing, and requires time consuming matching procedure. If we assume that most important is the performance of the matching procedure, the matching speed, the storage requirements, and uniform representation of data in form of local features, then the approach
Table 3. Compression ratios, and effective compression ratios for the studied approaches.
Approach
Compression ratio
Effectivecompression ratio
iDWT/MDL iWPT/MDL iDWT/RMS iWPT/RMS PCA99 JB99 JBB99
7.0621 7.6418 15.3248 15.6575 9.3945 7.0621 7.3669
6.3602 5.6264 12.3642 9.0300 7.1896 7.0618 7.3661
309
JBB is the optimal one. The joint best-basis, estimated for the 'variance tree', allows data compression comparable with PCA. Taking into account the fact that only small subset of the retained coefficients is sufficient for matching procedure (for the studied data set this subset contains five wavelets coefficients only), this approach seems the most interesting for the spectral library compression, reconstruction and matching. For the studied data set there is no significant difference between D W T and W P T approaches, but generally the compression in W P T is much batter than in D W T basis. The only disadvantage of that approach is the necessity of the library updating, i.e. all calculations has to be repeated. This is however a c o m m o n feature for both PCA, and wavelet-based decomposition.
References 1. R.D. Cramer, D.E. Patterson and P. Hecht, Discovery and lead refinement using Chemspace (TM), Abstracts of papers of the American Chemical Society, 215: 016-COMP, Part 1 APR 2 1998" Virtual compound libraries: a new approach to decision making in molecular discovery research J. Chem. h!/i Comp. Sci. 38 (1998), 1010-1023. 2. A.M. Ferguson, T. Heritage, P. Jonathon, S.E. Pack, L. Phillips, J. Rogan and P.J. Snaith, EVA: A new theoretically based molecular descriptor for use in QSAR/ QSPR analysis, J. Comp. Aid. Mol. Des. 11 (1997), 143-152. 3. T.W. Heritage, A.M. Ferguson, D.B. Turner and P. Willett, EVA: A novel theoretical descriptor for QSAR studies, Perspectives in Drug Discovery and Design 9-11 (1998), 381-398. 4. C.M.R. Ginn, D.B. Turner, P. Willett, A.M. Ferguson and T.W. Heritage, Similarity searching in files of three-dimensional chemical structures" Evaluation of the EVA descriptor and combination of rankings using data fusion, J. Chem. b!/i Comp. Sci. 37 (1997), 23-37. 5. I. Daubechies, Orthonormal bases of compactly supported wavelets, Comm. Pure Applied Math. XLI (1988), 909. 6. S.G. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Transactions on Pattern Analysis and Machine hlteiligence 11 (7) (1989), 674. 7. M.A. Cody, Dr. Dobb's Journal 17 (1994) 16-28. 8. R.R. Coifman and M.V. Wickerhauser, Entropy-based algorithms for best basis selection, IEEE Trans/brmation on b!/brmation Theory 38 (1992), 713-719. 9. D. Donoho, De-noising by soft-thresholding, IEEE Tran,s'/brmation on hl./brmation Theory 41 (1995), 613-627. 10. J. Rissanen, A universal prior for integers and estimation by minimum description length, Annals of Statistics 11 (1983), 416-431. 11. J. Rissanen, Universal Coding, Information, Prediction, and Estimation, IEEE Transformation on h!flormation Theory 30 (1984), 629-636.
310
12. J. Rissanen, Stochastic Complexity in Statistical Inquiry, World Scientific, Singapore 1989. 13. N. Saito, Simultaneous noise suppression and signal compression using a library of orthonormal bases and the minimum description length criterion, available via internet 14. B. Walczak and D.L. Massart, Noise supression and signal compression using wavelet packet transform, Chemometrics and hltelligent Laboratory Systems 36 (1997), 81-94. 15. W. Wickerhauser, Adapted Wavelet Analysis from Theory to Software, A.K. Peters, 1994 16. B. Walczak, B. van den Bogaert and D.L. Massart, Application of wavelet packet transform in pattern recognition of NIR data, Analytical Chemistry 68 (1996), 1742-1747. 17. G. Buckheit, S. Chen, J. Crutchfield, D. Donoho. H. Gao, I. Johnstone, E. Kolaczyk, J. Scargle, K. Young, T. Yu and Wavelab, http'//playfair.Stanford.edu/ wavelab/, 1996
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
311
CHAPTER 14 Application of the Discrete Wavelet Transformation for Online Detection of Transitions in Time Series M. Marth Freiburg Materials Research Center FMF, University of Freiburg, German)'
1 Introduction Wavelets are very useful for applications that require local and multi-scale information. In this chapter an application of the discrete wavelet transform is discussed where these properties are used for online detection of transitions in time series. This method named Early Transition Detection is demonstrated on data from a chemical sensor array. Wavelets possess two properties that make them especially valuable for data analysis: they reveal local properties of the data and they allow multi-scale analysis. Their locality is useful e.g. for applications that require online response to changes. If the typical time scales of these changes are not known in advance a multi-scale approach is advantageous. In this chapter it will be demonstrated how wavelets can be used in data analysis by discussing a specific data set from a chemical sensor array. The data is a time series where each point in time belongs to a certain unknown class and needs to be classified. It will be discussed what problems arise when a common classifier like SIMCA [1] or Nearest Neighbour [2] is used. Further, it will be shown that an extended classifier named Early Transition Detection (ETD) [3] can be used to overcome these problems. For the construction of the ETD classifier the Discrete Wavelet Transform (DWT) is used. It will be shown that the DWT provides an expedient tool to solve this problem.
2
Early transition detection
Early Transition Detection is an extension of common static classification methods like SIMCA or Nearest Neighbour in order to enable these methods to process time-dependent data in some cases. ETD is useful e.g. for the
312
following situation. Consider a time-dependent Markov process Y(t) (see e.g. [4]) that can take R different discrete values or classes, i.e. y E [1, R]. Let this process be measured with p different sensors at equidistant points in time ti so that a p-dimensional time series x(ti) is obtained. From the sensor values x(ti) the process value y(ti) is to be estimated. An example for such a problem is the classification of quality of the air streaming from the environment inside a car. Through an opened ventilation flap of a car environmental air can enter the inside. If the quality of the incoming air becomes bad, e.g. because a tunnel or a narrow busy street is entered, it would be advantageous to automatically close the ventilation flap. In order to check how such a system could be realized the sensor system K A M I N A [5,6] was used to measure the quality of the incoming air. The K A M I N A system consists of 40 gas sensitive tin dioxide semiconductor sensors. Contact with gases leads to a change in conductivity that enables a measurement of gas concentrations. This sensor system was installed in a car and a test drive was undertaken. During this test the sensor signals x(t) and the air quality time series y(t) were measured. As y(t) is usually not known it was examined if these values could be estimated from x(t). The estimates were compared with the known values y(t). In Fig. 1 a part of the time series y(t) is shown (solid line). There are two classes (i.e. R = 2 ) ' g o o d air' (y = 2) and 'bad air' (y = 1). The depicted occurrences of bad air resulted from driving closely behind another car. In Fig. 2 the corresponding time series x(t) is displayed for 10 of the 40 very collinear sensor signals. It can be observed that the occurrence of bad air leads to a decrease of the sensor signals that enables the detection of bad air. However, it can also be seen that the sensors exhibit a delayed behaviour after transitions (the points in time when the value of y(t) changes). This is especially easy to recognize at the transition at t ~ 100. Hence, after transitions the ventilation flap gets closed too late. This problem is solved by ETD. How E T D works and where the wavelets come in shall be described after a quick summary of classification methods. There are a number of different well-known classification methods in chemometrics, e.g. S I M C A [1], D A S C O [7], Linear Discriminant Analysis (LDA) [8], Nearest Neighbour [2], A L L O C [9], or Artificial Neural Nets (ANN) [10]. These classification methods (from now on denoted static classification methods) share some similarities [8,11,12]. The measurement
313
1.6!
1.4
1.2 1-
-
-
'
I
1
~
1~
1~
200
time[s]
l
2~
300
Fig. 1 y ( t ) (solid line), 9(t) estimated with S I M C A ahme (dotted/#re), and f'(t) estimated with E T D and S I M C A (dashed line). The values o f the dashed line and the dotted line are slightO' shifted upwards to enhance readability. One measurement was taken every second. The estimate without E T D shows a delay and consequentO" wrong classifications after transitions (at t ~ 100 s and t ~ 270 s). The esth~late with E T D reduces that delay, however, some wrong transitions (at t ,~ 240 s) are predicted.
vector x(ti) is in all static classification methods mapped from p-dimensional sensor space onto a R different numbers g(x(ti). 1) . . . . . g(x(ti), R): g(x(ti), r ) = Pr(X(ti)),
(1)
where the functions pr(x(ti)) characterize how likely it is that measurement x(ti) results from y(ti) haven taken the value r. The different static classification methods differ in the way this mapping is performed. Often (e.g. in SIMCA, DASCO, or LDA) it is assumed that measurements that belong to 3.5
x 108
t
~,2.5 i.=--=
x
2i 1.5
~o
lOO
1~o
time[s]
200
2~o
30o
Fig. 2 Some o f the very coilinear sensor signals. The points in time where the value o/'y(t) changed are marked by the vertical lines. The sensor signals are electric resistances.
314 some class r (i.e. where y - r) are normally distributed with variance ]~r around the class mean x m~n, i.e. p~(x) w exp ( - ( x - xrm~an)XE~-I(x - xme~n)). In Fig. 3 this is schematically illustrated for p -- R = 2. For some static classification methods Pr(X) can be interpreted as a probability density. Class affiliation, i.e. the value of Y, is in all methods estimated as
~r
-
-
arg max g(x(ti), r).
(2)
r
As supervised methods are considered in this chapter (see [13] for applying ETD to unsupervised classification problems) it is assumed that there are time series available for calibration of the static classification method. For this data the class affiliation of each measurement is known and it is further supposed that no transitions occur during calibration measurements. Applying for example SIMCA classification to the data displayed in Fig. 2 results in the time series g(x(t),r) (with r = 1,2) displayed in Fig. 4. The corresponding estimate gained from Eq. (2) is displayed in Fig. 1 (dotted line). During the times when the process is in a stable state the classification is correct. However, during transitions wrong classifications occur, because the values of the sensor signals (and therefore also the values g(x(t), r) converge only slowly to their new values. The estimated value of y(t), however, changes only as soon as the corresponding time series of g(x(t), r) cross (see Eq. (2)).
X2
x
p2(x)=~:
Fig. 3 Schematic drawing o f a class(fi'cation problem with p = 2 and R = 2. The two possible classes are illustrated by a contour line representing their corresponding probability densities p~(x) and their class means (black dots). Some example measurement~ are marked by crosses and their chronological order is indicated by the dotted line. At first the measurements remain well within the area given b)" p2(x). After a while the measurements start to systematically approach the area given by Pl (x) which leads to c systematic decrease o f g(x,2) and increase o f g(x,1) respectivel)'.
315
I P
r
J
i
C=
,J
~,x"
1
i
k.~ ,~
A
"-
!
i j , , l ,'
i
' L
I -80
F
time[s] Fig. 4 The time series g ( x ( t ) , l ) (dashed line) and g ( x ( t ) , 2 ) (solid line) j o t tile data shown in Fig. 2. The points in time where the value o / y ( t) changed are marked b)' the vertical lines. Especially at t ~ 100 s and t ~ 270 s it can be observed that the sensor signals and correspondingly the vahws o f g converge onh' slowly to their new vahws and therefore lead to a delayed behaviour o f static classifiers.
Therefore, the delayed behaviour of g(x(t), r) results in a delayed detection of transitions when static classification methods are applied. The problem these delays cause are solved by ETD. The idea of ETD is as follows: during a transition the maximum of g(x(t), r) with respect to r will decrease whereas the value of g(x(t), r) for another r will increase. The requirements for a quantification of a simultaneous occurrence of these two changes are at least: 9 In order to detect changes online a method local in time is needed. 9 As transitions might occur on different time scales a multi-scale approach is needed, i.e. it is necessary to be able to cope with slow and fast transitions while not interpreting noise as a transition. Both requirements are fulfilled by wavelets. It will be demonstrated in the next section how to apply wavelets to solve this problem.
3
Application of the D W T
The discrete wavelet transform is a transformation that transforms data, e.g. a time series, of length 2 J into 2 J wavelet coefficients dj (j denotes the resolution or level, j - 0 , . . . ,J). The first 2J coefficients dj.k ( k - 1 , . . . , 2J) are the approximation coefficients, whereas the remaining dj.k ( k - 2j -+- 1 . . . . ,2 J) are the detail coefficients at the different levels (see also Fig. 5). The coefficients dj.k with k - 2j -+- 1 , . . . ,2 j+] are the details that belong to level j, the coeffi-
316
time
-.
time series of length 2J
details j=J-1
approximation j---J-1
r
approximation j=J-2
details j=J-2
[-~
//q[
details j=J-1
most recent
approximation j=J-3
details details j=J-3 j=J-2 . V,~ ~ ,
details
j-J-I r,-w
I
most recent
wavelet number . . . . . . . . .
Fig. 5 Schematic drawing o f the D W T for a time series o/length 2 J. The locations o f the coefficients [t that are related to the most recent vahws of the time series are marked b)' the arrows at the bottom.
cients dj,k with k - 2j+l + 1 , . . . , 2j+2 are the details that belong to level j + 1 and so on. The first value of the index k belongs to the first points of the time series, the last value of the index k belongs to the last points of the time series. Let us consider the discrete wavelet transform with the H a a r wavelet. If the D W T with the Haar wavelet is applied to a time series the detail coefficients supply information about the temporal change of the time series. The detail coefficients of different levels correspond to changes on different time scales. Hence, these coefficients may serve as a m e a s u r e o f change of a time series. If a transition occurs in a time series the detail coefficients take values that are larger than the values they take during stable states. Below it is described how this property is used for detection of transitions. Since online applications are considered in this chapter only the most recent development of a time series is of interest for our problem. This means that for a D W T down to level j the detail coefficients that measure the most recent change are the j - 1 coefficients: dj.k where k takes the values 2j - 2, 2j 92 2 , . . . , 2J. These are the detail coefficients of the different levels that measure the most recent change of the time series (k - 2J corresponds to the last approximation coefficient, a time series of length 2 J is assumed).
317 Note that if transitions of a maximum length of 2 ~ points in time are considered for online analysis it is only necessary to calculate the D W T for the latest 2 ~ measurements as this is the maximum number of necessary measurements for J applications of the D W T (the detail coefficients that correspond to the largest time scales will then consist of only one number). Hence, a window of length 2 ~ is moved across the data. The right edge of this window is the last measurement. For simplicity of notation these coefficients will in the following be denoted ~]h with h - 1. . . . . J (see also Fig. 5). If these coefficients result from an application of the D W T to the corresponding window of a time series z(t) they shall be denoted dl ( z ( t ) ) , . . . , ~]j (z(t)). As it was already stated above, a transition is characterized by the simultaneous increase of one value g(x(t),rl) and decrease of another value g(x(t),r2). Assume that the process is in a stable state rl and a transition from this state to state r2 on a time scale characterized by the wavelet level h occurs. If a wavelet like the Haar wavelet that characterizes change is used such a transition should lead to significantaly large (negative) values of the product ~]h(g(x(t),rl)).~]h(g(x(t),r2)). This is because the decreasing/increasing time series leads to large negative/positive values of dh(g(x(t), rl/2)). Therefore, this is a feature that can be used to detect transitions. As it is usually not known in advance on what time scales transitions can occur different values h in general an interval has to be considered. If the window on the data is chosen such that a D W T up to level J can be performed there are J different detail coefficients/levels ~]l, . . . . ~]j. Therefore two values are to be determined" the largest expected time scale (which corresponds to choosing the value of J as the data window is of size 2 ~) and the time scale L of the shortest transitions (which corresponds to considering not all values d 1 , . . . , ~]J, but a reduced set d i . . . . . ~]L with L _< J). The value of J that characterizes the largest expected time scale is not critical and should be chosen just large enough to allow for the detection of transitions with very large time scales. The value of L that characterizes the smallest expected time scale, however, is critical as choosing a value too large would result in noise being interpreted as a transition. Choosing this value too small would worsen the detection of fast transitions. L should be determined from the calibration data where no transitions but only noise occurs. The value has to be chosen as large as possible, but small enough so that no transition is detected on the calibration data.
318 Another quantity that has to be determined from the calibration data are the typical magnitude s hrl ,r2 of noise-induced fluctuations of the products _ dh(g(x(t), rl )) 9dh (g(x(t), r2)) when the process is in state rl. A typical choice is shrl ,r2 - 3. o hrl ,r2 ~ where o hri .r2 denotes the standard deviation of this product when the process is in stable state rl. It is necessary to determine these values in order to normalize the noise magnitudes on the different levels and hereby make the values of dh(g(x(t), rl )). ~]h(g(x(t), r2)) comparable for different h. These considerations allow the construction of an extension of static classification methods for dynamic data. At each point in time the values g(x(t), r) are calculated with a static classification method (see Eq. (1)). From g(x(t), r) an estimate of y(t) can be gained using Eq. (2). However, in order to allow for corrections due to possible transitions this estimate is taken only as a preliminary estimate ~,~ ~,~
- arg max g(x(ti), r).
(3)
r
This estimate should be accepted if the process is in a stable state. In order to check if a transition is occurring the time-dependent vector q(t) is inspected. This vector is constructed according to the considerations outlined above: /' --~l ClL(g(x(t), 2)0)). ClL(g(x(t), 1)) '~ S+01
- ~ l aL(g(x(t), 2)0)) 9dL(g(x(t). R)) S~'0.R
q(t) - -
9 --rL ~]l (g(x(t), ~,~ S,;,0 1
(4) ~]l (g(x(t), 1))
k, ~dls,0.R l (g(x(t), .90i) 9~], (g(x(t), R)) J The first R components of q detect transitions towards 1,... ,R on the level that corresponds to fast transitions, i.e. h = L. The next R components detect transitions on level h - L - 1 and so on. A transition leads to a component larger than 1. The final estimate ~'(ti) is determined as: y ( t i ) - / ((arg mhax q h ( t i ) - 1) m o d R ) + 1
(
yO(ti)
if max qh (ti) > 1 h
and (]~).01t,)(g(x(t),2)0)) < ( otherwise.
(5)
319 If the maximum element of q is larger than 1 a transition is detected (additionally, g(x(t),~, ~ needs to be decreasing as otherwise a movement towards ~r would incorrectly be interpreted as a transition). In that case the maximum element of q is chosen (the term arg maxh qh(ti)--1) and converted into an estimate of y(t) using the term ((. - 1) mod R) + 1. If no transition is detected the preliminary estimate ~~ is accepted. It can be expected that the use of this classifier reduces the delays after transitions that can be observed in Fig. 1. However, since the future progression of a time series is predicted it must also be expected that wrong transitions will be predicted. It is examined in the next section if the overall number of wrong classifications is reduced.
4
Results and conclusions
Application of the method to data displayed in Fig. 2 using SIMCA (to determine g(x(t), r) and the Haar wavelet yields the following results. Fig. 6 shows the components of the time series q(t) for the data set presented above. The value determined for L was 5, for J the value 6 was chosen. After transitions the corresponding components take values larger than 1. Fig. 1 shows the resulting estimate ~(t) (dashed line). It can be observed that transitions are now detected. It can, however, also be seen that there are occurrences of detected transitions where none occurred (t ~ 240 s). This behaviour is understandable through the comments above regarding prediction of a time series and the values the g(x(t),r) take at t ~ 240 s. It depends on the intended application if errors of this kind can be tolerated. The total number of wrong classifications (i.e., how often it occurred that ~'(ti) r y(ti)) on this data set was reduced from 36 (only SIMCA) to 24 (SIMCA with ETD). It can be concluded that ETD is a beneficial extension of static classification methods when online response to changes is required and the measurement devices exhibit a delayed behaviour. The D W T provides a useful framework for characterizing changes of a time series locally and on multiple time scales and could therefore beneficially be used for ETD.
320
' 50
100
150
'
50
100
150
/
25O
200
i
3OO
fi~ 1
~'
0 -1
'\",,
25O
200
2 o"
;
I
3OO
/
2~
r D"
!t
~I
01 -I
L
50
Z.
/,/"
~
100
150
100
1,~)
. ~
i
200
l" 50
,
I
1
i .
...... 200
.
.
.
/~'!
"
250
2so
~ ......
-
300
--~
3oo time[s]
Fig. 6 The time series o f the coefficients o f q for the data shown in Fig. 2. The points in time where the value o f y ( t ) changed are marked by the vertical lines. A value larger than 1 indicates a transition. The components q l and q,~ indicate transitions towards 3' = 1 the remaining ones towards y = 2. The components q l and q: hldicate transitions on short time scales, the remaining ones belong to transitions on longer time scales.
5
Acknowledgement
Dr. J. Goschnick and Dr. R. Menzel (both at the Institute of Instrumental Analysis of the Research Center in Karlsruhe, Germany) are thanked for providing the data.
References 1. M. Sjostrom and S. Wold, SIMCA: A pattern recognition method based on principal components models, in Pattern Recognition ill Practice (E.S. Gelsema and L.N. Kanal Eds), North-Holland, Amsterdam (1980), pp. 351-359. 2. L. Devroye, L. Gy6rfi and G. Lugosi, A Probabilistic Theory o[Pattern Recognition Springer, Berlin (1997). 3. M. Marth, D. Maier, J. Honerkamp, M. Rupprecht and J. Goschnick, Early Transition D e t e c t i o n - a dynamic extension to common classification methods, Chemometrics and Intelligent Laboratory System 43 (1998), 123-133. 4. J. Honerkamp, Statistical Physics, Springer, Berlin (1998).
321
5. P. Althainz, A. Dahlke, M. Frietsch-KlarhoL J. Goschnick and H.J. Ache, Reception Tuning of Gas Sensor Microsystems by Selective Coatings, Sens. & Act. B, 24-25 (1995), 366-369. 6. P. Althainz, J. Goschnick, S. Ehrmann and H.J. Ache, Multisensor Microsystem for Contaminants in Air, Sens. & Act. B, 33 (1996), 72-76. 7. I.E. Frank and J. Friedman, Classification Oldtimers and Newcomers, Journal o[ Chemometrics, 3 (1989), 463-475. 8. R.O. Duda and P.E. Hart, Pattern Class([ication and Scene Analysis, Wiley, New York, (1973). 9. D. Coomans and D.L. Massart, Potential Methods in Pattern Recognition, Analytical Chimica Acta, 133 (1982), 215-224. 10. S. Haykin, Neural Networks- a Comprehensive Foundation, McMillan, London, 1994. 11. I.E. Frank and S. Lantieri, Classification models: Discriminant Analysis, SIMCA, CART, Chemometrics and Intelligent Laboratory System, 5 (1989), 247-256. 12. T. Naes and U. Indahl, A Unified Description of Classical Classification Methods for Multicollinear Data, Journal of Chemometrics, 12 (1998), 205-220. 13. M. Marth, D. Maier, J. Honerkamp and J. Goschnick, Two Improvements of Early Transition Detection, Journal of Chemometrics, 13 (1999), 1-13.
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
323
C H A P T E R 15 Calibration in Wavelet Domain B. Walczak* and D.L. Massart ChemoAC, Vrije Universiteit Brussel, Pharnlaceutical hlstitute, Laarbeeklaan 103, 1090 Brussels, Belgiunl
1
Introduction
Spectroscopic methods are increasingly employed for quantitative applications in many different fields, including chemistry [1]. The dimensionality of spectral data sets is basically limited by the number of the objects studied, whereas the number of variables can easily reach a few thousands. Highdimensional spectral data are very correlated and usually somewhat noisy, so that, the conventional multiple linear regression (MLR) cannot be applied to this type of data directly: the feature selection or reduction procedures are needed [2]. In the arsenal of calibration methods there are methods more suited for modelling any number of correlated variables. The most popular among them are Principal Component Regression (PCR) and Partial Least Squares (PLS) [3]. Their models are based on a few orthogonal latent variables, each of them being a linear combination of all original variables. As all the information contained in the spectra can be used for the modelling, these methods are often called the "full-spectrum methods'. There are situations, when feature selection coupled with M L R can offer some advantages, compared to the full spectrum methods. This can happen, for instance, if there are many redundant X-variables with very different curvature in their relationship to Y. In such case, the feature selection procedure allows to eliminate those X-variables, which are most non-linear in their response, while their non-linear curvatures may contaminate the fullspectrum calibration models.
* On leave from Silesian University, Szkolna Street 9, 40-006 Katowice, Poland
324 In principle, the factor-based methods do not require feature selection, but improvements of the PRC and PLS methods coupled with feature selection procedures have in several studies [4-10] been reported. The main goal of feature selection can be formulated as selection of a subset of the candidate variables to obtain a final model that provides accurate and reliable prediction of future values of the dependent variable Y (e.g. a concentration) for a given set of independent variables X (e.g. optical absorbance at a set of wavelengths).
2
Feature selection coupled with MLR
There is no unique statistical procedure for feature selection and many different approaches are currently used. They often do not result in the same solution, but usually allow to achieve the main goal, i.e. improvement of prediction for the constructed regression model. A general review of the subset selection methods can be found, e.g. in [1 1]. Presently, only the most popular feature selection procedures will be discussed.
2.1 Stepwise selection In forward selection, the first variable selected for an entry into the constructed model is the one with the largest correlation with the dependent variable. Once the variable has been selected, it is evaluated on the basis of certain criteria. The most common ones are Mallows' Cp or Akaike's information criterion. If the first selected variable meets the criterion for inclusion, then the forward selection continues, i.e. the statistics for the variables not in the equation are used to select the next one. The procedure stops, when no other variables are left that meet the entry criterion. Backward elimination starts with all the variables in the equation and then it sequentially removes them. This approach cannot be applied to the ill-posed settings, but it can be combined with forward selection (so-called stepwise selection). The difference between the forward and the stepwise selection is that in the stepwise selection, after a variable has been entered, all already entered variables are examined in order to check, whether any of them should be removed according to the removal criteria. This testing of the 'least useful variable currently in the equation" is carried out at every stage of the stepwise procedure. A variable that could have been the best entry candidate at an
325 earlier stage, c a n - at a later s t a g e - be superfluous because of the relationships between it and other variables now in the regression. Because the classical F-test greatly underestimates the true probability of chance correlation, Topliss and Edwards [12] investigated this probability by simulation studies. They repeatedly constructed the X matrix with random numbers, then applied a standard stepwise to generate the best linear equation relating one column to a certain subset of the others. The frequency of a chance correlation was found to be much higher than it might have been expected, when applying the F-test. For instance, for the data matrix containing 10 objects and 10 variables, more than half of the runs yielded an r 2 of at least 0.5. According to the results of the study carried out by Derksen and Keselman [13], and concerning several automatic variable selection methods, in a typical case 20 to 74 percent of the selected variables are noise variables. The number of the noise variables selected varies with the number of the candidate predictors and with the degree of collinearity among the true predictors (due to the well-known problem of variance inflation when variables are correlated any model containing correlated variables are unstable). Screening out noise variables while retaining true predictors seems to be a possible solution to the chance correlation problem in stepwise MLR.
2.2 Global selection procedures To arrive at a true optimal subset of variables (wavelengths) for a given data set, consideration of all possible combinations should in principle be used but it is computationally prohibitive. Since each variable can either appear, or not, in the equation and since this is true with every variable, there are 2npossible equations (subsets) altogether. For spectral data containing 500 variables, this means 25oo possibilities. For this type of problems, i.e. for search of an optimal solution out of the millions possible, the stochastic search heuristics, such as Genetic Algorithms or Simulated Annealing, are the most powerful tools [14,15]. Genetic Algorithm (GA) requires binary coding of possible solutions. In the case of feature selection, elements in the bitstring are set to zero for nonselected variables, while elements representing selected features are set to one. The initial population of bitstrings is selected randomly. All strings are then
326 evaluated and selected proportionally to their fitness, in order to undergo reproduction (via cross-over and mutation operations). Evaluation of the solutions is carried out according to the predefined fitness function. The fitness function is problem-dependent and its formulation determines final solution. GA can be combined with local search methods (e.g. with the forward stepwise procedure) for the final tuning in a multidimensional search space. Although GA is a global search method, it should be understood that while looking for a best subset of variables from highly correlated data, many equivalent solutions to the problem can be designed. GA is very powerful for optimization problems in general, but can have some drawbacks when applied to a feature selection problem. For instance, there is no way to include in GA procedure the evaluation of predictive ability of the model, using an independent test set. If the 'independent' set is used to evaluate fitness function, in reality it is involved in the GA procedure and not completely independent. This means that only the final solution of GA can be evaluated by an independent test set. As demonstrated by Jouan-Rimbaud [16], even a careful design of fitness function does not prevent that some features are selected by chance. Simulated Annealing (SA) originates from thermodynamics, and is based on the physical annealing process of solids. The key parameter to govern the optimization procedure is 'temperature'. Each consecutive solution is compared with the previous one, and if it is better, then it is accepted, if it is not, it can be accepted with a certain probability, controlled by the 'temperature' parameter. At the beginning of an SA run, high temperature allows to escape local minima, but as the temperature is lowered, the probability of acceptance of the worst solution is reduced. The rate of the temperature decrease is very important. If the temperature is lowered too quickly, there is not enough opportunity to escape from local minima and, if the temperature is lowered very slowly, it can take a long time to converge to a final solution. There are many propositions on how to improve the efficiency of SA and how to combine it with the local search methods.
3
Feature selection with latent variable methods
There is no single and systematic approach for feature selection coupled with the full-spectra methods, for instance with PLS, but there are many diverse propositions. To mention at least some of them, we can start with Intermediate Least Squares (ILS) proposed by Frank [17], which calculates an
327 optimal model in the range between PLS and stepwise regression by crossvalidating two parameters, namely the number of components in model and the number of elements in the weight vector set to zero. A Monte Carlo study shows that, although PLS often gives better prediction than the stepwise regression or a model in-between, there are some cases, where one can gain by the suggested ILS method. Use of the H-principle, proposed by Hoskuldsson [18], leads to a procedure for variable selection basically the same as ILS. The approach by Lindgren et al. [19,20], called Internal Variable Selection PLS (IVS-PLS) can also be considered as a continuation of Frank's approach. In this approach variables are dimension-wise selected. The decision about variable selection or elimination is taken depending on the results of a comparison of the weights (loadings) and on the predefined threshold value. Wehrens and van der Linden [21] propose to use bootstrap confidence intervals for the regression coefficients, in order to select a subset of the original variables for inclusion in the regression model. This procedure yields a more parsimonious model with a smaller prediction error. The basic idea of this approach is to generate "bootstrap samples' by sampling with replacement from the data and then calculating the statistical parameter of interest for each bootstrap sample. This yields estimates that are used to obtain a confidence interval. Navarro-Villoslada et al. [22] studied wavelength selection for PLS and PCR (and other methods), by means of criteria such as the maximum signal-to-noise ratio and the minimum condition number of the calibration matrix. Spiegelman et al. [23] presents the mathematical basis of improved calibration through selection of informative variables for PLS. The authors improve selection of wavelengths by ranking them according to a modified signal-to-noise ratio. PLS models are generated in iterative fashion, each loop including the next highest ranked variable into the test set. The algorithm attempts to minimize prediction errors and therefore continues until all variables are included. The variables providing a minimum prediction error are then selected. Another approach, called Iterative predictor weighting (IPW) PLS, proposed by Forina et al. [24] is based on the cyclic repetition of PLS regression, each time multiplying the predictors by their importance (product of the absolute value of the regression coefficients and the standard deviation of the predictor) computed in the previous cycle. The convergence of algorithm is observed after 10-20 cycles. The final PLS model usually retains a very small number of predictors and frequently the model complexity decreases. Clark and Cramer III [25] carried out studies on chance correlation with PLS. For all the studied data dimensionality and different correlation
328 structures, the risk of chance correlation is much greater with the stepwise MLR, than with PLS. For PLS, a greater risk of overlooking a 'true' correlation was observed in these cases, when the correlation involved a sufficiently small fraction of the total variance among independent variables. 3.1 U V E - P L S
In the approach, proposed by Centner et al. [26], the experimental variables can be eliminated that do not have more importance than artificial random. As this procedure motivated us to study feature selection in the wavelet domain, we present it in a more detailed way. The PLS model relating a variable y (m, l) with a set of predictors X (m, n) can be presented in the following form: y=Xb+e where b is the vector containing n regression coefficients for the model with A factors, and e (m,1) is the vector of errors unexplained by the model. The reliability of the b regression coefficients is estimated as the ratio of the regression coefficient bi and its standard deviation: stability(bi) - bi/std(bi) where i denotes variables, i -
1. . . . . n.
The standard deviation of the PLS regression coefficients cannot be computed directly. To overcome this problem, Centner proposed to use leaveone-out and to define stability of the b coefficient associated with the ith variable as: stability(bi) = mean(bi)/std(bi)
(1)
where mean and standard deviation are calculated from the set of the b coefficients, obtained by jackknifing. To distinguish between stable and unstable regression coefficients, the cutoff value is needed. Centner et al. [26] proposed to calculate this value based on the stability of the regression coefficients associated with artificial random variables, introduced into the PLS model. To the original matrix X, another matrix N (containing random variables with a very small amplitude) can be added and then stability of the regression coefficients
329
for the experimental variables can be compared with that of random variables. The matrix N ought to contain at least 300 random variables, to ensure proper estimation of the cut-off value, and the random variables must be small enough, in order not to influence the PLS model. This requirement can be achieved by multiplication of the normally distributed random numbers by a small constant (i.e. 10-~~ First, the dimensionality (A) of the PLS model is estimated, using, for instance, RMSEP: R M S E P -- (Zi ( yipred-- yiobserved
)2/m)1/2
fori--1
.....
m
Then a new matrix, XN (m, n + n*) (where n* is a number of random variables) is used to calculate regression coefficients of the PLS model with A factors, according to the leave-one-out procedure. The first object is eliminated and the PLS model with A factors is constructed for the remaining (m - 1) objects. The resulting set of regression coefficients is called hi. Then the second object is left out (and the first object is put back) and the second PLS model with A factors is constructed, so that a second set of regression coefficients, b2, is obtained. If there are m objects, then m vectors of b-coefficients bl, b2 . . . . . bm are obtained. These vectors are organized into matrix B (m, n + n*). The ith column of B contains m regression coefficients associated with the ith variable. Mean values of these coefficients divided by their standard deviation define stability of (n + n*) regression coefficients (see Scheme 1). The first n elements of the vector stability are associated with the experimental features, whereas the remaining n* elements are associated with the added artificial random variables. Based on the absolute value of stability of the coefficients associated with noise, one can calculate the cut-off value for the coefficients of the experimental data. The cut-off can be defined in many ways, for instance, as a maximum of an absolute value of stability, calculated for noisy variables: cut-off = max(abs (stability*)
(2)
or instead, one can find the cut-off level among the ranked abs(stability*) as the value corresponding to the 99% (or 95, 90 = alfa) quantile. Another possibility is to replace the stability definition by its robust version:
330
(a)
y
X
n
b
m
(b)
Y
XN
n
n +n"
b
m
n+n*
(c)
B
n
n+n*
bI
bm 9 ,
; ..
,
,
mean(B)
,
std(B)
,
stabiliW=
mean(B)/std(B)
Scheme l Schematic representation of (a) PLS model, (b) UVE-PLS model and (c) matrix B, containing regression coefficients calculated by the leave-one-out cross-validation procedure and their mean values, standard deviations, and stability.
stability(bi) - median(bi)/interquantile(bi)
(3)
For illustrative purposes, in Fig. 1 we showed stability coefficients calculated according to Eqs. (1) and (3) for the 401 experimental variables of the NIR data set and 500 artificial variables.
331
120
expenmental features
art~fiolal random features
100 80 60 ._~ 40 th 1 th2
2o 0 -20 -40 -60 -80
200
400
600
800
vanable number
Fig. 1 Stability coefficients o[experimentai and artificial random features calculated.for the NIR data set and the threshold~ th I and th2 calculated according to Eqs (1) and (3), respectivel)'.
Use of Eq. (1) and criterion (2) (strongly dependent on the maximal value of random features) can lead to elimination of too many variables, whereas use of Eq. (3) with criterion (2) is much less strict and broader bands from original data are used for modelling. After elimination of uninformative variables, a PLS model for the y and Xnew data is constructed, using the leave-one-out cross validation procedure to estimate its complexity.
3.2 Feature selection in wavelet domain
A similar procedure can be performed in the wavelet domain [27]. In this case, there is no need for adding noisy variables, but instead, wavelet coefficients associated with the data noise can be used to calculate the threshold value of the stability of the b-coefficients. Let us consider this approach in a more detailed way, assuming that Discrete Wavelet Transform (DWT) is used for data decomposition. Using DWT, we can transform m signals from the time domain to the wavelet domain (see Scheme 2a):
332 (a)
x
w
n
n
DVVT
wavelet domain
time domain
L sorting
(b)
Wsorted Ws
Wn
wavelet domain
Scheme 2 Schematic representation o[ (a) discrete ~ravelet transform of data set, X, from time domain to wavelet domain, W, and (b ) matrix W .....-t,,~icontaining wavelet coefficients sorted according to their contribution to the data variance.
The information content of both matrices X and W is identical, but in the scale-frequency domain signals have sparse representations, i.e. many wavelet coefficients approach zero. When dealing with a data set, we have to find the set of wavelet coefficients describing most of the data variance. The higher the variance of a given column of matrix W, the more important this column is in description of data variability. The variance of all columns can be summarized in a vector v (1, n). The sum of its elements equals the total data variance. The interpretation of the elements of vector v is similar to the interpretation of eigenvalues associated with the PCs extracted in the PCA. In the same way that PCs are sorted according to their eigenvalues, waveforms can be sorted according to the value of the elements of vector v. Due to the above mentioned sparse representation of signals in the wavelet domain, there usually is only a limited number of columns of W, needed to describe the majority of the data variance. To calculate the number of significant coefficients (i.e. the number of the Wsorted columns, describing the majority of the data variance), different criteria can be applied, e.g. only the n' largest coefficients that together describe a predefined variance (i.e., 99.9 %) can be retained as the important
333 ones, or the Minimum Description Length criterion (MDL) [28-30] can be applied to determine the number of significant coefficients. This means that Wsorted c a n be divided into two submatrices, W~ and Wn, containing important and noisy coefficients, respectively (see Scheme 2b). Then, using a leave-one-out procedure we can construct the PLS model with A factors and calculate the matrix of b-coefficients, their means and standard deviations, and finally their stability (Scheme 3). The stability of the regression coefficients associated with the noisy features can then be used to calculate a threshold value, which allows to distinguish relevant and irrelevant features within the group of n original features.
4
Illustrative example
Let us consider a data set containing Near-Infrared (NIR) spectra of 30 gasoline samples and five dependent variables (Y = [yl y2 y3 y4 y5]) [31]. The original spectra and the so-called std spectrum are presented in Fig. 2. The std, i.e. standard deviation, spectrum it is a vector, the elements of which
(a)
Y
Wsorted
(b)
bm
I
,
9~
n
-
b
,
mean(B)
,
std(B)
,
stabiliw=
mean(B)/std(B)
Scheme 3 Schematic representation of the R C E - P L S approach. (a) PLS nlodel./br 3" and W~.o,.t,,,1 containing signOTcant and insignificant ~ravelet coefficients, (h) matrix of the regression coe[ficients, calculated ltSillg the leave-one out cross-validation proce&tre and the vectors describing their ntean vahws, standard deviations and stability.
334 (a)
2/ .
.
.
.
.
.
1.5 <
(b)
~
9
9
o~
1 05 0 0
~' 0.01 100 200 variable number
o
lOO 200 variable number
Fig. 2 (a) NIR spectra of 30 gasoline samples and (h) their std spectrum.
describe the standard deviation of the columns of matrix X. The std spectrum allows visualization of data variation, i.e. identification of spectral regions with significant variation. The goal is to construct calibration models (y = f(X)), which allow prediction of dependent variables for new samples, based on their NIR spectra. The original data set X (30, 256) was divided into two subsets: the model set (20, 256) used to construct the model and the test set (10, 256) to evaluate predictive ability of the model. The splitting of the model and test sets was performed according to the Kennard and Stone algorithm [32]. This algorithm allows the selection of objects (samples) which are uniformly distributed over the experimental space and represent all sources of data variance. Evaluation of the constructed models and their predictive ability was based on the RMSCV and RMSEP, respectively. These parameters are defined as:
RMSCV - (Ei(Ypred(i) RMSEP-
Yobserved(il)2/m) 1/2
(Ei(Yp~ed(i/- Yobse~ved(i))2/mt) 1/2
for i - - 1,... ,m for i - - 1, . . . . mt
where m denotes the number of objects in the model set, mt the number of objects in the test set, and the subscript (i) denotes the object left-out. A cross-validation procedure and a randomization test [33] were used to evaluate the complexity of the full-spectra models. Let us first apply stepwise MLR (SMLR) with the standard settings and the level of significance for the null hypothesis, i.e. that the regression coefficient of the newly added variable should not significantly differ from 0, set to 5%. When applied to construct five calibration models for respectively yl, y2 . . . . .
335
y5, it leads to the solutions, summarized in Table 1 and presented in Figs. 3 and 4. For modelling yl, no variable was found to fulfil the F-to-enter test criterion. For modelling y2, y3 and y4, respectively, three, eight and five variables were selected and the constructed models have a good predictive ability (see Fig. 4). While considering the SMLR model for y3 one can notice a great difference between RMSCV and RMSEP. The reason probably is that the variables (189, 191, 197, 199 and 201), included into the model, are highly correlated. This inflates the variance of the estimates and hence, the unique contribution of each variable is difficult to assess. For modelling y5, one variable only was selected and the constructed model performs very badly (see Figs. 3 and 4).
Table 1. RMSCV, RMSEP and the selected variables for modelling y l, y2, y3, y4 and y5 by SMLR.
yl y2 y3 y4 y5
RMSCV 0.3978 0.0672 0.1968 4.4420
RMSEP 0.3958 0.2266 0.1657 6.0820
1:t
Se#cted variab#s 58, 169, 64 199, 189, 116, 197, 80, 217, 191,201 234, 256, 61,240, 92 213
y2
21
0
;.
y3
/<
100
200 variable number
, 0
1O0
1
y5
y4
0
100 200 variable number
200
vanable number
~0
1O0
200
variable number
Fig. 3 Variables selected b)" stepwise proce~htre /or nlodelling y2, y3, y4 and y5.
336
I
3O
2~25
I
2o
>,15 10. 10
>'30[
30
20 30 y2 obsewed
40 y3 observed
50
20.
S
"~15 .~. .u_ ~10
0
2O
10 y4 observed
0
10 20 y5 obsewed
30
Fig. 4 The Yp,'edicted v e r s u s Y,h.......,.,,,1plots (/br the test set) according to the S M L R models.
The results reflect typical situations. While applying SMLR automatically, we can construct models with good predictive ability on condition that there are enough calibration samples, the spectra are not noisy and that no extrapolation outside the calibration domain is required. Otherwise less good results are obtained. In extreme case, as for y l, there is no possibility of constructing the SMLR model. The results of SMLR ought to be always carefully analysed and interpreted. Genetic Algorithm applied to the discussed data set leads to different subsets of selected variables. There are many different versions of GA, depending on the way reproduction, cross-over, etc., are performed. The algorithm used in our study, adapted from Leardi et al. [34,35], is particularly directed towards feature selection. In each GA run, a few subsets with similar responses are selected. Final solutions are then evaluated based on the RMSEP of an independent test set. Results are presented in Table 2 and in Figs. 5 and 6.
Table 2. RMSCV, RMSEP and the numbers of selected variables for modelling yl, y2, y3, y4 and y5 by GA-MLR.
yl
y2 y3 y4 yfi
RMSCV
RMSEP
S e l e c t e d variables
0.3245 0.05834 0.0596 0.04138 0.2563
1.208 0.1505 0.09227 0.05768 0.5951
32, 44, 56, l l5, l l6, 129, 147, 172, 232 61, 84, 92, 107, 116, 127, 140, 193, 197, 235 46, 70, 134, 169, 180 48, 83, 116, 139, 174, 181, 183, 190, 199, 235 6, 26, 39, 70, 114, 168, 188, 189, 242
337
yl
t " 0
100 200 vanable number y3
y2
0
I
100 200 wanablenumber
1:t 2
0
.
y4
.
1O0 200 ,,,an~le number y5
2
1O0 200 variable number
0
1O0 200 vanabte number
Fig. 5 Variables selected by GA .[or modelling y l , ),2, j'3, y4 and yS.
e~
20
>'10 ~
10
.
,
20 30 yl observed
!::J
~25~ 3O
20 c..,,i >,15 10. 10
~~~
. . 20 30 y2 observed
9
3O
.
40 50 y3 observed
30 "
2~ ~5
2005I"
O h
0
~ ' /
,
10 y4 observed
20
20
0
,
.
10 20 y5 observed
.
30
Fig. 6 The Yp,-eclicteclversus Y,,h.......,.,.,1plots ([or test set) according to the G A - M L R models.
338
Only for yl the predictive ability of the G A - M L R model is not satisfactory. For y2, y3, y4 and yS, the prediction of the models are excellent, but as one can notice, few of the selected variables are on the data baseline, which suggests that the models can prove unstable. In fact, this is a case with these models. If to the independent test set a small noise is added (simulated as randn(mt,nt)*0.001), the constructed models failed completely. Plots of Ypredicted v e r s u s Yobserved for the noisy test set, denoted as Xtn, are presented in Fig. 7. These results demonstrate the danger of working with few variables, which can, however, be overcome by applying the full-spectra models, or by applying GA to the compressed data set, containing the significant wavelet coefficients only. Results of PLS, i.e. of the full spectrum method, are presented in Table 3. The RMSCV and RMSEP values are much higher than the analogous values observed for the SMLR or G A - M L R models, but one can hope that the PLS models are more stable e.g. when instrumental problems occur. Still, one can try to lower model complexity by extracting relevant information from the original spectra. This can be done, for instance, by using the UVE-PLS or
8~
60
9
.
,
10
.
20 30 y1 observed
3O
4~I
2~
2~.j' 20
>.15
10
10
-
>'301
.
20 30 y2 observed
30
3O
~15
40
y3 observed ....
2o
-/ OL 0
0 10 y4 observed
20
0
10 20 y5 observed
30
Fig. 7 The Yp,'edicted v e r s u s Y(,h.......,,,,a f o r tlle test set contamined with white noise, Xtn, according to the G A - M L R models.
339
Table 3. RMSCV, RMSEP, number of latent variables (f) for the PLS and U V E - P L S models.
yl y2 y3 y4 y5
PLS f 7 6 6 6 6
RMSCV 0.5533 0.2559 0.1382 0.2581 0.5274
UVE-PLS f RMSCV 5 0.9290 5 0.5091 6 0.1320 4 0.5821 5 0.7745
RMSEP 1.0006 0.3393 0.2121 0.1409 0.9899
RMSEP 0.6406 0.3479 0.2145 0.3418 0.7740
RCE-PLS approaches. Results of PLS and UVE-PLS are presented in Table 3. As one can notice, elimination of uninformative variables leads in the majority of cases (i.e. in 4 out of 5) to the models with lower complexity. RMSEP for yl and y5 is decreased, for y2 and y3 is similar to the RMSEP of PLS models, and for y4 it is higher. The stability of regression coefficients and selected variables, are presented in Fig. 8. (a)
15
20
TI."_-m~l;:r_T ~
0
-20
0 200 400 600 real vanables random vanables ( b ) 20 ~
0 _='g .IQ
100 200 selected variables
0
,,
"
.
,
.
15
fii,ll~ ~ , - ~ 2 L ' ~ -,r~,~,,,l~, ...........
0
~ -20 i
-4O
0
0 200 400 6()0 real variables random~riables
(e)100 - 501 z . --
0 200 400 660 realvadebles random venables
<
1
1O0 200 ,selectedvanables
1
~
2L__,ml 0
IO0 200 selected vari~oles
Fig. 8 Plots of stability of regression coefficients and the selected variables for the UVE-PLS models of yl, ),2, )'3, )'4 and)'5 (a-e).
340
()-(]-10 . . . .
--
. . . .
". . . . . .
15
". . . .
._~
'i -1
0
0 200 400 600 real variables random variables
(e)40
~
100 200 selected van ables m
-
:LhfiZ-.kL,I
15 <
1 0.5
-40
0 0
/
0 200 400 600 real vanables random variables
1oo 200 selected variables
Fig. 8 (Contd.)
The number of selected variables and their position in spectra vary, depending on y. Variables relevant for modelling yl are not necessarily relevant for modelling other dependent variables. The smaller number of informative variables is observed for y4, the higher one for y3. Using DWT, the original spectra can be transformed from the time to the wavelet domain (X ~ W). This transform does not change the information content of data, i.e. the full spectrum PLS models constructed for W and Y are exactly the same as those constructed for X and Y. Spectra transformed into the wavelet domain have sparse representation, i.e. many elements approach zero (see Fig. 9, where for illustrative purpose, two decomposed spectra are presented). For the further data processing, we are interested only in these coefficients of individual spectra, which vary within the data set, i.e., for which a high variance is observed. Their identification can be made, based on the elements of the variance vector (see Fig. 10). (a)
spectrum 1
spectrumI
(b)
0
0
-0.2 0
100 200 wavelet coefficients
-~
1O0 200 wavelet coefficients
Fig. 9 Two spectra from the N I R data set decomposed by D W T ,'ith the Daubechies filter 11o. 4.
341 0.1 u
'- 0.05
._
>
0
1O0 200 coefficient number
Fig. 10 The variance vector for the NIR data set decomposed b)' D W T with the Daubechies [iher no. 4.
For the studied data set, 128 wavelet coefficients were identified as significant, and the remaining 128 (256- 128) as insignificant. If the data matrix W is compressed to the matrix Ws (30 x 128) then the PLS models are the same as those constructed for the original data, which shows that the coefficients removed from the matrix W are uninformative. In this case the only advantage of wavelet decomposition is that the data set was compressed [36]. Another possibility is to keep insignificant coefficients Wn and perform UVElike type of modelling, using these coefficients as irrelevant features to calculate the cut-off value of stability of the regression coefficients, associated with significant features. This type of modelling is called Relevant Component Extraction-PLS (RCE-PLS), in order to distinguish it from the U V E PLS approach. Results of RCE-PLS applied to the NIR data set are summarized in Table 4. The complexity of the RCE-PLS models is always lower than the complexity of the PLS models, whereas RMSEP is lower for yl, y3 and yS, and higher for y2 and y4. The Y p r e d i c t e d versus Yobserved plots for the test set according to the PLS, UVE-PLS and RCE-PLS models are presented in Fig. 11.
Table 4. Complexity (f), RMSCV and RMSEP for the RCE-PLS models.
yl y2 y3 y4 y5
RCE-PLS f 4 5 4 4 5
RMSCV 0.9681 0.3932 0.3319 0.5115 0.6424
RMSEP 0.5832 0.4714 0.1952 0.2301 0.7637
342
V1
PLS
,,
UVE.-PLS
~20[
~20>,~.
~20~.>,
10
10
10[ 1"0
2'0 3"0 y observed
v2
1"0
PLS .
30
30
"=
20 3'0 y observed
10
UVE-PLS
~ 25 15
~~
2o
y observed
u
3o 1~
PLS
>'15
2o
y observed
3o
10.
lo
2'o
y observed
UVE-PLS
50
3'o
RCE-PLS 50
,10
40
N 30
2~
y observed
p y4
I 5o 2%
PLS
2O ~15~
2'0 3'0 y observed RCE-PLS
~
I >'151
~
RCE-PLS
2O ,
~ao so
4o
y observed
ab
so
UVIE-PLS
20 T
4'o
y observed
RCE-PLS
i 10
-~
y observed
20
O~
PLS
3O
30-
10 y observed
0
UVE-PLS
10 y observed RCE-PLS
30 "
N 20 ._. ~-10
20
Oo lb 2b 20 y observed
so
o
lb
2"o
y observed
30
o
~b
2'0
y observed
Fig. 11 The Ypredicted versus Yoh......,,,,,t plots Jor the test sets according to the P L S , U V E P L S and R C E - P L S models.
343
While constructing the RCE-PLS models, we do not need to reconstruct the spectra at any step of the calibration procedure, but they can be reconstructed for visualization purposes. In Fig. 12, the original spectra (centered), spectra reconstructed with the relevant coefficients and spectra reconstructed with the irrelevant coefficients, together with the corresponding std spectra,
yl)
(a)
J~
~L -0.2[
1()0
0
(b)
"W 200
02[
(d) 0.1 0.05 0 0 0.1
-o~[;
"-- -'~[ "''1 260
0
lOO
(c)
-
,'m
0
1130
200
.......
0
100
200
(f) 0.1
-
100
(e)
. . . . . .
o
200
0
100
200
y2)
(a)
(d)
o.o: _>4 o.1
o
(b)
11~0
~L _o~F __ , 0
100
200
0
:
-.'~
_~k
'
200
_0.2[" o
"
_
~
~6o
100
200
100
200
0.05 db
L0 _=
200
01
(c) 0 2 '
100
(e)
_ A.~..
r
:,60
0 0
o.I
0.05 0 0
Fig. 12 (a) Original spectra (centered), (b) the relevant and (c) irrelevant components o/ the spectra and (d)-(e) the respective standard deviation (std) spectra jor modelling yl, y2, y3, 3'4 and yS.
344 y3)
(a)
02[_ _/,,l _o.Tr - 1130 l"'l 0 200
_o.f 0
100
y4)
~ 100
1
ii
~
_~1 100
200
1O0
200
0
100
200
0
100
200
(f)
(d) 0.1
" ~ 200
1130
-
0.05 0 0
200
~
0
0.05 0 "--'-"-"-~. 0 1O0
200
.....
-0.21. 0
(b)
(e)
C);
0
(a)
o.I
b)
(b)
(c)
(d)
(e)
o.~ 0
200
100
200
(r)
(c) 0 . 2 [
L
.o2r 0
0
"
-
~
- .......
-
100
-~'~
200
0.05 O~ ~ ' - "
0
,'7""~',
100
200
"
Fig. 12 (Contd.)
are presented. These figures well illustrate the difference between the UVEPLS and the RCE-PLS approaches. In UVE-PLS, variables are selected from the set of original variables, whereas selecting relevant features in the wavelet domain results in different weighting of the original variables. If the PLS, UVE-PLS or RCE-PLS models are applied to the test set slightly contamined with white noise (data set Xtn), they allow an acceptable prediction, thus giving evidence of their stability. RMSEP for Xtn, observed for
345 yS)
(a)
o.2[ 0
(b)
200
~I _o. l 0
(c)
100
"W
0
100
-0.2[ 0
|_
0
I00
200
0 0
I00
200
(e) 0.05
.
o.2[
(d) 0.1
200
__
16o
AL~
26o
0 . 1 ~ 0.05 0 0
I00
200
Fig. 12 (Contd.)
the G A - M L R , PLS, UVE-PLS and RCE-PLS models are summarized in Table 5, whereas the Ypredictedversus Yobserved plots are presented in Fig. 13. For data highly contaminated with noise also the difference between UVEPLS and RCE-PLS approaches becomes more evident. For illustrative purposes, in Fig. 14 the spectra (centered) of test set contaminated with high noise (simulated as randn*0.01, i.e. ten times higher, than for Xtn), the relevant and irrelevant components, extracted by RCE-PLS for modelling y4, are presented. As one can easily notice, the majority of noisy variables are properly identified and removed from the original noisy spectra.
Table 5. R M S E P for the test set contamined with white noise (Xtn) for the G A - M L R , PLS, U V E - P L S and R C E - P L S models.
yl y2 y3 y4 y5
GA-MLR 31.4029 2.1708 1.0911 0.9236 8.4285
PLS 0.8924 0.3428 0.2252 0.1639 0.9500
UVE-PLS 0.7401 0.3562 0.2211 0.4148 0.7581
RCE-PLS 0.4434 0.4898 0.2156 0.2556 0.5869
346 GA
PLS
30 ~ 20
(I)
-"920
"o
0
10 20 y5 observed
0
30
10 20 y5 observed
UVE-PLS
30
RCE-PLS
3O
.
~ 2o 0~
10 20 y5 observed
o
30
0
10 20 y5 observed
30
Fig. 13 The Ypredicted versus Yob. . . . . . . .,,,t plots Jor the test set contamined ~rith noise, Xtn, according to the G A - M L R , PLS, U V E - P L S and R C E - P L S models.
(a)
(d)
0.2 0 -0.2 0
0.05 100
200
(b)
(e)
0 0
100
20O
0 . 1 ~ 0.05 0
100
200
(c) 0.2~
o
....
"
~bo
260
(r)
0 0
0.1
100 .
0
100
200 .
.
.
200
Fig. 14 (a) The spectra (centered) of the test set contamined with noise, (b) the relevant and (c) the irrelevant components extracted by R C E - P L S for modelling y4.
347
5
Conclusions
To construct parsimonious multivariate models for highly correlated spectral data, one can extract all relevant information, present in data, and eliminate the irrelevant one. This can efficiently be done in the wavelet domain, where it is easy to distinguish between significant features and features associated with noise. The latter variables can be further used for discrimination of relevant and irrelevant features for data modelling. This approach usually leads to the decrease of model complexity and to increase of its stability.
References 1. B.G. Osborne, T. Fearn, P.H. Hindle, Practical NIR Spectroscop)" ~rith applications in Food and Beverage Analysis, Longman Group UK Limited, England, (1993). 2. N.R. Draper, H. Smith, Applied Regression Analysis, Wiley, New York, (1981). 3. H. Martens, T. Naes, Multivariate Calibration, Wiley, New York, (1991). 4. D. Jouan-Rimbaud, D. Massart, R. Leardi, O.E. de Noord, Genetic Algorithm as a tool for wavelength selection in multivariate calibration, Analytical Chemistry, 67 (1995), 4295-4301. 5. U. Horchner, J.H. Kalivas, Further Investigation on a Comparative Study of Simulated Annealing and Genetic Algorithm for Wavelength Selection, Anah'tical Chimica Acta, 311 (1995), 1-13. 6. M.J. Arcos, M.C. Ortiz, B. Villahoz, L.A. Sarabia, Genetic-Algorithm-Based Wavelength Selection in Multicomponent Spectrometric Determinations by PLS" Application on Indomethacin and Acemethacin Mixture, Analytical Chimica Acta, 339 (1997), 63-77. 7. G. Weyer, S.D. Brown, Application of New Variable Selection Techniques to Near Infrared Spectroscopy, Journal of Near Infrared Spectroscopy, 4 (1996), 163-174. 8. J.H. Kalivas (Ed.), Adaption q[ Simulated Annealing to Chemical Optimization Problems, Elsevier, Amsterdam, in press. 9. J.P. Brown, Measurement, Regression and Calibration, Clarendon Press, Oxford, (1993). 10. P. Salamin, H. Bartels, P. Forster, A Wavelength and Optimal Path Length Selection Procedure for Spectroscopic Multicomponent Analysis, Chemometrics and Intelligent Laboratory Systems, 11 ( 1991 ), 57-62. 11. A.J. Miller, Subset Selection in Regression, Chapman & Hall, New York, (1990). 12. J.G. Topliss, R.P. Edwards, Chance factors in Studies of Quantitative StructureActivity Relationships, Journal of Medical Chemistry, 22 (1979), 1238-1244. 13. S. Derksen, H.J. Keselma, Backward, Forward and Step Wise Automated Subset Selection Algorithms; Frequency of Obtaining Authentic and Noise Variables, British Journal of Mathematical and Statistical Psychoh~g)', 45 (1992), 265-282. 14. D.E. Goldberg, Genetic Algorithm in Search Optimisation and Machine Learning, Addison-Wesley, Reading, MA, (1989).
348 15. P.J.M. van Laarhoven, E.H.L. Aarts, Simulated Annealing: Theory and Applications, Reidel, Dordrecht, (1987). 16. D. Jouan-Rimbaud, D.L. Massart, O.E. de Noord, Random Correlation in Variable Selection for Multivariate Calibration with a Genetic Algorithm, Chemometrics and Intelligent Laboratory Systems, 35 (1996), 213-220. 17. I. Frank, Intermediate Least Squares Regression Method, Chemometrics and Intelligent Laboratory Systems, 1 (1987), 233-242. 18. A. Hoskuldsson, The H-principle on Modelling with Applications to Chemometrics, Chemometrics and Intelligent Laboratory Systems, 14 (1992), 139-153. 19. F. Lindgren, P. Geladi, S. Ranner, S. Wold, Journal of Chemometrics, 8 (1994), 349363. 20. F. Lindgren, P. Geladi, S. Ranner, S. Wold, Journal of Chemometrics, 9 (1995), 331342. 21. R. Wehrens, W.E. van der Linden, Bootstarping Principal Component Regression Models, Journal of Chemometrics, 11 (1997), 157-171. 22. F. Navarro-Villoslada, L.V. Perez-Arribas, M.E. Leon-Gonzalez, L.M. Polo-Diez, Selection of Calibration Mixtures and Wavelengths for Different Multivariate Calibration Methods, Analytical Chimica Acta, 313 (1995), 93-101. 23. C.H. Spiegelman, M.J. McShane, M.J. Goetz, M. Motamedi, Qin lci Yue, G.L. Cote, Theoretical Justification of Wavelength Selection in PLS Calibration" Development of a New Algorithm, Analytical Chemistry, 70 (1998), 35-44. 24. M. Forina, C. Casolino, C. Pizarro Millan, Iterative Predictor Weighting (IPW) PLS: A Technique for the Elimination of Useless Predictors in Regression Problems, Journal of Chemometrics, 13 (1999), 165-184. 25. M. Clark, R.D. Cramer II, The Probability of Chance Correlation Using Partial Least Squares (PLS), Quantum Struct-Acta Relat, 12 (1993), 137-145. 26. V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna, Elimination of Uninformative Variables for Multivariate Calibration, Analytical Chemistry, 68 (1996), 3851. 27. D. Jouan-Rimbaud, B. Walczak, R. Popi, O.E. de Noord, D.L. Massart, Application of wavelet transform to extract the relevant component from spectral data for multivariate calibration, Analytical Chemistry, 69 (1997), 4317-4323. 28. J. Rissanen, A Universal Prior for Integers and Estimation by Minimum Description Length, Analytical Statistics, 11 (1983), 416-431. 29. N. Saito, Simultaneous Noise Suppression and Signal Compression Using a Library of Orthonormal Bases and the Minimum Description Length Criterion, Wavelets in Geophysics, (eds. E. Foufoula-Georgiou and P.Kumar), Academic Press, New York, (1994). 30. B. Walczak, D.L. Massart, Noise Suppression and Signal Compression Using Wavelet Packet Transform, Chemometrics and hltelligent Laboratory Systems, 36 (1997), 81-94. 31. B.M. Wise, PLS Toolbox for Use with Matlab, version 1.4 (Eigenvector Technologies, West Richland, WA, USA). 32. R.W. Kennard, L.A. Stone, Computer Aided Design of Experiments, Technometrics, 11 (1969), 137-148.
349
33. H. van der Voet, Comparing the Predictive Accuracy of Models Using a Simple Randomization Test, Chemometrics and hTtelligent Laboratory Systems, 25 (1994), 313-323. 34. R. Leardi, R. Boggia, M. Terrile, Genetic Algorithms as a Strategy for Feature Selection, Journal of Chemometrics, 6 (1992), 267-281. 35. R. Leardi, Application of a Genetic Algorithm to Feature Selection Under Full Validation Conditions and to Outlier Detection, Journal o/" Chemometrics, 8 (1994), 65-79. 36. J. Trygg, S. Wold, PLS Regression on Wavelet Compressed NIR Spectra, Chemometrics and Intelligent Laboratory Systems, 42 (1998), 209-220.
This Page Intentionally Left Blank
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
351
C H A P T E R 16 Wavelets in Parsimonious Functional Data Analysis Models Bjorn K. Alsberg Department of Computer Science, University of Wales, Aberystwyth, Ceredigion SY23 3DB, UK e-mail." bka@ aber.ac.uk
I
Introduction
Occam's (or Ockham's) razor is a principle attributed to the 14th-century logician William of Occam which can be stated as follows" "Entities should not be multiplied unnecessarily". It is a principle commonly accepted as a sound working principle for construction of scientific knowledge. It means that if several possible hypotheses can explain an observed fact, then the one is chosen that has the minimum number of assumptions attached to it. Such hypotheses or models are often referred to as parsimonious. These models are often associated with the following properties: 1. 2. 3. 4.
Improved prediction. More general. Easier to understand. Few variables/parameters.
Reducing the model complexity often reduces the prediction error of a model [1], however this is not always true. It might be acceptable to sacrifice some of the prediction ability in favour of a less complex model. Historically, an example can be found from astronomy where the Ptolemian geocentric cosmology was replaced by the helio-centric Solar system. The first model suggested by Copernicus was based on circular planetary orbits which actually had higher prediction error compared to the Ptolemaic model. In spite of this the main idea of heliocentricity was eventually preferred (togethe~ with elliptic orbits) because of its simplicity, generality and explanator) power.
352 Improvement of understanding and more generality can often be attributed to the higher abstraction level of the model representation. For instance, a reaction coordinate can be seen to exist on a higher abstraction level than using all the coordinates for the different atoms involved in a reaction. The abstraction level indicates the degree of detail needed for the model representation. There is an inverse relationship between abstraction and the resolution of the representational detail. How can parsimonious models be constructed? There are several possible approaches, however in this chapter a combination of data compression and variable selection will be used. Data compression achieves parsimony through the reduction of the redundancy in the data representation. However, compression without involving information about the dependent variables will not be optimal. It is therefore suggested that variable selection should be performed on the compressed variables and not on the original variables which is the usual strategy. Variable selection has been applied with success in fields such as analytical chemistry [1-4], quantitative structureactivity relationships (QSAR) [5-8] and analytical biotechnology [9-1 1]. In this chapter, compression is achieved by assuming that the data profiles can be approximated by a linear combination of smooth basis functions. The bases used originate from the fast wavelet transform. The idea that data sets are really functions rather than discrete vectors is the main focus of functional data analysis [12-15] which forms the foundation for the generation of parsimonious models.
2
Functional data analysis
Spectra originating from infrared, Raman and ultraviolet spectroscopy are reasonably approximated by smooth functions. The degree of smoothness is defined by the continuity of the various derivatives of the function. A function f is said to be k-times continuously differentiable or C k if (ok/otk) f(t) is continuous for all points t E ~'. The traditional approach in multivariate data analysis and chemometrics is to consider the data profiles as discrete vectors where each sampled point along a spectroscopic profile is assigned a unique variable in the analysis. This is here referred to as the sampling point representation (SPR) [14]. SPR is often so simple and intuitive that it is sometimes difficult to see why any alternative representation should even be considered. One aspect of SPR is that information about
353 continuity between the different vector elements is lost. The apparent continuity is due to the fact that most people tend to organise the sampled data points to be meaningful, however this information is not contained explicitly in the representation itself. A simple experiment can demonstrate this: consider at data matrix with e.g. 100 spectra where each spectrum is described by 1000 sampled data points at different wavelengths. The aim is to use principal component analysis. The output loadings vectors from such an analysis will reflect the shapes of the input profiles. Another analysis is possible where we have randomly permutated the variable columns in the data matrix. ~ All the shapes of the spectral profiles are lost and this will also be reflected in the loadings profiles. However, mathematically the results from the two analyses are identical in the sense that it has produced the same eigenvector solution with the same eigenvalues. The only difference is a relabelling of the column (or row) vectors which do not have any effect on the convergence and solution of the PCA model. Thus, the indexing information we took for granted is not accessible to the PCA or any other multivariate method that does not take it into consideration explicitly. Functional data analysis [12,13,16] on the other hand does makes this information directly available for the multivariate methods by assuming each data object is a function rather than a discrete vector. The other aspect is that SPR is unnecessary memory demanding. The digital sampling density of spectroscopic and analytical instruments can often be adjusted by the experimenter. Both the real spectroscopic resolution and the digital sampling density will influence the actual number of variables (intensities at certain wavelengths) used to represent the spectral profile. There is a large redundancy in the data which can be attributed to the continuity of the data profile. In general, the smoother the data, the lesser number of bytes are needed to store it. By approximating the profiles as actual functions, it is possible to perform an efficient compression of the data set. Obviously, this is going to influence the complexity of the resulting calibration model. If the multivariate model needs to use additional parameters to handle this redundancy it will tend to mask the real underlying variations that are more interesting. A compression of the data profiles by a functional approximation can therefore be an efficient way to obtain a better understanding of deeper relationships.
1The same
also applies to the matrix rows.
354 2.1 From vectors to functions
This section sets the stage for some of the ideas discussed later. Going from equations with discrete vectors and matrices to equations with functions in its basic form is not difficult. Let us demonstrate these ideas by looking at an example where PCA is applied to functions. Let X be a matrix of continuous spectra. This means that the N rows in X are really functions such that X - [xl(t)T; x2(t)T'..." XN(t)T]. One way to find the principal components of X is to solve the eigenquation of the covariance matrix G = XX T. For the discrete case G can be written in terms of vector inner products Gij -- (xi Ixj)
( 1)
where xi is the ith row in X and xi is the jth row in X. For the discrete case we define the inner product between two vectors a and b as (alb)discrete -- ~
aibi
(2)
i
whereas in the case for inner product between two functions a(t) and b(t) we write f {a(t)lb(t))continuous - J a(t) b(t)dt.
(3)
The bracket notation is similar to the one used in quantum mechanics [17]. Basically, the summation signs are replaced with the corresponding integration signs in the equations for PCA (and other similar multivariate algorithms). Thus, the covariance matrix G = XX T has elements G i j - (xi(t)]xj(t))- f xi(t)xj(t)dt
(4)
G will here be an N x N matrix whereas the dual R = x T x is not a matrix, but a 2D function (also referred to as a kernel) R(u; v) - Z
xi(u)xi(v)
(5)
i
It is very common to represent the smooth functions xi(t) in a finite basis xi(t) - Z
cjOj(t) J
where ~)i(t) is the basis. Writing Eq. (4) in matrix notation we get
(6)
355
Oij - / Z Cik*k(t) Z Cj~*~(t) dt k r
- CBTBC T
(7)
where B = [~l (t); qb2(t):... ;qbK(t)]. For some bases the calculation of Uij = (~i]qbj) matrix will be easy. As we shall see below, the discrete wavelet transform from the Mallat algorithm produces an orthonormal basis which makes U equal to the identity matrix. For orthonormal bases no modification of the original multivariate algorithms is necessary and we can use the method directly on the basis of coefficients C. The conceptual relationship between function, sampled data and the coefficient space is shown in Fig. 1.
2.2 Spline basis Spline approximations of functions are a logical extension of using simple polynomials P k ( x ) - ~-]~-0 ci xi to fit a curve. It may be possible to find the coefficients ci to a kth degree polynomial that will fit in a least square sense a set of sampled points. However, these high degree polynomials are very unreliable for extrapolation and thus contain unrealistically large oscillations. Global polynomial functions are therefore poor at describing local features without using very large k. In spline theory, the idea is used that a function can be approximated by polynomials that are only valid over finite regions or segments. These seg-
Fig. 1 It Shows the basic idea behind the relationships between fimction space, sampled point space and coefficient space. All three different representations describe the shape of the observed spectrum.
356 ments are defined by points tj called knots. At the boundary between two regions the function has C k continuity. C k ~> 1 prevents the boundaries from introducing artificial sharp edges which would be detrimental to the approximation of smooth functions. To control the shape of the curve, control points or spline coefficients are used. For a uniform cubic parametric B-spline over a region i we have Qi(t) -- TMBCBs,
(8)
where T - [ ( t - t i ) 3 ( t - t i ) 2 ( t - t i ) 1 ] contains the polynomials of the parametric variable t, MB is the uniform B-spline basis matrix 1 MBsi -- ~
-1 3 --3
3 -6 0
-3 3 3
1 0 0
1
4
1
0
(9)
and CBsi is the geometry matrix which contains the control points
Pi-3,x CBsi --
Pi-3.y
Pi-2,x Pi-2,y Pi-l,x Pi,x
Pi-l,y Pi,y
Pi-3.z Pi-2.z Pi-l.z Pi,z
3 ~< i <~ m
(10)
'
which in this case displays a cubic parametric B-spline curve in 3D. For compression of spectra from infrared or Raman spectroscopy, only one dimension is usually needed. The word "uniform" means that the distance between the individual knots is the same. For the B-splines to become a more versatile tool, it must relax its assumption about constant distance between the knots. Now the blending functions for the different segments will be different. The only requirement for the knot sequence is that it must be a non-decreasing sequence of numbers. When ti -- ti+l it indicates a multiple knot and the segment Qi is reduced to a point. This is one of the great advantages with non-uniform B-splines since it offers great flexibility in the representation ot functions. For example [0, 0, 1, 1, 1, 1, 2, 3, 4, 4] is a valid sequence of knots. The knot value 0 has multiplicity of 2, knot value 1 has multiplicity of 4 and so on. The multiplicity is used to control the continuity of a point. The higher the multiplicity, the less smooth the spline function at this point becomes. A curve segment Qi in cubic B-splines is defined by four control points
357
Pi-3,Pi-2, Pi-l,Pi and four functions: Bi-3.4, Bi.-2.4, Bi-l.4, Bi.4 where 4 denotes the order of the splines. This order is always one larger than the degree of the spline (here 3). So the Q i ( t ) c a n be written as Qi(t)
= Pi_3Bi_3.4(t)
+ Pi-2Bi-2.4(t)
+ Pi_l Bi-l.4(t)
+ PiBi.4(t)
for 3 ~< i ~< m, ti <~ t ~< t i + l .
(11)
How are the B j . k ( t ) functions computed? An efficient way is to use the CoxDeBoor algorithm [18,19] which is based on a recursive formulation of the Bj,k(t) functions. Using a cubic B-spline to illustrate, it is possible to write out the recursive steps explicitly: 1 Bi,1 (t) -
0
t
Bi,z(t) -
-
ti ~< t < ti+l otherwise
t i Bi.l(t) +
-
ti+l -- ti
Bi,3(t) -
t
-
-
t i Bi.2(t) +
ti+2 -- ti
Bi.a(t) -
t
-
-
t-------!-iBi.3(t) +
ti+3 -- ti
ti+2 - t
Bi+l.l(t)
ti+2 -- ti+l
ti+3 - t
Bi+l.2(t)
ti+3 -- ti+l
ti+4 - t
Bi+l.3(t)
ti+4 -- ti+l
The basis functions Bj(t) can in the discrete case be collected into a matrix B which allows us to write a spectrum x(t) as x(t) = Be
(12)
where B is the basis matrix for all the segments and c is the coefficient vector (1D control points).
2.3 Non-linear bases Often, it is found that certain data profiles are best described by non-linear bases. A typical example is curves from e.g. infrared spectroscopy [20] which are well described by a linear combination of Lorentzian peaks. Sometimes it is also possible to use Gaussian peaks. The Lorentzian peak function is written as S ( t " , a, b , c) -
a : 4 (t-@) + l
and each spectrum x(t) is written as
(13)
358 n
x(t) - ~ 5~ j=l
aj, bj, cj)
(14)
where aj is the height, bj the position and cj is the width of peak j. The peak parameter representation (PPR) [21] uses only the set of non-linear basis function parameters in the multivariate modelling. The PPR method can be written as follows: 9 Perform non-linear curve fitting to each spectrum using n peaks. 9 Record the set of non-linear curve parameters ai, bi, Ci, i c {1, n} for each peak i in each spectrum, j E {1, N}. 9 Construct a vector vj for each of the N spectra that contains all the parameter triplets for each peak: vj = [(al,bl,Cl), (a2, b2,c2),..., (an, bn, Cn)]. 9 The vectors vj are used to construct a data matrix which is analysed with multivariate methods. The PPR approach makes highly compressed models that are very easy to interpret. However, the approach is limited in practical use by the difficulties involved in obtaining the non-linear peak parameters. The process is computer demanding and convergence is not guaranteed. In addition, there is a problem with the fact that it is required that the different peak parameters for the different peaks must be comparable. This makes it difficult to handle problems when peaks for certain spectra appear or disappear. One way to handle this problem is to include all the different peaks in all spectra and their parameters (disappearing peaks are e.g. given zero heights and widths). 2.4 Wavelet bases
The application of wavelet transforms to the analysis of spectra and other chemical data has gained popularity in chemistry [22-30]. Examples of typical chemical signal processing application are: 9 9 9 9
Denoising [24,31-33]. Data compression [22,34-37]. Detection of spectral anomalities [38]. Peak detection [25].
359 There is a whole family of different wavelet methods available depending on the signal properties and the type of information that is to be extracted. However, this chapter will only focus on the fast wavelet transform (FWT) which is based on Mallat's algorithm [39,40]. It should be mentioned that the described methods to achieve parsimonious models are not dependent on one particular type of wavelet transform. Other types of wavelet transforms can be used. FWT is not always optimal for all types of problems and other techniques such as wavelet packets [41], continuous transforms [42,43] and biorthogonal transforms [37] should be considered. Some of the properties of the FWT that makes it an attractive transform are: 9 Orthonormal bases which makes many algebraic operations simple. 9 The reconstruction using all wavelet coefficient is lossless. 9 The FWT algorithm is very efficient (O(n) which is faster than F F T (O(n log n)). 9 FWT can make use of wavelet bases with compact support. In addition to attractive signal processing properties, wavelet transforms in general are also useful for another reason. It is argued in this chapter that representing spectral profiles in the wavenumber-scale (frequency) domain (referred to as a scalogram) also has an interpretational advantage. A spectrum represented in the 2D wavenumber-scale domain explicitly includes information that is related to the continuity of the profile. The scalogram enables us to regard a signal profile as a collection of broad and narrow features at various wave number positions. To understand scalograms it is useful to keep in mind the similarities of FWT to other time-frequency domain method such as e.g. the Short Time Fourier Transform (STFT) [44]. Here an Fast Fourier Transform (FFT) of a small part of the signal is performed (using a window) to produce an estimate of the frequency spectrum over a local time region. An apodizing function such as a Gaussian is usually applied over the window in order to remove ringing (Gibbs) effects. The process is repeated by moving a window of a certain size over the signal. The size of the window is important. A short window will produce an excellent time resolution, but a poor frequency resolution. A long window will produce the opposite. The wavelet transform can be interpreted as an STFT where a short window length is used for the higher frequencies and a long window is used for low frequencies. So what are the effects of analysing a single peak? How will the peak shape affect the scalogram? The appearance of a scalogram of a peak can be investigated in a simple example. Let us consider varying the width of a single Lorentzian peak. The Lorentzian
360 function is used here since this is commonly used to model peaks in infrared spectra. For each width an FWT is performed and the scalogram plotted, see Fig. 2. The basic idea here is that a very broad peak will have very few highfrequency components whereas a very sharp peak will have contributions from low to high frequencies. Here, as the width of the Lorentzian peak becomes smaller, more of the short-scale coefficients become larger. Stated in another way: sharp peaks show coefficients over several scales whereas broad peaks only have contributions from lower scales. This interpretation can be used in the analysis of scalograms from regression and classification models. For instance, B-coefficient vectors used in regression and decision planes in multivariate discrimination can be displayed as scalograms. Variables in the scalogram that are identified as particular important for the prediction can now be positioned along the wavenumber axis in addition to information about frequency content. Finding a selected variable at high scales strengthens the assumption that a sharp feature is important for the prediction. Sometimes it is observed that several variables
Fig. 2 Demonstration of how the shape of a Lorentz peak q~ects the scalogram. Note that as the peak gets more narrow the scalogram gets contribution from progressively higher scales.
361 representing different scales of the same peak are selected. So, in addition to generating a parsimonious model from the variable selection process, enhanced qualitative information about the important features is also obtained.
Choosing optimal wavelet bases. Unlike the Fourier transform there are several basis functions to select from when using wavelet transforms. This means there must be a criterion for choosing the optimal wavelet. A reasonable criterion is to use the compression ability of the analysing wavelet. This means that the optimal wavelet is defined to be the one that produces the smallest number of coefficients needed to describe the data. The following algorithm is here used: 9 A representative spectrum of the data set is chosen. One typical representative is the mean vector ~ which has a corresponding wavelet coefficient vector W. However, other representative data vectors could have been used, e.g. the standard deviation or the first principal component. The choice will be dependent on the data set at hand. 9 Sort (from high to low) the squared elements W~ and insert the result into a vector f. 9 Calculate u i - l o g E j = I9 f.J " Note that it is here assumed that vector indices start at no. 1 (following M A T L A B notation). 9 The u vectors are normalised such that the largest element (the grand maximum) in all the vectors is set to one and the lowest (the grand minimum) to zero. 9 The sum of all the elements in each of the normalised u vectors is calculated. This gives an indication of the area Ek below each curve. These curves are monotonously decreasing with only positive values and the one with the smallest area Ek corresponds to the optimal wavelet for a particular data set. For all data sets in this chapter the following six wavelets were tested for: Haar, Beylkin, Coiflet, Daubechies, Symmlet and Vaidyanathan with varying number of vanishing moments. The method produces plots as shown in Fig. 3.
3
Methods for creating parsimonious models
In this section, several different strategies for creating parsimonious models are described.
362
t g4 ::3 =. r" 0
~3 o ._1
0L
0
Fig. 3
Hi
I
|
|
I
100
200
300 Coefficients
400
1
500
600
vectors for different wavelet filnctions. The curve with the smallest area corresponds to the optimal wavelet in this scheme.
3.1 The simple multiscale approach Wavelet transforms inspect signals at different scales or resolutions. At the coarse level, only the most prominent large features can be seen, whereas at the higher levels finer details are captured. Multiresolution develops representations of a function f(t) at various levels of resolution where each level is expanded in terms of translated scaling functions d~(2Jt- k). As mentioned earlier in this book a sequence of embedded subspaces Vj is created V0 C V1..-Vj_l C Vj
(15)
where J corresponds to the highest and 0 corresponds to the lowest resolution level, f(t) can be approximated by the projection Pjf onto the space Vj: PJf- Z
Cj,k(D(2Jt - k)
(16)
k
It is here assumed that Pjf = f, i.e. the maximum signal resolution is contained in the original function. The wavelet coefficients correspond to the contribution from the projections onto the space of details between two
363 approximation spaces Vj and g j + 1 . A detail space Wj is spanned by the translated wavelet functions @(2J - k) and is the difference of detail between two approximation spaces Vj+1 and Vj Vj+ 1 -- V j @ W j
(17)
where | is the direct sum. The projection Qjf of f onto Wj is QJf- Z
dj,k~(ZJt- k)
(18)
k
and relates to the approximation projections as follows: Pjf = P j - l f + Qj-lf
(19)
which means that j-1
Pjf- P0f + X
I20)
k=0
An algorithm for the construction of more parsimonious regression and classification models can be found based on these formulas: for s = 0 to J do Reconstruct each spectrum at resolution level s (or use the wavelet coefficients directly from scale 0 to s): X(s) = PsX Perform multivariate analysis on X t~) Record prediction errors etc. end for Parsimony is achieved through optimisation of the resolution level of all the spectra in the data set X with respect to prediction ability of the dependent variable y. The process of changing the resolution level of a spectrum is shown in Fig. 4. An important question is: what resolution level is sufficient to maintain a good prediction model? The new model must be compared to a model using all available variables. If it has better prediction there is no problem. The problem occurs when the smaller model has worse prediction error. A general method for the generation of an optimal solution is difficult because the acceptable lower limit of prediction errors is problem dependent.
364 Pof
P1f
P2I
P3f
~f
P~f
Psf
P7f
Psi
Fig. 4 Illustration of multiresolution reconstruction applied to a single iJ!frared spectrum.
The scale-error plot. By following the simple multiscale algorithm suggested above, a scale-error 2 plot is produced that can be used to make decisions about the optimal resolution level.
By plotting the scales added versus prediction errors it is possible to study how the various levels of cumulative details will influence the prediction ability. Some possible outcomes for such plots may be: 9 The prediction error is increasing for more added scales. This indicates that the underlying structure is present in the very broad features of the signal and that adding more scales will introduce structures that are detrimental to the calibration model.
2 Error types can be e.g. root mean square error of cross validation (RMSECV), root mean square error of prediction (RMSEP) or predictive residual sum of squares (PRESS).
365 9 The prediction error is decreasing with adding more scales. This is the most common case. There is useful information for prediction of the dependent variable at several scales. 9 The prediction error decreases to a minimum before increasing again. In this case it suggests that scales above the minimum have a detrimental effect on the prediction ability. 9 The prediction decreases to a level and flattens out. This means that scales over a certain level do not change the prediction error and can therefore be removed. The other useful information from the scale-error plot is that it will tell us something about the width of the features important to the prediction. If it is possible to create a very good model at very low resolution, this indicates that broad features are important for the prediction. If higher resolution is necessary to maintain good prediction, then this indicates the importance of narrow features in the data.
The scale-error-complexity (SEC) surfaces. Instead of observing the prediction error with respect to resolution, it is also possible to monitor the complexity of the calibration/classification model. In PLS this can be measured by the number of PLS factors needed. How the error (e.g. RMSECV, RMSEP, PRESS) changes with varying the added scale and model complexity can be observed in scale-error-complexity (SEC) surfaces. In this case the first axis is the scales, the second axis is the model complexity (for PLS this is the number factors) and the third axis is the error. The complexity dimension is not limited to the number of PLS factors. For example classification and regression trees (CART) a measure based on tree depth and branching could be used [45]. With the PLS-based SEC surfaces it is important to remember that parts of such surfaces are not valid due to rank problems. These plots usually have a "rank ridge" which is defined by the theoretically maximum number of PLS factors A that can be extracted. If j is the resolution level, then A must satisfy the following relation: A < 2j + ~ - 1
(21)
where 2j+l - 1 is the total number of wavelet coeffcients at scale j (i.e. using scales 0-..j).
366
3.2 The optimal scale combination (OSC) method The simple multiscale approach can be seen as a subset of the optimal scale combination (OSC) method. In OSC the total number of possible scale combinations is generated. In the simple multiscale approach, scales were increased in a systematic fashion" {0}, {0 1}, {0 1 2}, {0 12 3},... In OSC all possible combinations of the J + 1 scales are generated and tested. The combination that gives rise to a regression or classification model with the lowest number of coefficients and the lowest prediction error will be selected. Assume i is the number of scales to be selected from a total of K scales. There are aj(K, i ) _ ( K )
(22)
ways of selecting i such scales. Note that it is here combinations and not permutations, since there is no difference between e.g. scales {1 2 4} and {2 1 4} selected. The i variable runs from 1 (we are excluding scale 0 here) to J which is the number of scales one decides to collect. The total number of scale combinations must therefore be the sum K
~ff(K) - Z
N(K, i) - 2 K - 1
(23)
i=l
The idea is illustrated with a simple example. Assume there are three scales: 1, 2 and 3, and ~ ( 3 ) = 7. The seven scale combinations are { 1}, {2}, {3}, {1 2}, {1 3}, {2 3} and {1 2 3}. If the jth scale combination, c0) gives rise to a multivariate model then there is a (prediction) error value rj associated with each combination, c O) is a set which has members q~, q2,. 99 qn where n is the number of scales in the set (e.g. in the sixth combination above ql = 2;q2 = 3). Of the doublets (c~J/,rj) we are only interested in those models with a low value of rj (prediction error). However, when there are several combinations with the same (or comparable) prediction error, we should follow Occam's razor and choose the smallest one. In this case this is easy to define because it can be based on the total number of wavelet coefficients associated with a particular scale combination c ~j/. Function ~k'(c(j)) produces the total number of wavelet coefficients n
~'(c0)) -- Z
2qi
i=l
which can be used to determine which model to choose.
(24)
367
3.3 The masking method Finding only the most important scales with respect to the prediction error does not indicate where in the wavenumber domain these features are located. Often it is observed that going from one resolution level to another has dramatic effect on the quality of the multivariate model. In such a case it is possible to perform a systematic search for where in the wavenumber region for the particular scale the important features are approximately localized. The method described here is only applied to the cluster analysis example, however it could have been used for other multivariate modelling techniques also. In cluster analysis, instead of prediction error, the interest is focused on properties such as overlap between clusters, cluster separation, cluster variance etc. In general, some interesting property ": is observed to change with respect to the resolution level. A scale k (for the dyadic FWT) has 2 k wavelet coefficients. Each coefficient is associated with a wavelet that has a certain localisation along the wavenumber (or time) axis. If the wavelet function for a region is not contributing to the change in E, it is expected that setting the corresponding coefficient to zero will not change E significantly. One way to find the important wavenumber regions is to look through all the possible combinations of wavelet coefficients being multiplied with zero or one for the scale in question. In total there are 22k binary mask vectors m (i) that can be multiplied with a wavelet coefficient vector w at scale k. The structure of the new wavelet coefficient vector w/i) is
-1,0 <
(vli') .. 0 0j
= [qbo Wo T w T . . . (wj ~ m(i)') . . . 0 O]
(25)
where Q indicates element-wise multiplication between two vectors. This masking operation is applied to all the wavelet coefficient vectors of the spectra in the data set and then reconstructed. Again, it should be stressed that reconstruction is not necessary; it is sufficient to only use all wavelet coefficients up to the current scale. All the wavelet coefficients above resolution level k are set to zero and do not contribute. For each combination i the chosen multivariate method is applied to the masked data. Fig. 5 illustrates the basic idea behind the method.
368 ~
COIIIBINATltBNS FINI ~ J i L E 2 USING B L I N ~ ~ l ] E 2
~W-O~.~.-~ 1-e,~.---- 2
~
~l.-O~Mw.-l~
2
'|1
i Iili I I I i I !
ill ill ill
I i ill Fig. 5 Illustration o f the masking principle using scale 2 as an example. Shaded areas are where a masking value o f I are used, i.e. white regions correspond to wavelet coefficients not used in the modelling. Note however that white areas below scale 2 are all used.
It is now necessary to measure the effect the various combinations have on E. For instance, let E be the degree of cluster overlap. In this case it is possible to observe how different masks affect = Of course, it is vital to have decided on which objects belong to a cluster. In other words, the analysis is temporarily turned into a supervised rather than an unsupervised problem which for complex cases can be solved by methods like CART [45] and discriminant PLS [46-48] and Quinlan's C4.5 algorithm [49]. Masking analysis makes it possible to detect the important variables (i.e. wavelet coefficients which are located in the wavenumber domain) that are always associated with certain values of E. In a masking analysis it is not necessary to examine the real values of the wavelet coefficients, only the binary masking matrix. It should also be stressed that it is not necessary to use all 2 k coefficients in the search. For the higher scales the number of possible masking combinations is very large. One way to reduce this number is to look at larger blocks of coefficients and investigate the combinations of whole wavelet coefficient blocks being set to zero or one rather than the individual coefficients. In this study the size of the blocks has been restricted ~k
to be a power of 2. Thus, the total number of mask vectors is 2~- where p is the block size.
369
3.4 Genetic algorithms Genetic algorithms (GA) [50] was originally invented by Holland [51] where the basic mechanisms of natural evolution are mimicked in order to perform optimization of very complex problems. There are several diffierent implementations and approaches to GA today and they of course do not need to follow "real" evolutionary principles as found in nature to be efficient. However, the following steps can be found in most GA methods: 1. The algorithm is iterative and halts after some pre-defined optimum criterion is satisfied. 2. In each iteration step, the algorithm maintains a population of hypotheses that are coded by a linear string of symbols ("genes"). 3. In each iteration step, hypotheses are probabilistically selected according to a fitness function to be allowed to "live" in the next population. 4. Combinations between selected hypotheses are made using several genetic operators such as cross-over and mutation. A common way to represent a hypothesis is by using binary bit strings. A binary string can be given an arbitrary coding and thus has the potential to represent very complex hypotheses. However, there is no need algorithmically to maintain binary strings for problems where this is not natural. In the examples discussed in this chapter, the use of binary strings is most natural, since we can use that 1 represents a selected variable whereas 0 represents a variable that has been removed from the analysis. For the GA applications described in this chapter the fitness function is set to be the cross-validation error from PLS on a data set using the selected variables dictated by the binary string (i.e. the hypothesis is identical to the set of selected variables). For the applications in this chapter the MATLAB gaselstr.m routine from the PLS_Toolbox (from Eigenvector Research, 830 Wapato Lake Road, Manson, WA 98831, USA) is used.
3.5 The dummy variables approach A certain variable selection can be represented by a binary vector dk X~ks) -- X diag(dk)
(26)
370 The binary codes used in such a matrix represent 0 for excluding a variable and 1 for including a variable. The matrix X vs tk) together with the dependent variable(s) vector (matrix) is subsequently analysed with the multivariate calibration method. The prediction error from the analysis is recorded. The goal is to optimise the prediction error by finding the best dk. Here we will focus on PLS in particular. One method that tries to solve this problem is the Generating Optimal Linear PLS Estimations (GOLPE) [8]. The basic idea of GOLPE is to produce an experimental design matrix D for all the variables that specifies the necessary PLS runs for determining the important variables for the prediction. Each row in D is a binary dk vector, which is chosen according to various experimental design strategies. A cross-validation is performed and the prediction error of the PLS model (the standard deviation of error predictions, SDEP 3) is recorded. If D has dimensions [K • M] (M is the number of variables), there are K different PLS calculations to perform. These calculations produce the SDEP vector which has dimensions [K • 1]. Using a meta-PLS model with the D matrix as the X matrix and the SDEP vector as the y vector with the dependent variables, it is possible to localise the variables that are important for the prediction. This is done by detecting the variables with large negative loadings (i.e. variables that makes the prediction error smaller). However, as noted in [8] this approach can be problematic due to chance correlations where it is possible to select variables that do not really have significant influence on the reduction of the PLS prediction error. The way the authors chose to solve the problem was by introducing several dummy design variables. These design variables are chosen to have random high and low codes. For each variable j (both the real and dummy) the effect Ej is calculated Ej = (SDEP+ - S D E P _ ) / K
(27)
where SDEP+ represents the mean SDEP value of all the design variables with code 1 and SDEP_ represents the mean SDEP value of all the design variables with code 0. From this it is possible to compute a threshold value which a variable must be over in order to be selected as significant to the prediction model. This threshold is given by )~
-
-
ND E2 Z N D tcrit j--1
3SDEP- CPRNE.SS
(28)
371
where tcrit is the Student's t-test critical value for a certain p-value and ND is the number of dummy variables. In the GOLPE approach the status of a variable is divided into three possible categories: 9 If ]Ej] < )~ it cannot be decided whether the variable is significant or not. These are the uncertain variables. G O L P E will start a new iteration to determine the status of these variables. 9 If ]Ej] > ~, and Ej < 0 the variable is selected since it has a significant effect on the reduction of the prediction error value. These are the "fixed" variables. 9 If IEj] > ~ and Ej > 0 the variable is discarded since it has a significant effect on the increase of the prediction error value. Therefore, the GOLPE process is iterative where uncertain and ~fixed" variables are kept in the next step. The variables in step 3 above are removed. The GOLPE method is a very powerful and appealing way to select the most important variables. However, there are problems with this method. The most serious problem is that the algorithm becomes computational demanding for data matrices containing a large number of variables. GOLPE uses one approach to improve on this problem which is based on D-optimal design in the loadings plots. This may be possible for PLS, however this approach may not be applicable to other types of regression and classification methods (such as e.g. genetic programming and rule induction). In this chapter, GOLPE is not used. However, instead the idea of using dummy variables is employed to refine the results from other variable selection procedures. The main difference is that the variable selection methods presented in this chapter do not perform a design strategy for selecting the variables. Variable selection by genetic algorithms is a good example, where the population of variable selection choices does not follow a systematic design. It is here argued that using dummy variables in non-design variable selection runs (referred to as the dummy variable approach (DVA)) still can be useful as a further distillation of the most important variables for prediction. The new dummy variables are used to establish a threshold value for selecting variables. The approach being used here is as follows: 9 Initial variables selection (e.g. GA or TPW, see section 3.7). 9 The threshold ~ from the Ej values is computed using a binary dummy variable matrix and the prediction errors from the previous variable selection dk vectors in step 1.
372 9 The Ej values from a binary matrix of selected variables in step 1 are computed. 9 DVA removes unimportant variables suggested by threshold. 9 A PLS model with the new selected variables is generated. The dummy variable method used here is not iterative and only indicates the variables found to have positive effects for the prediction ability.
3.6 Mutual information The concept of mutual information originates from work in information theory [52] and can be seen as a generalization of the correlation coefficient. The mutual information between a class c and an input feature xj is the amount to which the knowledge provided to the feature vector decreases the uncertainty about the class. Mutual information is calculated from the probability distributions, p(x), p(y) and p(x,y), p(x) is the distribution of the values for a certain variable x. p(x,y) is the joint probability between the x variable and the dependent variable y. A comparison of the joint probability p(x,y) with p(x)p(y) is made. For statistically independent data [53] we have that: p(x)p(y) = p(x,y)
(29)
If these probabilities are not the same, there is a dependence between the two distributions and no prior assumptions about its form are made. The standard way of creating probability distributions from histograms is only optimal for dense data. For this reason, an alternative method is employed that is based on kernel density estimation [54]. These probability distributions estimated from this approach are subsequently used in the formula for the calculation of the mutual information, I(x,y) [53].
x Ep ,y,,og2 y
\p(x)p(y)J
I(x,y) has a large magnitude if one distribution provides much information about the other, small if it provides little. In a purely linear Gaussian situation, I(x,y) reduces to correlation and provides identical results. The variable selection method used here with mutual information is really univariate because each individual variable xi is associated with a I(xi,y)
373 value independently of the other variables xj, i r j. The threshold for the MI is calculated by using DVA. Extra dummy variables are included that have random values in the same range as the real data matrix. The mutual information for each of these randomised variables is calculated and averaged. A Student's t-test with a selected p-value determines the threshold for accepting a variable as significant for the prediction.
3.7 Selecting large w coefficients Regression. For many of the applications described in this chapter partial least square regression [46,55-57] is used. There are several possible approaches to variable selection in PLS, but here a strategy is used which is very similar to the approach described by Lindgren et al. [58,59]. A central part of the PLS algorithm is the calculation of a weight vector w for each PLS factor. PLS finds simultaneously important and related components in both X and Y space where the factors are written as linear combinations of the original variables. A central part of the algorithm uses Y space information to orient the latent variables of the X space w = ~XTu
(31)
where u contains Y space information and ~ is a normalisation factor, w is subsequently used to estimate the X space scores t=Xw.
(32)
The variable selection strategy as used here tries to simplify the structure in w by keeping only the k largest (in absolute values) variables. The remaining variables are set to zero. The k value is estimated by cross-validation. It is in addition necessary to ensure orthogonality/orthonormality of model matrices (T and W). This variable selection method is hereafter referred to as the Truncating PLS w-coefficient method (TPW).
Classification. PLS applied to multiple dependent y-variables (PLS2) can also be used in classification and it is usually referred to as discriminant PLS (DPLS). In PLS2 one uses linear combinations of the Y-space variables rather than individual y-variables. PLS2 therefore has an iterative stage in each of the PLS2 factor calculations. The equation to solve is Y = XB
(33)
where B is the regression matrix with dimensions Dim(B) = [M x K] where M is the number of variables and K is the number of classes. B is estimated
374
by finding a generalised inverse X + provided by the PLS2 algorithm. The Y matrix has to be coded to include information about the hard class memberships associated with the objects. A row vector yT in Y is interpreted as follows: 1 YJ - -
0
if object i belongs to class j otherwise
(34)
However, the estimated Y matrix produces real number and one way to convert it to binary form is by finding the element in a row which has the largest maximum absolute value and assign this element value 1 and all other elements are set to zero. The algorithm for the variable selection version of DPLS (VS-DPLS) as used here is [48]: for a - 1 to Amax do Select a column vector u from matrix F to = u
while not S T O P do w T = u TE w T - ~(w T, k) if a > 1 then wT _
T, W T)
end if w T _ w T/(wvw)
t=
1/2
Ew
if a > 1 then t -- ~ ( t , T)
end if Calculate q and u if ( ~ i ( d t i ) z ) / t x t < c o n v then STOP = T R U E end if end while Calculate p (Storage of matrices) U p d a t e E and F end for
375 f~ is an operator that picks the k largest elements from the Iwl vector. 9 is an operator that takes one vector and a matrix and makes the vector orthogonal to the columns in the matrix. Let DPLS(X, y, k) be cross-validation VS-DPLS that selects the k largest ]wjl at each factor. The optimal selection of k (i.e. kopt) is the one with lowest PRESS value: for k = 1 to kmax do DPLSmodelk = DPLS(X. y, k) Store PRESS(k). end for Select model corresponding to min (PRESS) A note of caution should be made in connection with setting k = 1. In this case the selected regression model has full rank (i.e. the number of PLS factor is identical to the number of selected variables). The investigator should be careful not to use a model that might be unstable.
4
Regression and classification
4.1 Regression Multiscale regression or wavelet regression [60] is based on the simple idea that the mapping between the independent and dependent variables may involve different resolution levels. Most approaches to multivariate regression and classification only make use of the original data resolution in forming models. The multiscale approach enables the investigator to zoom in and out of the detail structures in the data. Let us now consider regression in general in terms of a matrix formulation of the fast wavelet transform.
The F W T basis matrix. The fast wavelet transform (FWT) can be formulated in terms of matrix algebra by storing each of the wavelet functions in the time/wavelength domain in a matrix B. This matrix contains all the translations and dilations of the wavelet necessary to perform a full transform. One common way to organise this matrix is to sort the sets of shifted basis
376 functions according to their scale. This means that we present all the basis functions that are shifted but have the same scale followed by the next higher (or lower) scale's shifted basis functions. This organisation is not chosen arbitrarily but is closely related to how Mallat's algorithm [39] for calculating the wavelet coefficients operates. The number of shifts along the x-axis depends on the value of the scale j. Assuming that the total number of elements in our data vector is M - 2J+~ the different scales are the integers from 0 to J. The shifting coefficient k has the integer values 0 to 2j-1 for each j value. The structure of the basis matrix B is as follows: -
B0 B~ .
B -
(35)
BJ-1 -
Bj
where each submatrix Bj has a diagonal dominant structure for scale j. is associated with the projection onto the lowest resolution (j=0) scaling function. Each basis matrix added is related to the direct sum between the corresponding detail spaces Wj mentioned earlier in the book. The largest submatrices correspond to the shortest scales (dominated by high-frequency components). B is orthonormal and the wavelet transform can be written as z - Bx
(36)
where z is the vector of wavelet coefficients and x is the vector containing the input data profile. Reconstruction is trivial x--BTz
(37)
In a typical chemical regression problem the X matrix contains the N spectra (as rows) and M wavelengths (as columns) where y is column vector of a component concentration. Assuming that Beers law holds we have that y-Xb
(38)
where b is the linear regression coefficient. To estimate b we need to find a generalised inverse such that I~- X+y
(39)
X + is the generalised inverse performed by some regression method (e.g. partial least squares regression). Inserting for X
377 X : ZB
(40)
where Z is the matrix of wavelet coefficients. Substituting this into equation 39 one gets I ~ - B + Z + y - B T Z + y - BVbw
(41)
where bw is the resulting B-coefficient regression vector from the direct analysis of the wavelet coefficient matrix Z. A simple postmultiplication of the basis matrix will convert this vector into the B-coefficient vector from the analysis of the raw data. This is of course related to the fact that the FWT basis is orthonormal. For wavelet transforms that do not satisfy this criterion the conversion back to the original domain is not equally straightforward.
4.2 Classification The idea of performing classification at different levels of resolution can be explained by using an analogy: assume we want to distinguish an elephant from a dog by analysing images taken at different distances. We know that the level of detail obtained from using a magnifying glass is not needed. In fact, it would probably be possible to distinguish these two animals from a distance of more than a kilometer (i.e. whether it is an elephant or not). At large distances we can only see the broad and overall features of these animals, but the detail would be sufficient for a classification. However, if also we want to distinguish between e.g. a wolf and a dog a higher level of detail is necessary. The concept is illustrated in Fig. 6. Looking at this figure we see that it is possible to present diagrammatically the way a discrimination with respect to spatial resolution between these objects works. By starting at the image at top left and moving to right and down, it becomes easier to resolve the image into different objects. Here we are not interested in the actual distance between the objects, but rather the actual shape. Thus, objects are grouped together that have the same size and shape. Such diagrams are here called scale dendrograms and have a similar interpretation as the classical dendrogram. The scale dendrogram efficiently summarises the separation of objects with respect to spectrum resolution. In this way it is possible to detect when certain patterns change significantly with the addition of a scale. It should be emphasised that the definition of similarity between the objects in the diagram will depend on the problem to solve.
378
Fig. 6 Classification at different levels of resolution. When the resolution possible to distinguish between the elephant and something that is as small want to discriminate between a dog and a wolf, we need higher resolution. side of the figure shows a scale dendrogram which is used to summarise properties of the classification of the objects.
is low it is still as a dog. I f we The right-hand the qualitative
Scale dendrograms can in principle be applied to both unsupervised and supervised classification, however in this chapter only examples from unsupervised classification are included.
Unsupervised classification- Cluster analysis. Unsupervised classification or cluster analysis is a way to find "natural patterns" in a data set. There are no independent "true" answers that can guide the classification and we are therefore restricted to construct a set of criteria or general rules that can highlight the "interesting" patterns in a data set. A data set consists usually of a set of objects that each are characterised by a feature vector x. To find patterns it is important to establish to what degree vectors are similar to each other. Similarity is often defined using a distance metric between object i and j Dij -- F(x(i), x (j))
(42)
where F in principle can be any function operating on the elements on the vector elements of the object vectors. Typically, F is a Minkowski distance
Dij -
Z [[Xik -- Xjk liP k-1
(43)
379 where M is the number of variables. When p = 2 this corresponds to the Euclidean metric. Such a metric for functions is written as: D ~ - <xi(t)[xj(t))
(44)
assuming xi(t) and xj(t) as functions. This metric gives each individual infinitisemal point along the two curves xi(t) and xj(t) equal contribution in the calculation. However, it may be that not all features of a curve should be be taken into consideration for the detection of clusters. At a certain scale two functions may be very similar, whereas irrelevant scale information makes them far apart in the space defined by Eq. (44). By using a similar approach as the previous sections, it is possible to include scale information in the calculation of the metric. A multiresolution formulation of Eq. (44) becomes: j D~j -
j
.
>~/2
c~~) + ~ d~(t)lc0~ + ~ dl(t ) 1=1
(45)
1=1
where c~k) is a constant and dlk)(t)is a detail function (from the wavelet coefficients) for scale 1. In general, one can construct distance matrices that emphasise on selected scales: Dij(v) -
c~i)
j + ~
i c~j) " v,d,(t)l + ~J v,dl(t)
1=1
/ ~/2 (46)
1=1
A systematic exploration of different vl parameters would be similar to the OSC approach (see above). However, here restrictions are placed upon the selection of v weights to produce distance matrices with increased resolution of the spectra (i.e. the simple multiscale approach): O}~ -
1/2
(47)
o } j l )
Dij( k )
/ C~i)
k dli)
+ Z d[i) (t) ]Co0) + Z 1
(t)
(48)
l
Each distance matrix D (k) is then analysed to find groupings of the objects.
380
Fourier cluster analysis. In connection with multiscale cluster analysis, it is also interesting to make a comparison with Fourier transform cluster analysis. The idea is to use a subset of all the possible frequencies in the signal in the reconstruction of the spectra. This means to investigate how the cluster analysis is affected by applying various bandpass filters to the spectra. There are several ways how these bandpass filters can be designed, but here a simple approach is employed where a parameter a indicates the upper limit to the frequencies retained in the filtered signal. For each value of ~ all frequencies above it are smoothed to zero using a Gaussian function. A cluster analysis of the filtered data set is performed and the cluster structure is observed. Sometimes global frequencies can affect the cluster pattern, however, it is more likely to observe dependencies of a more localised nature. For such cluster pattern dependencies, the Fourier transform is unable to localise this since its basis functions have no localisation in time, only in frequency. For the Fourier cluster analysis performed in this chapter the following filter Q was used: Q(m) - { e-(m-u)2/al m ~>~ ~
(49)
where ~ is the location of the Gaussian shape used, m is the frequency and 8 is the width of the Gaussian peak. if Fj(m) is the Fourier transform of spectrum j a set of filtered versions of the spectrum is made from Gj(m; ~) = Fj(m)Q(m; ~t),
~ E [~min, ~max]
(50)
The conceptual relation between classical, Fourier and multiscale cluster analysis is illustrated in Fig. 7.
5
Example applications
5.1 Regression 5.1.1 The simple multiscale approach Data set. This is one of several available test data set on the Internet [61]. The data set has been kindly provided by Dr. Windig and is a mixture of: 9 2-butanol; 9 methylene chloride; 9 methanol:
381 T m r , . ~
m
MULT'gSCALE/WAVELET O,LUb'TIBt ,~laU.YS~r.5
Fig. 7 Wavelet or multiscale cluster analysis can be seen as something in between classical cluster analysis on the time/wavelength domain and the Fourier cluster analysis in the .fi'equency domain.
9 dichloropropane; 9 acetone. The mixtures are measured by near infrared (NIR) spectroscopy at wavelenghts 1100-2498 nm. There are in total 140 spectra at 700 wavelengths. To enable analysis by FWT it is necessary to convert the data to length 2 n. A subset of the original data was extracted to make a total of 512 variables. The extracted data correspond to a window between variable no. 147 and 658, see Fig. 8. Only four of the five components are analysed (for display reasons only). For comparison, PLS regression was performed on the four components using all available wavelet coefficients (this is identical to using the original data). The prediction errors using A = {13, 11, 11, 17} PLS factors are 2:2%; 2"1%; 2:3% and 2:3%. To see the relation between the change of scale and PLS factors on the prediction error, it is instructive to plot the SEC calibration surfaces, see Figs. 9 and 10. An automatic determination of the optimal number of PLS factors for each scale is difficult without causing overfitting.
382 2.5
1.5
0.5
1O0
200
300
400
500
600
700
Fig. 8 The variables between the t~ro vertical bars were used #1 the analyses.
The interesting regions are zoomed to avoid large PRESS values to dominate the plot. Note the "rank ridge" at the left part of each plot. By inspecting the SEC surfaces, possible candidates for parsimonious models can be made, see Table 1. However, plotting the PLS-factor/error plots for selected scales is necessary to find better models. As Table 1 indicates, our first suggestions do have relatively high prediction errors compared to the models using all the wavelet coefficients. Fig. 11 plots the prediction errors with respect to PLS factors for selected scales. To achieve prediction errors comparable to those observed when using all the wavelet coefficients, we need to go to at least scale 5 and A = 10 PLS factors for all the components. Another important question is: how does the prediction model itself change with the resolution? When few scales are added we have very smooth representations of the spectra and it is expected that e.g. the B-coefficient regression vectors will display the same degree of smoothness as the input data. This can be seen in Fig. 12 which shows the PLS B-coefficient regression vectors for the different resolution levels. This type of plot can be used to get
383
Fig. 9 SEC surfaces for the four components.
a rough idea where the features important for the prediction may be located. However, it is not possible to obtain a precise localization since we are looking at whole scales rather than individual wavelet coefficients. In order to obtain more precise location of the features variable selection is required.
5.1.2 Variable selection by genetic algorithms In the remaining analyses only 2-butanol will be discussed. Variable selection using a genetic algorithm as described above was performed. A maximum population size of 64 and 100 generations was imposed. The mutation rate was set to 0.005 and the maximum number of PLS factors was 30. The GA routine used a double breeding cross-over rule. The cross-validation blocks were randomly selected. This was to ensure that the GA operation would not optimize by locking onto favourable structures in a fixed cross-validation operation. All calculations were carried out on MATLAB5 using an Alpha
384
Fig. 10 The SEC (RMSEP error) surfaces for the four components.
processor based machine with Ultrix operating system. The gaselctr routine in the PLS_Toolbox 4 performed the GA variable selection. The routine selected 249 variables (A = 14 PLS factors) with a prediction error of 2.1%. The PLS prediction error on the unseen validation set using all the variables is 2.1% (A = 13 PLS factors) for comparison. Thus, the prediction error is approximately the same with an almost 50% reduction in the total number of variables. A scalogram with the selected variables is shown in Fig. 13. Note that the whole of scales 2 and 3 are selected. 4Eigenvector Research, 830 Wapato Lake Road, Manson, WA 98831, USA.
385
Table 1. Parsimonious models suggested by Fig. 9. Compound 2-butanol Methylene chloride Methanol Dichloropropane
Scale 2 3 3 3
Error (validation set) 9.2 6.8 4.6 9.0
P L S factors 3 3 3 5
2-butanol
Methylene chloride
8 7
2
6
~6
~5
fl_ m 5 cO n4
rr4 3 5
2
4
6 8 PLS factors
2
10
Methanol
4
6 8 PLS factors
10
12
Dichloropropane
7
6
~5
fl_ m4 n-
3
rr 4
t 2
4
6 8 PLS factors
10
12
2
4
6 8 PLS factors
10
12
Fig. 11 PLS factor plots at di(ferent scales.[or the.[bur components. In spite of the fact that a significant model reduction has been accomplished, the results can be further improved. The next question to ask is whether all the selected variables really are necessary. It would be interesting to find which variables that seem to be very positive for the prediction ability of the PLS model. In order to shed more light on this problem, DVA was applied to the GA population of selected variables. The number of dummy variables tested for was set to be 60% of the total number of variables, (307) and the Student's t-test critical factor was chosen to produce a p-value of 0.0005.
386 2-butanol Pure s'pectrUm ". ~
I
Methylene chloride
Pu~ sp~trdm
|
0
200 400 B-coef bins
~ure spectrum
600
0
200 400 B-coef bins
600
Dichloropropane
Methanol
~
I Pure s ~ r u m
"o
tl) "o
"o "o
_.e
0
200 400 B-coef bins
600
0
200 400 B-coef bins
600
Fig. 12 This figure shows how the PLS B-coefficient regression vector for each component at different scales added. Note the importance of scales 3 and 4 which introduces localizable peak-like features at several places.
The threshold found using DVA produced 58 variables. These variables are plotted in a scalogram in Fig. 14. The prediction ability using these variables is 5.8% ( A = 14 PLS factors). Below, a comparison of this result with the TPW analysis is made.
5.1.3 The TPW approach The TPW variable selection is a more rapid method compared to both GA and GOLPE. A TPW selected 228 variables using A = 19 PLS factors. The result from the analysis is shown in Fig. 15. The prediction error on the unseen validation set is" 2.1%.
387
549
100
t~0
200
250 B~
3~
350
400
450
SO0
Fig. 13 GA variable selection o['2-butanol. Black areas indicate the (249) variables selected.
OVA on reoJIts Irom GA vanel:fle
eetecl~on
II II
7
~4
50
100
150
200
250
300
350
400
450
SO0
Fig. 14 The most important variables after D VA o f the GA-result. Here 58 variables are selected.
388
//
PLS ~.-varial:~ m~ction
4
50
100
150
20O
250
B~
3~0
350
400
450
500
Fig. 15 The results ~'om using TPW variable selection.
The next step was to use DVA on the results. The results from using DVA on TPW variable selection is shown in Fig. 16 where 61 variables (A = 14 PLS factors) are selected (RMSEP - 2.2%). The plot indicates regions at scale no. 2, 3, 5 and 7 as being in particular important. It is interesting to see that the region defined by bin no. 125-250 and scale 3 overlaps the region found by the GA variable selection procedure described above. By taking an intersection between the two wavenumberscale regions it is possible to see what regions they have in common, see Fig. 17. The prediction error using these 7 variables is 19.0% ( A - 4 PLS factors). Among these common variables a relatively narrow region at scale 5 is observed. In [62] the Simplisma method was used to find selective variables for the different components. Of interest are two wavenumbers: 2080 nm (bin no. 344) and 1716 nm (bin no. 163). Fig. 17 shows that this wavelet is close to the selective variable at 1716 nm. However, it is not selective for 2-butanol, but dichloropropane. Windig and Stephenson found that position 2080 nm is a selective variable for 2-butanol and the variable methods do select regions containing this position. In particular, in the results from the GA variable
389
I I II
OVA on TPW ~rial~e e d e d o n reeuHs
mm =
I
I
l
oo
50
100
150
200
250 B~
300
350
400
450
500
Fig. 16 DVA on the T P W variable selection results.
Common re~ons in both ~
and GA variab~ se~=bon us,rig DVA
I
'L
1
1~ ? " X . _ f ~ . _ _ _ _ j o
/
J
5o
too
15o
2oo
250
300
350
400
&50
500
Fig. 17 Important regions detected hv D VA applied to both T P W and GA variable selection. The pure spectrum of 2-butanol is includedjor comparison.
390 selection contains a narrow region at scale 6 which contains wavenumber 2080 nm. 5.1.4 Mutual information variable selection Mutual information as described earlier has been used on the results from the GA and TPW analyses. This means that the data matrix for each of the MI analyses is a binary "design matrix", i.e. the matrix indicating which variables have been selected during the multivariate modelling. The dependent yvariable is the prediction error. Thus the procedure is comparable to how DVA operates. The resolution factor (not related to wavelet resolution) for each run was set to 10. The cut-off threshold for the mutual information computed for each variable is determined by calculating the mutual information for a set of 306 dummy variables (60% of the 512 variables analysed). The results for the analysis are shown in Figs. 18 and 19. The prediction error for MI on the GA results has R M S E P = 2 . 5 % using 90 variables, ( A - 1 3 PLS factors) and MI on the TPW results has R M S E P = 2 . 8 % using 109 variables ( A - 11 PLS factors).
Iiij
II
Mutual in~orrnetion8nalys~ of Glk msulB
IIIII B
~4
450
~s
Fig. 18 MI analysis of the GA results.
391
5.2 Classification 5.2.1 The simple multiscale approach Cluster analysis. In this section, the application of the simple multiscale approach to cluster analysis is demonstrated. The masking method will also be used to localise important features. There are several possible cluster analysis algorithms, however only discriminant function analysis (DFA) will be used here. Before discussing the results from the simple multiscale analysis, this section will first present DFA and how it can applied to both unsupervised and supervised classification, followed by how the cluster properties ~ are measured at each resolution level. Discriminantfunction analysis. Discriminant function analysis (DFA) which is also referred to as canonical variates analysis is here the chosen cluster analysis method. DFA is usually used in a supervised mode, but can also be used in an unsupervised way. Here, the unsupervised mode is enabled by direct usage of the replicate information of object samples as classes. The effect of using DFA in this way is that it will reduce the within-replicategroup variance. DFA is in many ways similar to PCA. However, the
Mutual in~ermaliena n ~
ot ~
rlmui~
'1 II IIIII ] aII imm
II I i i
,,m
I 1~
1~
200
2'50
3NIO
3raO
4~
4~
Fig. 19 MI analysis of the TP W variable selection results.
392 eigenvectors are not pointing along directions corresponding to maximum variance in the data set. They point along directions which minimize the within-group variance and maximize the between-group variance. To accomplish this, the within-sample matrix sums of squares and cross products, W is computed together with T which contains the total sample matrix of sums of squares. The eigenvectors of the matrix (W -1 T - I ) correspond to the DFA latent variables. The eigenvectors are sorted according to the magnitude of the eigenvectors: ~1 > ~2 > ' ' " > ~r DFA scores are computed by projecting the data onto these eigenvectors. Due to the inversion step, DFA cannot handle collinearity and therefore can only analyse data matrices containing independent variables. A common way to accomplish this is to perform a PCA on the original data and only use the orthogonal scores vectors in the DFA routine. In the experiments performed the number of principal components used corresponds to 99.99% of the total variance. Using the same number of PCs for the different wavelet scales is not possible because data set reconstructed with very few scales are very smooth and have much fewer significant PCs. The DFA algorithm used here was implemented in MATLAB (The MathWorks, 24 Prime Park Way, Natick, Mass. 01760-1500, USA) following the description by Manly [63].
Measures of cluster structure. To test the simple multiscale approach to cluster analysis, the independent taxonomic information available for the different objects were used either directly or indirectly in the analyses. This is similar to the situation where a taxonomic expert is faced with a data set without the true classification information. In the process of determining "interesting clusters" the expert is expected to make use of his external knowledge in the assessment of the observed patterns. Thus, here the external class information is used to define a cluster. Having identified taxonomically relevant clusters, the next step is to measure how they relate to each other. The three properties measured for the two data set analysed were: 9 Overlap between the clusters. 9 Relative distances between the clusters. 9 Relative area of each cluster Note that relative rather than absolute areas and distances are used (i.e. compared to total area and maximum distance). The reason for this is that
393 the DFA spaces for the individual wavelet reconstructions have different magnitudes. When a smaller than the maximum number of scales are used in a wavelet reconstruction, variance is removed from the data set by making the spectra smoother. Thus, absolute DFA scores cannot be directly compared due to this. This is also the reason why the DFA score plots for the different wavelet reconstructions do not contain axes since they do not really convey any important information for the comparison. Having defined the objects contained in a cluster is not sufficient to describe properties like area and overlap. Here a very simple approach is employed since the number of significant DFA dimensions for both data sets is only 2. A cluster can be defined from the convex hull enclosing the objects. Computer algorithms for finding 2D convex hulls are readily available. A convex hull can be defined as follows: Let P1 and P: be two arbitrary points within the convex hull region. Then all points Pj falling on a straight line from Pl to P2 must also be within the convex hull region. This means that e.g. a triangle and a circle are all convex hull regions whereas "E" or "T" shaped regions are not. Cluster area is here defined as the area of the 2D convex hull polygon divided by the total sum of cluster areas. A distance between two clusters is defined to be the distance between the centre points of the corresponding convex hull polygons. An overlap between cluster i and j is defined as"
Pij =
:if:(Ci) f"l :~: (Cj) :~(Ci U Cj) 100%
(51)
where the operator 4/: returns the number of elements in a set C. Ci and Cj designate the set of objects for cluster i and j, respectively.
Data collection. Two data sets are used to demonstrate the multiscale cluster analysis method and have been kindly provided by Dr. Roy Goodacre at Institute of Biological Sciences, University of Wales, Aberystwyth [64,65]. Ten microlitre aliquots of bacterial suspensions were evenly applied onto a sand-blasted aluminium plate. Prior to analysis the samples were oven-dried at 50~ for 30 min. Samples were run in triplicate. The F T - I R instrument used was the Bruker IFS28 F T - I R spectrometer (Bruker Spectro-spin, Banner Lane, Coventry, U K ) e q u i p p e d with an MCT (mercury-cadmiumtelluride) detector cooled with liquid N2. The aluminium plate was then loaded onto the motorised stage of a reflectance TLC accessory. The wave-
394 number range is in the mid-IR regions: 4000-600cm -1. Spectra were acquired at a rate of 20 s -1. The spectral resolution used was 4 cm -l. 256 spectra were co-added and averaged. The digital sampling level was set to produce 882 data points.
U T I data set description. The UTI data set contains in total five different bacterial species (classes), but for the multiscale cluster analysis, only three of them are actually used in the analysis. The three different bacterial species are: E. coli, P. mirabilis and P. aeruginosa which are referred to as cluster 1, 2 and 3 respectively. Twenty-two E. coli (Ea-Eq), 15 P. mirabilis (Pa-Pj) and 15 P. aeruginosa (Aa-Aj) were isolated from the urine of patients with urinary tract infection (UTI) and prepared as described previously [66]. In total there are 148 (4 x 37) infrared spectra in this data set. Eubacterium data set description. The Eubacterium data set contains the reflectance infrared spectra of four different bacteria: E. thnidum, E. infirmum, E. exiguum and E. tardum which are referred to cluster 1, 2, 3, and 4. Four replicates for each sample is used. Four E. timidum (Ta-Te), four E.infirmum ( l a - l d ) , four E. exiguum (2a-2e), five E. tardum (Na-Ne), and five eubacterial hospital isolates (Ha-He) were prepared as described previously [64]. In total there are 88 infrared spectra (4 x 22 samples) in this data set. In order to perform FWT on the data sets it was necessary to have a data length as a power of two. Just adding zeros may introduce ringing effects due to sharp edges. In order to avoid this, a smoothing operation was performed to ensure a smooth truncation to zero at the low wavenumber end. The two data sets are shown in Fig. 20.
UTI results. The taxonomic information has here been used directly to determine the D F A directions. Since the UTI data set contains three clusters the number of significant DFA dimensions is two. Using the method described earlier it was found that the optimal wavelet for this data set was Coiflet 5. The results from the multiscale cluster analysis of this data set are summarised in Fig. 21. The interpretation of this multiscale cluster analysis is straightforward: All the clusters are overlapped until scale no. 3 is added when they all separate.
395 Eubact data set
0.8 i
0.6 0.4
-0.2 4000
i 3500
i 3000
i 2500
i 2000
Wavenu~,
i , 1500
t 1000
t 500
0
cm-~
UTI dam set I
i
i
I 3500
3000
i
i
2500
2000
i
0.8 0.6 -I~ 0.4 0.2 0 -0.2--0.4 4000
I
/
I
Wavenu~,
i
1500
I
I(XX)
I
500
cm-~
Fig. 20 The Eubact amt UTI ~&ta sets.
A scale dendrogram for the same analysis is shown in the upper part of Fig. 22. Where in the spectral domain is the feature located is responsible for the separation of cluster 1, 2 and 3 after adding scale 3? By employing the method of systematic variation of mask vectors to scale 3, it is found that of 223 = 256 possible masking vectors there is only non-overlap between the clusters in 48 of the combinations. For these combinations it was observed that wavelet variable 5 in scale 3 (covers region 2024-1534 cm -l in the wavenumber domain) was always selected.
E u b a c t results. For this data set DFA was used in an unsupervised mode. The optimal wavelet was found to be Symmlet 9. Fig. 23 shows the results from the multiscale cluster analysis.
After adding the first scale it is easy to see that cluster 1 is different from the others. Cluster 1 is close to the other cluster, but does not overlap with any of
396
4
5
|
|
cSb
t~
03
c3 Fig. 21 The multiscale cluster analysis result on the UTI data set. For each reconstruction a DFA is performed and the 2D scores are plotted.
Wavelet scale 0
1
2
3
4
5
6
7
8
9 3 2
4U 3
i
t
-1
2 ,,
,
|
Fig. 22 This figure shows a scale dendrogram for each of the two data sets analysed ( U T I and Eubacterium). Scale dendrograms as used here efficientO' summarise the qualitative change of the overlap structure of the clusters involved after adding the different wavelet scales. Note that other measures than cluster overlap could have been used in the scale dendrogram. Another possible measure would be the cluster area or shape.
397
5
,3
7
8
9
,%
J
Fig. 23
them. This suggests that there must be very broad scale features that are making the cluster 1 spectra different from the others. After adding scale 3, cluster 2 separates out from the others. Clusters 3 and 4 remain overlapped until scale 5 is added. This suggests that relatively narrow features are making these two sets of spectra different. Adding more scales does not change the overlap structure between the clusters. This means that there is an optimal number of scales needed to achieve the separation of all the clusters. Overlap between the clusters is not the only property of interest. Another possible cluster property is the relative area and how it changes with the addition of scales. The relative area of a cluster is related to the correlation between the objects in the cluster. Table 2 confirms that the relative cluster areas are becoming smaller as more scales are added. However, some scales have a larger impact than others on the change of the relative area for a cluster. The table shows that cluster 1 is more dispersed than the others and needs more scales to become compact. After adding scale 5 there is a significant decrease in the relative area of cluster 1. Below we investigate which regions that appear to be associated with this large change of relative area.
398 Table 2. Areas of clusters in 2D D F A score space for each addition of a wavelet scale. The Eubacterium data set.
Sca~ added
C&ster 1
Cluster2
Cluster 3
Cluster 4
1 2 3 4 5 6 7 8 9
29 24 30 31 9 6 3 3 4
22 21 9 6 3 2 4 4 4
27 42 17 8 6 5 5 5 4
9 10 4 2 3 5 3 3 2
Finding important regions. As shown above, cluster 1 is already separated from the other clusters after scale 1. Where are these very broad regions approximately located that are associated with this? For scale 1 there are two wavelet coefficients that represents wavelet functions covering half of the spectral region each. By using the masking method described in Section 3.3, there are four possible masking vectors for these two regions: {0 0}, {1 0}, {0 1}, {1 1}. For each of these masking combinations, a multiscale cluster analysis is performed and the overlap between cluster 1 and the others is recorded. There are only two cases where cluster 1 separates from the other clusters: {0 1}, {1 1}. This means that presence of the right region is necessary to produce a zero overlap between all other clusters. This result suggests that there is a feature in the right half of the spectrum that makes cluster 1 different from the others. The wavenumbers for this region are: 202452 cm -1 (actually the region is over 2024-600 cm -1 since 600 cm -l is the lower detection limit for the IR instrument used). A thick line in Fig. 24A shows where this region is localised. The standard deviation spectrum of the data set is also plotted in this figure. After adding scale 3 it was observed above that cluster 2 separates out from all the others. The next problem is to localise the region(s) that is(are) responsible. Since scale 3 has 23 = 8 wavelet coefficients, there are in total 28 = 256 different masking vectors possible to test out. For each of these a D F A is performed and the overlap of cluster 2 with the other clusters is recorded. In 74 combinations a non-overlap situation was observed. In all ot these combinations, wavelet coefficient no. 3 in scale 3 was always present.
399
0.35
0.3
C
0.25
ro r162 =_ o
.8 ,<
0.2
B
0.15
0.1 -
A
0.05
-0.05 4OOO
3
500
I
3000
i
2500 Wavenumbers,
1
2000
I
1500
i
1000
cm -1
Fig. 24
This coefficient corresponds to a wavelet function covering the region 30122522 cm -1 (see thick line in Fig. 24B).
Results from Fourier clustering. Fourier clustering produces similar, but not identical results. The Fourier cluster analysis method is really a continuous method since the threshold for a frequency cut-off can in theory be made to vary continuously. Space Fourier analysis is here performed. This analysis also confirms that clusters 1 and 2 are separated from the others at low frequencies. As also shown for the multiscale cluster analyses clusters 3 and 4 appear to be separated at higher frequencies (after frequencies over 0.0177 Hz are added). However, it should be kept in mind that Fourier and multiscale cluster analyses are based on different assumptions related to the resolving of signals in time and frequency which can explain the discrepancies between the two methods.
400
5.2.3 Supervised classification Multiscale DPLS. Before using the V S - D P L S method on the Eubact data set, it is instructive to first use the simple multiresolution approach. The data set was analysed with DPLS for different scale reconstructions. (j-- { 0 . . . 9}). At each reconstruction level, a full cross-validated DPLS run estimates the optimal number of factors and calculates a regression model. The regression model is subsequently applied to the unseen validation set. Fig. 25 shows the results from the calibration using cross-validation. We see that the calibration error goes to zero after 5 PLS factors. Using five scales (i.e. a total of 25+1 - 1 - 63 variables) the prediction error was 2.8% on the validation set. This is a significant reduction (93%) in the model complexity from using 882 variables. To get an indication of how the complexity of the PLS model changes with resolution level, an SEC surface was generated, see Fig. 26. Here it can be seen that the optimal model (scale = 5, i.e. a total of 63 variables, using A = 7 PLS factors) is as expected located in the direction of the lower left corner of the SEC surface. For the prediction
100 8O
~6o ==,o 2O i
5 Scales
i
l
6
7
!
8
added
14 12
"6
2 0 0
i
i
!
1
2
3
i
4 Scales
!
!
5
6
added
Fi~.. 25
.
i
1
7
8
401 of the validation set, the error drops to zero at 6 factors and RMSEP = 2.8% using 7 PLS factors. The OSC approach was also tested out here to complement the simple multiscale approach results. Since the number of scales tested for are 9 ({ 1 , 2 , . . . , 9}), there are 511 different combinations of scales. 474 of these combinations resulted in perfect prediction in the calibration. What scales seem to dominate? In order to answer this question, the relative distribution for the different scale combinations was performed. Scale combinations were grouped according to their error produced by DPLS and the distribution of which of the nine scales selected was recorded. Fig. 27 displays the results. Models with high prediction errors all have one thing in common: They are not selecting scales 5, 6 or 7. What is the most parsimonious model from this analysis? Sorting the scale combinations with respect to the total number of wavelet coefficients, it was found that the smallest model is using only scale 5 (25 = 32 variables, A = 5 PLS factors, R M S E P = 5.6%).
Fig. 26 Scale-error smfaces of the Eubact data set.
402
Fig. 27 The distribution of the scales selected in models sorted with respect to prediction error.
However, as noted earlier, the simple multiresolution and the OSC approach are not precise enough to localise where the important features for the prediction appear. In order to do this the VS-DPLS method was employed. By running VS-DPLS on all the wavelet coefficients directly, it is found that seven variables and seven PLS factors were selected as optimal by the cross-validation procedure ( R M S E P = 0 % ) . However, models using the maximum number of PLS factors suggest overfitting and might therefore be unstable. Before using another model, it is instructive to observe where these seven variables are located in the wavenumber-scale domain, see Table 3. The VS-DPLS algorithm produces K different B-coefficient vectors, one for each of the K classes. Since each vector is based on data from wavelet coefficients it is possible to plot each regression vector as a scalogram, see Fig. 28. By comparing one of the class B-coefficient scalograms with a representative raw spectrum it is possible to detect two peaks that appear to be important in distinguishing between the different classes; see Fig. 29.
403 Table 3. Tabulation of the most important variables selected by V S - D P L S on the Eubact data set.
Wavelet variable no.
Region start (cm -l)
Region end (cm -1)
1 2 3 8 28 56 57
4000 4000 4000 1036 1283 1160 1036
52 52 2028 52 1040 1040 916
Noting the particular importance of scale 5 and the selected variables no. 56 and 57 at scale (high B-coefficient values in VS-DPLS) suggests that the right-most peak appears to be particularly important for a correct classifi-
404 cation. However, due to possible overfitting, our confidence in this model is not high. It was therefore decided to investigate the second best VS-DPLS model from the cross-validation. This model selects 12 variables ( R M S E P = 2.8%) using 8 PLS factors. We still have a problem with a large number of PLS factors compared to the number of selected variables. The four scalograms from the B-coefficient matrix of the new model are shown in Fig. 30. Both VS-DPLS models appear to capture much of the same information which can be seen from the set of selected variables the models have in common: 1, 2, 3, 56 and 57. The variables unique to the second VS-DPLS model are no. 4 (scale 1, 2024-52 cm-l), 6 (scale 2, 3012-2028 cm-1), 7 (scale 2, 2024-1040cm-1), 14 (scale 3, 1530-1040cm-1), 52 (scale 5, 16541534 cm-1), 53 (scale 5, 1530-1410 cm -1) and 102 (scale 6, 1715-1657 cm-1). Note that this model selects two additional wavelet coefficients at scale 5 (no. 52 and 53). Wavelet coefficients no. 7 and 14 covers a region overlapping with no. 56 which was selected by the first model. This may indicate that the information related to the same peak at lower scales has also been selected in the second model. For comparison, the regions covered by coefficients 28 and 57 are in the neighbourhood of the selected wavenumbers 1225 and 947 cm -1
Fig. 29 The selected variables from the.full rank PLS model. The Eubacterium data set.
405
Fig. 30 The selected variables ./i'om the secomi best PLS model. The mtmber of PL5 factors is 8 and the number of selected variables is 12. The Euhacterium data set.
found by using rule induction methods on the differentiated spectra (i.e. nc data compression) of the same data set [65]. This result may not be surprisin~ considering that wavelet coefficients contain information similar to a numerical differentiation of the signal at different resolutions. 5.3 Conclusion
FDA is a relatively new area in statistics and its application to chemica problems is relatively limited [14,67-70]. The current results suggest that th~ use of wavelet bases in FDA provides an excellent starting point for creatin~ parsimonious regression and classification models. Such models might en able us to obtain a deeper understanding of the underlying phenomena unde study. Note, however that this chapter is only using a very small part of thq FDA theory. It is suggested that other FDA tools such as annihilating linea differential operators and regularization [12] should be given careful con sideration for chemometric analysis.
406
References 1. M.B. Seasholtz and B. Kowalski, The Parsimony Principle Applied to Multivariate Calibration, Analytica Chimica Acta, 277(2) (1993), 165-177. 2. J.M. Brenchley, U. Horchner and J.H. Kalivas, Wavelength Selection Characterization for Nir Spectra, Applied Spectroscopy, 51(5) (1997), 689- 699. 3. U. Horchner and J.H. Kalivas, Simulated-Annealing-based Optimization Algor i t h m s - Fundamentals and Wavelength Selection Applications, Journal of Chemometrics, 9(4) (1995), 283-308. 4. O.E. Denoord, The Influence of Data Preprocessing on the Robustness and Parsimony of Multivariate Calibration Models, Chemometrics and hltelligent Laboratory Systems, 23(1) (1994), 65-70. 5. G. Cruciani, S. Clementi and M. Pastor, GOLPE-guided Region Selection, Perspectives in Drug Discovery and Design, 12 (1998), 71-86. 6. H. Kubinyi, Variable Selection in QSAR Studies 2 a Highly Efficient Combination of Systematic Search and Evolution, Quantitative Structure-ActiviO' Relationships, 13(4) (1994), 393-401. 7. G. Cruciani and K.A. Watson, Comparative Molecular-field Analysis Using Grid Force-field and GOLPE Variable Selection Methods in a Study of Inhibitors of Glycogen-phosphorylase-b. Journal of Medicinal Chemisto', 37(16) (1994), 25892601. 8. M. Baroni, G. Cstantino, D. Cruciani, G. Riganelli, R. Valigi and S. Clementi, Generating Optimal Linear PLS Estimations (GOLPE): An Advanced Chemometric Tool for Handling 3D-QSAR Problems, Quantitative Structure-Activity Relationships, 12 (1993), 9-20. 9. H.M. Davey, A. Jones, A.D. Shaw and D.B. Kell, Variable Selection and Multivariate Methods for the Identification of Microorganisms by Flow Cytometry. Cytometry, 35(2) (1999), 162-168. 10. D. Broadhurst, R. Goodacre, A. Jones, J.J. Rowland and D.B. Kell, Genetic Algorithms as a Method for Variable Selection in Multiple Linear Regression and Partial Least Squares Regression, with Applications to Pyrolysis Mass Spectrometry, Analytica Chimica Acta, 348(1-3) (1997), 71-86. 11. A.D. Shaw, A. Dicamillo, G. Vlahov, A. Jones, G. Bianchi, J. Rowland and D.B. Kell, Discrimination of the Variety and Region of Origin of Extra Virgin Olive Oils Using c-13 NMR and Multivariate Calibration with Variable Reduction, Analytica Chimica Acta, 348(1-3) (1997), 357-374. 12. J.O. Ramsay and B.W. Silverman, Functional Data Analysis, Springer series in statistics, Springer, New York, (1997). 13. J.O. Ramsay, When the Data are Functions, Psychometrika, 47(4) (1982), 379396. 14. B.K. Alsberg, Representation of Spectra by Continuous Functions, Journal oj Chemometrics, 7 (1993), 177-193. 15. P. Besse and J.O. Ramsay, Principal Components-analysis of Sampled Functions, Psychometrika, 51(2) (1986), 285-311.
407
16. J.O. Ramsay and X.C. Ei, Curve Registration, Journal of the Royal Statistical SocieO' Series B-statistical Methodology, 60(Pt2) (1998), 351-363. 17. R. Shankar, Principles of Quantum Mechanics, Plenum Press, New York, (1994). 18. C. de Boor, A Practical Guide to Splines, Applied Mathematics Sciences. Springer, New York, (1978). 19. G. Farin, Curves and Surfaces for Computer Aided Geometric Design, a Practical Guide: Second Edition, Academic Press, Boston, (1990). 20. P.R. Griffiths, Fourier Transform Infrared Spectrometry. Chemical Analysis, Vol. 83, Wiley, (1986). 21. B.K. Alsberg, M.K. Winson and D.B. Kell, Improving the Interpretation of Multivariate and Rule Induction Models by Using a Peak Parameter Representation, Chemometrics and Intelligent Laboratory Systems, 36(2) (1997), 95-109. 22. B.R. Bakshi and G. Stephanopoulos, Compression of Chemical Process Data by Functional Approximation and Feature Extraction, AIChE Journal, 42 (1996), 477-492. 23. P. Bury, N. Ennode, J.M. Petit, P. Bendjoya, J.P. Martinez, H. Pinna, J. Jaud and J.L. Balladore, Wavelet Analysis of X-ray Spectroscopic Data 1. The Method, Nuclear Instruments and Methods of Physics Research Section A, 383 (1996), 572-588. 24. C.R. Mittermayr, S.G. Nikolov, H. Hutter and M. Grasserbauer, Wavelet Denoising of Gaussian Peaks: A Comparative Study, Chemometrics and Intelligent Laboratory Systems, 34 (1996), 187- 202. 25. F. Ehrentreich, S.G. Nikolov, M. Wolkenstein and H. Hutter, The Wavelet Transform: A New Preprocessing Method for Peak Recognition of Infrared Spectra, Mikrochimica Acta, 128 (1998), 241-250. 26. F. Flehmig, R. Vonwalzdorf and W. Marquardt, Identification of Trends in Process Measurements using the Wavelet Transform, Computer Chemical Engineering, 22 (1998), $491-$496. 27. A.K.M. Leung, F.T. Chau and J.B. Gao, A Review on Applications of Wavelet Transform Techniques in Chemical Analysis: 1989-1997. Chemometrics and bltelligent Laboratory Systems, 43 (1998), 165-184. 28. X.G. Shao, W.S. Cai and P.Y. Sun, Determination of the Component Number in Overlapping Multicomponent Chromatogram Using Wavelet Transform, Chemometrics and Intelligent Laboratoo" Systems, 43 (1998), 147-155. 29. B.R. Bakshi, Multiscale Analysis and Modeling Using Wavelets, Journal of Chemometrics, 13 (1999), 415-434. 30. U. Depczynski, K. Jetter, K. Molt and A. Niemoller, Quantitative Analysis of Near Infrared Spectra by Wavelet Coefficient Regression Using a Genetic Algorithm, Chemometrics and Intelligent Laboratoo' Systems, 47 (1999), 179-187. 31. B.K. Alsberg, A.M. Woodward, M.K. Winson, J. Rowland and D.B. Kell, Wavelet Denoising of Infrared Spectra, AnaO,st, 122(7) (1997), 645-652. 32. E. Rosenberg, C.R. Mittermayr, B. Lendl and M. Grasserbauer, The Application of the Wavelet Power Spectrum to Detect and Estimate 1/f Noise in the Presence of Analytical Signals, Anah'tica Chimica Acta, 388 (1999), 303-313.
408 33. L. Pasti, B. Walczak, D.L. Massart and P. Reschiglian, Optimization of Signal Denoising in Discrete Wavelet Transform, Chemometrics and hTtelligent Laboratory Systems, 48 (1999), 21-34. 34. F.T. Chau, T.M. Shih, J.B. Gao and C.K. Chan, Application of the Fast Wavelet Transform Method to Compress Ultraviolet-visible Spectra, Applied Spectroscopy, 50 (1996), 339-349. 35. F.T. Chau, J.B. Gao, T.M. Shih and J. Wang, Infrared Spectral Compression Procedure Using the Fast Wavelet Transform Method, Applied Spectroscopy, 51 (1997), 649-659. 36. A.K.M. Leung, F.T. Chau, J.B. Gao and T.M. Shih, Application of Wavelet Transform in Infared Spectrometry: Spectral Compression and Library Search, Chemometrics and Intelligent Laboratoo' Systems, 43 (1998), 69-88. 37. H.L. Ho, W.K. Cham, F.T. Chau and J.Y. Wu, Application of Biorthogonal Wavelet Transform to the Compression of Ultraviolet-visible Spectra, Comput. Chem., 23 (1999), 85-96. 38. C.L. Stork, D.J. Veltkamp and B.R. Kowalski, Detecting and Identifying Spectral Anomalies Using Wavelet Processing, Applied Spectroscopy, 52 (1998), 1348-1352. 39. S. Mallat, A Theory for Multiresolution Signal Decomposition: The Wavelet Representation, IEEE, Transactions on Pattern Analysis and Machine Intelligence, 11(7) (1989), 674-693. 40. S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, San Diego, (1998). 41. R. Coifman, Y. Meyer and M.V. Wickerhauser, Wavelet Analysis and Signal Processing, PP. 153-178; In Wavelets and Their Applications, (M.B. Ruskai, G. Beylkin, R. Coifman, I. Daubechies, Y. Meyer and L. Raphael, Eds) Jones and Bartlett, New York, (1992). 42. A. Grossmann, J. Morlet and T. Paul, Transforms Associated to Square Integrable Group-representations. 1. General Results, Journal of Mathematical Physics, 26(10) (1985), 2473-2479. 43. I. Daubechies, Ten Lectures on Wavelets, CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 61, SIAM, Philadelphia, Pennsylvania, (1992). 44. D. Gabor, Theory of Communication, Journal of the lEE, 93 (1946), 429-457. 45. L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, Classification and Regression Trees, Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, California, (1984). 46. H. Martens and T. Naes, Multivariate Calibration, Wiley, New York, (1989). 47. B.K. Alsberg, R. Goodacre, J.J. Rowland and D.B. Kell, Classification of Pyrolysis Mass Spectra by Fuzzy Multivariate Rule Induction-comparison with Regression, K-nearest Neighbour, Neural and Decision-tree Methods. Anal)'tica Chimica Acta, 348(1-3) (1997), 389-407. 48. B.K. Alsberg, D.B. Kell and R. Goodacre, Variable Selection in Discriminant Partial Least Squares Analysis, AnaO'tical Chemisto', 70(19) (1998), 4126-4133. 49. J.R. Quinlan, C4. 5 Programs for Machine Learning, Morgan Kaufmann Publishers, (1993).
409 50. L. (ed.) Davis, Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, (1991). 51. J.H. Holland, Adaption in Natural and Artificial Systems, University of Michigan Press, Ann Arbo, MI, (1975). 52. C.E. Shannon, The Mathematical Theory of Communication, Bell Srstem Technical Journal, 379(623) (1948), 379-423, 623-656. 53. R. Battiti, Using Mutual Information for Selecting Features in supervised Neuralnet Learning, IEEE Transactions oll Neural Networks, 5(4) (1994), 537-550. 54. C.C. Beardah and M.J. Baxter, In Interfacing the Past, Computer Applications and Quantitative Methods in Archaeology, CAA95, (H. Kammermans and K. Fennema, Eds), Analecta Prehistorica Leidensia, vol. 28, (1996). 55. S. Wold, A. Ruhe, H. Wold and W.J. Dunn III, The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses, S I A M Journal of Scientih'c and Statistical Computations, 5 (1984), 735-743. 56. A. Lorber, L. Wangen and B.R. Kowalski, A Theoretical Foundation for PLS. Journal of Chemometrics, 1 (1987), 19-31. 57. A. H6skuldsson, PLS Regression Methods. Journal qf Chemometrics, 2 (1988), 211228. 58. F. Lindgren, P. Geladi, S. Rannar and S. Wold, Interactive Variable Selection (IVS) for PLS. 1. Theory and Algorithms, Journal o[" Chemometrics, 8(5) (1994), 349-363. 59. F. Lindgren, P. Geladi, A. Berglund, M. Sjostrom and S. Wold, Interactive Variable Selection (IVS) for PLS. 2. Chemical applications, Journal o[" Chemometrics, 9(5) (1995), 331-342. 60. B.K. Alsberg, A.M. Woodward, M.K. Winson, J.J. Rowland and D.B. Kell, Variable Selection in Wavelet Regression Models, Analvtica Chhnica Acta, 368(1-2) (1998), 29-44. 61. W. Windig, Near Infrared Data Set. ftp:':ftp.clarkson.edu pub/hopkepk/Chemdata/Windig/, (1998). 62. W. Windig and D.A. Stephenson, Self-modelling Mixture Analysis of Second-derivative Near-infrared Spectral Data Using the Simplisma Approach, Analytical Chemistry, 64 (1992), 2735-2742. 63. B.F.J. Manly, Multivariate Statistical Methods: A Primer, Chapman & Hall, London, (1994). 64. R. Goodacre, S.J. Hiom, S.L. Cheeseman, D. Murdoch, A.J. Weightman and W.G. Wade, Identification and Discrimination of Oral Asaccharolytic Eubacterium spp. by Pyrolysis Mass Spectrometry and Artificial Neural Networks, Current Microbiology, 32 (1996), 77-84. 65. B.K. Alsberg, W.G. Wade and R. Goodacre, Chemometric Analysis of Diffuse reflectance-absorbance Fourier Transform Infrared Spectra Using Rule Induction Methods: Application to the Classification of Euhacterium Species, Applied Spectroscopy, 52(6) (1998), 823-832. 66. R. Goodacre, E.M. Timmins, R. Burton, N. Kaderbhai, A. Woodward, D.B. Kell and P.J. Rooney, Rapid Identification of Urinary Tract Infection Bacteria Using
410
67. 68. 69. 70.
Hyperspectral, Whole Organism Fingerprinting and Artificial Neural Networks, Microbiology, 144 (1998), 1157- 1170. Z.P. Chen, Jiang, J.H, Y. Li, H.L. Shen, Y.Z. Liag and R.Q. Yu, Smoothed Window Factor Analysis, Analytica Chimica Acta, 381 (1999), 233-246. C. Goutis, Second-derivative Functional Regression with Applications to near Infra-red Spectroscopy, Royal Statistical SocieO', B 60(Part 1) (1998), 103-114. C. Goutis and T. Fearn, Partial Least Squares Regression on Smooth Factors, Journal of the American Statistical Association, 91(434) (1996), 627-632,. B.K. Alsberg and O.M. Kvalheim, Compression of nth-order Data Arrays by Bsplines. Part 1. Theory, Journal of Chemometrics, 7 (1993), 61-73.
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
411
CHAPTER 17 Multiscale Statistical Process Control and Model-Based Denoising Bhavik R. Bakshi Department of Chemical Engineering, The Ohio State UniversiO', Columbus, OH 43210, USA
1 Introduction Data from most processes are inherently multiscale in nature due to events at different locations and with different localization in time, space and frequency. This common occurrence of multiscale data has encouraged the development of data analysis and empirical modeling methods that can exploit the multiscale nature of data. Over the last decade, the development of wavelets has provided further impetus to research on multiscale methods. As described in this and other books and many papers, multiscale methods have been developed for solving a variety of data analysis and modeling tasks including, compression, filtering or denoising, pattern recognition and trend analysis, linear and nonlinear regression, noise removal and estimation with linear and nonlinear models, univariate and multivariate statistical process control. Most existing methods for data analysis and modeling are best for extracting information from data and variables that contain contributions at a single scale or localization in time and frequency. For example, existing methods for statistical process control (SPC) represent the data at the scale of the measurements in a Shewhart chart, or at a coarser scale by a moving average (MA), exponentially weighted moving average chart (EWMA), or cumulative sum (CUSUM) chart. The single scale nature of these charts makes them best for detecting changes over a narrow range of scales or localization. Consequently, Shewhart charts are best for detecting large shifts, whereas MA, EWMA, and CUSUM charts are best for detecting smaller changes. A multiscale approach to SPC can combine the best features of these charts and result in a method that is good at detecting various types of changes in the measured data. Such a method for both univariate and multivariate SPC has been developed by Bakshi and coworkers [1,2], and is described in this chapter.
412 Denoising or filtering is among the most popular applications of wavelets. The wavelet thresholding approach [3] removes stochastic noise from a deterministic underlying signal by eliminating the small wavelet coefficients. This approach is univariate in nature and does not use any information about the relationship between the measured variables. Furthermore, multiscale filtering by thresholding is best only for signals where the underlying signal is deterministic. A multiscale approach for removing random errors from fractal underlying signals has been developed based on the ability of wavelets to decorrelate fractal processes [4,5]. This approach is Bayesian in nature, but does not utilize the relationship between the variables given by the process model, and is best for stochastic underlying signals. A multiscale filtering approach that exploits the multivariate nature of measured process data, and can denoise measurements with either deterministic or stochastic underlying signals is described in this chapter [6]. This approach represents model-based denoising as a constrained Bayesian optimization problem. Information about the nature of the underlying variables is provided by selecting an appropriate distribution function for the prior probability, and the process model is represented as the constraint. The multiscale modelbased denoising approach described in this chapter is an error-in-variables approach and focuses on systems where the multivariate model is linear and describes a steady-state relationship between the variables. Extension of this approach to denoising or estimation with dynamic linear models and nonlinear models has also been developed [7]. The rest of this chapter is organized as follows. A very brief introduction to the relevant notation and properties of wavelets is provided in Section 2. This is followed by the description of a general methodology for multiscale analysis, modeling and optimization. This methodology provides greater insight into various multiscale approaches, and permits the development of new multiscale methods. The technique of multiscale statistical process control (MSSPC) and some recent developments in this area are described in Section 4. Finally, multiscale denoising with linear steady-state models is the subject of Section 5.
2
Wavelets
If the translation parameter in a family of wavelets is discretized dyadically as, b - - 2 - J k , the wavelet decomposition downsamples the coefficients at each scale. Any signal can be decomposed to its contribution at multiple scales as a weighted sum of dyadically discretized orthonormal wavelets,
413
J-1
y(t) -
N
N
Z Z djk~/J k(t) + Z Cjok~)jok(t) J-Jo k=l k-I
where, y is the m e a s u r e m e n t , djk are the wavelet coefficients or detail signal, Cj0k are the scaled signal coefficients at the coarsest scale, J0, and J - log2 N. The wavelet d e c o m p o s i t i o n of each m e a s u r e d variable in an N x P matrix, Y, results in an N x P m a t r i x of coefficients /'
Cjoll Cjo21 dj011 dj021
Cjol2 Cjo22 dj012 dj022
9
9 . . 9 . . 9 9 9 9 . . .
djkl 9
9 9
djkP
9 9 9 9
dj-l.l.p dj-l.2.p
.
dJ-l,l.2 dj-1,2.2
\ dj-1,N/2,1
'~
/' Cjo Dj~
.
9
dJ-l,l.1 dj-1.2.1
Cjolp Cjo2p dj01p dj02p
9 9
dj-I,N/2.2
Y
Dj
(1)
.
9 .
Dj-1
dj-I.N/2.p j
\
/
The m a t r i x of coefficients at each scale, DjY, is of size Nj x P where Nj is the n u m b e r of coefficients at the jth scale, N j - 2 J-j and P is the n u m b e r of variables. D e c o m p o s i t i o n of a signal by wavelets with d o w n s a m p l i n g s h o w n in Fig. l(a) shows that every m e a s u r e m e n t c a n n o t be d e c o m p o s e d as soon as it is obtained. This can cause a time delay in m a n y on-line applications of wavelets such as on-line filtering and SPC. This time delay can be eliminated
(a)
x~
x2
x3
d12
x4
x5
dl4
x6
x7
d24
(b)
xl
x8 dis
d16
d2s d3s
x2
X3
X4
X5
X6
X7
X8
dl2
dis
dl4
d15
d16
d17
d24
d25
d26
d27
dis d2s
d38
Fig. 1 Wavelet decomposition" (a) with downsampling, (b) without downsampling.
414 by decomposing the signal without downsampling by discretizing the translation parameter as, b - k, resulting in the decomposition shown in Fig. l(b). The wavelets lose their orthonormality but permit the development of truly on-line multiscale methods [8].
General methodology for multiscale analysis, modeling, and optimization All the wavelet-based multiscale methods for data analysis, empirical modeling and optimization may be represented by the approach shown in Fig. 2. The variables are first decomposed on the selected set of basis functions which may be orthonormal or non-orthonormal. Each variable may be decomposed independently on the same one-dimensional wavelet, or the entire data matrix may be decomposed using two-dimensional wavelets. The resulting wavelet and scaling function coefficients at each scale are processed by the appropriate analysis, modeling or optimization operator. Thus, for multiscale filtering of deterministic signals [3], the thresholding operator is applied to the coefficients at each scale. For multiscale PCA [1], the coefficients at each scale are subjected to PCA. For multiscale optimization [6], the optimization problem is solved at each scale. The information used by the operator such as the value of the threshold, number of components to be selected, and constraints need to be represented at the appropriate scale, and may change with scale. For example, multiscale filtering of autocorrelated noise uses a different value of the threshold at each scale [3]. If the relationship between the variables is linear, and orthonormal wavelets are used, the coefficients at each scale can be analyzed, modeled or optimized independently of the other scales, and the final result may be obtained by re-
c~
4 pperate ~ CJoY
4,. * 4, ~'
Fig. 2 General methodology for multiscale analysis, modeling and optimization.
415 constructing the result at each scale. In such cases, there is no need to iterate between the scales, as demonstrated by the approach for multiscale filtering, linear regression, and optimization with linear steady-state models. In contrast, if the relationship between the variables is nonlinear, obtaining the optimum solution requires iteration between the solution at each scale. The techniques discussed in this chapter are based on linear models and do not require any iteration between the solutions at each scale.
4
Multiscale statistical process control
Statistical process control (SPC) is the task of detecting abnormal process operation from the statistical behavior of variables. SPC determines the region of normal variation of a measured variable, and indicates abnormal operation if the measurements lie outside the normal region. A variety of control charts have been developed for SPC including, Shewhart, moving average (MA), exponentially weighted moving average (EWMA), and cumulative sum (CUSUM) charts. SPC of multivariate data is performed by reducing the dimensionality of the data matrix by principal component analysis or partial least squares regression, followed by monitoring the process in the reduced dimension space. The univariate filtering methods have also been extended for multivariate SPC, but the resulting multivariate EWMA and multivariate CUSUM are usually not as practical or popular as multivariate SPC by empirical modeling methods such as PCA and PLS [9]. Existing methods for univariate and multivariate SPC suffer from several limitations. Various control charts are best only for detecting certain types of changes. For example, a Shewhart chart can detect large changes quickly, but is slow in detecting small shifts in the mean, whereas CUSUM, MA and EWMA charts are better at detecting a small mean shift, but may be slow in detecting a large shift, and require tuning of their filter parameters [10]. This limitation may be overcome by using heuristics such as the Western Electric rules [10], or Shewhart and CUSUM charts together [11]. Another limitation of existing SPC methods is that they require the measurements to be uncorrelated, or white, whereas, in practice, autocorrelated measurements are extremely common. A common approach for decorrelating autocorrelated measurements is to approximate the measurements by a time series model, and monitor the residual error. Unfortunately, this approach is not practical, particularly for multivariate processes with hundreds
416
of measured variables. Other approaches for decorrelating autocorrelated measurements without time-series modeling include taking the batch-means [12], and finding the residuals between the measurements and their one-step ahead prediction by an EWMA model [13]. Unfortunately, neither of these approaches are broadly applicable to a wide variety of stochastic processes, and lack multivariate generalizations. For multivariate SPC, the measurements may be decorrelated by augmenting the data matrix by lagged values of the variables so that the linear modeling by PCA or PLS implicitly extracts the time-series model. This approach often works better than SPC by steadystate PCA or PLS, but suffers from the limitations of a Shewhart chart. These limitations of existing methods are due to a mismatch between the nature of the measured data, and the nature of existing SPC methods. Measured data are inherently multiscale in nature due to contributions from multiscale deterministic or stochastic events. In contrast, existing SPC methods are inherently single-scale in nature. The filter used by existing SPC charts is shown in Fig. 3. The fixed localization of the filter for each method indicates that the corresponding SPC method is single-scale in nature. Furthermore, Fig. 3 shows that these SPC methods differ only in the scale at which they represent the measurements. Thus, Shewhart charts represent data at the scale of the sampling interval, which is the finest scale, MA and EWMA charts represent data at a coarser scale, determined by the filter parameter, and CUSUM charts represent data at the scale of all the measurements, which is the coarsest scale. The disadvantages of existing methods may be overcome by developing a multiscale approach for SPC. Such an approach for multivariate SPC based on multiscale PCA was described by Bakshi [1], and shown to perform better than multivariate SPC by conventional PCA and dynamic PCA. Further insight into the statistical properties of multiscale SPC (MSSPC) was provided by Top and Bakshi [2]. 4.1 M S S P C methodology
The methodology for MSSPC is obtained from the general methodology by setting up univariate or multivariate SPC charts for the coefficients at each
Ill (a) Shewhart
(b)
MA
(C)
EWMA
(d) CUSUM
Fig. 3 Filtersfor various SPC methods.
417 scale. An illustration of univariate MSSPC is shown in Fig. 4. The measurements under normal operation are uncorrelated and Gaussian with unit variance. A mean shift of size 2 occurs at the l l0-th measurement. The Shewhart chart for this data is shown on the extreme left in Fig. 4. Decomposition of each measured variable on the selected wavelet results in decomposition of the variance of the data matrix into its contributions at multiple scales. Thus, for a mean centered data matrix, y V y _ (Cj0Y)T(cj0y)+ (Dj0y)T(Dj0Y) + . . . + (Djy)T(DjY) +'"
+ (Dj-1 v)T(Dj_l Y)
(2)
The detection limits and scores for the control charts at each scale are determined from the covariance and PCA or PLS model for the data at the corresponding scale. If the measurements representing normal process operation are uncorrelated and Gaussian, the coefficients at each scale will also be uncorrelated Gaussian with almost equal variance. If the normal data are autocorrelated Gaussian, then the coefficients of an orthonormal wavelet decomposition at each scale will be uncorrelated Gaussian, with the variance changing according to the power spectrum of the measurements. If wavelets with integer discretization are used, then the decorrelation ability of orthonormal wavelets is lost, but the variance at each scale is still proportional to the power spectrum of the measured data. The wavelet decomposition in Fig. 4 uses Haar wavelets, and the Shewhart chart at each scale uses equal detection limits since the normal measurements are uncorrelated Gaussian. These charts in the middle of Fig. 4 show that the mean shift is first detected
Fig. 4 Detection of mean shift by MSSPC. Wavelet decomposition is from j = 6 to/= 4.
418 by the chart at scale, j - 5. Subsequently, the mean shift is detected only in the last scale signal at j = 4. For on-line process monitoring, scales at which the most recent coefficients violate the detection limit are selected as being relevant for SPC at the current time. The signal and covariance at the selected scales are reconstructed by the inverse wavelet transform, and the state of the process is confirmed by checking if the current value of the reconstructed signal violates the corresponding detection limit. The signal reconstruction is essential for efficient and fast detection of a sustained shift. If the signal is not reconstructed, then there is no way to tell whether a wavelet coefficient outside its detection limits is due to an outlier or a sustained shift, or due to a return to normal operation, or another shift away from normal. The reconstructed signal clearly shows the process behavior, and whether it is violating the detection limits or not. Furthermore, since the signal is reconstructed based on the large coefficients, it automatically extracts the features representing abnormal operation. This simultaneous feature extraction with SPC can ease the task of determining the root cause of the abnormal operation. The example in Fig. 4 shows that the reconstructed signal during normal operation is zero, since no scale has a coefficient outside the limits. When the shift is first detected at scale m = 2, the corresponding point and detection limits in the reconstructed signal are obtained only from the coefficient that violates the limit at this time. Later, when the shift is only detected at the coarsest scale, the corresponding points and detection limits in the reconstructed signal are obtained based only on the last scaled signal. Thus, MSSPC adapts the scale and corresponding detection limits according to the nature of the measured data and the scale at which the abnormal event occurs.
4.2 M S S P C performance
The wavelet decomposition in MSSPC may be performed with dyadic sampling (with downsampling) or integer sampling (without downsampling) depending on the nature of the monitoring task. If the objective is to monitor the process without any time delay, and if the measurements are uncorrelated, the signal may be decomposed without downsampling as shown in Fig. lb. In this case, the wavelet coefficients will be autocorrelated, and the detection limits may be adjusted based on the knowledge of this correlation [2]. The resulting approach is equivalent to adapting the filter for each measurement to the scale that is best for detecting abnormal operation. Thus,
419 MSSPC with Haar wavelets subsumes SPC by MA charts, while MSSPC with smoother, boundary corrected wavelets approximately subsumes SPC by E W M A [8]. In contrast, if decorrelating the measurements is important, the signals are decomposed by wavelet decomposition with downsampling as shown in Fig. l a. In this case, there is a time delay in decomposing the measurements, but the detection limits for uncorrelated measurements may be used directly. Fortunately, if the degree of autocorrelation is high, the time delay in decomposing the measurements need not translate into a time delay in detecting small shifts as illustrated by the average run length calculations described in this section. The performance of Shewhart, MA, and MSSPC is compared in Fig. 5 based on the average number of samples required to detect a mean shift of different sizes. The measurements are uncorrelated and are decomposed using Haar wavelets without downsampling. In each case, the parameters are adjusted so that the in-control run lengths, or the average number of samples before a measurement violates the detection limit in the absence of a shift, are equal. This figure shows that if the objective of SPC is to detect only small shifts, it is best to use an MA control chart, or if the objective is to detect only large shifts, it is best to use a Shewhart chart. If the objective of SPC is to have a general method that can detect both small and large shifts, and provide better performance on the average, it is best to use MSSPC. For SPC of highly autocorrelated measurements, since it is essential to decorrelate the data, MSSPC with dyadic downsampling is used. The nature of the wavelet filters and the downsampling can decorrelate a wide variety of stochastic processes. Fig. 6 depicts the A R L for an AR(1) process given by
1000 r~ l+
MSSPC Shewhart
100
10
0
0.5
1
2 3 Mean Shift
4
5
Fig. 5 Average run length for SPC of uncorrelated ~&ta without downsampling.
420
1000
100
l01 ]
[~WBM 0
0.5
1
2 3 Mean
4
5
Fig. 6 A R L of AR(1) process.
x(t) = 0 . 9 x ( t - 1 )
+ e(t)
Weighted batch means [12] decorrelate the data by taking a weighted sum of the measurements in a window of fixed size. The weights are determined to decorrelate the measurements. This approach works best for detecting small shifts, and always has a run length greater than the length of the window. Moving center line E W M A ( M C E W M A ) [13] fits an E W M A to the measurements to minimize the one-step ahead prediction error. The results in Fig. 6 indicate that MSSPC performs well as a general method for detecting shifts of various sizes in stationary correlated measurements. Nonstationary stochastic processes present special challenges for SPC, since their mean tends to change over time. The A R L performance of MSSPC and M C E W M A is compared in Fig. 7. In this case, the stochastic process is IMA(1,1) given by
1000 L
10010 ---~MSSPC
--w-MCEWMA 0
0.5
2 3 Mean shift
4
Fig. 7 A RL of IMA (1,1) process.
5
421
x(t) = x ( t - 1 )
-t- e(t)
-
0.5e(t-1)
which can be modeled optimally by EWMA. In this case, M C E W M A performs better than MSSPC for detecting large shifts since it is the optimal approach for decorrelating an IMA(1,1) time series. The time delay in MSSPC for detecting large shifts is due to the downsampling at coarser scales. Using wavelets without downsampling is not feasible for SPC of such nonstationary measurements, since the high autocorrelation in the nondownsampled wavelet coefficients will increase the rate of false alarms for the same fault detection ability. The performance of multivariate SPC by MSPCA is illustrated based on simulated data from a fluidized catalytic cracker unit. This simulation was provided by Honeywell to the abnormal situation management consortium. The data consist of 110 measured variables and several types of process faults. Only three components are enough to capture most of the variation in the data. The results of multivariate SPC by PCA and MSPCA are compared in Fig. 8 for a slow drift in the slurry pump around. This drift is present in variable numbers 55 and 97, and starts at 5 min and ends at 65 min. Conventional PCA is unable to detect the shift with more than 99% confidence, whereas MSPCA detects the shift consistently with 99% confidence after 24 min. The contribution plots for this fault at 20 min shown in Fig. 9 clearly indicate that MSPCA identifies the contributing variables, whereas PCA does not. Further theoretical comparison based on the average run length of steady-state PCA, dynamic PCA and MSPCA are also available [14].
time
MSPCA
%
~'
(99%)
2
.
3
.
.
.
.
s
6
7'
8 time
Fig. 8 Multivariate SPC by PCA and MSPCA [1].
422
0~ o.a.I,
.IL
I
i , I ,l, ai,_ .,,_, .., IIii.I,I.,.,
!~ 'i .....'-'r llr",' ""
MsPcA i l
-i,'r"" i, ,","
-]- ...................................................................................................................................................
PCA
]
~ ,.. I~ h I~..I, ,,,,,i., ,,., ,,., I~ ,., , t ~t " ' " " " " '"~' J'' " ' " " "r r ," ["1
-0.
-0.41 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Variable number
110
Fig. 9 Source plots for sample at 20 min /br monitoring b)' MSPCA and PCA.
5
Multiscale denoising with linear steady-state models
The data obtained from many processes are multivariate in nature, and have an empirical or theoretical model that relates the variables. Such measurements can be denoised by minimizing a selected objective function subject to the process model as the constraint. This approach has been very popular in the chemical and minerals processing industries under the name data rectification, and in electrical, mechnical and aeronautical fields under the names estimation or filtering. In this chapter all the model-based denoising methods are referred to as data rectification. 5.1 Single-scale model-based denoising
Data rectification involves solving the following optimization problem: minimize
@(Y, Y)
(3)
subject to f(d~(/dt, Y , . . . , U, t ) -
0
h(~', t) - 0
(4)
g(Y, t) >~ 0 where @ is the objective function, Y the N • P matrix of N samples of P measured variables, U the matrix of inputs, ~' the matrix of noise-free variables, and Y is the matrix of rectified variables. Each column of Y represents
423 the measurements for a variable, and is represented as yj. Each row of Y is the value of the variables at a given instant and is represented as y~. The errors are usually considered to be additive with, Y:Y+t~ The process model is represented by f, the equality constraints by h, and the inequality constraints by g. A common representation of the objective function in Eq. (3) minimizes the mean-square error of approximation as
T -1
minimize (Yi- Yi) (~ ( Y i - Yi)
(5)
subject to constraints, where, Q~ is the P x P covariance matrix of the errors. If the process model is linear, as given by Eq. (6) AYi = 0
(6)
with no inequality constraints, the maximum likelihood estimate, ~'i of .9i is,
Y i - PMLYi
(7)
PML -- I -- Q~A T (AQ~AT)-IA
(8)
where
In terms of the N x J data matrix, Eq. (7) may be written as ~" -
YPT L
(9)
This approach is equivalent to maximum likelihood rectification for data contaminated by Gaussian errors. The likelihood function is proportional to the probability of realizing the measured data, Yi, given the noise-free data,
~i, L(yi; Yi) cx a(yilYi)
(10)
where Yi is the vector of variables obtained at the ith sampling instant. For data corrupted by additive, Gaussian errors, maximizing the logarithm of the likelihood function is equivalent to minimizing the mean-squared error [15], and results in the solution given by Eqs. (8) and (9). In most rectification problems, information about the nature of the underlying noise-free variables is available, or can be determined from historical data. The Bayesian approach uses such prior information in the form of a probability distribution to improve the smoothness and accuracy of the
424 rectified signal. Thus, the Bayesian approach aims to maximize the probability of the rectified variables given the measured data, P(YilYi). According to Bayes rule P(Yi]Yi) -
P(Yi I~'i)P()'i) P(Yi)
The probability of the rectified variables, P(~'i), is referred to as the prior, and reflects the prior information about the nature of the rectified variables. The term, P(Sri]Yi) is called the posterior, since it represents the probability distribution after the measurements have been collected. The denominator, P(Yi) need not be considered in the optimization since it is independent of Yi. Thus, the objective function for Bayesian data rectification problem may be stated as max P(yi [yi)P(yi)
(11)
The term, P(Yi Yi) is equal to the likelihood function in Eq. (10), and is the probability distribution of the contaminating errors. The same problem solved by the Bayesian approach yields a different result as discussed here. If the errors and variables are assumed to follow a Gaussian distribution, the objective function for Bayesian data rectification may be written as minimize (Yi
Yi) Q~ (Yi
Yi) +
Q~'(yi-~,~)
(12) where g~ and Q~ are the mean and variance of the prior probability distribution. For a linear process model given by Eq. (6), the closed form solution for Eq. (12) may be derived as the maximum a posteriori (MAP) solution ~'i- PMAP(Yi -+- QzQylpt~))
(13)
PMAP- [ I -
(14)
where D-1QaAT(AD-'QaAT)-IA]D-'
D - (I _+_QeQ~I)
(15)
Eq. (12) reduces to Eq. (5) if the prior is a uniform distribution, that is, as Q~ ~ oo. For data rectification with linear process models, the matrix, Q,), will be singular, and cannot be inverted. Consequently, Eqs. (14) and (151
425 need to be modified by replacing Q~I by ~tQ~l~tT, where 0t are the eigenvectors corresponding to non-zero eigenvalues of Q~., and Q~ is the corresponding eigenvalue matrix with non-zero diagonal terms. Unlike maximum likelihood rectification, Bayesian rectification can remove errors even in the absence of process models. Another useful feature of the Bayesian approach is that if the probability distributions of the prior and noise are Gaussian, the error of approximation between the noise-free and rectified measurements can be estimated before rectifying the data as, c o v ( ~ " - ~') - [ I -
PMAP]Q~,
where Y is the matrix of the noise-free variables, and PMAP is given by Eq. (14). This ability of the Bayesian approach to estimate the error without processing the measurements is useful for controlling the complexity of the multiscale Bayesian rectification as illustrated by the examples in this paper. The quality of any Bayesian approach depends on the accuracy of the prior probability distribution, and a variety of methods have been devised for determining the distribution [6,16,17]. The Bayesian or maximum likelihood approach may also be used for simultaneous removal of random and gross errors by using a probability distribution that represents both types of errors such as a weighted sum of broad and narrow Gaussian distributions [18]. The narrow Gaussian represents the random errors, and the broad Gaussian represents the gross errors. A closed-form solution like Eq. (8) or (14) can no longer be derived for nonGaussian errors, and quadratic or nonlinear optimization methods are necessary for rectification.
5.2 Multiscale Bayesian data rectification Existing methods for data rectification with process models including, maximum likelihood and Bayesian methods, are inherently single-scale in nature, since they represent the data at the same resolution everywhere in time and frequency. The multiscale Bayesian data rectification method developed in this section combines the benefits of Bayesian rectification and multiscale filtering using orthonormal wavelets. The methodology for multiscale Bayesian rectification is a special case of the general multiscale analysis and modeling methodology shown in Fig. 2. Each
426 variable is decomposed on the selected family of orthonormal wavelet basis functions. The coefficients at each scale are rectified based on the Bayes prior, and noise probability distributions at that scale, as discussed in this section. If the noise and prior are Gaussian, the coefficients at each scale may be rectified efficiently by a closed form solution. If the noise or prior is nonGaussian, the resulting nonlinear or quadratic optimization problem needs to be solved to obtain the rectified coefficients at each scale. This method is analogous to multigrid methods for solving differential equations [19], and multiscale principal component analysis [1]. The multiscale Bayesian method requires an estimate of the prior at each scale. Fortunately, the ability of wavelets to compress deterministic features, and to approximately diagonalize linear operators [20] and decorrelate stochastic processes often makes it easier to estimate the prior of the wavelet coefficients, as compared to estimating the prior of the time-domain signal. Many common stochastic processes that are non-Gaussian in the time domain, follow a Gaussian distribution if the measurements are transformed by an appropriate mathematical operator. Examples of such data include common non-stationary stochastic processes such as, A R I M A and fractal time series. An A R I M A process can be transformed to a stationary Gaussian stochastic process by successive differencing [21]. Decomposition of such stochastic processes on orthonormal wavelets results in approximately white and Gaussian wavelet coefficients, since wavelets are their approximate eigenfunctions [22]. Thus, the wavelet decomposition eliminates the need to find the appropriate mathematical operator. Furthermore, the wavelet coefficients are approximately white with the variance at each scale changing according to the power spectrum of the time-domain signal in the range of frequencies corresponding to each scale. Wavelets also approximately decorrelate other autocorrelated stochastic processes such as, A R M A processes, while maintaining their Gaussian distribution. These properties of wavelets permit the prior for the coefficients at each scale to be represented as a Gaussian and the objective function for Bayesian rectification at each scale may be written as minimize I ( d j k -
djkT)Q~
,
(16)
where djk is the vector of noise-free wavelet coefficients for all variables at scale, j, and position, k, and Q~j and Qf~j are the covariance matrices of the error and coefficients at scale j. Eq. (16) is similar to Eq. (12), but allows the
427 error and prior covariance to change with scale to reflect the behavior of scale-varying signals. Thus, the rectified coefficients at each scale may be computed by modifying Eq. (14) to using a filter matrix, PMAP.j computed from the covariance matrices at the selected scale as
--
PMAP.j
(17)
The linear process model used at each scale remains unchanged. Like the single-scale Bayesian approach, the multiscale Bayesian approach with Gaussian error and prior also provides an estimate of the covariance of the error of approximation at each scale as
cov(I)j - Dj) - [I- PMAP.j]QI),
(18)
Decomposition of the variables on orthonormal wavelets decomposes the error covariance at all scales as shown in Eq. (2) since,
(Uj _ Dj)T (Dj _ Dj) _ (Nj -- 1)cov(Dj - Dj)
(19)
This multiscale decomposition of the estimated error covariance is analogous to an eigenvalue decomposition of the error covariance matrix at multiple scales. It can be used to save computation by eliminating coefficients at less important scales from the rectification, as described later in this section, and illustrated by the examples. If the distributions of the error and the prior are represented as Gaussian, the estimate of the error covariance at each scale can be used to decrease the computational complexity of multiscale Bayesian rectification. For a nonstationary stochastic process corrupted by white noise, the energy of the underlying signal decreases at finer scales, while that of the noise remains constant at all scales. Thus, the signal-to-noise ratio decreases with increasing frequency. Since the accuracy of the rectification decreases with decreasing signal to noise ratio, coefficients at finer scales may not contribute as much towards reducing the overall error of rectification, as the coefficients at coarser scales. If the spectra of the noise and underlying signal are similar, it may not be possible to eliminate any scales from the rectification. If the finest scale is eliminated, the rectification will save 50% of the total computation. If the finest two scales can be eliminated, the computational saving will be 75%. The relevance of the coefficients at any scale to reducing the approximation error may be determined by
428
Rj
(Nj - 1) (| - PMAP.Dj)QID, J-1
---
( N j o - 1 ) ( I - PMAP.Ajo)QAj~ + ~ .. ( N i - 1 ) ( I - PMAP.D,)QIS, 1--J 0
(20) where Nj = 2-J+JN, is the number of coefficients at scale j. Eq. (20) represents the relative improvement in the error of approximation by introducing the coefficients at a finer scale, j. This equation is analogous to the relative error covariance matrix used by Miller and Willsky [5] for multiscale sensor fusion. If the estimate of the relative improvement in the error covariance is small, excluding the finer scales from the rectification will not have much effect on the quality of the rectified data. For data containing deterministic features such as, mean shifts or oscillations, determining the probability distribution may require nonparametric estimation, since the distribution usually cannot be represented by a standard distribution such as a Gaussian. Fortunately, estimating the distribution of the wavelet coefficients of such a signal is likely to be easier since, wavelets capture deterministic features as a few relatively large wavelet coefficients. Thus, the probability distribution of the absolute value of the wavelet coefficients of a signal containing deterministic features consists of a large number of small coefficients, and a small number of large coefficients. The prior for the wavelet coefficients may be approximated quite well by an exponential distribution. Consequently, the objective function for Bayesian rectification of the wavelet coefficients at scale j with an exponential prior and Gaussian errors becomes, minimize
djk
[(djk - djk)rQ~ ' (djk- ajk) + Aj
lajkl]
where Aj is a scaling vector proportional to the reciprocal of the variance of the exponential distribution. Eq. (21) is a multivariate version of the objective function used for basis pursuit denoising [23]. Unlike basis pursuit or wavelet denoising, data rectification by satisfying Eq. (21) subject to the model constraints takes advantage of the multiscale representation and the multivariate process model. Consequently, the multiscale Bayesian approach performs better than existing multiscale filtering methods for rectification of deterministic underlying signals. The optimization problem with the exponential prior lacks a closed form solution, but can be converted to a quadratic programming problem [23].
429 In many practical rectification problems, the error or prior may not follow a Gaussian distribution, and may not permit the application of Eq. (20) to eliminate less important scales from the rectification. For example, if a signal is contaminated by both random and gross errors, the error distribution may be represented as a weighted sum of two Gaussians. Similarly, if the underlying signal contains deterministic features, the prior may be represented as an exponential function. Fortunately, if a Gaussian provides a reasonable approximation of the non-Gaussian error or prior, Eq. (20) may still be used to determine whether the nonlinear optimization at finer scales will be worth the computational effort. Rectification of the coefficients of the last scaled signal may require a different approach from that used for rectification of the wavelet coefficients. This is because the last scaled signal has the lowest frequencies which may have different statistical properties than the wavelet coefficients. Nonparametric methods for determining the prior, and nonlinear optimization methods for accurate Bayesian rectification of the last scaled signal may be needed, even when the wavelet coefficients follow a Gaussian distribution or when their distribution can be determined by parametric methods. For example, the wavelet coefficients of a nonstationary stochastic process will be Gaussian, but the last scaled signal will be non-Gaussian like the measured data. Similarly, the wavelet coefficients of a signal with deterministic features may be represented as an exponential function, but the last scaled signal may require the use of nonparametric methods for estimating the prior. Fortunately, the need for nonlinear optimization or nonparametric estimation of the last scale signal may be eliminated by decomposing the signal to an adequately coarse scale to minimize the contribution of the noise. Subsequently, the maximum likelihood approach may be used to rectify the last scaled signal, without much loss in accuracy, but significant saving in computation. This is analogous to not thresholding the last scaled signal, which is suggested in multiscale filtering [3]. Alternatively, for the same data, the coefficients of the last scaled signal may also be approximated as a Gaussian distribution, without much effect on the quality of the rectification. In contrast, a conventional single-scale Bayesian approach may not be able to avoid nonparametric estimation or nonlinear optimization. Thus, the multiscale rectification method is likely to be computationally more efficient than a time-domain rectification method for many different types of signals. The multiscale rectification approach may be used for both maximum likelihood and Bayesian methods by solving the corresponding optimization
430 problem at each scale. A multiscale maximum likelihood rectification approach will perform better than a single-scale maximum likelihood approach for data that are contaminated by scale-dependent errors such as, autocorrelated stochastic processes. In either case, if the rectification parameters are scale-invariant, the result of multiscale rectification is identical to that of single-scale rectification [6].
5.3 Performance of multiscale model-based denoising The performance and properties of multiscale Bayesian rectification are compared with those of other methods in the following examples. The examples compare the performance for rectification of Gaussian errors with steady-state and dynamic linear models for stochastic and deterministic underlying signals. Three independent material balance equations can be written for the flowsheet shown in Fig. 10 as [24]
FI 0 1
1 -1
-1 0
0 1
0) 0
0
0
-1
1
1
F2 F3 F4 F5
=0
The performance of various rectification methods is compared for the noisefree underlying signal represented as a uniform distribution, non-stationary stochastic process, and data with deterministic features.
Uniform distribution. The data used for this illustration are similar to those used by Johnston and Kramer [24]. The noise-free measurements for the flowrates, F1 and F4, are uniformly distributed in the intervals [1,5,15,40], respectively. The flowrates F1 through F5 are contaminated by independent
I
~F5
Fig. 10 Flowsheetfor multiscale model-based denoising.
431 Gaussian errors with standard deviations l, 4, 4, 3, and 1, respectively. The performance of maximum likelihood, single-scale Bayesian, and multiscale Bayesian rectification are compared by Monte-Carlo simulation with 500 realizations of 2048 measurements for each variable. The prior probability distribution is assumed to be Gaussian for the single-scale and multiscale Bayesian methods. The normalized mean-square error of approximation is computed as, 1
MSE =
N
N P Z (Yi- ~"i)TQs-1 (Yi- ~'i) i=l
The mean and standard deviation of the MSE for 500 realizations of the 2048 measurements per variable are summarized in Table 1, and are similar to those of Johnston and Kramer. The average and standard deviation of the mean-squared errors of single-scale and multiscale Bayesian rectification are comparable, and smaller than those of maximum likelihood rectification. The Bayesian methods perform better than the maximum likelihood approach, since the empirical Bayes prior extracts and utilizes information about the finite range of the measurements. In contrast, the maximum likelihood approach implicitly assumes all values of the measurements to be equally likely. If information about the range of variation of the rectified values is available, it can be used for maximum likelihood rectification, leading to more accurate results. For this example, since the uniformly distributed uncorrelated measurements are scale-invariant in nature, the performance of the single-scale and multiscale Bayesian methods is comparable.
Non-stationary stochastic process. The noise-free measurements for this illustration are generated as integrated white noise. The variables are contaminated by independent identically distributed Gaussian errors with standard deviations 5, 20, 20, 15, 5. The multiscale Bayesian approach is Table 1. Rectification of uniform distribution. Mean and standard deviation of M S E based on 500 realizations.
Rectification method None Maximum likelihood Bayesian, single scale Bayesian, multiscale, j0 = 6
MSE Mean 0.9991 0.3998 0.2492 0.2495
S.D. 0.0440 0.0141 0.0044 0.0044
432
particularly well-suited for rectification of such data due to the decorrelation ability of wavelets, and the strongly scale-dependent nature of non-stationary stochastic processes. The mean and standard deviation of the mean-squared errors for 500 realizations shown in Table 2 confirm that the multiscale Bayesian approach performs significantly better than existing methods. In this example, the prior for the single-scale Bayesian approach and for the last scaled signal in the multiscale Bayesian rectification is approximated as a Gaussian. The Monte-Carlo simulation results show a greater variation of the mean-squared error for single-scale Bayesian rectification than the multiscale approach. This is due to the significant variation of the probability distribution of each realization in the time domain. The percentage reduction in error by including finer scale coefficients indicates that the finest scales do not contribute much to decreasing the error, and the finest scale, J - 1 , may be ignored with virtually no loss of accuracy. This is confirmed by the error for rectification without the finest scale, in Table 2. If the finest two scales, J - 1 and J - 2 , are eliminated, the reduction in computation is 75%, and the meansquare error increases only slightly, as shown in Table 2. If the correct mathematical transformation to convert the underlying signal to a Gaussian distribution is used, the single-scale Bayesian approach may give results comparable to the multiscale approach, but cannot save computation by eliminating the less important scales.
Signal with deterministic changes. In this example, the noise-free signal is deterministic with some sudden changes in the mean. The variables are contaminated by iid Gaussian error of standard deviation 0.5, and the results are summarized in Fig. 11. Wavelet thresholding of the result of maximum Table 2. Rectification of non-stationary stochastic process.
Rectification method None Maximum likelihood Bayesian, single scale Bayesian, multiscale, jo = 6 (all scales) Bayesian, multiscale, jo = 6 (without finest scale, j = 10) Bayesian, multiscale, jo = 6 (without finest two scales, j = 10, 9)
MSE Mean
S.D.
0.9999 0.3993 0.3825 0.0538 0.0539
0.0463 0.0150 0.0697 0.0030 0.0030
0.0544
0.0030
433
28
9
,.
.
.
,
281
.r
..
,,~L. ~
27 _'
26
....
i L
"'
27 t
:
g l"
25
24
, . 20
0
,
28
9
, 40
9 60
9
0
80
.
20
28
9
40
'
60
',
80
,
i ~a .
27
,;..;
:':
e.
9
27
'
..,'z
"; :'.'; ;
", ,.,..',~ 26
26
25
25
240
10
!
40
I,,
60
80
24
0
20
40
60
80
Fig. 11 Data rectification of signal with deterministic features. Dashed line is noisy data 9 (a) Original and noisy data, (b) Wavelet thresholding, (c) Wavelet thresholding q[ter maximum likelihood rectifi'cation, (d) Multiscale Bayesian rectification.
likelihood rectification gives a smaller error than either method alone. The threshold is determined by the median absolute deviation method [3]. The smallest error is obtained for multiscale Bayesian rectification with the distribution of the wavelet coefficients represented as an exponential function. and the last scaled signal rectified by the maximum likelihood approach. It the distribution at each scale is approximated by a Gaussian, the quality ot rectification is slightly worse than that for the exponential distribution, bu! requires much less computation [6].
6
Conclusions
This chapter presented a general methodology for multiscale analysis modeling, and optimization. Specific application of this general method wa,
434 developed for multiscale statistical process control and multiscale modelbased denoising with linear steady-state models. The properties of MSSPC studied in this paper indicate that MSSPC is an excellent general method for monitoring of both univariate and multivariate processes. It can perform better than existing methods for processes where it is essential to detect shifts of different sizes. Furthermore, MSSPC can also be easily applied to monitoring of data with any type of autocorrelation due to the ability of wavelets to approximately decorrelate stochastic processes. MSSPC is certainly not panacea, and if the objective of monitoring is to detect specific types of changes only, then an existing single-scale method can be tailored for this task to provide the best possible performance. For example, if the objective of monitoring is to detect only small shifts, an optimum MA, EWMA, or CUSUM method can be designed. Similarly, if the measurements can be modeled as an IMA(1,1) process, moving center line EWMA can result in excellent performance. In practice, since it is usually essential to detect all types of changes, and since the stochastic nature of the measurements need not follow an IMA(1,1) model, MSSPC is a method that can be superior to existing methods on the average. A multiscale Bayesian approach for data rectification of Gaussian errors with linear steady-state models was also presented in this chapter. This approach provides better rectification than maximum likelihood rectification and single-scale Bayesian rectification for measured data where the underlying signals or errors are multiscale in nature. Since data from most chemical and manufacturing processes are usually multiscale in nature due to the presence of deterministic and stochastic features that change over time and/or frequency, the multiscale Bayesian approach is expected to be beneficial for rectification of most practical data. The improved performance of the multiscale approach is due to the ability of orthonormal wavelets to approximately decorrelate most stochastic processes, and compress deterministic features in a small number of large wavelet coefficients. These properties permit representation of the prior probability distribution of the variables at each scale as a Gaussian or exponential function for stochastic and deterministic signals, respectively. Consequently, computationally expensive non-parametric methods need not be used for estimating the probability distribution of the coefficients at each scale. If the probability distribution of the contaminating errors and the prior can be represented as a Gaussian, the multiscale Bayesian approach provides
435
an estimate of the error at each scale, before data rectification. This estimate can be used to eliminate scales that have an insignificant contribution towards decreasing the overall error, resulting in significant computational savings. It is expected that the general multiscale methodology and the multiscale methods for SPC and model-based denoising will help in developing multiscale methods for other tasks.
7
Acknowledgements
Financial support from the National Science Foundation through grant CTS-9733627, and the donors of the Petroleum Research Fund, administered by the American Chemical Society through grant 30523-G9 are gratefully acknowledged.
References 1. B.R. Bakshi, Multiscale PCA with Application to Multivariate Statistical Process Monitoring, AIChE Journal, 44(7) (1998). 1596-1610. 2. S. Top and B.R. Bakshi, Improved Statistical Process Control Using Wavelets, in Third International Conference on Foundations of Comp. Aided Proc. Oper. (J.F. Pekny and G.E. Blau, Eds) AIChE Symposium Series, 94(320) (1998), 332-337. 3. D.L. Donoho, I.M. Johnstone, G. Kerkyacharian, D. Picard, Wavelet Shrinkage: Asymptopia?, Journal o[ Royal Statistical Society Series B, 57 (1995), 41. 4. G.W. Wornell, A.V. Oppenheim, Wavelet-based representations for a class of selfsimilar signals with application to fractal modulation, IEEE Transformation and Information Theory, 38 (1992), 785. 5. E. Miller, A.S. Willsky, A Multiscale Approach to Sensor Fusion and the Solution of Linear Inverse Problems, Appl. Comp. Harm. Anal.. 2 (1995), 127. 6. B.R. Bakshi, M.N. Nounou. P.K. Goel. X. Shen, Multiscale Bayesian Data Rectification with Linear Steady-State Models, htd. Eng. Chem. Res., submitted (1999). 7. S. Ungarala and B.R. Bakshi. Multiscale Bayesian Rectification of Linear and Nonlinear Processes, AIChE Annual Meeting, Dallas, TX, (also available as technical report) (1999). 8. M.N. Nounou and B.R. Bakshi, Online Multiscale Filtering of Random and Gross Errors Without Process Models, AIChE Journal, 45(5) (1999), 1041-1058. 9. J.F. MacGregor, Statistical Process Control of Multivariate Processes, in Proceedings of the IFAC ADCHEM, Kyoto, Japan, (1994). 10. D.C. Montgomery, Introduction to Statistical Quality Control, Wiley, New York (1996). 11. J.M. Lucas, Combined Shewhart-CUSUM Quality Control Schemes, Journal oj Quality Technology, 14(2) (1982), 51-59.
436 12. G.C. Runger and T.R. Willemain, Model-Based and Model-Free Control of Autocorrelated Processes, Journal of Quality Technology, 27(4) (1995), 283-292. 13. C.M. Mastrangelo and D.C. Montgomery, SPC with Correlated Observations for the Chemical and Process Industries, Qual. Reliab. Eng. Int., 11 (1995), 79-89. 14. B.R. Bakshi, H. Aradhye and R. Strauss, Process Monitoring by PCA, Dynamic PCA, and Multiscale P C A - Theoretical Analysis and Disturbance Detection in the Tennessee Eastman Process, AIChE Annual Meeting, Dallas, TX, (1999). 15. A.H. Jazwinski, Stochastic processes and filtering theory, Academic Press, New York, (1970). 16. J.O. Berger, Statistical Decision Theory and Bayesian Analysis, Springer, New York, (1985). 17. C.P. Robert, The Bayesian choice: a decision-theoretic motivation, Springer, New York, (1994). 18. I.B. Tjoa, L.T. Biegler, Simultaneous Strategies for Data Reconciliation and Gross Error Detection of Nonlinear Systems, Computer Chemical Engineering, 15 (1991), 679. 19. W.L. Briggs, A Multigrid Tutorial, SIAM, Philadelphia (1987). 20. G. Beylkin, R. Coifman, V. Rokhlin, Fast Wavelet Transforms and Numerical Algorithms I, Communications of the Pure Applied Mathematics, XLIV (1991), 141. 21. G.E.P. Box, G.M. Jenkins, G.C. Reinsel, Time Series Analysis Forecasting and Control, Prentice Hall, NJ (1994). 22. G.W. Wornell, A Karhunen-Loeve-like Expansion for 1/f Processes via Wavelets, IEEE Transformation and Information Theory, 36 (1990), 859. 23. S.S.B. Chen, D.L. Donoho, M.A. Saunders, Atomic Decomposition by Basis Pursuit, SIAM J. Scientific Computing, 20 (1999), 33. 24. L.P.M. Johnston, M.A. Kramer, Maximum Likelihood Data Rectification: SteadyState Systems, AIChE Journal, 41 (1995), 2415.
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
437
CHAPTER 18 Application of Adaptive Wavelets in Classification and Regression Y. Mallet, D. Coomans and O. de Vel Statistics and Intelligent Data Analysis Group, School of Mathematics-Physical Sciences, Mathematics and Physics, James Cook University. Townsville, Australia
1 Introduction This chapter demonstrates how the adaptive wavelet algorithm of Chapter 8 can be implemented in conjunction with classification analysis and regression methods. The data used in each of these applications are spectral data sets where the reflectance/absorbance of substances are measured at regular increments in the wavelength domain.
2 Adaptive wavelets and classification analysis 2.1 Revie~l' of relevant classOqcation methodologies Discriminant analysis techniques (also called classification techniques) are concerned with classifying objects into one of two or more classes. Discriminant techniques are considered to be learning procedures. Given, a set of objects whose class identity is known, a model ~learns" from the variables which have been measured for each of the objects, a procedure which can be used to assign a new object, whose class identity is unknown, into one of the predefined classes. Such a procedure is performed using a well-defined discriminatory rule. In many instances one will be given a set of training data consisting of nr objects Xifr/ from class r E { 1.2 . . . . . R} giving a total of n - ~ R 1 nr objects. Each object xi consists of measurements made on p variables and can be represented as a data vector of the form xi _ (xli. X2i . . . . . Xpi )T where p also indicates the dimensionality of the data set. In the case of a spectral data set, each object will represent a spectrum. For each training object xi the class identity Yi E {1,2 . . . . . R} is known. The training objects are stored as col-
438 umns in the p x n data matrix X = (xl, X2,..., Xn) and we prefer that the class labels are stored in the n x 1 column vector y - (Yl, Y2,..., Yn)V" The reason for defining X to be a p x n matrix, which is in slight contrast to the dimension of y, is to allow for a simplification of notation when the DWT of the data matrix is performed. A discriminant model which is assessed using the same training data which was used to estimate the parameters in the model will usually reflect overly optimistic results. It can be appropriate to use an independent test set for assessing the validity of the model. Let X' define the testing data which contains n' objects with n/r objects from class r such that n'r and Y ' - (Y~I,.--, Yn') denotes the vector of true class labels of the testing data.
n'-~-~R=I
The discriminatory rule that we consider is based on Bayes decision rule [1]. An object x is assigned to the class r, which maximizes the posterior probability P(rlx )
forr=l,...,R
(1)
By performing a direct application of Bayes theorem, the posterior probability in Eq. (1) can be written as P(rlx ) - p(xlr)P(r) p(x)
(2)
where P(r) is the a priori probability of belonging to class r, p(x) is the probability density of x, p(x]r) - (2~)-P/2lSr]-~
exp[-0.5(x - xr)Sr I (x - 'Xr)T]
(3)
is the class probability density function which measures the probability of x arising from group r. It is assumed that p(x]r) follows a multivariate normal distribution. Commonly, the class covariance matrices St, and the class mean vectors Xr, are calculated using the maximum likelihood estimates nr
Sr-
1/nr ~ ( X i ( r ) -
'Xr)(Xi(r)- Xr) T
i-1 nr
IKr- 1/nr Z X i ( r ) i=l Since p(x) is independent of r, we can disregard the denominator in Eq. (2), and the classification problem can be reformulated a s - assign object x to the group r, which maximizes the classification score
439
g ( x , r ) - { p(x]r)P(r) r - 1, . . . . R (2r~)-P/2]S~]-~ exp[-0.5(x - ~)S~ -l (x - ~)V]p(r).
(4) The particular Bayesian classifier that we consider in this chapter is Bayesian linear discriminant analysis (BLDA). For BLDA one assumes that the class covariance matrices Sr are equal. A pooled covariance matrix is constructed as follows R Spooled -- 1/n Z nrSr - Sw r=l
and then substituted into Eq. (4). Upon taking the natural logarithm of Eq. (4) and ignoring the constants, the following classification rule for BLDA results gBLDA(X,r) -- - - 0 . 5 ( X _ ,~r)XSpoole d - 1 (X -- Xr) d--In P(r).
(5)
An advantage of using Bayesian linear discriminant analysis is it allows for easy implementation of probability-based criteria (see Section 2.3). A disadvantage of using BLDA is that it does not have a graphical element such as Fisher's linear discriminant analysis (FLDA). Whilst in this chapter we use BLDA for predicting the class membership of spectra, we will use FLDA as a data exploratory technique for viewing separation among the classes with the aid of discriminant plots (see Section 2.5). For this reason we briefly describe Fishers linear discriminant analysis. F L D A seeks the linear transformation z = XXv maximizes vXSBv subject to v X S w v - 1, where v - (vl,... ,Vp)v is a vector of discriminant coefficients and SB is the between - classes covariance matrix defined by SB -- 1/n ~-'~R_I nr(~,r -- '~)(Xr -- ,~)T, with '~ = 1/n Zn_l Xi. The solution to the maximization problems reduces to solving (SwlSB- ~ , l ) v - 0. Notice that there will be S o - m i n ( R - 1, p) eigenvalues ~,l,...,~,s0 and So corresponding eigenvectors Vl,...,vs,, which produce So discriminant variables zl,...,Zs0. The first discriminant variable gives the largest measure of the discriminant criterion, the second discriminant variables gives the next largest measure such that z2 is uncorrelated with Zl and so on. Section 2.5 plots the first two discriminant variables or variates Zl and z2 against each other. (One data set has only one discriminant variable since min(R - 1, p) = 1).
440 2.2 Classification assessment criteria The correct classification rate (CCR) or misclassification rate (MCR) are perhaps the most favoured assessment criteria in discriminant analysis. Their widespread popularity is obviously due to their ease in interpretation and implementation. Other assessment criteria are based on probability measures. Unlike correct classification rates which provide a discrete measure of assignment accuracy, probability based criteria provide a more continuous measure and reflect the degree of certainty with which assignments have been made. In this chapter we present results in terms of correct classification rates, for their ease in interpretation, but use a probability based criterion function in the construction of the filter coefficients (see Section 2.3). Whilst we speak of correct classification rates, misclassification rates ( M C R = 1 - CCR) would equally suffice. The correct classification rate is typically formulated as the ratio of correctly classified objects with the total number of objects in the test set. More formally, if we let y - (:?l,..., ~?n) the vector of predicted class labels with ~'i E 1. . . . . R, the correct classification rate can then be expressed as follows n
C C R - 1/n Z
~5(Yi'Yi)
(6)
i=l Here ~5 is an indicator variable such that 8(y i, Yi)- 1 if Y i - Yi and zero otherwise. (For an interesting documentation involving error-rate estimation procedures to simulated data, the reader is referred to [2]). Eq. (6) is based on the training data, and as mentioned earlier, this result is likely to give an overly optimistic impression of the classification model. The correct classification rate for the testing data which is defined by n !
CCRtest - l/n' Z a(y I, 2)i) i-I should also be considered.
2.3 Classification criterion functions for the adaptive wavelet algorithm A correct classification rate is a discrete measure whose calculation is based upon which side of a decision boundary the observations lie. It does not reflect how "close" or how "far away" the observations lie from the decision boundary and hence how clear the assignments are made. It is still possible to have a high classification rate, where many assignments have lied close to
441 decision boundaries. An advantage of using probabilistic based classification methods such as those based on Bayes decision rule, is that it is possible to obtain more information than just the correct classification rate. Probabilistic measures provide information about the assignment accuracy, but they also reflect the degree of certainty by which assignments have been made. Due to the continuous nature and ability to measure distinctness of class predictions, we consider a probability based criterion function for the adaptive wavelet algorithm. Most probabilistic discriminatory measures have the basic form n
P - 1/n Z
a(xi)
i=l
where a(xi) is an appreciation function which produces an appreciation score for xi. The correct classification rate for BLDA has the simple appreciation function 1 0
aCCR(Xi)--
P(rxitr)) > P(rxi) otherwise
P(rlXi(r)) denotes the posterior probability for the true class of xi, since the notation Xi(r) indicates that the true class of xi is class r. Another simple probabilistic measure results when the appreciation score is aA (Xi) = P(rlxir
The associated probabilistic measure is the average probability that an object is assigned to the correct class: n
PA -- 1/n Z
aA(Xi).
i=l
The quadratic appreciation score which we apply in this chapter and is briefly described in Chapter 8 is formulated as follows, 1
1 R
aQ(xi) -- ~ + P(rlxi(r)) - ~ Z
P(rlxi)2
r-1
The quadratic probabilistic measure is then defined n
QPM - 1/n Z i=l
aQ(xi) -- PQpM.
442 The quadratic probability measure is related to the Brier quadratic score, which is a loss function for comparing two probability vectors, and is used for the elucidation of probabilities [3,4,5]. The QPM ranges from 0 to 1 with values closer to 1 being preferred, since this implies the classes can be differentiated with a higher degree of certainty. We extend the QPM by seeking filter coefficients which optimise a crossvalidated quadratic probability measure. A cross-validated measure was chosen with the aim of reducing overfitting on the training data. The CVQPM criterion function based on a band of coefficients X )(~) would be defined as follows (x[j] 1~ 9 ~CVQPM~ (r)(T)) -- -n . aQ(xpl)(~ ) - i ) l=l where aQ(xpl9) ('I:),-i) -- ~1 + P-i (r[x~l~) (~))5-
1
R
Z
P-i(rlXPlr)('1:)
)2
r=l The notation P_i(rlx.ltr [j),(z)) refers to the posterior probability for the true ) class of X)Ir)(l:) which is computed in the absence of x~)(~). 9
2.4 Explanation of the data sets The adaptive wavelet algorithm is applied to three spectral data sets. The dimensionality of each data set is p = 512 variables. The data sets will be referred to as the seagrass, paraxylene and butanol data. The number of training and testing spectra in the group categories is listed in Table 1 for each set of data.
Seagrass data The seagrass data were provided by Lem Aragones and Dr Bill Foley, from the Department of Zoology at James Cook University. The training seagrass data set contains 165 digitized spectra, for which log 1/reflectance was measured for the 512 wavelengths 400, 4 0 4 , . . . , 2444 nm. The data consist of three classes of seagrass species- Halophila ovalis (class 1), a mixture of Halodule uninervis and Halodule pinifolia (class 2) and Halophila spinulosa (class 3).
443
Table 1. Description of the spectral data sets used for classification. Data Set
Seagrass
Class 1 55 34 25 25 21 21
Train Test Train Test Train Test
Paraxylene Butanol
Class 2 55 34 25 25 27 26
Class 3 55 34 25 25
Total 165 102 75 75 48 47
The training seagrass data comprise 55 spectra in each group and the testing data have 34 spectra in each class. Fig. 1 shows five sample spectra from each of the classes. With the naked eye, there appears to be some striking similarities between the spectra from the different seagrass species.
~0.5[ I~ [ ~ ~
o,
400 1
Halophilaovalis
,
,
600 I
,
800 ,
:
= ~
-
,
1000 1200 1400 1600 1800 2000 2200 2400 , , i I i I i ! i ' =
"
.
.
.
.
.
.
.
.
I
Haloduleuninervisand
I
0.5 --~ 01 400
1[.~
,
600
.
I
,
.
!
.
.
.
.
.
~~
--
0
400
I
I
. Halophila
I
600
1
800
, I
I
L
|
1000 1200 1400 1600 1800 2000 2200 2400
800
I
I
I
I
.
.
spinu
I --
LJ
1000 1200 1400 1600 1800 2000 2200 2400 wavelength(nm)
Fig. 1 Five sample spectra fi'om the seagrass data.
444
Paraxylene data The paraxylene data were provided by Professor Massart at the Pharmaceutical Institute, the Free University of Brussels. The data were produced by Dr Wim Penninckx at the same institute. The training paraxylene data set contains 75 digitized spectra, for which absorbance was measured at the 512 wavelengths 1289, 1291,..., 2311 nm. The data consist of three groups. Pure paraxylene (class 1), paraxylene plus 10% orthoxylene (class 2) and paraxylene plus 20% orthoxylene (class 3). The training and testing data comprise 25 spectra in each of the classes. Although it appears as if the same spectra are presented, Fig. 2 actually shows five sample spectra from each of the classes. There appears to be some slight variation exhibited at the peak near 1700 nm and in the 2100 nm region.
Butanol data The butanol data were accessed from Professor Massart and Dr Wu Wen at the Pharmaceutical Institute, the Free University of Brussels. The training butanol data set contains 48 digitized spectra, for which absorbance were measured for the 512 wavelengths in the range of 1200-2400 nm. The data consist of two groups. Pure butanol (class 1) and butanol containing various concentrations of water (class 2). Class 1 in the training set contains 21 spectra and class 2 in the training set contains 27 spectra. Class 1 in the test set contains 21 spectra and class 2 in the test data has 26 spectra. Fig. 3 shows five sample spectra from each of the classes.
2.5 Results In this section, we design our own task specific filter coefficients using the adaptive wavelet algorithm of Chapter 8. The idea behind the adaptive wavelet algorithm is to avoid the decision of which set of filter coefficients and hence the wavelet family which would be best suited to our data. Instead, we basis design our own wavelets or more specifically, the filter coefficients which define the wavelet and scaling function. This is done to suit the current task at hand, which in this case is discriminant analysis. The discriminant criterion function implemented by the adaptive wavelet algorithm is the CVQPM criterion function discussed in Section 1.4. The adaptive wavelet algorithm is applied using several settings of the m, q and J0
445
2 [i
t
"
i
i
O i-
Re
o
~
('~ll
1300
2,
,
o e-
i
I
1400
i
1500
1600
,
,
'
I
1700
I
1800
I
I
I
1900
2000
,
,,i
I
,
21 O0
2200
II
2300
Paraxylene
r
-e~o1 ~ ~ l e n e OLI
,~00
.- ~
u r
I
I
1,00 .
]
I
1~00 .
1600 .
.
1
;00
9
,~00
.
'
I
1900
~000
,
.,
~1
100
'
2200
,
,
,
21 O0
2200
2300
Paraxylene
.~ ~o 1 ~ x y l e n e
~ /'t I, I
V1300
i
1400
i
1500
I
1600
1700 1800 1900 wavelength (nm)
2000
....
Fig. 2 Five sample spectra./)'ore the paraxylene data.
C
.~ L O
..Q
i
i
,
i
.
0.6
0.5 0.4
0.13 [-
~0 0 . 6
Butanol !
I
50
100
150
200
250
300
350
i
I
i
i
i
1
i
50
1 O0
'
400
450
500
!
!
!
400
450
500
_
e"
-E 0.5 0 .JQ
0.4
0.3
I
i
150
200 250 300 wavelength (nm)
350
Fig. 3 Five sample spectra.from the butanol data.
446 parameters. The particular (m,q,j0) triplets used were (4,3,2), (4,2,2), (8,1,1), (2,5,3), (2,5,4), (2,7,3), and (2,7,4). These settings were chosen because (i) they provide suitable ratios of the dimensionality of the wavelet bands to the sample size and, (ii) so that the number of filter coefficients is Nf = 12 or Nf = 16. Chapter 8 describes some heuristics for choosing values for these parameters. Note
Since log(p)/log(m) for the case m = 4, is not an integer, we would like to clarify our definition of J, the highest level in the D W T (which is the original data). We let J = ceil(log(512)/log(m)). For the case m = 4, the highest level in the D W T is 5 as demonstrated in Fig. 4. At the highest level there are 512 coefficients, at level 4, there are 512/4 - 128 coefficients in each band, at level 3 there are 128/4 = 32 coefficients in each band, and, for the level which we consider, there are 32/4 = 8 coefficients in each band. For each (m,q,j0) triplet, ~ was chosen as the band which produced the largest ~CVQPM(X~))(~)) at initialization. The coefficients in band (J0,~) are then supplied to BLDA. In some cases the algorithm chose to optimize over a scaling band. This would occur if the discriminant criterion for a scaling band was higher than that for the wavelet bands (at initialization). We have discussed earlier that the scaling coefficients may prove to be useful when the basic shape or low frequency event contains discriminatory information. If a scaling band (i.e. ~ = 0) was selected for a particular setting, then for the same (m,q,j0) settings it was decided to repeat the experiment and optimize over the wavelet band having the largest discriminant measure at initialization. Some stopping rules were applied to the optimization routine. The optimization routine halted if 2000 iterations of the optimization routine had been
J=j=5, N c o e f = 512 I j=4, N c o e f = 128 =32
Fig. 4 Demonstration o f the m = 4 band D W T where p = 512.
447
performed or sooner if an optimal value was obtained. For the seagrass data we found it was necessary to have only 500 iterations, since the discriminant measure was already quite high in the early stages of the AWA. Whilst having a preset number of iterations does not allow for the best optimal value to be found, from an applied point of view it is more practical in our experimentations. The (m, q, J0) settings which produced the highest test CCR are displayed in Table 2 for each of the data sets. Also shown are the number of filter coefficients (Nf), used in computing the DWT and the number of coefficients (Ncoe0 in each of the bands for the respective (m, q, J0) settings. Perfect classification results are obtained for the seagrass data. The next best performance was with the butanol data followed by the paraxylene data. Discriminant plots were obtained for the adaptive wavelet coefficients which produced the results in Table 2. Although the classifier used in the AWA was BLDA, it was decided to supply the coefficients available upon termination of the AWA to Fisher's linear discriminant analysis, so we could visualize the spatial separation between the classes. The discriminant plots are produced using the testing data only. There is a good deal of separation for the seagrass data (Fig. 5), while for the paraxylene data (Fig. 6) there is some overlap between the objects of class 1 and 3. Quite clearly, the butanol data (Fig. 7) post a challenge in discriminating between the two classes.
2.5.1 Classification using Daubechies' wavelets One might be interested in how the adaptive wavelet performs against predefined filter coefficients. In this section, we perform the 2-band DWT on each data set using filter coefficients from the Daubechies family with N f - 16. The coefficients from some band (j, ~) are supplied to BLDA. We consider four b a n d s - band(3,0), band(3,1), band(4,0) and band(4,1). The results for the training and testing data are displayed in Table 3. The test CCR rates are the same for the seagrass and butanol data, but the AWA clearly produces superior results for the paraxylene data.
Table 2. Percentage of correctly classified objects for AWA. Seagrass Paraxylene Butanol
m 4 2 2
q 3 5 5
jo 2 4 3
Nf 16 12 12
Nco,:/ 8 16 8
~ 1 1 1
Train 100 94.67 93.75
Test 100 86.67 87.23
448
Seagrass ~r-
2 2
2 22
2
|
>
3333333 3 33333 333 3 3333 3
2
o
333 3
eq
1 1
11 1
11
1 tl 11
1~11 1
111 1
-5
1
1
I
0
5
1st Variate
Fig. 5 Discriminant plots .for the seagrass data produced by supplying the coefficients resulting from the A WA to Fisher's linear discriminant analysis.
3
Adaptive wavelets and regression analysis
3.1 Review of relevant regression methodologies Let y be a n z 1 response vector containing n measurements such that Y - (Yl, Y2,-.., yn) T. The p variables in the p x n matrix X will be referred to as predictors or independent variables, and the response vector y may be referred to as the dependent variable. We will assume that the predictor matrix and response vector have been appropriately centred prior to the regression analysis, thus ensuring the y-intercept term is zero. The general form of the multiple linear regression model is written as
Yi ----- ~lXli -nt- [~2X2i -~-""" + ~pXip -+- l~i. Here, Yi is the response measurement for the ith object xi - (Xli, x 2 i , . . . , Xpi)T, I~i the residual or prediction error for the ith observation, and 131,[32,... , [~p are the regression coefficients. The multiple linear regression model can also be described in terms of matrices as follows
449
Paraxylene 2
'ql"
2 2 2
04
2 (u .r (u
2
2 2
2
2
2 2 2
2
>
"0 f04
2
1
0
1 11 1
1
1
3
3
3
3
22
1
3
11111
1
1 1 1 1
1
32 1
3 3
3
3
3
3
3
$ 3
13
3
3
3
1 3
-2
0
2
4
1st Variate
Fig. 6 Discriminant plots for the paraxylene data produced by suppl)'ing the coel~'cients resulting from the A WA to Fisher's linear discriminant analysis.
Butanol
11
1 1
:~JII'B1 l m I
2"~'~
2
2
22
2
22
2
2
l
I
I
I
-2
0
2
4
....
1st Variate
Fig. 7 Discriminant plots for the butanol data produced b)' supplying the coefficients resulting from the A WA to Fisher's linear discriminant analysis.
450
Table 3. Classification results for wavelet and scaling coefficients produced using filter coefficients from the Daubechies family with Nf = 16. Data Seagrass
Train Test Train Test Train Test
Paraxylene Butanol
X[3](0)
X[3]( 1)
X[4] (0)
X[4]( 1)
98.79 100 62.67 50.67 85.42 82.98
99.39 98.04 68.00 58.67 87.50 82.98
100 100 81.33 56.00 93.75 76.60
100 99.02 80.00 61.33 87.50 87.23
y = xT[~ + t; with I I - (131,132,..., [~p)T, I~- (gl, g2, ..., t~n)T. In practice, the vector of regression coefficients II, is usually unknown and is typically estimated by the least squares method. The least squares method calculates regression coefficients so that the residual sum of squares aT~ is minimized. The least squares solution is b-
(xxT)-lXy
where b - ( b l , . . . , bp) T is the estimate of the true regression coefficients II. The estimated response is then ~, -- XTb. The M L R model assumes the residuals are independent and gi ~ N(0, 0"2). 3.2 Regression assessment criteria
In this section, we will describe three regression criteria relevant to Section 3.5. These criteria can be used to assess how well a model is performing. The three criteria are - the residual sum of squares (RSS), the R-squared (R 2) measure and the predictive residual sum of squares (PRESS). The residual sum of squares and R-squared criteria both measure how well the model fits the data. These criteria are respectively defined n
RSS - Z ( Y i - yi)2 i=l
451
R2
~-~in_-_l (Yi - 5/i) 2
1 --
----
~-~in=l (Yi- ~)2
1
RSS TSS'
where , ~ - ~ i n l yi/n is the mean response and the total sum of squares T S S - Z i n l ( Y i - .Yi)2The RSS measures the sum of squared deviations between the actual and predicted values of the response. A lower measure of the RSS is preferred. The R 2 criterion ranges from zero to one, with values closer to one being preferred, provided that a high R 2 is not a consequence of overfitting. We will test the performance of the adaptive wavelet algorithm for regression purposes using an independent test set. For this reason we have decided to ~ formulate an R 2 measure for the test set which is denoted by Rtest RSStest Rt2est -- 1 -- TSStes~ The residual and total sum of squares for the testing data are defined, respectively to be nt RSStest- Z ( Y l - 5'i)2 i-1 nI -t)2 TSStest- Z (Yl- Yi i=l where y ' - ( Y ' I , - . . , Y'n)T is the response values of the independent test set, Y'= (Yl, ,!)~) T are the predicted test response values, n' the number of t objects in the test data set and ' 2 ' - ~ i =n'l Yi/n' is the mean of the test responses. Define the PRESS statistic to be PRESS --- ~in=l (Yi- ~/-i) 2. Here, Y-i is the predicted value for Yi, but object xi was "left out' when estimating the parameters in the regression model. Another way of calculating the PRESS statistic is simply by using ( Y i - Y-i) -- Y i - .Yi
1-hii
where hii is the ith element along the diagonal of the hat matrix H - x T ( x x T ) - I X . This avoids the need to leave out observations in turn.
452
3.3 Regression criterion functions for the adaptive wavelet algorithm A suitable criterion function for regression analysis should reflect how well the response values are predicted. In the adaptive wavelet algorithm, the criterion function considered for regression is based on the PRESS statistic and is then converted to a leave-one-out cross-validated R-squared measure as follows CVRSQ = 1 -
PRESS TSS
(6)
The formulation of Eq. (6) using the hat matrix makes the leave-out-one method of cross-validation quite a useful and relatively inexpensive procedure to employ. The cross-validated R-squared criterion function is defined as PRESS ~CVRSQ(X
[j](T)) - - 1 --
TSS
The actual regression model used for predicting the response is - (x[J] (~))Tb.
3.4 Explanation of the data sets Two data sets and three responses were used for evaluating the performance of the various regression procedures. These data sets will be referred to as the sugar and wheat data. A summary of each of these data sets is presented in Table 4. Here the number of spectra in each training and test set is displayed, as well as the response(s) which are to be modelled by each spectral data set. The dimensionality of both data sets is p = 512.
Sugar data The sugar data were supplied by Dr Nils Burding at the Bureau of Sugar Experiment Station in Gordonvale. The training sugar data contain 100
Table 4. Description of the spectral data sets used for regression. Data Set
Train
Test
Responses
Sugar Wheat
100 60
89 40
brix, fibre protein
453 digitized spectra for which log 1/reflectance was measured at the 512 wavelengths 916, 9 1 8 , . . . , 1938 nm. The test set contains 89 spectra. Fig. 8 shows five sample spectra from the sugar training data which were used to model the responses, brix and fibre. At 1100 nm there is a distortion which arises from a change in instrumentation. One detector is used to measure the radiation reflected for wavelengths less than 1100 nm and another detector is used to measure the radiation reflected for wavelengths greater than 1100 nm (inclusively). The change in receptors gives rise to the jump. Wheat data
The wheat data set was accessed from Professor Philip K. Hopke and has previously been discussed in the literature, see for example [6]. The training wheat data contain 60 spectra for which log 1/reflectance was measured at the 512 wavelengths 1100, 1102..... 2122 nm. The test set contains 40 spectra. Fig. 8 shows five sample spectra from the wheat training data. The wheat training data were used to model protein content. 3.5 Results
The adaptive wavelet algorithm (AWA) is applied to the regression spectral data sets described in Section 3.4. The AWA is applied with similar settings as those used for classification. The (m, q, J0) settings for which the AWA is applied, are again (4,3,2), (4,2,2), (8,1,1), (2,5,3), (2,5,4), (2,7,3), and (2,7,4). The most conceivable difference between the AWA when applied for regression (as opposed to classification) is the criterion function which is implemented. Here, the cross-validated R-squared criterion which is based on the PRESS statistic, is the regression criterion function which is implemented
r (3 ttO (3 Q
|
L
2
!
!
i
!
!
!
!
0
-- -2
I 1000
~00 11
,.,I 1200
1300
1400
wavelength
1500 (nm)
1600
1700
Fig. 8 Five sample spectra./i'om the sugar data.
1800
1900
454 by the AWA. A similar banded selection strategy used for classification is used for regression. Here, the band ~ at some level J0 in the DWT which produces the largest regression criterion m e a s u r e (~CVRSQ(X~](T)) forms the basis of the optimization routine. The same coefficients are later supplied to MLR. If the algorithm chose to optimize over a scaling band (i.e. ~ = 0), then for the same (m, q, J0) settings the experiment was repeated, where optimization was over the wavelet band producing the largest CVRSQ measure at initialization. The optimization routine halted if 2000 iterations of the optimization routine had been performed or sooner if an optimal value was obtained. The (m, q, J0) settings which produced the highest test R-squared measures are displayed in Table 5 for each of the data sets. Also shown is the number of filter coefficients (No), used in computing the DWT and the number of coefficients (Ncoef) in each of the bands for the respective (m, q, J0) settings. It seems that brix achieved the highest test R-squared measure, followed by protein and then fibre. For the brix response the (2,5,5) setting produced the best results. When the fibre response was modelled using the AWA, the best setting in terms of the Rt2estmeasure was (8,1,1). The best results for the wheat data were also obtained with the (2,5,5) setting where optimization was over a wavelet band.
.
o~ 2 tI
.
.
.
.
.
.,
~ -2
,
,
,
1900
2000
2100
,
1100
1200
1300
1400
1500 1600 1700 wavelength (nm)
,I
1800
Fig. 9 Five sample spectra from the wheat data.
Table
Brix Fibre Protein
5. R - s q u a r e d
m 2 8 2
q 5 1 5
values resulting from the AWA.
jo 5 1 5
Nf 12 16 12
Ncoef 16 8 16
"c 1 6 2
Train 0.975 0.872 0.975
Test 0.971 0.801 0.825
455
Table 6. Regression results for wavelet and scaling coefficients produced using filter coefficients from the Daubechies family with Nf = 16. Data Brix
Fibre Protein
Train Test Train Test Train Test
X [4](0)
X [4]( I )
X 131 (0)
X [31 (I)
0.975 0.973 0.781 0.692 -
0.961 0.949 0.797 0.723 0.952 0.704
0.740 0.753 0.647 0.533 0.763 0.263
0.525 0.530 0.707 0.569 0.795 0.108
3.5.1 Regression using Daubechies' wavelets This section is similar to Section 1.5.1 in that we perform the 2-band DWT on each data set using filter coefficients from the Daubechies family with N f - 16. The coefficients X[4](0), X[4](1), X[3](0), X[3](1) are supplied to MLR. The DWT was performed on the original (uncentred data), but the coefficients and response variables were centred, prior to them entering 2 the M L R model. The Rtrai n and Rtest for each response are displayed in Table 6.
Due to numerical instabilities it was not possible to obtain regression results for the protein model when the scaling coefficients from band(4,0) were supplied to MLR. This problem arises from the condition number of the matrix (X[4](0)Tx[4](0)) being quite large (3.133e+ 17). Care should also be taken when interpreting the results for the scaling coefficients from the wheat data in band(3,0) for the same reason. A higher Rt2est is obtained for the brix response using Daubechies wavelets, whilst, the AWA produces a higher R~est value for the fibre and protein responses.
References 1. G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York (1992). 2. D. Hirst, Error-rate Estimation in Multiple-Group Linear Discriminant Analysis, Technometrics, 38 (1996), 389-399. 3. S. Brier, Monthly Weather Review, 78 (1950), 1-31.
456
4. D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decision Making, Research Studies Press, Wiley, Schichester, (1986). 5. Y. Mallet, D. Coomans, J. Kautsky and O. de Vel, Classification Using Adaptive Wavelets for Feature Extraction, IEEE-PAMI, 10 (1997), 1058-1066. 6. J. Kalivas, Two Reference Data Sets of Near Infrared Spectra, Chemometrics and Intelligent Laboratory Systems, 37 (1997), 255-259.
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
457
Chapter 19 Wavelet-Based Image Compression O. de Vel, D. Coomans and Y. Mallett Statistics and Intelligent Data Anal~'sis Group, School of Computer Science, Mathematics and Ph3"sics, James Cook University, To~'nsville, Australia
I Introduction Many applications generate an exponentially increasing amount of information or data which needs to be stored, processed and transmitted in an efficient way. Typical information-intensive applications include spectral and highresolution image analysis. For example, a computerised axial tomography (CAT) image slice of size 512 • 512 and pixel depth (i.e. number of possible colours or grey-levels) of 8 bits occupies 0.25 MB of storage memory. For 60 such slices in a patient scan used in 3-D reconstruction, the total storage requirements is of the order of 15 MB. As a result of the possibly many stages involved in image analysis, each image in itself may generate other images, thereby further increasing the storage requirements for the image analysis procedure. For example, the raw CAT image slices can be processed to create a set of segmentated images used for interpretation such as volumetric analysis. Unfortunately, current storage hardware is inadequate for storing large amounts of such data as might be found in a patient database. Furthermore, if these data were to be transmitted over a network, the effective transmission times can be large. A solution is to employ compression techniques which may be capable of achieving a reduction in storage and transmission demands by a factor of more than 20 without significant loss in perceived image quality. Much of the information in a smooth image is highly correlated by virtue of the fact that, for example, pixel values are not spatially random and that the value of one pixel indicates the likelihood of its neighbours" values. Several types of correlation exist in an image"
1. Spatial correlation: Pixel values in a neighbourhood of a given pixel are generally similar. Exceptions include pixels in the neighbourhood of a pixel which forms the edge of an object in the image.
458
2. Sequential correlation: This occurs when two or more images are taken at different times (e.g. as a set of video frames) or different spatial positions (e.g. CAT image slices). The same pixel in adjacent image frames or slices is generally strongly correlated. 3. Spectral correlation: The spectral decomposition (Fourier transform) of an image is often smooth. Rapid fluctuations in the energy content of adjacent frequencies are uncommon. That is, spectral frequencies in a neighbourhood of frequencies are correlated. The presence of one or more of spatial, spectral and temporal correlations (and, therefore, the existence of an inherently high degree of redundancy) indicates that there exists a description of the image that has a significantly lower rank (for the definition of rank, see Chapter 4). That is, there exists in the image a set of features that captures most of the independent features. This suggests that an image is a good candidate for compression. Compression schemes can be broadly classified as loss-less or lossy compression. Loss-less compression schemes assume no loss of information during a compression-decompression cycle. This is most suited to data that need to be reconstructed exactly. Lossy compression schemes allow a certain error during a compression-decompression cycle, as long as the information loss is tolerable (i.e. the quality of the data is acceptable). The degree of tolerance to information loss is dictated by the particular application and some distortion metric appropriate to the application at hand is employed to measure the quality of the compression (see Section 2.1). For example, images which are used for simple visual display purposes can tolerate some loss as long as the images are psycho-visually acceptable. However, images that are used for segmentation or classification (e.g. medical or micro-fractographic industrial X-ray images) may not tolerate much information loss, particularly in the region of interest in the image. Lossy compression schemes have the advantage that a higher compression can be achieved compared with loss-less compression schemes. Most compression algorithms generally use a combination of both lossy and loss-less compression schemes with some facility made available to select the degree of loss of quality. In Section 2 we introduce the fundamentals of image compression and overview the various compression algorithms. We review the transformation techniques used in image compression in Section 3. Section 4 describes image compression using optimal task-based and best-basis image compression algorithms.
459
2
Fundamentals of image compression
The standard procedure used in image compression algorithms comprises of three stages namely, an invertible transformation, quantisation and redundancy removal (see Fig. 1). The decompression phase usually involves the reverse procedure followed in the compression phase. In the case of multidimensional (spatial or temporal) imagery, the compression algorithm usually includes additional encoding/decoding algorithms to exploit the inherent spatio-temporal correlation in the set of images. In such cases the overall compression/decompression is generally asymmetric: that is, the space-time complexity of the compression and decompression phases is different to allow for fast visualisation. The invertible transformation stage uses a different mathematical basis of features in an attempt to decorrelate the data. The resulting data will have a set of features that capture most of the independent features in the original data set. Typical features used include frequency and spatial location. The transformation is nearly loss-less as it is implemented using real arithmetic and is subject to (small) truncation errors. Examples of invertible transforms include the discrete cosine transform (DCT), the discrete wavelet transform (DWT) and the wavelet packet transform (WPT). We will investigate these transforms later.
Fig. 1 Stages of image compression.
460 If the transformation stage is effective in decorrelating the data, then the transformed image pixel data will have a large number of features with small real number values. The quantisation stage performs the essential rank reduction by replacing the transformed data stream of real numbers by a stream of reduced length with lower-precision coefficients or symbols that can be coded using a finite number of digits. The higher the compression required, the smaller the number of coefficients generated. Two kinds of quantisation can be performed namely, scalar and vector quantisation. Scalar or regular quantisation partitions the real axis into non-overlapping intervals and associates each real number with a coefficient or symbol associated with the interval to which it belongs. Prior to scalar quantisation, the decorrelated features are mapped onto the real axis (e.g. the DCT maps its 2D block structure onto the real line by "zig-zagging" through the block from low to high frequencies to exploit the fact that much of the relevant information contained in most image typologies is described by the set of lower frequency features). A quantisation table is used to store the pairs of intervals and symbols. Vector quantisation replaces a group of features (the real numbers) with a symbol. For example, wavelet compression can use groups of wavelet coefficients that are associated with the same spatial location. The fewer the number of groups, the higher the compression. The stream of coefficients emanating from the quantisation stage may still be redundant. The redundancy removal stage replaces the coefficients by a more efficient alphabet of variable-length characters. For example, some coefficients may be more frequent than others and these are allocated shorterlength codes compared with infrequent coefficients that are allocated longer codes. Variable-length coding algorithms are also called entropy coding algorithms and examples include the efficient Huffman and arithmetic coding [1]. Unfortunately, entropy codes require the statistics (probabilities) of the coefficients to be known a priori. Universal coding (UC) algorithms attempt to measure the statistics during the actual coding operation and adapt themselves in order to maximise the compression. Example (UC) algorithms include substitutional (or dictionary) methods [2]. The resulting compression from the redundancy removal stage is loss-less. Compression algorithms that do not conform to the above three-stage scenario also exist. One popular set of algorithms is based on the observation that natural images exhibit self-similarity at different scales. That is, information of an object at one level of magnification or resolution is repeated at a different level of magnification. A portion of an image at one
461 scale may be approximated by another portion of the image at different scale. For example, the land/sea coastline has a similar appearance at different map scales. Compression is then achieved by only storing the nonsimilar parts of the image. Algorithms which exploit the self-similarity property of images include fractal algorithms [3], weighted finite state automata [4] and generalised stochastic automata [5]. Fractal image compression algorithms generally perform better at higher compression ratios compared with the more traditional DCT-based algorithms (e.g. JPEG). NB: The compression ratio is defined as the number of bits required to store the original image divided by the number of bits required to store the compressed image. 2.1 Performance measures for image compression
When comparing different lossy image compression algorithms one usually desires the compressed image to be of the same visual quality as the original image. The most common measure of quality is the mean square error (MSE) or distortion defined as: 1
N-1
-
MSE - ~ y ~ Ixi - x i li=0 ..-..
where xi and x i are the input and reconstructed image pixel values. Alternatively, the peak signal-to-noise ratio (PSNR), measured in decibels (dB), is defined as: M2 PSNR - l0 logl0 D~where M is the maximum peak-to-peak value in the signal and D is the noise level. For an 8-bit image M = 256. In general, however, distortion measures based on squared error are not satisfactory when assessing the quality of an image, particularly at high compression ratios. An important consideration in determining the required image quality is the task or application for which the image is to be used. Each image application may require a different quality measure. For example, an image broadcast would be more concerned with tonal reproduction whereas a task involving the interpretation of X-rays would be more interested in image sharpness, etc. In many cases a perceptually weighted MSE may be more appropriate. It is known that, based on studies in visual psy-
462 chometrics, the human visual system has a reduced sensitivity at low frequencies and a very marked insensitivity at high frequencies. So, for example, errors are less visible in bright and "busy" (in terms of edges and discontinuities) areas of the image.
3
Image decorrelation using transform coding
As mentioned previously, smoothly varying images are generally characterised by a high degree of redundancy due to the presence of one or more spatial, sequential and spectral correlation. We can apply a change of mathematical basis in an attempt to decorrelate the image data, resulting in data that will have features that capture most of the independent features in the original image. We consider three transformations that have shown to decorrelate smooth images: the Karhunen-Loeve Transform (KLT), the Discrete Cosine Transform (DCT) and the Discrete Wavelet Transform (DWT). We first briefly describe the KLT and DCT.
3.1 The Karhunen-Loeve transform (KL T) The basic idea of the Karhunen-Loeve Transform (KLT) is that, if the correlation in the image is known, then it is possible to calculate the optimal mathematical basis by using an eigen-decomposition. The optimal basis is defined here as the one that minimises the overall root-mean-square distortion. Consider an image I = I(X) where X = (x1, x2, x 3 , . . . , XN)T is the vector of N image pixels. The intensity (pixel value) of the jth pixel, xj, is assumed to be a wide-sense stationary random variable with a non-negative value. We calculate the positive definite autocovariance matrix AI = E[XX T] and we can find the orthogonal matrix U that diagonalises A~. That is, UAIU T is diagonal with the diagonal values, referred to as the eigenvalues, being the uncorrelated coefficients or features in the transformed space. The optimal basis is given by the associated set of eigenvectors. The decorrelated image corresponds to the KLT basis transform Y = UX and, since the image is completely decorrelated (there are no off-diagonal values), it is considered to be an optimal transform. The autocovariance matrix of Y is given as E[YY T] = E [ u x x T u T] = E[UAIU T] which, as stated above, is diagonal. Besides decorrelating the image, the KLT has another useful property: the
463 KLT coefficients (eigenvalues) are ordered according to decreasing variance and compact the energy of the image into a few large coefficients. This allows the compression ratio to be set a priori by simply selecting the appropriate number the coefficients. Unfortunately, this approach has some significant disadvantages: 9 The time-complexity is generally O(N 3) for the diagonalisation algorithm and O(N 2) for the basis transformation. 9 The basis is a function of the image data since it depends on the autocovariance matrix for the given image. That is, each image will have its own basis transform. 9 The autocovariance matrix varies considerably from image to image, though the KLT assumes statistical stationarity. For these reasons the KLT is seldom used in practice. To circumvent these problems, sub-optimal basis transforms are employed which effectively decorrelate the image but are image-independent and have a reduced (linear or linear-log) time-complexity.
3.2 The discrete cosine transform (DCT) Very often image statistics are assumed to be isotropic (though this is not quite correct since there are vertical correlations between pixels in successive image scan lines) and the autocovariance matrix is modeled as some decreasing function of the geometric distance between any two pixels in the image. The autocovariance matrix is of the following Toeplitz form (assuming unit variance and zero mean):
A I z
1 p p2
p
p2
p3
...
1 p
p 1
... ...
p3
p2
p
p2 p 1
..-
where P is the correlation coefficient with the condition that P is large (P ~ 1). This model gives reasonably good agreement with experimental data for natural images (generally, P is found to be P > 0.9).
464 The DCT is defined as an inner product of cosine basis functions: YO -=- ~
YJ -
1 N-1 Zi=0 Xi
~N-1 Z i=0
((2i+l)j) xi cos 2p 4N
k_1
2,.. '
~
N_ 1 "'
The DCT was developed as an approximation to the KLT where, for large values of p, the DCT approximately diagonalises the above matrix A,. In fact, the DCT is asymptotically equivalent to the KLT of a stationary process when the image block size tends to infinity (N ~ ~ ) . Even for small values of N (say, N = 64), the basis functions for the KLT and DCT for many natural images look remarkably similar. Also, the DCT can be computed very efficiently with a time-complexity O(N log N) as opposed to O(N 3) for the case of the KLT. The computation can be made even more very efficient by blocking the image into K square blocks, each block with ~ x ~ pixels where N = K 2 N' (i.e. create disjoint blocks of sub-images of, say, 8 x 8 or 16 x 16 pixels), thereby reducing the DCT computations to each block. This makes the DCT the preferred algorithm in many standard commercial compression algorithms such as JPEG. Unfortunately, the DCT has some shortcomings. The blocking effect can create annoying high-frequency effects at the block boundaries due to the inherent discontinuities at image block boundaries and it also effectively reduces the compression of the entire image since the correlation across image block boundaries is not removed. The DFT can be used in lieu of the DCT. However, the D F T has rather severe blocking effects which is more noticeable than with the DCT, thus making the DCT the preferred option. To attenuate the blocking effect, smoothly overlapping blocks are used rather than disjoint blocks. The orthogonality of the overlapping blocks can still be achieved by using the lapped orthogonal transform (LOT) which is an elegant extension of the DCT. While the blocking effects are attenuated with the LOT, some other artifacts such as ringing appear around the block edges and its increased level of complexity makes LOTs less attractive. An alternative approach is to use wavelet transform coding schemes.
465
3.3 Wavelet transform coding An easy way to construct a multi-dimensional (e.g. 2-D) wavelet transform is, for example, to implement the tensor products of the 1-D counterparts. That is, we apply the 1-D wavelet transform separately along one dimension at a time. This, as we shall see shortly, results in one scaling function and three different "mother" wavelet functions. Although simple, such a separable decomposition has some drawbacks. For example, the number of free parameters used in the design of a separable 2-D wavelet transform is very much reduced (though this could also be seen as an advantage!). Also, only a rectangular partitioning is possible with a separable decomposition, and the same limitations appearing in one dimension will appear in two dimensions. To overcome such drawbacks, non-separable decompositions are necessary. One non-separable technique is to sub-sample separately as before, but use non-separable filters (four different filters for two dimensions). True multi-dimensional treatment of wavelets leading to single scaling and wavelet functions is possible but with a significant increase in implementation time-complexity e.g. quincunx lattice (along the diagonal) and hexagonal lattice sub-sampling [6]. For this reason we only discuss the separable decomposition technique, where both the filtering and sub-sampling are separable. We recall from Chapter 8 that the separable wavelet decomposition involves applying the 1-D wavelet decomposition to the rows and columns of the image matrix, I(x,y) where the image I is a finite sequence indexed by the two Cartesian coordinates x and y with N• column-pixels and Ny row-pixels, respectively. This decomposition results in a partially ordered set of subspaces Is,t(x,y), called sub-bands, where t >~ 0 indicates the level of the decomposition and 0 < s < (mm) t is the sub-band index at the given decomposition level. At a given level t, the orthogonal rectangular sub-bands correspond to disjoint covers of the wavenumber space, where the dimensionality of each sub-band Is. t is (Nx m -t) x (Ny m -t) owing to the m-way decomposition Normally, m -- 2 (the dyadic decomposition scheme) and we have four image sub-bands resulting from each octave sub-band in successive decomposition steps as shown in Fig. 2. At the image boundaries, periodic extension is normally used (see Chapter 4). The sub-bands in each successive dyadic decomposition step are generally labelled simply as IL,L, IL,H, IH,L, and IH,H ("L" for "low" frequency, and
466
l Ioo /
Level t=O
1,
l Io,,1 I,, / /,2,~ I 13,~/ Io,]l,,dl.,dl~,d /1I8.dl 12. 2~/1dl6~ 3.1~2tl1~. / ,3.t2/
Level t=l
Level t=2
11,o.ti,,.t1,,t1.,5.t
Fig. 2 A 2-D recursive separable wavelet decomposition (shown for m = 2).
"H" for "high" frequency). The ILL sub-band corresponds to an "average" or "smoothed" image, whereas the sub-bands IL,H, IH.L, and IH,H are the "detailed" images. The IL,H, and IH,L, sub-bands capture the horizontal and vertical edges, respectively, whereas the IH.H sub-band captures the diagonal details. The image sub-band decomposition scheme for a single decomposition level is shown in Fig. 3. We observe that the lowest band (ILL), being a low-pass, down-sampled version of the original input image, has many characteristics of the original image. Most of the image correlation, except along some of the edges, remains in the IL.H, IH.L, and IH,H sub-bands. This is due to the two-stage (L and H) directional filtering, where edges in the image are confined to certain directions in a given sub-band. Wavelet-based software packages generate the sub-bands in the form of a grouped display of the smoothed and detailed sub-bands. For example, the grouped display of the sub-bands in a single decomposition step as generated by S + WAVELETS TM is shown in Fig. 4. Consistent with the notation used in this book, the "cl" and " d l " labels represent the smoothed and detailed coefficients, respectively, so that " c l - c l " corresponds to the smoothed (ILL) wavelet coefficients, whereas " c l - d l " , " d l - c l " and " d l - d l " correspond to the detailed wavelet coefficients (IL,H, IH,L, and IH,H sub-bands, respectively). NB: In some wavelet software packages (e.g. S + WAVELETSTM), the
467
Fig. 3 Image wavelet decomposition for a single (dyadic) decomposition level.
convention used for the index of the origin in a grouped display is that the index is located in the lower left-hand corner of the grouped display whereas in other packages (such as MATLABTM), it is located in the upper left-hand corner. With a larger number of decomposition levels, the 2-D DWT is displayed as shown (for the case of 3 levels) in Fig. 5. The labels " L L H H " ,
468
Fig. 4 Image wavelet coefficient matrices (shown for a single decomposition level).
LH
d2
LLLH
HH
LLHH HL
_ 03
LLLLLI-.LLLHI-
C3
LLLLLL.LLLHL c3
d3
LLHL d2
dl
Fig. 5 A 2-D D WT image wavelet coefficients for multiple decomposition levels (shown for three levels).
etc. indicate the corresponding wavelet coefficients in each decomposition level sub-band. As mentioned before, when dealing with values of m greater than 2, the decomposition scheme is slightly more complicated by the fact that we have more than three sub-bands of wavelet coefficients at each level of the decomposition scheme. In fact, there are m 2 sub-bands at each decomposition level of the 2D DWT. For the case of h levels, we have a total of (m 2-1)h + 1 D W T sub-bands or, for the dyadic case, a total of 3h + 1 sub-
469 bands. Fig. 6 shows the decomposition scheme for two levels and for the case of m = 4. Note that we have used the same labelling format for the smooth and detailed sub-bands as Chapter 8. That is, for level j, the smoothed subband coefficients are labelled as "cj'" and the m - 1 detailed sub-band coefficients as "d~ 1) d~2) d~m-l)'' 9
9
"
"
"
~
Whilst the 2 D - D W T provides an efficient space-frequency characterisation of a given image, it only uses a fixed decomposition of the pixel space. As in the case of the 1-D wavelet packet transform, we can extend the wavelet packets to two dimensions. That is, the 2D wavelet packet transform (2D-WPT) generates a more general, full m2-ary tree representation with a total of m 2 + m 4 + -.- m 2h sub-bands for h levels. Each sub-band in a given level of the tree splits into a smoothed sub-band and m2-1 detailed sub-bands, resulting in a tree that resembles an m-way pyramidal "stack" of sub-bands. For the case of a dyadic decomposition scheme, this corresponds to a pyramidal sub-band structure where each sub-band is decomposed into 22 = 4 sub-bands at each successive (higher) level (see Fig. 2). Fig. 7 shows results of the third level of the 2D W P T for the dyadic case - a total of 2 2 ( 4 - 1 ) - - 64 sub-bands at the third level, where each sub-band block consists of 16 x 16 pixels for an image size equal to 256 x 256. We note in passing that the maximum number of independent mZ-ary orthonormal tree representations is (mm) h, a potentially very large number of trees. The set of such trees is often referred to as a bases dictionary.
Fig. 6 Image wavelet coefficients jbr a multi-band decomposition scheme (shown for m = 4 and for the case of two levels).
470
Fig. 7 2-D W P T image wavelet coefficients for level 3 (for the dyadic case, m = 2).
Looking at either the 2D D W T or WPT sub-band images it is clear that the sub-bands are related; that is, they are not independent. Of particular interesl in image compression are horizontal or vertical edges, which will appear in the smoothed sub-band image as well as in every detailed sub-band that was generated by horizontal filtering (for horizontal edges) or vertical filtering (for vertical edges). It may be more appropriate to only select a useful subset of sub-bands. A variety of algorithmic approaches may be used. An algorithm that is quite useful is the e m b e d d e d zero-tree wavelet encoding (EZW) algorithm which exploits the self-similarity between the different wavelet bands [7]. It uses a simple heuristic - if a wavelet coefficient on one decomposition level is set to zero, it is likely that the wavelet coefficients corresponding to the same locations on the finer decomposition levels are set to zero as well. That is, for natural images with rapidly decaying spectra, it is
471 unlikely to find significant high-frequency energy if there is little low-frequency energy in the same spatial location. The EZW algorithm produces good compression results without requiring a priori knowledge of the image statistic. An alternative approach is to choose the best full or partial (i.e. pruned) mZ-ary tree representation. In this case, we have to search for the "optimal" full or partial tree from the bases dictionary subject to minimising some information theory-based (or other)cost f u n c t i o n - called best-basis search algorithm due to Coifman and Wickerhauser [8]. The search algorithm used is generally based on the dynamic programming paradigm and the cost function used is determined by the application at hand. This "adaptive" bestbasis search algorithm is computationally efficient and has also been used in a variety of applications including, classification/discriminant analysis, regression, dimensionality reduction etc. In the case of image compression, a simple cost function often used is thresholding- that is, sub-band coefficients with values less than some predetermined threshold are ignored. Other bestbasis approaches have also been developed, including tree pruning using a Lagrangian cost function [9] and tree building based on the local transform gain metric (the ratio of the sum-to-product of the wavelet sub-band variances) [10]. Fig. 8 shows the set of best-basis sub-band blocks obtained for the case of a threshold cost function (using the source image in Fig. 3) and
-H [ l I
I|1 IIIII |Jill
II1
]
1
[ 1
Fig. 8 2-D W P T best-basis of image #1 Fig. 3 (.for the dyadic case, !11 = 2, using a symmlet of width equal to 8, a "threshold" cost function, and for a maximum of six levels).
472
with an eight-point symmlet. The best-basis chooses smaller sub-band blocks to capture the detailed features in the image and larger blocks to represent lower-frequency information. Many of the sub-band blocks in Fig. 8 are grouped smoothed sub-bands (lower left of display) as one would expect with a normal 2D DWT. However, some sub-bands are also present in other parts of the WPT, indicating that some of the information in the image is best captured by the detailed coefficients. By overlaying the WPT (for each level) with the best-basis, we can easily identify the wavelet coefficients that best capture the different regions of the image. More details on the best-basis search algorithm are given in Chapter 6). One inherent problem with the best-basis technique is the choice of the wavelet type to use. Generally, a standard ("off-the-shelf") wavelet is chosen prior to the best-basis search; for example, a coiflet or symmlet. This choice is made independently of the best-basis search and of the application at hand. A more flexible approach involves designing the wavelet in conjunction with the best-basis search. That is, the wavelet is customised to the task at hand and integrated with the best-basis search. The reader is directed to Chapter 8 for more information on custom, task-specific wavelets. Different wavelet customisation/best-basis search integration methodologies are possible, namely the wavelet customisation is made (i) independently of (see Fig. 9), or (ii) integrated with the best-basis search (see Fig. 10). The former case is simpler to implement and has reduced computational requirements, whereas the latter is better adapted to the task at hand (i.e. should give optimal image compression performance) but requires larger computational resources. We present results for the latter (integrated) case in the next section (Section 4). From Chapter 8 we note that the construction of a task-specific wavelet proceeds by generating a normalised vector v of dimensionality m-1 and a further N d m - 1 normalised vectors ui, each of length m (Nf is the number of wavelet filter coefficients). The total number of free parameters required to construct the wavelet is therefore Npar =
( N f - m)(m - 1) m
nt-
m- 2
For the dyadic case (m = 2), N p a r = N f / 2 - 1. For short compact wavelets (i.e. small Nf ), the number of parameters is not a large number and, consequently, the search space has a low dimensionality. Furthermore we shall show that, experimentally, the hyper-surface in the search space is smooth.
473
I Construct TaskSpecific Wavelet I
........................~
1
J~
Image data
[ Calcuiaie WLe,et I Update Wavelet Parameters
!Wavelet Construction
Parameters
l ......................................] ......)))))))/)/)//))I//)///-//)/)))))))//_ ._._.;_
[
..
Evaluate Compression Ratio
............_.Yest.............. ~ ~
............
Best-basis Search
Fig. 9 Independent wavelet construction and best-basis search.
Therefore we can easily determine the optimal values using a simple hillclimbing procedure.
Integrated task-specific wavelets and best-basis search for image compression As stated in the previous section, optimal wavelet image compression can be achieved by integrating the process of wavelet construction with best-basis search. The best-basis search using a standard "off-the-shelf" Coifman wavelet ( N f - 12) for four levels is shown in Fig. 11. In this case the threshold cost function used was simply the constant value 0.2. The resulting compression ratio obtained was 9.50. Fig. 12 shows the result when the task-specific wavelet construction is integrated with the best-basis search (with the same threshold cost function). Here, the compression ratio is 9.71, an improvement of 2.2% compared with
474
i ............................... :__~::-_'~_"_:::-:-'Z:',_:;_5"_S??'L::?;I
i ! ............ i ............................
i ~,
i
517; ...........................
Construct TaskSpecific Wavelet
~
i
....
[
,ma e
.......
.................
Data
l
........ i i:iiiiiii!iiiii:=i iii
i i
Evaluate 1 _ Compression Radio ]
Update Wavelet Parameters
Best-basis Search
~
.
' :Wavelet ~Construction
1
Yes
Fig. 10 Integrated wavelet construction and best-basis search.
the standard wavelet. We note that the wavelet basis sub-band block distribution is similar except for the lowest level (the wavelet coefficients in each sub-band block will also be different). The complexity of the search space for wavelet construction is dependent on the choice of the number of parameters, Npar, as this variable determines the dimensionality of the search space. The search complexity is exponential in N p a r , s o the value of N p a r should be kept as small as possible. So, for the case of Nf = 12 and m = 2 that was used in Figs. 11 and 12, we have a dimensionality equal to Npar = 5. Fig. 13 shows a simple search space for Nf = 6 and m - 2 (and, therefore, N p a r = 2) where we have chosen 50 points along each parameter dimension, that is a total of 2500 points on the search hyper-surface. Observe the highly regular and near-symmetric distribution of the search space, enabling the use of simple hill-climbing search algorithms (the parameter values corresponding to the white parts of the surface represent the optimal wavelet parameters). Such regularity and symmetry properties have been observed for a wide range of image typologies
475
Fig. 11 Results of the best-basis search .for a standard C o f m a n wavelet (.(our levels, m = 2, filter length is NI = 12).
476
Fig. 12 Results of the integrated task-specific wavelet construction and best-basis search(four levels, m = 2, filter length Nf = 12).
477
Fig. 13 Search space for task-specific wavelet construction (m = 2 and Nf = 6).
[11]. Furthermore, it is conjectured that these properties of the search space scale up well to higher dimensions (i.e. Npar > 2), thereby significantly reducing the computational time of search.
5
Acknowledgements
The authors would like to thank Dr S. Aeberhard for generating the results for the integrated task-specific wavelet construction and best-basis search.
References 1. I. Witten, R. Neal and J. Cleary, Arithmetic Coding for Data Compression, Computer Practices, 30 (1987), 520-540. 2. J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Transactions on Information Theoo', 23 (1977), 337-343. 3. Y. Fisher (Ed), Fractal Image Compression." Theory and Application, SpringerVerlag, Berlin, (1994). 4. K. Culik and J. Kari, Image compression using weighted finite automata. Comput. and Graphics, 17 (1993), 305-313. 5. B. Litow and O. de Vel, On Digital Images which cannot be Generated by Small Generalised Stochastic Automata, In Mathematical Foundations of Computer Sci-
478
6. 7. 8. 9. 10.
11.
ence Workshop on Randomised Algorithms (R. Freivalds Ed), RWTH, Aachen, (1998). J. Shapiro, Embedded image coding using zerotress of wavelet coefficients, IEEE Transactions on Signal Processing, 41 (1993), 3445-3462. M. Vetterli and J. Kovacevic, Wavelets and Subband Coding, Prentice-Hall (1995). R. Coifman and M. Wickerhauser, Entropy-based algorithms for best-basis selection, IEEE Transactions on Information Theory, 38 (1992), 713-718. K. Ramchandran and M. Vetterli, Best wavelet packet in a rate-distortion sense, IEEE Transactions on Image Processing, 2 (1993), 160-175. M. Mandal, S. Panchanathan and T. Aboulnasr, Choice of wavelets for image compression, In Information Theory and Applications II, Lecture Notes in Computer Science LNCS 1133, (P. Fortier, J.Y. Chouinard and T. Gulliver (Eds) Springer-Verlag, (1996), 239-249. O. de Vel and S. Aeberhard, Image-specific adaptive wavelet compression, submitted to lEE Journal of Vision, Image and Signal Processing (1998).
Wavelets in Chemistry Edited by B. Walczak 9 2000 Elsevier Science B.V. All rights reserved
479
CHAPTER 20 Wavelet Analysis and Processing of 2-D and 3-D Analytical Images S.G. Nikolov I *, M. Wolkenstein 2 and H. Hutter 2 l Image Communications Group, Centre for Communications Research, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol BS8 1UB, UK; e-mail." stavri.nikolov(~j bristol.ac.uk 2Research Group on Physical Anal)'sis and Computer Based Analytical Chemistry, Institute of Analytical Chemistry, Vienna UniversiO' of Technolog)', Getreidemarkt 9/151, Vienna 1060, Austria," e-mail." mwolken~ mail.zserv.tu~'ien.ac.at, h.hutter(a tuwien.ac.at
I
Introduction
The rapid progress in high technology poses new challenges to analytical chemistry. Besides the development of new or improved techniques, the general trend in the development of analytical methods and instrumentation is to increase the information content extracted from analytical signals and analytical images. By analytical images, here in this chapter, we mean all images acquired by any of the analytical chemistry techniques described below. Many of these new techniques often reach out to the very limits of physics: when individual atoms are observed, or when single ions are detected, or monolayers on the surface of materials are selectively analysed. A wide variety of scientific instruments directly produce images in a form suitable for computer acquisition and computer analysis. The majority oJ these images are two-dimensional (2-D) images. Imaging has played a majoI role for a very long time in biology, chemistry, and physics, if one considers, the widespread use of microscopic techniques like light microscopy or electron microscopy. The most common type of images obtained in microscop5 show the intensity of light, or any other radiation that has come through th~ sample. These images are called transmission images and they are generatec
* Member of the Research Group on Physical Analysis and Computer Based Analytica Chemistry, IAC, Vienna University of Technology, from 1993 until 1996.
480 by techniques such as light microscopy or transmission electron microscopy. In transmission images, the absorption of the radiation at each point is a measure of the density of the specimen along the radiation path. Some radiation energies may be selectively absorbed by the sample, according to its composition. Some other techniques used in analytical chemistry are based on a completely different principle of operation. Images are acquired by scanning devices, where an analysing beam, either radiation or particles, is scanned in a raster pattern over the specimen, and the interaction with the sample is measured by a detector. Examples of such techniques are Electron Probe Microanalysis (EPMA), Secondary Ion Mass Spectrometry (SIMS), Auger Electron Spectrometry (AES) or Confocal Scanning Light Microscopy (CSLM). The instruments used in these techniques provide a time-varying signal, which can be related to the spatial locations on the sample by knowing the scanning speed and parameters. The interaction may differ selectively according to the composition of the analysed spot. A completely different class of analytical techniques, e.g. Atomic Force Microscopy (AFM) or Scanning Tunnelling Microscopy (STM), generate images, where the pixel brightness is used to record distances. Many instruments used in analytical chemistry capture more than one single image. Multiple images may constitute a series of views of the same area using different radiation wavelengths, or they may as well be images of different elemental distributions on the specimen surface. Such images are often called multispectral images. Another scenario is when several different techniques provide complementary information, often in the form of images, about one and the same specimen. The collection of such images is called a multimodality image. A different multiple image case is a time sequence, where the one and the same specimen is imaged in consecutive moments of time. Some of the instruments described above can also produce three-dimensional (3-D) images of the specimen. These images usually comprise series of parallel slices through the specimen. Common methods capable of generating 3-D images are: various serial sectioning methods used in microscopy; Computing Tomography (CT) or Magnetic Resonance Imaging (MRI) used in medicine; or methods such as SIMS, where a collection of images is produced by physically eroding the specimen and capturing 2-D images at different depths.
481 Automated processing and computerised measurement of analytical images can be employed to extract specific information very accurately and reproducibly. The processing of analytical images itself is mainly used for two main purposes: (a) to improve the visual appearance of images for enhanced interpretation by a human observer; (b) to prepare the images for quantitative measurement of the inherent features and structures. As stated before, the main goal of the analytical chemist is the extraction of interesting information from the measured data. The achievement of this goal is often complicated by the presence of noise in the data. Processing digital images to tackle the noise reduction problem is one of the main applications of image processing. Since this problem is common in various fields of science and technology a large number of noise reduction techniques have been proposed. Yet, a severe problem of many of the classical smoothing operations is the loss of resolution. A possible solution to this problem is the use of wavelet de-noising. Wavelet de-noising of analytical images results in increased noise suppression compared to other state-of-theart filtering algorithms, while the most important features in the input image are well preserved. Several publications show that wavelet de-noising produces better reconstruction results when compared to most traditional linear smoothing methods, especially in cases of high spatial variability of the original data. Besides improving the visual perception of the image, the main purpose of any de-noising technique is always to enhance further image processing. Many automated image processing and evaluation methods may lead to very poor results, or may even not be at all applicable, if the noise variance in the image is too high. Therefore, some pre-processing steps have to be performed to reduce the noise and enhance the image quality, and most of all, to enable further processing. In many cases, wavelet pre-processing of the input image results in superior performance of the following processing steps. Such an example is the classification of analytical images, which is significantly improved when wavelet de-noising is applied prior to classification. Another application inherently associated with image de-noising is image compression. Image compression, similar to de-noising, removes unimportant or undesired details from an image, and thus compresses it. There exist a variety of data compression techniques in different application areas, many of which have been well standardised. Data compression techniques can be
482 divided into two groups: lossless and lossy data compression. Examples of |ossy data compression are the Joint Photographic Experts Group (JPEG) and the Motion Picture Experts Group (MPEG) standards for still images and movies, respectively. Data compression is another very successful application of wavelets. Many studies show the superiority of wavelet compression algorithms to other compression methods. Other interesting applications of wavelet analysis are the extraction of features from analytical images. Wavelets can be used for edge detection and texture analysis. Some of the extracted features can be used to align (register) multimodality analytical images. Image registration is the first step in the process of combining the information from the various modalities, i.e. image fusion. With the availability of several different instruments, which are used in everyday analysis of different specimens, in one and the same chemical laboratory, image fusion is becoming a very important and active field of research. Wavelet transform .fusion, or the fusion of images in the wavelet domain, provides us with additional tools to combine analytical images. This chapter describes several applications of wavelet methods to analyse and process analytical images. A short list of online resources on wavelets and wavelet analysis with focus on their application to analytical images is included at the end of the chapter, together with an extensive bibliography on the subject and the names of the software programs used to process the images displayed in this study.
2
The 2-D and 3-D wavelet transform
The one-dimensional (I-D) discrete wavelet transform (DWT) defined in the first part of the book can be generalised to higher dimensions. The most general case has been studied by Lawton and Resnikoff [1]. An N-dimensional (N-D) DWT is described also in [2]. The separable extension of the wavelet transform (WT) to three dimensions, for example, is explained in [2, 3,4]. In this chapter, for simplicity and because of the problems studied, only the theory of the 2-D and 3-D DWT will be outlined, and only separable 2-D and 3-D wavelets will be considered. These wavelets are constructed from one-dimensional wavelets. Separable wavelets are most frequently used in practice, since they lead to significant reduction in the computational complexity.
483 In this chapter a 2-D image refers to any 2-D intensity function I(x,y), where x and y denote spatial coordinates and the value of I at any point (x,y) is proportional to the brightness (or grey level) of the image at that point. Similarly, a 3-D image refers to any three-dimensional intensity function I(x,y,z), where x, y, and z denote spatial coordinates. A digital image is an image I(x,y) or I(x,y,z) that has been discretised in both spatial coordinates and intensity. The theory of the 2-D and 3-D presented in this chapter closely follows [4]. To extend the 1-D wavelet transform to 2-D and 3-D, we have to find the multiresolution approximations of L2(R 2) and L2(R3), where I(x,y) E LZ(R 2) and I(x,y,z)E L2(R3), respectively. Separable versions of the approximations can be defined, where each vector space V2, is decomposed as a tensor product of two identical subspaces of L2(R2), or as a tensor product of three identical subspaces of L2(R3). Let us first study the 2-D case. We can define a 2-D scaling function (I)(x,y) = ~(x)~(y),
(1)
where ~(x) is the 1-D scaling function of V~,. The three wavelets 9
(x,y) -
q~2(x,y) -- ,(x)qb(y),
(2)
tlJ3(x,y) - ~(x)q/(y), can be used to build an orthonormal basis of L2(R2). This orthonormal basis is 2-Jw~j ( x - 2-Jn, y - 2-Jm), 2-Jv~j ( x - 2-Jn, y - 2-Jm),
(3)
2-Jv~j ( x - 2-Jn, y - 2-Jm), where (n,m,j) E Z 3. Here Z denotes the set of integer numbers. Now we will explain how to compute the DWT of a 2-D image, using a pyramidal algorithm, i.e. a filter pyramid with quadrature mirror filters (QMF) L and H. This method is usually used in signal and image processing to compute a DWT. Let the image I(x,y) be a square matrix of dimensions N x N where N is a power of two. The low-pass filter L and the high-pass filter H are applied first to the matrix rows and the output is downsampled by two. This results in two new matrices LrI and HrI (where Fr means that the
484 filter is applied to the matrix rows, while Fc - to the matrix columns) both having dimensions N x (N/2). Next, H and L are applied to the columns of the matrices LrI and HrI resulting in matrices LcL~I, HcLrI, LcHrI and HcHrI, all of dimensions (N/2) x (N/2). The input matrix I is divided into four matrices or channels. These channels are LLI, HLI, LHI, and HHI, when we omit the row and column indexes. The matrix LLI is a smoother copy of the image I, while the matrices HLI, LHI and HHI contain the vertical, horizontal and vertical-horizontal high frequencies. Thus, one band of the DWT is computed. The same procedure continues with the matrix LLI producing another band of the wavelet decomposition, and so on, until a single number, which is the average of the whole original matrix I, is obtained. Fig. 1 illustrates the two-dimensional pyramidal algorithm. Generally, each smoother approximation S2J+,I of I at scale 2j+l is decomposed into a low-pass subimage SzjI (or the LL channel) and into three detail subimages D~jI (HL channel), D~jI (LH channel) and D~jI (HH channel). The corresponding inner products are S2JI - { (I(x,y)* qb2j(-x)*2j (-y))(2-in, 2-Jm) } (n,m)cZ 2 DljI -- { (I(x,Y)**2J (-x)*2J (-Y))(2-Jn, 2-Jm) }(n,m)cZ2 (4) D2I - { (I(x,Y)**2J (-x)*2J (-Y))( 2-jn, 2-Jm) }(n,m)EZ 2 D3.I - {(I(x,y)**2J(-x),2j(_y))(2-Jn ~ 2-Jm)}(n,m)eZ_~" 2J
HLI
HL(LLI)
HHI
HH(LLI)
LHI LH(LLI)
Fig. 1 Two-dimensional pyramidal wavelet decomposition." Each of the four channels & one band of the 2-D WT can be named using the follow&g notation." LLI, LHI, HLI, HHI, where I & the 2-D image, L stands for a low-pass filter, H stands for a high-pass filter, and the filters are applied first along the y direction (right position), and then along the x direction (left position). Three bands of the wavelet decomposition are displayed.
485 To compute the inverse DWT, at each scale 2j+~, we can reconstruct the approximation SzJ+,I of I(x,y) by S2J+'I - Z
(S2JI)m.n(I)(2Jx - n, ZJY - m)
m.n + Z [(D~jI)m.n ~ l (2Jx - n, 2Jy - m) m,n --t-(D~jI)m.n tI/2 (2Jx - n, -+-(D~jI)m.n 1"I~3(2Jx -
2Jy -
m)
n.2Jy - m)].
(5)
Thus, the whole image I(x,y) can be reconstructed from the pyramid by using shifted and dilated versions of the four functions (I), ~ 1 ~ 2 and tI/3 (see Eqs. (1) and (2)). Now we will briefly discuss the 3-D case. Here, we will define a separable version of V2J as a multiresolution approximation of L2(R3). In this case the scaling function can be defined as 9 (x,y,z) = ~(x)~(y)~(z) and the corresponding wavelets as ~pl(x,Y,Z)- r W2(x,y,z)- +(x)qt(y)~(z), ~p3(x,y,z)- qb(x)~(y)~(z), 9 4 ( x , y , z ) - qt(x)~(y)~)(z), ~ 5 ( x , y , z ) - ,(x)(~(y),(z), W6(x,y,z)- qt(x)~/(y)~)(z), ~IJT(x,y,z)- q/(x)/l/(y)/l/(z). Fig. 2 shows one band of the 3-D pyramidal decomposition. Here, a 3-D image (volume) I(x,y,z) is decomposed into eight 3-D subvolumes (channels). Each channel of the decomposition is labelled by a three-letter label, where each letter denotes the filter type (L or H) in the x, y and z direction. The volume is decomposed into a low-pass subvolume S2JI (or the LLL channel) and seven detail subvolumes {DC~I}c__l.....7 (the high frequency channels). The whole process can be repeated with the low frequency subvolume, in order to compute another band of the 3-D wavelet decomposition. Similar to the 2-D
486
Fig. 2 Three-dimensional pyramidal wavelet decomposition. Each of the eight channels in one band of the 3-D W T can be named using the following notation." LLLI, LLHI, LHLI, LHHI, HLLI, HLHI, HHLI, HHHI, where I is the 3-D image, L stands for a low-pass filter, H stands for a high-pass filter, and the filters are applied first along the z direction (right-most position), then along the y direction (middle position), and finally along the x direction (left-most position).
case, we can reconstruct the approximation S2j+,I of I(x,y,z) at scale 2j+l by computing S2j+, I - ~ (S2JI)m,n,k (I)(2Jx - n, 2Jy - m, 2 J z - k) m,n,k + Z [(D~jI)m,n,k ~p1 (2ix _ n, 2Jy -- m, 2Jz -- k) n,m,k + (D~j I) m,n,k V2 2 J x - n, 2Jy - m, 2 J z - k) + (D~jI) m,n,k ~ 3 2Jx -- n, 2Jy -- m, 2 J z - k) D4 ~4 2ix -- n, 2Jy - m, 2 J z - k) + ( 2jI)m,n,k Ds ~ps 2Jx - n, 2Jy - m, 2 J z - k) -+-( 2jI) m,n,k + (O~J I)m,n, k ~Ij6 (2iX -- n 2Jy - m, 2 J z - k) 7 I) m,n,k ~p7 (2Jx_ n, 2 J y - m 2 J z - k)]. --b( D 2j More details about the theory of the 2-D and 3-D wavelet transform can be found in [4].
487
3
Mathematical measures
For assessing the de-noising and compression performance of the WT a quantitative evaluation of the reconstructed or decompressed images was carried out. Two mathematical measures were used to evaluate the output results after applying different filtering algorithms. The mean square error (MSE) is an estimator showing how close the reconstructed image I (e.g. the de-noised image or the decompressed image) is to the original input image I. The MSE is defined as l
M
MSE - M - N ~
N
Z(I(x'Y)-
[(x'Y))2
(6)
x=l y=l or
1
M
MSE - M . N - K Z
N
Z
K
~(I(x,y,z)-
"~
I(x,y,z))-
(7)
x=l y=l z=l
in the 2-D or 3-D case, respectively. The peak signal-to- noise ratio (PSNR) is another figure, which can be derived from the mean square error. The PSNR is defined as follows: PSNR - 10 log
I2max
MSE1/2
(8)
where Imax is the maximum grey level of the image. In this chapter, whenever we refer to the signal-to-noise ratio (SNR) we actually mean the PSNR.
4
Image acquisition
4.1 S I M S images The instrument we used to acquire the SIMS image included in this chapter is a double focusing Secondary Ion Microscope C A M E C A IMS3f, with a typical lateral resolution of 1-3 ~m and a typical depth resolution of 5 nm. An intensive primary beam (primary ion 0 2 , primary beam intensity 2 gA, primary beam energy 5.5 keV) homogeneously illuminates the sample by scanning rapidly over an area of up to 500 x 500 ~tm2. The ion optical system of the mass spectrometer produces a mass-filtered secondary ion image S(x,y) of the surface, which is registered using a CCD camera system (Pulnix TM 760) in combination with a double micro-channel-plate fluorescent screen
488 assembly (Galileo HOT). The camera signal is digitised by an ITI 151 image processor and is stored on the controlling computer [5]. Under the bombardment with the primary ions the surface of the sample is etched. The typical erosion rate is approximately three atomic layers per second. The measurement of the lateral distributions over time allows the determination of the 3-D elemental distributions S(x,y,z), yielding a signal with N chemical dimensions (number of masses or elements measured) and three spatial dimensions.
4.2 EPMA images A JEOL JSM 6400 scanning electron microscope (SEM) and a LINK eXL EDX energy disperse spectrometer were used for this work. A fine electron beam (acceleration voltage 20 kV, working distance 39 mm) is scanned in a raster pattern (512 x 512 scanning steps) over the surface of the sample, producing secondary or backscattered electrons and X-rays. The X-ray images are formed by selecting an energy (energy resolution 20 eV/channel) corresponding to a particular element, and then registering all detected Xrays in an image E(x,y), in which the brightness of each pixel is proportional to the X-ray intensity of the element.
5
Wavelet de-noising of 2-D and 3-D SIMS images
Images captured by analytical techniques are usually noisy. Noisy images may occur because of various reasons, such as counting statistics in the image detector due to a small number of incident particles (photons, electrons, ions) in techniques such as SEM or SIMS, or instability of the light source, or the detector, etc. The noise pattern depends on the phenomena under consideration and the instruments used, with common noise models like Gaussian noise and Poisson noise. De-noising is the process of reconstruction of the underlying original signal from the noisy one, with the objective of removing as much of the noise as possible, while preserving the major signal features.
5.1 De-noising via thresholding Unlike the sine and cosine functions in Fourier analysis, which are localised in frequency but not in time, i.e. a small frequency change in the Fourier transform (FT) produces changes everywhere in the time domain, wavelets are localised both in frequency/scale (via dilations of the mother wavelet),
489 and in time (via translations of the mother wavelet). This leads to a very compact representation of large classes of functions and operators in the wavelet domain. Images with sharp spikes and edges, for instance, are well approximated by substantially fewer wavelet basis functions than sine and cosine functions. In the wavelet decomposition of signals and images, as it was described before, the filter L is an averaging or smoothing filter (low-pass filter), while its mirror counterpart H produces details (high-pass filter). With the exclusion of the last remaining smooth components all wavelet coefficients in the final decomposition correspond to details. If the absolute value of a detail is small and if we omit it (set it to zero), the general picture would not change much. Therefore, thresholding of the wavelet coefficients is a good way of removing unimportant or undesired details from a signal (see Fig. 3). Thresholding techniques are successfully used in numerous data processing domains, since in most cases a small number of wavelet coefficients with large amplitudes preserves most of the information about the original data set. Different thresholding methods like
9 hard thresholding (HT) i~2JI _ { 0 ( D2J I
if DzjI < z if D2JI > ~;
(9)
9 soft thresholding (ST) I~2jI - sign(D2jI)(]D2jI ] - ~)+,
(10)
where (x)+-
x 0
when x _> 0 whenx<0;
9 universal thresholding (UT), hard or soft thresholding, where - cyV/2 log(N)/v/N
(1 1)
and cy is the robust estimate of the standard deviation of the noise and N is the number of pixels/voxels in the image, have been used to solve problems ranging from Gaussian noise reduction to density estimation and inverse problems [6]. In all wavelet thresholding algorithms, the magnitude of the
490 (b) Wavelet Shrinkage R:econstrt~ion
(a) Noisy Data 20
20
10
10
0
0
-10
-10
-20
0
0.5
(c) Wavelet Coefficients
-2U I' ,
f,
1
3_11..... l,.-.
, l..
0
0
............
] .......
--T~
-4
[.I '~ .... -6
i iI
ilI'
r I
[
-IC
[ ..................................
0.5
1
-2
-8 "1U [
0.5
(d) Thresholded Wavelet Coefficients I
-6
-20
I
0
0.5
I
Fig. 3 Multiresolution Analysis plot of a one-dimensional D WT." noisy data (top left)" wavelet shrinkage reconstruction (top right)" MRA plot of the DWT of the noisy data (bottom left)" MRA plot of the thresholded wavelet coefficients (bottom right). Figure courtesy of Prof. David Donoho, Stanford University.
wavelet coefficients, or more precisely the magnitude of the wavelet details D2jI, is compared to a threshold ~, which is either set manually (as in HT and ST), or is derived from the data (as in UT). All coefficients of the WT which have magnitudes smaller than ~ are either set to zero or are shrinked to zero. In this chapter we have used the algorithm initially proposed by Donoho and Johnstone in [7,6], and used by Nikolov et al. [8] for de-noising of 2-D SIMS images, and by Wolkenstein et al. for de-noising of 3-D SIMS images [9]. Below is the wavelet de-noising algorithm, as described in [8]. Let us have the following image model: a two-dimensional image G(x,y) = I(x,y) + cy z(x,y), where z(x,y) is white Gaussian noise with standard deviation cy. Then, we can use the 2-D DWT and universal soft thresholding (see Eqs. (10) and (11)) to de-noise the image by making the following three steps:
491 1. perform the forward DWT to the image G(x,y), yielding noisy wavelet coefficients Wzj.k; 2. apply universal thresholding (soft thresholding with ~cyx/'21og(NN)/N) to the noisy wavelet coefficients, obtaining the estimates 7Vzj.k; 3. set all wavelet coefficients ~2J.k -- 0 for j > J and apply the inverse DWT, producing the image estimate I(x,y). This filter in the wavelet domain shrinks the wavelet coefficients to zero. Because the few large wavelet coefficients preserve almost the whole energy of the signal, the shrinkage reduces the noise without distorting much the image features (see Fig. 4). In the reconstructions of images resulting from this algorithm the noise is significantly suppressed, while sharp features in the original are still sharp in the reconstruction [8,10].
5.2 Gaussian and Poisson distributions All measurement data produced by counting single events are characterised by Poisson statistics. Let us have a 2-D SIMS image I(x,y). In this case, we can apply the Anscombe [11] variance-stabilising transformation to the image, i.e. P(x,y) = 2v/I(x,y) + 3/8, and then use the de-noising algorithm proposed above, as if the whole image has Gaussian white noise with cy = 1. As investigations made by Starck et al. [12] show, the variance of the stabilised Poisson image P(x,y) is, from a practical point of view, equal to 1 irrespective of the mean value of I(x,y). However, in cases when the mean value of the Poisson parameter is under 10, a generalisation of the Anscombe formula should be preferred [12].
5.3 Wavelet de-noising of 2-D SIMS images To demonstrate the result of the wavelet shrinkage algorithm, several examples with different measurement time, i.e. SNR, were measured. In Fig. 5 an A1 distribution is displayed. More examples can be found in [8]. Miscellaneous wavelets, including the Haar wavelet, Daubechies and Coiflet wavelets, were tested. The SIMS image shown in Fig. 5 was de-noised using a Coiflet with 3 vanishing moments. The top-right and the bottom-right images in Fig. 5 present close-ups of regions of interest and the corresponding denoised close-ups. Soft thresholding was applied to the wavelet coefficients.
492
Fig. 4 De-noising via wavelet shrinkage." Lena image, 512 x 512 pixels, 256 grey levels (top left); Lena image, close-up (top right); Lena image with additive Gaussian noise, standard deviation ~ -- 10, close-up (bottom left); de-noised Lena image, universal soft thresholding (Coiflet wavelet with 3 vanishing moments), close-up (bottom right).
For assessing the performance of the above-described wavelet de-noising algorithm a quantitative evaluation of the reconstruction was carried out. As figures of merit, the MSE (Eq. (6)) and the SNR (Eq. (8)) were used. Wavelet de-noising was compared with the optimal MSE Wiener filter [2]. Wiener filter reconstructions were calculated using the wiener2 function from the
493
Fig. 5 De-noising of a S I M S image." (a) original #nage, 512 x 512 pixels, 256 grey levels; (b) original image, close-up; (c) de-noised image, universal soft thresholding," (d) denoised image, universal soft thresholding, close-up. Measurement parameters." primary ions." Cs +; primary intensity." 1 nA; primao" beam diameter." 0.3 l~m,"primary ion energy." 6.5 keV; scanning steps." 512 x 512; step width: 0.1 ltm; analytical area." 51.2 x 51.2 l~m; measurement time per pixel." 1 ms; measurement time per #nage." 256 s," detected secondary ions." 27Al-.
MATLAB Image Processing Toolbox [13]. The block size of the Wiener filter was tuned to find the least MSE reconstruction. Since quantification requires true images for comparative reasons, the evaluation was carried out on the basis of a simulated image (Fig. 6), which has simple features, such as rectangular bumps with increasing widths, resembling structures in some real
494
495 SIMS images. A simulated image with Poisson statistics and an SNR = 3.8 was created. The Anscombe transform was applied prior to filtering. The simulated image was processed by both wavelet shrinkage and Wiener filtering. The results obtained from 100 noisy replicates are presented in [8]. We calculated not only the MSE of the whole image but also the MSE of one cross-section (along column 80) and the MSE of the individual bumps. Thus, a better quantification of how features with different widths are reconstructed was achieved. Generally, it can be concluded that wavelets give comparable MSE to the Wiener filter, while in the same time they gain a much better SNR improvement, though mainly due to the fact that they smooth more the background and the top plateaus of the rectangular bumps. In order to prove this last statement we calculated the gradients and measured additionally the reconstruction quality of the bump edges alone. The Wiener filter gives a 5-10% better reconstruction of the bump edges, which confirms the previous statement. Another easily observed trend is that the wider the bumps are, the smaller the wavelet MSE becomes. By comparing the standard deviations of the MSE, it can be seen that the wavelet MSE tends to be more stable than the corresponding Wiener MSE. Another comparison between wavelets and the results of two state-of-the-art adaptive filters (one based on fitting splines with adaptively chosen tension and the other using adaptive truncation of the empirical Fourier series) applied to various artificially generated signals, may be found in [6]. An extensive comparison of wavelet filtering with various other widely used filtering techniques can be found in [16]. Simulated images of point sources and an elliptical galaxy were processed by a wavelet image restoration technique with a multiresolution support in [12]. All investigations clearly show that wavelet shrinkage algorithms produce better reconstructions than most traditional linear smoothing methods, especially in cases of high spatial variability of the original data. In all wavelet reconstructions the noise is efficiently suppressed and most of the image features well preserved after processing.
Fig. 6 Simulated Bumps image, Poisson statistics." (a) original image, 256 x 256 pixels, 256 grey levels, background grey level = 10, bump plateau at grey level 20; (b) noisy image; (c) original image, close-up; (d) noisy #nage, close-up; (e) wavelet de-noised image (Coiflet 3), close-up," (f) Wiener de-noised #nage, 3 x 3 window, close-up.
496 Since a wavelet basis is not unique, finding the optimal wavelet for a specific problem is often a difficult task. Some wavelet properties, such as the smoothness of the wavelet, the number of vanishing moments, etc., may point the right direction to the optimal wavelet. Usually, one has either the option of using a wavelet from a library of wavelets, the way this has been done in this chapter, or he may construct his own wavelets, which have some desired characteristics. Determining the optimal threshold in the de-noising process is usually a result of careful exploration of the data. Multiresolution Analysis (MRA) plots (Fig. 3) reveal the structure of the data at different scales, and thus help the observer in acquiring the best threshold for a certain data set. Some thresholding methods, like universal thresholding, derive the nearly optimal (or optimal in some sense - in the case of universal thresholding the nearly optimal in minimax sense) threshold from the data, provided some initial normalisation conditions are met.
5.4 Wavelet de-noising of 3-D SIMS images The multiscale wavelet transform of a signal contains all frequency information in the different scales of the transform. High frequency information resides in the fine levels and low frequencies in the coarse levels. By analogy with the Fourier transform, the narrower a peak, the higher the frequencies, which are required to describe it. Thus, optimal de-noising depends on the amount of noise and the size and shape of the features of interest. This explains previous publications on de-noising of Gaussian-shaped peaks [14,15,16] reporting the optimal filter width to be between one and two times the full width at half maximum of the data features. Although wavelet denoising via thresholding does not have a parameter such as the filter width, it does have a parameter with similar characteristics, i.e. the level of decomposition of the wavelet transform. The values of this parameter correspond to the filter width parameter of other de-noising filters. The wavelet de-noising algorithm described above for 2-D images, can be extended to process 3-D images [9] using a 3-D DWT. Keeping in mind that SIMS images have a resolution between slices (z axis), which is different from the resolution within one slice (xy plane), optimal de-noising therefore should be accomplished using different coarse levels (levels of decomposition) of the WT for the three spatial dimensions (strictly speaking a different coarse level for the z axis). This assumption is proven by the quantitative evaluation of the reconstruction we carried out. As in the 2-D case, since the quantification
497 of the reconstruction performance of a filter requires true data for comparative reasons, the evaluation was first carried out using a simulated 3-D image comprising several features often found in real 3-D SIMS images such as: low p i x e l intensities, peak areas of vao'ing sizes, edges showing a Gaussian-like shape (as a result of the cross-section of the electron beam). To simulate Poisson noise, each pixel was replaced by a random number, chosen from the Poisson distribution, with the pixel intensity as parameter. After applying the wavelet de-noising, a quantitative evaluation of the filtered images was made using the MSE (see Eq. (7)) as figure of merit. The simulated volume (Fig. 7) used in the quantitative evaluation is a set of 128 images, each of size 256 x 256 pixels, comprising spherical features with different diameters (8 to 38 pixels) and different feature intensities (SNR ~ 0.9 to SNR ~ 5). Three-dimensional SIMS image sets normally have different resolutions within a slice and between slices. The pixel distances within a slice are normally between 1 and 3 Jam, the distance between slices can vary from 10 to 100 nm. These different resolutions were taken into account when sampling the simulated volume and generating the simulated images. Round features were chosen because they show no preferred edge orientation and bear a great resemblance to structures in captured SIMS images. These features were smoothed using a Gaussian weighted filter with a standard deviation of one pixel, which simulates the cross-section of the electron beam. Fig. 7 shows a 'true' 3-D volume and one of the 128 slices. The Poisson noise was created using a MATLAB Statistics Toolbox routine [13].
Fig. 7 Simulated S I M S volume." original 3-D image (left); one representative z slice (right). The whole image set comprises 128 slices, each of size 256 x 256 pixels, 256 grey levels, background grey level = 12, feature ,~rev levels 15 to 30.
498 The optimal level of decomposition within one slice was found to be identical with the one found in a previous publication [16]. The optimal level of decomposition for the z axis is higher, due to the higher resolution between the slices. The results of the evaluation for different combinations of the coarse level (levels of decomposition) of the WT are plotted in Fig. 8. Now the question arises, whether these different resolutions not only demand different coarse levels for optimal de-noising, but also call for the use of different filter banks? To answer this question we de-noised our simulated SIMS volume using all possible combinations of wavelets. The results of this investigation are summarised in Table 1. They show that the assumption above cannot be confirmed. This seems to be in correspondence with the results reported by Wang and Huang [17], which were obtained for the compression of medical images using a separable 3-D wavelet transform.
Fig. 8 Wavelet filter reconstructions using different combinations of the level of decomposition of the WT. M S E is the mean square error between the original and the reconstructed volume.
Table 1. Evaluation of all possible combinations of wavelets.
7
.Y
$ t I
L '4
2
2
s
7
.2
5
t I
L Z
2
2
s
MSE Haar Coiflet 2 Coiflet 3 Daubechies 4 Daubechies 6 Symmlet 6 Symmlet 8 Villasenor 2 Villasenor 6 Antonini
Wavelet (xy-plane) Coij7et 2 Haar 1.341 1 1.0259 1.2199 0.9241 1.2209 0.9201 1.2642 0.9603 1.2324 0.9385 1.2269 0.9258 1.2202 0.9 162 1.2206 0.9 192 1.1850 0.9 182 1.2155 0.9200
MSE Haar Coiflet 2 Coiflet 3 Daubechies 4 Daubechies 6 Symmlet 6 Symmlet 8 Villasenor 2 Villasenor 6 Antonini
Wrrv~lc't(.~j'-plr/nr) Svmmlet 6 Splnilet 8 1.0179 1.01 12 0.9 197 0.9 152 0.9 148 0.9121 0.9557 0.9506 0.9334 0.9296 0.9202 0.9 178 0.9093 0.904 1 0.9 125 0.9066 0.9 133 0.9092 0.9141 0.9084
Daubechies 4 1.1362 1.0275 1.0247 1.0630 1.0399 1.0299 1.0207 1.0224 1.0144 1.0222
Dauhechies 6 1.0512 0.9484 0.9450 0.9825 0.9589 0.9504 0.9360 0.9380 0.9364 0.9387 Antonini 1.025 1 0.9236 0.9200 0.9607 0.9374 0.9253 0.91 19 0.9 142 0.9 173 0.9 163
500 The effect of wavelet de-noising is demonstrated in Figs. 9 and 10. The first of the two figures shows our simulated SIMS volume contaminated with Poisson noise. The noise severely degrades the rendering of the 3-D image. After wavelet de-noising (Fig. 10), the noise in the reconstruction is suppressed to a large extent and most of the image features are only slightly distorted. A real 3-D SIMS image is presented in Fig. 11. The measured specimen is a high-speed steel S 6-5-2 (W 6%, Mo 5%, V 2%). The SIMS volume displayed in Fig. 11, was then de-noised using the filter which gave the best results in
Fig. 9 Simulated S I M S volume." the 3-D image in Fig. 7 has been degraded with Poisson noise: noisy 3-D image (left); one representative z slice (right).
Fig. 10 Simulated S I M S volume: the 3-D image in Fig. 9 has been processed using wavelet de-noising." 3-D reconstruction (left) - c o m p a r e with Fig. 7; one representative z slice (right).
501
Fig. 11 Real S I M S volume." 3-D image (left); one representative z slice (right).
our evaluation of the de-noising performance of the simulated volume, i.e. a Symmlet with eight vanishing moments. The left image in Fig. 12 shows the 3-D reconstruction of the whole de-noised volume, the right image is one representative slice. Again, as with the simulated image, in the reconstruction of the wavelet processed volume the noise is significantly reduced. Additionally, 3-D wavelet de-noising was compared to traditional 2-D wavelet de-noising of separate volume slices. Three-dimensional de-noising
Fig. 12 Real S I M S volume. the 3-D image in Fig. 11 has been processed using wavelet denoising: 3-D reconstruction (left); one representative z slice (right).
502 easily outperforms its 2-D counterpart (Fig. 13) at low additional computational costs, i.e. the de-noising time for 3-D wavelet de-noising is only 50 percent longer than for the 2-D wavelet de-noising.
6
Improvement of image classification by means of de-noising
6.1 Classification 6.1.1 B a s i c s
Classification is a procedure utilised in aerial and medical imaging, as well as in microscopy, in order to extract features of interest from multispectral images. Numerous classifiers such as the k-nearest neighbour classifier [18], neural networks [19] and fuzzy c-mean clustering [20], just to mention a few, have been described and used by the scientific community. Analytical tools like multidimensional histograms [21], scatter diagrams, and concentration histogram imaging [22], have been applied to Scanning Auger Microscopy (SAM) [23,24] and SIMS images [25,26,27].
Fig. 13 Wavelet filter reconstruction using two- and three-dimensional wavelet de-noising. M S E is the mean square error between the original and the reconstructed volume.
503 The inherent presence of noise in analytical images often leads to false clusters in the classified images or to misclassification of some features. This section investigates the extent to which de-noising algorithms improve the subsequent classification of images. Geometric features in digital images, such as texture and shape, lead to pixel populations in coherent clusters and can therefore be treated further by multivariate statistical means to extract information, i.e. image segmentation for correlation of positional data can be performed [24]. A following step is the classification of the image features. In order to identify different objects in feature space, it is necessary to establish their frequency distribution. The resulting clusters ideally represent the relationship of the constituents in the original image. Each picture element can be assigned to an object employing certain classification strategies. Some classification algorithms, which have been used for the classification of analytical images, are neural networks and fuzzy c-means clustering [28]. 6.1.2 Scatter diagrams A scatter diagram is used to represent the frequency distribution of grey levels, which point out the position of the objects in two-dimensional space [21,23,25]. Scatter diagrams for higher dimensional space can also be computed [22]. Fig. 14 demonstrates the construction principle for a two-dimensional scatter diagram. Many pixels in these diagrams tend to pile up at the same spots as they possess the same relative frequency distribution of grey levels in both input images. Therefore, the scatter plot allows the determination of pixel clusters, outliers and gradients in terms of their density. Classification of images of analytical samples assigns each picture element of the image to a chemical phase, which is of great importance for several major analytical techniques such as EPMA and SIMS. Fig. 15 displays two elemental distributions of a soldered industrial metal sample acquired with SIMS and a classified image showing the different chemical phases of the sample. Fig. 16 shows two noiseless images and the same images with added Gaussian noise. The corresponding scatter diagrams and the classified images are shown as well. 6.2 Results The scatter plot allows the determination of pixel clusters (representing single sample phases), outliers and gradients in terms of their density. These clusters can be separated using different classification strategies [28,20]. Then, the separated picture elements are projected back onto the original images to
504
Fig. 14 Schematic construction of a scatter diagram." image A (top left); image B (bottom left); scatter diagram (right). The pixels at location (205,37) have intensities 41 and 21, respectively, and therefore, histogram bin (41,21) is incremented.
display a new classified image. Fig. 17 shows the application of wavelet denoising and subsequent classification of two SIMS images. The specimen is a soldering alloy used to join steel and chromium. The solder material is a nickel base alloy (Cr 7.0%, Fe 3.0%, B 3.0%, Ni 82.5%, Si 4.5%, B 3%) in the form of a foil. The scatter plot of the noisy images in Fig. 17 (middle left) shows four distinct peaks, where the clusters are overlapping due to the noise, and therefore class assignment is uncertain. The classified image shows many misclassified pixels and the phase boundaries are blurred. De-noising decreases the noise variance. Thus, it both reduces the extension of the clusters in the scatter plot and increases their separability, which consequently improves the classification performance (Fig. 17, bottom). Generally, de-noising the images prior to classification substantially improves the classification results [29,30]. The number o f misclassified pixels ( M C P ) rapidly decreases when Fourier or wavelet de-noising filters are ap-
505
Fig. 15 Classification of two elemental distributions. ion micrograph of B (top l e f t ) ion micrograph of Cr (top right)" scatter diagram (bottom left)" class(fi'ed image (bottom right). Measurement parameters." primary ions. O+" primar)" beam intensity." 2 laA" pri mary beam energy 5.5 ke V," scanned area." 500 x 500 Ira1,"anah'sed area diameter." 400 lira.
plied prior to classification. In [30] it was concluded, that the self organising map ( S O M ) classifier and the f u z z ) ' c-means clustering classifier show the same trends when investigating the effect of de-noising on the following classification. The better the reconstruction is with respect to the MSE, the smaller the number of misclassified pixels. Although this is generally true, our study of sub-optimal Fourier filtering (FF) showed, that some commonly used methods for determination of the FF cut-off frequency, such as the Kirmse algorithm, do not necessarily lead to optimal classification. A further investigation in the general framework of multiscale methods [12] and scale-space theory [31,32] may be classification at different scales. If for some types of images classification gives the same or very similar results at several scales, then it may be performed on smaller, smoothed copies of the
506
Fig. 16 Simulated noiseless and noisy images, together with the corresponding scatter plots and classified images: the two noiseless images having two intensity value areas (60 and 70) (top). The scatter plot shows four clearly separated clusters. The same images as on the top, but with additive Gaussian noise (var = 20) (bottom). The four clusters in the scatter diagram are not discernible, and therefore automatic classification fails.
original images. This will considerably diminish the computation time for classification, which is of great importance, especially in the case of nonlinear classifiers such as neural networks.
7
Compression of 2-D and 3-D analytical images
7.1 Basics
In chemical analysis, many types of instruments now provide far more information than the integrated properties of a homogeneous sample. The correlation of local spatial and chemical information produces pictures revealing the composition and structure of non-homogeneous samples. In material science new imaging techniques lead to 3-D structural representations of material objects and their inner structure. Some of these advanced analytical methods, e.g. SIMS, are capable of producing series of 2-D sections, resulting in 3-D spatially resolved information about element distributions of signifi-
507
Fig. 17 Classification of two SIMS images." noisy Si image (top left); noisy Ni image (top right); scatter diagram of the original noisy images (middle left); classified image from the original noisy images (middle right); scatter diagram of the reconstructed images (bottom left); classified image from the reconstructed (wavelet de-noised) images (bottom right). De-noising prior to classification leads to significantly better classification results.
cantly large and representative volumes (106 lain3) in a relatively short time (about 1 h). However, very often, large amounts of data are obtained. SIMS images are typically digitised at a minimum resolution of 256 • 256 pixels with 16 bits per pixel. A single 2-D image, therefore, occupies at least 0.125 MB of storage space. A typical 3-D SIMS image set consists of 64 slices, thus requiring a minimum of 8 MB of storage. If we take into account that SIMS instruments capture images of four to eight 3-D distributions simultaneously, this results in 32 to 64 MB of data for only one analysis of one specimen.
508 Obviously data compression is important to bring these numbers down and make 3-D SIMS image analysis and processing a more manageable task. Several methods have been proposed to reduce the storage space for image data. The discrete cosine transform (DCT), which is the basis for the JPEG standard, has been widely used for still image compression. Although it can be efficiently implemented and it performs well for high bit-rate compression, serious blocking artefacts are a well-known disadvantage of DCT-based coding. As an alternative transform, the discrete wavelet transform not only can overcome the blocking artefacts, but also can achieve better overall performance in most cases. An effective approach to data compression using wavelets was introduced by Wickerhauser [33]. Data coding is one of the most visible applications of wavelets. Compression ratios of about 10:1 can be achieved without significant loss of visual detail. The FBI has adopted a standard for digital fingerprint image compression, based on wavelet compression algorithms (see Fig. 18). This standard is described in the work of Bradley [34] and Brislawn and Hopper [3 5,36]. Compressing an image set with multiple slices is different from compressing only a single 2-D image. Multiple slices are normally correlated to each other. In other words, there are some structural similarities between adjacent slices. Although it is possible to compress an image set slice by slice, more efficient compression can be achieved by exploring the correlation between slices. In
Fig. 18 FBI-digitised left thumb fingerprint." original image (left)" compressed image with compression ratio of 26:1 (right). Figure courtesy of Chris Brislawn, Los Alamos National Laboratory.
509 this section, a separable 3-D DWT with varying wavelet banks is applied to compress a 3-D SIMS image.
7.2 Quantisation After the DWT of the image has been computed, the second step in the image compression process is quantisation. The purpose of quantisation is to reduce data entropy by compromising the precision of the data. The quantisation step maps a large number of input values into a smaller set of output values. This step is not invertible, thus it introduces the so-called quantisation noise. Therefore, the original data cannot be recovered exactly after quantisation. Hence, it is very important to design a quantisation strategy which selectively quantises the wavelet coefficients and preserves the image quality. In the wavelet decomposition of signals and images, as it was described in the previous sections, the filter L is a low-pass filter while its mirror counterpart H is a high-pass. Wavelet transformed data thus consists of two types of channels: a single low-resolution channel, which contains most of the energy, and multiple high-resolution channels, which contain the edge information. All transformed data is represented by floating point values. A quantiser is operating on these channels to produce a sequence of symbols. From an MSE perspective, the minimum entropy is approximately achieved for a given distortion by uniform quantisation [37]. However, if the quantiser is applied to the high frequency channels, where the sample values are often small, the coding efficiency can be improved using a larger quantisation interval around zero. In this study, we have used the embedded family of quantizers described in [38].
7.3 Entropy coding After quantisation the channels with discrete levels are represented by integers. In the third step this data is further entropy coded to reduce the bit rate. Entropy coding assigns fewer bits to integers with higher frequency of occurrence and more bits to integers with lesser frequency of occurrence. This fully invertible step allows us to represent the data in even less space than the original data after quantisation. For a detailed description see [39].
7.4 Results The measured specimen is a high-speed steel S 6-5-2 (W 6%, Mo 5%, V 2%). Two different test volumes, i.e. 3-D SIMS images of the sample, were re-
510 corded. One of them is displayed in Fig. 19. The definition of a cutting function produces a better view of the internal structures and the intensity distribution inside the overall volume [40]. Fig. 20 shows the PSNR versus the compression ratio for both test volumes. Even at very low bit rates, i.e. high compression ratios, relatively high PSNR
Fig. 19 3-D S I M S image." the whole image set consists of 64 2-D images (256 • 256 pixels), each having 256 grey levels.
Fig. 20 3-D S I M S image compression at different compression ratios, where PSNR is the peak signal-to-noise ratio.
511 can be achieved. To compare 3-D wavelet compression with 2-D image compression methods, both tested volumes were compressed using: (a) 3-D wavelet compression; (b) 2-D wavelet compression method, and (c) standard 2-D JPEG compression. More details about the results reported in this section can be found in [41]. The 2-D wavelet compression algorithm used was similar to the 3-D compression algorithm except that the 2-D WT of each slice was computed. Multiple slices were compressed slice by slice with the 2-D method. The same applies for the compression using the JPEG algorithm. Fig. 21 shows the PSNR (Eq. (8)) versus the compression ratio of the 3-D and 2-D methods. We only show the results for one of the test volumes here. The results for the second volume are very similar. The compression ratio of the 3-D wavelet method is much higher than that of the 2-D wavelet method at comparable PSNR. For a very low compression ratio the JPEG algorithm yields a slightly higher PSNR, but for higher compression ratios 3-D wavelet compression easily outperforms both 2-D methods. Fig. 22 presents the original and decompressed images of the first image set at a ratio of 1:32, using the 2-D and 3-D compression methods. At the same compression ratio Fig. 22 shows very little difference between the 3-D wavelet decompressed image and the original, whereas both 2-D methods
Fig. 21 3-D S I M S image compression using two- and three-dimensional wavelet compression and JPEG compression.
512
Fig. 22 3-D SIMS image compression using two- and three-dimensional wavelet compression and JPEG compression." original image (top left)," JPEG compression (top right); 2-D wavelet compression (bottom left); 3-D wavelet compression (bottom right).
reveal some clearly visible artefacts. Tables 2 and 3 in [41] summarise the results of applying the different wavelet filters at a compression ratio of 1:32. Three-dimensional SIMS images normally have different resolutions within a slice and between slices. As already mentioned in the de-noising section, Wang and Huang [17] used a separable 3-D WT for compression of medical images and proposed to apply a second wavelet filter bank in the slice direction to take into account the different correlation between the slices and within one slice. They concluded that in general this gives better results only if the distance between slices is much greater than the pixel distance within a slice. Since in the case of SIMS volumes the distance between slices is much lower than the lateral resolution within a slice the application of a different wavelet filter in the z direction should not yield a better performance. Although this assumption is not proven by the quantitative evaluation of the reconstruction we carried out, the results for a combination of different filters do not significantly differ from the results when using the same filter in all directions.
513
Table 2. Advantages and disadvantages of EPMA and SIMS.
Advantages
Disadvantages
8
EPMA Good quantification and high resolution, particularly for electron signals (SE, BSE).
SIMS High local detection power (typically <1 lag/g), all elements (also H), isotopes, suitable for 3-D imaging and 3-D image analysis.
Low local detection power (typically 0.1%), poor depth resolution in X-ray analysis (typically 1 lam).
Poor quantification, expensive.
Feature extraction from analytical images
8.1 Edge detection 8.1.1 Basics The first step in image analysis is usually the image segmentation. Segmentation subdivides an image into its constituent parts or objects. Segmentation algorithms are generally based on one of the two primary properties of the image grey level values: discontinuiO' and similarity. According to John Russ [42], a feature in an image may be defined either as a region of contiguous pixels that share some property (similarity), or as a region inside some boundary, or adjacent to some other feature (discontinuity). The approach of the second category is to partition the image, depending on abrupt changes in some physical aspect of the image, such as intensity, colour or texture. The principal areas of interest within this category are detection of isolated points, lines and edges. An edge is the boundary between two regions which have relatively distinct properties. Since edges represent the basic structure of an image, detecting edges is of profound importance for image analysis. Many edge detection techniques have been proposed, most of which look for local maxima of the image intensity gradients. If we consider noisy images,
514
however, it may be shown that the gradient magnitude method is s e n s i t i v e to t h e v a r i a n c e o f t h e n o i s e [43]. Zero crossings and false edges (Fig. 23) are likely to be detected in noisy images. The larger the variance of the noise is, the more false edges will be detected. This is a main obstacle when dealing with images captured by analytical techniques, as they are usually rather noisy and have a very low SNR. There exist many variations of standard edge detection algorithms with the purpose of reducing the noise sensitivity. Smoothing may be applied prior to edge detection. Thresholding the local variance of the image leads to a significant decrease of the number of false edges, yet a severe problem of many of these classical methods is, as mentioned before, the l o s s o f r e s o l u t i o n . Mallat and Zhong [44] have proposed a multiscale edge representation of images based on the multiscale gradient maxima using the wavelet transform.
0~''i
>.:9
<~
'~=~
'-,F
0
,,.,._,,~'
- ; :,,... :-
'
' ;'~" ":"
.... 4 . -
c/ %
-":
t,:" "
"k-~ y f _ . J t
.... .,,
.
~-~'"=-~-''7
"
~--"~,.-
;'
,,,,:,""
"":"
-~" . ' (h ~
v~< ;. ::~.; . ~ ~ r@/~/x-~.,q.! ~c2,,,'. ;" ..~.,% +~~. "" ' i;: ;:;:- ~ '~Ce;(-:- "r' -_ ->';..,' ~... "-~,,I~~: ,', ;.. ~S~,. .'-'c 3),.-" ~ , . ~ . x .'" . ~: ~ i~,~., 0 Pj~-., .. -. too.;, O~%Y.... 9 9< ..~ 0 . ~9 o . " . - . ~,',,".- .-."-z# (t '~.:'."..~ : ~ / ~ ' ':~ >.~;,%t:'.:.'.,7.'-;'- :o,~u,"~._ 2,.'.',~.,_.-. ".7' ,.-.,.",,~ ~- ,.~, ,, " -
.
....
":3"t~
,, g , ~ t
~."_ ~ ,'., ."~,;."S ,-
" '
~ -,~ ,,~ ~ "
"'r
.-"
,-
>r "-.
...,
-
'-'
"" . :
'"~."' - ;:~'G".- ~" 3. L~,.~;::.v.;.~<~;,~:.. i." .~ Fig. 23 S e n s i t i v i t y o f the g r a d i e n t m a g n i t u d e edge d e t e c t o r to noise." original i m a g e after edge detection ( t o p ) ; original i m a g e with a d d e d Gaussian noise (~ - 20) after edge detection ( b o t t o m ) .
515
8.1.2 1-D gradient magnitude and Laplacian methods How does an edge detector work? Let us first illustrate two standard edge detection algorithms, i.e. the Gradient Magnitude method and the Laplacian method, in the 1-D case. Let f(t) be a nonnegative function which has a continuous second derivative. If ]f'(t)] is very large, then f(t) is changing very rapidly. If If'(t)l is greater than a certain threshold, the point t is a candidate edge point. The gradient magnitude method is derived from this observation. The task of detecting all edges in a signal is to find all values of t for which ]f'(t)l is larger than a threshold. On the other hand, f'(t) is considered to be large when it reaches a local extremum, i.e. when f"(t) has a zero crossing. Therefore, the edges in a signal may be detected by finding the zero crossings of f"(t). This is the main idea behind the Laplacian method.
8.1.3 2-D gradient method The 2-D counterpart of f'(t) is the gradient of I(x,y), given by VI(x,y)
-
~0I(x,y) + ~
0x
0I(x,y) 0y
(12)
Detecting the edges in a 2-D image I(x,y) is equivalent to finding all values of (x,y), where the magnitude of I(x,y) is greater than a certain threshold. For computer calculations the directional gradient 0I(x,y)/0x may be replaced by one of the following discrete analogues I(xm, Yn) -- I(Xm-,, yn),
(13)
I(Xm+l, Yn) - I(xm-,, Yn) 2
(14)
or I(Xm+l,Yn+l) + I(Xm+l,Yn)-4-I(Xm+l,Yn-1)- I(Xm-l,Yn+l)- I(Xm-l,Yn)- I(Xm-l,Yn-l)
(15) The other directional gradient 0I(x,y)/Oy can be discretised in a similar manner. From the discrete forms we are able to calculate the gradient magnitude of each image point and compare it with the threshold. Thus, we can locate all edge points.
8.1.4 2-D Laplacian method One of the possible generalisations of the second derivative of a 2-D function I(x,y), is the Laplacian vZI(x,y), given by
516
vai(x,y) _
~2
~2 I(x,y) I(x,y) Ox-------5~+ ~ ~y2 9
(16)
As mentioned above, detecting the edge point of an image is equivalent to finding all zero crossings of V2I(x,y). For numerical implementations, if we use Eq. (13) for ~I(x,y)/~x, then for the discrete analogue of 82I(x,y)/Sx2 we have Ix(xm+,, Yn) -- Ix(xm, Yn) -- Ix(Xm+l, Yn) -- 2Ix(xm, Yn) -}- Ix(Xm-1, Yn)" (17) From Eq. (17), and by discretising ~2I(x,y)/Oy2 in a similar way, we obtain the following discrete form of the Laplacian V2I(x,y) I(Xm+l, Yn)-ff I(Xm-1, Yn)+ I(Xm, Yn+l)-4- I(Xm, Yn-l) --4I(xm, Yn)" (18) Many edge detection operators on the discrete image I(xm, Yn) can be expressed as a linear convolution of I(xm, Yn) with a convolution kernel c OG
c, I -
OC
Z Z C(Xm-k,Yn_l)I(Xk, Yl) k=-oc l=-oc
(19)
For example, the convolution kernel which corresponds to the operator in Eq. (15)is 1 0
-1)
1
0
1
1
0
1
(010)
,
(20)
while the kernel for the Laplacian (Eq. (18)) is 1
0
-4
1
1
.
(21)
0
8.1.5 Edges of noisy images If we consider noisy images, however, it may be shown that the gradient magnitude method is sensitive to the variance of the noise [43]. More precisely, the mean square of the gradient magnitude is proportional to the variance of the noise. Since each gradient magnitude is compared to a given threshold, and a point is considered to be an edge point only if the magnitude is larger than the threshold, it follows that the larger the variance of the noise, the more false edges will be detected (see Fig. 23).
517 The Laplacian method is also sensitive to noise. Therefore, zero crossings and false edges are likely to be detected by the Laplacian edge detector. There exist many variations of both methods with the purpose of reducing the noise sensitivity. In the case of the gradient magnitude method, smoothing may be applied prior to edge detection. For the Laplacian method, since many false edges are generated by even small perturbations, thresholding the local variance of the image leads to a significant decrease of the number of false edges.
8.1.6 Multiscale edge detection In the late seventies, David Marr and his group at the Artificial Intelligence Laboratory at MIT created a unified framework of the science called robotic vision [45]. One of their goals was to develop computer algorithms for object recognition similar to those in human vision. A fundamental question was how to distinguish the boundaries of blurred shado~t" regions and the contours of fine detail regions present in the same image. Marr and Hildreth [46] studied the intensity changes of smoothed versions of an image at different scales and argued that these changes contain important information about the image. They suggested to process the original image by a low-pass filter at several different scales, and then to find the edges at each scale. The contours of various types of regions could be determined by examining the edge maps of different scales. Generally, the edges of large blurred regions can be found in the edge maps of large scales, while the edges of fine details can be found in that of small scales. This is the basic concept of multiscale edge detection. The low-pass filter proposed in [46] is the discretised Gaussian function defined by h,~(x,y) - e -(x2+y-~)/2rcc~-",
(22)
where ~ is the scale parameter. For the smoothed image Marr and Hildreth used the Laplacian method to detect edges at each scale. In their approach, however, the values of O2h~, I(x,y)/~x 2 and ~2h~, I(x,y)/Oy 2 were not saved. They conjectured that an image can be pelfectly reconstructed from the multiscale edge locations, but they did not give an explicit reconstruction algorithm. This conjecture was disproved by Meyer [47]. Nevertheless, the question whether an image can be perfectly reconstructed if not only edge locations but also some other information, such as the first or the second derivatives at the edge locations is used, remained open.
518 A major disadvantage of the Gaussian function, when used as a smoothing kernel, is the fact that it does not have compact support in either time/space or frequency domain. This limitation makes its use undesirable in many image processing applications. Since the Gaussian function does not have compact support in the space domain, if the original image is noisy, then the noise in each point will spread to many points after applying the Gaussian low-pass filter. Another disadvantage of the approach used by Marr and Hildreth is the detection of false edges in noisy images, when utilising the Laplacian edge detector. In order to overcome these drawbacks, Stephane Mallat [48,44] proposed an edge detection scheme based on the modulus maxima of the image wavelet transform (see Fig. 24). Essentially, this new idea is a gradient magnitude method for multiscale edge detection, which uses wavelets rather than the Gaussian kernel as a filter at different scales.
8.1.7 Wavelets and multhscale edge detection Stephane Mallat refined the conjecture by Marr and Hildreth described in the previous section. He conjectured that a signal in L 2 can be perfectly reconstructed from the location and the modulus maxima and the wavelet transforms at different scales together with the values of the wavelet coefficients at the modulus maxima locations [48]. This hypothesis was also disproved by Meyer [47] in its most general form. In spite of that, many experiments indicate that the reconstruction of a signal from the modulus maxima locations and values leads to excellent approximation of the original signal [48]. Furthermore, Mallat and Zhong not only made this conjecture, but provided a reconstruction algorithm as well [48,49,44]. Below, we will briefly outline the algorithm proposed by Mallat and Zhong in the one-dimensional case. More details can be found in [44]. Let us have a one-dimensional signal f(t) sampled as f(tn), where n = 1 , . . . , N and N = 2 m. Here, the WT details of f(t) at scale 2J are denoted by W~f. We can apply Mallat's algorithm by making the following steps:
1. calculate the WT {(W2Jf(t))j=l, 2.....j: S2Jf(t)} = {HJf, L H J - l f , . . . ,LHf, Lf} of the signal f(t), where U J -- U J U J - 1 . . . H1; 2. for each scale 2J, 1 _< j <_ J and each location tn, 1 < n <_ N, determine if the location is a modulus maximum location by comparing LHJf(tn) with LHJf(tn+l) and LHJf(tn_l). The point at location tn is an edge point at scale 2J if [LHJf(tn)l > ILHJf(tn_l)[
and
[LHJf(tn)[ >_ ILHJf(tn+l)l
519
Fig. 24 Multiscale edge detection." test image (top), size." 256 x 256 pixels. Only the wavelet edge detector (middle) is able to detect all edges, while the Canny edge detector (bottom) misses some important features. The test image was kindly provided by Prof. J.M.H. du Buf, Vision Laboratoo', Department of Electronics and Computer Science, UniversiO' of Algarve.
or
[LHJf(tn)l > ILHJf(tn+l)[
and
ILUJf(tn)l > ILHJf(tn_,)l;
3. obtain the estimate LHJf(tn) of LHJf(tn) for each n and j from the wavelet coefficients at the edge locations calculated in Step 2 by using interpolation; 4. obtain the estimate t"of f by taking the inverse WT of LHJf(t,), which is computed in Step 3.
520 It is difficult to estimate the approximation error between t" and f analytically. The primary approach to minimise this error is to minimise the error caused by the interpolation. Mallat and Zhong selected the interpolation function used in Step 3, based on the minimisation of the interpolation error in Sobolev norm. In the numerical computations presented in this chapter, the same interpolation scheme was used. The extension of the above algorithm to two-dimensions is straightforward. Let ~(x,y) be a separable spline scaling function, which plays the role of a smoothing filter. We can construct two oriented wavelets by taking the partial derivatives ~l(x,y ) _ ~ ( x , y ) Ox
and
~2(x,y) - ~ ( x , y ) . Oy
(23)
The 2-D dyadic WT of a function I(x,y) r L2(R 2) at scale 2J is defined as in Eqs. (4) and (5), only the x and y directions have been swapped. In this section we have kept the mathematical notation proposed by Mallat, so here the vertical details D~jI(x,y) are denoted by W~I(x,y) and the horizontal details D~jI(x,y) are denoted by W~jI(x,y). Also note that in Sections 8.1.7 and 9.1.2 a redundant (non-decimated) wavelet transform was used as opposed to the DWT used in the rest of the chapter. It can be shown that the 2-D WT defined in this way gives the gradient of I(x,y) smoothed by ~(x,y) at dyadic scales V2JI(x,y) - W2J----I(x,y) = (W~jI(x,y), W~jI(x,y)) 1
--2----~ V((ID2j zr I)(x,y)
(24)
1
= 2--~VI 9 (I)2J (x,y). The multiscale gradient representation of I is complete. The wavelets used by Mallat and Zhong are not orthogonal, but nevertheless, I may be recovered from WzjI(x,y) through the use of an associated family of synthesis wavelets (see [49,48] for details). If we want to locate the positions of rapid variation of I(x,y), we should consider the local maxima of the gradient magnitude at various scales. The gradient magnitude at various scales is given by PeJI(x,Y)-
IlVeJI(x,y)ll- IlW2jI(x,y)[I-
~/(w~jI(x,y)) 2 + (w~jI(x,y)) 2. (25)
521 More precisely, a point (x,y) is a multiscale edge point at scale 2j if the magnitude of the gradient 92ji attains a local maximum there along the gradient direction 0:j I, defined by [W~jI(x'Y)1 02jI(x,y) - arctan [W~jI(x,y)] '
(26)
For each scale, we can collect the edge points together with the corresponding values of the gradient at that scale. The resulting local gradient maxima at scale 2j are where p~jI(xm,y,) has local maximum at], Azj(I ) - { [(Xm,Yn)" ~72JI(Xm,Yn)], (Xm,Yn)-along the direction 02JI(xm,Yn) J (27) For a J-level 2-D WT, the set { S2JI(x,Y), [A2J(I)] l <_j<j}
(28)
is called a multiscale edge representation of the image I(x,y). Here S2J I(x,y) is the low-pass approximation of I(x,y) at the coarsest scale 2J. Mallat and Zhong showed in [48] that an image can be effectively reconstructed from its multiscale edge representation (Eq. (28)) alone. Fig. 25 displays an SIMS micrograph of a Fe ion distribution. The multiscak edge representation of the image in Fig. 25 is shown in Fig. 26
8.2 Wavelets for texture analysis
Texture can be defined as the set of local neighbourhood properties of th~ grey levels of an image. It includes intuitive properties like roughness granularity and regularity. Texture analysis has proven to be a very impor. tant tool for image segmentation. A great variety of texture analysis method~ have been proposed in the past [50,51]. Statistical features, structural meth. ods, and more recently fractal models, wavelets, and Markov random fields have been used in texture analysis. The ability to effectively segment ant classify images based on textural features is exploited in many areas such a'. remote sensing, medical image analysis, computer vision, etc. The objective of dividing an image into homogeneous regions remains a challenge, espe cially when the image is composed of both complex textures and smoothe: regions. Therefore, it is clearly very desirable to have some means of featur~
522
Fig. 25 S I M S micrograph o f a Fe ion distribution; hnage size." 128 • 128 pixels. The hnage is plotted on a log scale.
selection prior to segmentation. In this way, highly textured regions can be segmented using spatial frequency-based features, whereas smooth regions can be segmented using local grey level statistics, such as mean and variance. Such an approach was first suggested by Porter and Canagarajah [52]. Here, we will outline the scheme they have proposed. In [52] the wavelet transform was used both to analyse the image prior to segmentation, enabling feature selection, as well as to provide spatial frequency-based descriptors as features for segmenting textures (see Fig. 27). Smooth and textured images can easily be distinguished from each other by examining their wavelet transforms. Fig. 28 shows the 2-D discrete wavelet transforms of a smooth image (top) and a textured image (bottom). Both images have had their mean values subtracted prior to computing their wavelet transforms. The 2-D plots in Fig. 28 display the magnitudes of the wavelet coefficients across the various frequencies and orientations of the 2-D DWT. Smooth images have large wavelet coefficients only in the low frequencies, while textured images have large wavelet coefficients in a wide frequency/scale range. A three-level wavelet decomposition of a 2-D image results in 10 WT channels (Fig. 29). The channels are numbered from 1 to 10. The energy of each channel is calculated by finding the mean magnitude of the wavelet coefficients in this channel. The use of energy is sound and has an obvious physical interpretation. Moreover, it is additive and its total is conserved by
523
Fig. 26 The multiscale edge representation o1 the Fe S I M S image Ifx,y).[i'om Fig. 25. The first column displays { $2, I (x, y) } l<j<4 (the low-pass approximation of I(x,y) at scale 2J). The scale increases from top to bottom. The second and third columns show { wl2jZ(x,y)},<j<4 and { W~,l(x,y)}l<~.<4. Black, grey, and white pixels stand/br negative, zero, and positive values, respectiveO'. The lourth column displa)'s the wavelet modulus maxima {A2~I(x,y)}l<j<_4. Here, black pixels indicate zero values, whereas white ones correspond to maxima points. The muitiscale edge representation of the image in Fig. 25, i.e. {S2,I(x,y), [A2,1(x,y)]l<_g.<4}, consists of the bottom image in the fi'rst column atut the .four images in the lourth column.
the transformations. Several studies [53,54] investigate alternative measures, but no general conclusion in favour of a particular measure can be drawn from them. In order to successfully discriminate between smooth and textured images using the energy of the different channels of their wavelet transforms, Porter and Canagarajah [52] have further grouped the ten channels into low
524
1) wavelet analysis 2) optimal feature selection 3) clustering segmented image
original image
Fig. 27 Image segmentation in the wavelet domain.
o
o
~
o
o
Fig. 28 Typical 2-D W T of" a smooth image (top)" a textured image (bottom). The low frequencies are displayed closest to the viewer.
(channels 1-4), middle (channels 5-7), and high (channels 8-10) frequency bands. They proposed the ratio la of the mean energy of the low frequency channels (1-4) to the mean energy in the middle frequency channels (5-7), i.e. ~-'~c4= e c
1 ~t - ----7--~, ~c=5 ec
(29)
525
10
5 2
7 4
1 3
6
Fig. 29 The 10 main channels of a three-level wavelet decomposition of a 2-D image.
as a criterion for optimal feature selection, where ec is the energy in the channel with number c of the DWT, given by ec=INN[ N---~Z ~
I W~jk.
(30)
x=l y=l
c is a wavelet coefficient within the channel c. If In the last formula w2J.k the ratio g is above a certain threshold, i.e. la > ~, the pixel (or block of pixels) is labelled as smooth; otherwise it is labelled as textured. After all the areas of the image have been labelled as either smooth or textured, different features are used for segmentation in different areas of the image. Local grey level statistics, such as mean and variance, was used in [52] to segment the smooth areas of the image. Textured regions were segmented using the energy values in the 10 channels of the DWT as features for segmentation. An example test image, made up of both smooth and textured regions taken from the Brodatz album [55], has been segmented using the algorithm above. The results are displayed in Fig. 30. More details about the segmentation procedure can be found in [52]. An extensive comparison of wavelet, Gabor filter, and Gaussian Markov Random Fields (GMRF) for texture classification is published in [56]. Although many aspects are still evolving, the main methods for wavelet texture analysis are more or less established, and they produce good results [51]. Apart from the theoretical work, important experimental knowledge is
526
Fig. 30 Optimal feature segmentation: test image (top left); ground truth segmentation (top right); segmentation using only local grey level statistics as features (bottom left); segmentation using only wavelet features (bottom middle); segmentation using the automatic feature selection algorithm described in [52], where optimal features are used in different areas of the image. The images were kindly provided by Dr. Nishan Canagarajah, Department of Electrical and Electronic Engineering, University of Bristol.
coming from a wide variety of applications, e.g. medical image analysis (segmentation of ultrasonic images), remote sensing [57], and material science (characterisation of corrosion [58], segmentation of marble images [59]). In a growing number of areas, wavelet-based texture methods are being investigated and successfully used in practice.
9
Registration and fusion of analytical images
9.1 Image registration 9.1.1 The problem The first step in studying an object or scene is gathering as much information as possible, either from one source under various conditions, or from different sources. Then we usually want to put all the information that is available together. In order to compare and evaluate the data acquired during the first step, a correspondence should be established between the respective data sets. Image registration is a fundamental task in various fields of science. Having a set of images (for simplicity in this section we will
527 consider only the case of two images) we want to combine them in a meaningful way and to eventually find out how similar they are. This is usually achieved by establishing a correspondence between important features of the two images. This general problem has various names in different fields: it is called image registration or image alignment in image processing, automatic fusion in computer vision and particularly in stereoscopy, section alignment in three-dimensional microscopy, etc. A vast amount of substantially different matching techniques can be found in the literature [60,61]. All classical approaches to image registration can be loosely divided into the following basic algorithmic paradigms: algorithms that use image pixel values directly, e.g. correlation methods [62,63]; algorithms that use the frequency domain, e.g. fast Fourier transform-based methods [64]; algorithms that use low-level features such as edges and corners, i.e. feature-based methods [60]; and algorithms that use high-level features such as objects or relation between features [60]. Here we will discuss the general problem of 2-D image registration. Let us have two images, the first of which we will name the reference image R(x,y) and the second one the input image I(x,y). In many cases the input image can be considered a distorted or transformed 'copy' of the reference image, although it may contain very different information from the reference image. Often, it is very difficult, if not impossible, to establish a direct correspondence between the input image and the reference image. Therefore, what we need is to generate a new, corrected cop)" I(x,y) of the input image, which can be easily compared and combined with the reference image. There are two major questions to be studied. The first one is how to determine the important features of both images. They vary substantially from one problem to another, with common important features such as edges, corners, region boundaries, line segments, texture, etc. Depending on the choice of features, different image processing operators can be applied in order to extract those features from the reference image and the input image. The second question is how to establish a correspondence between the two sets of selected features and what kind of correspondence this should be. What we are looking for is a transform that will map the input image to the reference image, or more precisely the set of important features of I(x,y) to the set of important features of R(x,y) and will produce a new image I(x,y). Unfortunately, a perfect fit is almost never possible or is very difficult to achieve. Therefore, a simplified problem is considered instead, where the transform has some specific form, e.g. it is a composition of translation,
528 rotation and scaling operators. Global distortions of the image can be described by much simpler transforms than local distortions. Often, when distinctive features cannot be determined, some artificial marks are introduced prior to image acquisition. These special marks, which are usually called control points, can be easily tracked down. The number and position of the control points Pk, k = 1,..., Q, depends on the density of the features of interest and the transform type. The control points can be either automatically detected by a feature detection algorithm, or they can be manually selected using a pointing device. A major advantage of the manual selection of important features is that human beings can easily and very quickly determine a set of corresponding features. The principles behind this process have been investigated by numerous computer vision researchers [45,65]. A certain drawback, however, is that this process is very subjective and depends strongly on the operator and his experience. Here we will not go further trying to explain how the human visual system matches features or scenes, but will rather illustrate how wavelet based algorithms can be employed to register multimodality or multispectral analytical images. In case of investigation of materials with a microstructure, it is useful to compare the two-dimensional distributions of features measured with complementary techniques at the same sample position. Let us consider the problem of matching 2-D EPMA and SIMS micrographs. Both of these analytical techniques have distinct advantages and disadvantages, which are shown in Table 2 (see p. 513). The problem of matching images with chemical content has been studied only recently and only a few papers have been published on this subject. Boehmig et al. [66] presented a five parameter automatic matching algorithm for SAM, SIMS, and EPMA multispectral images. The parameters include global translation, rotation and scaling. An improved hierarchical version of this method is described in [67,68]. The multiscale representation used is an image pyramid. There are several problems encountered in combination of SIMS and EPMA images. Due to the change of the sample holder, the images sometimes have different orientations. SIMS and EPMA have different projection functions, i.e. the projection of different concentrations to the resulting image intensity values is non-monotonic and non-linear. EPMA and SIMS micrographs also exhibit various artefacts, e.g. lateral distortion of the SIMS distributions [26]. Most of these distortions are both local and non-linear. All the above
529 described problems have to be taken into account when looking for an image registration method which will allow the successful registration of 2-D SIMS and EPMA images. As already mentioned, some of the features of profound importance in visual perception are the image edges. Indeed, when experiments were conducted, most of the important features selected by a human operator were points either lying on, or very close to an edge. Corners, i.e. points where the curvature of the edge changes, were preferred locations as well. The main obstacles when dealing with analytical images is that they often have very low SNR. In presence of high noise, even our eyes have great difficulties detecting some distinct features. Therefore, the use of image registration methods based on the multiscale edges of images, significantly improves the matching process.
9.1.2 Registration of S I M S and EPMA images based on their wavelet transform maxima Next, we will illustrate the application of the multiscale edge representation outlined in the previous section for the registration of SIMS and EPMA images. More technical details about this algorithm can be found in [69]. Our method is similar to the approaches described in [70,71] for registration of remote sensing data. Both of these methods utilise the wavelet transform maxima as control points (reference points), but the two approaches differ in several respects: (a) the type of wavelet decomposition; (b) the search strategy, i.e. whether the transformation to obtain the registered image is computed globally over the whole image [71] or by matching the control points one by one [70]; and (c) the transformation t y p e - affine transformation in [71] and polynomial transformation in [70]. The method used in this chapter is more similar to the algorithm presented in [70], however, there exist a number of differences. The most significant difference is that while Djamdji et al. [70] have presented a procedure for automatic registration of satellite images where the best match of the thresholded maxima of the wavelet transform is first found at the coarsest scale, and is then further improved to the finest scale, we have discovered that fully automatic feature detection is impossible in the case of EPMA and SIMS images, and therefore have proposed a method which relies on manual selection of important features. For comparative reasons, the same micrographs of a chromium distribution, obtained with EPMA and SIMS (see Fig. 31), as in [72,26] will be used.
530
Fig. 31 Chromium distribution obtained with EPMA and S I M S in a soldering layer system consisting of chromium~solder (Ni-Fe-Cr-Si-B)/steel." the EPMA image E(x,y) (left), size. 256 • 256 pixels; the S I M S image S(x,y) (right), size." 256 • 256 pixels.
We can choose the reference image R(x,y) to be the SIMS image S(x,y) and the input image I(x,y) to be the EPMA image E(x,y). The corrected or registered image I(x,y) - l~(x,y). Fig. 32 displays the multiscale edges of E(x,y) and S(x,y). These edges depict the image discontinuities at different scales of investigation. The coarser the scale becomes, the smaller the number of detected edges gets. Scale 21 is dominated by edges due to the noise. At scales greater than 23 most of the fine image structures are already significantly distorted by the smoothing process. Therefore, we selected scales 22 and 23 as most suitable for the image matching. They present a trade-off between local feature matching and global image matching. A close-up of the low-pass approximations ({S2JE(x,y)}j=2, 3 and {S2JS(x,y)}j=2,3) at scales 22 and 23 is shown in Fig. 33. The corresponding multiscale edges {A~E(x,y)}j=2, 3 and {AzjE(x,y)}j=2, 3 are plotted in Fig. 34. Even at these two scales, it is still very difficult for the human eye to determine the important features and to define a one-to-one correspondence between the two sets of selected features. In order to reduce the number of edges and to preserve only the most pronounced discontinuities, the wavelet gradient maxima were thresholded. Fig. 35 illustrates the result of the thresholding procedure. Only wavelet maxima greater than a threshold z~ are displayed"
531
Fig. 32 Multiscale edge representation of the Cr distribution in Fig. 31. The first and second columns display {S2JE(x, y) } l<_j<4and {S2JS(x, y) } l<j<_4, respectively. The scale 07creases from top to bottom. The third and fourth columns show the wavelet modulus maxima {A2JE(x, y) } 1<_)<4and {A2JS(x, y) } l<j<4 of the EPMA and S I M S micrographs, respectively'. Black pixels indicate zero values, whereas white ones correspond to maxima points.
AzjI(x,y)- {AzjI(x,y) > ~2J}.
(31)
The threshold ~ is different for each scale. After the thresholding the number of edges is substantially diminished. It is clear now that the images are captured from the same sample position. Furthermore, various local distortions may be observed in both images. Hence, the transform which will map the EPMA micrograph to the SIMS micrograph must be non-linear and it should take into account as many local feature differences as possible.
532
Fig. 33 {S2JI(x,y)}j=2j - the low-pass approximations at scales 22 and 23 of the images in Fig. 31. The two columns show the approximations of the EPMA and S I M S images, respectively. The scale increases from top to bottom.
In order to overcome these obstacles we have used an algorithm similar to the one reported in [26]. A weighted least squares transform with selectable order [72] is used to register the images. The corrected image is calculated on the basis of a one-to-one correspondence between two sets of control points Pk. The control points {Pk}k--1..... Q , which lie on the image edges at scale 23, are determined manually by an operator (see Fig. 36, top row). Scale 23 was selected as optimal for the image registration. The results reported below are obtained by image registration at only one scale of the image decomposition. Multiscale image registration, similar to the one in [67,70] where the best match is first found at the coarsest scale and is then improved to the finest scale, when applied to EPMA and SIMS images, results in improved performance in some cases but at the cost of increased computation. Because of the inaccuracy of the pointing device, a special function was written to snap the selected point to the nearest wavelet gradient maximum.
533
Fig. 34 {Ae, I(x,y) }/=_,.3 - the wavelet gradient maxima at scales 22 and 23 ol'the images #1 Fig. 31. The first and second columns show the wavelet gradient maxima of the EPMA and S I M S micrographs. The scale increases fi'om top to bottom. Here, black pixels correspond to ~ravelet ma.vhna points.
The number of control points Q is usually chosen between 10 and 30. The more local distortions we have, the greater the number of control points should be. The least square method is localised by introducing a weighting function ~k which represents the contribution of the control point Pk -- (Xk, Yk) to the point (x,y), i.e. 1
(Xk --V/Sq-(X-Xk
)2 + ( y - - y k )2
.
(32)
The parameter 8 defines the influence of distant control points. As 8 approaches zero, the distant points have less influence and the approximating mapping function follows the local control points more closely. As 8 becomes larger, it dominates the distance measurement and curtails the impact of the weighting function to discriminate between local and more distant control points. This algorithm proved to be very efficient, since it calculates the polynomial coefficients for each pixel individually using the weighting function ~k, and
534
,~.~.~., '
~_~
9
9
j:
.
..
'!./
~7
~15 C:~'\ '.t.~~. 9
%
---~- ~
i
! i
t ",.
t-",,,\J~
I
Fig. 35 {AzjI(x,y) > ~2;}j-23 - the thresholded wavelet gradient maxima at scales 22 and 23 of the images in Fig. 3 l_T~he first and second columns display the thresholded multiscale edges o f the E P M A and S I M S micrographs. The scale increases from top to bottom. Black pixels correspond to wavelet maxima points. The corresponding thresholds at scale 22 and 23 are ~22 = 12 and ~2~ = 3.5for the E P M A image, and r22 = 5 and ~2~ -- 2for the S I M S image, respectively.
therefore is able to correct very well local non-linear distortions. A major drawback of the algorithm, however, is that it requires a lot of computations. The computational complexity of the weighted least squares algorithm is O(NQK3), where N is the number of pixels in the reference image, Q is the number of control points, and K = (n + 1)(n + 2)/2 is the number of polynomial coefficients, where the polynomial is of order n. Nevertheless, the quality of the fit is often worth extensive computations. For the registration result displayed in Fig. 36, a polynomial of order n = 3 was used. The bottom row of Fig. 36 displays the corrected EPMA image l~(x,y). After the two images have been registered, the information gathered by the two analytical techniques can be visually compared, and eventually combined, in a straightforward manner. Actually, it would be much better if we swap the two images, thus making the SIMS image S(x,y) the input image and the EPMA image E(x,y) the reference image, because the EPMA image is much
535
Fig. 36 The registered (matched) EPMA and S I M S images of the Cr distribution in Fig. 31" {~423E(x,y) > 3.5}5- the thresholded wavelet gradient maxima of the EPMA image at scale 23 (top left); {A23S(x,y) > 2} - the thresholded wavelet gradient maxima of the S I M S image at scale 23 (top right); the EPMA image E(x,y) (middle left)," the S I M S image S(x,y) (middle right); the corrected (transformed) EPMA image E(x,y) (bottom right). The dark crosses mark the positions of the control points {Pk}k-i.....Q. In this case Q = 15. The selection of the control points was made manually using the top row images.
less distorted than the SIMS image. Again, the same method as above can be used to compute the corrected SIMS image S(x,y) and to compare and combine it with E(x,y). 9.2 I m a g e f u s i o n 9.2.1 Basics
The successful fusion of images acquired from different modalities or instruments is of great importance in many applications, such as medical imaging, microscopic imaging, remote sensing, computer vision, and robotics.
536
Image fusion can be defined as the process by which several images, or some of their features, are combined together to form a single image. Let us consider the case where we have only two original images I1 and I2, which have the same size and which are already aligned using some image registration algorithm. Our aim is to combine the two input images into a single fused image I. Image fusion can be performed at different levels of the information representation. Four different levels can be distinguished according to [73], i.e. signal, pixel, feature and symbolic levels. To date, the results of pixel level image fusion in areas such as remote sensing and medical imaging are primarily intended for presentation to a human observer for easier and enhanced interpretation. Therefore, the visual perception of the fused image is of paramount importance when evaluating different fusion schemes. In the case of pixel level fusion, some generic requirements can be imposed on the fusion result: (a) the fused image should preserve, as closely as possible, all relevant information contained in the input images; (b) the fusion process should not introduce any artefacts or inconsistencies, which can distract or mislead the human observer, or any subsequent image processing steps [74]; (c) in the fused image, irrelevant features and noise should be suppressed to a maximum extent. Pixel level fusion algorithms vary from very simple, e.g. image averaging, to more complex, e.g. Principal Component Analysis (PCA), pyramid based image fusion and wavelet transform fusion. Several approaches to pixel level fusion can be distinguished, depending on whether the images are fused in the spatial domain (spatial domain image fusion), or they are transformed into another domain, and their transforms are fused (frequency domain image fusion or image transform fusion). After the fused image is generated, it may be further processed and some features of interest may be extracted.
9.2.2 Wavelet transform image fusion The general idea of all wavelet based image fusion schemes is that the wavelet transforms of the two registered input images I1 and I2 are computed and these transforms are combined utilising some kind of fusion rule ~ (see Fig. 37). Then, the inverse wavelet transform W~ 1 is computed, and the fused image I is reconstructed by I -- W~ 1(z[W2JI1, W2JI2]) ,
(33)
where W2JI 1 and W2jI2 are the wavelet transforms of the two input images. The fusion rule Z is actually a set of fusion rules ; ( - ~j which defines the fusion of each pair of corresponding channels c for each band j of the wavelet transforms (Fig. 38).
537
IDwT. l ...... I
':l
I
IDWT~U
Iow l [
registered inputimages
II
I
~'~
cfUoSect~V~eelet ant
fused image
wavelet coefficients
Fig. 37 Fusion of the wavelet transforms of two 2-D images.
HLI,
HHI,
HLI2
HHI:
+ HL(LLI,) HH(LLI,]
HL(LLI:) HH(LLI~]
LHI, LH(LLI,)
LHI2 LH(LLI:)
Fig. 38 Fusion of the different bands and channels of the WT of two 2-D images.
The wavelet transform offers several advantages over similar pyramid-based techniques, when applied to image fusion: (a) the wavelet transform is a more compact representation than the image pyramid. This becomes of very great importance when it comes to fusion of 3-D and 4-D images. The size of the WT is the same as the size of the image. On the other hand, the size of a Laplacian pyramid, for instance, is 4/3 of the size of the image; (b) the wavelet transform provides directional information, while the pyramid representation does not introduce any spatial orientation in the decomposition process [75]; (c) in pyramid based image fusion, the fused images often contain blocking effects in the regions where the input images are significantly different. No such artefacts are observed in similar wavelet based fusion results [75]; (d) images generated by wavelet image fusion have better SNR than images generated by pyramid image fusion, when the same fusion rules are used [76]. When subject to human analysis, wavelet fused images are also better perceived according to [75,76]. A number of wavelet based techniques for fusion of 2-D images [77,78,75,71,76,74] and 3-D images [79] have been proposed in the literature.
538 Some of the fusion rules, which can be used to combine the wavelet coefficients of two images, are presented below:
1. fusion by averaging [71] - for each band of the decomposition and for each channel the wavelet coefficients of the two images are averaged;
2. fusion by maximum [75,71,74] - for each band of the decomposition and for each channel, the maximum of the absolute values of the respective wavelet coefficients is taken; 3. high~lowfusion [71] - the high frequency information is kept from one of the images while the low frequency information is taken from the other; composite fusion [79] - various combinations of the different channels and bands of W~I~ and W2jI2 are composed; 5. de-noising and fusion [79]- the wavelet coefficients of the high frequency channels are thresholded by either hard or soft thresholding, and then are combined by using some of the other fusion rules; 6. fusion of the WT maxima [79] - the WT maxima (the multiscale edges of the image) can be linked to construct graphs. These graphs can be combined instead of combining all the wavelet coefficients. This fusion technique is based on the multiscale edge detection results reported in [44]. ,
Several other 2-D WT image fusion algorithms have been published, which are based on some of the principles of visual perception, e.g. fusion using an area-based selection rule with a consistency verification [75] or contrast sensitivity fusion [76]. Since these methods have been designed specifically to enhance the perception and interpretation of fused 2-D images, their threedimensional analogues are difficult to construct. An extensive study of the application of wavelet transform fusion to 2-D SIMS images can be found in [80]. Here, only two examples of wavelet transform image fusion are included. In each case three 2-D SIMS images, i.e. a multispectral SIMS image, are combined to produce a single image (see Figs. 39 and 40). In both figures the input images display a steel alloy containing about 1% A1. The alloy was produced by hot isostatic pressing and is intended to show better high-temperature oxidation resistance, as well as a more homogeneous distribution of all alloy compounds, when compared to conventionally treated materials. In Fig. 40, weighted averaging in the wavelet domain results in an improved combination of the input images than simple averaging in the spatial domain. Some of the fusion rules proposed
539
Fig. 39 Wavelet transform fusion of 2-D SIMS images. Input images at location A." mass 12 (C) (top left); mass 43 (AlO) (top middle); mass 68 (CrO) (top right). The fused image, produced by fusion by averaging in the wavelet domain (Daubechies 4 wavelet), is shown below. The negatives of the input images and the fused image are displayed. All three input images are of a steel alloy produced by hot isostatic pressing and containing about 1% Al. Measurement parameters for all 3 input images." a Cs + primary ion beam (primary energy: 5.5 k V, primary ion current." 2 pA) was applied to sputter the sample; diameter of the imaged surface area: 150 tlm. The images shown in this figure and in Fig. 40 were kindly provided by Thomas Stubbings, Vienna University of Technology.
above have been compared visually in [80] to find the best fusion results for a large class of multispectral SIMS images. Fig. 41 shows the fusion of two 3-D phantom images, i.e. a solid cube and a 3-D texture, which consists of a grid of lines parallel to the three axes. Crosssections of the input images and the fused image are also displayed in Fig. 41. Some multimodality or multispectral images are made up of both smooth and textured regions. Such images can be segmented in terms of smooth and textured regions by analysing their wavelet transforms as already described in this chapter, and depending on each pair of regions to be combined (i.e. smooth with smooth region, smooth with textured region, textured with textured region) different fusion rules can be used. A very important advantage of 3-D WT image fusion over alternative image fusion algorithms is that it may be combined with other 3-D image processing algorithms working in the wavelet domain, such as: 'smooth versus
540
Fig. 40 Wavelet transform fusion of 2-D S I M S images. Input images at location B." mass 26 (CN) (top left); mass 43 (AIO) (top middle)," mass 68 (CrO) (top right). Two fused images, produced by averaging in the spatial domain (bottom left) and weighted averaging in the wavelet domain (Daubechies 4 wavelet) (bottom right) are also shown. The negatives of the input images and the fused images are displayed. All three input images are of the same steel alloy as in Fig. 39, only a different location is being imaged. The measurement parameters are also the same as for the images in Fig. 39.
textured' region segmentation [52]; 3-D image registration based on features extracted in the wavelet domain; volume compression [3,4], where only a small part of all wavelet coefficients are preserved; and volume rendering [3,4], where the volume rendering integral is approximated using multiresolution spaces. The integration of 3-D WT image fusion in the broader framework of 3-D WT image processing and visualisation will bring new benefits to the research community and will show the full potential of wavelet based methods in volumetric image analysis.
10
Computation and wavelets
All computations were performed on Pentium II PCs, running Windows 95/ 98/NT, or on a Silicon Graphics (SGI) Power Challenge L computer and an SGI 02 computer, running IRIX 6.2. A number of standard MATLAB [13] toolboxes, e.g. the Statistics Toolbox, the Image Processing Toolbox, and the Fuzzy Logic Toolbox, were used to obtain some of the results in this chapter. Several wavelet toolboxes for MATLAB were also used, including WaveLab
541
Fig. 41 Wavelet tran,~florm (Daubechies 4 wavelet)fusion o[two 3-D phantom #nages: first input image (top le[?) and second input image (top right). Cross-sections at v = 32: first input image (second le['t), second input h~zage fsecond right]; fused image using averaging (third left)," fused image using weighted avera~hlg (z~ = 0.75) (third right); fused image using maximum (bottom l~:ft),'./itsed hnage usitzg the high frequency wavelet coefficients of the second #nage and the lowJi'equencv wavelet coe~cients of the first image (bottom right). In all cases the .[itsion was done in the ~ravelet doma#l. Vohmle size." 64 x 64 x 64 voxels.
[81] and WaveBox [82]. In addition, the Wavelet Workbench [83] toolbox for IDL, which is actually WaveLab ported to IDL, and the Wavelet Fusion Toolbox for IDL [84], were used to do some of the calculations on SGI computers. The SOM package [85], originally written by Teuvo Kohonen
542 and co-workers, was ported to Windows by the authors and used to classify some of the images. A library of functions for calculation of scatter diagrams and statistical measures was written by the authors. Some of the figures in this chapter were generated by two image processing programs, e.g. MIE [86] and X-image [87], which were developed by the authors. Some of the filtering algorithms were implemented in C and C ++ as stand-alone programs. The quantisation and entropy coding algorithms, used for the wavelet compression of 2-D and 3-D SIMS images, were adapted from the Wavelet Image Compression Construction Kit written by Geoff Davis. Finally, the wave2 program [88], developed by Mallat et al., was used on SGI machines to compute the wavelet transform maxima (multiscale edges) of the SIMS and EPMA images presented in this study. Various mother wavelets, e.g. Haar, Daubechies 4, 6, 8, 12 and 20, Symmlet 6 and 8, Coiflet 2 and 3, Villasenor 1 to 5 and Antonini, spline wavelets, and different levels of decomposition, were employed in the computations.
II
Conclusions
In this chapter we have summarised some recent results considering the application of wavelet analysis and processing to 2-D and 3-D analytical images. We have introduced the reader to the theory of the 2-D and 3-D separable discrete wavelet transform. Several applications of algorithms based on the wavelet transform have been described: (a) wavelet de-noising of 2-D and 3-D SIMS images; (b) improvement of analytical image classification by means of wavelet de-noising; (c) compression of 2-D and 3-D analytical images; (d) feature extraction from analytical images, i.e. edge detection and texture analysis in the wavelet domain; (e) registration of images obtained by complementary techniques (EPMA and SIMS), based on their wavelet transform maxima (multiscale edges); (f) wavelet fusion of multispectral SIMS images. Throughout the chapter, parallels to applications of wavelet techniques in microscopy and medical imaging have been included and some ideas for future research have been outlined. All these successful new applications clearly show that wavelet methods have already found their place in analytical chemistry in general, and in analytical image processing in particular. In the near future, the integration of some of the above-described techniques, within the general framework of wavelet analysis, will certainly lead to improved interpretation and quantification of analytical images.
543 12
Acknowledgements
The research presented in this chapter was supported by the Austrian Scientific Research Council (projects $5902, $6205), the Jubilee fund of the Austrian National Bank (Project 6176), and the Austrian Society for Microelectronics. The authors would like to thank: Wen Liang Hwang, Stephane Mallat, and Sifen Zhong, from New York University, for designing, writing and providing the wave2 program; David Donoho and his team from Stanford University, for writing and providing the WaveLab toolbox for Matlab; Amara Graps and Research Systems Inc. for porting WaveLab to IDL; and Geoff Davis for providing the Wavelet Image Compression Construction Kit. Special thanks also go to Hans du Buf, Nishan Canagarajah, and Thomas Stubbings, for some of the images used in the chapter.
13
Online resources
Following is a list of online resources on wavelets, wavelets and image processing, and wavelet analysis of analytical images. This list is by no means a complete, or even a comprehensive collection, of all the online resources available on these topics. The aim of compiling such a list is to provide the reader of this chapter with a good starting point for his further investigations in the field of wavelet analysis and processing of analytical images. This collection of online resources was made at the time of publication of the book. A more up-to-date version of the list can be found at: www.iac.tuwien.ac.at/webfac/WWW/waveletbook.html
13.1 Wavelets and wavelet analysis www.wavelet.org (The Wavelet Digest- a free monthly newsletter edited by Wim Sweldens) www.amara.com/current/wavelet.html (Amara's wavelet page) www.mat.sbg.ac.at/~uhl/wav.html (Andreas Uhl's wavelet page) www.mame.syr.edu/faculty/lewalle/tutor/tutor.html (Jacques Lewalle's tutorial on wavelet analysis of experimental data)
544 paos.colorado.edu/research/wavelets (Christopher Torrence's practical guide to wavelet analysis) www.mathsoft.com/wavelets.html (The MathSoft wavelet resources page with links to many reprints available as PostScript files on the Internet)
13.2 Wavelets and image processing www.multiresolution.com (Image and data analysis: The multiscale approach (book) + MR/1 software) www.cs.nyu.edu/cs/faculty/mallat (St6phane Mallat group at New York University) www.summus.com/publish/wavelets/wavelet.htm (Summus' image compression technology using wavelets) research.microsoft.com/~geoffd (The Wavelet Image Compression Construction Kit) zeus.ruca.ua.ac.be/VisionLab/wta/wta.html (wavelets for texture analysis) vivaldi.ece.ucsb.edu/projects/registration/registration.html (multisensor image fusion using the wavelet transform) www.eecs.lehigh.edu/~zhz3/zhz3.htm (Zhong Zhang's web page about registration and fusion of multisensor images using the wavelet transform) www. fen.bris.ac.uk/elec/research/ccr/imgcomm/fusion.html (image fusion using a 3-D WT)
13.3 Wavelet analysis and processing of analytical images www.iac.tuwien.ac.at/webfac/WWW/cobac.html (wavelet processing of 2-D and 3-D analytical images) www.iac.tuwien.ac.at/~cmitter/wavebib.html (Christian Mittermayr's bibliography on wavelets in analytical chemistry)
545 About the authors Stavri Nikolov was born in Sofia, Bulgaria, in 1967. At present he is a
researcher in medical imaging at the Image Communications Group, Centre for Communications Research, University of Bristol, UK. He obtained an MSc in computer science from ~St. Kliment Ohridski' Sofia University and a PhD (Microscopy Image Processing) from Vienna University of Technology in 1992 and 1996, respectively. His research interests include wavelet image analysis, medical image analysis, image fusion, and volumetric data processing and visualisation. He is currently developing new algorithms for fusion of medical images and for automatic navigation in volumetric images. He is member of the British Machine Vision Association and associate member of IEEE. Martin Wolkenstein was born in Vienna, Austria, in 1971. He received his
diploma and PhD degrees in chemistry from Vienna University of Technology, in 1996 and 1998, respectively. In 1995 he joined the Physical Analysis Group at the Institute of Analytical Chemistry, Vienna University of Technology, where he worked on the application of various image processing techniques, i.e. neural networks, fuzzy logic and wavelets, to analytical data. His main research interests are 3-D microscopy, imaging, image processing, and data visualisation. He is currently serving his military service at the NBC defence school in Vienna, Austria. Herbert Hutter was born in Bregenz, Austria, in 1962. He received his
diploma in physics and his PhD degree in analytical chemistry from Vienna University of Technology, in 1985 and 1990, respectively. Since 1992 he is the head of the Physical Analysis Group at the Institute of Analytical Chemistry, Vienna University of Technology. His main field of research is the characterisation of trace element distributions in materials with Secondary Ion Mass Spectrometry (SIMS). In addition, he and his group have made contributions to various methods for data analysis of multispectral three-dimensional distributions. His main research interests are material science, 3-D microscopy, and image processing. Current projects in close cooperation with several industrial partners investigate the influence of trace elements on material characteristics.
546
References 1. W.M. Lawton and H.L. Resnikoff, Multidimensional Wavelet Bases, Aware Report AD910130, Aware Inc., Cambridge, MA, 1991. 2. W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery, Numerical Recipes in C: The Art of Scientific Computing: Second Edition, Cambridge University Press, Cambridge, UK, 1992. 3. S. Muraki, Approximation and Rendering of Volume Data Using Wavelet Transforms, Proc. Visualization '92, 1992, pp. 21-28. 4. M. Gross, Visual Computing, Springer-Verlag, Berlin, 1994. 5. H. Hutter and M. Grasserbauer, Three-Dimensional Stereometric Analysis of Materials with SIMS, Mikrochimica Acta, 107 (1992), 137-148. 6. D.L. Donoho, Nonlinear Wavelet Methods for Recovery of Signals, Densities and Spectra from Indirect and Noisy Data, in: Different Perspectives on Wavelets, Proceedings of Symposia in Applied Mathematics, American Mathematical Society, 47 (1993), 173-205. 7. D.L. Donoho and I. Johnstone, Ideal Spatial Adaptation Via Wavelet Shrinkage, Technical Report, Department of Statistics, Stanford University, Stanford, 1992. 8. S.G. Nikolov, H. Hutter and M. Grasserbauer, De-Noising of SIMS Images via Wavelet Shrinkage, Chem. and Intell. Lab. Systems, 34 (1996), 263-273. 9. M.G. Wolkenstein and H. Hutter, De-noising Secondary Ion Mass Spectrometry Image Sets Using A Three-Dimensional Wavelet Transformation, submitted to Analytical Chemistry (May 1998). 10. M. Wolkenstein, Optimization of the Visualization Process for Three-Dimensional Analytical Data, PhD Thesis, Vienna University of Technology, Vienna, Austria, 1998. 11. F.J. Anscombe, The Transformation of Poisson, Binomial and Negative-Binomial Data, Biometrika, 15 (1948), 246-254. 12. J-L. Starck, F. Murtagh and A. Bijaoui, Image Processing and Data Analysis: The Multiscale Approach, Cambridge University Press, Cambridge, UK, 1998. 13. MATLAB 5.0, The Mathworks Inc., Natick, MA. 14. C.G. Enke and T.A. Nieman, Signal-to-Noise Ratio Enhancement by LeastSquares Polynomial Smoothing, Analytical Chemistry, 48 (1976), 705a-712a. 15. C.R. Mittermayr, S.G. Nikolov, H. Hutter and M. Grasserbauer, Wavelet Denoising of Gaussian Peaks: A Comparative Study, Chem. and Intell. Lab. Systems, 34 (1996), 187-202. 16. M. Wolkenstein, H. Hutter, S.G. Nikolov, I. Schmitz and M. Grasserbauer, Comparison of Wavelet Filtering with Established Techniques for EPMA Image De-Noising, J. Trace and Microprobe Techniques, 15 (1) (1997), 33-49. 17. J. Wang and H.K. Huang, Medical Image Compression by Using Three-Dimensional Wavelet Transformation, IEEE Trans. Med. lmag., 15 (4) (1996). 18. T. Cover and P. Hart, Nearest Neighbor Pattern Classification, IEEE Transactions on Information Theory, IT- 13 (1967), 21-27. 19. T. Kohonen, Self-Organization and Associative Memory: Third Edition, SpringerVerlag, Berlin, 1989.
547
20. J.C. Bezdek and P.F. Castelaz, Prototype Classification and Feature Selection with Fuzzy Sets, IEEE Transactions on Systems, Man, and Cybernetics, SMC-7 (1977), 87-92. 21. D.S. Bright, D.E. Newbury and R.B. Marinenko, Concentration-Concentration Histograms: Scatter Diagrams Applied to Quantitative Compositional Maps, in: Microbeam Analysis, (D.E. Newbury, Ed), San Francisco Press Inc., 1988. 22. D.S. Bright and D.E. Newbury, Concentration Histogram Imaging, Analytical Chemistry, 63 (4) (1991), 243-250. 23. M.M. E1 Gomati, D.C. Peackock, M. Prutton and C.G. Walker, Scatter Diagrams in Energy Analysed Digital Imaging: Application to Scanning Auger Microscopy, Journal of Microscopy, 147 (1987), 149-158. 24. S.D. Boehmig and B.M. Reichl, Segmentation and Scatter Diagram Analysis of Scanning Auger Images- A Critical Comparison of Results, Fresenius Journal of Analytical Chemistry, 346 (1993), 223-226. 25. D.E. Newbury and D.S. Bright, Concentration Histogram Images: A Digital Imaging Method for Analysis of SIMS Compositional Maps, in: Secondary Ion Mass Spectrometry (SIMS VII), (A. Benninghoven et al., Eds), Wiley, NY, 1990, pp. 929-933. 26. H. Hutter and M. Grasserbauer, Chemometrics for Surface Analysis, Chemometrics and Intelligent Laborato O, Systems, 24 (1994), 99-116. 27. C. Latkoczy, H. Hutter and M. Grasserbauer, Classification of SIMS Images, Mikrochim. Acta, 352 (1995), 537-543. 28. M. Wolkenstein, H. Hutter, C. Mittermayr, W. Schiesser and M. Grasserbauer, Classification of SIMS Images Using a Kohonen Network, Analytical Chemistry, 69 (1997), 777ff. 29. M.G. Wolkenstein, H. Hutter, S.G. Nikolov and M. Grasserbauer, Improvement of SIMS Image Classification by Means of Wavelet De-Noising, Fresenius J. Analytical Chemistry, 357 (1997), 783-788. 30 S.G. Nikolov, M.G. Wolkenstein, H. Hutter and M. Grasserbauer, Improving Image Classification by De-Noising, submitted to Mikrochim. Acta. 31. A.P. Witkin, Scale-Space Filtering, Proceedings of 8th International Joint Conference on Artificial Intelligence, 1983, pp. 1019-1022. 32. T. Lindeberg, Scale-Space: A Basis for Early Vision, Technical Report CVAP 120, Royal Institute of Technology, Stockholm, Sweden, 1989. 33. M. Wickerhauser, Adapted Wavelet Analysis from Theory to Software, IEEE Press, 1994. 34. J.N. Bradley, C.M. Brislawn and T. Hopper, The FBI Wavelet/Scalar Quantization Standard for Gray-Scale Fingerprint Image Compression, in: SPIE Proc. Visual Info. Processing II, Orlando, FL, 1992, pp. 293-304. 35. T. Hopper, C.M. Brislawn and J.N. Bradley, WSQ Gray-Scale Fingerprint Image Compression Specification, Technical Report IAFIS-IC-0110-v2, FBI, 1993. 36. T. Hopper, Compression of Gray-Scale Fingerprint Images, in: SPIE Proc. Wavelet Application, Orlando, FL, Vol. 2242, 1994, pp. 180-187. 37. N. Jayant and P. Noll, Digital Coding of Waveforms, Prentice Hall, Englewood Cliffs, NJ, 1984.
548
38. D. Taubman and A. Zakhor, Multirate 3-D Subband Coding of Video, IEEE Transactions on Image Processing, 3 (5) (1994). 39. T. Bell, J. Cleary and I. Witten, Text Compression, Prentice Hall, 1990. 40. M. Wolkenstein, H. Hutter and M. Grasserbauer, Visualization of N-Dimensional Analytical Data on Personal Computers, Trends in Analytical Chemistry, 17 (3) (1998), 120-128. 41. M.G. Wolkenstein and H. Hutter, Compression of Secondary Ion Microscopy Image Sets Using a Three-Dimensional Wavelet Transformation, submitted to Microscopy and Microanalysis (March 1999). 42. J.C. Russ, Computer-Assisted Microscopy: The Measurement and Analysis of Images, Ist Edition, Plenum Press, New York, NY, 1990. 43. B. Lin, Wavelet Phase Filter for Denoising in Tomographic Image Reconstruction, PhD Thesis, Illinois Institute of Technology, 1994. 44. S. Mallat and S. Zhong, Characterisation of Signals from Mulitscale Edges, IEEE Transactions on PAMI, 14 (7) (1992), 710-732. 45. D. Marr, Vision, W. H. Freeman and Company, New York, 1982. 46. D. Marr and E. Hildreth, Theory of Edge Detection, in: Proc. R. Soc. Lond., Vol. 207, 1980, pp. 180-217. 47. Y. Meyer, Wavelets: Algorithms and Applications, Society for Industrial and Applied Mathematics, Philadelphia, 1993. 48. S. Mallat and S. Zhong, Complete Signal Representation with Multiscale Edges, Robotics Report No. 219 483, Courant Institute of Mathematical Sciences, New York University, 1989. 49. S. Mallat and S. Zhong, Wavelet Transform Maxima and Multiscale Edges (Coifman et al., Eds), Bartlett and Jones, 1990. 50. T.R. Reed and J.M.H. du Buf, A Review of Recent Texture Segmentation and Feature Extraction Techniques, Computer Vision, Graphics and Image Processing: Image Understanding, 57 (3) (1993), 359-372. 51. S. Livens, P. Scheunders, G. van de Wouwer and D. van Dyck, Wavelets for Texture Analysis, Technical Report, VisieLab, Department of Physics, University of Antwerp, 1997. 52. R. Porter and N. Canagarajah, A Robust Automatic Clustering Scheme for Image Segmentation Using Wavelets, IEEE Transactions on Image Processing, 5 (4) (1996), 662-665. 53. A. Laine and J. Fan, Texture Classification by Wavelet Packet Signatures, IEEE Trans. on PAMI, 15 (11) (1993), 1186-1190. 54. W.Y. Ma and B.S. Manjunath, Texture Features and Learning Similarity, In Proc. IEEE Computer Vision and Pattern Recognition Conference, 1996, pp. 425-430. 55. P. Brodatz, Textures: A Photographic Album for Artists and Designers, New York, Dover, 1966. 56. R. Porter and N. Canagarajah, Robust Rotation-Invariant Texture Classification: Wavelet, Gabor Filter and GMRF Based Schemes, lEE Proc.-Vis. Image Signal Process., 144 (3) (1997), 180-188. 57. D.A. Clausi, Texture Segmentation of SAR Sea Ice Imagery, PhD Thesis, University of Waterloo, 1996.
549
58. S. Livens, P. Scheunders, G. van de Wouwer, D. van Dyck, H. Smets, J. Winkelmans and W. Bogaerts, A Texture Analysis Approach to Corrosion Image Classification, Microsc., Microanal., Microstruct., 7 (2) (1996), 143-152. 59. F. Lumbreras and J. Serrat, Wavelet Filtering for Segmentation of Marble Images, Technical Report No. 5, Univ. Autonoma de Barcelona, 1996. 60. L.G. Brown, A Survey of Image Registration Techniques, ACM Computing Surveys, 24 (4) (1992), 325-376. 61. P.A. van den Elsen, E. Pol. and M. Viergever, Medical Image Matching- A Review with Classification, Eng. Med. Biol., 12 (1) (1993), 26-39. 62. D.I. Barnea and H.F. Silverman, A Class of Algorithms for Fast Digital Registration, IEEE Trans. Comput., C-21 (1972), 179-186. 63. W.K. Pratt, Correlation Techniques for Image Registration, IEEE Trans. on Aerospace and Electronic Systems, AES-10 (1974)~ 353-358. 64. B.S. Reddy and B.N. Chatterji, An FFT-Based Technique for Translation, Rotation, and Scale-Invariant Image Registration, IEEE Trans. hnag. Proc., 5 (8) (1996) 1266-1271. 65. R. Watt, Understanding Vision, Academic Press Limited, London, UK, 1991. 66. S.D. Boehmig, B.M. Reichl, H. Stoeri and H. Hutter, Automatic Matching of SAM, SIMS and EPMA Images, Fresenius J. Analytical Chemistry, 349, (1993). 67. S.D. Boehmig, B.M. Reichl, M.M. Eisl and H. Stoeri, A Template Matching Technique Based on Segmented Images Using Pyramids, in: Proceedings of RECPAD 94, 1994. 68. S.D. Boehmig, Bild- und Signalrestauration in der Oberflaechenanalytik, PhD Thesis, Vienna University of Technology, Vienna, Austria, 1995. 69. S.G. Nikolov, Wavelet Transform Algorithms for Analytical Data Processing, PhD Thesis, Vienna University of Technology, Vienna, Austria, 1996. 70. J.P. Djamdji, A. Bijaoui and R. Maniere, Geometrical Registration of Images" The Multiresolution Approach, Photogrammetry and Remote Sensing Journal, 59 (5) (1993), 645-653. 71. J. Le Moigne and R.F. Cromp, The Use of Wavelets for Remote Sensing Image Registration and Fusion, TR-96-171, NASA Goddard Space Flight Center, 1996. 72. G. Wolberg, Digital Image Warping, IEEE Computer Society Press, Los Alamitos, California, 1990. 73. M.A. Abidi and R.C. Gonzalez (Eds), Data Fusion in Robotics and Machine Intelligence, Academic Press, 1992. 74. O. Rockinger, Pixel-Level Fusion of Image Sequences Using Wavelet Frames, in: Proceedings in Image Fusion and Shape Variability Techniques, Leeds, UK, (K. V. Mardia and C. A. Gill and I. L. Dryden, Eds.), Leeds University Press, 1996, pp. 149-154. 75. H. Li, B.S. Manjunath and S.K. Mitra, Multisensor Image Fusion Using the Wavelet Transform, Graphical Models and Image Processing, 57 (3) (1995), 235245. 76. T.A. Wilson, S.K. Rogers and L.R. Myers, Perceptual Based Hyperspectral Image Fusion Using Multiresolution Analysis, Optical Engineering, 34 (11) (1995)~ 31543164.
550 77. I. Koren, A. Laine and F. Taylor, Image Fusion Using Steerable Dyadic Wavelet Transforms, in: Proceedings 1995 IEEE International Conference on Image Processing, IEEE, Washington, DC, 1995, pp. 232-235. 78. L.J. Chipman and T.M. Orr, Wavelets and Image Fusion, in: Proceedings 1995 IEEE International Conference on Image Processing, IEEE, Washington, DC, 1995, pp. 248-251. 79. S.G. Nikolov, D.R. Bull, C.N. Canagarajah, M. Halliwell and P.N.T. Wells, Image Fusion Using a 3-D Wavelet Transform, in: Proceedings of the Seventh International Conference on Image Processing and its Applications, Manchester, UK, IEE, 1999, pp. 235-239. 80. T. Stubbings, S.G. Nikolov and H. Hutter, Fusion of 2-D SIMS Images Using the Wavelet Transform Microchimica Acta, 624 (2000), pp 1-6. 81. WaveLab, D. Donoho et al., Stanford University. 82. WaveBox, C. Taswell, Stanford University. 83. Wavelet Workbench for IDL, A. Graps and Research Systems Inc.. 84. Wavelet Fusion Toolbox for IDL, S.G. Nikolov et al., Bristol University, UK, 1999. 85. SOM (Self-Organinizing Maps), T. Kohonen et al., Helsinki University of Technology, Helsinki, Finland. 86. MIE (Multispectral Image Enhancement), S.G. Nikolov, Vienna University of Technology, Vienna, Austria, 1992. 87. X-image, S.G. Nikolov, Vienna University of Technology, Vienna, Austria, 1995. 88. Wave2, W.L. Hwang, S. Mallat and S. Zhong, New York University.
551
Index abstraction level 352 adaptive wavelet algorithm (AWA) 177, 189, 194, 199, 200, 440, 442 adaptive wavelets 437, 448 analytical images 479 apodisation 28-31 approximate derivative calculation 211 autocorrelation function 124 banded matrix 87 band-pass filter 23 base frequency 16 baseline drift 207 bases dictionary 469 basis 12, 13 basis transformation 9, 12, 13 best-basis 155, 473 biorthogonal wavelets 79, 252 boundary handling 99, 111, 113, 114, 116, 117 B-spline wavelets 226 calibration 323 chemical dynamics 274 chemical kinetics 279 Cheybychev polynomial 12, 13 chromatography 205 circulant matrix 87 classification 437, 502 cluster analysis 378 coefficient position retaining method (CPR) 244 compact support 76 compression 126, 128 compression ratio 294 constant-padding extension 113 continuous wavelet transform (CWT) 59, 62
control charts 415 control points 528 convolution 5, 7 convolution integral 7 correct classification rate (CCR) 440 cost function 160, 162 cut-off frequency 24
Daubechies family 63 wavelets 76 denoising 126, 488 derivative technique 211 determinant 89 diagonal matrix 86 difference of Gaussians (DOG) 233, 234 dilation equation 70 discriminant analysis 437, 439 discriminant function analysis (DFA) 391 discriminant partial least squares (DPLS) 373 discrete cosine transform (DCT) 463 discrete wavelet transform (DWT) 65, 91, 97 dynamic nuclear magnetic resonance spectroscopy 255
early transition detection (ETD) 311 edge detection 513 electrochemistry 225 electron probe microanalysis (EPMA) 488 embedded zero-tree wavelet encoding (EZW) 470 end effects 132 entropy 160, 171, 174, 192 entropy coding 509
552
factor analysis (FA) 213 fast Fourier transform (FFT) 14, 15 fast wavelet transform (WT) 74 feature extraction 513 selection 324, 326, 331 filter coefficients 185 filter coefficients conditions 185 filter matrix 96 finite impulse response (FIR) filter 101, 126 Fourier basis 13 domain 14 integral 4 polynomial series 13 transform (FT) 3, 9, 14, 18, 60 fractal structures 282 frequency analysis 124 domain 14 localisation 38 resolution 16 functional data analysis 352 fusion rules 538 Gabor transform 39 generating optimal linear PLS estimations (GOLPE) 370 genetic algorithm 325, 369 geometry matrix 356 Haar transform 52 wavelet 77 hard thresholding 132 higher multiplicity wavelets 179 high-pass filter 23, 92 high-pass filter coefficients 73 hyphenated instruments 213 identity matrix 86 image classification 502
compression 459 decorrelation 462 fusion 536 transform fusion 536 immune neural network (INN) 220 impulse response 6 infrared spectroscopy (IR) 243 inner product 354 inverse wavelet transform 63 joint basis 167 joint best-basis 171,294 Karhunen-Loeve transform (KLT) 462 Laplacian method 515 lapped orthogonal transform (LOT) 464 Lawton matrix 185 library compression 293 linear filtering 126 linear regression 10 low-pass filter 23, 92 low-pass filter coefficients 73 masking method 367 mass spectrometry 254 matrix 85 addition 88 multiplication 88 operations 88 polynomial product 89 product 88 properties 89 rank 90 theory 85 transpose 86 m-band discrete wavelet transform 180 mean square error (MSE) 461,487 median filter 129, 138 minimum description length (MDL) 293 minor 89 misclassification rate (MCR) 440 molecular structure 264
553
'mother' wavelet 59 moving average 25 multiple linear regression (MLR) 448 multiscale denoising 422 edge detection 517, 518 edge point 521 edge representation 521 filtering 130 median filtering 138 representation 121 statistical process control (MSSPC) 415 multiresolution 68, 69, 362 multiresolution analysis (MRA) 65, 91 multi-tree wavelet packet transform 94 mutual information 372 near-infrared spectroscopy (NIR) 333 neural networks (NNs) 166, 219 noise characterization 123 suppression 208 non-linear basis 357 filtering 128 non-singular matrix 90 nuclear magnetic resonance spectroscopy (NMR) 255 on-line multiscale (OLMS) filtering 139 optimal bit allocation (OBA) algorithm 252 optimal scale combination method (OSC) 366 oscillographic chronopotentiometry 235 partial least squares (PLS) 323, 373 parsimonious models 361 patterned matrix 86 pattern recognition 219, 251 peak detection 210 periodic extension 111 periodisation 103
periodised wavelets 99 permutation matrix 87 phase 15 photoacoustic spectroscopy (PA) 256 polynomial approximation 9 extension 114 product 89. 186 power spectrum 17, 124 principal component analysis (PCA) 166, 297 principal component regression (PCR) 323 pyramid algorithm 43, 75 quantization 509 quantum mechanics 264 rectification 425 regression 448 relevant component extraction-partial least squares (RCE-PLS) 341 resolution enhancement 210 sampling point representation 352 scale dendrogram 377 scale-error complexity (SEC) 365 plot 364 scale filter 92 scaling coefficients 70, 72. 75, 92 scaling function 69 scalogram 359 scanning electron microscope (SEM) 488 scatter diagram 503 secondary ion mass spectrometry (SIMS) 254, 487, 488, 491. 496, 507 segmentation 513 semiorthogonal wavelets 79 separable wavelets 482 shift matrix 87 short time Fourier transform 35, 60 signal enhancement 208
554
signal-to-noise ratio (SNR) 487 simulated annealing 326 singular matrix 90 smoothing 209 smoothness 77 soft thresholding 132, 489 spline basis 355 spline wavelet 226 standardization 250 statistical process control (SPC) 415 symmetric extension 111 symmetry 76 task-specific wavelets 473 texture analysis 521 threshold selection 131 three-dimensional (3-D) images 480, 483 wavelet transform 482 time domain 14 series 274 time-frequency analysis 35 domain 38 Toeplitz matrix 88 translation-invariant (TI) filtering 133 translation-rotation transformation (TRT) 246 two-dimensional (2-D) images 483 scaling function 483 wavelet packet transform 469 wavelet transform 482
universal threshold 131 ultraviolet-visible (UV-VIS) spectroscopy 250 vanishing moments 77 variable length coding (VLC) algorithm 252, 460 variables selection 383, 386, 390 variance tree 172 vector 86 voltammetry 225, 233 wavelet basis functions 59, 72 coefficients 73, 75, 92 coefficient method 213 decomposition 72 families 76 filter 92 matrix 95 neural network (WNN) 248, 251 on an interval 116 packet coefficients 155 packet decomposition 156 packet functions 154 packet transform 53, 94, 151 properties 76, 80 series 72 spectrum 126 window factor analysis (WFA) 216, 217 windowed Fourier transform 60 zero frequency 16 zero padding 113, 244