R. Todd Ogden
Essential Wavelets for Statistical Applications and Data Analysis
Birkhauser Boston • Basel • Berlin
R. Todd Ogden Department of Statistics University of South Carolina Columbia, SC 29208
Library of Congress Cataloging-in-Publication Data Ogden, R. Todd, 1965Essential wavelets for statistical applications and data analysis I R. Todd Ogden. p. em. Includes bibliographical references (p. 191-198) and index. ISBN 0-8176-3864-4 (hardcover : alk. paper). -- ISBN 3-7643-3864-4 (hardcover : alk. paper) 1. Wavelets (Mathematics) 2. Mathematical statistics I. Title. QA403.3.043 1997 519.5--dc20 97-27379 CIP
Printed on acid-free paper © 1997 Birkhauser Boston
Birkhiiuser
Jir
Copyright is not claimed for works of U.S. Government employees. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior permission of the copyright owner. Permission to photocopy for internal or personal use of specific clients is granted by Birkhauser Boston for libraries and other users registered with the Copyright Clearance Center (CCC), provided that the basefeeof$6.00percopy, plus $0.20perpage is paid directly to CCC, 222 Rosewood Drive, Danvers, MA 01923, U.S.A. Special requests should be addressed directly to Birkhauser Boston, 675 Massachusetts A venue, Cambridge, MA 02139, U.S.A. ISBN 0-8176-3864-4 ISBN 3-7643-3864-4 Typeset in LATFX by ShadeTree Designs, Minneapolis, MN. Cover design by Spencer Ladd, Somerville, MA. Printed and bound by Maple-Vail, York, PA. Printed in the U.S.A.
9 8 7 6 5 4 3 2 1
To Christine
Contents
Preface Prologue: Why Wavelets? 1
Wavelets: A Brief Introduction 1.1 The Discrete Fourier Transform 1.2 The Haar System Multiresolution Analysis The Wavelet Representation Goals of Multiresolution Analysis 1.3 Smoother Wavelet Bases
2 Basic Smoothing Techniques 2.1 Density Estimation
2.2 2.3
Histograms Kernel Estimation Orthogonal Series Estimation Estimation of a Regression Function Kernel Regression Orthogonal Series Estimation Kernel Representation of Orthogonal Series Estimators
3 Elementary Statistical Applications 3.1 Density Estimation 3.2
Haar-Based Histograms Estimation with Smoother Wavelets Nonparametric Regression
4 Wavelet Features and Examples 4.1 Wavelet Decomposition and Reconstruction
4.2
Two-Scale Relationships The Decomposition Algorithm The Reconstruction Algorithm The Filter Representation
ix
xili 1 1 7
14 16 22 23 29 29 31 32 35 38 39 42 45 49 49 49 52 54 59 59 60 62 63 66
vi
4.3 Time-Frequency Localization
4.4
The Continuous Fourier Transform The Windowed Fourier Transform The Continuous Wavelet Transform Examples of Wavelets and Their Constructions Orthogonal Wavelets Biorthogonal Wavelets Semiorthogonal Wavelets
5 Wavelet-based Diagnostics 5.1 Multiresolution Plots 5.2 Time-Scale Plots 5.3 Plotting Wavelet Coefficients 5.4 Other Plots for Data Analysis 6 Some Practical Issues 6.1 The Discrete Fourier Transform of Data
6.2 6.3
6.4 7
8
The Fourier Transform of Sampled Signals The Fast Fourier Transform The Wavelet Transform of Data Wavelets on an Interval Periodic Boundary Handling Symmetric and Antisymmetric Boundary Handling Meyer Boundary Wavelets Orthogonal Wavelets on the Interval When the Sample Size is Not a Power of Two
69 69 72 74 79 81 83 87 89
89 92 95 100 103
104 104 105 107 110 111 112 113 114 115
Other Applications 7.1 Selective Wavelet Reconstruction Wavelet Thresholding Spatial Adaptivity Global Thresholding Estimation of the Noise Level 7.2 More Density Estimation 7.3 Spectral Density Estimation 7.4 Detections of Jumps and Cusps
119
Data Adaptive Wavelet Thresholding 8.1 SURE Thresholding 8.2 Threshold Selection by Hypothesis Testing
143
8.3 8.4
Recursive Testing Minimizing False Discovery Cross-Validation Methods Bayesian Methods
119 124 126 128 131 132 133 140 144 149 151 154 156 161
vii
9 Generalizations and Extensions 9.1 Two-Dimensional Wavelets 9.2 Wavelet Packets
9.3
Wavelet Packet Functions The Best Basis Algorithm Translation Invariant Wavelet Smoothing
167 167 173 174 177 180
Appendix
185
References
191
Glossary of Notation
199
Glossary of Terms
201
Index
205
Preface I once heard the book by Meyer (1993) described as a "vulgarization" of wavelets. While this is true in one sense of the word, that of making a subject popular (Meyer's book is one of the early works written with the nonspecialist in mind), the implication seems to be that such an attempt somehow cheapens or coarsens the subject. I have to disagree that popularity goes hand-in-hand with debasement. While there is certainly a beautiful theory underlying wavelet analysis, there is plenty of beauty left over for the applications of wavelet methods. This book is also written for the non-specialist, and therefore its main thrust is toward wavelet applications. Enough theory is given to help the reader gain a basic understanding of how wavelets work in practice, but much of the theory can be presented using only a basic level of mathematics. Only one theorem is formally stated in this book, with only one proof. And these are only included to introduce some key concepts in a natural way.
Aim and Scope This book was written to become what the reference that I wanted when I began my own study of wavelets. I had books and papers, I studied theorems and proofs, but no single one of these sources by itself answered the specific questions I had: In order to apply wavelets successfully, what do I need to know? And why do I need to know it? It is my hope that this book will answer these questions for others in the same situation. In keeping with the title of this book, I have attempted to pare down the possible number of topics of coverage to just the essentials required for statistical applications and analysis of data. New statistical applications are being developed quickly, so due to the combination of careful choosing of topics and natural delays in writing and printing, this book is necessarily incomplete. It is hoped, however, that the introduction provided in this text will provide a suitable foundation for readers to jump off into other wavelet-related topics. I am of the opinion that basic wavelet methods of smoothing functions, for example, should be as widely understood as standard kernel methods are now. Admittedly, understanding wavelet methods requires a substantial amount of overhead, in terms of time and effort, but the richness of wavelet
X
PREFACE
applications makes such an investment well worth it. This modest work is thus put forward to widen the circle of wavelet literacy. It is important to point out that I am not at all advocating the complete abandonment of all other methods. In a recent article, Fan, et al. (1996) discuss local versions of some standard smoothing techniques and show that they provide a good alternative to wavelet methods, and in fact may be preferred in many applications because of their familiarity. This book was written primarily to increase the familiarity of wavelets in data analysis: wavelets are simply another useful tool in the toolbag of applied statisticians and data analysts. The treatment of topics in this book assumes only that the reader is familiar with calculus and linear algebra, with a basic understanding of elementary statistical theory. With this background, this book is essentially self-contained, with other topics (Fourier analysis, £ 2 function space, function estimation, etc.) treated when introduced. A brief overview of L 2 function space is given as an appendix, along with glossaries of notation and terms. Thus, the material is accessible to a wide audience, including graduate students and advanced undergraduates in mathematics and statistics, as well as those in other disciplines interested in data analysis. Mathematically sophisticated readers can use this reference as quick reading to gain a basic understanding of how wavelets can be used.
Chapter Synopses The Prologue gives a basic overview of the topic of wavelets and describes their most important features in nonmathematicallanguage. Chapter 1 provides a fundamental introduction to what wavelets are, with brief hints as to how they can be used in practice. Though the results of this chapter apply to general orthogonal wavelets, the material is presented primarily in terms of the simplest case of wavelet: the Haar basis. This greatly simplifies the treatment in introducing wavelet features, and once the basic Haar framework is understood, the ideas are readily extended to smoother wavelet bases. Leaving the treatment of wavelets momentarily, Chapter 2 gives a general introduction to fundamental methods of statistical function estimation in such a way that will lead naturally to basic applications of wavelets. This will of course be review material for readers already familiar with kernel and orthogonal series methods; it is included primarily for the non-specialist. Chapter 3 treats the wavelet versions of the smoothing methods described in Chapter 2, applied to density estimation and nonparametric regression. Chapter 4 returns to describing wavelets, continuing the coverage of Chapter 1. It covers more details of the earlier introduction to wavelets, and treats wavelets in more generality, introducing some of the fundamental properties of wavelet methods: algorithms, filtering, wavelet extension of the Fourier transform, and examples of wavelet families. This chapter is not,
Preface
xi
strictly speaking, essential for applying wavelet methods, but it provides the reader with a better understanding of the principles that make wavelets work well in practice. Chapters 6-9 deal with applying wavelet methods to various statistical problems. Chapter 5 describes diagnostic methods essential to a complete data analysis. Chapter 6 discusses the important practical issues that arise in wavelet analysis of real data. Chapter 7 extends and enhances the basic wavelet methods of Chapter 3. Chapter 8 gives an overview of current research in data dependent wavelet threshold selection. Finally, Chapter 9 provides a basic background into wavelet-related methods which are not explicitly treated in earlier chapters. The information in this book could have been arranged in a variety of orders. If it were intended strictly as a reference book, a natural way to order the information might be to place the chapters dealing primarily with the mathematics of wavelets (Chapters 2, 5, and 10) at the beginning, followed by the statistical application chapters (Chapters 4, 8, and 9), with the diagnostic chapter last, the smoothing chapter being included as an appendix. Instructors using this book in a classroom might cover the topics roughly in the order given, but with the miscellaneous topics in Chapter 4 distributed strategically within subsequent applications chapters. The current order was carefully selected so as to provide a natural path through wavelet introduction and application to facilitate the reader's first learning of the subject, but with like topics grouped sufficiently close together so that the book will have some value for subsequent reference.
Supplements on the World Wide Web The figures in this book were mostly generated using the commercial 5-Plus software package, some using the 5-Plus Wavelet Toolkit, and some using the freely available set of S-Plus wavelet subroutines by Guy Nason, available through StatUb (http: I I 1 ib. stat . emu. edu /). To encourage readers' experimentation with wavelet methods and facilitate other applications, I have made available the S-Plus functions for generating most of the pictures in this book over the World Wide Web (this is in lieu of including source code in the text). These will be located both on Birkhauser's web site (http:llwww.birkhauser.cornlbookslisbni0-8176-3864-41),
and as a link from my personal home page (http: I lwww. stat. sc. edul ~ogden/), which will also contain errata and other information regarding this book. As they become available, new routines for wavelet-based analysis will be included on these pages as well. Though I have only used the 5-Plus software, there are many other available software packages available, such as Wavel.ab, an extensive collection of MATLAB-based routines for wavelet analysis which is available free from Stanford's Statistics Department WWW site. Vast amounts of wavelet-related material is available through the WWW,
xii
PREFACE
including technical reports, a wavelet newsletter, Java applets, lecture notes, and other forms of information. The web pages for this book, which will be updated periodically, will also describe and link relevant information sites.
Acknowledgments This book represents the combination of efforts of many different people, some of whom I will acknowledge here. Thanks are due to Manny Parzen and Charles Chui for their kind words of encouragement at the outset of this project. I gratefully acknowledge Andrew Bruce, Hong-Ye Gao and others at StatSci for making available their S-PLUS Wavelet software. The suggestions and comments by Jon Buckheit, Christian Cenker, Cheng Cheng, and Webster West were invaluable in improving the presentation of the book and correcting numerous errors. I am deeply indebted to each of them. Mike Hilton and Wim Sweldens have the ability to explain difficult concepts in an easily understandable way-my writing of this book has been motivated by their examples in this regard. Carolyn Artin read the entire manuscript and made countless excellent suggestions on grammar and wording. Joe Padgett, John Spurrier, Jim Lynch, and my other colleagues at the University of South Carolina have been immensely supportive and helpful; I thank them as well. Thanks are also due to Wayne Yuhasz and Lauren Lavery at Birkhauser for their support and encouragement of the project. Finally, my deepest thanks go to my family: my wife Christine and daughter Caroline, who stood beside me every word of the way.
PROLOGUE
Why Wavelets? The development of wavelets is fairly recent in applied mathematics, but wavelets have already had a remarkable impact. A lot of people are now applying wavelets to a lot of situations, and all seem to report favorable results. What is it about wavelets that make them so popular? What is it that makes them so useful? This prologue will present an overview in broad strokes (using descriptions and analogies in lieu of mathematical formulas). It is intended to be a brief preview of topics to be covered in more detail in the chapters. It might be useful for the reader to refer back to the prologue from time to time, to prevent the possibility of getting bogged down in mathematical detail to the extent that the big picture is lost. The prologue describes the forest; the trees are the subjects of the chapters. Broadly defined, a wavelet is simply a wavy function carefully constructed so as to have certain mathematical properties. An entire set of wavelets is constructed from a single "mother wavelet" function, and this set provides useful "building block" functions that can be used to describe any in a large class of functions. Several different possibilities for mother wavelet functions have been developed, each with its associated advantages and disadvantages. In applying wavelets, one only has to choose one of the available wavelet families; it is never necessary to construct new wavelets from scratch, so there is little emphasis placed on construction of specific wavelets. Roughly speaking, wavelet analysis is a refinement of Fourier analysis. The Fourier transform is a method of describing an input signal (or function) in terms of its frequency components. Consider a simple musical analogy, following Meyer (1993) and others. Suppose someone were to play a sustained three-note chord on an organ. The Fourier transform of the resulting digitized acoustic signal would be able to pick out the exact frequencies of the three component notes, and the chord could be analyzed by studying the relationships among the frequencies. Suppose the organist plays the same chord for a measure, then abruptly change to a different chord and sustains that for another measure. Here, the classical Fourier analysis becomes confused. It is able to determine the frequencies of all the notes in either chord, but it is unable to distinguish which frequencies belong to the first chord and which are part of the second. Essentially, the frequencies are averaged over the two measures, and the
xiv
WHY WAVELETS?
Fourier reconstruction would sound all frequencies simultaneously, possibly sounding quite dissonant. While usual Fourier methods do a very good job at picking out frequencies from a signal consisting of many frequencies, they are utterly incapable of dealing properly with a signal that is changing over time. This fact has been well-known for years. To increase the applicability of Fourier analysis, various methods such as "windowed Fourier transforms" have been developed to adapt the usual Fourier methods to allow analysis of the frequency content of a signal at each time. While some success has been achieved, these adaptations to the Fourier methods are not completely satisfactory. Windowed transforms can localize simultaneously in time and in frequency, but the amount of localization in each dimension remains fixed. With wavelets, the amount of localization in time and in frequency is automatically adapted, in that only a narrow time-window is needed to examine high-frequency content, but a wide time-window is allowed when investigating low-frequency components. This good time-frequency localization is perhaps the most important advantage that wavelets have over other methods. It might not be immediately clear, however, how this time-frequency localization is helpful in statistics. In statistical function estimation, standard methods (e.g., kernel smoothers or orthogonal series methods) rely upon certain assumptions about the smoothness of the function being estimated. With wavelets, such assumptions are relaxed considerably. wavelets have a built-in "spatial adaptivity" that allows efficient estimation of functions with discontinuities in derivatives, sharp spikes, and discontinuities in the function itself. Thus, wavelet methods are useful in nonparametric regression for a much broader class of functions. Wavelets are intrinsically connected to the notion of "multiresolution analysis." That is, objects (signals, functions, data) can be examined using widely varying levels of focus. As a simple analogy, consider looking at a house. The observation can be made from a great distance, at which the viewer can discern only the basic shape of the structure-the pitch of the roof, whether or not it has an attached garage, etc. As the observer moves closer to the building, various other features of the house come into focus. One can now count the number of windows and see where the doors are located. Moving closer still, even smaller features come into clear view: the house number, the pattern on the curtains. Continuing, it is possible even to examine the pattern of the wood grain on the front door. The basic framework of all these views is essentially the same using wavelets. This capability of multiresolution analysis is known as the "zoom-in, zoom-out" property. Thus, frequency analysis using the Fourier decomposition becomes "scale analysis" using wavelets. This means that it is possible to examine features of the signal (the function, the house) of any size by adjusting a scaling parameter in the analysis. Wavelets are regarded by many as primarily a new subject in pure mathe-
Why Wavelets?
XV
matics. Indeed, many papers published on wavelets contain esoteric-looking theorems with complicated proofs. This type of paper might scare away people who are primarily interested in applications, but the vitality of wavelets lies in their applications and the diversity of these applications. The objective of this book is to introduce wavelets with an eye toward data analysis, giving only the mathematics necessary for a good understanding of how wavelets work and a knowledge of how to apply them. Since no wavelet application exists in complete isolation (in the sense that substantial overlap can be found among virtually all applications), we review here some of the ways wavelets have been applied in various fields and consider how specific advantages of wavelets in these fields can be exploited in statistical analysis as well. Certainly, wavelets have an "interdisciplinary" flavor. Much of the predevelopment of the foundations of what is now known as wavelet analysis was led by Yves Meyer, Jean Morlet, and Alex Grossman in France (a mathematician, a geophysicist, and a theoretical physicist, respectively). With their common interest in time-frequency localization and multiresolution analysis, they built a framework and dubbed their creation ondelette (little wave), which became "wavelet" in English. The subject really caught on with the innovations of Ingrid Daubechies and Stephane Mallat, which had direct applicability to signal processing, and a veritable explosion of activity in wavelet theory and application ensued.
What are Wavelets Used For? Here, we describe three general fields of application in which wavelets have had a substantial impact, then we briefly explore the relationships these fields have with statistical analysis.
1. Signal processing Perhaps the most common application of wavelets (and certainly the impetus behind much of their development) is in signal processing. A signal, broadly defined, is a sequence of numerical measurements, typically obtained electronically. This could be weather readings, a radio broadcast, or measurements from a seismograph. In signal processing, the interest lies in analyzing and coding the signal, with the eventual aim of transmitting the encoded signal so that it can be reconstructed with only minimal loss upon receipt. Signals are typically contaminated by random noise, and an important part of signal processing is accounting for this noise. A particular emphasis is on denoising, i.e., extracting the "true" (pure) signal from the noisy version actually observed. This endeavor is precisely the goal in statistical function estimation as well-to "smooth" the noisy data points to obtain an estimate of the underlying function. wavelets have performed admirably in both of these fields. Signal processors now have new, fast tools at their disposal that are
xvi
WHY WAVELETS?
well-suited for denoising signals, not only those with smooth, well-behaved natures, but also those signals with abrupt jumps, sharp spikes, and other irregularities. These advantages of wavelets translate directly over to statistical data analysis. If signal processing is to be done in "real time," i.e., if the signals are treated as they are observed, it is important that fast algorithms are implemented. It doesn't matter how well a particular de-noising technique works if the algorithm is too complex to work in real time. One of the key advantages that wavelets have in signal processing is the associated fast algorithms-faster, even, than the fast Fourier transform.
2. Image analysis Image analysis is actually a special case of signal processing, one that deals with two-dimensional signals representing digital pictures. Again, typically, random noise is included with the observed image, so the primary goal is again denoising. In image processing, the denoising is done with a specific purpose in mind: to transform a noisy image into a "nice-looking" image. Though there might not be widespread agreement as to how to quantify the "niceness" of a reconstructed image, the general aim is to remove as much of the noise as possible, but not at the expense of fine-scale details. Similarly, in statistics, it is important to those seeking analysis of their data that estimated regression functions have a nice appearance (they should be smooth), but sometimes the most important feature of a data set is a sharp peak or abrupt jump. Wavelets help in maintaining real features while smoothing out spurious ones, so as not to "throw out the baby with the bathwater."
3. Data compression Electronic means of data storage are constantly improving. At the same time, with the continued gathering of extensive satellite and medical image data, for example, amounts of data requiring storage are increasing too, placing a constant strain on current storage facilities. The aim in data compression is to transform an enormous data set, saving only the most important elements of the transformed data, so that it can be reconstructed later with only a minimum of loss. As an example, Wickerhauser (1994) reports that the United States Federal Bureau of Investigation (FBI) has collected 30 million sets of fingerprints. For these to be digitally scanned and stored in an easily accessible form would require an enormous amount of space, as each digital fingerprint requires about 0.6 megabytes of storage. Wavelets have proven extremely useful in solving such problems, often requiring less than 30 kilobytes of storage space for an adequate representation of the original data, an impressive compression ratio of 20:1. How does this relate to problems in statistics? To quote Manny Parzen, "Statistics is like art is like dynamite: The goal is compression." In multiple
Why Wavelets?
xvii
linear regression, for example, it is desired to choose the simplest model that represents the data adequately, to achieve a parsimonious representation. With wavelets, a large data set can often be summarized well with only a relatively small number of wavelet coefficients. To summarize, there are three main answers to the question "Why wavelets?":
1. good time-frequency localization, 2. fast algorithms, 3. simplicity of form.
This chapter has spent some time covering Answer 1 and how it is important in statistics. Answer 2 is perhaps more important in pure signal processing applications, but it is certainly valuable in statistical analysis as well. Some brief comments on Answer 3 are in order here. An entire set of wavelet functions is constructed by means of two simple operations on a single prototype function (referred to earlier as the "mother wavelet"): dilation and translation. The prototype function need never be computed when taking the wavelet transform of data. Just as the Fourier transform describes a function in terms of simple functions (sines and cosines), the wavelet transform describes a function in terms of simple wavelet component functions. The nature of this book is expository. Thus, it consists of an introduction to wavelets and descriptions of various applications in data analysis. For many of the statistical problems treated, more than one methodology is discussed. While some discussion of relative advantages and disadvantages of each competing method is in order, ultimately, the specific application of interest must guide the data analyst to choose the method best suited for his/her situation. In statistics and data analysis, there is certainly room for differences of opinion as to which method is most appropriate for a given application, so the discussion of various methods in this book stops short of making specific recommendations on which method is "best," leaving this entirely to the reader to determine. With the basic introduction of wavelets and their applications in this text, readers will gain the necessary background to continue their study of other applications and more advanced wavelet methods. As increasingly more researchers become interested in wavelet methods, the class of problems to which wavelets have application is rapidly expanding. The References section at the end of this book lists several articles not covered in this book that provide further reading on wavelet methods and applications. There are many good introductory papers on wavelets. Rioul and Vetterli (1991) give a basic introduction focusing on the signal processing uses
:xviii
WHY WAVELETS?
of wavelets. Graps (1995) describes wavelets for a general audience, giving some historical background and describing various applications. Jawerth and Sweldens (1994) give a broad overview of practical and mathematical aspects of wavelet analysis. Statistical issues pertaining to the application of wavelets are given in Bock (1992), Bock and Pliego (1992), and Vidakovic and Muller (1994). There have been many books written on the subject of wavelets as well. Some good references are Daubechies (1992), Chui (1992), and Kaiser (1994) -these are all at a higher mathematical level than this book. The book by Strang and Nguyen (1996) provides an excellent introduction to wavelets from an engineering/signal processing point of view. Echoing the assertion of Graps (1995), most of the work in developing the mathematical foundations of wavelets has been completed. It remains for us to study their applications in various areas. We now embark upon an exploration of wavelet uses in statistics and data analysis.
CHAPTER
ONE
Wavelets: A Brief Introduction
This chapter gives an introductory treatment of the basic ideas concerning wavelets. The wavelet decomposition of functions is related to the analogous Fourier decomposition, and the wavelet representation is presented first in terms of its simplest paradigm, the Haar basis. This piecewise constant Haar system is used to describe the concepts of the multiresolution analysis, and these ideas are generalized to other types of wavelet bases. This treatment is meant to be merely an introduction to the relevant concepts of wavelet analysis. As such, this chapter provides most of the background for the rest of this book. It is important to stress that this book covers only the essential elements of wavelet analysis. Here, we assume knowledge of only elementary linear algebra and calculus, along with a basic understanding of statistical theory. More advanced topics will be introduced as they are encountered.
1.1
The Discrete Fourier Transform
Transformation of a function into its wavelet components has much in common with transforming a function into its Fourier components. Thus, an introduction to wavelets begins with a discussion of the usual discrete Fourier transform. This discussion is not by any means intended to be a complete treatment of Fourier analysis, but merely an overview of the subject to highlight the concepts that will be important in the development of wavelet analysis. While studying heat conduction near the beginning of the nineteenth century, the French mathematician and physicist Jean-Baptiste Fourier discovered that he could decompose any of a large class of functions into component functions constructed of only standard periodic trigonometric func-
2
THE DISCRETE FOURIER TRANSFORM
tions. Here, we will only consider functions defined on the interval [-1r, 1r]. (If a particular function of interest g is defined instead on a different finite
interval [a, b], then it can be transformed via f(x) = g(21rxj(b- a) - (a+ b)1r / (b- a)).) The sine and cosine functions are defined on all of lR and have period 21r, so the Fourier decomposition can be thought of either as representing all such periodic functions, or as representing functions defined only on [-1r, 1r] by simply restricting attention to only this interval. Here, we will take the latter approach. The Fourier representation applies to square-integrable functions. Specifically, we say that a function f belongs to the square-integrable function space L 2 [a, b] if
Fourier's result states that any function f E £ 2 [ -1r, 1r] can be expressed as an infinite sum of dilated cosine and sine functions: 1
f(x) =
00
2ao + 2:::)ai cos(jx) + bj sin(jx)),
(1.1)
j=l
for an appropriately computed set of coefficients { a 0 , a 1 , b1 , ... } . A word of caution is in order about the representation (1.1). The equality is only meant in the £ 2 sense, i.e.,
It is possible that f and its Fourier representation differ on a few points (and this is, in fact, the case at discontinuity points). Since this book is concerned primarily with analyzing functions in £ 2 space, this point will usually be neglected hereafter in similar representations. It is important to keep in mind, however, that such an expression does not imply pointwise convergence. The summation in (1.1) is up to infinity, but a function can be well-approximated (in the £ 2 sense) by a finite sum with upper summation limit index J: J
SJ(x)
= ~a 0 + l:(aj cos(jx) + bj sin(jx)). j=l
(1.2)
Wavelets: A Brief Introduction
C!
C!
10
10
Eo
Eo oo u
ci
3
ci IJl
.5 ci IJl
C!
•
C!
";-
";-
-3
-2
-1
0
2
3
C!
C!
10
10
ci
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
ci
x ~C! :go
xC\JO
co
"iii
u
C!
C!
";-
";-
-3
-2
-1
0
2
3
C!
C!
10
10
~0
x ~C! :go
ci
0
co
"iii
u
C!
C!
";-
";-
-3
-2
-1
0
2
3
Figure 1.1: The first three sets of basis functions for the discrete Fourier transform This Fourier series representation is extremely useful in that any L 2 function can be written in terms of very simple building block functions: sines and cosines. This is due to the fact that the set of functions { sin(j ·), cos(j ·), j = 1, 2, ... } , together with the constant function, form a basis for the function space L 2 (-1r, 1r]. We now examine the appearance of some ofthese basis functions and how they combine to reconstruct an arbitrary £ 2 function. Figure 1.1 plots the first three pairs of Fourier basis elements (not counting the constant function): sine and cosine functions dilated by j for j = 1, 2, 3. Increasing the dilation index j has the effect of increasing the function's frequency (and thus decreasing its period). Next, we examine the finite-sum Fourier representation of a simple example function, as this will lead into the discussion of wavelets in the next sec-
4
THE DISCRETE FOURIER TRANSFORM
Example function
Reconstruction with J=1
If!
q It)
It)
0
0
0
0
~~--~~~--~~
-3
-2
-1
0
2
0
0 -3
3
Reconstruction with J=2
-2
-1
0
2
3
Reconstruction with J=3
If!
If!
q
q
It)
It)
ci
ci
0
0
0
0
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
Figure 1.2: An example function and its Fourier sum representations
tion. The truncated Fourier series representations (1.2) for J are displayed in Figure 1.2 for the piecewise linear function
X+ 7r, f(x) =
1rj2, { 7r- x,
= 1, 2, and 3
-?r ~X~ -?r/2 -?r/2 <X~ ?r/2 7r
(1.3)
/2 <X ~ 7r.
Figure 1.2 shows the original example function and the three representations of it. As the summation limit J gets larger, more terms are included in the reconstruction, so the resulting sum does a better job of approximating f. In this simple example, using three pairs of basis functions in the reconstruction gives a fairly good representation of the original function. Of course, even this good representation could be improved by allowing J to increase even more. The next issue to consider is calculation of the coefficients { a0 , a 1, b1, a 2 , b2 , .•• } • For j ~ 1, the Fourier coefficients can be computed by taking the inner product of the function f and the corresponding basis functions:
1 -(!, cos(j·)) = -1 /71" j(x)cos(jx) dx, 7r 7r -71"
j = 0, 1, ... ,
(1.4)
Wavelets: A Brief Introduction
5
!_(!, sin(j·)) = !_ /_11" f(x)sin(jx) dx, j = I, 2,.... 7r
7r
(1.5)
-7r
The coefficients ai and bi are said to measure the "frequency content" of the function f at the level of resolution j. Examining the set of Fourier coefficients can aid in understanding the nature of the corresponding function. The coefficients in (1.4) and (1. 5) are given in terms of the £ 2 inner product of two functions:
J
(j, g) =
f(x)g(x) dx,
where the integral is taken over the appropriate subset of JR. The £ 2 norm of a function is defined to be
11!11 = VfD) =
Jj
J2(x)dx.
Let us return to our earlier example and look at some of the coefficients, which are given in Table 1.1. First, note that all the bi 's (corresponding to the sine basis functions) are zero. The reason for this is that the example function is an even function, so the inner product of f with each of the odd sine functions is zero. From inspection of Table 1.1, we note that the even-index cosine coefficients are also zero (for j ~ 4) and that odd-index coefficients are given by aj = 2/(j 2 7r), with coefficients aj becoming small quickly as j gets large. This indicates that most of the frequency content of this example function is concentrated at low frequencies, which can be see in the reconstructions in Figure 1.2. The only relatively large coefficients are a0 , a 1 , a 2 , and a 3 , so the third reconstruction (J = 3) does a very good job at piecing f back together. By increasing J further, the approximation will only improve (in the £ 2 sense), but the amount of the improvement will be smaller.
Table 1.1: Fourier coefficients for the example function. j 0
1 2 3 4
aj
37r/4 2/7r -l/7r 2/(97r) 0
bj
0 0 0 0
j 5 6 7 8 9
aj
bj
2/(257r)
0 0 0 0 0
0
2/(497r) 0
2/(817r)
6
THE DISCRETE FOURIER TRANSFORM
The representation (1.1) holds uniformly for all x E [-1r, 1r] under certain restrictions on f (for instance, if f has one continuous derivative, f (1r) = f( -1r), and j'(1r) = f'( -1r)-see, e.g., Dym and McKean (1972)). The example function in Figure 1.2 has discontinuities in its derivative, but the Fourier representation will converge at all other points. For any £ 2 [ -71", 1r] function, the truncated representation (1.2) converges in the £ 2 sense:
as J -t oo. In practical terms, this means that many functions can be described using only a handful of coefficients. The extension of this to wavelets will become clear in the following section. Though not mentioned previously, the Fourier basis has an important property: It is an orthogonal basis. Definition 1.1 Two functions ft,
fz
E L 2 [a, b] are said to be orthogonal
if
(ft, h) = 0. The orthogonality of the Fourier basis can be seen through orthogonality properties inherent in the sine and cosine functions:
(sin(m·), sin(n·))
=I:
sinmxsinnxdx
={~:
(cos(m·),cos(n·)) = /_: cosmxcosnxdx = {
~:
271",
(sin(m·),cos(n·))
m =f. n, m = n > 0,
m =f. n, m = n > 0, m = n = 0,
= /_: sinmxcosnxdx = Oforallm,n ~ 0.
The three expressions can be verified easily by applying the standard trigonometric identities for sin a sin f3, cos a cos f3, and sin a sin f3. A minor modification of the sine and cosine functions will yield an orthonormal basis with another important property. Definition 1.2 A sequence offunctions { iJ} is said to be orthonormal the iJ 's are pairwise orthogonal and II iJ II = 1 for all j.
if
Wavelets: A Brief Introduction
7
The orthogonality requirement is already satisfied with the sine and cosine functions. Defininggj(x) = 7r- 112 sin(jx) for j = 1,2, ... and hj(x) = 1 2 7r- 1 cos(jx) for j = 1, 2, ... with the constant function h0 (x) = 1/V21f on x E [-1r, 1r] makes the set of functions { h 0 , g 1 , h 1 , ... } orthonormal as well. Normalizing the basis in this manner allows us to write the Fourier representation (1.1) along with the expressions for computing the coefficients (1.4) and (1.5) as 00
f(x)
= (f,ho)ho(x) + L
((f,gj)gj(x)
+ (J,hj)hj(x)).
j=l
Definition 1.3 A sequence of function { fJ} is said to be a complete orthonormal system (CONS) if the fJ 's are pairwise orthogonal, II fJ II = 1 for each j, and the only function orthogonal to each fJ is the zero function. Thus defined, the set { h0 , gj, hj : j = 1, 2, ... } is a complete orthonormal system for £ 2 [ -1r, 1r]. The Fourier basis is not the only CONS for intervals. Others include Legendre polynomials and wavelets, the latter to be studied in detail.
1.2
The Haar System
The extension from Fourier analysis to wavelet analysis will be made via the Haar basis. The Haar function is a bona fide wavelet, though it is not used much in current practice. The primary reason for this will become apparent. Nevertheless, the Haar basis is an excellent place to begin a discussion of wavelets. This section will begin with a definition of the Haar wavelet and go on to derive the Haar scaling function. Following this development, we will begin with the Haar scaling function and then rederive the Haar wavelet. Of course, terms like "wavelet" and "scaling function" have not yet been defined. Their meaning will become clear as we progress through a discussion of issues associated with wavelets. The Haar wavelet system provides a paradigm for all wavelets, so it is important to keep in mind that the simple developments in this chapter have much broader application: All the principles discussed in this chapter pertaining to the Haar wavelet hold generally for all orthogonal wavelets. The Haar wavelet is nothing new, having been developed in 1910 (Haar, 191 0), long before anyone began speaking of "wavelets." The Haar function, given by
-1,
o::;x
0,
otherwise
1,
'lj;(x) =
{
(1.6)
8
THE HAAR SYSTEM
1+-------------~
0
1
2
1
-1
Figure 1.3: The Haar function
is better expressed by a picture, shown in Figure 1. 3. The Haar function is not particularly awe-inspiring, either in appearance or in expression (1.6). It is piecewise constant over intervals of length onehalf. What can be so important about these wavelets? The Haar function 'ljJ defined above is called a mother wavelet. The mother wavelet "gives birth" to an entire family of wavelets by means of two operations: dyadic dilations and integer translations. Let j denote the dilation index and k represent the translation index. Each wavelet born of the mother wavelet will be indexed by both of these indices:
for integer-valued j and k. As in the Fourier series, dilation by larger j "compresses" the function on the x-axis. Altering k has the effect of sliding the function along the x-axis. Some of these dilated and translated wavelet functions are plotted in Figure 1.4. The primary importance of this set of functions is expressed in the following theorem. Another very useful property is brought out in its proof, which is sketched here informally only to help in the understanding of the Haar system and to bring out some new notation in a natural way. Though the ideas in this development are not particularly difficult, the notation gets fairly complex, so repeated reading of this section might help with understanding. Eventually, the Haar system will be extended to general wavelet bases, and all the same principles will apply.
Wavelets: A Brief Introduction
9
j=4~=13
-
j::3, k=2
j::O, k=O 0
I
I
I
I
..__
.... 0.0
0.2
0.4
0.6
0.8
1.0
Figure 1.4: Haar wavelet examples
Theorem 1.1 The set {'1/Jj,kl j, k E 7l} constitutes a complete orthonormal system for L 2 (1R). The informal proof of this theorem generally follows that given in Chapter 1 of Daubechies (1992). This is the only "proof" given in this book, included
here because it leads into a discussion of the principles of wavelet analysis. To establish the theorem's result, it is necessary to show two things: 1. The set is orthonormal; 2. Any function f E L 2 (JR) can be approximated arbitrarily well by a finite linear combination of the '1/Ji,k 's.
First, we note that each Haar wavelet satisfies
L:
1/J;,k(x) dx = 0,
for all j, k E 7l. To show (1), note that the support of the wavelet function '1/Jj,k is
Two wavelets with the same dilation index j but differing k can never have overlapping support and are therefore orthogonal. If two wavelets have different dilation indices, say j' < j, then supp '1/Jj,k is in a region in which the other wavelet is constant, so they are also orthogonal. Since 11'1/Ji,kll = 1 for all j, k E 7l, the set is orthonormal.
10
THE HAAR SYSTEM
Original Function
Approximation with j=4
0 0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
Approximation with i=5
0.2
0.4
0.6
0.8
1.0
Approximation with j=6
M
M
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1. 5: A function and its approximations of increasing resolution. At each step, the width of the constant intervals is halved. To establish (2) above, we first approximate any L 2 (JR) function function with compact support: Since
f
by a
as J 1 -+ oo, we can approximate f arbitrarily well in the £ 2 sense by choosing an integer J 1 to be large; the first approximation off is thus the restriction of f to the interval [- 2J1 , 2J1 ), which we will denote !I [_2 Jl , 2 J1). This first approximation is approximated further by a function that is piecewise constant over all intervals of the form [RzJo, (f + 1)2-J0 ), where the integer J 0 is chosen to be large enough to make the approximation as good as desired. (This approximation of a smooth function by a piecewise constant function will be illustrated in Figure 1.5.) Since these approximations can be made for any function in £ 2 ( JR), we now restrict attention to such piecewise constant functions with compact support. Let JJo represent a function that is piecewise constant on intervals of length z-Jo as described above, which is understood to have support [-2J1 , 2J1 ]. Then let 0 represent the constant value of the function on the interval
J/
[Rz-Jo, (f + l)z-Jo), i.e.,
Wavelets: A Brief Introduction
11
It is now possible to write JJo as the sum of two functions: (1.7)
where JJo- 1 is an approximation to JJo that is piecewise constant over intervals oflength z-(Jo- 1), twice as large as before, i.e., t an t-jJo-1 - f . ! Jo-11 [f2-(Jo-I),(f+l)z-(J0-I)) =cons
The values of this coarser approximation to JJo are obtained by averaging the two corresponding constant values of the function JJo; 0 - 1 = 4(![£ +
f/
! Jo )
2£+1 ·
We can now define a "detail function" gJo- 1 , which is piecewise constant over the same intervals as those for JJo. Following the subscripting conventions established earlier, it follows that
and
This detail function, which is constant over intervals of length z-Jo, is the part that must be added to the coarser approximation JJo to get a finer approximation JJo. The decomposition in (1. 7) is illustrated in Figure 1.6. Using the Haar function (1.6), we can now get a useful expression for the detail function gJo- 1 , in terms of dilated and translated Haar wavelets:
Hence the original piecewise constant function can be written
L.....t Jo-1,£ nf, 'f/Jo-1,£, ! Jo _- JJo-1 + ""'d f_
where
THE HAAR SYSTEM
12
JJo 2l
g:{rl
j
Jjo-1
JJo 2£+1
:M:h 2
f_
2Jo-l
0
Jo-1
g2l
0
----------------------------1
Figure 1.6: The decomposition of a piecewise constant approximation function into a coarser approximation and a detail function
The approximation function
JJo =
JJo- 1
can be broken down again, giving
JJo-1 + gJo-1 JJo-2 + gJo-2 + gJo-1.
Note that JJo- 2 has the same support as JJo but is piecewise constant over widerintervals: [£2-(Jo- 2), (£+1)2-(Jo- 2)). Using the wavelet representation of gJo- 2 , the function JJo can be written
JJo = jJo- 2 +
L f_
dJ0 -2,t'I/JJ0 -2,f +
L
dJ0 -1,£'1/JJo-1,£,
f_
where the dj,k 's may be computed according to (1.8).
Wavelets: A Brief Introduction
13
Continuing in this way, we obtain Jo-1 JJo
= ~-JI +
L L
j=-J!
dj,k'I/Jj,k.
f_
The coarsest approximation f-J 1 has two constant pieces: f-J 1 I[o, 2 J 1 ) = f 0 J 1 which is the average of JJo over [0,2J1 ), and f-J 1 1[- 2 JI,o) !~(1 , the average of JJo over [-2J1 ,0). This represents a fundamental concept of wavelet analysis-breaking a function down into a very coarse approximation, with an ordered sequence of detail functions {g-J 1 , g-J 1 + 1 , ... , gJo-1} making up the difference. Even though the entire support of JJo has been represented, it is possible to double the support of the approximation to JJo, say from -2J1 + 1 to 2J1 + 1 . Then f-J 1 can be broken down as well: f-J 1 = f-(J 1 + 1) + g-(J1 + 1), where
=
and
Repeating this process K times results in
! Jo = ~-(J!+K) +
Jo-1
""' L-
j=-(JI+K)
""'d L- j,k .!, 'f/j,k' k
Using only the sequence of detail functions to approximate pute the resulting £ 2 error of approximation: Jo-1 II JJo -
L
L
j=-(JI+K)
k
dj,k '1/Jj,k 11
2
=
IIJ-(JI+K) 11
JJo,
we can com-
2
This error can be made as small as may be desired by choosing K large enough. Since the approximation in the previous expression is written only
14
THE HAAR SYSTEM
in terms of dilated and translated 'ljJ functions, (2) is established and the "proof" is complete. 0 This "proof" brings out an important property of wavelets, that of the multiresolution analysis, which is discussed in the next section.
Multiresolution Analysis In the proof of Theorem 1.1, it was seen that any function f E L 2 (JR) can be approximated by a piecewise constant function j1, and that as j gets larger, the approximation gets better (at least in the £ 2 sense). Figure 1.5 illustrates a smooth function and three such approximations. Once the principles of the multiresolution analysis are presented, the example function from Section 1.1 will be approximated also using piecewise constant functions to allow comparison with the Fourier series approximations in Section 1.1. From the proof of Theorem 1.1, it is seen that, using the Haar function and the corresponding piecewise constant approximations, for each level j, one can construct j1, an approximation of the original function. This approximation can be written as the sum of the next coarser approximation j1-I and a detail function g1 -I. As the index j runs from small to large, the corresponding approximations run from coarse to fine. Each detail function g1 can be written as a linear combination of the corresponding 'I/J1,k functions. This illustrates the basic principles of multiresolution analysis, which will be discussed more formally later in this section. To make this argument more rigorous, define a function space Vj, j E 7L, to be
Vj = {! E L 2 (JR) :
f
is piecewise constant on [k2-1, (k
+ 1)2-1), k
E 7L}.
Then the sequence of spaces (Vj )1Ez represents a ladder of subspaces of increasing resolution (as j increases). Each subspace Vj consists of functions that are piecewise constant over intervals of exactly twice the length of those for VJ- 1 (and half the length of those for VJ+ 1). This sequence of subspaces possesses the following properties: 1. ... C V_z C V-1 C Vo C V1 C Vz C ... ; 2.
3. f 4.
= L 2 (1R); E Vj if and only if f(2·) E VJ+t;
n1Ez Vj
f
= {o},
u1Ez Vj
E V0 implies!(·- k) E
Vo for all k E 7L.
The third property demonstrates that each space Vj is a scaled version of the original space V0 . In the proof of the theorem, it was seen that each approximation can be written as the sum of a coarser approximation and a detail function. Using the notation p1 f to denote the projection of a function f
Wavelets: A Brief Introduction
15
onto the space Vj (the "best" approximation to fin Vj), this is expressed pJ f = pJ-1 f
+ gJ-1.
The detail function g1- 1 (representing the "residual" between two approximations) can be written in terms of dilated and translated wavelets, giving pJ f
= pJ- 1f + L
(j, 'I/Jj-1,k) 'I/Jj-1,k.
(1.9)
kEZ
It is important to note that {'1/JJ,k, k E Z} is not a CONS in Vj, but is actually a set of functions that are orthogonal to each function in Vj . The significance of this will be discussed more later in this section. As in the previous section, the decomposition can be extended recursively: j-1
pJt
=
pJot+
LL9£
f=Jo k j-1
=
pJo f
+
L L (j, '1/Je,k) '1/Je,k · f=Jo k
The concept of multiresolution analysis relates back to wavelets by noting that whenever there is a sequence of spaces (Vj )jEZ that satisfies the four properties above along with 5. There exists a function ¢ E Vo such that the set { c/Jo,k = ¢( · - k), k E Z} constitutes an orthonormal basis for V0 , then there exists a function 'ljJ such that (1.9) is true! In the Haar case considered in this chapter, it is clear that one choice for ¢ is
¢(x)
= l[o,1)(x),
(1.10)
where I A ( ·) is the indicator function of the set A. The function ¢ is called the scaling function since its dilates and translates constitute orthonormal bases for all Vj subspaces, which are simply scaled versions of V0 . The scaling function is also referred to as the father wavelet. Since this concept will be used quite a bit throughout the rest of this book, a formal definition is in order. Definition 1.4 Closed subspaces (Vj )JEZ that satisfy properties (1)-(5) above are said tofonn a multiresolution analysis (MRA) of L 2 (IR).
16
THE HAAR SYSTEM
If a function ¢ can be used to form spaces
Vj
= span{¢j,k, k E Z}
such that (Vj) jEZ constitute an MRA, then the (scaling) function ¢ is said to generate a multiresolution analysis.
The Wavelet Representation In the Haar example, it is clear that the subspace sequence (Vj) jEZ generated by the function¢ defined in (1.10) satisfies properties (1)-(5) above. In fact, for any arbitrary (not necessarily piecewise constant) scaling function ¢ which generates a sequence of subspaces Vj satisfying the five properties above, it is possible to construct a "wavelet" function 'ljJ so that (1.9) holds. To illustrate this fact, we will start with the Haar scaling function and show how the Haar wavelet can be derived from it. Here, we begin with the Haar scaling function as defined in (1.1 0) with
defining each space Vj to be the span of the set of functions { cPi,k, k E Z}. By previous arguments, it is clear that properties (1)- (5) hold and that the spaces discussed in Section 1.2 exactly correspond with those just defined. The projection of an £ 2 ( IR) function onto an approximation space Vj is done by writing it in terms of the appropriately dilated and translated scaling functions pi f
""" cJ,. k '-!JJ, k. = "'L....J ,J.. .
(1.11)
k
Since the set { cPi,k, k E Z} is an orthonormal basis for Vj, the scaling function coefficients can be computed:
c;,k
= (f,
just as in Section 1.1 with the Fourier basis on [-1r, 1r]. Now that we have a multiresolution analysis and a scaling function¢, it is desired to find a function 'ljJ such that (1.9) holds. First, we note that ¢ E V0 and also that ¢ E V1 . Since { ¢ 1 , k, k E Z} is a basis for V1 , we can write ¢ in terms of the ¢ 1 ,k 's. In this example, it is clear that (1.12)
Wavelets: A Brief Introduction
17
A formula analogous to (1.12) can be used to define the Haar wavelet: (1.13) which corresponds with the definition in (1.6). The construction in (1.13) is a simple example of a more general wavelet construction described in Section 4.1. For now, it is sufficient to know that there is a very specific relationship between a scaling function and a wavelet, enabling the derivation of a wavelet function 'ljJ from any scaling function ¢ (with its corresponding multiresolution analysis). We now proceed to enhance our discussion of the MRA structure using wavelets. In (1.9), it is seen that the approximation off at any resolution level j is composed of the approximation of the function at the next lower level and a detail signal. We have seen that the approximation at level j can be expressed as a linear combination of the elements in the basis of appropriately scaled functions { cPi,k, k E ll}, and also that the detail signal is a linear combination ofwaveletfunctions {'1/Jj-I,k, k Ell}. We have already established that the '1/Jj,k functions are mutually orthogonal. Thus, it is possible to define a "detail space" for which the set of wavelets with a single dilation index forms an orthonormal basis:
Wi = span{'I/Jj,k, k Ell}. In the Haar example, it is seen easily that (¢j,k, '1/Ji' ,k') = 0 whenever j ~ j' (j,j', k, k' E Z); that is, a scaling function and a wavelet are orthogonal whenever the scaling function is of lower resolution. Hence, an important property of the multiresolution can be written as (1.14) where EB represents the orthogonal sum of two subspaces. Since the Wj subspaces are created from wavelets (which can be written in terms of scaling functions-see (1.13)), the Wj spaces also inherit property (3) of the multiresolution analysis, namely that
j E Wi if and only if j(2·) E Wi+I· Equation (1.14), along with (1.9), expresses much of the underlying philosophy of wavelet analysis: It is possible to construct approximations at increasing levels of resolution that are linear combinations of dilations and translations of a scaling function ¢, the differences in approximations being expressed as linear combinations of dilations and translations of a wavelet function 'ljJ. Furthermore, the scaling function and the wavelet (their dilates and
18
THE HAAR SYSTEM
Figure 1. 7: Relation of approximation and detail spaces
translates) are orthogonal. To relate (1.14) and (1.9) to the ideas expressed in Theorem 1.1, we note that the subspaces Wi and Wi', j =/= j' are orthogonal. (Two spaces R and S are said to be orthogonal, written R ..l S, if every element of R is orthogonal to every element of S.) With this inter-space orthogonality, we further note that (1.14) can be extended recursively starting with
Continuing in this way, for any arbitrary integer j, it is seen that j-1
VJ
=
E9
VJo EB
Wt,
l=jo
for j
> j 0 ; thus for any j
E 7.l, j-1
Vj =
ffi
Wt,
£=-oo
and hence 2
L (IR)
= ffi Wh jEZ
which corresponds to Theorem 1.1, since each of the (mutually orthogonal) spaces Wj has orthonormal basis {'1/Jj,k, k E 7.l}. The ideas of the multiresolution analysis with detail spaces between successive levels of approximation are expressed in Figure 1. 7, in which the arrows denote composition: Vj is composed of Vj _ 1 and Wi _ 1 , etc. Though illustrated in this section using only the piecewise constant MRA, these same principles hold for any orthogonal MRA system. The example function from Section 1.1 can be approximated according to these piecewise linear spaces. Figure 1.8 provides a demonstration of this approximation in action. Each approximation is written as a linear combination
Wavelets: A Brief Introduction Example function ~
19
Projection for j=-1 .--··············---·····
~
\
C!
C!
I{)
I{)
c:i
c:i
0
0
c:i
c:i
-4
-2
0
2
4
-4
Projection for j=O
-2
0
2
4
Projection for j=1
~
~
C!
C!
I{)
I{)
c:i
c:i
0
0
c:i
c:i -4
-2
0
2
4
-4
-2
0
2
4
Figure 1.8: An example function and its piecewise constant approximations
of the appropriate basis functions ¢J,k as described in (1.11). As in Fourier analysis, the scaling function coefficients are simply the inner products of f with the corresponding basis functions:
CJ,k = (j, ¢J,k) =
j
f(x)¢J,k (x) dx.
(1.15)
These coefficients are easily computed when ¢ is the Haar scaling function (1.1 0) A table of the scaling function coefficients for the piecewise linear function is given in Table 1.2. Exact values for the coefficients are typically in terms of 1r, 1r 2 , and so the values in the table are just approximate numerical equivalents. Just as each Fourier coefficient gives information about the frequency content of a function, scaling function coefficients (and thus also wavelet coefficients) contain information about the function. The primary difference is that information in scaling function/wavelet coefficients is localized-the amount of localization is controlled by the dilation index j. This concept is covered in more detail in Section 4.3. One feature in particular of Table 1.2 is of interest. This is pointed out briefly here and developed more fully in Chapter 4. Given the coefficients for j = 1, it is straightforward to compute the coefficients for j = 0. This is not
vz,
20
THE HAAR SYSTEM
apparent at first glance, but note that, for instance, (c1,4 + c1,5) / V2 = c0,2. In the Haar case, the reason for this is easily seen. The scaling function ¢ 0 2 has constant value 1 over support [2, 3). Similarly, ¢ 1 ,4 has value over [2', 2.5) and ¢ 1 ,5 has the same value over [2.5, 3). In terms of Haar scaling function coefficients,
vz
1 3
Co,z
f(x) · 1 dx 5
([
=
(c1,4
f(x) · vfzdx + {/(x) · vfzdx)
/viz
+ c1,s)/.Ji.
In fact, it is easy to generalize this and derive that for any Haar scaling function coefficient, (1.16) This is an elementary example of the decomposition algorithm inherent in wavelets, which is developed fully in Section 4.1. It may be seen that the expression in (1.16) has strong ties to that in (1.12). This correspondence will also be treated in more detail in Section 4.1. The nice thing to notice about (1.16) is that integration need take place only at the highest level of
Table 1.2: Haar scaling function coefficients for the example function
k -7
-6 -5
-4 -3 -2 -1 0 1 2 3 4 5 6
j = -1 0 0 0 0 0 0.4608 2.1563 2.1563 0.4608 0 0 0 0 0
j=O 0 0 0 0.0100 0.6416 1.4787 1.5708 1.5708 1.4787 0.6416 0.0100 0 0 0
j=1 0.0142 0.2769 0.6305 1.1107 1.1107 1.1107 1.1107 1.1107 1.1107 1.1107 1.1107 0.6305 0.2769 0.0142
Wavelets: A Brief Introduction
21
approximation, and that a formula like (1.16) can be used recursively to compute all lower-level scaling function coefficients. According to the theory of multiresolution analysis, the differences in the successive level of approximations can be expressed as linear combinations of the '1/Jj,k functions. Again, the wavelet coefficients can be computed by taking the inner products of the function with the wavelet functions:
di,k
= (f, '1/Ji,k) =
J
f(x)'I/Jj,k (x) dx.
(1.17)
The coefficients of these detail signals are tabulated in Table 1.3. Graphs of the detail functions for the example function are provided in Figure 1.9. The coefficient values can be verified by integration of the product of the wavelets and the function f. Again, there are some interesting things to note from Table 1.3 as it relates to Table 1.2. For example, note that the coefficient do,z = (ct,4- Ct, 5 )/.Ji. The reason for this is that
and as a result it is straightforward to see that in general, (1.18) which is analogous to (1.16). This is a result of the relationship between the Haar scaling function and the Haar wavelet, expressed in (1.13). Thus, each scaling function coefficient at level j is seen to be a weighted sum of scaling function coefficients at level j + 1; each wavelet coefficient at level j is a weighted difference of scaling function coefficients at level j + 1. This simple observation on the Haar example is meant to provide a
Table 1.3: Haar wavelet coefficients for the example function
k -4 -3 -2 -1 0 1 2
3
j = -1 0 0 -0.4466 -0.0651 0.0651 0.4466 0 0
j=O -0.0100 -0.2500 0 0 0 0 0.2500 0.0100
22
THE HAAR SYSTEM
Detail signal for j=O C')
C')
ci
ci r-
T"""
T"""
ci
0
r-
..,___r T"""
T"""
9
9 -
'-
C')
C')
9
9 -4
-2
0
2
4
-4
-2
0
2
4
Figure 1.9: The piecewise constant detail functions between successive approximations for the example function glimpse into interesting properties that general wavelets share, which will be discussed in Chapter 4. In the Haar case, it is clear that there must be a corresponding reconstruction algorithm relating higher level scaling function coefficients to lower level wavelet and scaling function coefficients. The existence of such an algorithm is merely mentioned here-it is not difficult to derive in the Haar case and it will be treated fully later in the text.
Goals of Multiresolution Analysis The goal of wavelets and multiresolution analysis in fields such as signal processing is to get a representation of a function (signal) that is written in a parsimonious manner as a sum of its essential components. That is, a parsimonious representation of a function will preserve the interesting features of the original function, but will express the function in terms of a relatively small set of coefficients. An example of a situation in which such a representation would be desirable is in the field of image analysis, in which researchers have been making widespread use of multiresolution analysis for several years. A black-andwhite image can be expressed in a numerical form as a function f (x, y) over two dimensions in which the function value f(x 0 , Yo) represents the "gray scale" value of the image at the point (xo, y 0 ). This is then discretized to a relatively fine grid. In many images of interest, there are only a few locations in which greater detail would be desired; the greater part of many images consists of large fairly homogeneous areas of a particular shade of gray. One could "sharpen" the image without increasing the amount of information to be stored by using only a very few components to represent the large homo-
Wavelets: A Brief Introduction
23
geneous areas, leaving a great many components with which to represent the "action" areas of the image. The piecewise constant MRA described in Section 1. 2 in terms of the Haar wavelet illustrates the foundation of multiresolution analysis. At each step of increasing (decreasing) resolution a finer (coarser) approximation of the original function is created. Moving from a coarse to a fine approximation, or from a fine to a coarse approximation, is known as the "zoom-in, zoom-out" feature of multiresolution analysis. The "analysis" consists of studying the "detail signals," or the difference in approximations made at adjacent resolution levels. A wavelet (which, when appropriately dilated, forms the basis for the detail spaces) must be localized in time, in the sense that '!j;(x) -+ 0 quickly as lxl gets large. The wavelet should also be oscillating about zero so that f~ao '!j;(x) dx = 0 and the first m moments are also zero: f~oo xk'lj;(x) dx = 0, k = 1, ... , m- 1. The oscillating property makes the function a wave, but because it is localized, it becomes a wavelet. So as a mother wavelet 'ljJ is dilated to '1/Ji,k (x) = 2i1 2 '1j;(2i x-k), it oscillates more quickly, so it is better able to represent finer details in the signal. Note that '1/Ji,k(x) is localized about the point x = 2-ik. The wavelet coefficient dj,k = (j, '1/Jj,k) measures the amount of the fluctuation in the function about the point x = 2-i k with a frequency determined by the dilation index j. In application, it is typical to start with a low-level approximation and then add in only the higher-level wavelets that correspond to relatively large wavelet coefficients. This is a general description of how a wavelet-based parsimonious representation can be obtained, which will be covered in more detail in Chapter 7.
1.3
Smoother Wavelet Bases
So far, the only wavelet treated is the Haar wavelet, with only vague hints as to the existence of smoother wavelets. The good news is that once a full understanding and appreciation of multiresolution analysis using the Haar basis is achieved, extension to other wavelet bases is just a matter of changing the functions ¢ and 'ljJ and a few details of the decomposition. The fundamental principles remain the same. There are a myriad of other wavelet bases. This section will only present a few examples and discuss their general features; a more full treatment and development is reserved for Section 4.4. Several families of orthonormal wavelet bases have been developed in recent years. Stromberg (1982) developed a family of wavelets that was not noticed much at the time, and, a few years later, unaware of Stromberg's work, Meyer (1985) introduced the system now known as the Meyer basis. Soon thereafter, Battle (1987) and Lemarie (1988) each independently proposed the same new family of orthogonal wavelets. Two members of the Battle-
24
SMOOTHER WAVELET BASES
Scaling function, p=O
Wavelet, p=O
LO
c)
0
c:i 0
c:i ~
4
~
0
2
~--~--,---~--~~
-4
4
-2
Scaling function, P=1
0
4
2
Wavelet, P=1
LO
c:i
C\1
c:i
LO
9 C\1
9
~~----~----~~ -5 0 5
-5
0
5
Figure 1.10: Two examples of scaling function/wavelet sets from the BattleLemarie family Lemarie family are illustrated in Figure 1.1 0. This family is indexed by an integer p: Scaling functions and wavelets are constructed from polynomial splines of degree 2p + 1. Thus, the first example in Figure 1.10 is piecewise linear, and the second is based on cubic polynomials. Though not obvious from the picture, these scaling functions and wavelets each have support over the entire real line, although they do have exponential decay. In statistics, we deal with finite data sets, so we might tend to prefer wavelets with compact support. The Haar wavelet does have compact support, but an ideal wavelet would probably be something a bit smoother. As are the wavelets of the Battle-Lemarie family, the members of the Chui-Wang family of wavelets are based on polynomial B-splines, but have compact support. This family is indexed by m, where m - 1 is the order of the polynomial. Elements of the family are shown in Figure 1.11. Note that the scaling function and wavelet corresponding to m = 2 are piecewise linear, those for m = 3 are piecewise quadratic, and so on. Again, a full treatment is reserved for the Chapter 4. These Chui-Wang wavelets are easy to compute and to work with, but a substantial disadvantage of these wavelets (at least in standard statistical applications) is that the Chui-Wang wavelets do not form an orthonormal basis for L 2 (JR). They form instead a semi-orthogonal wavelet basis, but discussion of this is deferred to Chapter 4 as well.
Wavelets: A Brief Introduction Scaling function, m=2
Wavelet, m=2
co c:i co c:i
25
'
c:i '
0
c:i
c:i
0
9
'
c:i 0.0
0.5
1.0
1.5
2.0
0
Scaling function, m=3
2
3
Wavelet, m=3
tD
C\1
c:i
c:i
'
0
c:i
c:i
C'!
C\1
0
9
0
c:i 0.0
1.0
2.0
3.0
0
Scaling function, m=4
2
3
4
5
Wavelet, m=4
tD
A
'
c:i
c:i
'
C\1
c:i
c:i
0
C\1
v\f
c:i
c:i
C\1
0
9
c:i 0
2
3
4
0
2
4
6
Figure 1.11: Three examples of scaling function/wavelet sets from the ChuiWang cardinal spline family Ingrid Daubechies introduced a family of wavelets that are not only orthonormal, but also have compact support (Daubechies (1988)). The development of this family made a profound impact, and these wavelets are used extensively in practice. Several members of this family are displayed in Figure 1.12. As with the other two families displayed in this section, these are also indexed by an integer N, and the smoothness of the functions increases as N increases. With so many wavelet families from which to choose, it is only natural to ask which one is preferred in most cases. In statistical applications, one important quality that a wavelet basis should possess is that of orthonormality (this results in independence of empirical coefficients-see Chapter 7 for details). As mentioned before, finite data sets are often scaled to live on an interval like [0, 1], so it would be natural to want a wavelet with compact sup-
26
SMOOTHER WAVELET BASES
Scaling function, N=2
Wavelet, N=2
~
~
10
0
0
0
0
0
~
'7
0.0
1.0
2.0
3.0
-1
0
Scaling function, N=3
2
Wavelet, N=3
~
~
10
ci 0
ci
0
0 ~
'7
0
4
3
2
-2
-1
Scaling function, N=S
0
2
3
Wavelet, N=S
~
~
<0
ci
0
ci
C\1
ci
C\1
9
~
'7
0
2
4
6
-4
-2
0
2
4
Figure 1.12: Three examples of scaling function/wavelet sets from the Daubechies compactly supported family
port. Of the wavelet bases discussed here, these two restrictions leave us only with the two Daubechies families of wavelets. In practical application, the choice of wavelet family and index N is not terribly important, provided that the corresponding basis is fairly smooth. All examples in this text will use the usual Daubechies family with N = 5. The example function is approximated according to the multiresolution ideas for three levels of resolution using Daubechies wavelets and scaling functions with index N = 5. The results are displayed in Figure 1.13. The scaling function coefficients for this example (computed by numerical integration) are tabulated in Table 1.4. Note that the effective support of Daubechies' N = 5 wavelet is roughly [0, 5] (the actual support is [0, 9], but the function is very near zero outside of [0, 5]). Thus, there are about ten non-
Wavelets: A Brief Introduction Example function
27
Reconstruction with j=-2
q
Ill
Ill
c:i
c:i
0
0
c:i ~~--~~~--~~ c:i ~~--~~~--~~ -3 -2 -1 0 2 3 -3 -2 -1 0 2 3 Reconstruction with j=-1
Reconstruction with j=O
~
~
q
q
Ill
Ill
c:i
c:i
0
0
c:i
c:i
-3
-2
-1
0
2
3
-3
-2
-1
0
2
3
Figure 1.13: An example function and representation using Daubechies wavelets with N = 5 dilated scaling functions 0 = 0) with effective support overlapping [-1r, 1r], the support of the example function, and thus ten coefficients at that level. However, only a handful of these are big enough to matter much in the reconstruction. Similar observations can be made of the two lower-level reconstructions as well. Table 1.4: Scaling function coefficients for the Daubechies N = 5 representation of the rescaled example function
k -7
-6 -5 -4 -3 -2 -1 0 1 2
j = -2 0 0 -0.0214 0.1226 -0.4581 1.5109 2.4652 0.1536 0 0
j = -1 0 -0.0006 0.0006 -0.0296 0.1030 1.8404 2.5179 0.8906 0.0155 0
j=O -0.0078 0.0243 -0.0749 0.3756 1.3702 1.6069 1.5847 1.6640 0.9159 0.0950
CHAPTER
TWO
Basic Smoothing Techniques In order to appreciate fully the application of wavelets to various function estimation problems in statistics, it is first useful to examine some of the existing techniques in use. Though there are many methods currently in use for these applications, this chapter will focus on only two of them: kernel smoothing and orthogonal series estimation. This background should provide a useful lead-in to a discussion of wavelet methods for function estimation, since standard techniques in these two areas can be modified in a straightforward manner to use wavelets. These two methods are considered in two contexts: that of estimating a probability density function (pdf), and that of nonparametric estimation of a regression function. The extension of these methods to other types of estimation (for example, spectral density estimation, considered in Chapter 7) is straightforward.
2.1
Density Estimation
An important and fundamental concept in probability is the notion of the cumulative distribution function (edt) (or, simply, distribution function) as-
sociated with a random variable X. The distribution function, denoted F (x) and defined as
F(x) = P(X :::; x), completely specifies the behavior of the associated random variable. The probability of X falling into any half-open interval (a, b] with a < b can be calculated: P[a
<X
:S; b] = F(b) - F(a).
(2.1)
Clearly, any valid distribution function must be non-decreasing, right-continuous, and have limits 0 and 1 as x ---+ - oo and x ---+ oo respectively.
30
DENSITY ESTIMATION
If the distribution function F has a continuous derivative, the associated random variable is called absolutely continuous, and the probability density function (pdf) (or, simply, density) of the random variable X is defined as
f(x)
d
= dx F(x),
x E JR.
By the Fundamental Theorem of Calculus, F is the antiderivative of f: (2.2)
F(x) = [,"" J(y) dy. Thus the probability in (2.1) can be expressed in terms of the density
P[a
< X :S: bJ
= F(b) -
F(a)
=[
f(x) dx,
so the density also completely specifies the behavior of the random variable. Often, therefore, attention is focused primarily on the density. Note that when such a density exists, it makes no difference in calculating the probability of the random variable falling into an interval, whether the interval is open, half-open, or closed. Clearly, for a function f to be a valid density function, it must satisfy f (x) ~ 0 for all x E lR and the "total area under the curve" must be 1:
The density function is very important in statistics and data analysis as well. If a set of observed data values X 1 , ... , X n are regarded as being independent observations from an unknown distribution with density f (x), it is often of interest to use the data to make inferences about the density function. Traditionally, this has been accomplished by assuming a particular form of the density, so that the function f (x) is known up to the values of a few parameters. For instance, it is often assumed that the data follow a normal distribution with unknown parameters J.L and a 2 , representing the mean and variance of the distribution, respectively. In such a situation, inference upon the density f can be accomplished through estimating the parameter values from the data. This parametric approach to making inference on a density function is supplemented by various methods of nonparametric density estimation. These more modern techrriques offer the advantage of much greater flexibility. If the form of the distribution is misspecified in a parametric analysis, the results could be quite severe, as when forcing a square peg into a round hole.
Basic Smoothing Techniques
31
Nonparametric density estimation requires only relatively weak assumptions to be placed upon the distribution of the data, thereby avoiding the problems that result from misspecification. This additional flexibility comes with a price tag, in that much greater computational effort is required. Modern gains in computing power have made these methods viable, so they have come into standard usage. This section will trace the development of three basic methods used in density estimation. This discussion will eventually lead (Chapter 3) to the application of wavelets to the problem. Thus, the treatment of density estimation is not intended to be exhaustive, but only to provide enough introduction to current methods so that the reader can appreciate the benefits of applying wavelets to the problem. A more complete reference is given in the monograph of Silverman (1986). Other references include texts by Prakasa Rao (1983), Wertz (1978), and Delacroix (1982).
Histograms In exploratory data analysis, it is of interest to get an idea of the "shape" of the data distribution, with the hope that interesting features, such as multiple modes or skewness, will make themselves evident. An informal yet very practical approach to such exploration is presented in virtually all introductory statistics classes: to describe the shape of the data distribution using a histogram. These simple graphical displays are easily understood by virtually everyone, with no training in statistics needed. A histogram is constructed from continuous data by placing the data points into bins or classes. Each bin is represented graphically by a rectangle with width equal to that of the bin, and with height proportional to the number of data points falling into the corresponding bin. Bins can be determined from the choice of an origin x 0 and a bin width .A. For any integer f, a single bin consists of the half-open interval [x 0 + f.A, x 0 + (£ + 1) .A). The value of the histogram at any point x can be expressed as A
1
I
f(x) = -.#of xis in the same bin as x. n.A
For any particular data set, the appearance of the histogram is affected primarily by the choice of binwidth .A-a small value for .A will result in a choppylooking histogram with many thin bars; a larger .A might oversmooth the data by using too few bars. Various rules of thumb are given for automatic choice of binwidth, such as that given by Rudemo (1982), but in practice, .A is often chosen by constructing several histograms and making a somewhat subjective choice on which appears to represent the data best. To illustrate the qualitative effect of binwidth choice, histograms for various values of .A are plotted in Figure 2.1 for a classic data set consisting of 107
32
DENSITY ESTIMATION
co c:i
co c:i
'
c:i
'
c:i 0
c:i
0
2
3
4
5
Binwidth = 0.05
c:i
2
3
4
5
Binwidth = 0.1
<0
c:i (0
c:i '
c:i
'
c:i C\1
c:i
C\1
c:i 0
c:i
0
2
3
4
5
Binwidth = 0.25
c:i
2
3
4
Binwidth = 0.5
5
Figure 2.1: Histograms of the Old Faithful geyser data with varying binwidths eruption length times (in minutes) for the Old Faithful geyser in Yellowstone National Park. The original data is given in Silverman (1986). It should be noted also that the choice of origin x 0 also has some effect on the appearance of the resulting histogram. Moving x 0 slightly to either side may redistribute data points in bins enough to cause an appreciable difference in the histogram's appearance. Admittedly, the effect of this choice is not as great as that of binwidth, but Silverman provides a particular example where two different choices of x 0 produce rather dissimilar pictures. A natural estimator of F (x) is the corresponding sample quantity: the piecewise constant
empirical distribution function 1
I
-·#of XiS that are
n
~X.
(2.3)
1 n
- LI(Xi ~ x) n
i=l
Kernel Estimation A histogram provides an important look at the "shape" of the data, and is therefore a very valuable tool in data analysis. In many situations, we might believe that the true density function is smooth and thus desire a smooth estimator of it. For this, we might turn to kernel density estimation. This section
Basic Smoothing Techniques
33
will outline some of the development of kernel estimators, beginning with the paper by Parzen (1962). Recall that the distribution function F(x) gives the probability of an observation falling below (or equal to) a point x:
F(x) = P[X Since
f (x)
~
x].
is defined to be the derivative ofF (x), it can be written 1
f(x) = lim ,(F(x +A)- F(x- A)). >.-to 2" For a suitably chosen A, a natural estimator of the density would result from replacing F in the expression above with the empirical distribution function and disregarding the limit:
}(x)
\ (F-(x 2
=
+A) - F1x- A))
-+--·#of X/s in (x- A, x +A].
2An
(2.4)
We will refer to this expression as the "naive" estimator. Note that for any x, this estimator counts only the points that lie within a bandwidth A of x. The naive estimator can be written in another form by defining a particular weight function or kernel function
K(x) = {
~'
0,
-l<x~l,
otherwise.
(2.5)
Using this kernel function, (2.4) can be written
](x) = 2_ nA
tK (x-AXi).
(2.6)
i=l
The resolution of this estimator can be adjusted by changing the bandwidth A, with a small choice of A giving a narrow "window" with more localization, and a larger choice of A giving a wider window and less localization. The idea of giving weight only to data points in the vicinity of x is a very natural one, but it needs to be refined slightly for the estimator (2.6) to be really useful. The problem with the naive estimator is that the "aU-or-nothing" nature of the associated weight function makes for a jagged estimator: As the reference point x is moved continuously along the real line JR, data points are abruptly included or excluded from the domain of the weight function w, giving sharp discontinuities in the resulting estimator j (x).
34
DENSITY ESTIMATION
This is easily remedied by considering smoother weight functions in place of the piecewise constant function (2.6). A suitable kernel function should satisfy
I:
K(x) dx =I,
so it is common to use probability density functions as kernels, typically those that are symmetric about zero. The Gaussian kernel (normal pdf) is a popular choice: -x2/2 K( x ) =_I_ .~e . y27r
Other kernel functions in common use include the triangular kernel
K(x) = 1 - lxl, the Epanechnikov kernel
and the biweight kernel
K(x)
15 ( 1 - x 2) 2, = 16
each of these last three defined to be zero outside of [- 1, 1]. Figure 2. 2 gives a plot of these four kernel functions. For kernel density estimation, it is not required that the kernel function be nonnegative. In fact, kernels that take on negative values in places are often used to reduce the asymptotic bias of the resulting estimator. Using such higher-order kernels was originally considered by Parzen (1962) and Bartlett (1963); an overview of these methods is given in Chapter 3 of Silverman (1986). In practice, such kernels are often avoided, since they can occasionally give density estimates that are negative in places. As mentioned before, the naive estimator (2.4) was dismissed because of its jagged appearance. This jagged nature is inherited from the shape of its boxy weight function. In the same way, when a smoother kernel function is used, the resulting estimator is also smooth, since it is simply a sum of smooth functions. Analogous to the selection of a binwidth in constructing a histogram, the most important factor in determiningthe appearance of the density estima-
Basic Smoothing Techniques Gaussian kernel
'
35
Triangle kernel
c:i
co c:i
C")
c:i N
'
c:i
c:i
c:i 0
c:i -2
-1
-1.0
2
0
Epanechnikov kernel
-0.5
0.0
0.5
1.0
Biweight kernel
co c:i
<0
c:i '
c:i
'
c:i N
c:i 0
0
c:i
c:i
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
1.0
Figure 2.2: Four kernel functions
tor is the choice of a bandwidth A. A very small choice of A will localize the estimator too much-the resulting estimator will consist of a lot of sharp bumps centered at the data values. Conversely, using a very large value of A will not allow enough localization to occur-the estimator will "smooth over" any local features, completely obscuring the true picture. The effect of the choice of bandwidth on the appearance of resulting estimators is shown in Figure 2.3.
Orthogonal Series Estimation First considered by Cencov (1962), using orthogonal series provides another useful tool for estimating the probability density function from data. The general approach is illustrated by using the Fourier basis; other orthogonal series estimators are essentially the same in principle. This discussion will lead naturally into an introduction of wavelet density estimators in the next chapter. Walter (1994) discusses density estimation with orthogonal series, with an aim toward extending them to include wavelet estimators. Suppose for the moment that the density of interest has support [-1r, 1r]. (If the density has support on any other finite interval, it can be transformed easily enough.) Recall from Section 1.1 that iff E L 2 [-1r, 1r], then
36
DENSITY ESTIMATION
0
C\i q
0
c)
2 3 4 5 Bandwidth = 0.05
6
2
3
4
5
6
2
3
4
5
6
Bandwidth = 0.1
"'<1; 0
C\i q
<0
c)
C\1
0
c) '--r---r---r---.------.--1
4 2 3 5 Bandwidth = 0.25
c)
6
Bandwidth= 0.75
Figure 2.3: Kernel density estimates for the Old Faithful geyser data, using the Gaussian kernel and varying bandwidths
f(x) =
~ao +
f
(ai cos(jx)
+ bi sin(jx)),
(2.7)
j=l
and that the density f will be well-approximated by replacing the sum in (2. 7) by a finite sum up to a large integer J. With knowledge off, the coefficients could be computed according to
!_(j, cos(j·)) 7r
f(x) cos(jx) dx, j
= .!_ /_71"
j(x) sin(jx) dx, j
7r
!_(j, sin(j·)) 7r
= !_ /_11"
-71"
7r -71"
= 0, 1, ...
(2.8)
= 1, 2, ....
(2.9)
Note that ai
= E[cos(jX)]/1r,
and bi
= E[sin(jX)]/7r.
(2.10)
A very "raw" estimator of the density function can be given in terms of the Dirac delta function 8(x), defined to be infinite for x = 0 and zero for x =f. 0
Basic Smoothing Techniques
37
with the property that
Allowing the bandwidth A to go to zero in (2.6) will result in the degenerate situation 1
f1x)
n
= ~ I>5(x- Xi)·
(2.11)
i=l
This function is constantly zero except for infinitely high spikes at the data points. This expression is interesting because the empirical distribution function p- defined in (2.3) can be written in terms of f1x):
Y(x) =
L~ F(y) dy,
the same relationship between the "true" distribution function and density function (2.2). Without knowledge off, the coefficients could be estimated by plugging the raw function f-in place of the true density f in (2.8) and (2.9):
n
71"
/_ -71"
f1x) cos(jx) dx
n1r
cos(jXi) j
= 0, 1, ...
i=l n
71" /_ -71"
= __!._ L
f1x) sin(jx) dx
= __!._ L n1r
sin(jXi) j
= 1, 2, ....
i=l
It is straightforward to show that these coefficient estimators are unbiased:
which follows trivially from (2.10). A similar calculation gives that Var[a1] = 0( ~).so that 1 14-a1 as n ~ oo for all j (similarly for the bj's). Thus, a natural estimator of the density function would be
a
J
JJ(x) =
&ao + ~ ( aj cos(jx) + bj sin(jx)) . J=l
38
ESTIMATION OF A REGRESSION FUNCTION
The relative smoothness of the resulting estimator is controlled by the limit of summation J. Choosing a large value of J will include many terms in the reconstruction, giving a rather wiggly estimate. Conversely, using a small J will give a smooth estimate, even if the underlying density is not smooth.
2.2 Estimation of a Regression Function Introductory courses in statistics generally include some treatment of regression. The typical case is where one has bivariate data (xi, li) and wants to use the data to quantify the relationship between the two variables. In particular, the standard regression model is
li=J(xi)+ti, i=1, ... ,n,
(2.12)
where the Ei's are independent and identically distributed (iid) N(O, a 2 ) random variables. For the time being, it will be assumed that (x 1 , ... , xn) are fixed design points. The primary interest is in recovering (estimating) theregression function (or underlying function) f from the data. Approaches to this problem vary according to the assumptions that are put on the underlying function. The most common assumption is that f is linear in x, giving a parameterized version of (2.12):
li
= f3o + f3t Xi + Ei,
i = 1, ... , n.
(2.13)
If the errors are assumed to be iid N(O, a 2 ) random variables, then the maxi-
mum likelihood estimator for the function f involves computing least-squares estimators for f3o and {3 1 . If f is assumed to be a polynomial or any other function involving a finite number of parameters, the model (2.13) can be extended to a multiple regression problem, in which statistical inference on the regression function f amounts to inference on the unknown parameters. Although parametric regression is used widely in practice, in many situations, one might be reluctant to choose a specific form of a model to fit a particular set of data. The field of nonparametric regression has developed to fit a curve to data without assuming any particular structure on the underlying function. Techniques in non parametric regression each come with their own sets of assumptions, typically regarding the smoothness of f, such as specifying that f have at least one continuous derivative. These assumptions are always much less restrictive than those in the parametric regression setting. For the sake of the developments in this section, it will be assumed that the design points x 1 , ... , Xn are equally spaced, and without further loss of generality, that they lie on the unit interval: Xi = i/n, i = 1, ... , n. Procedures similar to the ones discussed here can be used for the non-equally spaced case.
Basic Smoothing Techniques
0.0
0.2
0.6
0.4
39
0.8
1.0
Figure 2.4: The piecewise constant "raw" approximator of the underlying regression function
Kernel Regression Just as smooth kernel functions were introduced into density estimation to "smooth out" the naive estimator (2.4), applying kernels to the estimation of regression functions has proved quite useful and become very popular. In the same vein as the "raw" density estimator, a "raw" estimator of the regression function f is the piecewise constant function
f-( u) =
v .l
{
i'
0,
i-1 < i n - u < n'
.
't
u=l.
=
1
' ... 'n
(2.14)
The !1 u) function for an example set of data is plotted in Figure 2.4. This raw function must now be "smoothed out" in some way to get a final estimator of f. This can be accomplished in general by using localized weighted averages of the li 's: n
}(u)
= LPi(u)Yi,
(2.15)
i=l
where the weights sum to one: ~~~ Pi(u) = 1 for all u E [0, 1]. The dependence of the weights on u allows us to localize the smoothing, by giving more weight to data points that occur near the point u and less weight to points further away. These weights can be obtained by applying a kernel function, as was done in the previous section. Though there are many forms of the kernel estimator, the one used here is of the form
40
ESTIMATION OF A REGRESSION FUNCTION
](u) =
t (3: i= 1
{'In
K
Jcz-1)/n
(u
~ v)
dv) Yi,
(2.16)
representing one way in which the Pi (u) 's can be computed. This estimator was originally studied by Gasser and Muller (1979) and Cheng and Lin (1981). Other forms of the kernel estimator are presented and compared in Eubank (1988) and other sources. As in kernel density estimation, this estimator also depends on the choice of a kernel function K and a bandwidth .A. Any of the functions discussed in Section 2.1 are appropriate for use in kernel regression as well. When the particular kernel estimator (2 .16) is used to generate the weights in (2.15), the weights will sum to one as long as the support of the function K((u- v)/ .X) is completely contained within [0, 1]. In mathematical analysis, the convolution operator is often applied to "smooth out" a rough function. It is only natural to apply this technique in statistics, where the goal is the same. The particular form of the kernel estimator (2.16) is called the convolution type kernel estimator. The reason for this is that it can be derived by convolving the raw estimator f- given in (2.14) with a dilated version of the kernel function. Defining, for the moment, K;,(t) = (1/.A)K(t/.X),
(u-v) -A- f-(v) dv nii/n K (u-A_v) f-(v) dv, ~L . 1
](u)
= (K;, * f)(u)
Jo [
=
1
~K
(2.17)
1
i= 1
(z-1)/n
which agrees with (2.16), since f-(u) is constant over each interval of the form [(i- 1)/n, i/n). Key in the application of kernel regression is the choice of the bandwidth .A. Note that for a kernel function with support on [- 1, 1], the support of K (u /A) is [-A, .A]. A relatively small value of A gives a narrow smoothing window, so the estimate of the function at u depends primarily on observations near u. Thus, estimators with small bandwidths tend to be rather wiggly. Conversely, a larger choice of bandwidth gives a wider smoothing window, resulting in more averaging and a smoother estimate. The qualitative effect of bandwidth choice upon kernel estimators is illustrated in Figure 2.5. Three choices of bandwidth are used, with the Epanechnikov kernel and the simulated data plotted in Figure 2.4. A scaled version of the appropriately dilated kernel is superimposed on each plot. Small values of A give estimators with smaller bias but larger variance; selecting a larger bandwidth will decrease the variance but increase the bias.
Basic Smoothing Techniques Bandwidth
.. 0
41
=0.2
..
......................................................... ·----·.
~~------~------~r-------~------~------~~
0.0
0.2
Bandwidth
..
0.8
1.0
0.8
1.0
=0. 1
....... ··-
..
C\1
0
0.6
0.4
.
-------·-··-·---------------------··- ..................... '
~
"1 ~
0.0
0.2
0.6
0.4 Bandwidth
=0.05
.... .. .
.. .... - .... - ..................'
'
... ..
~-- .......
----- ..... ..
~
"1 ~
0.0
0.2
0.6
0.4
0.8
1.0
Figure 2.5: Kernel estimates for various choices of the bandwidth A A data analyst might use several bandwidths on any particular data set, perhaps selecting a final estimator based on a qualitative assessment of the resulting estimates. It is also possible to compare estimates quantitatively as well, based on some objective function. In the linear (parametric) regression model (2.13), the estimate j is chosen so as to minimize the mean squared error (MSE) criterion: 1
~ ( li - f(xi)
MSE(j) = ; ~ A
A
)2
i=l
Obviously, this approach cannot be used without some modification for nonparametric function estimation, as any estimator that interpolates the data points will have a MSE of zero. Using a very small value for A in (2.16) will
42
ESTIMATION OF A REGRESSION FUNCTION
do just this (in fact, as A ~ 0, the kernel estimator will converge to the raw estimator f). Clearly, the MSE criterion by itself is not a good criterion. It is reasonable, however, to consider a related quantity, the quadratic loss for an estimator j that is estimating a function f:
A = -1~(A ~ f(xi) -
L(f, f)
n
f(xi)
)2
i=I
Related to this is the expected loss, or risk:
A 1 ~ (A R(f, f)=-:;;,~ E f(xi)- f(xi)
)2
(2.18)
i=l
If we are willing to make certain assumptions about the underlying function f, then we can choose the bandwidth that will minimize the risk over the class of functions to which f belongs. Typically, it is assumed that the function has at least two continuous derivatives. The asymptotically "optimal" bandwidth, in the sense of minimizing risk, depends on a few properties of the kernel function, and the sample size n, as well as on some unknown quantities, such as a 2 and the integral of the square of the second derivative of f. There are various ways of estimating these values for use in computing the optimal bandwidth, and it can be shown that the resulting estimator is consistent, not only in terms of (2.18), but also pointwise. These ideas are discussed in more detail in Eubank (1988) with a more complete set of references.
Orthogonal Series Estimation Another popular method for estimating regression functions uses orthogonal series. In the past, common families of orthogonal functions have included Legendre polynomials, Hermite polynomials, and the trigonometric functions. Each of these families constitutes a complete orthogonal system for the interval [0, 1]. This brief introduction to using orthogonal series in non parametric regression will build on the principles set forth in Section 1.1, focusing only on the Fourier basis. Procedures for the other families of orthogonal functions are similar. In Section 1.1 it was seen that any function f E £ 2 ([ -1r, 1r]) can be approximated arbitrarily well by a sum of the form
J
SJ(x)
= ~a 0 + L(aj cos(jx) + bj sin(jx)) j=l
(2.19)
Basic Smoothing Techniques
43
by choosing J large enough. The coefficient sequences { ai} and { bi} are computed by taking the inner product of the function f with the corresponding basis function as in (1.4) and (1.5). In the previous section, the interval of interest was [0, 1]. A function defined on this interval can easily be rescaled to lie on the interval [--rr, -rr] to fit into the natural domain of the trigonometric functions, so that the results of Section 1.1 can be applied directly. The remainder of this section will presuppose that this rescaling has been done for f and f~ In nonparametric regression, the underlying function f is unknown, so the coefficients a1 and bi must be estimated rather than computed directly. This could be accomplished by replacing f(x) in (1.4) and (1.5) by its raw estimator f- given in (2.14):
(f~ cos(j·)) =
_!_ 7r
=
(f~ sin(j·))
/_1T f1x) cos(jx) dx, -1T
= -1 /_7T
-1T
7r
J1x) sin(jx) dx, j
j = 0, 1, ...
= 1, 2, ....
It can be shown that these estimated coefficients converge to the "true" coefficients, just as was done in the density estimation situation. Plugging these estimates into expression (2.19) gives a Fourier series estimator for f: J
]J(x)
= ~a 0 + L(ai cos(jx) + bi sin(jx)).
(2.20)
j=l
This expression depends on the choice of a maximum index J in a manner similar to a kernel estimator depending on the choice of bandwidth. A small value of J will result in a relatively smooth estimator (small variance, possibly large bias), and a larger J will give a more wiggly estimator (small bias, but large variance). Three Fourier series estimates are plotted in Figure 2.6 for another simulated data set. Note that as J increases, the estimate tends to follow the local features of the data better, at the expense of smoothness. Using orthogonal series estimators is similar to parametric regression in the sense that the underlying function f is written in terms of some buildingblock functions, with the coefficients of the building-block functions being estimated from the data. Since each of the orthogonal systems mentioned earlier forms a basis for a much broader class of functions, this approach is much more generally applicable, especially in cases when a specific model is not known.
44
ESTIMATION OF A REGRESSION FUNCTION
J=1
~
....
co
co '
0
~
-3
-2
-1
0
2
3
2
3
2
3
J=3
~ co
co '
0
~
-3
-2
-1
0 J=5
~ co
co '
0
~
-3
-2
-1
0
Figure 2.6: Orthogonal series estimates for various choices of the summation limit J
For a given function f-(x), as J goes to infinity, the estimate (2.20) is alegitimate Fourier series that converges to a version of f-( x). The limiting function is equal to f-(x) where f-(x) is continuous; at points of discontinuity, the estimate converges to the average of the limits from either side. The smoothing parameter of interest in nonparametric regression using orthogonal series is J, related to the number of terms included in the reconstruction. In the same way that an "optimal" bandwidth .A can be chosen in kernel regression to minimize expected loss, an "optimal" choice for J can be obtained. In particular, under certain assumptions on f, the optimal value for J in terms of risk is proportional ton I/(S+o) for some constant 8 > 0. Again, this section is intended only to give a broad overview of orthogonal series estimation. For more details, the interested reader is referred to Eubank (1988).
Basic Smoothing Techniques
2.3
45
Kernel Representation of Orthogonal Series Estimators
Kernel estimation and orthogonal series estimation were developed independently and seem, at first glance, to be completely unrelated methods. In fact, the two methods are closely related, and can be shown to be essentially equivalent under certain circumstances. This equivalence lends additional and valuable insight into the workings of each method. This general relationship is illustrated here by means of a single example, in the framework of nonparametric regression; the equivalence in density estimation can be shown in a similar manner. Here, we will consider a particular case of nonparametric regression using orthogonal series and show that the series estimator can be written in the form of a kernel estimator. This can be accomplished for any orthogonal series estimator in general. A common technique in dealing with boundary effects in nonparametric regression is "reflection boundary handling." (fhis will be discussed in terms of wavelets in Chapter 6.) Suppose we are interested in estimating a function f with domain [0, 1]. We then impose a boundary condition by reflecting the function about 0: f (u) = f (-u) for u E [-1, 0], so that the "extended function" has support [-1, 1). We can now estimate f on [- 1, 1) by standard orthogonal series methods. The basis we will use is the Fourier basis scaled to live on [-1, 1]: { ~, cos 1ru, sin 1ru, cos 21ru, sin 21ru, ... } . By reflecting the function f about zero, we force it to be an even function and thus we need only the constant function and the cosine functions in the decomposition, since 1 /_
f(u) sin(j1ru) du = 0, for j
= 1, 2, ....
-1
We can reflect the raw estimator f- about zero as well: /1 u) = !1-u) for u E [- 1 , 0). The orthogonal series estimator for maximum index of summation J is J
]J(u)
=
&ao + L a'i cos(j1ru),
(2.21)
j=1
where (2.22) and 1
1
aj = /_ -1
f1u) cos(j7ru) du
=2
{ f1u) cos(j7ru) du, j
Jo
= 1, 2, ... 'J. (2.23)
46
KERNEL REPRESENTATION OF ORTHOGONAL SERIES ESTIMATORS
We can thus restrict our attention to [0, 1] again, keeping in mind the imposed reflection about zero. Substituting the coefficient estimates (2.22) and (2.23) into the expression (2.21), and performing some algebraic manipulation eventually leads to
}J(u) = [
f\v) (
~+
t.
cos(j1r(u- v))) dv.
(2.24)
We now apply the identity
1 DJ(x) =2
~
.
+ L- cos(J7rX) =
sin [(J
J=I
+ ~)1rx]
. ( x)
2 sm 1r 2
.
(2.25)
The expression (2.25) is known as the Dirichlet kernel, and it satisfies all the necessary requirements for a kernel function. Examples of the Dirichlet kernel for various choices of J are plotted in Figure 2. 7. (Note that this kernel is negative over part ofits domain.) By combining (2.25) and (2.24), the series estimator can be written to look like a kernel estimator:
h(x)
=[
f\u)DJ(x- u)du.
(2.26)
To show the correspondence between bandwidth in kernel regression and choice of J in orthogonal series regression in this example, note that applying a Taylor series expansion on DJ(x) in (2.25) gives
Taking out the ( J
+ ~) factor, we can define a basic kernel as K(x)
= sin(1rx), 1fX
so that an approximation to the orthogonal series estimator (2.26) can be
Basic Smoothing Techniques
47
J=4 <0
-:!" ('I)
-:!" C\1 C\1
0
0
";'"
-0.4
0.0
0.2
0.4
-0.4
0.0
0.2
0.4
0.2
0.4
J=10 ~
CX)
CX)
<0 <0
-:!" -:!" C\1
C\1
0
0
~
~
-0.4
0.2
0.0
-0.4
0.4
0.0
Figure 2.7: Plots of the Dirichlet kernel for various choices of J
written
1 1
!J(u) A
=
1 F'(v)~K
(uv) dv, -A-
0
which corresponds exactly to the expression for the standard Gasser-Mi.iller kernel smoother (2.16). To make these expressions equivalent, we must set the bandwidth to be 1
A=
(J+!)'
which makes explicit the relationship between the bandwidth A in kernel smoothing and the maximum index of summation J in orthogonal series estimation: Roughly speaking, they are reciprocals of each other. Wavelets seem more applicable to standard orthogonal series estimation, but in fact, the wavelet versions of the standard methods presented in the next chapter are essentially equivalent to corresponding versions of kernel estimators. This can be shown using arguments analogous to those presented in this section.
CHAPTER
THREE
Elementary Statistical Applications
With the basic introduction of wavelets in Chapter 1 and the overview of standard smoothing techniques in Chapter 2, we are ready to combine these concepts and examine some fundamental applications of wavelets in function estimation. This chapter will focus on wavelet versions of the two types of estimators discussed in Chapter 2 (kernels and orthogonal series), as they are applied to density estimation and nonparametric regression.
3.1
Density Estimation
Given the framework for density estimation by orthogonal series described in Section 2.1, it is relatively straightforward to adapt these methods to the use of wavelets. As before, it is natural to begin our discussion of density estimation with the construction of histograms.
Haar-Based Histograms The piecewise constant nature of the histograms in described in Section 2.1 might be vaguely reminiscent of the Haar wavelet system. Specialized versions of histograms can in fact be constructed using the Haar basis. An approach to this construction is described in Chapter 12 ofWalter (1994). Some of the important theoretical properties of such an estimator are discussed by Engel (1990). This interesting application of the Haar wavelet system is considered here as well, not only for the new interpretation that it lends to histogram estimators, but also because it leads naturally to other density estimators with smoother wavelet bases. Recall the Haar scaling function
¢(x) = { 1,
o,
O~x<1
otherwise.
50
DENSITY ESTIMATION
Applying the usual dilation and translation operations gives
In light of this, we can count up the number of data points that lie within a particular interval [2-j k, 2-j (k + 1)) using the quantity n
2- 112
2: ¢3,k(Xi)· i=l
Now for any x E IR and j E 7L,
(where [x] denotes the greatest integer function of x), so the number of data points that lie in the same interval as any real number x can be computed by n
2-j/
2
2: ¢J,[2i
n
x] (Xi)
=
i=l
L ¢(21xi - [21x]). i=l
The histogram density estimator with origin 0 and bins of width 2-J is thus given by
(3.1)
This estimator can be regarded as being the "best" estimator of the density f on the approximation space VJ. Incrementing J by 1 has the effect of halving each bin, and, similarly, decreasing J by 1 collapses two adjacent bins into one. Constructing histograms using the Haar basis does not allow as much flexibility as the more general histogram estimator discussed in Section 2.1, but it does provide an interesting application of the Haar basis, and, as we will see below, leads to more general wavelet density estimators. Note that the expression (3.1) can also be written as a decomposition into scaling function components:
jJ(x)
= :2:: CJ,kcPJ,k(x), kEZ
(3.2)
Elementary Statistical Applications
51
where the estimates of the scaling function coefficients are given by
(3.3)
the "raw" density estimate having been defined in (2.11). For any finite data set, the coefficients CJ,k will all be zero for sufficiently large lkl. so a suitably truncated version of (3.2) will do. Computing the estimated coefficient Cj,k according to (3.3) is equivalent to (3.4) Only the coefficients at the highest level of resolution of interest need to be calculated according to (3.3) or (3.4). Lower-level empirical scaling function coefficients Cj,k (and empirical wavelet coefficients dj,k) can be computed by the fast decomposition algorithms to be described in Section 4.1. Thus, histograms at many levels can be displayed for any data set without requiring a great deal of computational effort. The decomposition algorithm can be applied until reaching an appropriately "coarse" scale j 0 . The histogram can then be written in terms of the Haar wavelets as follows:
JJ(x)
=L
J-1
Cj0 ,kc/Jj0 ,k(x)
k
+
L L dj' ,k'I/Jj' ,k(x).
j=jo+l
k
Also in Chapter 12, Walter (1994) discusses an automatic algorithm to choose the "best" level J for giving a Haar-based histogram for a set of data, using the integrated mean square error criterion 2
00
IMSE
=
-oo /_
E (fJ(x)- f(x))
dx.
This algorithm begins by computing the Cj,k coefficients (and thus the histogram estimate) at the highest level of interest, estimating its error, and then recursively computing lower-level coefficients and the associated estimated error. Walter suggests using the level J at which the estimated error increases most rapidly when moving from level J to J - 1. This idea is related to the problem of choosing the optimal number of predictors in multiple regression: adding predictors until the improvement in R 2 slows down. Some wavelet histogram version of the adjusted R 2 criterion might be appropriate here.
52
DENSITY ESTIMATION
j
=2
C\J
0
0
0
~~--~~--~--~
2
3
4
0
0
5
L-~--~----~--~
2
3
4
5
j=4
C\J
0 0
0
~~--~----~--~
2
3
5
4
0
0
~~--~----~--~
2
3
4
5
Figure 3.1: Haar-based histograms for the Old Faithful geyser data for varying resolution levels
Examples of the Haar-based histograms are given in Figure 3.1 for levels 1, 2, 3, and 4, using the Old Faithful data set from Chapter 2. It is informative to compare the histograms in Figure 2.1 with these density estimates based on the Haar basis. Engel (1990) considers a similar construction of Haar-based histograms for data on the interval [0, 1], and derives rates of convergence in terms of integrated mean absolute error:
!MAE=
f
liJ(x)- f(x)l dx.
He also briefly considers convergence in terms of IMSE. His results indicate that good rates of convergence can be obtained using these Haar histograms with less restrictive assumptions (as compared to standard density estimation techniques) on the true density f.
Estimation with Smoother Wavelets Estimating density functions using smooth wavelets can be done essentially in the same way as is done using any orthogonal series. A natural application
Elementary Statistical Applications
53
of wavelets, this estimation procedure results from a straightforward extension of the Haar-based histogram approach of the previous section. Among the first to consider density estimation using wavelets are Doukhan and Leon (1990), Kerkyacharian and Picard (1992, 1993), and Walter (1992). The same approach that was used to estimate a density in terms of the Haar basis can be used with more smooth wavelet bases as well. Let ¢ and '1/J be an orthonormal scaling function and mother wavelet pair that generate a series of approximating spaces { Vj} JEZ. Then, if f (x) represents a squareintegrable density function, it can be represented by (3.5) k
j>jo
k
where j 0 represents a "coarse" level of approximation. The first part of (3.5) is the projection of f onto the coarse approximating space Vj 0 , and the second part represents the details. The first issue in estimating f involves estimating the coefficients in the above decomposition. This can be accomplished in essentially the same way that the Haar coefficients were estimated. In particular, 1
Cj,k
n
= (/-, cPJ,k) = ;:; L
cPJ,k(Xi)
(3.6)
i=l
(3.7)
Again, as in the Haar histogram case, these coefficients only need to be computed this way at the highest level of interest, and then all other coefficients are computed using the decomposition algorithm. This represents one instance of orthogonal series estimation of densities, and all the results of orthogonal series apply. Using the estimated coefficients given above, the wavelet estimator for f at level J 2:: j 0 is simply
L Cj ,kcPJ ,k(x) + L L dj,k'1/Jj,k(x) j<J L CJ,kcPJ,k(x). 0
k
0
k
(3.8)
k
The smoothing parameter in this estimation scheme is the index J of the highest level to be considered. The issue of choosing J here is the same as choosing J in constructing Haar-based histograms.
54
NONPARAMETRIC REGRESSION
2
3
4
5
6
2
3 j
2
3
4
5
6
2
3
4
5
6
4
5
6
=3
Figure 3.2: Smooth wavelet-based density estimates for the Old Faithful data set using the Daubechies wavelet with N = 5 and four choices of J
As pointed out by Janssen (1992), there is a potential problem with using arbitrary wavelets for estimation of a density. If the Haar basis were used in constructing (3.8), the resulting estimator is guaranteed never to be negative, since the Haar scaling function is always nonnegative. This is not the case, though, with general scaling functions (see the examples shown in Chapter 1). In fact, among the orthogonal wavelets, only use of the Haar basis will guarantee positive density estimates. Walter (1994) considers estimating the density function indirectly, by using wavelets to estimate the Fourier transform of the density (i.e., the characteristic function) and then transforming back. This will give a nonnegative estimate of f for any orthogonal wavelet basis. Smooth wavelet-based density estimates for the Old Faithful data set are plotted in Figure 3.2 for four choices of the level j. Note that the estimates are indeed negative in places.
3.2 Nonparametric Regression The basic methods of kernel regression and orthogonal series estimation are described in Chapter 2. This section will discuss the wavelet version of these basic methods as applied to nonparametric regression, i.e., recovery of are-
Elementary Statistical Applications
55
gression function f (x) given only bivariate data (x 1, Y1 ), ( X2, Y2), ... , (Xn, Yn), where
Yi = f(xi) + Ei, i = 1, ... , n, and Ei iid N(O, a 2 ). To simplify the treatment of these methods, we will only treat the special case of equally spaced design points. As was done in Chapter 2, it will be supposed without loss of generality that these are on the unit interval [0, 1]: xi= i/n, i = 1, ... , n. In one of the early papers combining wavelets and statistics, Antoniadis, Gregoire, and McKeague (1994) describe a technique for the estimation of a regression function. Here, we briefly review this paper. The projection of f onto the approximation space VJ may be written I'V
(PJ f) (u)
= l:cJ,kc/JJ,k(u),
(3.9)
k
where the coefficients are computed by 1
CJ,k = (/, c/JJ,k) = { f(u)c/JJ,k(u) du.
lo
(3.10)
Without complete knowledge of the function f, the "true" coefficients in (3.9) must be replaced with estimated coefficients giving a wavelet estimator of the projection to be
jJ(u)
= 2: CJ,kc/JJ,k(u),
(3.11)
k
where the estimated coefficients are computed as in (3.10), substituting the raw estimator f- defined in (2 .14) in place of f: (3.12)
This is just the wavelet version of the classic orthogonal series estimator, using the basis { ¢J,k, k E Z}. As was shown for one particular example in Section 2.3, this wavelet orthogonal series estimator is equivalent to a wavelet-based kernel estimator. Substituting (3.12) into (3.11) and rearranging terms gives n
]J(u) =
i/n
L Yi 1 i= 1
(i-1)/n
EJ(u, v) dv,
(3.13)
56
NONPARAMETRIC REGRESSION
II II II I I I I I I I I I I I I
I I I
...-·······...
,·:;·
I I
0.0
0.2
·....
\.......
0.4
0.6
Figure 3.3: Wavelet-based kernels for J with N = 5 for three values of u
I I I
I I I I
I I I I
I I I I
I I I I
I
I
I I I I I
I
I I I I
I
I I
I
I I;
0.8
1.0
= 4 based on the Daubechies family
where the function Em(u, v), as defined by Meyer (1990), can be written
EJ(u, v) =
L c/JJ,k(u)c/JJ,k(v),
(3.14)
k
where ¢J,k ( x) is the appropriately dilated and translated scaling function for a specified family of wavelets. The standard kernel estimator described in Section 2.2 is a fixed kernel estimator, but it can be seen clearly that the kernel (3.14) is variable-that its form depends on u. Variable kernels are common in practice, for example, when dealing with boundary effects (see, for example, Gasser and Mi.iller (1979) or Gasser, Mi.iller, and Mammitzsch (1985)). Antoniadis, Gregoire, and McKeague (1994) suggest that this changing kernel allows the wavelet-based estimator to adapt itself automatically to local features of the data. Figure 3.3 displays three versions of the kernel used in such wavelet smoothing. The kernels shown are those for u = 0.1, u = 0.5, and u = 0.9, plotted using a solid line, a dotted line, and a dashed line, respectively. These kernels are specific to J = 4 using the Daubechies wavelet family with N = 5. Again, note that these kernels are negative in places.
Elementary Statistical Applications
.
~
CD
-:!" C\1
0
~
~
...
CX)
0.2
0.4
.
~
0.6
0.8
CD
-:!" C\1
0
~
1.0
0.0
0.4
0.6
0.8
1.0
J=5 ~ CX)
CD
CD
-:!"
-:!"
C\1
C\1
0
0
~
~
0.2
0.2
J=4
CX)
0.0
J=3
CX)
.... .. ,.. ,· ... ........ . .. 0.0
.
J=2
57
0.4
0.6
0.8
1.0
.. .. 0.0
0.2
0.4
0.6
0.8
1.0
Figure 3.4: Wavelet estimates of a simulated data set for varying values of J
It is not apparent that the expression (3.13) depends explicitly on the choice of a bandwidth in the same way that other kernel estimators do. The user still has control over the resulting smoothness of the estimator, but, as was seen in the kernel representation of the Fourier series, this control comes in the form of choosing the level J. As in other orthogonal series estimators, increasing J amounts to decreasing the amount of smoothing, so In the same way, the kernel EJ(u, v) becomes narrower for larger J, affecting the estimate the same way as using a smaller bandwidth in a "standard" kernel estimator. The wavelet estimator (3.13) has distinct advantages over classical nonparametric regression techniques. One of these is that the asymptotic rates of convergence hold for weaker conditions on the underlying function than must be assumed in obtaining similar results for other types of smoothing. Figure 3.4 displays a simulated data set, and the resulting wavelet estimator for four choices of J. The mean function is the same as that used in the examples in Figure 2.6: a linear trend upward, a linear trend downward, and a flat portion. Like standard kernel and orthogonal series methods, this wavelet estimator tends to undershoot the peak for small values of J (corresponding to large bandwidth). The development of this estimator made it clear that it is at the same time both an orthogonal series and a kernel estimator. Viewed as a series esti-
58
NONPARAMETRIC REGRESSION
mator, this method captures the essence of the multiresolution analysis in a function estimation framework. The estimator }J represents the projection of the function f- onto the approximating space VJ as defined in Chapter 1. Analogous to nonparametric regression with orthogonal series, increasing the smoothing parameter J allows additional detail in the estimated reconstruction (at the expense of adding greater variability of the resulting estimator). Though the simple situation considered in this chapter requires that the design points be fixed and equally spaced, analogous estimators can be constructed under more general conditions. Estimators similar to (3.13) with non-equally spaced Xi's and random design points are considered in the paper by Antoniadis, Gregoire, and McKeague (1994).
CHAPTER
FOUR
Wavelet Features and Examples
Chapter 1 presented the bare necessities for understanding the basic principles of wavelet analysis, presenting the concepts through the simplest example, the Haar system. With a good understanding of these principles, it is possible to skip forward to later chapters dealing with statistical analysis, but before treating more advanced statistical applications of wavelet analysis, a more thorough and general treatment of wavelets is useful. This chapter will give more insight into some of the advantages inherent in wavelet analysis, describing basic algorithms, and time-frequency localization concepts. It finishes up with a more complete development of the wavelet examples mentioned in Section 1.3. It should be emphasized here that while perhaps none of the topics in this chapter are essential to applying wavelets in data analysis, a good working knowledge of the relevant concepts will greatly aid in appreciation and understanding. Section 4.4 describes a few example wavelet bases, including those in the more general class of biorthogonal wavelets. The first three sections of this chapter (and most of the rest of this book) are concerned only with orthogonal wavelets.
4.1
Wavelet Decomposition and Reconstruction
In Section 1.2, it was seen that both the Haar scaling function ¢>(x) and the Haar wavelet '1/J ( x) can be written in terms of Haar scaling functions at level 1: ¢ 1 ,0 (x) and ¢ 1 , 1 . (See equations (1.12) and (1.13).) This provided a simple example of the two-scale relationships of wavelets. Later in the same section, it was shown that, at least in the Haar case, a decomposition algorithm existed (equations (1.16) and (1.18)) that allow us to express the wavelet and scaling function coefficients at any level of resolution in terms of the scaling function coefficients at the next higher level. At that time, it was hinted that the two concepts (two-scale relationships and decomposition algorithms) were
60
WAVELET DECOMPOSITION AND RECONSTRUCTION
related, and that they existed in some form for all sets of wavelets. This section, tracing some of the work in Mallat (1989a), will develop these ideas further and with more generality.
Two-Scale Relationships We begin with an MRA (see properties (1)-(5) in Section 1.2) consisting of spaces {Vj, j E .72:} with each Vj having orthonormal basis {¢J,k, k E .72:} where, as before, ¢1,k(x) = 211 2 ¢(21x- k). From this, we will present an expression for a wavelet function 'ljJ, define W1 spaces based on 'ljJ, and show that this leads to a CONS for L 2 (JR). Note that¢ E V(l and therefore also ¢ E VI since V0 C VI. Since {¢I ,k, k E .72:} is an orthonormal basis for VI, there exists a sequence { hk} such that
¢(x)
= ~ hk¢I,k(x)
(4.1)
kEZ
and that the sequence elements may be written (4.2) This sequence { hk} is a square-summable sequence: We say { hk} E £2 ( .72:) if I:kEZ h~ < oo. The two-scale relationship (4.1), relating functions with differing scaling factors, is also known as the dilation equation or the refinement equation. For the Haar basis, it was seen in (1.12) that this sequence is
h k-- {
)z' 0,
k = 0, 1 otherwise.
(4.3)
In this multiresolution context, this same sequence that relates scaling functions at two levels of hk 's can be used to define the mother wavelet: 'ljJ(x) = ~(-l)kh_k+I¢I,k(x).
(4.4)
kEZ
A special case of this construction was seen in (1.13) for the Haar wavelet. The reason for such a construction is to ensure that the scaling function and wavelet will be orthogonal:
Wavelet Features and Examples
('l/;, ¢)
=I
61
I (~(-t)•h-•wh•(x)) I
'lj;(x)¢(x) dx
l:(-1)kh_k+l
<jJ(x)dx
c/Jt,k(x)¢(x) dx
k
~(-1)kh_k+Ihk k
0.
The last step follows since the summand for k is the opposite of the summand for 1- k, so each term is negated, convergence holding since {hk} E £2 (.72:). It can be seen similarly that each integer translation of the mother wavelet 'ljJ is also orthogonal to¢:
I I
'lj;(x- k)¢(x) dx l:(-1)fh_f+I¢I,f(x- k)¢(x) dx fEZ
l:(-1)fh_f+l fEZ
I
¢I,2k+f(x)¢(x)dx
I:( -1)fh_f+Ih2k+f fEZ
o, the last step following because the summands for f and for 1 - f - 2k cancel, and convergence holds because of the square summability of the sequence { hk}. A straightforward extension of this argument will show that 'l/Jo,k .l c/Jo,f for all k, f E .7Z and, further, that 'l/JJ,k .l ¢J,f for all j, k, f E .72:. Thus, if we define the space W0 to be the span of the set of wavelets {'l/Jo,k, k E .72:}, then it is clear that V0 .l W0 , and it follows readily that Yj .l W1 for all j E .72:. Now, to show that V(l .l W1 , we must first show that, for each k, f E .72:, c/Jo,k .l 'l/Jt,f· This is straightforward given what we know already: The result follows by using (4.1) to express ¢o,k in terms of ¢ 1 ,m's and applying earlier results. This argument can be extended recursively to show that Vj .l WJ+m for all m ~ O,j E .72:. From this, it can be seen that
'l/;J,k .l 'l/Jj' ,k' for all j, j', k, k' E .72:, j =/:- j', k
f:-
k'
(write either wavelet according to ( 4.4), and apply known results), so that the wavelet spaces {Wj, j E .72:} are mutually orthogonal as well. Thus, from the MRA structure of the Yj 'sand from the results derived in this section, we have
62
WAVELET DECOMPOSITION AND RECONSTRUCTION
shown that the set of all wavelets {'ljJJ,k, j, k E .72:} is a complete orthonormal system for L 2 (IR).
The Decomposition Algorithm In Section 1.2, the two-scale relationships (1.12) and (1.13) were converted to decomposition algorithms (1.16) and (1.18). This is now accomplished in more generality. As before, let {c1,k, j,k E .72:} and {d1,k, j,k E .72:} represent the scaling function and wavelet coefficients of a function f respectively. These can be computed as CJ,k
=I
f(x)¢J,k(x) dx
(4.5)
f(x)'ljJJ,k(x) dx
(4.6)
and dj,k
=I
as before. Section 6.2 will discuss computation of scaling function and scaling function coefficients that don't involve integration, but regarding these coefficients conceptually as being computed according to ( 4.5) and ( 4.6) will help in their interpretation. Given the two-scale relationship ( 4.1), an expression for any scaling function in terms of higher-level scaling functions can be derived: Since ¢ 1,k(x) = 2il 2¢(2ix- k),
I: he 2i1 2¢l,t(2lx- k) I: he 2U+I)/ ¢(2J+I x fEZ
2
2k- £)
fEZ
2: he ¢J+l,f+Zk(x) fEZ
2: he-zk ¢J+I,e(x).
(4.7)
fEZ
Substituting this result into the definitional formula (4.5) for c1,k, then interchanging the sum and the integral, gives the general decomposition algorithm for scaling function coefficients: Cj,k
= L ht-2k Cj+I,f·
(4.8)
f
Similarly, a two-scale relationship relating any 'ljJJ,k to the ¢J+t,e's can be
Wavelet Features and Examples
+--
CJ-M+1,·
+--
+--
CJ-2,-
+--
63
CJ-1,-
+--
C J,.
Figure 4.1: Schematic representation of the decomposition algorithm
derived using ( 4.4), which leads to the wavelet coefficient portion of the decomposition algorithm: dj,k
= :~::::) -1 )f h-£+2k+l
Cj+I,f·
(4.9)
fEZ
Thus, given scaling coefficients at any level J, all lower-level scaling function coefficients for j < J can be computed recursively using (4.8), and all lower-level wavelet coefficients (j < J) can be computed from the scaling function coefficients using (4.9). Defining Cj,· and dj,· to represent the sets of scaling function and wavelet coefficients at level j respectively, this decomposition algorithm is represented schematically in Figure 4.1. The arrows represent the decomposition computations: CJ- 2 ,. and dJ- 2 ,. can be computed using only the coefficients CJ- 1 ,., for instance. The algorithms given in ( 4.8) and ( 4.8) share an interesting feature. Note that in either equation, if the dilation index k is increased by one, the indices of the {he} sequence are all offset by two. Thus, in computing either decomposition, there is an inherent down-sampling of coefficients. Roughly speaking, this means that if there are only finitely many non-zero elements in the {he} sequence, then applying the decomposition algorithm to a set of nonzero scaling function coefficients at level j + 1 will yield only half as many nonzero scaling function coefficients at level j. Similarly, there will only be half as many non-zero wavelet coefficients at level j. Computing the decomposition algorithm recursively yields fewer coefficients at each level. This structure led Mallat (1989a) to refer to the decomposition of coefficients as the "pyramid algorithm"; Daubechies (1992) terms it the "cascade algorithm." This downsampling is the key to the fast wavelet algorithms, which will be discussed thoroughly in Chapter 6.
The Reconstruction Algorithm Though perhaps not as apparent, it is also possible to move back up the ladder, starting with low-level coefficients and computing higher-level coefficients. This is known as the reconstruction algorithm, which is derived here for general orthogonal wavelet bases.
64
WAVELET DECOMPOSITION AND RECONSTRUCTION
Again, start with an MRA with { c/>j,k, k E .72:} and {'1/Jj,k, k E .72:} forming orthonormal bases for Vj and Wj, respectively, and the Wi spaces representing the orthogonal detail spaces as in Section 1.2. Since ¢I,o E VI and since VI = V0 EB Wo, we know that ¢I ,o can be written as a linear combination of the ¢o,k 's (basis for V{>) and the '1/Jo,k 's (basis for W0). The same can be said for ¢ 1, 1 . As before, the coefficients of the linear combinations will be computed by taking inner products. To allow a later unification of treatment, we adopt an unusual numbering scheme, and we define, for k E .72:,
Thus,
c/>1,o(x) =
L
+ bzk'I/Jo,k(x))
(4.10)
+ bzk-1'1/Jo,k(x)).
(4.11)
(azk¢>o,k(x)
kEZ
and
¢1,I(x) =
L
(azk-I¢>o,k(x)
kEZ
Then using (4.10) and (4.11), we can write a similar expression for any ¢ 1,k· We derive first the formula for even k: k ¢>I,o(x- 2)
L
k
au¢>o,f(x - 2)
+ bu'I/Jo,f(x -
k
2)
fEZ
L azf¢>o,t+f(x) + bzf'I/Jo,t+f(x) L azf-k¢>o,f(x) + bu-k'I/Jo,f(x). fEZ
(4.12)
fEZ
Working through a formula for odd k gives precisely the same formula, so (4.12) holds for all k E .72:. For odd (even) k, only the odd-indexed (evenindexed) elements of the sequences {at} and {be} are accessed.
Wavelet Features and Examples
CJ,.
--+
--+
CJ+1,·
--+
CJ+M-1,·
65
--+
CJ+M,·
Figure 4.2: Schematic representation of the reconstruction algorithm
Following similar arguments, an expression relating each scaling function ¢J,k to scaling functions and wavelets at level j - 1 can be derived:
;j,k(x) =
L au-k
(4.13)
fEZ
As in the previous subsection, the expression ( 4.13) can be applied to get the reconstruction algorithm for the scaling function and wavelet coefficients: Cj,k =
L au-kcj-I,f + bu-kdj-1,f·
(4.14)
fEZ
Thus we see that the scaling function coefficients at any level can be computed from only one set of low-level scaling function coefficients and all the intermediate wavelet coefficients by applying (4.14) recursively. This concept is represented schematically in Figure 4.2. These two reconstruction sequences, though initially presented as if they had to be computed from scratch, are written easily in terms of the (now familiar) two-scale sequence { hk}. As an example, the coefficient bzk- 1 is written bzk-1
(¢1,1,
'l/Jo,k)
/_: hq,(zx- 1)'1/J(x- k) dx /_: hq,(zx- (1- Zk)),P(x) dx
L( -1 )f h-f+1
(¢1,1-2k' ¢1,f)
fEZ
-hzk·
By similar arguments it can be seen that ak = h-k and that bk = (-1)khk+ 1 for all k E 7L. In terms of the sequence {hk}, the reconstruction algorithm
66
THE FILTER REPRESENTATION
can be written Cj,k
= L hk-UCj-I,f + (-1)kh2f-k+Idj-I,f·
(4.15)
fEZ
Thus, the two-scale sequence { hk} completely characterizes each wavelet basis. More of this will be seen in Section 4.4. For compactly supported wavelets, it is easily seen from the definitions of the sequences that only finitely many sequence elements will be non-zero. For wavelets with support on all of IR, each sequence element hk is, in general, non-zero, but elements decay exponentially as Ik I gets large. These decomposition and reconstruction algorithms are extremely useful in the application of wavelets, providing a fast algorithm with which to transform the data. This will be discussed in detail in Chapter 6.
4.2
The Filter Representation
Concepts already discussed in this modest introduction to wavelets can also be expressed in terms used by signal processors. Indeed, issues in this field of application have provided much of the impetus for the eventual emergence of wavelet analysis, so it is only proper to pause for a moment and reconsider these developments in a new light. In Section 4.1, algorithms giving decompositions and reconstructions of wavelet and scaling function coefficients were presented. Additional insight into some of the practical issues in wavelet analysis can be gained by regarding these algorithms as examples of signal processing.filters. Just as a filter is used in a laboratory either to purify a liquid from solid impurities (if the liquid is of interest) or to remove a solid from suspension in a liquid (if the solid is of interest), a filter applied to a pure signal contaminated with noise might attempt either to isolate the pure signal or to extract the noise, depending on which (signal or noise) is of primary interest. It should be noted that what follows in this section is certainly not meant to be a complete treatise on signal filtering, but merely intended to provide a brief introduction to the specialized situation of subband filtering and how it relates to the wavelet concepts discussed previously. More detailed information may be found (among many other sources) in Daubechies (1992) and Meyer (1993). In general, an observed (discrete) signal f is represented by a sequence {fk}kEZ· To conform with earlier treatment, we will assume that the signal f E €2 (.72:). A.filter may also be represented by an €2 (.72:) sequence {akhEz and is denoted by A. Applying a filter to a signal results in another signal. In the previous section, we mentioned that applying the decomposition algorithm involved a down-sampling operation. The special case of subband filtering that we consider in this section operates exactly that way-applying such a filter to a signal results in a signal with length half that of the original
Wavelet Features and Examples
67
signal. (To be precise, since signals in general may have an infinite number of non-zero elements, it should be said that the resultant signal consists of a half-sampling of a filtered signal: Either the odd indices or the even indices are retained before reordering indices. In practice, however, we often deal only with signals in which only a finite number of indices correspond to non-zero elements, in which case the original assertion is adequate.) The filtering process consists of a discrete convolution of the filter sequence with the signal: Applying the filter A to the signal f is written
(Af)k
= I: ae-2k h, fEZ
yielding a new signal indexed by k, which ranges over 7L In what follows, usual operator notation will be followed: AB f denotes the result from first applying a filter B to the signal f and then applying the filter A to the intermediate result; A 2 f represents applying the filter A twice, etc. For a given wavelet basis, represent the filter H by the sequence { hk} kEZ given in (4.2). The concepts of Section 4.1 may be expressed in terms of these filtering operations by regarding the set of scaling function coefficients at a particular level as a signal: {Cj,khEz· (In cases when there are only a finite number of scaling function coefficients, the sequence is understood to be filled with zeroes on either side so that the indices may range over all of 7L) Scaling function coefficients at the next lower level are obtained by applying the filter H to the signal Cj,· = {Cj,k}kEz: Cj-1,·
= H Cj,·'
which corresponds to (4.8). In fact, scaling function coefficients at any level can be obtained by repeatedly applying the filter H:
Now define a new filter G by (4.16) where the hk 's are again defined as in ( 4.2). Wavelet coefficients at level j - 1 can be obtained from scaling function coefficients at level j via this filter:
which corresponds to (4.9). By combining these two filters, wavelet coefficients at any lower level can be computed from scaling function coefficients
68
THE FILTER REPRESENTATION
at level j:
Now that the decomposition algorithms of the previous section have been written in terms of these filtering concepts, it is natural to inquire into the nature of the filters Hand G. These filters are examples of what are called quadrature mirror filters by signal processors. The filter H is known as a low pass filter, while G is an example of a high pass filter. Roughly speaking, low pass filters correspond to averaging operations, and high pass filters correspond to differencing. We now turn again to the Haar wavelet example in the continued development of these ideas. Recall from earlier sections that what has now become known as the low pass filter H for the Haar system consists of only two nonzero elements: h0 = h1 = 1/.Ji. For a signal f = {ikhEz, the elements of the filtered signal f' = H f using the Haar filter are
which is proportional to the average of adjacent elements. Applying the G filter corresponding to the Haar system to the same signal f would give a signal j* with elements
which is proportional to the difference between adjacent elements. This example is fairly straightforward (and simplistic), but, again, the Haar system provides an important paradigm for all wavelet-based filters. In general, applying the H filter results in a signal composed of localized weighted averages performed on the original signal. Applying the G filter results in a signal whose elements are contrasts of localized elements of the original signal. This idea is reinforced by summing the coefficients of the filter representation. For a low pass (averaging, or smoothing) filter H, (4.17)
for a high pass (contrasting, or detail) filter G, (4.18)
Wavelet Features and Examples
69
This is seen trivially for the Haar example, but it can be seen in general for the filters that correspond to any wavelet decomposition sequence as follows. Since J ¢(x) dx = 1 (and as a result, J c/Jj,k(x) dx = 1 for any j, k E 7L), integrating both sides of the two-scale relationship ( 4.1) yields ( 4.17). The identity ( 4.18) can be shown by substituting ( 4.16) into ( 4.4) and integrating. As will be seen in Section 6.2, the wavelet decomposition of a data set is accomplished by applying these filtering operations to the vector of data values. This provides the basis for the fast wavelet decomposition algorithm.
4.3
Time-Frequency Localization
In the previous section, we discussed the treatment of sequences that were regarded as being discrete signals (observed only at discrete time points). Here, we return to L 2 (IR) function space and the treatment of functions. This development can be tied to that of the previous chapter by regarding the L 2 (JR) function f as a signal observed continuously over time. From the properties of wavelets that we have studied so far, there is little to recommend using a wavelet basis of L 2 (IR) over any other complete orthogonal system. The simplicity of the Haar basis certainly recommends it, but are there any advantages to using other, more complicated, wavelets besides that they just "look nicer"? These are the issues that this section will address.
The Continuous Fourier Transform First, we begin with a brief background on the Fourier transform required for a full appreciation of the developments in this section. Chapter 1 discussed the representation of periodic functions in terms of cosine and sine functions with varying frequencies. It was also pointed out that any function defined on an interval can be extended periodically to have support on lR and be analyzed the same way. Further, we noted that each Fourier coefficient gives some information about the frequency content of the function at the corresponding frequency. In many situations, it is desirable to extend the analysis to nonperiodic functions defined on JR. Recall that an appropriate periodic function f can be represented by 1
f(x) =
2a + :~:::)ai cos(jx) + b sin(jx)). 00
1
0
(4.19)
j=l
This representation can be rewritten by applying the well-known formula of Euler:
TIME-FREQUENCY LOCALIZATION
70
= cosw + i sinw,
eiw
(4.20)
where the imaginary unit i represents the square root of - 1. As we move to complex-valued functions, the definition of the inner product is extended to
(j, g)
=
J
f(x)g(x) dx,
where g(x) represents the complex conjugate of the complex-valued function. Note that this extended definition does not change anything for realvalued functions. Manipulating ( 4.20) allows us to write the trigonometric functions as
cosw sinw Substituting these relationships into (4.19) gives 00
f(x)
1 "" (
1
2ao + 2 ~
( i].X
aj e
+ e -iJ.X)
-
z"b j (e i]"X -
e
-i]"X))
j=l 00
L.....t (( aJ 21 ao + 21 ""
z"b j ) e i 1·x
+ ( aJ + z"bJ) e- iJ"x)
.
j=l
A new set of Fourier coefficients can now be defined: 1
Po
2ao
Pi
~(aj- ibJ), j = 1, 2, .. .
P-J
~(aj + ibJ), j = 1, 2, ... .
In terms of these coefficients, the Fourier representation can be written 00
f(x) =
L
Pi
eijx.
(4.21)
j=-oo
Combining these definitions with (1.4) and (1.5) gives a universal formula:
/_71" f(x )e-z..1 x dx, 27r -71"
Pi = -
1
(4.22)
Wavelet Features and Examples
71
which holds for all j E 7L. Fourier methods are also applicable to nonperiodic functions defined on all of JR. Extending ( 4.22) to a continuous set of frequencies and changing the limits of integration to all of JR gives the standard continuous Fourier transform of a function f E L 2 ( JR), expressed here as a formal definition.
Definition 4.1 The continuous Fourier transform of an L 2 (JR) function
f
is a function offrequency w, defined as
}(w)
=
L:
f(x)e-""" dx,
forw E JR. It should be noted here that there is an unfortunate commonality in standard notation for the Fourier transform of a function and for the estimate of a function: Both are typically written}. Not to depart from either of these deeply ingrained conventions, and to avoid cumbersome alternative notation, this book will use j in both situations. It is unlikely to result in any serious confusion, however, as the treatments of these two concepts have no overlap, and it will generally be clear from the context which meaning is intended. For the remainder of this chapter, j will denote the Fourier transform. The function j (which is also in L 2(JR)) gives information on the frequency content of the functions at all possible frequencies, not only on frequencies proportional to 1 / j as in the Fourier sum. The original function can be represented in terms of its Fourier transform by means of the inverse
Fourier transform:
f(x) = 27r
L:
}(w) eiwx dw,
(4.23)
for x E JR, which is the continuous analogue of the Fourier sum representation of a periodic function. Thus an L 2 ( JR) function is completely characterized by its Fourier transform j (w), w E JR. This section ends with Parseval's identity, which describes an important relationship between functions and their Fourier transforms: For j, g E
£2(JR), (j, g)
=
1
A
271" (j, g).
Applied to a single function, Parseval's formula shows that the £ 2 norm of a function is proportional to the £ 2 norm of its Fourier transform. This identity has an analogous representation in the discrete Fourier transform case as well,
72
TIME-FREQUENCY LOCALIZATION
relating the norm of a periodic function to its Fourier coefficients:
(4.24)
The Windowed Fourier Transform The continuous Fourier transform is not entirely satisfactory for many applications. Thinking for the moment of a function f (t) representing a signal in time, the continuous Fourier transform gives information on the average frequency content over the entire signal for the frequency w. To extract frequency information at even a single w requires the computation of an integral over an infinite interval of time, making it impractical for many applications. Another shortcoming of standard Fourier methods is that they are quite inadequate for dealing with signals whose frequency content changes over time. These observations on the limitation of Fourier methods are not by any means new. Following the development of Chui (1992), the attempt of Gabor (1946) to correct these deficiencies is examined. Gabor introduced a timelocalizing window function to broaden the application of Fourier methods. In an attempt to extract local frequency information from the signal, Gabor proposed a windowed Fourier transform:
T 9 j(w, to)
= /_: f(t)g(t- t 0 )e-iwt dt.
The windowing function Gabor used was the Gaussian function
which is proportional to the standard normal probability density function. Using the Gaussian window function, the Gabor transform is defined to be
WI: f)(w)
= /_: J(t)ga(t- b)e-iwt dt,
(4.25)
which localizes the Fourier transform of the signal f about the point t =b. As with the continuous Fourier transform, there exists an inverse Gabor transform that will recover the original signal from its transform (see Corollary 3.8 in Chui (1992)).
Definition 4.2 A function w is a window function if both w(t) and tw(t) are in L 2 (1R).
Wavelet Features and Examples
73
It is natural to speak of the "center" and the "width" of a window function. The window center is defined as
which acts like the "mean" of the window. As would be expected, the Gaussian function has center equal to zero. The radius of a window function is based on the "standard deviation" of the function:
The width of the window is defined to be twice the radius. For Gabor's Gaussian window function ga, the window width turns out to be 2fo.. The time-localizing transform of Gabor may be interpreted in a new way, by defining a new function
and rewriting (4.25) as
(Qf: f)(w) = (j, Gf),w) =
l:
f(t)Gf),w(t) dt.
(4.26)
The reason to write the Gabor transform in this way is to give insight into the implicit relationship between the Gabor transform off and the Gabor transform of j. It can be shown that the Fourier transform of Gf),w is
which follows from the fact that the Fourier transform of the Gaussian function is itself proportional to a Gaussian function. This function can itself be regarded as a window function in the frequency domain. Applying this result gives
(Qf:f)(w)
74
TIME-FREQUENCY LOCALIZATION
;;;, L: e''' -ibw
_e_
2foQ
(gtf4a w
}(€)9tf4a(€- w) d€
J) (-b).
(4.27)
Thus, the equivalence of ( 4.26) and ( 4.27) indicates that the windowed Fourier transform centered at b is proportional to the window inverse Fourier transform centered at w, so the Gabor transform is simultaneously localizing both in time and in frequency! The window width in the time domain is, as mentioned before, 2fo, and the window width in the frequency domain is 1/ fo. This simultaneous localizing in both time and frequency can be illustrated by considering a two-dimensional window in the time-frequency plane. Figure 4.4 gives two such time-frequency localization windows. The relative widths and heights of the windows can be adjusted by allowing a to vary, but the "area" of the two-dimensional window is constant at 2. It turns out that this area is optimal according to the Heisenberg Uncertainty Principle, well known in quantum mechanics; see Messiah (1961). This principle states that for any suitably chosen window function w (whose Fourier transform is also a suitable window), the product of the window widths in the time-frequency plane satisfies
Equality is attained if and only if the window w is the Gaussian function. For comparison purposes, if an attempt were made to sketch time-frequency localization boxes for the standard continuous Fourier transform, the boxes would be infinite in width (time domain) and infinitesimal in height (frequency domain). So the Gabor transform succeeds admirably in accomplishing the goal of localizing the frequency information of the signal in time, and also of localizing the time information in the frequency domain. One disadvantage of the Gabor transform is that once the parameter a is selected, the shape of the time-frequency windows does not change as w or t changes. If we are analyzing the low-frequency content of a signal, we might desire a wide window in time. Conversely, if we are interested in highfrequency phenomena, a narrower time window would be preferred. The rigidity of the Gabor transform does not allow this desired flexibility, but, as we will see, wavelets give a framework for which this is automatic.
The Continuous Wavelet Transform Just as we moved from the discrete Fourier transform to the continuous Fourier transform earlier in this section, we now describe an analogous adaptation of the discrete wavelet basis.
Wavelet Features and Examples
75
~~ -----------------@ ~: -------------------!---------------~ t
Figure 4.3: Time-frequency localization windows for the continuous wavelet transform
Up to now, we have defined wavelets with two integer subscripts as
for a suitably defined mother wavelet 'lj;. The content of an L 2 (JR) function corresponding to the wavelet with dilation index j and translation index k is summarized in the form of the corresponding wavelet coefficient:
Rather than restricting dilation and translation of the mother wavelet by the set of integers, we can allow them to vary continuously. For a > 0, b E JR, define
'1/Jca,b) (x) =a -1/2 '1/J
(X-a-b) .
The continuous wavelet transform is thus defined to be, for 00
(WV;f)(a, b)
= (j, '1/Jca,b))
=a- 1/21-oo f(t)'lj;
(t- b) -a-
f
E L 2 (JR),
dt.
(4.28)
76
TIME-FREQUENCY LOCALIZATION
Note that the discrete wavelet coefficient
is related by
dj,k
Suppose that both '1/J and '1/J are suitable window functions and that the window function 'lj; has center t* and radius ~'1/J· Then '1/J(a,b) is also a window function, with center b + at* and radius a~'I/J. The continuous wavelet transform (4.28) gives information about the signal using the time window
which is relatively narrow for small a and wide for large a. To see the localization occurring in the frequency domain, first we derive an expression for the Fourier transform of '1/Jca,b): 00
,j;(a,b) (w)
-oo
/_
a-1/2'1/J
VO,e-iwb
(t- b)
l:
-a-
e-iwt dt
'1j;(t)e-iawt dt
VO,e-iwb,J;(aw). Applying Parseval's identity to ( 4.28) and using the above result gives 1
A
Val 2
A
71" (j, '1/J(a,b))
271"
j(w)etwb'lj;(aw) dw. A
•
- A
(4.29)
If{/; is a window function with center w* and radius ~;p, it can be shown that {/;(a,b) is also a window function, with center w* fa and radius ~;pfa. Thus, aside from a constant and the phase-shift, it is seen from ( 4.29) that the continuous wavelet transform also gives local information about the signal f using the frequency window
This frequency window automatically widens for small a and becomes narrow for small a. Similar to Figure 4.4, we can construct a picture giving some idea of the simultaneous time-frequency localization that takes place when applying the
Wavelet Features and Examples
77
w UJ]
~~~~~~~~~~~~~~~
Ulz
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~~~ ~ ~ j ~~~~ ~~ ~ ~~ ~ ~ ~ ~ ~~ I
I
t
Figure 4.4: Time-frequency localization windows for the Gabor transform
continuous wavelet transform. Figure 4.3 gives a rough idea of the timefrequency localization corresponding to the continuous wavelet transform. The areas of the time-frequency windows are constant: 2atl1/J · 2/l-J) a = 4/l'I/Jtl,j;, but they become long and short for low frequencies and tall and thin for high frequencies, as would be desired. Just as it is possible to reconstruct a signal f (t) given its Fourier transform j (w) by means of the inverse Fourier transform, it is also possible to reconstruct a signal given only its continuous wavelet transform (W'l/1 f) (a, b). For an appropriate wavelet '1/J and for f E L 2 (JR), 00
f(t)
= -02
/_
1/1
-oo
1
00
o
1 dadb (W'I/Jf) (a,b)'I/Jca,b)(t)-z a
(4.30)
for every t E lR at which f is continuous. The constant C'I/J is given by
c'I/J
= 2 foo l~(w)lz dw = 2 foo I~( -w)lz dw. }0
w
}0
w
(4.31)
(The additional condition on '1/J for (4.30) to hold is that the two right-hand quantities in (4.31) are in fact equal and finite.) See Chui (1992) for the proof of these results.
78
TIME-FREQUENCY LOCALIZATION
Haar scaling function
Daubechies scaling function N=S
'V
'V
n
o
0
M
0
C\1
C\1
0
0
~
0 0
0
0
0 -30
-10
0
10
20
30
-"'--/.1\J -30
-10
0
10
20
30
Figure 4.5: The modulus of the Fourier transform of two scaling functions
Thus we see that the continuous wavelet transform not only corrects the noted deficiencies of the continuous Fourier transform, but also offers a vast improvement over previous attempts to localize the Fourier transform of a signal. At the beginning of this section, the question was posed as to why we should use other, more complicated, wavelets rather than the Haar wavelet. It turns out that the time-frequency localization results of the continuous wavelet transform do not apply to the Haar wavelet. The reason for this is that the Fourier transform of the Haar scaling function is not localized enough to qualify as a window function. This poor frequency localization requires us to consider other functions with better properties in the frequency domain. The poor frequency localization for the Haar basis is illustrated in Figure 4.5, which plots the modulus of the continuous Fourier transform of both the Haar scaling function and the Daubechies scaling function with N = 5. The Fourier transform for the Daubechies wavelet clearly decays much faster and thus is more appropriate for time-frequency localization. Parseval's identity ( 4.24), relating the norm of a function to its Fourier coefficients, has an important analogue in the wavelet domain as well. In particular, (4.32)
This relationship underlies much of the way wavelets are applied in statistics. Functions (or data) are transformed into the wavelet domain, manipulated,
Wavelet Features and Examples
79
and then inverse-transformed into the original domain. The expression ( 4.32) dictates that estimation can be done in either domain and that the squared error of estimation is maintained.
4. 4
Examples of Wavelets and Their Constructions
Aside from a brief discussion of some other wavelet families in Section 1.3, we have only considered orthogonal wavelets, the primary focus being upon the Haar wavelet system. Of the many wavelet families in existence, for statistical applications, we would like to choose a wavelet basis with orthogonality, compact support, and symmetry. But, as Daubechies (1988) points out, the only real-valued wavelet that has all three of these properties is the Haar wavelet, with its poor frequency localization. The purpose of this section is to expand on the discussion given in Section 1.3. Here, we review various families of wavelets, trace their developments, and briefly touch on their constructions. We begin with a discussion of orthogonal wavelets, focusing upon the Battle-Lemarie family and the Daubechies family. Symmetry might be more important than orthogonality for some applications, so we then consider a more general class of wavelets, the so-called biorthogonal wavelets, which have been developed to allow greater application of wavelet methods. The section closes with a treatment of a particular special case of biorthogonal wavelets, the Chui-Wang
semiorthogonal wavelets. Before beginning this discussion, we pause momentarily to give a brief introduction to B-splines. This concept will be used in the construction of some of the wavelet bases described in this chapter, so it seems appropriate to insert this treatment here. First, a function f is said to be a spline function of order m with knot sequence {x 1 , ... , Xk} if
1. f is a piecewise polynomial of order m on each interval [xi, xi+ 1), 2. f has m- 2 continuous derivatives, and 3. the (m - l)st derivative off is a piecewise constant function with jumps at the Xi's.
Spline functions are used extensively in nonparametric regression, density estimation, and other statistical applications (see Eubank (1988)). We will focus here only on a special case of spline functions: the cardinal B-splines. These will prove useful in constructing many of the wavelet families described in this section.
80
EXAMPLES OF WAVELETS AND THEIR CONSTRUCTIONS
m=1
m=2
c:o ci
""" ci
~ 0.0
0.4
0.0
0.8
0.5
1.0
1.5
2.0
m=4
co ci
co ci
""" ci
""" ci
C\1
C\1
ci
ci 0
ci
0
~~----~~--~~ ci
0.0
1.0
2.0
3.0
~--~--~--~--~
2
0
3
4
Figure 4.6: The first four cardinal B-spline functions
The first cardinal B-spline is identical to the Haar scaling function, namely, 0 ~X< 1, otherwise.
Cardinal B-splines of higher order are defined recursively in terms of lowerorder B-splines by means of the convolution operator:
Nm(x) = (Nm-l
* Nl)(x) =
1'
Nm-l(x- t) dt,
for m ~ 2. It can be seen easily that the function N m is a piecewise polynomial of order m with knots at the integers. The first four cardinal B-splines are plotted in Figure 4.6. It is interesting to note also that the function N m ( x) is also the probability density function for the sum of m independent uniform (0, 1) random variables. Therefore, by the Central Limit Theorem, as m gets large, the spline function N m (X) approaches the probability density function of a normal random variable. The interested reader is referred to texts by de Boor (1978) and Schumaker (1981) for more about spline functions.
Wavelet Features and Examples
81
Orthogonal Wavelets Here, we outline the development of two examples of orthogonal wavelets: the Battle-Lemarie family and the Daubechies family. It turns out that the Haar wavelet is the first member of a family of wavelets studied in Battle (1987) and Lemarie (1988). The Battle-Lemarie wavelets are formed from the B-spline functions just introduced. The first member of the Battle-Lemarie family (as seen in Section 1.3) has scaling function equal to the piecewise constant spline: ¢( x) = N 1 ( x). It was seen before that the wavelet system based on this scaling function (the Haar system) is orthonormal. The next member of this family begins with the piecewise linear spline N 2 ( x). This function has the nice two-scale relationship
N2(x)
1
1
= -N2(2x + 1) + N2(2x) + -N 2(2x- 1). 2 2
Unfortunately, the function N 2 ( x) is not orthogonal to its integer translates, so it is unsuitable for direct use in constructing an orthonormal wavelet system. First, this function must be "orthogonalized" by means of a trick described in Chapter 5 of Daubechies (1992) to give an appropriate scaling function. There is no closed form expression for the resulting scaling function, but it is given only in terms of its Fourier transform:
(4.33)
The ¢ function with Fourier transform (4.33) is also piecewise linear with knots at the integers, but it is not compactly supported, although it does decay exponentially. It generates an appropriate multiresolution analysis, so, once the filter coefficients { hk} are determined, the corresponding wavelet function can be defined according to (4.4). Mallat (1989a) gives general formulas for constructing wavelets and scaling functions for the Battle-Lemarie family, indexed by a smoothness parameter p, as well as for computing the associated filter { hk}. The resulting scaling function ¢ will be a polynomial spline of order 2p + 1 with knots at the integers. Figure 1.10 is a graph of two scaling function/wavelet pairs from this family. Each member pair of this family constitutes an orthogonal wavelet system. Except for the Haar system, all Battle-Lemarie wavelets are symmetric and have infinite support and exponential decay. Another important family of wavelets is the Daubechies family of orthonormal and compactly supported wavelets, introduced in Daubechies (1988). Without going into detail in the derivation of these formulas, the Daubechies
82
EXAMPLES OF WAVELETS AND THEIR CONSTRUCTIONS
family will be introduced here. For a full outline of the development of such wavelets, the interested reader is referred to Chapter 6 of Daubechies (1992). As we've seen in this section, the development of the Battle-Lemarie family began with defining an appropriate scaling function ¢ that generates a multiresolution analysis, and then constructed corresponding wavelets by a method such as (4.4). The derivation of the Daubechies family of wavelets represents quite a different approach. Here, we begin with the filters, and then we derive from the filter coefficients both the scaling function and the wavelet. Recall again the definition of the filter elements from (4.2):
hk
=
J
¢(x)V2¢(2x- k) dx.
Clearly, for compactly supported ¢(x), only finitely many of the hk 'swill be non-zero. Daubechies' approach to constructing orthogonal compactly supported wavelets begins with defining the 21r-periodic trigonometric polynomial
associated with the filter { hk}. A new family of wavelets is obtained by constraining this function to give orthonormality and smoothness. The scaling functions and wavelets are written in terms of this m 0 ( w) function, and, again, they can only be expressed in terms of their Fourier transforms
and
~(w)
= -e-iwfzmo (~ + 7r) J> (~) .
There are several possibilities for the function m 0 ( w), depending on the number of non-zero elements in the filter { hk}. Naturally, the more non-zero filter elements, the smoother the resulting scaling functions and wavelets (see the examples in Figure 1.12). In its brief presentation of the Daubechies family, Section 1.3 mentioned a filter index N that controls the smoothness of the resulting scaling functions and wavelets. In fact, choosing a filter with 2N non-zero coefficients (the Haar wavelet corresponds toN = 1) will give a corresponding scaling function 'lj; with support [0, 2N -1] (the corresponding mother wavelet¢ has
Wavelet Features and Examples
83
support [- N + 1, N]). Also, the choice of filter length 2N dictates that the resulting wavelet will have N vanishing moments:
J
x' 'lj; (x) dx = 0, f = 0, 1, ... , N - 1.
This implies that polynomials of degree up to N - 1 can be written exactly in terms of the appropriately translated scaling functions. Note that these wavelets are neither symmetric nor antisymmetric (except in the Haar case). They are fairly easy to implement, and with the additional advantages of compact support and orthonormality, they are preferred in many applications. So far in this section, we have discussed only two examples of the many orthogonal wavelet systems in existence. Some other families are briefly mentioned now. As mentioned in Chapter 1, the first among the modern family of wavelets is due to Stromberg (1982), whose orthogonal wavelets have infinite support, exponential decay, and an arbitrarily large number of continuous derivatives. The Meyer wavelets, introduced in Meyer (1985), are also orthogonal with infinite support and exponential decay. The Fourier transforms of the Meyer wavelets are compactly supported, which implies that the wavelets and scaling functions themselves are infinitely differentiable. We have noted before that the Daubechies wavelets are quite asymmetric. In view of this, Daubechies derived another family of compactly supported orthogonal wavelets, often called symmlets, which are "least asymmetric." This construction is also described in Daubechies (1992). Example scaling functions and wavelets from this family are plotted in Figure 4. 7. A third family of wavelets constructed by Daubechies is the family of coijlets, named by her in honor of wavelet researcher Ronald Coifman, who suggested a wavelet-based multiresolution analysis in which the scaling function also has vanishing moments. The resulting system, particularly useful in numerical analysis applications, is described in Daubechies (1993).
Biorthogonal Wavelets As mentioned before, it is not possible to combine compact support, orthogonality, and symmetry in a single wavelet construction (except in the Haar case). Here, we describe a more general class of wavelets, which features compact support and symmetry, but at the expense of orthogonality. The first example of a biorthogonal wavelet basis was constructed by Tchamitchian (1987). A complete development of biorthogonal wavelets with compact support is given in Cohen, Daubechies, and Feauveau (1992). It may seem strange to call a wavelet "biorthogonal" when it is not orthogonal. This term refers to two separate multiresolution analyses in L 2 ( IR) that
84
EXAMPLES OF WAVELETS AND THEIR CONSTRUCTIONS
Scaling function, N=4
C!
C!
co ci 0
ci
N
ci
C!
C\1
9
2
0
4
-2
6
0
Scaling function, N=5
2
4
Wavelet, N=5
C!
C!
co ci 0
ci
N
ci N
9
C! 2
0
6
4
-2
-4
8
2
0
Scaling function, N=7
4
Wavelet, N=7
C!
C!
co ci 0
ci
N
ci N
C!
9 0
2
4
6
8
10
12
-6
-4
-2
0
2
4
6
Figure 4. 7: Three examples of scaling function/wavelet sets from Daubechies' "least asymmetric" family
correspond to a biorthogonal wavelet, one "dual" to the other:
... c
V_z
c v_l c
Vo
c
... c
if_z
c if_t c
Vo
c if1 c ....
VI
c ...
and
The first multiresolution analysis is generated by the ~caling function ¢, and the second is generated by a "dual" scaling function¢. As in the orthogonal case, each of these sequences of approximation spaces has a sequence of successive detail spaces: (Wi )jEZ and (W)jEZ,
Wavelet Features and Examples
85
respectively. Furthermore, these detail spaces are generated by two mother wavelet functions 'ljJ and {;, which are also dual to one another. The biorthogonality is expressed through relationships between the dual multiresolution analyses:
and
In terms of scaling functions and wavelets, this biorthogonality is expressed as
tPj,k
tPJ,k
j_
'l/Jj' ,k''
¢J,k' for k f; k',
j_
and
'¢J,k
j_
{;j' ,k' for j f; j' or k f; k',
for j, j', k, k' E 7L.. As in the orthogonal case, the projection of an L 2 (JR) function approximation space Vj is written
f
onto an
""""c. k,J,.. k
P 1J = 6
J, 'fJJ,'
k
but the coefficients are computed in terms of the dual scaling function:
L:
c;,k = (!, ~;,k) = The decomposition of the function
f(x) = L
f
f(x)~;,k (x) dx.
into its wavelet components is
LdJ,k'¢J,k(x),
j
k
where the coefficients are computed also according to the corresponding dual wavelets:
d;,k = (!, ;fi;,k) =
L:
f(x);fi(x) dx.
86
EXAMPLES OF WAVELETS AND THEIR CONSTRUCTIONS
In terms of the du~MRA, an L 2 (JR) function approximation space Vj as follows:
f
can be projected onto the
1 ·k PiJ = """'c· 6 J, k 'PJ,' k
where the coefficients are computed using the scaling function ¢: Cj,k
= (f, ¢J,k) =
L:
f(x)¢j,k (x) dx.
The representation of the function in terms of the dual wavelets is similar to that above, the coefficients being defined in terms of inner products off with the corresponding 'lj;j,k 's. It can thus be seen that an orthogonal wavelet is a special case of biorthogonal wavelets, in which ¢ = ¢ and {; = 'lj;, and thus the dual multiresolution analyses coincide. In the orthogonal wavelet case, the set of wavelet functions {'lj;j,k, k E Zl} formed an orthonormal basis for WJ, and { c/>j,k, k E Zl} formed an orthonormal basis for Vj . By generalizing to the biorthogonal case, these sets no longer form an orthogonal basis for their spaces, but instead they form a more general Riesz basis. For completeness' sake, a definition is included here.
Definition 4.3 The set {¢( · - k), k E Zl} is said to form a Riesz basis for a function space Vo if the ¢( · - k) 's span V0 and there exist constants 0 < A~ B < oo such that for all sequences {PkhEz E f 2 (Zl),
Riesz bases are examples of general frames discussed in the wavelet literature. It is clear, for example, in the (orthogonal) Haar case that ¢(x) = N 1 (x) produces a Riesz basis for the space of piecewise constant functions with A = B = 1. Scaling functions that correspond to other orthogonal wavelet systems also form Riesz bases. By generalizing from orthogonal wavelets to biorthogonal wavelets, we sacrifice two useful properties in the generated multiresolution analysis First, the set { ¢( ·- k), k E Zl} no longer forms an orthonormal basis for V0 , but only a Riesz basis (similarly for the wavelets and W0 ). Second, the detail spaces are no longer orthogonal to the approximation spaces: We can still write
Wavelet Features and Examples
87
but the sum is no longer a direct sum, as it was in (1.14). In some situations, one might be reluctant to give up the second of these properties, so semiorthogonal wavelets would be appropriate.
Semiorthogonal Wavelets It is possible to construct scaling functions and wavelets from splines without using the orthogonalization trick previously discussed, thereby retaining compact support of the scaling functions and wavelets. This approach gives the semiorthogonal Chui-Wang B-wavelets described by Chui and Wang (1991) and Auscher (1989), who construct a family of orthogonal wavelet bases that retain compact support, symmetry, and orthogonality between detail levels. These Chui-Wang wavelets or pre-wavelets are based upon the cardinal B-splines, as were the Battle-Lemarie wavelets. For index m, define the scaling function to be
¢(x)
= Nm(x).
The dilated and translated set of scaling functions { >j,k = 2i1 2 ¢(2i x-k), k E 7L} forms a Riesz basis for Vj, the function space consisting of all splines of order m- 1 with knots at k2-i, k E 7L. This is the same space as that associated with the Battle-Lemarie family of wavelets, but the Chui-Wang basis is not orthogonal. The general two-scale relationship for these scaling functions is given by the well-known B-spline identity
Nm(x)
=
f
rm+l (
7) Nm(2x- k),
k=l
which gives us the elements of the (finite-length) filter { hk}. The mother wavelet 'ljJ can then be written in terms of the scaling function via ( 4.4). An advantage of this construction is that the wavelets and scaling functions are defined explicitly in terms of B-splines, rather than only indirectly through their Fourier transforms, as are many of the other wavelet bases presented here. Associated with ¢ and 'ljJ are dual functions ¢ and 'ljJ. The functions { 'l/Jj,k, k E Z} form a Riesz basis for the detail space Wj, and just as in the orthogonal wavelet case,
thus the Wj spaces are mutually orthogonal. In this formulation, both { >j,k, k E Z} and { J;j,k, k E 7L} are Riesz bases for the space Vi; similarly,
88
EXAMPLES OF WAVELETS AND THEIR CONSTRUCTIONS
{ 'I/Ji,k k E Z} and {~j,k, k E Z} are Riesz bases for Wj (i.e., Wj = Wi 2 and Vj = Vj). A consequence of this construction is that although ¢ and 'lj; are compactly supported, their dual functions are not. The decomposition into either basis is done just as for any biorthogonal wavelet basis, the coefficients for the wavelets computed in terms of their duals, and vice versa. For index m, the scaling function has support [0, m] and the mother wavelet has support [0, 2m- 1]. When m = 1, the result is the Haar system. For m > 1, the wavelets are constructed of splines of degree m - 1. Wavelets are symmetric for m even and antisymmetric for m odd.
CHAPTER
FIVE
Wavelet-based Diagnostics In a real data analysis, an essential component is a thorough graphical study of the data. It is not uncommon for graphical data analysis to turn up some interesting (even vital!) aspect of the data set that might be completely overlooked by applying some canned "black box" statistical inference procedure. Here, we describe some of the plots that are commonly used in waveletbased statistical analysis, giving some examples as to what to look for in various situations. It should be noted here that some of the details of material presented in this chapter are not covered thoroughly until later in this book. Enough explanation is provided in each case, however, to allow readers to gain a basic appreciation for each type of diagnostic plot, the details being left for full treatment in subsequent chapters.
5.1
Multiresolution Plots
Perhaps the oldest wavelet diagnostic plot is the multiresolution approximation plot of the original function, as was done in Mallat (1989a). For an input signal, this plot consists of approximations of the signal in various Vj spaces. This is illustrated in Figure 5.1 using an example function consisting of a linear trend and portions of sinusoids of differing periods. The mulitresolution plot consists of the original data on top, with successive approximations given underneath. For the example function used in Figure 5.1, n = 1024, so the approximations correspond to projecting onto the spaces v9' ... ' v4 respectively. This figure was generated using the mr a command in the S-Plus Wavelet toolkit, using the Daubechies N = 5 compactly supported wavelet basis and periodic boundary handling. It is interesting to compare the various levels of approximation that appear in Figure 5.1: The high frequency burst near the beginning of the data is completely smoothed over for all but the highest-level approximations. Even the lower-frequency sinusoid component near the end of the signal is almost completely damped by the lowest-level approximations. This multiresolution diagnostic plot is useful for noisy data as well. Figure 5.2 plots the corresponding multiresolution analysis for the same function, but contaminated with Gaussian noise. This plot is simply the projection of f- onto various Vj spaces - there is no smoothing or noise removal
90
MULTIRESOLUTION PLOTS
Figure 5.1: Multiresolution approximation of test function
done. For this example, the signal-noise ratio was set at 4. Comparing Figure 5.2 with Figure 5.1, it is seen that the lower-level projections coincide very closely. This is due to the high-frequency error term being averaged out. Higher-level reconstructions retain much of the random noise component. Rather than displaying a sequence of successive approximations, another way to examine the multiresolution composition of a function or signal is to display the sequence of detail signals. Recall from Chapter 1 that an L 2 function f can be written as a low-level (coarse) approximation plus the sum of all successive detail functions:
where Ji represents the projection off onto an approximating space Vi and gi is the projection onto a detail space Wj . Figure 5.3 shows multiresolution information about the same example function, but displays it in terms of the detail functions instead. As in Figure 5.1, the last function displayed is the approximation at the smoothest level considered, and the top function is the input function. Each of the functions in between is simply the difference between the corresponding approximations plotted in Figure 5 .1. The input function can be reconstructed by adding together the smoothest approximation and all the detail functions.
Wavelet-based Diagnostics
Figure 5.2: Multiresolution approximation of noisy test functi1
• v---------~---------------------~
~---~
Figure 5.3: Multiresolution decomposition of test function
92
TIME-SCALE PLOTS
11 II
I
I
tit
t
I
Figure 5.4: Multiresolution decomposition of noisy test function
It is interesting to see in Figure 5.3 at exactly which level each component of the test function is added into the approximation. The very high frequency burst near the beginning of the function is concentrated primarily into the two highest levels of detail. This display of the multiresolution decomposition can of course be done for functions observed with noise as well. Figure 5.4 displays this decomposition for the same test signal with noise added, as in Figure 5.2.
5.2
Time-Scale Plots
In Section 4.3, we compared the wavelet transform of a function with the Gabor transform, concluding that the wavelet transform did a better job of representing functions with non-constant (over time) frequency content. This is due to the ability of wavelets to localize in time the characterization of the frequency behavior of a function. Here, we discuss a diagnostic plot that can help identify the way the frequency content is changing over time. The spectrogram is a plot related to the Gabor transform
(Q~ f)(w) =
L:
f(t)ga(t- b)e-iwt dt,
Wavelet-based Diagnostics
93
which gives information on the frequency content near the frequency w of the function f(t) near the point t = b (see Section 4.3). The spectrogram for a continuous function f is defined to be the square modulus of the Gabor transform I(Qb f)(w)l 2 • This is simply a function in two variables, wand b, so it can be plotted in three dimensions. More typically, the spectrogram is plotted over the b-w plane by varying the intensity of each point according to the grey-scale-for (Ql: f)(w) near zero, the point should be close to white; for relatively large values of the Gabor transform, the point should be almost black. The wavelet analogy of the spectrogram is the wavelet scalogram (with wavelets, we speak of scale rather than frequency). The scalogram consists simply of the square of the continuous wavelet transform ( 4.28): I W 1/J f) (a, b) 12 , where b represents the location in time and a represents the scaling factor. This can be represented either as a three-dimensional plot or as a twodimensional grey-scale image. The spectrogram and the scalogram based on the respective continuous transforms are useful objects in analyzing a function defined continuously, but, in most applications, the function is only observed at a few discrete points. In this case, the discrete analogues of the Gabor transform and the wavelet transform are computed, and the plots are adjusted accordingly. Recall that both the Gabor transform and the wavelet transform divide the timefrequency plane into blocks measuring local (in time) frequency content of the signal. For the wavelet transform (see Figure 4.3), these blocks are short and wide for analyzing low-frequency (large-scale) content and tall and narrow for analyzing high-frequency (small-scale) phenomena. For the Gabor transform (see Figure 4.4), these windows all have constant shape. In plotting either the spectrogram or the scalogram in the discrete cases, each of these blocks in the time-frequency (or time-scale) plane are shaded in the grey scale according to the magnitude of the corresponding coefficient. The wavelet scalogram for the test function is given in Figure 5. 5. In the plot, theY-axis is actually the reciprocal of the scale factor, so that small-scale content (corresponding to high frequency) is represented near the top of the plot, and large-scale objects (low frequency) are toward the bottom. For the test function, it is seen that the function consists primarily of coarse scale content. The sine curve, added with increasing amplitude near the latter part of the function, is manifested by gradually darkening blocks (from left to right) in the less coarse area. The burst of high frequency near 0.2 shows up in the time-scale plot as well. This can be accomplished for the noisy version of the test function as well. Figure 5.6 gives the wavelet scalogram for the same function, again contaminated with noise. Comparing the scalograms for both the noise and the true function, it is seen that, at least in this example, the added high-frequency noise does not change the basic appearance of the scalogram much. An interesting example function to illustrate how the scalogram works is
TIME-SCALE PLOTS
00
0
<0
0
C\1
0
0
0
~------~------,------,-------,------,-~
0
200
400
600
800
1000
Figure 5. 5: Scalogram of test function
00
0
<0 c:)
-.:t
0
C\1
0
0
0 0
200
400
600
800
Figure 5.6: Scalogram of noisy test function
1000
Wavelet-based Diagnostics
0.0
0.2
0.6
0.4
95
0.8
1.0
Figure 5. 7: The Doppler function
the Doppler function, plotted in Figure 5. 7. The Doppler function, one of the test functions used in Donoho and Johnstone (1994), increases in both amplitude and frequency as time t increases. The scalogram for the Doppler function is plotted in Figure 5.8. This plot indicates how the frequency content of the function is changing over time. Collineau (1994) and Ariiio and Vidakovic (1995) consider an alternative version of the wavelet scalogram. In the nonparametric regression situation, they define the scalogram to be a plot of the vector of the energy at each level (E(O), E(1), ... E(J- 1)), where the energy for level j is defined to be E(j) =
L d~,k·
(5.1)
k
This could be applied either with true wavelet coefficients as defined in (5.1), or with the true coefficients replaced by their empirical counterparts computed from the noisy function f": This plot will indicate at which levels of resolution the energy of the function is concentrated. A relatively smooth function will have most of its energy concentrated in large-scale levels, yielding a scalogram that is large for small j and small for large j. A function with a lot of high frequency oscillations will have a large portion of its energy concentrated in high resolution wavelet coefficients.
5.3 Plotting Wavelet Coefficients Since a set of wavelet coefficients completely describes a function or a set of data, a good graphical display of these coefficients would provide a good diagnostic plot for data analysis. Wavelet coefficients are typically displayed
96
PLOTTING WAVELET COEFFICIENTS
0.0
0.2
0.4
0.6
0.8
1.0
Figure 5.8: Scalogram of the Doppler function
in a plot like Figure 5.9, which plots the wavelet coefficients for the blocky function of Donoho and]ohnstone (1994), plotted in Figure 5.10. Figure 5.9 is the plot that results from applying the discrete wavelet transform (see Section 6.2 for details) to a set of 1024 equally spaced values from the blocky function, using Daubechies' wavelet with N = 5. Because of the down-sampling mentioned in Chapter 4, the number of coefficients at each level is exactly half that of the next higher level, so there are 512 coefficients at level9, 256 coefficients at levelS, and so on. Each coefficient is plotted as a line, up or down corresponding to the sign of the coefficient, from a reference line for each level. The length of the lines is relative to the magnitude of the coefficients at the same level (each level possibly scaled differently). The coefficients are spaced at each level so as to represent their localizations. Such a plot provides a good description of where significant change is taking place in the function. The locations of the abrupt jumps in the blocky function can be spotted by looking for vertical (between levels) clusterings of relatively large coefficients. Large jump sizes correspond to proportionately larger coefficients. In this (noiseless) example, the finest-scale coefficients indicate with a fair amount of precision where the jumps in the function occur. When the signal is contaminated with additive noise, the noise is distributed evenly among all wavelet coefficients. In Figure 5.9, there are many coefficients at the higher levels that are exactly zero, because the support of
Wavelet-based Diagnostics
97
I
I .
I
I
I'.
I ..
.1,,1('"1"'
•
I
o
'1'1'···
''(· .,.,11,
.... '11"'111, ....• ,'('· ..... "'1""'1' ........ .II"
-- . ddt ---\'1\t--- ___1lr--,·-1
1
1
128
0
..... ,1,, ... 1,.
--l . .1,,+
1•
11
I
1
256
I
I
384
512
Figure 5.9: Wavelet coefficients for the blocky function the wavelet falls entirely within a portion of the blocky function that is completely flat. When noise is added, the resulting "zero" coefficients will actually be N (0, a 2 / n) random variables. Figure 5.12 shows the plot of the empirical coefficients when N(O, a 2 ) noise is added to the function. For the plot, the value of a was chosen to give a signal-to-noise ratio of 3. For reference, the
0 -
~ ~--------~------~------~------~------~~ 0.0
0.2
0.4
0.6
0.8
1.0
Figure 5.1 0: The blocky function of Donoho and Johnstone (1994)
98
PLOTTING WAVELET COEFFICIENTS
C\1
0
0.0
0.2
0.4
0.6
0.8
Figure 5 .11: The blocky function with added noise, n
1.0
= 1024
plot of the input data is given in Figure 5 .11-it is still fairly easy to identify the points of discontinuity just by looking at the plot. In Figure 5.12, the larger jumps are still evident by clusters of large coefficients, but the higher-level coefficients are almost completely obscured by the random noise, making it much more difficult to pinpoint the jump location. In a typical statistical application, we will observe data Y1 , ... , Yn with an unknown mean function f: Yi = f (i / n) + Ei, where Ei represents independent mean-zero noise. The main question of interest is in describing how (if at all) the mean function changes on the interval (0, 1). This question can be addressed by means of the discrete wavelet transform. If the function f is completely flat, then the wavelet coefficients of f will all be identically zero; the coefficients corresponding to the noisy observations Y1 , ... , Yn should behave as a random sample of zero-mean random variables. It is apparent that there is plenty of noise in the plot in Figure 5 .12, but some signal is also visible. For comparison, Figure 5.13 displays the plot of the wavelet coefficients when the input signal is a pure white noise sequence, i.e., f(u) = 0, 0 ~ u ~ 1. In Chapter 7, we will consider in detail the estimation of a regression function by using only the largest coefficients in reconstructing a noisy function (by setting the other coefficients to zero). This is mentioned here only to illustrate another use of the coefficient plots-to visualize the results of such coefficient selection. As an example, Figure 5 .14 displays two coefficient plots side-by-side that correspond to the data used in an example in Section 7 .1. (The raw data and the smooth estimate off are shown in the upper right-hand plot in Figure 7 .4). The first plot in Figure 5.14 is that of the raw data; the sec-
Wavelet-based Diagnostics
I I .. I
I
I
I
I' .' .
•
I
I
•
"
.
I . '
I
I
I
•.
ol .. ,.l,lll'l''lj'''''''lllll .. ,lo,I,I,.OIII,I''I'I''·I' 1.. 1111 .. 11, .. 1,1. 1.,1 1,1,o .. lllllll 1.1.111, .1. 11 ,1., llo,l 1,1 11 ,, 11l 1,1,, 11 .. lo 11•. .1 ,,,1, 111 ,.1, 111 ,.,1 11 ... 11 ... 11l(l11,( 11 1 1 1 1
,,.~•rrNj''ll,~..,~~.~·,\lll,,l~,~,.~~-~1'l,o\rl,j~'I,~J~J,r~'TI~-r'w. J,r'i·~~r~J"~·l•.JU,•..,, "'''I ~~.,~~'\''~"~f'l'l.hh~ .114.1,,, "pi'' 1'1Ji' llf 1'''n.,.,-,1,~•\1 ')'! U'111tp11 lit
T
0
j
128
256
.
·~·~·!.. ~·· 11 ( ,~._,., h rlr·
""'
512
384
Figure 5.12: Wavelet coefficients for the noisy blocky function
I
I.,, I' 'I'' Ill.
I I.
lj''lljl'ljllillii 11J!I 11 1 • .. 1 111
I
I.
I,.
'I I
I
I
I
I.'
,.,. 11 1. 111 1.1 11, 1.1111 111 .. 1
.~,,~ .. ~~,,. ~ •... ~~~~~·''·'',····ldlll.ll',.',·'llllllll'·'·'·ll·.,.l.l.l'lll'l.,lllll'llllr"·llll'l·,~·~'~~,ll'll'l'·''
r(..,.,~,~.l.,li.~JiJr!I.\\.J,J1~~..,.~~1(\/tiiTlr'LILI~~ri~J,.,,.,L\~~~h~l~fJ~~i~~Jlt1li!'Y,rjr
~r·'~d~~~~~j~~··~*A'~"~~rr#~~~f~f ·~,~~~~ ','"~,~,..,~, ~~~~~~~.~, ,·,~'~~~~~ J·,,~• 0
128
256
384
512
Figure 5 .13: Wavelet coefficients for a pure white noise sequenc
100
OTHER PLOTS FOR DATA ANALYSIS
Raw data
I • I
I
,,,,
11
11
I
'
'
Treated data
I I I
loo,,ooo.ol,olll
1
111,
I
,.
I
i 1 ., .1 ia 1 , 1 •• 111,11.1lla1 0 olll,,. 1.a,,o , aoii.,,I.IJI''''' 11 11 1 1 1
0
16
32
48
64
.I
0
16
I
'I
32
I
48
64
Figure 5.14: Coefficient plots for the simulated data from Figure 7.4: the lefthand side plots the raw coefficients, and the right-hand side displays the coefficients that remain after applying the minimax coefficient selection rule
ond plot represents the wavelet coefficients that remain after applying the particular coefficient selection rule to the set of empirical coefficients. Note that most of the coefficients have been set to zero and that only the largest (in absolute value) remain. There is one very large coefficient at the highest resolution level that remains after applying the selection rule-this corresponds to the rapid fluctuation in the reconstruction of the function around u = 0.4. Such coefficient selection rules will be covered in detail in Chapters 8 and 9.
5. 4
Other Plots for Data Analysis
There are of course a multitude of other plots that can be constructed for data analysis purposes. The ones considered here are all contained among the various options of the S-Plus Wavelet module in the exploratory data analysis function eda. plot. One method is to construct a Q-Q plot of the set of wavelet coefficients, as is done by Daniel (1959) for the cell means of a zk factorial design. Such a plot (see Chambers, et at. (1983), for example) provides an excellent diagnostic tool for looking at the data to examine their distribution. The Q-Q plot consists of a scatterplot of the order statistics of a data set (the empirical quantiles) plotted against a corresponding set of true normal quantiles. If the underlying function f is flat, all empirical wavelet coefficients are identically distributed, and this will result in all the points in the Q-Q plot lying roughly along a straight line. If the function is not flat, however, changes in the function will be represented by large wavelet coefficients, and these points typi-
Wavelet-based Diagnostics
101
0
(t)
C/)
C/)
Q)
Q)
::J
.2o ~C\1
.,
c
a>o ·o ~ !i= Q) 8o Q)
Q)o
> ~, ca
3:
0
..
r·
_..I
~
~C\1
c
Q)
·o !i= Q)O 0 0
Q)
Q)
>
ctlC\1
3:'
~
-2 0 2 Quantiles of Standard Normal
-2 0 2 Quantiles of Standard Normal
Figure 5.15: Q-Q plots for empirical wavelet coefficients for the noisy blocky function and for white noise
cally lie off the line formed by the "small" coefficients. Figure 5.15 displays the Q-Q plots associated with the noisy blocky function in Figure 5.11 and the white noise sequence in Figure 5.13. It is clear from this plot that there is significant change in the function associated with the first data set, and the number of significant coefficients can be approximated by counting the number of points off the line. The coefficients from the second data set behave just as a random sample of normal random variables should, so there is no evidence of any significant wavelet coefficients. Another standard method for graphical data analysis is the box plot. Since the wavelet decomposition orders coefficients according to their dilation index, it is natural to consider the sets of coefficients grouped by scale. A set of side-by-side boxplots for the wavelet coefficients of the five highest levels associated with the noisy function in Figure 5.11 is plotted in Figure 5.16. For comparison purposes, the box plots for the coefficients of the white noise sequence in Figure 5.13 are also plotted. The corresponding wavelet levels are listed below each boxplot. Recall that there are 2i coefficients at level j. For a set of pure noise, all the coefficients should have the same distribution, which seems to hold for the right-hand plot in Figure 5.16. It is clear that this is not the case for the left-hand plot, however, as can be seen by the widely differing widths of the five boxplots. This is related to the sparsity of representation idea inherent in wavelet analysis: At high levels, there are many coefficients, but only a very few correspond to actual signal; the rest are just noise. At low levels, however, a larger proportion of coefficients is needed to represent the function adequately. This is illustrated well by the first plot in
102
OTHER PLOTS FOR DATA ANALYSIS
Noisy blocky function <0
..;t
j
C\1
0
~~~~ .....:.....
~
"r
.....:.....
White noise
C\1
0
.....:.....
L-1....
~
iI I I I .....:.....
~
.....:.....
.....:.....
j=8
j=9
.....:.....
.....:.....
j=S
j
T
j=6
j=7
j=8
j=9
j=S
j=6
j=7
Figure 5.16: Box plots by wavelet level for empirical wavelet coefficients for the noisy blocky function and for white noise
Figure 5 .16, as the coefficients at level 9 all behave as usual normal random variables, except for a handful of outliers. A reasonable rule of thumb might be to regard coefficients appearing as outliers in such a boxplot as containing significant signal.
CHAPTER
SIX
Some Practical Issues
In earlier chapters it may have seemed that wavelets and multiresolution analysis were treated more as a collection of interesting mathematical ideas than as a powerful method for solving practical problems. In fact, much of the beauty of wavelet analysis lies in its widespread applications. This chapter will discuss important issues that arise in moving wavelets from theory to practice. Earlier chapters have treated the Fourier transform and the wavelet transform at length, breaking down a known function f into its components. In statistical function estimation problems, f is never known, so coefficients (Fourier or wavelet) must be computed from discrete data points. The first two sections of this chapter deal with computing these transforms using observed data and the associated algorithms. There are various issues to consider when applying these algorithms to statistical problems. For instance, most of the development of wavelet analysis thus far has dealt with functions defined on all of IR, but in application we are often interested only in functions that are defined on a finite interval, over which data has been collected. Section 6.3 will discuss adapting wavelet analysis to "life on the interval." Finally, Section 6.4 discusses another practical matter: applying the discrete wavelet transform to data with any sample size. These points are all key in making wavelet methods applicable to a broad class of statistical problems.
104
6.1
THE DISCRETE FOURIER TRANSFORM OF DATA
The Discrete Fourier Transform of Data
Wavelet methods offer improvement over classical Fourier methods in improved time-frequency localization and faster algorithms. To give a basis of comparison of these algorithms, this section discusses the Fourier algorithms and the next deals with the wavelet transform.
The Fourier Transform of Sampled Signals Recall that the usual definition of the Fourier transform
](w) =
1:
f(t)e-iwt dt,
(6.1)
requires knowledge of the "signal" f over all continuous time. In practice, we commonly have only the value of the signal at discrete points in time (and only a finite number of them). It is necessary to adapt the usual Fourier transform for this situation as well. Let Y1 , Y2 , ... , Yn represent the sampled signal. Define a piecewise constant approximation of the continuous signal f with support on (0, n] by
(6.2)
where the time unit is rescaled if necessary. This approximation is fairly good if the signal function f is reasonably smooth and n is large. Using this function in place of f in (6.1) gives an approximate Fourier transform:
f(w)
~
L:
!\ t)e -iwt dt
1n f{t)e-iwt dt n
~
L
J-(k)e-iwk.
k=l
The frequency domain is discretized as well, the Fourier transform being estimated only at the "natural frequencies" WJ = 21r(j- 1)/n for j = 1, ... , n, giving the discrete Fourier transform of Yi., ... , Yn: n
Zj
= LYke-2rri(j-l)k/n. k=l
(6.3)
Some Practical Issues
105
Each Zk gives information on the frequency content of the set of observations Y1 , Y2 , ... , Yn at the frequency Wk. The discrete Fourier transform is used extensively in times series analysis to look for hidden periodicities. For instance, in monthly data, an annual cycle will manifest itself in terms of a large z 12 coefficient. By using such a transform and examining the relative sizes of the Fourier coefficients, it is possible to gain valuable insight into modeling the original signal. Clearly, the straightforward method for computing all coefficients in the Fourier transform (6.3) involves computing n sums, each sum requiring n multiplications. As the sample size gets large, the number of sums increases proportionally to n and the number of multiplications per sum also increases as n, so the computational cost of the algorithm (corresponding to the number of required operations) is O(n 2 ).
The Fast Fourier Transform Cooley and Tukey (1965) adapted ideas of Good (1958) to the Fourier situation and introduced an algorithm to compute the Fourier transform of data that had a reduced computational cost of only 0 (n log n), a significant improvement over the naive algorithm. This important contribution has become standard in implementation and has helped widen the application of Fourier methods. With modern computers, capable of executing hundreds of millions of instructions per second, it might seem that the improvement in computing speed is not worth the added complexity of the algorithm. With expanded computing power and speed, however, have come increasingly larger data sets, maintaining the premium on efficient algorithms. The development by Cooley and Tukey (1965) has become known as the Fast Fourier Transform (FFT). It should be noted here that, despite its name, the fast Fourier transform is not a transformation in its own right, but refers to an efficient algorithm for computing the usual discrete Fourier transform. This designation now includes a variety of related methods, the original of which is briefly described here. The main idea is to reduce the number of computations made by recursively computing the DFT of subsets of the data. This is done by reordering the data subsets so as to take advantage of some redundancies in the usual DFT algorithm. Descriptions of the fast Fourier transform are given in many sources. The approach used here parallels that of Wei (1990); other good sources include Bloomfield (1976) and Brigham (1988). Consider first the special case that n is even. The first simplification step involves writing the discrete Fourier transform of the entire data set in terms of the discrete Fourier transform of the two halves of the data set: the evenindexed and the odd-indexed. In particular, we begin with (6.3) and split it into two halves:
106
THE DISCRETE FOURIER TRANSFORM OF DATA n
Zj
( 6 .4)
LYke-2rri(j-l)k/n k=l n/2
n/2
L y ke-2rri(j-1)2k/n 2
+ Ly2k_ 1e-2rri(j-1){2k-1)/n
k=l
k=l
n/2
n/2
Ly2ke-2rri(j-l)k/(n/2)
+ e2rri(j-1) 2:Y2k-le-27ri(j-l)k/(n/2)
k=l
k=l
+ e2rri(j-I) z?J' J
zC:
(6.5)
for j = 1, ... , n; zj the discrete Fourier transform of the even-indexed data; and zj that of the odd-indexed data. Though Zj in (6.4) is defined for all j = 0, ... , n, we would normally only compute zj and zj for j = 1, ... , nf2, according the the usual definition (6.3), since there are only n /2 values in each half-sample of the original data set. We will see that there is some redundancy in (6.5) so that we will only be required to compute the DFf of the half-samples at n/2 different frequencies. First, we begin with the argument of the exponential function in (6.5), multiply by e 2rrik = 1, and rearrange the expression: e -2rri(j-l)k/( n/2)
=
e-2rri(j-I)k/(n/2) e2rrik e-2rrik(j-n/2-1)/(n/2).
Applying this simple result to the definition of the DFf of the even- and oddindexed points gives that e
zC: J
=
zj-n/2
z?
=
zj-n/2'
J
0
Thus, the expression (6.4) for Zj can be written in terms of the usual DFfs of the even- and odd-indexed DFfs as follows:
Zj -
{
zj e
+ e2rri(j-1) zj, o + e 2rri(j-1) zj-n/ 2,
zj-n/ 2
j
= 1, ... 'n/2 · _
J - n/2
+ 1, ... , n.
(6.6)
Now, computing the DFf ofY1 , ... , Yn according to (6.6) saves some computer time. Computing the DFf for two sequences of length n/2 requires 2(n/2) 2 operations using the definition formula (6.3), and computing Zj (j
=
Some Practical Issues
107
1, ... , n) according to (6.6) requires an additional n multiplications and additions. The total number of operations required is now (6.7) which represents a substantial savings if n is large. If n/2 is also even, then the even- and odd-indexed sequences can each be split in half as well, each of those DFfs being computed in terms of the DFT of sequences of length n /4. Following the same reasoning as before, the number of operations in (6. 7) is reduced further to
If n = 2m for some integer m, this algorithm can be iterated m times, giving the DFf of the original sequence (Y1 , ... , Yn) with an approximate number of operations equal to
~ m n + ... + n +2m (n)2 ~ n log 2 n. 2
This represents an enormous reduction in computation time. For instance, with m = 11 (n = 2048), computing according to the straightforward formula (6.3) would take over 4 million operations; the fast Fourier transform algorithm would require fewer than 23,000 operations, a ratio of approximately 186:1. Of course, as n gets even larger, the comparison becomes even more extreme. The algorithm presented here is a special case algorithm for n equal to a power of two, with each step consisting of breaking the sequence into two subsequences. Similar algorithms exist for other cases as well. Brigham (1988) and Bloomfield (1976) work out the more general case in which n = r1 · r2, where r 1 and r 2 are integers. It is not required, then, that n be a power of two; efficient FFT algorithms exist whenever n is highly composite, i.e., n is the product of small prime numbers. If this condition is not met (e.g., if n is itself prime), then an efficient algorithm is still possible, but the sequence must be "tweaked" slightly first. These issues will be discussed further in Section 6.4
6.2
The Wavelet Transform of Data
Similar to what we did in taking approximate Fourier transforms, we could take approximate wavelet transforms by replacing the function f in the defi-
108
THE WAVELET TRANSFORM OF DATA
nition of a wavelet coefficient by an estimate such as (6.2):
rlj,k
= /_: J1.x),Pj,k(x) dx.
Of course, the same idea can be used to estimate scaling function coefficients as well: (6.8) n
~
I: Ye ¢J,k(t). l:'=l
Computing coefficients this way would, of course, require numerical integration, and, as we saw in the previous section, implementing this approximation directly would result in an 0( n) algorithm to compute each (wavelet or scaling function) coefficient. We know from Section 4.1 that if we are given scaling function coefficients at a particular level, there exist good algorithms for computing all lower-level scaling function and wavelet coefficients. For the moment, let us consider the algorithm only after top-level coefficients are provided. Thus we begin with a set of high-level scaling function coefficients. Suppose for the moment that we have only a finite number of non-zero coefficients. This assumption is not unreasonable, as any signal f E L 2 (IR) must have rapid decay in both directions, so scaling function coefficients Cj,k will be negligible for large Ik j. Rescale the original function f if necessary so that the scaling function coefficients at level J are given by CJ,o, ... , CJ,M - l · Computing scaling function and wavelet coefficients at level J - 1 is accomplished via the decomposition algorithm formulas given in Section 4.1, specifically Cj,k
L
(6.9)
he-2k CJ+I,e
l:'EZ
L(-1)eh_t+2k+I
Cj+I,k·
(6.10)
l:'EZ
As noted in Section 4.1, the { hk} sequence used in these computations has only finitely many non-zero values if the wavelets are compactly supported; otherwise the hk values decay exponentially, so it can be well approximated by finitely many terms. In either case, let K denote the number of non-zero terms used in the (possibly truncated) sequence. Computing a single coefficient at level J - 1 according to either (6.9) or (6.10) would take at most K operations. If scaling function coefficients
Some Practical Issues
109
CJ,k for k ¢ {0, ... , M - 1} are set to zero, giving exactly M non-zero coefficients at level J, then the number of non-zero scaling function coefficients at level J - 1 would be at most
(6.11) where [x] denotes the greatest integer function of x. Note also that, if M is large, the number of non-zero wavelet coefficients is no more than the number in (6.11). Thus, the total number of non-zero scaling function coefficients at level J - 1 is approximately M 12, and the total number of operations required to compute the one-level-down wavelet and scaling function coefficients is approximately 2K · M 12. Let M 1 be the number of non-zero scaling function coefficients at level J - 1. Applying the decomposition again requires no more than
operations to compute the scaling function coefficients, and no more than the same number of operations to compute the wavelet coefficients. There will be approximately M 1 12 ~ M I 4 non-zero scaling function coefficients at level J- 2, and the computation will require approximately 2K · M 1 12 ~ 2K · M I 4 operations. Continuing in this way, we see that the total number of operations required to do all the decompositions is approximately
Given a set of top-level scaling function coefficients, this decomposition is very efficient, better even than the 0( M log M) operations required for the fast Fourier transform algorithm. Of course, everything we have discussed thus far has assumed that the toplevel scaling function coefficients are provided as input. The remarkable efficiency of the decomposition algorithm is moot without an efficient algorithm for computing the top-level coefficients. Recall from Section 4.2 that computing the decomposition procedure (6.9) amounts to applying a low-pass "smoothing" filter to the discrete signal { cJ+I,k}kEZ· resulting in anew signal { Cj,k}kEZ· This signal can be "smoothed" again by means of the same operation, and this can continue recursively as long as might be desired. At each level, the "detail signal" { dj,k hEz, computed via (6.10), can be computed as well. Recall from Section 4.1 that
110
WAVELETS ON AN INTERVAL
by combining the smoothed signal and the detail signals, the original unsmoothed signal can be recovered. In nonparametric regression, for instance, the goal is often to start with data Y1 , ... , Yn (usually regarded as signal plus noise) and then "smooth" the sequence, in hopes of retaining the signal and removing the noise. The filtering ideas of Section 4.2 fit in nicely with this scenario. Regarding the original data Y1 , ... , Yn as a signal, applying a low-pass filter H will result in a smoother version of the signal. This is precisely what is done in the standard discrete wavelet transform of data - the filters H and G are applied to Y1 , ... , Yn. The same filters are then applied recursively to the smoothed versions of the data. The standard wavelet decomposition thus regards the input data Y1 , ... , Yn as the highest-level scaling function coefficients, or, equivalently, the leastsmoothed signal. Thus, there is no computational expense at all in computing the {CJ,k} sequence, and the entire wavelet decomposition requires only 0 (n) operations. In practice, then, it is possible to do a full wavelet decomposition and reconstruction without performing any numerical integration, indeed, without ever computing a single value of any wavelet function. Though not specifically discussed, it should be noted here that the wavelet reconstruction algorithm is also fast, requiring O(n) operations.
6.3
Wavelets on an Interval
One topic of interest in general statistical function estimation is the handling of the boundaries. A vast literature deals with this problem in all varieties of smoothing methods. Some proposed techniques are readily adaptable to wavelet analysis, and these will be discussed briefly in this section. Up to now, the function space of interest has been L 2 (JR). In nonparametric regression and many other statistical applications, the data are gathered not over the entire real line, but over a finite interval. Thus, the space of real interest is actually £ 2 (/), where I represents a single interval. Since any finite interval can be translated, we consider here only the unit interval [0, 1) without loss of generality. The Haar wavelet is most easily adapted to the unit interval. Note that the "mother wavelet" 'ljJ defined in (1.6) has support [0, 1), as does the corresponding scaling function¢> defined in (1.10). Dilates and translates of these two functions also fit nicely into the unit interval: Combining supports of 'I/J 1 , 0 and 'I/J 1 , 1 also make up the interval [0, 1); four wavelets at level2 are required, and so on. Thus, the set of wavelets that form a CONS for the interval [0, 1) is { '1/Jj,k :
k
= 0, ... '2j -
1' j
= 0, 1' ... }.
Taking the discrete Haar transform on a set of n = 2J (where J is a positive integer) data values gives wavelet coefficients at levels 0, 1, ... , J- 1, with 2j
Some Practical Issues
111
wavelet coefficients at levelj. This gives a total of 1 +2+4+ ... +2J-I = n-1 wavelet coefficients. Including the lowest level scaling function coefficient c0 , 0 along with the wavelet coefficients gives an orthogonal transformation of the original data. Note that the coefficient Co,o represents a "final smoothing" of the data - in the Haar case, this is just the sample mean. Adapting other wavelets to the interval is a little more difficult. Certainly, if a function f has domain [0, 1], techniques discussed previously could be applied directly if the domain off were extended to include all of IR, with the expanded version off defined to be zero outside of [0, 1J. Since this can introduce artificial discontinuities at the endpoints of the interval, this approach is not entirely satisfactory. Various other techniques have been proposed, and these will each be explored briefly. Some good discussions of this adaptation are given in Unser (1996), Jawerth and Sweldens (1994), and Chapter 10 of Daubechies (1992), among other sources. The discussion here begins with two very straightforward (and quite useful) solutions: periodic and symmetric boundary handling.
Periodic Boundary Handling A very useful approach to adapting wavelet analysis to the unit interval is accomplished through periodic boundary handling. The function f defined on [0, 1] could be expanded to live on the real line by regarding it as a periodic function with period one: f(x) = f(x - [x]) for x E JR. As might be expected, this approach introduces a periodicity in the wavelet coefficients as well. Computing a coefficient dj,k with j ~ 0 is accomplished via
d;,k
= =
1: 1:
f(x) ,P;,k(x) dx f(x)2il' ,P(21x- k) dx.
Applying a change of variables and the fact that f (x) = f (x - 1) for all x E IR gives further that
d;,k
=
1: 1:
f(x- 1) 21/' ,P(21x- k- 2;) dx f(x) 1/J;,k+2; (x) dx
Thus, there are only 2J unique wavelet coefficients at any level j ~ 0, so the set of wavelet coefficients for j ~ 0 can be indexed the same as the Haar
112
WAVELETS ON AN INTERVAL
basis on the unit interval. Restricting j to be nonnegative provides only for the detail at higher resolution than the approximation space V0 . By forcing the periodicity of f and considering the resulting multiresolution analysis on [0, 1], we are actually adapting the scaling functions and wavelets to live on the unit interval as well. In essence, these functions are "wrapped around" as well, so that the portions of the functions on intervals [j, j + 1) are all combined together. Thus, the appearance of these functions will be altered considerably, but the ideas of the multiresolution analysis are the same. This style of boundary handling is somewhat problematic in that, unless the function is truly periodic, it introduces artificial singularities by pasting the function together at the interval's endpoints. This can result in several large coefficients for wavelets centered near the boundaries that have no real interpretation in terms of the function f. Applying this transform to computing the discrete wavelet transform from data Y1 , ... , Yn involves "wrapping around" the original sequence (for i = n + 1, ... , 2n, define Yi = Yi-n with a similar treatment for negative i) and applying the usual filters to the expanded data. The decomposition relation can be applied in the same way, keeping in mind the imposed periodicity of the coefficients. This method has been used a great deal in statistical application. This is true partly because the implementation is very straightforward, and partly because the resulting empirical wavelet coefficients are independent with identical variances, whenever an orthogonal wavelet family is used and the noise is Gaussian (see Chapter 7 for more details).
Symmetric and Antisymmetric Boundary Handling Another technique, used extensively in kernel function estimation, involves reflecting the function of interest about the boundaries. A symmetric reflection would require extending the domain of the function beyond [0, 1] and defining J(x) = f( -x) for x E [-1, 0) and f(x) = f(2- x) for x E (1, 2]. This has an advantage over periodic boundary handling in that it preserves the continuity of the function, though discontinuities in the derivative of f may be introduced. Another way in which the function can be reflected about the endpoints is antisymmetrically: f(x) = 2f(O)- f(-x) for x E [-1,0) and f(x) = 2f(1)- f(2- x) for x E (1, 2]. This is able to preserve continuity in both the function and its first derivative (and all odd derivatives). As with periodic boundary handling, these methods impose their own alterations of the usual multiresolution analysis. Applying the reflected boundary handling method to the discrete wavelet transform of data would require reflecting the data in the appropriate way, and then applying the decomposition filter directly. This method will result in a few more than 2i wavelet coefficients at level j, since a few extra co-
Some Practical Issues
113
efficients (the number depending on the length of the filter) are kept at the ends of the usual set of 2i coefficients. Implementing this procedure is a little more involved than implementing the periodic boundary handling scheme. Having more than 2i wavelet coefficients at each level gives more wavelet coefficients than data points, so some dependencies are introduced. Unser (1996) discusses some of the practical problems that arise when implementing these boundary conditions. Care must be given in coding the reconstruction algorithm to ensure that the original data can be recovered exactly. These three methods are quite useful and produce the desired effect: coercing the multiresolution analysis built for L 2 (JR) to live on the unit interval. It is possible, however, to construct wavelets specifically for L 2 [ 0, 1]. Several approaches to this are briefly presented here.
Meyer Boundary Wavelets As described by Meyer (1992), usual wavelet functions can be adapted so as to have their natural domain on an interval, thereby giving an orthonormal basis for £ 2 [0, 1]. This development is outlined here without going into much detail. As an example, let us consider applying this adaptation to the Daubechies orthogonal wavelets with compact support. Recall that as the dilation index j gets larger, the supports of the corresponding wavelets and scaling functions diminish, so that for large enough j, each function c/>j,k will have support including only one of the endpoints 0 and 1. Thus for sufficiently large j 0 , it is possible to label each of the scaling functions ¢>io,k as being a "0-intersecting," an "interior" (meaning that the support of the function is entirely contained within [0, 1]), or a "!-intersecting" scaling function. If we consider only the restriction of these scaling functions on [0, 1], they will form a basis for a subspace l'Jo C £ 2 [ 0, 1]. If we consider making a similar restriction for j ~ j 0 , we can similarly create another approximation space Vj. These spaces will naturally inherit the multiresolution structure generated on the real line by the original scaling functions, so that an analogous form of Definition 1.4 holds for £ 2 [0, 1], with the added restriction that j ~ j 0 • Thus a basis for the spaces Vj can be formed simply by restricting the support of the original scaling functions, but some work remains to make this basis orthogonal. Note that for any j ~ j 0 , the interior scaling functions are still mutually orthogonal, and, in fact, each interior scaling function is orthogonal to each 0-intersecting and each !-intersecting scaling function. To see this latter fact, note that for any interior scaling function c/>j,k and any 0-intersecting or !-intersecting scaling function (¢j,b c/>j,k') = c/>j,k', the £ 2 [0, 1] inner product between the two is
114
WAVELETS ON AN INTERVAL
since the original version of the interior wavelet vanishes outside of [0, 1]. By a similar argument, it can be shown that for j ~ j 0 , each 0-intersecting scaling function is orthogonal to each !-intersecting scaling function. Thus, it remains only to orthogonalize separately the set of 0-intersecting and !intersecting scaling functions, and this can be accomplished using the GramSchmidt procedure for functions. Detail spaces and wavelets are added to the construction by noting that a set of similarly restricted versions of the original (defined on JR) wavelets will spaa a space Wj C L 2 [0, 1], which satisfies
There are more wavelet functions that have support overlapping [0, 1] than the dimension of Wj, however, so that these restricted wavelets do not form a basis. Further, these wavelets are not all mutually orthogonal, and they are not all orthogonal to the basis functions of Vj. It is necessary then to work out the linear dependence among the restricted wavelets to reduce the number of functions to obtain a basis, then to apply the Gram-Schmidt procedure again to orthogonalize the set of 0-intersecting and !-intersecting wavelets and to force them to be orthogonal to each function in Vj . This useful construction suffers from a problem of numerical instability inherent in the orthogonalization procedure, which makes it difficult to implement. Other problems with this procedure are pointed out by Cohen, Daubechies, and Vial (1993), as they introduce an alternative solution to the construction of orthogonal wavelet bases on the interval.
Orthogonal Wavelets on the Interval Another approach to adapting wavelets to live on the interval [0, 1] was proposed independently by Anderson, et al. (1993) and Cohen, Daubechies, and Vial (1993). A short description of the method was given in Cohen, et al. (1993). This section will describe only the main idea of this construction, with the interested reader referred to any of these articles for more details on the construction. Using the Daubechies family of compactly supported wavelets with filter length 2N (see Section 4.4); this scheme uses the fact that polynomials of degree up to N- 1 can be expressed only in terms of the appropriately translated scaling functions. Certainly, then, the restriction of all polynomials of degree N- 1 will be in Vj C L 2 [0, 1] for any j. As in Meyer's construction, we begin by choosing some integer j 0 large enough so that no scaling functions ¢>i,k
Some Practical Issues
115
have support overlapping both endpoints and construct nested approximation spaces l'Jo C l'lo+I C ... C £ 2[0, 1]. The "interior" scaling functions will be kept intact, but the boundary scaling functions are altered in a manner quite different from that used in the previous subsection. For any j ~ j 0 , an edge function can be created for any polynomial of degree I! by writing the monomial xe in terms of its scaling function representation:
xe
= L(xe,c/>j,k)cPj,k(x). k
A "left-edge" function ¢>e(x) is given by the above expression restricted to [0, 1] except that the sum is only over the k's that correspond to 0-intersecting scaling functions. The set of left-edge functions for 0 ~I!~ N- 1 is already mutually orthogonal. (The same is true of the set of "right-edge" functions, created in a similar manner.) Each of these sets is also orthogonal to the set of interior scaling functions, so it is only necessary to orthogonalize the functions in each edge set using Gram-Schmidt. Thus constructed, the union of these sets forms an orthonormal basis for Vj, and the construction is numerically stable. The Wj detail space is still defined as the difference in successive approximation spaces, and a basis for it is made up of the corresponding "interior wavelets," along with a few edge wavelets that are constructed similarly to the edge scaling functions. A nice feature of this construction (not shared by Meyer's method) is that there are 2i basis functions for each Vj and also 2i for each Wj, just as in the simple periodic MRA of described earlier in this section. Implementing the discrete wavelet transform using either the method of this section or that of Meyer is considerably more involved than implementing the transform resulting from imposing the simpler periodic or symmetric conditions. The filters are different from level to level, so rather than just using a handful of coefficients, substantial tables of edge coefficients must be stored. Examples of such tables are given in Cohen, Daubechies, and Vial (1993). Still, in many applications, having an orthogonal wavelet transform with no unpleasant edge effects is well worth the additional effort in implementation.
6.4
When the Sample Size is Not a Power of Two
In our discussion of adjusting the wavelets to live on a finite interval, we have been assuming that the sample size of the data set is a power of two, i.e., n = 2J for some positive integer J. This is not a particularly restrictive assumption in fields such as signal processing and image analysis, as such sample sizes are often due to the natural sampling rate. In statistics, however,
116
WHEN THE SAMPLE SIZE IS NOT A POWER OF TWO
we often have no control over the size of the sample, thus it is only in relatively rare situations that n is a power of two. Adapting the usual discrete wavelet decomposition and reconstruction algorithms with their inherent 2J structure to arbitrary-length data sets is not trivial. In considering taking the discrete wavelet transform of an ordered data set Y1 , .•. , Yn, there are a number of properties we would want the transform to have. Among the most important might be the following: 1. ease of implementation, 2. orthogonality, 3. adaptation to a finite interval. The orthogonality of the transform is perhaps more important in statistical applications than in other fields. Orthogonality ensures mutual independence of empirical coefficients (scaling function and wavelet) when the original data are independent with Gaussian errors. (This is discussed at length in Chapter 7.) There are, of course, other considerations, such as computational efficiency, exact reconstruction, and vanishing moments, but the three listed above are perhaps most important for typical applications to statistics. We might be willing to compromise somewhat on these conditions, but for statistical application, it is imperative that a wavelet transform be available for arbitrary sample sizes. The paper by Cohen, Daubechies, and Vial (1993) develops orthogonal wavelets on the unit interval. It concentrates its treatment on the 2J case (which is of primary interest to engineers), but the general methodology is also given to compute the discrete wavelet transform on the interval for any sample size. This satisfies Properties 2 and 3 above perfectly, but the implementation would be quite involved. Though this is certain to appear in standard wavelet software packages in the future, at the time of this writing it has not yet been implemented for arbitrary sample sizes either in the S+Wavelets module or in WaveLab. When the sample size is a power of two, a wide variety of boundary treatments and wavelet families is available by virtually all currently available wavelet software packages. A natural approach would be to precondition the original data set somehow to get a set of values with length 2J for some positive integer J. The resulting preconditioned data could then be plugged in directly to any standard discrete wavelet transform routines. A comparison of some of these methods is given in Ogden (1997). A brief discussion is given here as well. In Section 6.1, it was noted that the fast Fourier transform algorithm is computationally efficient whenever the sample size n is highly composite. In practice, a common way to precondition the data when this is not true is to "pad with zeroes," i.e., to increase the size of the data set to the next larger
Some Practical Issues
117
power of two (or some other highly composite number), and then apply the FFT algorithm. This has been used in applying the discrete wavelet transform as well. This is certainly a reasonable solution, but it is somewhat problematic in that to some extent, it "dilutes" the signal near the end of the original data set, since coefficients will have zeroes averaged into their computation. Also, since the filters are not applied evenly (multiplying by a signal element constrained to have magnitude zero is equivalent to omitting the filter coefficient), the orthogonality of the transform is not strictly maintained. Another possibility would be to regard the data as being observations of a function with domain [0, 1] with equal spacing 1/n, then interpolating the function to a grid with spacing 1/2J for some positive integer J. The usual discrete wavelet transform (DW1) would then be applied to the interpolated data points. This approach also seems reasonable, and it works tolerably well in practice, but it has some of the same problems as padding with zero. Orthogonality of the transform is lost, so correlations between empirical coefficients are introduced (which don't go to zero even asymptotically). A third possibility is to abandon the fast algorithms discussed in Section 6.2 and compute the top-level empirical scaling function coefficients by numerical integration, as in (6.8). This is related to the a trous (with holes) algorithm discussed by Dutilleux (1989) and Shensa (1992). Again, this procedure gives reasonable results, but the strict orthogonality is lost. For any finite sample size, this approach will introduce autocorrelations between coefficients, but the amount of autocorrelation becomes small as n gets large. Rather than manipulating the data set and applying a standard orthogonal transform to the result, one might apply the decomposition corresponding to a "nearly orthogonal" biorthogonal wavelet transform. This is discussed by Cohen, Daubechies, and Feauveau (1992) and in Chapter 8 of Daubechies (1992). Such biorthogonal wavelet transforms are readily adapted to the interval, and the discrete wavelet transform can be applied to data of any length. The resulting coefficients will not be strictly uncorrelated, but they may be treated as such for some applications.
CHAPTER
SEVEN
Other Applications With a basic understanding of wavelet theory and a knowledge of the practical issues involved in applying wavelets to observed data, we are now ready to extend the basic methods of Chapter 3 to more sophisticated techniques on a wide variety of applications. Perhaps the most common wavelet application in statistics is nonparametric regression, which is covered in some depth in Section 7.1. This will serve as a groundwork for other applications treated later in this chapter: density estimation, estimation of the spectral density in time series, and the general change-point problem. Extensions of these methods will be given in the context of nonparametric regression in Chapter 8.
7.1
Selective Wavelet Reconstruction
As stated in Section 3.3, the approach considered by Antoniadis, Gregoire, and McKeague (1994) consists of projecting the raw estimator f- onto the approximating space VJ for any choice of the smoothing parameter J, which represents a linear estimation procedure (in that it operates linearly on the data). In contrast to this, David Donoho and lain Johnstone, in a seminal series of papers (see References), offer a non-linear wavelet-based approach to nonparametric regression. This type of approach has received a great deal of attention from statisticians, signal processors, and image analysts alike. TheDonoho-Johnstoneapproachbeginswithcomputingthediscretewavelet transform of the data Y1 , ... , Yn, thereby creating a new data set of (noisy) empirical wavelet coefficients as outlined in Chapter 6. The basic idea behind selective wavelet reconstruction is to choose a relatively small number of wavelet coefficients with which to represent the underlying regression function f. It will be assumed throughout the treatment of this subject that the function f has been scaled to live on the unit interval [0, 1] and that the wavelet coefficients are computed according to some family of orthogonal wavelets with an appropriate method for dealing with the boundaries. For simplicity, it may be assumed that there are 2J equally spaced data values for some positive integer J and that periodic boundary handling is used. More details about the discrete wavelet transform under these (and more general) conditions are included in Chapter 6.
I20
SELECTIVE WAVELET RECONSTRUCTION
This selective reconstruction idea is based on the premise that virtually any regression function f can be well-represented in terms of only a relatively small number of wavelet components at various resolution levels, the same general idea that drives the use of wavelets in data compression. A more precise definition of what is meant by virtually all is given later in this chapter. A heuristic justification of this claim is presented here, along with an example to illustrate the "sparsity of representation" property of wavelets. Any smooth function f can be approximated well (in the L 2 sense) by its projection onto a Vj space for relatively small j, requiring only a small number of coefficients. All the higher-level wavelet coefficients in the discrete wavelet decomposition could be regarded as being set to zero. Adding some unusual feature (such as a discontinuity in f' or in the function itself) would require additional higher-level wavelets components to represent the function f well. Such a localized phenomenon, however, would require only a small number of additional coefficients, the number of needed coefficients at level j decreasing as j increases. Adding any finite number of such unusual features would still give only a relatively small number of significantly nonzero coefficients, illustrating the idea of the sparse wavelet representation of functions. The justification of this approach begins with the decomposition of the function f into its wavelet components, as described in Chapter I. If the function f were known, its wavelet coefficients could be computed according to
Oj,k =
1'
f(u)'l/lj,k(u) du.
The notation Bj,k is used in this statistical context to emphasize that each wavelet coefficient can be regarded as a parameter. Donoho and Johnstone (1994) point out that coefficients computed in such a manner can be used to answer the question "Is there a significant change in the function near t?" for t close to 2- i k. If there is a large change, it will be manifested by a large (positive or nee:ative) value of{},,.; if the function is nearlv flat near_2_-! /s_.__then the tion about f that is contained in the wavelet coefficients becomes more localized, allowing one to do a better job of pinpointing exactly where the change is taking place. The wavelet components corresponding to coefficients that are close to zero can be neglected in the reconstruction of the function, with only a negligible loss of information. The notion of selective wavelet reconstruction is illustrated in Figure 7 .I using the example function from Chapter I. The example function is approximated using four different reconstructions, each using only a relatively small number of the largest (in absolute value) wavelet coefficients. The coefficients included in the reconstruction
Other Applications
121
4 coefficients Lt! Lt! ~ ~ I()
ci I()
d 0
ci
-2
0
-2
2
12 coefficients
0
2
16 coefficients
Lt!
Lt! .....
~
~
I()
I()
d
ci 0
0
d
d -2
0
2
-2
0
2
Figure 7.1: Selective wavelet reconstructions (using only the largest coefficients in absolute value) for the example function in Figure 1.2
are chosen irrespective of their resolution level-magnitude is the only criterion. Generally speaking, the statistical methods of function estimation parallel this deterministic approach to function approximation. It is informative to compare Figure 7.1 with Figure 1.2 (the Fourier reconstruction) and also with Figure 1.13 (the projection of the example functions onto spaces of varying resolution). In nonparametric regression, the parameter values { BJ, k} are unknown, so they must be estimated from the data. Define the empirical wavelet coefficient corresponding to the true coefficient BJ,k as
wJ1 = [
1\u),P;,k(u) du.
a.1)
(In practice, the empirical coefficients are computed according to the algorithms discussed in Section 6.2, but they are written here in terms of 0.1) to emphasize the correspondence between the empirical and true coefficients.) Given data values Yt , ... , Yn, which are distributed as
122
SELECTIVE WAVELET RECONSTRUCTION
it is relatively straightforward to derive the approximate distribution of from 0.1). In particular,
E[wt2J
=
wt2 is normal with mean E[[ f\u)'I/J;,k(u) du]
t, =
wt2
E[Y;]
l~~)/n '1/J;,k(u) du 1
n
2:: f(i/n)'I/Jj,k(i/n) + 0( 2) n i=l 1 [
Jo f(u)'I/Jj,k(u) du
1
1
+ 0(;) + O(n 2 )
1
B·k+O(-). J, n The variance of
w) ~2 can be computed in a similar way:
The above results hold true provided that the wavelet '1/Ji,k and the function f are sufficiently smooth. In particular, this result holds if the mother wavelet has one continuous derivative and iff is piecewise continuous and piecewise smooth with a finite number of discontinuities. (This smoothness assumption of the wavelets is satisfied, for example, of the Daubechies wavelets with sufficiently large N. This can be seen by applying results in Daubechies and Lagarias (1991, 1992) to the Daubechies families of wavelets.) These conditions could be relaxed and similar results would hold, but the assumptions given above are sufficient for the purposes of this discussion. See Ogden (1994) for more details. In a manner similar to the two computations given above and under the same assumptions, it can be shown that the empirical wavelet coefficients are (at least asymptotically as n --+ oo) independent.
Other Applications
123
Alternatively, the distribution of the empirical wavelet coefficients can be derived using a matrix representation of the discrete wavelet transform of data. This more closely represents the way the decomposition is done in practice; the above results are included to give a more intuitive idea of computing the wavelet transform of data. Let W represent the n x n orthogonal matrix associated with the orthonormal wavelet system of choice. Let Y (no subscript) denote the vector of data values: Y = (Y1 ... Yn)'. Then we can write
w=
1
0.2)
foWnY,
in which the vector w (no subscripts) is the vector of wavelet coefficients _ (
w-
(n) (n) (n) (n) (n) w-I,o' Wo,o 'wi,o 'wi,I '· · · 'w J-I
)'
'
zJ- 1 -I ·
Here the extra coefficient denoted w~n{ 0 is actually the lowest-level scaling function coefficient. As noted in Sectio~ 6.3, this represents a "final smoothing" of the data. It is included here to allow an invertible n x n transformation. The factor fo is included in 0.2) so as to unify the two representations of the discrete wavelet transform 0.1) and 0.2). Due to the orthogonality of the matrix Wn, it is clear from 0.2) that the vector w will have a multivariate normal distribution with variance-covariance matrix a 2 In/ n, where In represents the n x n identity matrix. The mean of the vector w is the vector that would result from applying the transform in a .2) to the mean vector (/(1/n), f(2/n), ... /(1))'. This vector is approximately equal to the vector of BJ,k 's, indexed the same way as w, the approximation error being of order 0(1/n). Since these two sets of means for thew vector are essentially equivalent, BJ,k will be used to indicate the result of either decomposition interchangeably. Since Wn is an orthonormal transform, the data vector can be reconstructed exactly via the inverse transform
wtJ
is thus an estimator of the true coefficient The empirical coefficient converging at the usual parametric rate:
Bj,k.
0.3) which can be expressed
124
SELECTIVE WAVELET RECONSTRUCTION
where the set of Zj,k 's are a set of (unobservable) n independent N(O, a 2 ) random variables. Thus, as pointed out by Donoho and Johnstone (1994), each empirical wavelet coefficient consists of a certain amount of noise, but only relatively few consist of significant signal. The noise in the original sequence Y1 , ... , Yn is spread out uniformly among all empirical wavelet coefficients. The natural question is to ask, "Which of the coefficients contain significant signal, and which are mostly noise?" Then, once we have chosen the set of coefficients containing significant signal, some attempt might be made to remove the noise component from each empirical coefficient. This is the heuristic idea underlying the Donoho-Johnstone method. Large "true" coefficients Bj,k will typically have large corresponding empirical coefficients w ),\nk) , and so it is natural to reconstruct the function using only the largest empirical coefficients in an attempt to estimate f. The idea of wavelet thresholding represents a very useful method for selective wavelet reconstruction using only noisy (empirical) coefficients.
Wavelet Thresholding A technique for selective wavelet reconstruction similar to the general approach presented here was proposed by Weaver, et at. (1991) to remove random noise from magnetic resonance images. Donoho and Johnstone (1994) develop the technique from a rigorous statistical point of view, by considering selective wavelet reconstruction as a problem in multivariate normal decision theory. DeVore and Lucier (1992) also developed the same approach independently. Since the largest "true" coefficients are the ones that should be included in a selective reconstruction, in estimating an unknown function it is natural to include only coefficients larger than some specified threshold value. Here (and throughout this chapter), a "large" coefficient is taken to mean one that is large in absolute value. For a given threshold value A, such an estimator can be written
0.4)
where IA represents the indicator function of the set A. This represents a "keep or kill" wavelet reconstruction, where the large coefficients (relative to the threshold A) are kept intact and the small coefficients are set to zero. This thresholding can be thought of as a nonlinear operator on the vector of coefficients, resulting in a vector {J of estimated coefficients that are then plugged into the inverse transform algorithm. Such a thresholding scheme is designed to distinguish between empirical coefficients that belong in the reconstruction (corresponding, one would
Other Applications
125
hope, to true coefficients which contribute significant signal) and those that do not belong (corresponding to negligibly small true coefficients). In making this decision, we should account for the two factors that affect the precision of the estimators: the sample size n and the noise level a 2 • All other things being held equal, a coefficient is a strong candidate for inclusion if the sample size is large and/or if the noise level is small. Based on the result in 0.3), the thresholding will be performed on fow~n2 /a, since this quantity is normally J, distributed with variance one for all values of n and a. The thresholding estimator of the true coefficient BJ,k can thus be written
A
_
01· k '
a
r,:::8>.. yn
(fow)~2) a
,
0.5)
where the function 8>.. in 0. 5) is the hard thresholding function
8f (x) = { x, 0,
if lxl >A otherwise.
(7.6)
This "keep or kill" hard thresholding operation is not the only reasonable way to estimate wavelet coefficients. Recognizing that each empirical coefficient consists of both a signal portion and a noise portion, it might be desired to attempt to isolate the signal portion by removing the noisy part. This idea leads to the soft thresholding function also considered by Donoho and Johnstone (1994):
8f(x) =
X-A 0, ' { x+A,
if X> A, if lxl ~A, if X< -A,
0.7)
which can also be used in a.5). When the soft thresholding operator is applied to a set of empirical wavelet coefficients, only coefficients greater than the threshold (in absolute value) are included in the reconstruction, but their values are "shrunk" toward zero by an amount equal to the threshold A. These two thresholding functions are displayed in Figure 7.2. Clearly, in using either type of wavelet thresholding, the choice of a threshold is a fundamental issue. Choosing a very large threshold will make it very difficult for a coefficient to be judged significant and included in the reconstruction, consequently resulting in an oversmoothing. Conversely, choosing a very small threshold value will allow many coefficients to be included in the reconstruction, giving a wiggly, undersmoothed estimate. The proper choice of threshold involves a careful balance of these principles.
126
SELECTIVE WAVELET RECONSTRUCTION
/ -A
SOFT
Figure 7.2: The hard and soft thresholding functions
Figure 7.3 illustrates the effect of varying the threshold value on the resulting estimator. The underlying function for the simulated data in the left-hand column is a simple sine curve; that on the right-hand side is a piecewise constant function with jumps at 0.25 and 0.75. For each data set, three values of A were used as the hard thresholding operator was applied to wavelet coefficients at all levels.
Spatial Adaptivity One drawback of some of the standard methods for function estimation discussed in Chapter 2 is that they are not spatially adaptive. Some functions might require a greater amount of smoothing in some portions of the domain (where f is relatively flat, for example), and less smoothing in other places (where f has one or more finer-scale features). A spatially adaptive estimator is one with the ability to discern from the data where more smoothing is needed, and where less will suffice, and then to apply the needed amount of smoothing. Variations of some of the methods discussed earlier have been developed to make them more spatially adaptive. For a kernel regression estimator, it is common to use a variable bandwidth, replacing the fixed bandwidth .A in (2 .16) with one that varies with u: A(u). For a point u at which f (u) is fairly smooth, the bandwidth should be relatively large, allowing a greater amount of averaging. Similarly, at locations where f (u) is less regular, a smaller band-
Other Applications
127
Threshold = 1. 75
Threshold = 1. 75
C\1
0
.. 0.0
0.2
0.4
0.6
0.8
0.0
1.0
0.2
Threshold = 2. 75
·....
0.6
0.8
1.0
Threshold = 2. 75
~-------------------------------,
,
0.4
C\1
0
C\1
0
... .···. 0.0
0.2
0.4
0.6
0.8
.. . 0.0
1.0
Threshold = 3. 75
·....
0.6
C\1
0
..
0.0
0.2
0.4
0.6
0.8
1.0
1.0
.......•. .. .··......
C\1
..... ....
0.8
.. .- ..
0
0.4
Threshold= 3.75
~~------------------------------,
,
0.2
.. .. . 0.0
0.2
0.4
0.6
0.8
1.0
Figure 7.3: Simulated data sets smoothed by hard wavelet thresholding, with varying values of the threshold
128
SELECTIVE WAVELET RECONSTRUCTION
width should be used, using the more localized data for estimation. Selecting an appropriate bandwidth A(u) for use in estimating f (u), of course, begs the question of the smoothness of the function f near u. Variable bandwidth estimators are thus heavily data dependent. A thorough discussion of variable bandwidth estimation is given by Muller and Stadtmuller (1987). Donoho and Johnstone (1994) discuss the built-in spatially adaptive properties of wavelet threshold estimators, compared with other popular spatially adaptive nonparametric regression procedures. In this landmark paper, Donoho and Johnstone compare theoretical properties associated with each method supposing an "oracle" is present: The oracle provides additional information to aid in estimation, not revealing the actual values of the function, but, in the case of variable bandwidth kernel estimation, giving the optimal bandwidth A(u) with which to estimate f (u). The first conclusion drawn is that selective wavelet thresholding, when equipped with such an an oracle, is competitive with each of the other spatially adaptive smoothing techniques, similarly equipped. Such theoretical results are of little practical import in real-data function estimation situations, in which oracles are unavailable, so Donoho and Johnstone go on to consider using the data to mimic the information that an oracle would provide. They conclude that thresholding the empirical wavelet coefficients as described above will estimate the function almost as well as if an oracle were actually present supplying information. Since methods for effectively imitating oracles (using only the data) in the other spatially adaptive estimation methods are unknown, their conclusion is that wavelets provide an extremely powerful tool in statistical function estimation. The above is a very brief synopsis of the results of theorems contained in Donoho and Johnstone's paper. For the technical details of these results, the interested reader is referred to their original paper and to Donoho, et at. (1995).
Global Thresholding
Given the basic framework of function estimation using wavelet thresholding, there are a variety of methods to choose the threshold A in (7. 5) for any given situation. These can be grouped into two categories: global thresholding, by which we mean choosing a single value of A to be applied globally to all (or nearly all, as explained later) empirical wavelet coefficients, and leveldependent thresholding, by which is meant that a possibly different threshold value AJ is chosen for each wavelet level j. In this chapter we consider only global thresholding; the many proposals of level-dependent thresholding for nonparametric regression problems will be treated separately in Chapter 8. In global thresholding, it is typical to threshold only coefficients at the highest resolution. Since the lower-level coefficients correspond to "macro"
Other Applications
129
Table 7.1: Minimax thresholds for various sample sizes, from Donoho and Johnstone (1994). n
64 128 256 512 1024
A 1.474 1.669 1.860 2.047 2.232
n
2048 4096 8192 16384 32768
A 2.414 2.594 2.773 2.952 3.131
features of the data, these are often included automatically in the reconstruction. Thus, in global thresholding, the function f- is projected onto an approximation space VJ 0 for some small j 0 (which represents the smooth components of the data) and then coefficients at higher resolution are thresholded, so that the noise is suppressed but the fine-scale details are included. By so doing, the thresholding estimator G .4) is generalized to
i.e.,
where 8.x. is either thresholding function. Note that this procedure requires us to know a, the value of the standard deviation of the data values. In practice, a is often unknown, but these techniques can still be applied simply by replacing a above by an estimate fJ. Estimation of a is the topic of the next subsection. Donoho and Johnstone (1994) propose two methods of global wavelet thresholding. The first one considered here is labeled minimax thresholding, which applies the optimal threshold in terms of £ 2 risk. This optimal (minimax) threshold depends on the sample size n and is derived to minimize the constant term in an upper bound of the risk involved in estimating a function. It does not exist in closed form, but can be approximated numerically. The minimax threshold values in Table 7.1 are taken from the table presented in Donoho and Johnstone's paper. Various examples of using this minimax threshold in nonparametric regression are given in Figure 7 .4. These estimators are designed to perform well in terms of mean-square-error performance.
130
SELECTIVE WAVELET RECONSTRUCTION
..·.......
Minimax thresholding
"
N
Minimax thresholding CD
"
0
N
~
0
..
"r 0.0
0.2
0.4
0.6
0.8
~
1.0
.... 0.0
Universal thresholding
"
0.2
0.4
0.6
0.8
1.0
Universal thresholding CD
N
"
0
N
~
0
..
"r 0.0
0.2
0.4
0.6
0.8
~
1.0
.... 0.0
0.2
0.4
0.6
0.8
1.0
Figure 7.4: Minimax and universal thresholding applied to the simulated data sets in Figure 7.3
In the same paper, Donoho and Johnstone (1994) present another rule for thresholding that has become very common in practice. Calling it the universal thresholding method, the authors propose setting A = .j2log n. This alternative to the minimax procedure is also asymptotically optimal and simpler to implement. The universal threshold value .j2log n is substantially larger than its minimax counterpart for any particular value of n. As a result, the reconstruction will include fewer coefficients, resulting in an estimate that is a good deal smoother than the minimax estimate. Since a smoother estimator is often considered to be more visually appealing, this method is called the "VisuShrink" method. The advantages to using the universal threshold listed above are partially offset by worse mean square error performance for small and moderate samples. Examples of data sets smoothed using the universal threshold policy are also given in Figure 7.4. The simulated data sets are the same as those used to illustrate the differences in choices of threshold in Figure 7.3. The differences between the two global thresholding methods are readily apparent in the figure. The minimax method does a better job at picking up abrupt jumps, at the expense of smoothness; the universal policy gives smooth estimates that don't pick up jumps or other such features as well.
Other Applications
131
As mentioned earlier, the typical sparsity of the Bj,k sequence ensures that most of the appropriately scaled fow]~~ coefficients are essentially white noise. The Donoho-Johnstone universal thresholding method safeguards against allowing spurious noise into the reconstruction. This is due to the fact that if Z 1 , ... , Zn represent an iid N(O, 1) sequence, then P
[zi ~ yf2logn, for all i = 1, ... ,n]
1
-t
(7.8)
as n goes to infinity. Essentially, (7.8) says that the probability of all noise being shrunk to zero is very high for large samples. Since the universal thresholding procedure is based on this asymptotic result, it does not always perform well in small sample situations.
Estimation of the Noise Level The thresholding rules given earlier all require at least an estimate of a, the standard deviation of the observations Y1 , ... , Yn. While estimating the noise level in any parametric setting is relatively simple, in nonparametric regression, for example, it is considerably more involved. Taking the usual standard deviation of the data values is clearly not a good idea-this will only work well if the underlying function is reasonably flat. Otherwise, the "signal" in the data will causes to grossly overestimate the true value of a. Instead, consider estimating a in the wavelet domain. Recall that each empirical wavelet coefficient w]~~ has mean equal to Bj,k. its corresponding "true" coefficient, and variance a 2 / n, and that the coefficients are independent when an orthogonal wavelet is used. Donoho and Johnstone (1995) propose an estimate of the noise level that is based only on the empirical wavelet coefficients at the highest level of resolution. The reason for considering only the top level of coefficients is that these tend to consist mostly of noise - often, all but the finest details in the function are accounted for in coefficients at lower levels. Since there typically is some signal present even at the highest level, Donoho and Johnstone propose a robust estimator of scale, the median of absolute deviation (MAD):
A
a=
median(lw}~ 1 k- median(w}~ 1 k)l) '
0.6745
'
'
where J = log2 (n). Since the "true" coefficients WJ-I,k, k = 0, zJ-I- 1 are thought to be mostly zero, it would be reasonable to replace median( w}~ I,k) above with zero.
132
MORE DENSITY ESTIMATION
This MAD estimator has become very common in practice. Other possibilities would be to take the usual standard deviation of the top-level coefficients, but the signal still contained in the coefficients will tend to make the estimate upwardly biased.
7.2
More Density Estimation
The wavelet density estimators discussed in Section 3.1 are linear estimators, as they are in terms of linear functions of the estimated parameters. Just as in nonparametric regression we moved from linear estimators in Section 3.2 to non-linear estimators earlier in this chapter, we now consider applying a nonlinear thresholding operation to the empirical coefficients in density estimation as well. A theoretical comparison of linear and nonlinear wavelet estimators is made by Donoho, et al. (1993) and Johnstone, Kerkyacharian, and Picard (1992), showing that nonlinear methods are always superior in terms of the rate of convergence. The basic idea here is the same as it was in Section 7.1: The unknown density function f is assumed to be square-integrable, and the representation off in the wavelet domain is assumed to be sparse (only a few coefficients are needed for a good reconstruction of j). Thus, in reconstructing the function, only the large coefficients dj,k are needed, so it would make sense to apply the thresholding operators from Section 7.1 to the empirical coefficients dj,k as well:
}j(x) =
L Cj k
J-1 0
,kcPj0 ,k(x)
+
L L 8;..(dj' ,k)'l/Jj' ,k(x),
j'=io+l k
where 8;.. represents either the hard or the soft thresholding function (7.6) or (7.7). The question remains of choosing the threshold.\. Though on the surface this threshold selection problem has much to do with the corresponding thresholding problem in nonparametric regression, they are quite different in nature. Though there is some common ground, the problems should be considered separately, with threshold selection procedures only applicable for use in the problem for which they are designed. The global thresholding results of Section 7.1 are based upon the distribution of each empirical coefficient w)~2 having an independent normal distribution with mean equal to the corresponding "true" coefficient Oj,k and variance a 2 / n, which results from applying the signal-plus-noise model to the data: Yi rv N(f(i/n), a 2 ), i = 1, ... , n. This distributional assumption is, of course, not satisfied for the empirical wavelet coefficients computed according to (3.6) and (3.7), so the thresholding problem must be reevaluated in this context.
Other Applications
133
In addition to choosing a threshold A, another issue to be considered in wavelet density estimation is the choice of J, the highest level of resolution to be considered. With regression data, the discrete wavelet transform (DWT) algorithm yields coefficients at resolution only as high as level J- 1 = log2 n1, so a natural upper limit is in place. One approach to selection of the threshold and the maximum level of resolution is proposed by Donoho, et al. (1995). They suggest choosing J = [log2 n] - 1 and applying the threshold , _ 2Clogn A-
.jii
to the empirical coefficients, where C = supJ,k supx z-i/ 21'1/JJ,k(x)l. Another method is due to Donoho, et al. (1993). They propose using J = [log 2 n - log2 (logn)] as the maximum level of resolution to be considered, and a level-dependent threshold AJ = K -/T(n, for a suitably chosen constant K. Level dependent thresholding will be treated at length for nonparametric regression in Chapter 8. It is possible to modify the density estimation problem to more closely coincide with Section 7.1 results by binning the data. This approach is also described by Donoho (1993), the theoretical arguments put forth in Donoho, et al. (1993). Assume that the data X 1 , ... , Xn are a random sample from a density f on (0, 1]. The unit interval is partitioned into M = z[tog 2 n]- 2 equally spaced intervals (actually, the number of intervals should be about n/ 4; taking M to be a power of two simplifies the computation), and Ni is the number of observations falling into interval i, i = 1, ... , M. Setting
gives that the Yi 's are approximately (for large n) independent with approximate distribution
(see Donoho, et al. (1993)). By making this approximation, we can estimate the square root of the density by applying results from Section 7.1, then square the result to get a final estimate for f.
7.3
Spectral Density Estimation
Another basic area of statistical estimation is spectral density estimation in time series analysis. Here, we will give a brief description of the problem, then describe the applications of wavelets to this problem. Good references
134
SPECTRAL DENSITY ESTIMATION
on general time series analysis include Anderson (1971), Priestley (1981), and Wei (1990). The model for this section involves data Y1 , ... , Yn, which represent univariate observations that are equally spaced over time (or space). This type of data differs from the nonparametric regression data considered in Section 7.1 in that it is assumed here that the data have a constant mean, and the earlier assumption of independence is dropped. Thus, we consider here only weakly stationary time series, which means that E[Yi] = J-L for all i and that there exists a function R(f) such that Cov(Yi, Yj) = R(lj- il). That is, stationarity requires that the covariance of any pair of observations depends only on the time between the observations. The function R is known as the autocovariance function (or, simply, the covariance function) of the time series. Note that R(f) = R( -f) for all£ E 7L Often, we focus attention instead on the autocorrelation function of the time series, defined as
p(f)
= Corr(Yi, Yi+e).
This can be computed from the covariance function by
p(f)
=
R(f) R(O).
Note that R(O) is simply the variance of a single observation and also that
p(O)
= 1.
In time series analysis, the primary interest is often to study the periodic behavior of the data. For instance, economic time series often show significant seasonal (yearly) cycles. The Fourier transform has thus become a common tool in time series analysis. The frequency content of a time series can be analyzed through the spectral density function, which results from regarding the covariance function values ... , R( -1), R(O), R(1), ... as a set of Fourier coefficients. The Fourier representation (4.21) gives an expression for the spectral density function: 00
f(w)
L
R(j)eijw
j=-oo 00
L j=-oo
00
R(j) cos(jw)
+i
L j=-oo
R(j) sin(jw)
Other Applications
135
00
L
R(j) cos(jw)
j=-00 00
R(O)
+2L
R(j) cos(jw),
(7.9)
j=l
the imaginary term dropping out since R(j) = R( -j) and sin(jw) = - sin(- jw) for all j E 7L Note that f (w) is defined for w E JR, but that it is periodic with period 21!", so we need only consider f on, say, the interval [-1!", 7r]. Note further that f is symmetric about zero: f(w) = f( -w). Therefore, the spectral density is usually only considered on the interval [0, 7r]. The argument w indicates the frequency value, thus the spectral density at any value w analyzes the frequency content of the time series at frequency w. Note that if the data are independent, then R(f) = 0 for .e f; 0, so the speca 2 • Very wiggly time series are characterized by tral density is flat: f (w) mostly high-frequency components, which is manifested by a spectrum that is small for small w and large for large w. Conversely, a very smooth time series will contain an abundance of low-frequency components and an absence of high-frequency content. Periodicities manifest themselves as one or more spikes in the spectral density. The locations of the spikes give information about the periodicities present in the time series. Four typical spectral density functions are plotted in Figure 7 .6. The upper left-hand plot corresponds to a pure white noise (uncorrelated) time series; the others correspond to an autoregressive process. Briefly, a time series is autoregressive with order p, denoted AR(p), if it follows the model
=
in which .the E/s are independent N(O, a 2 ) random variables. Wei (1990) gives expressions for the spectral density associated with general AR processes. The next two spectral densities in Figure 7.6 correspond respectively to an AR(l) process with positive r 1 (an abundance of low frequency content) and an AR(l) process with negative r 1 (an abundance of high frequency). The last process corresponds to a seasonal time series with period 12. The spikes correspond to the seasonal harmonic frequencies w = 27r j / 12 for j = 1,2,3,4,5,6. Corresponding to Figure 7.6 are sets of simulated time series data, plotted in Figure 7.5. The high-frequency and low-frequency content of the respective AR(l) processes are somewhat evident from the time series plots. The seasonality of the last time series is somewhat obscured by the noise. The spectral density f (w) represents the complete theoretical information about a time series. In statistical practice, this must be estimated from the data. Given an estimate of the covariance function, a raw estimator of f (w)
136
SPECTRAL DENSITY ESTIMATION
White noise
AR(1), r = 0.5
C\1
0
0
20 40
60
80
120
0
20
AR(1), r = -0.5
40
60
80
120
Seasonal time series
C").r-------'--'"'-------, C\1
C\1
0
0
20
40
60
80
120
0
20
40
60
80
120
Figure 7.5: Four simulated time series corresponding to the spectral densities in Figure 7.6: pure white noise, AR(l) with r 1 = 0.5, AR(1) with r 1 = -0.5, seasonal time series with period 12.
can be obtained by plugging R(j) in place of R(j) in (7.9). A natural way to estimate R(j) = Cov(Yi, Yi+J) is simply n-j 1""' - j = 0, 1, ... ,nR(j) = - L.)Yi- Y)(Yi+JY), A
n
1.
(7.10)
i=I
Defining R(j) this way, the estimate of the autocorrelation function p(j) is
A(") - R(j) R(O).
p J -
Note that this is essentially the standard sample correlation coefficient computed on (Yt, Y1+j), (Y2 , Y2 +J), ... , (Yn-j, Yn). Also, note that R(O) = 8- 2 is the usual maximum likelihood estimator of the variance under the AR model. A plot of p(j) vs. j is known as the correlogram of the data. The sample spectral density function is obtained by plugging the estimate (7.10) of R(j) into the definition of f(w) in (7.9):
Other Applications
137
n-1
/(w)
= R(O) + 2 L
R(j) cos(jw).
j=l
An alternative form of the sample spectrum is given in terms of the discrete
Fourier transform of the original data (see Bloomfield, 1976):
/(w)
= ~It Yee-i(l-I)w'z l=l
The sample spectral density function is typically computed only at the "natural frequencies" WJ = 2njfn, j = 0, ... , [n/2]. A plot of /(wJ) vs. w1 is known as the periodogram. The periodogram is, first of all, a raw estimate of the spectral density. Its initial purpose was to search for "hidden periodicities" in the data-ifthere is a (deterministic) periodic component in the data, perhaps obscured with noise, this will be manifested in the form of a sharp spike in the true spectral density at the frequency of the periodic component. It would be hoped that this spike would also appear in the periodogram. The periodograms that correspond to the example time series plotted in Figure 7. 5 are plotted in Figure 7. 7. The periodogram by itself is a useful and interesting diagnostic tool for time series data analysis. By comparing Figure 7.7 with Figure 7.6, it is possible to see a correspondence, but the very wiggly nature of the periodogram renders it unfit for estimation of the true spectral density function, which is typically assumed to be mostly smooth, possibly with some sharp spikes. Priestley (1981) calls the periodogram "an extremely poor (if not useless) estimate of the spectral density function." It is not consistent (as n gets large), and its peculiar covariance structure ensures that the periodogram will have a wildly erratic behavior. In order to get a more appropriate estimate of the spectral density, many methods have been proposed to "smooth" the periodogram. One method, from Parzen (1974), involves fitting a parametric autoregressive model to the data and estimating the parameters. The resulting estimate is just the spectral density for an AR process, with the estimated parameter values plugged in. By applying the general methods described in Chapter 2, model-free estimates of the spectrum can be obtained. In particular, a kernel function K can be applied to smooth f (w) by averaging over neighboring frequencies just as was done for smoothing regression functions in (2.17): A
f(w)
{1f
= Jo
1
7rA K
(WT) ~ j(T) dT.
The kernel K in the above expression is known as a spectral window, since the averaging is being done in the frequency domain. This smoothed spectral
138
SPECTRAL DENSITY ESTIMATION
White noise
AR(1), positive r
C'!
,.... C! C\1 Ol
ci co ci
0.0
1.0
2.0
3.0
0.0
AR(1), negative r
1.0
2.0
3.0
Seasonal time series
'
C")
C\1
C\1
0.0
1.0
3.0
2.0
0.0
1.0
2.0
3.0
Figure 7.6: Spectral densities for four different time series: pure white noise, AR(1) with r 1 = 0.5, AR(l) with r 1 = -0.5, seasonal time series with period 12.
density estimator can be translated back to the time domain through Parseval's identity, giving an alternative representation that averages over adjacent values of the estimated autocovariance function: n-1
f(w)
=
L
W(j)R(k) cos(jw).
j=-(n-1)
Here, the function W is called the lag window since it controls the averaging over various lags of the covariance function. (This dual representation in both the time and the frequency domains is perhaps reminiscent of the discussion in Section 4.3.) The lag window and the spectral window are in fact a Fourier pair: the spectral window is the Fourier transform of the lag window, and the lag window is the inverse Fourier transform of the spectral window. Wei (1990) gives several examples of lag-spectral window pairs. Since the spectral density is typically thought to be "mostly smooth," with the possible exception of one or more very sharp spikes, it is natural to desire a spatially adaptive procedure such as that offered by wavelet shrinkage. The application is not straightforward, however. Wahba (1980) suggests taking the log of the periodogram to stabilize the variance. Smoothing the log pe-
Other Applications White noise
139 AR(1 ), r = 0.5
It)
'
~ co
C")
<0 C\1
'
0
0
0.0
1.0
2.0
3.0
0.0
AR(1 ), r = -0.5
1.0
2.0
3.0
Seasonal time series
co <0 <0
'
C\1
C\1
0
0
0.0
1.0
2.0
3.0
0.0
1.0
2.0
3.0
Figure 7. 7: Periodograms for the simulated time series plotted in Figure 7. 5: pure white noise, AR(l) with r 1 = 0.5, AR(l) with r 1 = -0.5, seasonal time series with period 12. riodogram by wavelet shrinkage and then exponentiating to get an estimate of the spectral density is the approach taken by Moulin (1993a, b) and Gao (1993). Even with the Wahba log transformation, results for wavelet shrinkage of additive Gaussian noise discussed in Section 7.1 cannot be applied here. Gao (1993) applies results from Taniguchi (1979, 1980) and Wahba (1980) to obtain two approximate distributions for resulting wavelet coefficients. For "coarse" coefficients (small j),
where Bj,k represents the "true" wavelet coefficient of the log-spectrum. For "fine" coefficients (large j), the Gaussian approximation does not fit well, so an approximation based on weighted sums of observations with x2 noise is used instead. Gao (1993) considers these cases separately and derives thresholds that correspond to the universal VisuShrink method of Donoho and Johnstone (1994). The threshold for wavelet level j for j = 0, ... , J- 2 is thus log(2n) )
A.i =max ( 1ry'log(n)j3, {J-i- 3)/ 4 2
,
a.1o
140
DETECTIONS OF JUMPS AND CUSPS
AR(1 ), r I()
=0.5
C\i co 0
C\i tO
t.q '
C!
C\1
0
0.0
1.0
2.0
0.0
3.0
AR(1), r =-0.5
1.0
2.0
3.0
Seasonal time series tC?
C")
C'!
C\1
co
c::i 0.0
1.0
2.0
3.0
0.0
1.0
2.0
3.0
Figure 7.8: The wavelet spectral density estimates for the simulated time series plotted in Figure 7.5: pure white noise, AR(l) with r 1 = 0.5, AR(l) with r 1 = -0.5, seasonal time series with period 12.
where J = log2 n. (Note that since there are only n/2 points at which the sample spectral density is computed, the highest wavelet level computed by the usual DWT algorithm is J- 2.) The first element in the max function in (7 .11) will be used for coarse levels, and the second element will be used for large values of j. Figure 7.8 plots the estimated spectral densities for the simulated data examples used in this section.
7.4 Detections ofjumps and Cusps As mentioned before, wavelets are most useful in function estimation when usual smoothness conditions required of kernel and other standard methods are not satisfied-wavelet methods are appropriate even when the underlying function exhibits sharp spikes and jumps. In many applications, the primary interest lies in estimating the locations and sizes of these jumps, with only secondary interest in the actual recovery of the function. This is the primary focus of the general change-point problem. The change-point problem has been considered as early as the mid-1950's (Page, 1954, 1955) and has become gradually more popular since then, as researchers discover new methods and applications for such a problem. The
Other Applications
141
standard single change-point problem for observed data Y1 , ... , Yn deals with testing the hypotheses
Ho : E[Y1] = ... = E[Yn] Ha : E(Y1] = ... = E[Yk] -:f- E[Yk+d = ... = E[Yn], for unknown k E { 1, ... , n - 1}. This is the usual set of hypotheses in classical quality control, where interest is in detecting whether a manufacturing process has gone "out of control" at an unknown time point k. In addition to the testing problem described above, it is often of interest to estimate the location and size of possible jump points. Parametric methods assume models for the problem, and then adapt standard maximum likelihood or Bayesian methods for testing and estimating change-points. A variety of nonparametric methods have been proposed as well (see Csorg6 and Horvath (1988) for a good review of such methods). Many of these methods involve taking a cumulative sum (CUSUM) of some transformation of the data points; test statistics and estimators are based on this process. The simple model described above can be generalized to include multiple changes, changes in scale as well as in location, smooth (rather than abrupt) changes, and many other possibilities. Lombard (1988) applies Fourier methods to the change-point problem. His method decomposes the CUSUM into its Fourier components, analyzing each component separately, along with a smoothed version of the CUSUM process. It is only natural to apply wavelet methods to the problem as well. Though there has been a great deal of effort put into applying wavelets in general function estimation, as yet, relatively little work has been done applying wavelets to the statistical change-point problem. The general change-point problem in statistics has close ties to problems in signal processing. In particular, image analysts are very interested in edge detection, locating areas of sharp contrast in digital pictures. Wavelets have been used successfully in this context for quite some time now. Many of the proposed methods that have appeared in the signal processing literature have sophisticated statistical components to them. These methods typically involve the two-dimensional wavelet transform and will not be considered in detail here. In image analysis, the primary purpose is to de-noise an image, giving a nice-looking picture. This aim is achieved in part whenever the image's edges are detected successfully. One might argue that the problem of edge detection differs fundamentally from the change-point problem, in which the interest lies less in recovering the unknown function and more in making statistical inference on the number, location, and magnitude of the jump points.
142
DETECTIONS OF JUMPS AND CUSPS
Nevertheless, methods developed for edge detection could probably be applied with success to the change-point problem (and vice versa). Of course, the statistical properties of such techniques should be investigated for their suitability in inference. Hu (1994) describes an application of the Haarwavelet basis to the general change-point problem. This approach, most suitable for a model with several change-points, involves choosing a piecewise constant regression function based on the Haar wavelet. Similar in concept to a backward deletion scheme in multiple regression, Hu's approach begins with the Haar representation of f1x) as in (6.2), recursively eliminating coefficients (not necessarily just one at a time) until the conditions of some stopping rule are met. Various stopping rules are considered, such as one based on a wavelet version of Mallow's Cp, and a levelwise test for white noise. Wang (1995) describes using the wavelet transform to detect abrupt jumps and cusps in an otherwise smooth function. A function is said to have an /3cusp at a point x 0 if there exists a constant K > 0 such that
lf(xo +h)- f(xo)l ~ Klhl.a, as h goes to zero from either side. (Note that the point x 0 corresponds to an abrupt jump if (3 = 0.) If the model Yi = f(i/n) + Ei is assumed, with 2 Ei "" iid N(O, a ), then the discrete wavelet transform of the observed data is comprised of two parts: the DWT of white noise, and the DWT of the signal. Tests of hypotheses regarding the existence of cusps (and/or jumps) are based upon the fact that the wavelet transformation of the observed data is dominated by the noise portion when f is smooth, but characterized by large coefficients at points of irregularity. Wang's approach uses the discrete wavelet transform of a data set, focusing on the finest level of coefficients. A large coefficient in that level (greater than, say, the universal threshold of Donoho and Johnstone (1994)) indicates the presence of a cusp. Wang gives convergence results on the estimation of such jumps. Richwine (1996) takes a Bayesian approach for estimating the location of change-points by placing a prior distribution on a change-point T on the interval [0, 1) and examining the posterior distribution of T given the empirical wavelet coefficients. Richwine's procedure is robust against model misspecification and represents an important diagnostic tool in a variety of applications. By considering smoother wavelets and removing the smooth trend portions of the data, this method is extended in Ogden and Richwine (1996). Clearly, there is room for more work to be done in this area. Initial results indicate that wavelets provide useful tools for detecting and estimating change-points.
CHAPTER
EIGHT
Data Adaptive Wavelet Thresholding
Thresholding of empirical wavelet coefficients was discussed in Section 7.1, along with some examples of global thresholds. This chapter will pick up where Section 7.1 left off, focusing on the nonparametric regression situation. The ideas described here could also be adapted for use in density estimation or other types of function estimation. To focus attention on the methods described in this chapter, it will be assumed throughout that an orthogonal wavelet transform on the unit interval is used, and that the sample size is a power of two: n = 2J for some integer J > 0. When this condition is not met, the methods described herein may be adapted, using techniques described in Chapter 6. To reiterate the problem studied in Section 7.1, suppose we are given data Y1 , ... , Yn which we regard to be noisy observations of a function, equally spaced on the interval [0, 1]: Yi = f(i/n) + Ei, where Et, ... ,En are iid normal random variables with mean zero. Taking the discrete wavelet transform of the data gives n- 1 wavelet coefficients, {w 1,k,j = 0, ... , J- 1, k = 0, ... , zi - 1}, along with the coarsest level scaling function coefficient, which will be labeled w_ 1 ,0 , giving n coefficients in all. Let {Bj,k,j = -1, ... , J- 1, k = 0, ... , zi -1} represent the "true" coefficients that would result from applying the same transform to the expected values of Yt , ... , Yn. By the arguments given in Section 7.1, we assume that the majority ofthese Bj,k 's are essentially zero, so our first task is to determine which coefficients to include in the reconstruction, our second task being the estimation of each included Bj,k. We will consider here the soft thresholding operator 0.7) (though the methods discussed could be applied with any reasonable thresholding function), and introduce some data-dependent schemes for choosing the threshold A. From the description of the global thresholding schemes in Section 7.1, recall that all coefficients above a certain resolution level were shrunk according to a single value of A, with the lower-level coefficients being included "as
144
SURE THRESHOLDING
is." If we consider that the threshold might change depending on the level of resolution, these thresholding schemes might be denoted
A.. = { A., 1
0,
j < io, j ~ io,
for some choices of the global threshold A. and the cut-off level j 0 . Since the coefficients are naturally indexed by a resolution index j, which controls the frequency content, it is only natural to group the coefficients by level and treat each level separately. (See Johnstone and Silverman (1995) and Donoho and Johnstone (1992) for further discussion.) In this chapter, we will explore this more general thresholding idea and consider several data dependent schemes for selecting the threshold at each level. The choice of threshold is a very fundamental one in wavelet smoothing, just as is the choice of bandwidth in kernel smoothing. A vast volume of literature has been devoted to this second issue, and the first has received a fair amount of attention as well. Nason (1995) gives a brief overview of some datadependent threshold selection methods, focusing mainly on cross-validation techniques. To further simplify the presentation of these methods, it will be assumed that a 2 , the variance of the original data values, is known, and, without further loss of generality, that the data are suitably normalized to have variance one. In practice, it is rarely the case that the a is known, so the data set is typically normalized by dividing by a- as described in Section 7 .1.
8.1
SURE Thresholding
Donoho and Johnstone (1995) introduce a scheme that uses the wavelet coefficients at each wavelet level j to choose a threshold A.1 with which to shrink the coefficients at that level. This is perhaps the most popular data-dependent threshold selection procedure, so we examine its development in some detail and include examples. The basic idea behind this Donoho-Johnstone scheme is to find an estimator j for f that will have small £ 2 risk:
R(}, f) =
E[
~
t
(i(ifn) - j(i/n)
r] .
This quantity can be expressed in terms of the wavelet coefficients by the wavelet version of Parseval's identity given in (4.32): Since the transform is orthogonal,
Data Adaptive Wavelet Thresholding
R(}, f) oc E [
~ ~ (0,,•
-IJJ,k
145
r]
The significance of this is that it is possible to transform the original data into its wavelet coefficients, and then attempt to minimize risk in the wavelet domain; doing so will automatically minimize risk in the original domain. In practical situations, the risk R(j, f) must be estimated from the data. This method employs an unbiased estimate of risk that is due to Stein (1981), the acronym SURE being derived from Stein's Unbiased Risk Estimate. Stein's result has been applied to choose the smoothing parameter value data adaptively in other nonparametric situations as well (see Li and Hwang (1984) and Li (1985)). As we will see, this result works well in wavelet threshold selection as well. Minimization of estimated risk is done by choosing a threshold value for each wavelet level. This method is illustrated by considering the following equivalent problem. Suppose X 1, ... , Xd are independent observations with Xk ""N(f.tk, 1). Theproblemistoestimatethemeanvector{t = (f.t 1, ... ,f.td)' with minimum risk. This represents estimation of true wavelet coefficients at any level j, with xk = fow]~2 and d = zj. Since it is thought that most of the coefficients are zero, the estimator will be the soft thresholding function 0. 7): P,k = 6f (X k). Denoting the vector of observations X (no subscripts) and letting P, represent the resulting estimator of f.t, the result of Stein (1981) states that the £2 loss can be estimated unbiasedly for an estimator of 11 that can be written P,(X) = X+ g(X), where the function g : JRd --+ JRd is weakly differentiable:
EJL IIP,(X) - 1111
2
= d + EJL { llg(X) 11 2 + 2\7 . g(X)} '
=E%=
where \7 · g 1 8 ~k gk(X), defining g = (g1, ... , gd)· Using the soft thresholding function gives that
Xk
~
t,
-t
~
xk < t,
xk < -t,
E%=
E%=
2 sollg(X)II 2 = 1 min (1Xkl,.\). Notealsothat\7·g =1 1[->.,>.j(Xk), so that Stein's estimate of risk applied to this situation can be written for any set of observed data x = (x1, ... , xd)':
146
SURE THRESHOLDING d
SURE(,\; x)
d- 2 · #{k: lxkl ::=; .\} +
2::: min (lxkl, ,\) 2
k=l d
-d + 2 · #{k: lxkl > .\} +
L min (lxkl, .\), 2
(8.1)
k=l
where #S for a setS denotes the cardinality of the set. Here, EJLIIJt(A)(X)MW == EJLSURE(.\; X). The threshold level is set so as to minimize the estimate of risk for given data XI, •.• , Xd:
,\ == arg mint~ 0 SURE(t; x). Such a method can reasonably be expected to do well in terms of minimizing risk, since for large sample sizes the Law of Large Numbers will guarantee that the SURE criterion is close to the true risk. The SURE criterion is written in the form (8.1) to show its relation to Akaike's Information Criterion (AIC), introduced by Akaike (1973) for time series modeling: It consists of a function to be minimized CE ~= 1 min 2 (I x k I, ,\)) and a penalty term consisting of twice the number of estimated parameters included in the reconstruction (only the observations with Ix k I > ,\ will be nonzero after the shrinking). The computational effort involved with minimizing the SURE criterion is light-if the observations are re-ordered in order of increasing lxk I, then the criterion function SURE(t; x) is strictly increasing between adjacent values of the lxk I's. It is also strictly increasing between 0 and the smallest lxk 1. as well as fort> maxk lxkl, so the minimum must occur at 0 or at one of the lxkl's. Thus, the criterion must only be computed ford+ 1 values oft, and, in practice, there is no need to order the lxkl's. Figure 8.1 illustrates this method in action. This figure displays plots of v'nlw)~21 for levels 10, 9, and 8 for the blocky function shown in Figure 5.10 normalized to have signal-to-noise ratio 5 with n = 2048. Signal-to-noise ratio (SNRatio) for a set of means Ill , ... , ftd with additive noise is defined to be the ratio of the standard deviation of the mean vector to the standard deviation of the noise. In the first column of plots, the absolute values of y'n times the coefficients are plotted in increasing order. In the second column, the SURE criterion is plotted as a function oft, evaluated for each t = v'nlwt21 at the current level. The dashed line in the first column of plots indicates the value of the threshold selected by the SURE criterion; all points below this line will be shrunk to zero, and all points above will be shrunk toward zero by that amount.
Data Adaptive Wavelet Thresholding
Level 10 coefficients 'o::t
:f:
:3M
SURE(t;x) for level 10
0 0
+ +
It)
147
~
x
.±:::.a we
::I
a:te
~N
:::>
en
0 0
N
0
0
200
400 600 800 Index
0
Level 9 coefficients
4
SURE(t;x) for level 9
* ?<:
'o::t
3
0 0
+ 1/)C")
2 t
It)
ffi'g
(I)
::I
a:C")
caN
>
:::>
en
0
~
0
0
100
200 300 Index
400
500
0
Level 8 coefficients
~ Ill (I)
::I
>
N 0
4
SURE(t;x) for level 8
+
:f:
0 0
It)
E + ur a:o
tO 'o::t
3
+
CXl
ca
2 t
-
0
50
~. 100 150 200 Index
250
:::>0 enM
0
~ 0
2
3
4
Figure 8.1: Plots of wavelet coefficients and the SURE function for levels 810, from the blocky function with SNRatio = 5 and n = 2048.
The global thresholding procedures discussed in Chapter 7 applied thresholding only to higher-level coefficients, preferring to leave the lower-level coefficients (which would correspond to "macro" features of the function) intact. The data adaptive scheme described in this section is often applied to the coefficients at all levels, allowing the coefficients themselves to determine if any shrinking is needed. Indeed, for many examples, the SURE criterion chooses 0 as the best threshold for the low-level coefficients. Looking at the plots on the left-hand side of Figure 8.1, a data analyst might point out that each level of wavelet coefficients consists of a few "large" coefficients and many "small" ones, and that the cut-off point may be a bit low, in the sense that there are quite a few seemingly "small" coefficients that will be included in the reconstruction. A related observation might result from look-
148
SURE THRESHOLDING
ing at the right-hand side of plots and noticing that (especially for j = 10) the SURE( t; x) function is relatively flat near the place that it achieves its minimum. This would indicate that there is a wide range of possible thresholds, the choice of which would make relatively little difference quantitatively (in terms of estimated risk), but may have a significant difference qualitatively (in terms of the relative smoothness of the resulting estimator). This apparent problem is also addressed by Donoho and Johnstone (1995), noting that the SURE method does not perform well in cases where the wavelet representation at any level is very sparse, i.e., when the vast majority of coefficients are (essentially) zero. This is due to the noise from the essentially zero coefficients overwhelming what little signal is contributed from the nonzero coefficients. Thus, Donoho and Johnstone suggest a hybrid scheme to get around this issue. The heuristic idea behind this hybrid method is to test the coefficients for sparsity at each level. If the set of coefficients is judged to be sparsely represented, then the hybrid scheme defaults to the universal threshold ..j2log d; otherwise the SURE criterion is used to select a threshold value. The criterion used is related to the usual sample variance of the data if the true mean were known to be zero:
The representation at the current level is judged to be sparse if
s2
d) 3/2 (1 < 1 + ....;._o_g_2.......:.....-
d-
Jd
otherwise, the threshold is selected by SURE. Originally, to aid in proving the relevant theorems, the hybrid method proposed by Donoho and Johnstone (1995) broke the data x 1 , ... , xd randomly into two subsets of equal size, and each half-sample was used to choose a threshold for the other half-sample. Donoho and Johnstone made a note in the manuscript proofing stage that this subsampling is unnecessary, that the theory holds when the threshold is chosen from all coefficients at the current level. The result of applying this hybrid scheme is to get around the problem noted previously, allowing too many coefficients in the reconstruction and thereby producing an estimate that is far too noisy. When there are only a very few non-zero coefficients at a particular level, the scheme detects this, and applies the universal threshold, giving a much less noisy-looking reconstruction and simultaneously maintaining good MSE performance.
Data Adaptive Wavelet Thresholding
149
Noisy observations
0
0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
SURE estimator
0.2
0.4
0.6
0.8
1.0
SURE hybrid estimator
I!)
0
0
0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 8.2: The blocky function with simulated data, n 2048, and SNRatio = 5, estimated by both the SURE and the hybrid SURE methods
Examples of the SURE-based thresholding procedures are shown in Figure 8.2 and Figure 8.3 for two example functions: the piecewise constant blocky function (with n = 2048 and a signal-to-noise ratio of 5) and a sine function (with n = 512 and a signal-to-noise ratio of 2). The differences between the regular SURE estimator and the hybrid are clearly demonstrated in the second example: Since the function is smooth, the true coefficients at higher levels of resolution are almost entirely white noise. The hybrid method recognizes this and thus shrinks most of these coefficients to zero.
8.2
Threshold Selection by Hypothesis Testing
The primary goal in data dependent threshold selection is the division of wavelet coefficients into a group of "small" coefficients (those consisting primarily of noise) and one of "large" coefficients (those containing significant signal). A reasonable way for a statistician or a data analyst to go about this is to utilize statistical tests of hypotheses, the large coefficient group consisting only of coefficients that "pass the test" of significance. This is the general approach taken by Abramovich and Benjamini (1995) and Ogden and Parzen (1996a, b).
150
THRESHOLD SELECTION BY HYPOTHESIS TESTING
C')
"
t\1
t\1
0
0
'7
~
~
"'!"
cr 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
SURE estimator C')
C')
t\1
t\1
0
0
'7
'7
~
~
cr
cr 0.0
0.2
0.4
0.6
0.8
1.0
Figure 8.3: A sine function with simulated data, n = 512, and SNRatio estimated by both the SURE and the hybrid SURE methods
= 2,
For a data set of length n, we consider a set of n parameters {Oj,k,j = -1, ... , J - 1, k = 0, ... , 2i - 1 }, most of which are thought to be (essentially) zero. The maximum likelihood estimator for each Oj,k is the corresponding empirical coefficient and we saw in Section 7.1 that
w)1,
w)1
and that the 's are mutually independent when an orthonormal wavelet basis is used on Gaussian data. A test for the hypotheses
Ho: Bj,k Ha : Oi,k
=0 "# 0
for any fixed j and k would naturally recommend rejecting the null hypothesis if lwj,k I/ y'n > Za, where Za represents the upper-a critical point of the standard normal distribution.
Data Adaptive Wavelet Thresholding
151
The above test is certainly appropriate for any fixed choices of j and k, but it would be problematic to apply such a testing procedure to all n coefficients: If all of the null hypotheses are true (the true function f is identically zero on [0, 1]), we could expect no:. of the coefficients to be falsely declared significantly different from zero, where a is the common level of the tests. The result of this will be too many coefficients included in the reconstruction, giving an undersmoothed estimate of f. This is precisely the problem that one faces in multiple comparisons in an ANOVA setting, except that in the wavelet case, all tests are independent. One way to account for this would be to apply the standard Bonferroni correction, and adjust the level of the tests so as to control the probability that any of the zero-coefficients are included in the reconstruction. For even moderate n, however, this will be overly conservative, making it very difficult for any coefficients to be judged significant and typically resulting in an oversmoothing. While neither of these extremes are particularly useful in themselves, there are a number of ways to strike a compromise.
Recursive Testing The general approach taken by Ogden and Parzen (1996a, b) operates on a level-by-level basis, as does the SURE approach. At any particular level, a single test is performed to determine if the set of coefficients at that level behave as white noise or if there is significant signal present. If it is determined that there is signal present, the most likely candidate (the largest coefficient in absolute value) is removed from consideration, and the test is repeated. Continuing recursively, at each level one will be left with two sets of coefficients: "large" coefficients thought to contain significant signal, and a set of "small" coefficients which is indistinguishable from pure white noise. More precisely, let X 1 , ... , X d represent the empirical wavelet coefficients at level j = log2 d, as in Section 8.1. Suppose further that these coefficients have means J.Lt, ... , /-Ld respectively. Initially, interest is in testing the null hypothesis that all the means are zero vs. a general alternative that some of the J.Li 's are non-zero. Specifically, let Id represent a non-empty subset of the indices { 1, ... , d}. Then the hypotheses could be expressed as
Ho : /-LI
= ... = /-Ld = 0 (8.2)
H a : /-Li =J 0 for all i E I d; /-Li
= 0 for all i ¢ I d.
A fundamental question that must be addressed in this approach is how to test the above set of hypotheses. The approach of Ogden and Parzen (1996b) proceeds as follows: If the cardinality of the set Id is not known, the standard likelihood ratio test for
152
THRESHOLD SELECTION BY HYPOTHESIS TESTING
these hypotheses would be based on the test statistic L:f=t Xf, which has a x2 distribution with d degrees of freedom when the null hypothesis is true. Note that this is also the test statistic that would be used if it were known that I d = {1, ... , d}. This is not the most appropriate test statistic for this situation, especially because it is usually believed that very few, if any, of the /-Li 's are non-zero. The result of applying this test statistic would be poor power of detection when I d contains only a few coefficients, since the noise of the zero coefficients will tend to overwhelm the signal of the non-zero coefficients. If the cardinality of the set Id were known to be, say, m, then the standard likelihood ratio test statistic would be the sum of squares of them largest Xi's in absolute value. In practice, m is not known, so the Ogden-Parzen approach consists of a recursive testing procedure for I d containing only one element each time. Thus, the appropriate test statistic is the largest of the squared Xi's. The a-critical point of this distribution is worked out to be
(8.3)
The recursive method for choosing a threshold at each level consists of the following steps: 1. Compare the largest Xf with the critical point xd. 2. If the Xf is larger, this indicates that there is still significant signal among the coefficients. Remove the Xi with the largest absolute value from consideration, set d to d - 1, and return to Step 1. 3. If Xf < xd, then there is no strong evidence of strong signal among the (remaining) coefficients. The threshold for the current level is set equal to the largest remaining xi in absolute value. By following this algorithm, we are throwing out "large" coefficients from the data set X 1 , ... , Xd until everything left (the set of "small" coefficients) is not distinguishable from pure noise. By setting the threshold equal to the maximum absolute value of the "small" coefficients, we are ensuring that they will all be shrunk to zero, and that each "large" coefficient will be included in the reconstruction, but shrunk toward zero by the same amount. Ogden and Parzen (1996a) point out that existing thresholding techniques (including the one just described) take into account only the relative magnitudes of the wavelet coefficients, but that there is also some information to be gained (as to whether signal is present in a particular set of coefficients) by the relative position of large coefficients. Figure 8.4 illustrates this point: It is a plot of the "true" wavelet coefficients at level j = 5 from a function with jumps at 0.25 and 0.75. Notice that while there is only one "very large"
Data Adaptive Wavelet Thresholding
153
~~
.20 co >
]3
~
•·+·+··•··•··•··;;··············-~-- ...................................~---····················•··•··•··•
"(3
:eQ) 0 (.)
0.0
0.2
0.4
0.6
0.8
1.0
Figure 8.4: Coefficients at level 5 of a function (no noise added) with jumps at 0.25 and 0.75 coefficient per jump, they are flanked on both sides by other "large" coefficients. The approach of Ogden and Parzen (1996a) adapts standard change-point methods (which are closely related to classical goodness-of-fit techniques) to test the hypotheses given in (8.2). In the change-point problem with data X 1 , ... , X d, nonparametric test statistics for the general hypotheses
Ho : E[Xt) Ha : E[Xt]
= E[X2] = ... = E[Xd] = ... = E[Xm] -# E[Xm+t] = ... = E[Xd)
are based on the mean-corrected cumulative sum (CUSUM) process (see Csorgo and Horvath (1988) for a review of nonparametric change-point procedures). Typical functionals of this cumulative sum process are the maximum of the absolute value (Kolmogorov-Smirnov), the integral of the square (Cramer-von Mises), and a weighted integral of the squared process (Anderson-Darling). These test statistics in the change-point problem are the same as those used in goodness-of-fit situations. These tests are examples of omnibus tests that can be used to test the null hypothesis of equal means vs. a very wide variety of possible alternatives. The generality of the alternative hypothesis in this wavelet thresholding situation suggests that such an omnibus test would be appropriate. Thus, the approach of Ogden and Parzen (1996a) is to base the test for the hypotheses in (8.2) on the following process, which depends on the choice of a univariate function g: ~ ~ u ~ 1,
0 ~ u
< ~'
(8.4)
154 where
THRESHOLD SELECTION BY HYPOTHESIS TESTING
a(g)
represents the standard deviation of the random variable g(Xi):
a{9 ) =
1:
2
g (x),P(x) dx-
{1:
2
g(x),P(x) dx}
Under the null hypothesis, the process (8.4) converges in distribution to a Brownian bridge stochastic process { B (t), 0 ~ t ~ 1}, a continuous Gaussian process with B(O) = B(1) = 0, mean zero, and Cov[B(s), B(t)] = min(s, t) -st (see Ross (1983) or Karlin and Taylor(1975) for more discussion on Gaussian stochastic processes). The heuristic of this approach is that one or more groups of large coefficients clustered together would cause B 1t) to exhibit an appreciable divergence from the typical behavior of a Brownian bridge process. It would be hoped that this atypical behavior would be detected by any of the omnibus test statistics mentioned previously. Though a wide range of choices for g(·) and for the Brownian bridge-based test statistic is possible, the paper by Ogden and Parzen (1996a) focuses on using g( x) = x 2 and the KolmogorovSmimov test statistic SUPo
Minimizing False Discovery
The approach to the threshold selection problem taken by Abramovich and Benjamini (1995), representing an interesting _application of concepts develo ed b Ben"amini and Hochberg (1995), has as its goal the control of the
Data Adaptive Wavelet Thresholding
155
false discovery rate (FDR). This approach does not distinguish between coefficients at different levels, so it represents a data-dependent method for selecting a global threshold with which to shrink the empirical wavelet coefficients. Considering all n - 1 wavelet coefficients, there are n - 1 null hypotheses H 0 : Oj,k = 0 to be considered in such an approach. As stated before, in most applications, it is assumed that most of these null hypotheses are true (or, to be more precise, most of these hypotheses are "approximately true," in the sense that most coefficients are only negligibly different from zero, and thus are not needed in the reconstruction of j). The aim of this procedure is to limit the possibility of a coefficient being included in the reconstruction erroneously. In this context (in which each null hypothesis has a two-tailed counterpart as its alternative), a coefficient is said to be erroneously included if either H 0 is true (or approximately true) and the corresponding coefficients is included in the reconstruction, or if H 0 is false and the corresponding coefficient is included, but with the wrong sign. For any data dependent global thresholding scheme, define the random variable R to be the number of coefficients that are not shrunk to zero, or, equivalently, the number of coefficients that are included in the reconstruction. These included coefficients consist of two types: those included correctly, and those included erroneously, as defined in the previous paragraph. If Q is defined to be the number of coefficients incorrectly kept, then Abramovich and Benjamini (1995) attempt to control the proportion of erroneous conclusions Q I R, which is defined to be zero whenever Q = R = 0. The aim of this approach is to include as many coefficients as possible, provided that the expected value of Q I R is kept below a user-specified value q. The authors point out that when the data are pure white noise (all "true" wavelet coefficients are identically zero), this approach amounts to controlling the probability of even one coefficient being kept in the reconstruction. Thus, in their examples and simulations, Abramovich and Benjamini (1995) use values for q that are commonly used for the Type I error probability in usual testing situations: 0.05 and 0.01. In view of this, one might consider using a to denote this criterion value rather than q. The Abramovich-Benjamini method for choosing a data-dependent global threshold .A consists of the following four steps: 1. Compute the usual p-value associated with each of the n - 1 sets of hypotheses:
2. Place then- 1 Pi,k 'sin increasing order: P
1
~P
2
~
· · ·
~P
n-1 ·
156
CROSS-VALIDATION METHODS
3. Stepping through these ordered p-values beginning with i
=
1, let m
be the largest i for which P(i) ~
~
-q. m
4. Set
which is equal to p-value P(m)·
fo times the coefficient that corresponds with the
This procedure differs fundamentally from the other hypothesis testing-based approaches in that the Abramovich-Benjamini approach seeks to include as many coefficients as possible subject to a constraint. By contrast, the OgdenParzen procedures include a coefficient only when there is strong evidence that it is needed in the reconstruction. Abramovich and Benjamini (1995) report on a simulation study that compares their method (with both q = 0.05 and q = 0.01) with the VisuShrink algorithm of Donoho and Johnstone (1994) (with three different choices of the low-level cutoff j 0 ) for a variety of functions. They conclude that this FDR approach works well (in terms of MSE) in comparison with the VisuShrink methods when the function of interest has several abrupt local changes; as might be expected given its nature, VisuShrink performs better when the underlying function is relatively smooth. As in the previous section, the user retains some control over the amount of smoothing that will be done by choosing the parameter q-a small value for q will allow only few coefficients to be included, giving a relatively smooth result. Conversely, large values for q will give a smoother estimate.
8.3
Cross-Validation Methods
The notion of cross-validation has emerged as a very useful data-driven method to choose smoothing parameters in a wide variety of estimation procedures. The main idea of cross-validation is to choose the smoothing parameter that gives the best estimator for predicting new observations. In practice, we often have only a single data set to use, with no immediate expectation of new data, so the data is reused to simulate new data being observed. A typical way to apply this principle is the standard "leave one out" crossvalidation algorithm. For each value of a general smoothing parameter ,\ (which can be regarded as a bandwidth in kernel smoothing), single data values are left out in turn, and the remaining n - 1 data points are used to obtain an estimate off and then a "prediction" of the left out data point. A measure
Data Adaptive Wavelet Thresholding
157
of this success in prediction is given by 1 CV(,\) = -
n
~
2
L (Yi- Yi) ' n i=l
Yi
where fi denotes the "prediction" of using only the other n -1 data points. The smaller the cross-validation function, the closer the predictions are to the actual observed data values. In practice, CV( ,\) is computed for a grid of possible ,\values, and the final smoothing parameter value Acv is chosen to be the minimizer of the CV function. More details can be found in Stone (1978), who reviews cross-validation techniques in practice. These ideas are extended to generalized cross-validation by Craven and Wahba (1979). Lately, a good deal of attention has been focused on applying general crossvalidation ideas to wavelet regression. Early approaches include those by Weyrich and Warhola (1994) and Nason (1994). Following Nason (1994), we are interested in choosing a global threshold to minimize the £ 2 risk in recovering the unknown function f. Thus, the objective function is given by
M(t) = E [ [ (i,(x)- f(x)
f dxl ,
(8.5)
where ft is the wavelet estimator that results from applying a threshold t globally to all wavelet coefficients. In practice, this can never be computed (as f is unknown), so M(t) must be estimated in some way. In the model considered in this chapter, we are content to consider only the discretized version of (8.5):
m(t) = E
[t. (it(~)-!(~))
2 ]
.
By the wavelet version of Parseval's identity (4.32), this can be expressed in the wavelet domain:
m(t) = E [n }(
~ (ti],k- 8j,k)
2 ]
,
where Bj,k represents the estimate of the "true" coefficient Oj,k computed from the soft thresholding function with threshold t:
158
CROSS-VALIDATION METHODS
The first question of interest in this approach involves the behavior of the objective function m(t): In particular, given knowledge off (equivalently, given knowledge of the Bj,k 's), is it possible to minimize the function m(t)? Standard calculus concepts would suggest taking the first derivative of m(t), solving for the to which gives m'(to) = 0, then verifying that t 0 is indeed a minimum by checking the sign of the second derivative of mat t 0 . Unfortunately, this cannot be used, as the soft thresholding function bt applied to data values is not everywhere differentiable with respect tot. Nason (1994) examines this question at length, noting that the derivative m' (t) is piecewise linear with discontinuities at each lfow)~JI. With high probability, however, the jumps at these points are small in the area where the minimum is likely to occur. By this and additional arguments about the behavior of m (t) fort = 0, Nason concludes that the function m(t) is "almost convex" and can thus be effectively minimized. The usual method of cross-validation cannot be applied directly to estimation with wavelets because efficient algorithms are not yet available for computing the discrete wavelet transform using an orthogonal wavelet with non-uniform designs. Nason (1994) suggests breaking the original data set Y1 , . . . , Yn into two subsets of equal size: one containing only the evenindexed data, and the other, the odd-indexed data. The odd data will be used to "predict" the even data, and vice versa. To be more specific, we must introduce some notation. Let Y1 Y;j 2 represent the (re-ordered) odd data points and Y1E, .•. , 12 the similarly renumbered even data values. The usual wavelet
°, Y?, ... ,
Y;
estimator based on the even-indexed points, denoted jE, consists of estimates of the function fat the points 2/n, 4/n, ... , (n- 2)/n, 1; that using the odd data points j 0 estimates fat 1/n, 3/n, ... , (n- 1)/n. To compare these estimated points directly with original data values, we employ "wrapped around" interpolated versions of each subset:
t(Y2i-l + Y2i+I), i=1, ... ,~-1, i!! z(Yn-1 + Y1), - 2' for the odd data, and
t(Yn + Y2), 2 (Y2i-2 + Y2i),
i i
= 1, = 2, ... ' ~.
for the even data. Note that the indices on these interpolated versions of the data coincide with the indexing of the subsets of the data, and hence, of the estimators resulting from the subsets. Also, note that the interpolation scheme described above wraps the data around the interval [0, 1], which corresponds to periodic boundary handling. The cross-validation approach will
Data Adaptive Wavelet Thresholding
159
minimize the following estimate ofm(t): n/2 { (
m(t)=tt
2. 1 2} ) 2 ( . + if'( ' : J-Y.•) . Ji'(~)-Y,O
(8.6)
The Parseval relation can be applied to this expression as well. Defining for the moment {w],k} and {w7,k} to be the wavelet coefficients for the evenand odd-indexed data respectively, with {w],k} and {w7,k} representing the wavelet coefficients resulting from the respective interpolated sequences, the expression (8.6) can be rewritten
This function can then be minimized overt, giving the "best" threshold for half-sample prediction: , n/2
"'cv
. m (t) . = arg nunt2:o A
(8.7)
This minimum must be computed numerically, and though any of a number of algorithms could be applied to find the minimum, Nason recommends the simple golden section search given in Press, et al. (1992). The selected threshold (8. 7) is not the best to apply to the full data set, however. Note the dependence of the universal threshold of Donoho and Johnstone (1994) on the sample size n:
Au= Since
j21ogn.
Arj 2 = J2log(n/2), the relationship between the two thresholds is An
u
= (1-
1
og 2 logn
)
-1/2
An . u
(8.8)
The final threshold to use with the full data set results from applying the correction (8.8) to the cross-validation threshold for the half-samples. Nason (1996) describes an alternative scheme that uses the more traditional leave-one-out cross-validation idea, which can be applied to a data set of any size (not necessarily a power of two). As mentioned before, efficient algorithms do not exist for computing the discrete wavelet transform for unequally spaced data, or when n is not a power of two. Nason gets around this restriction by applying two of the methods described in Section 6.4: reflection and padding. The basic idea behind this approach, described in more
160
CROSS-VALIDATION METHODS
detail presently, is that, for each data value li, the data set is broken into two pieces around it, and each piece is used separately to predict li. To predict li, break the data set into a left half and a right half:
= {Yt, ... , li-d R = {li+I, · · ·, Yn}.
L
The sets L and R are reflected completely at the left and right ends respectively (giving 2i - 2 points in the reflected left set and 2n - 2i points in the reflected right set). These are each extended to the next larger power of two by repeating the extreme values to obtain L*
= {li-I, ... li-t, li-2, ... , Y2, Yt, Yt, Y2, ... , li-2, li-d
R* = {li+I, li+2, · · ·, Yn-1, Yn, Yn, Yn-1, · · ·, li 2 , li+I, · · ·, li+d· The usual wavelet threshold estimators are formed for each half of the data using the global threshold value t, giving and ftR• respectively. The prediction of the left-out point li is the average of the right-most value of and the left-most value of ftR. Denoting this prediction value the estimate of m (t) in this case is given by
if
Yi,
n-1
m(t)
= 2: (Yi- li)
if
2
i=2
The leave-one-out cross-validation threshold is simply the minimizer of m(t) (no adjustment needed). Nason (1995) points out that these cross-validation methods work well when the assumption of Gaussian and independent noise holds. In cases for which this does not hold (if the error distribution has heavier tails and/or correlated noise), these wavelet cross-validation techniques do not work well. It is well documented (see, for example, Altman (1990) and Hart (1994)) that usual cross-validation for other smoothing methods have problems when the errors are correlated, so it is perhaps no surprise to find this holds in the wavelet shrinkage situation as well. Wang (1996) studies the problem with correlated errors in depth, proposing cross-validation methods for use with long-memory data, and laying the theoretical framework of such methods. It is generally agreed (see Nason (1995) and Johnstone and Silverman (1995)) that threshold selection should be done level-by-level for correlated data. Thus, Wang (1996) proposes two such methods: a level-specific universal-type threshold resulting from multiplying J2 log n by a correction factor to account for the level and the noise
Data Adaptive Wavelet Thresholding
161
correlation structure; and a level-dependent cross-validation method that represents an extension of Nason's methods involving removing more than half the data each time. More details are given in the paper by Wang and in Nason (1995).
8.4 Bayesian Methods Recently, methods have been proposed in wavelet function estimation that involve Bayesian principles. These offer an interesting and useful alternative to the methods presented in earlier sections. It should be noted that Bayesian methods for function estimation with wavelets involve a more involved problem than simple threshold selection, in the sense that new shrinkage functions result from the Bayesian approach, different from either the soft or hard thresholding functions discussed previously. Vidakovic (1994) describes two related Bayesian approaches for "thresholding" the empirical coefficients. The first is based on Bayesian estimation using each w)~2 to estimate the corresponding OJ,k· This approach results in a Bayes rule for the estimation. The second, a true threshold selection scheme for the hard thresholding operator, involves Bayesian testing of each hypothesis H 0 : OJ,k = 0 vs. a two-sided alternative for each j and k similar in spirit to the methods of Section 8.2. These two methods will be summarized here briefly. For simplification of notation, we drop the subscripts, considering the estimation or testing of each coefficient separately. The distribution of the empirical coefficient w is
(Note that in terms of our earlier notation, ry 2 = a 2 jn.) An exponential prior distribution is placed on ry 2 ,
and to focus attention on 0, we integrate out ry 2 to give that the marginal distribution of w conditioned on 0 is double exponential:
wiO
~ Vt: (o, ~) .
Applying a symmetric prior distribution on 0: 0,...., 1r(O) where
1r(0)
= 1r( -0),
0
E
JR,
162
BAYESIAN METHODS
allows estimation of the "true" coefficient () using the empirical coefficient w according to the following Bayes rule, based on squared error loss:
where I1 1 and II 2 are the Laplace transforms of 1r(O +w) and 1r(O- w) respectively:
and similarly for II2. There are many possibilities for the choice of 1r(O). Vidakovic recommends using the Student t distribution rather than the normal:
where 7 is a scaling hyperparameter and v is the degree of freedom index. The Bayes rules thus derived are simply smooth shrinkage functions and can be thought of as competitors to the soft and hard thresholding shrinkage functions discussed earlier. The behavior of these functions depends upon the hyperparameters used in describing the distributions of ry 2 and (). For any choice of these hyperparameters, a commonality of these shrinkage functions and the hard and soft thresholding functions is that t5(w)
~
0, for w near zero
and t5(w) ~ w for lwllarge. This only makes sense-in general "small" coefficients should be shrunk to near zero, and "large" coefficients should be retained more or less intact. An advantage of Vidakovic's Bayes rules is that these shrinkage functions can be fine-tuned by the choice of hyperparameter values. In particular, 1 is related to the precision of the original data: E[ry 2 ] = 1 /1. Increasing 1 is analogous to decreasing the threshold. The hyperparameter 7 controls the shrinking for small values of w: small values of 7 yield estimators very close to zero for small w. Vidakovic (1994) gives several illustrations of the Bayes rules that correspond to various choices of these hyperparameters. As mentioned earlier, the second method proposed by Vidakovic (1994) is a Bayesian method for selecting the threshold value for use with the usual hard thresholding operator. This involves testing the precise hypothesis H 0 :
Data Adaptive Wavelet Thresholding
163
() = 0 vs. a two-sided alternative according to the Bayesian framework, which requires that point mass of the distribution of() be placed on zero. Thus, the prior on () becomes () rv
1r(O)
= p~(O) + (1 -
p){(O),
where ~ is the Dirac delta function introduced in Section 2.1. Note that P[O = 0] = p and the density~ describes the behavior of() when() is non-zero (which occurs with probability 1- p). As before, using the double exponential distribution of w conditioned on () and applying the usual Bayes methods for hypothesis testing dictates that the empirical coefficient will be "threshaided" if
where II 1 ( ·) and II 2 ( ·) are as before. As might be expected, increasing p has the effect of increasing the threshold value, as it corresponds to increasing surety about the sparsity of the wavelet representation of the data. In practice, it is believed that most of the () j,k 's are zero, so the hyperparameter p should be chosen to be close to one to represent this. Vidakovic suggests using the Student's t distribution for {in the Bayes factor setting as well. In either of these two schemes, the hyperparameter 'Y can be estimated from the data, giving an empirical Bayes procedure: 'Y n /8- 2 for some esti2 mator of a . Another approach to Bayesian estimation of wavelet coefficients is due to Chipman, Kolaczyk, and McCulloch (1995). Their approach, which gives level-dependent Bayes rules for shrinking coefficients, has an advantage over Vidakovic's method in that their shrinkage functions are in closed form. Their approach assumes the noise level a to be a known constant, and they apply a normal mixture distribution to each coefficient Oj,k:
=
The mixing parameters are assumed to have independent Bernoulli(pj) distributions with P['Yj,k
= 1] =Pi= 1- P['Yj,k = 0).
Note that the prior distributions are identical for all coefficients in any resolution level j. The hyperparameter Tj is taken to be small, so that the N(O, rJ) portion of the prior on Bj,k represented concentrating mass near zero, which
164
BAYESIAN METHODS
corresponds to a coefficient being "negligible." Then ci is taken to be substantially larger, so that the N(O, c]1j) component allows "significant" coefficients. The Bernoulli probability Pi plays essentially the same role here as it did in the model of Vidakovic (1994), corresponding to the a priori knowledge about the degree of sparsity in the wavelet representation. More will be said on particular choices of the hyperparameters later. Again, the distribution of the empirical coefficient w)~2 conditional on the "true" coefficient oi,k is
where ry 2 = a 2 / n as before. Again, dropping coefficients to simplify notation, Chipman, et at., use w to estimate () by means of the posterior mean of 0:
EOw-
[ I ]-
( C7) { ry2
2
2
+ (C7) 2
p 7 ·--+ ·p-+1-1 } p+1 "72 + 72
·w
'
(8.9)
where p=
p1r(wb
= 1)
~--~~~--~
(1- p)7r(wh'
= 0)
and
7r(wb = 1)
rv
N(O, ry 2
+ c27 2 )
1r(wb = 0)
rv
N(O, ry 2
+ 1 2 ).
Note that (8.9) looks like a simple multiplication of the coefficient w by the factor
s=
(c1) 2 ry2
+ (C7) 2
p
12
1
·---+ ·p-+-1 ' p+1 'r/2 + 72
where lsi ~ 1, but, in fact, s is itself a function of w, so (8.9) is indeed a nonlinear shrinkage function of w. As in the Bayesian approach described earlier in this chapter, this method depends heavily upon the choice of values used for the hyperparameters 1i, ci, and Pi. These must be chosen to quantify the notions of "small," "large," and "most," respectively, in order to correspond with the sparsity of representation belief that "most of the coefficients are small (negligible); the others are large." Thus 1i must be chosen small enough that a coefficient Oi,k E ( -37i, 37i) is negligible. At the same time, ci must be chosen large enough to model plausible values of Oi,k reasonably. The probability Pi represents the expectation of the proportion of significant coefficients at level
Data Adaptive Wavelet Thresholding
165
J. Chipman, et al., suggest that the sequence PJ be chosen to decrease as j increases, since, at high levels, relatively fewer coefficients are needed to represent the function well. Automatic rules depending on the wavelet used and the data are given in their paper for use as default values for these hyperparameters. An advantage to this approach is that Bayesian confidence bands can be derived by considering the variance (as well as the mean) of the posterior distribution of the (} J,k 's. If the vector(} represents the discrete wavelet transformation of the sequence f = (j(l/n), ... , f((n- l)/n, f(l))': (} = Wj,
then the Bayesian estimate of the sequence
j
=
f
will be
W'E[Oiw].
Now the variance-covariance matrix of the resulting estimator is given by Var(]) = ~ = W'diag{Var(Oiw)}W, and Bayesian posterior intervals for
f
are given by
f ± 3Vdiag(~).
CHAPTER
NINE
Generalizations and Extensions
Previous chapters have given the hows and whys of wavelets, and have discussed wavelet applications in statistics. The one-dimensional wavelet transform that has been the primary focus is only a small portion of all waveletbased methods available. In this chapter, we give an overview of some of the important extensions of standard wavelet methods, and briefly consider their uses in statistics.
9.1
Two-Dimensional Wavelets
This book has dealt only with unidimensional wavelet methods, but most of these methods have multidimensional counterparts. This section describes the construction of two-dimensional wavelets. The extension to more than two dimensions can be accomplished by similar means. Here, we consider analyzing two-dimensional signals f (x, y) which are square integrable over the real plane: f(x, y) E L 2 (JR 2 ). The simplest way of constructing a wavelet basis for L 2 (JR) is to take the simple product of unidimensional wavelets:
It is straightforward to show that the 'li's as defined above are indeed wavelets
and that they form an orthonormal basis for L 2 (JR 2 ).
168
TWO-DIMENSIONAL WAVELETS
But this approach is too simplistic to retain the nice features of the multiresolution analysis. Let us begin extending our one-dimensional MRA by defining the two-dimensional space V 0 as the tensor product of two onedimensional spaces:
Yo= Vo 0 Vo = span{f(x)g(y), j,g E Vo}. For any j, define Y i = Vj 0 Vj , and then this set of subspaces inherits the two-dimensional versions of the MRA properties listed in Section 1.2:
1.
...
c
Y -2
2.
njEzYj
3.
f f
4.
c
Y -1
c
= {(o,o)},
Yo
c
Y1
c
Y2
c ... ;
2
ujEzYj = L (JR 2 );
E Yj if and only if j(2·, 2·) E Yj+l; E Yo implies f(·- k1, ·- k2) E Yo for all k1, k2 E 7L
Also, defining
(x, y)
= ¢(x)¢(y) and (9.1)
we have that
5.
The set { cl>j,k 1 ,k 2 , j, k1, k2 E .7l.} constitutes an orthonormal basis for
Yo. Notice that this setup differs from the previously mentioned one by allowing only a single dilation index. To continue this extension into two-dimensions, we would next like to find complimentary spaces that represent the "detail signal" between successive approximations. As in the one-dimensional case, define W i to be the orthogonal complement of Y i in Y i+I· Breaking down this MRA gives more insight into the nature of the space W i:
Yj+l
VJ+l 0 VJ+I (Vj EB Wi) 0 (Vj EB Wi)
Vj 0 Vj EB ((Vj 0 Wi) EB (Wi 0 Vj) EB (Wi 0 Wj))
Yi EB Wi.
Generalizations and Extensions
169
It can be seen that the "detail space" Wj is itself made up of three orthogonal subspaces. Bases for these three sub-detail spaces are the corresponding tensor products of their components, so define the following twodimensional wavelets:
\ll 1 (x,y) 2
¢(x)'lj;(y);
\ll (x, y)
'l/J(x)¢(y);
w
'ljJ(x)'ljJ(y).
3 (x,
y)
Dilating and translating as in (9 .1 ), it is clear that an orthonormal basis for W is
j
where
It follows that an orthonormal basis for L 2 (JR 2 ) is
{lJI 1r:"k k , m ' I, 2
= 1, 2, 3, j, kt, kz
E Z}.
Mallat (1989a) notes that the three sets of wavelets correspond to specific spatial orientations. The "detail image" associated with each of these three orientations will give emphasis to edges in the image in the indicated direction. Specifically, the wavelet lJ! 1 corresponds to the horizontal direction, the wavelet lJ! 2 with the vertical direction, and w3 with the diagonal. Naturally, the decomposition and reconstruction algorithms for the two dimensional case are closely related to the corresponding one-dimensional algorithms. By applying the two-scale relationship (4.7) twice to the twodimensional wavelet in (9.1), we arrive at the two dimensional two-scale relationship:
Cl?j,ki,kz(x,y) =
:2::: :2::: hei-2kihfz-2kzcpj+I,fi,fz(x,y). £1EZfzEZ
Applying the result in (4.13) twice gives an analogous result in two dimensions, relating a scaling function at any level to scaling functions and wavelets at a coarser level:
170
TWO-DIMENSIONAL WAVELETS
D~J- 1
H (columns)
,....-------..;:__--~---
D
2
j-1
D~J- 1
Figure 9.1: Schematic diagram of the two-dimensional wavelet decomposition
:2::: :2::: {au1-k1 au2-k2 cl>j-1,£1,£2 (x, Y) f1EZ£2EZ
+ au1-k1 bu2-k2WJ-1,£ 1,£ 2 (x, Y) + bu1-k1 au2-k2lJ!~-1,£1 ,£2 (x, Y) +
bu1-k1 bu2-k2WJ-1,£ 1,£ 2 (x,
Y)} ·
Naturally, analogous versions of these formulas are applied to give the decomposition and reconstruction algorithms for coefficients, which again can be represented in terms of the high pass and low pass filters discussed in Section 4.2. The decomposition is the result of a two-step process, which is represented schematically in Figure 9.1. To begin with, we regard the matrix of scaling function coefficients as a two-dimensional signal, each row being thought of as a separate (one-dimensional) signal. The first step consists of applying filters H and G to each row of the matrix, the intermediate results being two matrices with the same number of rows but half as many columns as the original matrix. Regarding each of these matrices as consisting of columns of (one-dimensional) signals, the filters H and G are applied to the columns, giving four final square matrices, each with half as many rows and half as many columns as the original matrix. The four resultant matrices correspond to the scaling function and the three wavelets described earlier. The matrix C J _ 1 is a "smoothing" of the higher-level scaling coefficients, and the matrices D} _1 ,
Generalizations and Extensions
Dj+3
DJ+z
171
Dl+3
DJ+z
n;+3 Dj+1 DJ+1
cj
DJ+1
DJ+z
Figure 9.2: Two-dimensional signal decomposition
DJ _1 , D] _1 , represent the horizontal, vertical, and diagonal detail components, respectively. A very common example of a two-dimensional signal is a digital image, consisting of a matrix of pixel values, to be plotted on the grey-scale. The wavelet decomposition of an image can begin by regarding the original matrix elements as the top-level scaling function coefficients, as was done in the onedimensional case in Section 6.2. The matrix of scaling function coefficients that result from a one-level decomposition is a "smoothing" of the original image. The other three matrices represent "detail images," each according to its directional orientation. The results of the two-dimensional wavelet decomposition are conveniently stored in the format displayed in Figure 9.2, similar to that of Mallat (1989b). If the decomposition were to proceed further, the square in the lower left-hand corner of the figure would be replaced by four equally sized squares, the lower left-hand square in turn containing the matrix of scaling function coefficients at level j - 1. An example of the 2D wavelet decomposition, displayed as in Figure 9.2, is given for the "Lena" image in Figure 9.3. Note that the vertical details (side of arm, side of face) are represented in the lower right-hand portion of the
172
TWO-DIMENSIONAL WAVELETS
Figure 9.3: Wavelet decomposition of the Lena image
plot, the horizontal details (top of shoulder, mouth) can be seen in the upper left-hand corner, and the diagonal details (hat brim, diagonal part of shoulder) are in the top right-hand corner. Figure 9.4 shows the original Lena image and its projection onto three V j approximation spaces. Note that reducing the resolution level by one results in a smoothed image that requires only 1/4 the number of coefficients to be stored. The general ideas behind extending wavelets from one to two dimensions could be employed to produce scaling functions, wavelets, and decomposition/reconstruction sequences for any number of dimensions, although the algorithms and interpretations get a little messy. Again, the focus of this book has been entirely upon one-dimensional procedures, but it is important
Generalizations and Extensions
173
Figure 9.4: Image multiresolution example with the Lena image: original image (upper left); reconstruction with J = 6 (upper right); reconstruction with J = 5 (lower left); reconstruction with J = 4 (lower right)
to keep in mind that analogous techniques could be developed for higherdimensional problems.
9.2
Wavelet Packets
The usual wavelet bases based on dilations and translations as described in Chapter 1 and used in various applications throughout this text provide an extremely useful tool for describing non-smooth functions (or signals). But these bases represent only one possibility. In this section, we discuss the notion of wavelet packets, and how they can be used to enhance the usual wavelet analysis.
174
WAVELET PACKETS
Wavelet Packet Functions Wavelet packets, a generalization of wavelets bases, are alternative bases that are formed by taking linear combinations of usual wavelet functions. These bases inherit properties such as orthonormality and smoothness from their corresponding wavelet functions. The basic idea is that by using wavelet packets, a large class of possible bases can be constructed (of which the usual wavelet basis is only one), and that for any particular application, the "best basis" can be chosen from this "library" of basis functions. Comparing one basis to another is done according to some user-defined criterion function, and a best basis algorithm exists for doing the automatic basis selection. Wavelet packets were originally constructed by Coifman and Meyer. Early references include Coifman, et at. (1994), Coifman, Meyer, and Wickerhauser (1992), Coifman and Meyer (1991). A wavelet packet function is a function with three indices: wJ:k (t). Just as with usual wavelets, integers j and k index dilation and translation operations respectively:
where wm (no subscripts) is understood to have j = k = 0. The extra index m = 0, 1, ... is called the modulation parameter or the oscillation parameter. The first two wavelet packet functions are known to us as the usual scaling function (father wavelet) and the mother wavelet respectively:
w 0 (x) w 1 (x).
¢(x)
'lj;(x) Wavelet packet functions form tionships
= 2, 3, ... are defined by the recursive rela-
w 2 m(x)
=L
hkw~k(x)
(9.2)
k
and
w 2 m+ 1 (x)
= :~:::)-1)kh_k+IW~k(x)
(9.3)
k
for an appropriately chosen square-summable sequence { hk}. This is exactly the sequence used to express the two-scale relationships of wavelets and scaling functions in Section 4.1. In fact, note that form = 0, equation (9 .2) is just the two-scale relationship ( 4.1); form = 1, the expression (9.3) is just the formula ( 4.4) expressing the mother wavelet in terms of the father wavelet.
Generalizations and Extensions
175
It is instructive to pause here and consider the wavelet packet functions associated with the Haar basis. Recall that for the Haar system, the only nonzero coefficients in the { h k} sequence are h 0 = h 1 = 1/ V2. The Haar-based wavelet packet functions form= 0, ... , 7 are plotted in Figure 9.5. The primary purpose of constructing such packets is to provide alternative bases for the same spaces that were spanned by dilated and translated scaling functions and wavelets. The first important obsenTation to make is that the set {w[fk, k E Z, m = 0, 1, ... } forms an orthonormal basis for L 2 ( JR). In this r~presentation, the dilation index is not needed at all. Alternative bases can be constructed for other subspaces of interest as well. Note in Figure 9.5 that as m increases (j = k = 0 for all the packet functions in the plot), the number of oscillations increases, too. This is the reason behind constructing a basis for L 2 (JR) without adjusting the dilation index, which might seem to imply that higher resolution can be attained without using the j dilation index. This is in fact the case. We saw in Chapter 1 that if Vj C L 2 (JR) is the space of functions that are piecewise constant with jumps at k2-j, k E Z, then an orthonormal basis for this space is the set of scaling functions { c/>j,k, k E Z}. Another possible orthonormal basis for Vj is the set of wavelet packet functions
Since VJ+I = Vj space" Wj is
+ Wj, it can be seen that an orthonormal basis for the "detail (x) ' 2j { wm O,k
< m < 2j+I k E Z}. -
The orthonormal bases we are familiar with for these spaces are written in terms of this new notation as {wJ,k, k E Z} for Vj and {w],k, k E Z} for Wj. In addition to these two examples of possible orthonormal bases, there are many others that can be used, the elements of which result from the appropriate choice of various combinations of the indices m, j, and k. To be more precise, a basis for L 2 ( JR) can be formed by allowing k to range over Z, and choosing an index set I = {(m0 ,j0 ), (m 1,it), ... } such that the intervals [2jimi, 2ji (mi + 1)) are disjoint and "cover" the entire interval [0, oo): 00
ur2jimi, 2ji (mi
+ 1)) = [0, oo).
(9.4)
i=l
This can be thought of as covering the entire time-frequency plane with windows of various shapes. It is easily shown that the usual wavelet basis forms such a cover: let (m0 , j 0 ) = (0, 0), and then set m1 = m2 = ... = 1 and let ji = i, for i = 1, 2, ....
176
WAVELET PACKETS
Haar scaling function (m=O) co 0
Haar mother wavelet (m=1) L()
0 0
0
00
~-~-~-~---~-~
0.0
0.2
0.4
0.6
0.8
1.0
Packet function (m=2)
~l,r---..---r---=:::;:::===;::==~
";"" 0.0
0.2
0.4
0.6
0.8
1.0
Packet function (m=3)
~~====~---~~====~
~ ~.=====~-~====~~-,
L()
L()
0
0
0
0
0
0
~ ~-~~~==~~~..--~
";"" 0.0
0.2
0.4
0.6
0.8
1.0
~~-~~~~~-~~==~
";"" 0.0
Packet function (m=4} ,..---
0.2
0.4
0.6
0.8
1.0
Packet function (m=S) ,..---
,..---
L()
L()
0
0
,..---
0
0
0
0
'--
";"" 0.0
0.2
0.4
0.6
0.8
1.0
";"" 0.0
,..-
..--
-
..--
L()
L()
0
0
0
'---
0.6
0.8
1.0
..--
..--
..--
0
0
0
-
'---
";"" 0.0
0.4
Packet function (m= 7)
Packet function (m=6) -
0.2
0.2
0.4
0.6
0.8
1.0
";"" 0.0
0.2
0.4
..____
0.6
0.8
1.0
Figure 9. 5: Wavelet packet functions corresponding to the Haar system
Generalizations and Extensions
177
Though the previous discussion was given in terms of the Haar basis, the same results hold for all sets of wavelet packet functions and their associated subspaces Vj and Wj for j E 7L The collection of all wavelet packet functions {wrk' j, k E Z, m = 0, 1, ... } contains far too many elements to form an orthonormal basis. Care must be taken in choosing a subset of this collection in order to obtain a proper basis. Denoting by I a suitably chosen set of indices, decomposition of an L 2 (1R) function f into its wavelet packet components is given by f(x)
= LL
L:aJ:kwrk(x),
{m,j)E/ kEZ
where the coefficients are computed via
Thus, wavelet packets offer an enormous amount of flexibility in possible sets of basis functions. The grouping of all possible bases is called a library of bases. For the idea of wavelet packets to be really useful in practical situations, there must be some good adaptive way to choose the most appropriate set of basis functions with which to represent a particular function. This is the aim of the best basis algorithm.
The Best Basis Algorithm As we moved from the discussion of the wavelet decomposition of continuous functions to the decomposition of discrete data earlier in this text, we do so now as well in our discussion of wavelet packets. This is perhaps a more natural way to describe the main conceptual points of wavelet packets and the associated best basis algorithm. Recall from Section 4.1 that a decomposition algorithm exists to compute scaling function and wavelet coefficients at level j from the scaling function coefficients at level j + 1, specifically Cj,k
= :2::: ht-2k Cj+I,f, f
dj,k
= L:(-1)fh_t+2k+I
Cj+I,k·
fEZ
Recall also from Section 6.2 that this algorithm was begun by regarding the data values Y1 , ... , Yn as the highest level scaling function coefficients from which all lower-level coefficients are ultimately computed. In Section 4.2, we
178
WAVELET PACKETS
noted that such a decomposition algorithm is a pair of filtering operations, and thus the two decomposition expressions above can be expressed
Cj,·
= H Cj+l,·
dj,·
= Gci+I,·,
where H and G represent the low-pass and high-pass filters associated with the respective decomposition formulas. Thus, if the data points are regarded as the scaling function coefficients at level J, then all scaling function coefficients are obtained by repeated application of the filter H:
where Y = (Y1 , ... , Yn)'. Similarly, wavelet coefficients are computed by applying the filter G after successive applications of H:
By recalling the usual wavelet decomposition of data, we are better equipped to describe the organization structure inherent in the wavelet packet decomposition.
HY y
HGY GY
Q3y
Figure 9.6: Tree diagram of the usual wavelet decomposition algorithm
Generalizations and Extensions
179
The usual wavelet decomposition is displayed in a tree diagram in Figure 9.6. This idea is generalized to describe the wavelet packet decomposition. Each set of coefficients is subject to either of the filters H and G. Computing the full wavelet packet decomposition involves applying both filters to the Yi values and then recursively to each intermediate signal, giving the tree diagram in Figure 9. 7. The decomposition of each signal at each node of the tree by applying the two filters is known as the splitting algorithm. By computing the full wavelet packet decomposition on a data vector Y with n = 2J points depicted in Figure 9. 7 for r resolution levels, the result is a group of 2 + 4 + 8 + ... + 2r = 2r+ 1 - 2 sets of coefficients. At each level, note that the downsampling inherent in the filtering ensures that there are n total coefficients among all the sets at that level. The total number of coefficients (including the original data values) is thus n(r+ 1), which is obviously a highly redundant way to represent n data values. Choosing a particular basis of wavelet packet functions amounts to "pruning" the decomposition tree. In the usual wavelet decomposition algorithm shown in Figure 9.6, each right-hand node is "pruned," meaning that lowerlevel decompositions are not computed from the right-hand branches. The best basis algorithm, developed by Coifman and Wickerhauser (1992), consists of traveling down the tree structure, making a data-based decision at each node as to whether or not to split. The result (when all nodes are split as far
H 3Y HY 2
GH 2 Y HY HGHY GHY G2 HY y
H 2 GY HGY GHGY GY HG 2 Y azy Q3y
Figure 9.7: Tree diagram of the full wavelet packet decomposition algorithm
180
TRANSLATION INVARIANT WAVELET SMOOTHING
0.0
0.2
0.4
0.6
0.8
1.0
Figure 9.8: Schematic design of the result of the best basis algorithm
as they will be split) represents the "best" basis (according to the criterion in use) for representing the particular set of data in question. This tree-based algorithm will automatically ensure that the resulting index set { (m 0 , j 0 ), (m 1 , j 1 ), ... } will "cover" the interval [0, oo) in the best possible way (see (9.4)), guaranteeing an orthonormal basis for L 2 (JR). This tree-based approach is illustrated in two figures, Figure 9.8 denoting an arbitrary wavelet packet basis, and Figure 9.9 representing the usual wavelet basis. Note that the shaded boxes taken together should "cover" the entire width of the figure. In these figures, an unshaded box indicates that the box was split, so that it does not correspond directly to any element of the final basis. Of course, choosing a best basis begs the question as to what "best" means: the criterion function. Coifman and Wickerhauser (1992) focus primarily on the Shannon entropy measure. Other possibilities include counting the number of coefficients greater in absolute value than a given threshold.
9.3
Translation Invariant Wavelet Smoothing
One problem that wavelet bases have is the lack of translation in variance. To illustrate this point by example, consider the Haar basis decomposition of
Generalizations and Extensions
0.0
0.2
0.4
0.6
0.8
181
1.0
Figure 9.9: Schematic design of the usual wavelet decomposition
the function f(x) = 'lj;(x) (the Haar wavelet). It is clear that d 0 ,0 = 1 and all the other wavelet coefficients are identically zero. Now consider translating f to the right by a small amount 8: f(x) = '1/J(x- 8). This new function is of course still in L 2 (JR) and thus can still be decomposed into its wavelet components, but it is clear to see that the nice coefficient structure of the original decomposition is lost. For the shifted function, there are two non-zero coefficients at level 0: (d0 ,0 , d0 ,t) = (1 - 28, 8), three non-zero coefficients at level 1: (d 1 ,0 , d 1 , 1 , d 1 , 2 ) = v'2( -8, 8, 8) and so on. If the shift 8 is taken to be an integer, then the nice structure is preserved: Now do,J = 1 and all the other coefficients are zero, so the Haar wavelet decomposition is translation invariant under integral shifts, but not in general. No matter how an £ 2 function f(x) is shifted, it is still in L 2 (1R), and it can be written in terms of its wavelet components. Furthermore, the wavelet Parseval identity guarantees that the energy in the function is preserved in the total set of wavelet coefficients, regardless of how the energy is distributed among coefficients. It is true that the lack of translation invariance is not a real problem in this sense. It is a significant weakness, though, when applying wavelet methods to finite data sets, especially those with small or moderate sample sizes.
182
TRANSLATION INVARIANT WAVELET SMOOTHING
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 9.10: Wavelet reconstructions of translated versions of a data set
A simulated example of how this can affect a statistical estimator in practice is shown in Figure 9.10. The first data set consists of 64 N(2, 1) randomvariablesfollowedby64N(-2, 1) observations(j(x) = 2'1/J(x), theHaar wavelet, with a signal-to-noise ratio of 2). The other two sets of data are just translated versions of the first: zi = Y(i+h)modn for h = 42 (corresponding to a leftward shift of 1/3) and h = 32 (corresponding to a shift of 1/4). These versions of the original data are actually "wrapped-around" translations of the original data, as if we were applying periodic boundary handling. For all three data sets, the universal threshold was applied to all levels of coefficients using the hard thresholding operator. The resulting wavelet estimator is shown along with each data set. The errors (1/n) 'E~=I (j(i/n) - f(i/n)) 2 were computed for the three estimates, giving 0.031, 0.045, and 2.157, respectively. It should come as no
Generalizations and Extensions
0.0
0.2
0.4
0.6
0.8
183
1.0
Figure 9.11: Translation-invariant wavelet estimator of simulated data using the Haar function, hard thresholding, and the universal threshold across all levels
surprise that the first estimate is quite good- the abrupt jump at 112 lines up exactly with the middle jump of the Haar wavelet '1/Jo,o. The second estimate is not quite as good as the first, but is still quite good, since the pair of jumps at 1 I 4 and 3I 4 line up exactly with the jumps in the wavelets 'lj; 1 ,o and 'ljJ 1 , 1 . The third estimate illustrates what can go wrong in general. Shifting by 1I 3 ensures that none of the wavelets will line up exactly. The estimation procedure does the best it can, but fails miserably. For general wavelet bases, the same phenomenon holds. Though it is not disastrous in every case, the wavelet thresholding estimator does not perform well for arbitrary translations. Coifman and Donoho (1995) make note of this problem, and also that the lack of translation invariance causes various spurious artifacts in the reconstruction of functions, such as Gibbs-type phenomena (rapid oscillations of high amplitude which are typical in Fourier reconstructions near jumps) near jump discontinuities. They propose an ingenious, yet simple solution, which is described in this section. One possibility, of course, would be to impose a shift on the data set before the decomposition takes place (in order to align apparent features of the data to avoid the problem seen in the third plot in the example), then decompose, shrink, reconstruct, and inverse-shift. In practice, however, it would be quite difficult to know the exact amount to shift the data (if at all). Instead, Coifman and Donoho propose to compute the wavelet estimator for all possible shifts, then inverse-shift them and take as a final estimate the average of the estimates resulting from all shift values. By "all possible shifts," we mean all n shifts of the data - considered on the unit interval, this will be shifts off-by amounts iln, fori = 1, ... , n. Though, as in the example,
184
TRANSLATION INVARIANT WAVELET SMOOTHING
0.0
0.2
0.4
0.6
0.8
1.0
Figure 9.12: Translation-invariant wavelet estimator of simulated data using Daubechies' N = 5 wavelet, soft thresholding, and the SURE thresholding scheme some shifts will likely give poor results, these can reasonably be expected to average themselves out over all possible shifts. This "Spin Cycle" algorithm is demonstrated in Figure 9.11 for the example data set from Figure 9.10. Note that since the second two data sets are just translated versions of the first, in this translation-invariance scheme, all three versions will give the same estimate. The estimation scheme used to produce Figure 9.11 was the same as that used in Figure 9.10: using the Haar wavelet basis with the universal threshold applied to all levels. The error for this example was 0.411, much worse than the 0.031 and 0.045 for the first two plots of Figure 9.10, but a great deal better than the third plot. Note that even though the Haar wavelet was used, averaging over 128 different estimators gave a fairly smooth final estimate. Figure 9.12 is a plot of the same data set, smoothed by applying the Spin Cycle algorithm with a smoother wavelet, soft thresholding, and the SURE threshold selection scheme. This estimator does a good job of picking up the jump at the middle (and the jump at the edges which is induced by periodic boundary handling), but is a little wavy through the "flat" parts of the data. The error for this estimate is 0.273.
Appendix
This book is concerned with L 2 function space, and while the notion of function spaces may not be familiar to the reader at first, it can be readily understood by relating it to vector spaces in linear algebra. (It is presupposed that the reader has had some exposure to linear algebra.) Most of the specific material on L 2 function space needed for this book is introduced as it is needed in Chapter 1. The following pages, while certainly not intended to be a complete discussion of vector spaces and functions spaces, are devoted to briefly reviewing some basic concepts from linear algebra and then extending them to general Hilbert spaces. In linear algebra, a vector in JRk is an ordered k-tuple of real numbers x = (x 1 ,x 2 , ... ,xk) which is viewed as a directed line segment from 0 = (0, ... , 0) to x. Two vectors x andy are said to be equal if Xi = Yi for each i = 1, ... , k. Two fundamental algebraic operations may be applied to vectors. Vector addition is the elementwise sum of two k-tuples:
while the scalar multiplication of a vector x and a scalar a E lR is
Note that both of these operations result in a new k-tuple. From these two basic properties, it is easily shown that addition of k-tuples is commutative and associative, and that various other algebraic properties hold. With these basic ideas, we turn next to the idea of vector spaces. The set of all ordered k-tuples is said to form the vector space JRk. The space JR2 is often represented by the usual x-y plane. Three dimensional space corresponds with JR3 , but higher-order vector spaces are difficult to visualize. Formally, a vector space is any set of vectors V which is closed under vector addition and scalar multiplication, i.e., for all x, y E V, a E JR,
x + y E V and ax E V.
186
APPENDIX
These two operations must also satisfy a set of standard postulates, including commutativity, associativity, existence of a zero vector, etc. These postulates are listed in any basic linear algebra book. A subspace of a vector space V is a subset of vectors in V which is itself closed under addition and scalar multiplication. A subspace is also a vector space, so it must also include the zero vector and satisfy the other necessary postulates. In the vector space JR3 some examples of subspaces are the set consisting only of the zero vector; all vectors of the form (c, 0, 2c) for c E JR; and in fact any plane or any line which passes through the origin. To discuss a basis for a vector space, we need a few preliminary definitions. A vector y is a linear combination of the vectors x 1 , x 2 , ••• , Xm if it can be expressed
A set of vectors { x 1 , x 2 , ..• , Xm} is said to be linearly dependent if the zero vector is a non-trivial linear combination of the Xi's (non-trivial means that not all the ai 's can be zero). Thus, if a set of non-zero vectors is linearly dependent, then at least one of the vectors can be written as a linear combination of the others. If a set of vectors is not linearly dependent, then it is linearly independent, which means that none of the vectors in the set can be written as a linear combination of the others. If every vector in a vector space V can be written as a linear combination of a set of vectors { x 1 , x 2 , ... , Xn}, then it is said that these vectors span V. A set of vectors { x 1 , x 2 , ..• , Xm} is said to be a basis for a vector space V if the vectors are linearly independent and span V. The concept of a basis is essential to a discussion of linear algebra. For a particular basis x 1 , x 2 , ... , Xm, each vector in the space can be written in terms of the Xi's:
and furthermore, the representation is unique. There are many possible bases (infinitely many, in fact) for each non-trivial vector space. A simple example of a basis for JRk is the standard basis: x 1 = (1,0,0, ... ,0)', x 2 = (0, 1,0, ... ,0)', ... , Xk = (0,0,0, ... , 1)'. In fact, any set of k linearly independent vectors in JRk constitute a basis for JRk, and every possible basis for JRk will have exactly k vectors. The number of basis vectors for any vector space is known as the dimension of the space, with the dimension of the space {0} defined to be zero. A basis can be thought of geometrically as a set of coordinate axes. The standard basis is represented by the usual Euclidean axes. Any vector in the space has a unique representation in terms of these axes or bases. In Euclidean geometry, the well-known formula for the squared length of
Appendix
187
a vector x, a generalization of the Pythagorean theorem, is given by
Using the usual notation for the dot product (or scalar product between two vectors x and y: X·
Y
= XtYI + XzYz + .. · + XkYk,
the angle between the vectors x and y can be computed according to
cos
0
X·y
= llxiiiiYII.
(9.5)
To allow ready extension to other types of vector spaces, we will use the term inner product in place of dot product and write, for example, for k-tuples x andy,
(x, y)
= XtYI + XzYz + · · · + XkYk·
In terms of the inner product, the length of a vector x, which we will henceforth refer to as the norm of the vector, is given by
llxll = (x,x) 1/ 2 = Jxi +x~ + ... +x~. From (9.5) it is seen that if two vectors have an inner product of zero, the angle between them is 90 degrees (rr /2 radians), and the vectors are said to be perpendicular, or orthogonal. Orthogonality may be difficult to visualize in more than three dimensions, but it is a key concept for this book. A set of vectors {x 1 , x 2 , ... , Xm} forms an orthogonal basis for a vector space V if the vectors are a basis for V and if each pair of basis vectors is orthogonal. If each vector of an orthogonal basis for V is normalized to have length (norm) one: i = l, ... ,m,
then the resulting set of vectors { y 1 , y 2 , ... , y m} constitutes an orthonormal basis for V. The notion of orthogonality extends to subspaces as well. Two subspaces V and W (both in the same vector space) are said to be orthogonal if every vector in V is orthogonal to every vector in W. If each vector of a basis for V is orthogonal to each vector of a basis for W, then this implies that the subspaces V and W are orthogonal.
188
APPENDIX
Every subspace W of a vector space V has an orthogonal complement in V, which consists of the set of all vectors in V which are orthogonal to W. It is straightforward to show that the orthogonal complement of a subspace is also a subspace. Given a vector x and a subspace W, the projection of x onto W is a vector y such that y E W and x - y is in the orthogonal complement of W in V. The projection operation is denoted y = Pw x. The projection of a vector x onto a subspace W is the vector in W that is the "closest" to x, in the sense that the magnitude of the "error" llx- Yll is minimized when y = Pwx. From the vector space JRk, we can extend to the infinite-dimensional space lR 00 , which contains all infinite-length vectors x = (x 1 , x 2 , x 3 , ... ) ' with finite norm: llxll = L:~ 1 < oo. Though infinite-dimensional vector space might be hard to conceptualize, lR 00 defined this way does form a bona fide vector space, since adding any two vectors with finite norm or multiplying by a finite scalar will result in another with finite norm. It is possible now to move from the countably infinite-dimensional vector space to uncountably infinite-dimensional vector spaces, which are simply spaces of functions. An element of this function space is a function f (x) defined on a continuous set of the real line. The notions of inner product and norm extend to function space as well, where the summation in vector space is replaced by its continuous counterpart, the integral. The inner product of two functions is given by
xr
(j,g)
=
J
f(x)g(x) dx,
(9.6)
the range of the integration determined by the definition of the particular space. The treatment of vector spaces and function spaces can be unified by considering the more general framework of Hilbert spaces. A Hilbert space is simply a complete 1 vector space (finite- or infinite-dimensional) which has an inner product defined. This book is primarily concerned with the particular Hilbert space known as L 2 function space. With the inner product defined as in (9.6) (integration taking place over some specified interval I C JR), this function space consists of all functions that are square-integrable:
Clearly, this space is closed under addition and scalar multiplication, so it is indeed a valid Hilbert space since it is also complete. 1 Completeness is a closure condition on the space, requiring that all Cauchy sequences converge to a limit that is also in the space.
Appendix
189
All the concepts discussed earlier in terms of usual k-tuple vector space extend to L 2 function space as well. The norm of a vector in L 2 space is defined to be
11!11 2 = (j, f)
=
jf
2
(x) dx.
We can also speak of subspaces in L 2 function space. For example, the span of a set of L 2 ( JR) functions {f 1 , ... , f m} is a subspace of L 2 , defined to be 2 m
{f E L 2 (JR) : f(x)
=L
aifi(x), for some constants
a 1 , ... , am}
(9.7)
i=l
Other concepts that extend immediately to L 2 function space are orthogonality, bases, orthonormal bases, projections, etc.
2 To be precise, the representation (9.7) of a function fin terms of a linear combination of other functions needs hold only "almost everywhere" (a.e.), i.e. II! adi II = 0.
2:::
References
Abramovich, F., and Benjamini, Y. (1995). Thresholding of wavelet coefficients as multiple hypotheses testing procedure. In Wavelets and Statistics. Antoniadis, A., and Oppenheim, G. (eds.). New York: SpringerVerlag. pp. 5-14. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory. Petrov, B. N., and Csaki, F. (eds.). Akademiai Kiado: Budapest. Altman, N. S. (1990). Kernel smoothing of data with correlated errors. journal of the American Statistical Association 85: 749-759. Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley: New York. Anderson, L., Hall, N., Jawerth, B., and Peters, G. (1993). Wavelets on closed subsets of the real line. In Recent Advances in Wavelet Analysis. Schumaker, L. L., and Webb, G. (eds.). Academic Press: New York. Antoniadis, A., Gregoire, G., and McKeague, I. W. (1994). Wavelet methods for curve estimation. journal of the American Statistical Association 89: 1340-1353. Ariiio, M.A., and Vidakovic, B. (1995). On wavelet scalograms and their applications in economic time series. Discussion Paper 95-21, ISDS, Duke University, Durham, North Carolina. Auscher, P. (1989). Ondelettes fractales et applications. Ph.D. Thesis, Universite Paris, Dauphine, Paris. Bartlett, M. S. (1963). Statistical estimation of density functions. Sankhya Series A 25: 245-254. Battle, G. (1987). A block spin construction of ondelettes. part 1: Lemarie functions. Communications in Mathematical Physics 100: 601-615.
192
REFERENCES
Benjamini, Y, and Hochberg, Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. journal of the Royal Statistical Society, Series B 57: 289-300. Bloomfield, P. (1976). Fourier Analysis of Time Series: An Introduction. Wiley: New York. Bock, M. E. (1992). Estimating functions with wavelets. Statistical Computing and Statistical Graphics Newsletter 4-8. Bock, M. E., and Pliego, G.]. (1992). Estimating functions with wavelets Part II: Using a Daubechies wavelet in nonparametric regression. Statistical Computing and Statistical Graphics Newsletter 27-34. Brigham, E. 0. (1988). The Fast Fourier Transform and Its Applications. Prentice-Hall: Englewood Cliffs, New Jersey. Cencov, N. N. (1962). Evaluation of an unknown distribution density from observations. Soviet Mathematics 3: 1559-1562. Chambers,]. M., Cleveland, W. S., Kleiner, B., and Thkey, P. A. (1983). Graphical Methods for Data Analysis. Wadsworth: Belmont, California. Cheng, K. F., and Lin, P. E. (1981). Nonparametric estimation of a regression function. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 57: 223-233. Chipman, H. A., Kolaczyk, E. D., and McCulloch, R. E. (1995). Adaptive Bayesian wavelet shrinkage. Technical Report, University of Chicago, Chicago, Illinois. Chui, C. K. (1992). An Introduction to Wavelets. Academic Press: New York. Chui, C. K., and Wang,]. Z. (1991). A cardinal spline approach to wavelets. Proceedings of the American Mathematical Society 113: 785-793. Cohen, A., Daubechies, 1., and Feauveau, J. C. (1992). Biorthogonal bases of compactly supported wavelets. Communications in Pure and Applied Mathematics 45: 485-560. Cohen, A., Daubechies, 1., and Vial, P. (1993). Wavelets on the interval and fast wavelet transforms. Applied and Computational Harmonic Analysis 1: 54-81. Cohen, A., Daubechies, 1., Jawerth, B., and Vial, P. (1993). Multiresolution analysis, wavelets and fast algorithms on an interval. Comptes Rendus des Seances de l'Academie des Sciences, Serie I 316: 417-421. Coifman, R. R., and Donoho, D. L. (1995). Translation-invariant de-noising. In Wavelets and Statistics. Antoniadis, A., and Oppenheim, G. (eds.). New York: Springer-Verlag. pp. 125-150. Coifman, R. R., and Meyer, Y (1991). Remarques sur !'analyse de Fourier fenetre. Comptes Rendus des Seances de l'Academie des Sciences, Serie I 312: 259-261. Coifman, R. R., and Wickerhauser, M. W. (1992). Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory 38: 713-718. Coifman, R., Meyer, Y, and Wickerhauser, M. V. (1994).Wavelet analysis and
a
References
193
signal processing. In Wavelets and Their Applications, Ruskai, M. B., Beylkin, G., Coifman, R., Daubechies, 1., Mallat, S., Meyer, Y., and Raphael, L. (eds.). Jones and Bartlett: Boston. Coifman, R. R., Meyer, Y., Quake, S., and Wickerhauser, M. W (1994). Signal processing and compression with wavelet packets. In Wavelets and Their Applications, Byrnes, J. S., Byrnes,]. L., Hargreaves, K. A., and Berry, K. (eds.). Kluwer Academic Publications: Dordrecht, The Netherlands. Collineau, S. (1994). Some remarks about the scalograms of wavelet transform coefficients. In Wavelets and Their Applications, Byrnes,]. S., Byrnes,]. L., Hargreaves, K. A., and Berry, K. (eds.). Kluwer Academic Publications: Dordrecht, The Netherlands. Cooley,]. W, and Thkey, J. W (1965). An algorithm for the machine calculation of complex Fourier seriew. Mathematics of Computation 19: 297-301. Craven, P., and Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik 31: 377-403. Csorg6, M., and Horvath, L. (1988). Nonparametric methods for changepoint problems. In Handbook of Statistics, Volume 7. Krishnaiah, P.R., and Rao, C. R. (eds.). Elsevier: Amsterdam. Daniel, C. (1959). Use of half-normal plots in interpreting factorial two-level experiments. Technometrics 1: 311-341. Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Communications in Pure and Applied Mathematics 41: 909-996. Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM: Philadelphia. Daubechies, I. (1993). Orthonormal bases of compactly supported wavelets II. Variations on a theme. SIAM]ournal on Mathematical Analysis 24: 499-519. Daubechies, 1., and Lagarias, J. (1991). Two-scale difference equations I. Existence and global regularity of solutions. SIAM journal on Mathematical Analysis 22: 1388-1410. Daubechies, 1., and Lagarias,]. (1992). Two-scale difference equations II. Local regularity, infinite products of matrices and fractals. SIAM journal on Mathematical Analysis 23: 1031-1079. Delacroix, M. (1983). Histogrammes et Estimation de la Densite Que saisje? # 2055. Presses Universitaires de France: Paris. de Boor, C. (1978). A Practical Guide to Splines. Applied Mathematical Sciences, Volume 27. Springer-Verlag: London. DeVore, R. A., and Lucier, B.]. (1992). Fast wavelet techniques for nearoptimal processing. In Proceedings of the IEEE Military Communications Conference 48.3.1-48.3.7. New York. Donoho, D. L. (1993). Nonlinear wavelet methods for recovery of signals, densities, and spectra from indirect and noisy data. Proceedings of Symposia in Applied Mathematics 47: 173-205.
194
REFERENCES
Donoho, D. L., and Johnstone, I. M. (1992). Nonlinear solution for linearinverse problems by wavelet-vaguelet decomposition. Technical Report 403. Stanford University Department of Statistics, Stanford, California. Donoho, D. L., and Johnstone, I. M. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81: 425-455. Donoho, D. L., and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. journal of the American Statistical Association 90: 1200-1224. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1993). Density estimation by wavelet thresholding. Technical report, Stanford University Department of Statistics, Stanford, California. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1995). Wavelet shrinkage: Asymptopia? journal of the Royal Statistical Society, Series B 57: 301-369. Doukhan, P., and Leon, J. (1990). Deviation quadratique d'estimateur de densite par projection orthogonale. Comptes Rendus des Seances de l'Academie des Sciences, Serie I 310: 424-430. Dutilleux, P. (1989). An implementation of the "algorithme trous" to compute the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space. Combes, J. M, Grossman, A., and Tchamitchian, Ph. (eds.). Springer-Verlag: New York. Dym, H., and McKean, H. P. (1972). Fourier Sums and Integrals. Academic Press: New York. Engel,]. (1990). Density estimation with Haar series. Statistics and Probability Letters 9: 111-117. Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker: New York. Fan, J., Hall, P., Martin, M., and Patil, P. (1996). On local smoothing of nonparametric curve estimators. journal of the American Statistical Association 91: 258-266. Gabor, D. (1946). Theory of communications. journal of the Institute of Electrical Engineering, London III 93: 429-457. Gao, H.-Y. (1993). Choice of threshold for wavelet estimation of the log spectrum. Technical Report 438. Stanford University Department of Statistics, Stanford, California. Gasser, Th., and Muller, H. G. (1979). Kernel estimation of regression functions. In Smoothing Techniques for Curve Estimation. Gasser, Th. and Rosenblatt, M. (eds.). Heidelberg: Springer. Gasser, Th., Muller, H. G., and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. journal of the Royal Statistical Society B 47: 238-252. Good, I.]. (1958). The interaction algorithm and practical Fourier analysis. journal of the Royal Statistical Society, Series B 20: 361-372.
a
References
195
Graps, A. (1995). An introduction to wavelets. IEEE Computational Science and Engineering 2. Haar, A. (191 0). Zur Theorie der orthoganalen Funktionen-Systeme. Annals of Mathematics 69: 331-371. Hart,]. D. (1994). Automated kernel smoothing of dependent data by using time series cross-validation. journal of the Royal Statistical Society, Series B 56: 529-542. Hu, Y.-S. (1994). Wavelet approach to change-point detection with application to density estimation. Ph.D. thesis, Texas A&M University, College Station, Texas. Janssen, A.]. E. M. (1992). The Smith-Barnwell condition and non-negative scaling functions. IEEE Transactions in Information Theory 38: 884886. Jawerth, B., and Sweldens, W. (1994). An overview of wavelet based multiresolution analysis. SIAM Review 36: 3 77-412. Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1992). Estimation d'une densite de probabilite par methode d'ondellettes. Comptes Rendus des Seances de l'Academie des Sciences, Serie I 315: 211-216. Johnstone, I. M., and Silverman, B. W. (1995). Wavelet threshold estimators for data with correlated noise. Technical report, Stanford University Department of Statistics, Stanford, California. Kaiser, G. (1994). A Friendly Guide to Wavelets. Birkhauser: Boston. Karlin, S., and Taylor, H. (1975). A First Course in Stochastic Processes, 2nd Edition. Academic Press: New York. Kerkyacharian, G., and Picard, D. (1992). Density estimation in Besov spaces. Statistics and Probability Letters 13: 14-24. Kerkyacharian, G., and Picard, D. (1993). Density estimation by kernel and wavelets methods: Optimality of Besov spaces. Statistics and Probability Letters 18: 327-336. Lemarie, P. G. (1988). Une nouvelle base d'ondelettes de L 2 (1Rn). journal de Mathematiques Pures et Appliquees 67: 227-236. Li, K. C. (1985). From Stein's unbiased risk estimates to the method of generalized cross-validation. Annals of Statistics 13: 1352-1377. Li, K. C. and Hwang,]. (1984). The data-smoothing aspect of Stein estimates. Annals of Statistics 12: 887-897. Lombard, E (1988). Detecting change points by Fourier analysis. Technometrics 30: 305-310. Mallat, S. G. (1989a). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11: 674-693. Mallat, S. G. (1989b). Multifrequency channel decomposition of images and wavelet models. IEEE Transactions on Acoustics, Speech, and Signal Processing 37: 2091-2110. Messiah, A. (1961). Quantum Mechanics. North-Holland: Amsterdam.
196
REFERENCES
Meyer, Y (1985). Principe d'incertitude, bases hilbertiennes et algebres d'operateurs, Seminaire Bourbaki, 1985-1986, No. 662. Meyer, Y (1990). Ondelettes et Operateurs L· Ondelettes. Hermann: Paris. Meyer, Y (1992). Ondelettes sur l'intervalle. Revista Matematica lberoamericana 7: 115-133. Meyer, Y (1993). Wavelets: Algorithms and Applications. SIAM: Philadelphia. Moulin, P. (1993a). A wavelet regularization method for diffuse radar-target imaging and speckle-noise reduction. journal of Mathematical Imaging and Vision, Special Issue on Wavelets 3: 123-134. Moulin, P. (1993b). Wavelet thresholding techniques for power spectrum estimation. IEEE Transactions on Signal Processing 42: 3126-3136. Muller, H.-G., and Stadtmuller, U. (1987). Variable bandwidth kernel estimators of regression curves. Annals of Statistics 15: 182-201. Nason, G. (1994). Wavelet regression by cross-validation. Technical Report 447, Department of Statistics, Stanford University, Stanford California. Nason, G. P. (1995). Choice of the threshold parameter in wavelet function estimation. In Wavelets and Statistics. Antoniadis, A., and Oppenheim, G. (eds.). New York: Springer-Verlag. pp. 261-280. Nason, G. (1996). Wavelet shrinkage using cross-validation. journal of the Royal Statistical Society, Series B 58: 463-479. Ogden, R. T. (1994). Wavelet thresholding in nonparametric regression with change-point applications. Ph.D. thesis, TexasA&M University, College Station, Texas. Ogden, R. T. (1997). On preconditioning for the discrete wavelet transform when the sample size is not a power of two. Communications in Statistics B: Simulation and Computation, to appear. Ogden, R. T., and Parzen, E. (1996a). Change-point approach to data analytic wavelet thresholding. Statistics and Computing 6: 93-99. Ogden, R. T., and Parzen, E. (1996b). Data dependent wavelet thresholding in nonparametric regression with change-point applications. Computational Statistics and Data Analysis 22: 53-70. Ogden, R. T., and Richwine,]. (1996). WaveletsinBayesianchange-pointanalysis. Technical report, University of South Carolina, Columbia, South Carolina. Page, E. S. (1954). Continuous inspection schemes. Biometrika 41: 100115. Page, E. S. (1955). A test for a change in a parameter occurring at an unknown point. Biometrika 42: 523-526. Parzen, E. (1962). On estimation of a probability density function. Annals of Mathematical Statistics 31: 1065-1076. Parzen, E. (1974). Some recent advances in time series modelling. IEEE Transactions on Automatic Control19: 723-729. Prakasa Rao, B. L. S. (1983). Nonparametric Functional Estimation. Academic Press: New York.
References
197
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Nu-
merical Recipes in C, the Art of Scientific Computing, 2nd edition. Cambridge University Press: Cambridge. Priestley, M. B. (1981). Spectra/Analysis and Time Series. Academic Press: New York. Richwine, J. (1996). Bayesian estimation of change-points using Haar wavelets. Master's thesis, University of South Carolina Department of Statistics, Columbia, South Carolina. Rioul, 0., and Vetterli, M. (1991). Wavelets and signal processing. IEEE Signal Processing Magazine 14-38. Ross, S. (1983). Stochastic Processes. Wiley: New York. Rudemo, H. (1982). Empirical choice of histograms and kernel density estimators. Scandinavian journal of Statistics 9: 65-78. Schumaker, L. L. (1981). Spline Functions: Basic Theory. Wiley-Interscience: New York. Shensa, M.]. (1992). The discrete wavelet transform: Wedding the trous and Mallat algorithms. IEEE Transactions on Signal Processing 40: 24642482. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall: London. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics 10: 1135-1151. Stone, M. (1978). Cross-validation: A review. Statistics 9: 127-140. Strang, G., and Nguyen, T. (1996). Wavelets and Filter Banks. WellesleyCambridge Press: Wellesley, MA. Stromberg, ]. 0. (1982). A modified Franklin system and higher order spline systems on IRn as unconditional bases for Hardy spaces. In Conference in Honor ofA. Zygmund, Vol. II. Beckner, A. et al. (eds.). Wadsworth Mathematics Series, pp. 475-493. Taniguchi, M. (1979). On estimation of parameters of Gaussian stationary processes. journal of Applied Probability 16: 575-591. Taniguchi, M. (1980). On estimation of the integrals of certain functions of spectral density. journal of Applied Probability 17: 73-83. Tchamitchian, Ph. (1987). Biorthogonalite et theorie des operateurs. Revista Matemdtica Iberoamericana 3: 163-189. Unser, M. (1996). A practical guide to the implementation of the wavelet transform. In Wavelets in Medicine and Biology. Aldroubi, A., and Unser, M. (eds.). CRC Press: Boca Raton, Florida. Vidakovic, B. (1994). Nonlinear wavelet shrinkage with Bayes rules and Bayes factors. Discussion Paper 94-A-24, ISDS, Duke University, Durham, North Carolina. Vidakovic, B., and Muller, P. (1994). Wavelets for kids: A tutorial introduction. Discussion Paper 94-A-13, ISDS, Duke University, Durham, North Carolina.
a
198
REFERENCES
Wahba, G. (1980). Automatic smoothing of the log periodogram. journal of the American Statistical Association 75: 122-132. Walter, G. G. (1992). Approximation of the delta function by wavelets. journal of Approximation Theory 71: 329-343. Walter, G. G. (1994). Wavelets and Other Orthogonal Systems With Applications CRC Press: Boca Raton, Florida. Wang, Y (1995). Jump and sharp cusp detection by wavelets. Biometrika 82: 385-397. Wang, Y (1996). Function estimation via wavelet shrinkage for long-memory data. Annals of Statistics, to appear. Weaver, J. B., Yansun, X., Healy, D. M., Jr., and Cromwell, L. D. (1991). Filtering noise from images with wavelet transforms. Magnetic Resonance in Medicine 24: 288-295. Wei, W. W. S. (1990). Time Series Analysis: Univariate and Multivariate Methods. Addison-Wesley: Redwood City, California. Wertz, W. (1978). Statistical Density Estimation: A Survey. Vandenhoeck and Ruprecht: GOttingen. Weyrich, N., and Warhola, G. T. (1994). De-noising using wavelets and crossvalidation. Technical Report AFIT/EN!fR/94-01, Department of Mathematics and Statistics, Air Force Institute of Technology, Wright-Patterson Air Force Base, Ohio. Wickerhauser, M. V. (1994). Adapted Wavelet Analysis: From Theory to Software. AK Peters: Boston.
Glossary of Notation
lR the set of real numbers (- oo, oo).
7L the set of integers: 7L = { ... , -1,0, 1, ... }.
L 2 (I) the set of square-integrable functions on the interval I: { f : oo}. (/,g) the £ 2 inner product: (/,g) = J f(x)g(x) dx.
J1 f 2 ( x) dx <
the L 2 norm: 11/11 2 =(/,f). the closure of the set S.
11!11 S
f l_g the functions f and g are orthogonal, i.e., (/,g) = 0. V j_ W the subspaces V and W are orthogonal, i.e. for every f E V and g E W, jl_g.
E9
the sum of orthogonal subspaces.
£ (7L) the set of square-summable sequences: {ak : :L':z a~ 2
!lA
< oo }.
f to the set A. supp f the support of the function f. the restriction of the function
[x] the greatest integer function. O(g(n)) A sequence an = O(g(n)) for a function g(n) if there exists some M > 0 such that ianl/g(n) ~ M for all n = 1, 2, ....
j
alternatively (depending on context) the continuous Fourier transform of f or an estimate of f.
iid independent and identically distributed. [a, b) the half-open interval {x: a~ x < b} (similarly for (a, b), (a,b], [a,b]).
a
the complex conjugate of a: for a
= b + ci, a = b -
ci.
8(x) the Dirac delta function .
.1!t convergence in probability. Xn l!tj.t if for each E > 0, P[IXn- J.tl as n-+ oo.
> E] -+ 0
Glossary of Terms This glossary is included here for ready reference while reading through this book. Listings here should not, in general, be taken to be strict definitions but as basic descriptions of concepts.
approximation space any of the Vj spaces. basis A collection of functions { f 1 , ... , f m} forms a basis for a space V if the fi 's span V and they are linearly independent. cascade algorithm see decomposition algorithm. coefficient A function g E V can be written as a linear combination of functions in a basis {It, ... , f m} for V: g = I:: ai fi; the constants a 1 , ... , am are the coefficients of g with respect to the basis { f 1 , ... , f m}. If the basis is an orthonormal basis, then the coefficients are computed by ai = (g, /i), i = 1, ... , m. Each coefficient gives some information about the nature of the function g, e.g., a Fourier coefficient specifies the amount of the frequency content in g at the specified frequency. complete orthonormal system (CONS) A sequence of functions {/i} is a complete orthonormal system if the fi 's are pairwise orthogonal and the only function orthogonal to each fi is the zero function. decomposition algorithm fast algorithm for computing lower-level scaling function and wavelet coefficients given scaling function coefficients at a higher level (see Section 4.1). detail function (or detail signal) a function in the space Wi. detail space any of the Wi spaces. dilation The dilation of a function f(x) is given by f(ax) for a > 0. If a > 1, the function is stretched out over the real line; if 0 < a < 1, the function is compacted. dilation equation see two-scale relationship. Dirac delta function the function 8(x) which is defined to be infinite for x = 0 and zero for all other x's with the property that I 8(x) dx = 1. The Dirac delta function is the function for which (/, 8) = If (x )8 (x) dx f (0) for any function f. even function a function f for which f (x) = f (- x) for all x. father wavelet see scaling function.
=
202
GLOSSARY OF TERMS
Fourier coefficient see coefficient. Fourier transform The continuous Fourier transform of a function L 2 (IR) is given by
L:
}(w) =
f
E
f(x)e-iwxdx.
The discrete Fourier transform of a function f E L 2 ( IR) refers to the set of Fourier coefficients {a 0 , a 1 , b1 , a 2 , b2 , .•• } from the representation 00
f(x)
= ~ao + L(ai cos(jx) + bi sin(jx)). j=l
function space a set of functions that is complete and closed under addition and scalar multiplication. Haar wavelet the mother wavelet defined by
'lj;(x) =
{
1
1, -1,
0 ~X< ~ x < 1
0,
otherwise.
1
inner product The L 2 inner product of two functions
J f(x)g(x) dx.
f
and g is
(J, g)
inverse Fourier transform A function f E L 2 (IR) can be recovered from its Fourier transform j by means of the inverse Fourier transform
f(x)
= 27T
L:
}(w) eiwx d!.J.
level The wavelet level j refers to wavelets, scaling functions, and their coefficients with first subscript (dilation parameter) j. mother wavelet see wavelet. modulus The modulus of a complex-valued number a = b + ci is J a x a = Jb2 + c2. multiresolution analysis (MRA) see Definition 1.4. norm The L 2 norm of a function f is given by 11111 = Vff7). odd function a function f for which f (x) = - f (- x) for all x. orthogonal Two functions f and g are orthogonal if (J, g) = 0. orthogonal basis a basis whose elements are orthogonal. orthonormal basis a basis whose elements are orthogonal and llfill = 1, i = 1, ... , m.
Glossary of Terms
203
projection Let V be a subspace of L 2 (IR) and let W denote its orthogonal complement in L 2 (IR). Any function f E L 2 (IR) can be uniquely decomposed as f = g + h, with g E V and h E W. The function g is known as the projection of f on V. pyramid algorithm see decomposition algorithm. reconstruction algorithm algorithm for computing higher-level scaling func tion coefficients given lower-level wavelet coefficients (see Section 4.1). refinement equation see two-scale relationship. scaling function a function 0, the function is moved to the right; if a < 0, it is moved to the left. two-scale relationship equation relating a scaling function ¢ to its dilated versions: (x)
= J2 L
hk¢(2x- k).
kEZ
The sequence {hk} is known as the two-scale sequence. vector space a set of vectors that is complete and closed under addition and scalar multiplication. wavelet a function 'ljJ whose translates {'l/J(·- k), k E Z} form an orthonormal basis for the space W0 , where V1 = V0 EB W 0 in the usual multiresolution context. The term wavelet is often used generically to refer to any dilated and translated version 'lfJJ,k = 21 12'lj;(2i · - k) of the mother wavelet 'lj;. wavelet coefficient see coefficient.
Index a trous algorithm 117 Akaike Information Criterion (AIC) 146 approximation 9, 10, 11, 13, 14, 89 approximation spaces 15, 18, 84, 115, 201 dual84 two-dimensional 172 autocorrelation function 134 autocovariance function 134
cumulative distribution function (edt) 29 cumulative sum (CUSUM) process 141, 153 cusps 140-142
B-splines 24, 79 bandwidth 33, 35, 36, 40, 44, 47, 144 basis 3, 186, 201 orthogonal6, 187 orthonormal 187 Battle-Lemarie wavelet family 23-24, 81-82 Bayes rule 162 best basis 174, 177-180 binwidth 31-32 blocky function 97-99, 149 Bonferroni correction 151 boundary handling periodic 111-112, 119 symmetric 112-113 boxplot 101-102 Brownian bridge 154
Daubechies wavelet family 25-27, 82-83 decomposition algorithm 20, 51, 53, 59-66, 108, 170, 178, 201 two-dimensional 170 density estimation 29-38, 119, 143 histograms 31-32 kernel 32-35 naive 33, 34 orthogonal series 35-38 "raw" 36, 37, 50 wavelet 49-54, 132-133 detail function 11, 13, 14, 90, 201 detail image 169 detail signal 23, 90, 168 detail spaces 17, 18, 84, 86, 114, 169, 175,201 dilation 8, 174, 201 equation 60 index 8 Dirac delta function 36-37, 163, 201 Doppler function 95 down-sampling 63, 66, 96, 179
cascade algorithm 63 change-point problem 140-142 Chui-Wang wavelet family 24-25 coiflets 83 complete orthonormal system (CONS) 7,9, 15,42,60,62,110,201 convolution 40 discrete 67 coefficient 201 correlogram 136 covariance function 134 cross-validation 156-161
edge detection 141 empirical Bayes 163 empirical distribution function 32, 3 7 even function 5, 45, 201 fast Fourier transform 105-107, 109, 116 father wavelet (see scaling function) 15 ftlters 66-69,82, 109, 170, 178 high pass 68 low pass 68, 109 quadrature mirror 68
INDEX Fourier coefficients 4-5, 69, 70, 104-105 Fourier series representation 3, 4, 35, 70, 134 Fourier transform continuous 69-72, 202 discrete 1-7, 202 of data 104-107, 134 windowed 72-74 frames 86 function space 188, 202 Gabor transform 72-74, 92-93 Gibbs phenomenon 183 gray scale 22, 171
205
library of basis functions 174, 177 loss 42 mean square error (MSE) 41-42, 129, 148 median absolute deviation (MAD) 131-132 Meyer boundary wavelets 113-114 Meyer wavelet basis 23, 83 modulation parameter 174 multiple comparisons 151 multiresolution analysis (MRA) 14-16, 22-23,60,168,202 multiresolution approximation plot 89-91 multiresolution decomposition plot 90-92
Haar function 7, 11, 14 scaling function 49, 59 system 7-23, 175-176 transform 110-111 wavelet 9, 17, 59, 78, 184, 202 Heisenberg Uncertainty Principle 74 Hermite polynomials 42 Hilbert spaces 188 histograms 31-32 wavelet (Haar) based 49-52
natural frequencies 104-1 OS noise level, estimation of 131-132 nonparametric regression 3s-44, 110, 119, 143-165 Fourier series 43 kernel39-42,45-47 orthogonal series 42-44, 45-47 "raw" 39,43 wavelet 54-58 norm 5, 202
image analysis 22-23, 141, 171 indicator function 15 inner product 5, 187, 202 integrated mean absolute error (IMAE) 52 integrated mean squared error (IMSE) 51
odd function 5, 202 Old Faithful data 32, 36, 52, 54 orthogonal 6, 202 orthogonality of scaling function and wavelet 60-62 orthonormality 6, 174 oscillation parameter 174
jumps, detecting 140-142
Parseval identity 71-72, 76, 138 wavelet version 78, 144, 157, 181 parsimonious representation 22, 23 periodogram 137, 138-139 probability density function (pdf) (see density) 30 projection 14-15, 188, 202 onto approximation space 14-15, 16, 53, 55,89,172 pruning 179 pyramid algorithm 63
kernel function 33, 34, 39 biweight 34 Dirichlet 46 Epanechnikov 34, 40 Gaussian 34 higher-order 34 triangular 34 wavelet-based 56 L2 function space 2, 188 Legendre polynomials 7, 42 "Lena" 171-173 level 202
Q-Q plot 100-101 quality control141 quantiles 100
206
Index
reconstruction algorithm 22, 63--66, 170,202 coefficients 16, 19 two-dimensional 170 refinement equation 60 reflection boundary handling 45, 112-113 regression function 38 estimation (see nonparametric regression) Riesz basis 86, 87 risk 42, 144, 157 sample size 143 not a power of two 115-117 selective wavelet reconstruction 120-121 semiorthogonal wavelets 24 scaling function 15, 16, 174, 202 coefficients 62, 65, 108, 170 scalogram 93-96 spatial adaptivity 126-128 span 16, 186 sparsity of wavelet representation 101, 120, 132, 148 spectral density 119, 134-135, 138 estimation 133-140 sample 136-137 spectrogram 92-93 Spin Cycle 184 splitting algorithm 179 subspace 186, 202 support 202 symmlets 83 tensor product 168 thresholding 124-126, 133, 143-165, 144 Bayesian methods 161-165 cross-validation 156-161 global128-131, 147, 155 hard 125-126 hypothesis testing 149-156 false discovery rates 154-156
recursive 151-154 minimax 100, 129-130 soft 125-126, 143 SURE 144-149, 151, 184, hybrid 148-149 universal (VisuShrink) 130-131, 156, 159, 160, 184 time-frequency localization 69-79 time-frequency plane 17 5 time-scale plots 92-95 time series 119, 134-140 translation 8, 174, 202 index 8 translation invariance 180-184 two dimensional wavelets 167-174 two-scale sequence 60, 65, 174 two-scale relation 59, 60-62, 87, 169, 174, 202 vector space 185, 202 wavelet 16, 18, 23, 60, 75, 202 biorthogonal 79, 83--87, 117 coefficients 21, 62, 65, 95-102, 108, 112-113 empirical121-122, 123-124, 132, 143, 150, 151, 161, 163 plotting 95-102 mother 8, 85, 174 on an interval 110-115 orthogonal 25-26, 81-83 representation 12, 16-22 semiorthogonal 79, 87-88 wavelet packets 173-180 wavelet transform continuous 7 4-79 of data 107-110, 117 matrix representation 123 orthogonality 116 window 72-73 lag 138 spectral 13 7 "zoom-in, zoom-out" 23