Feedforward Neural Network Methodology
Terrence L. Fine
Springer
Preface
The decade prior to publication has seen an explosive growth in computational speed and memory and a rapid enrichment in our understanding of artificial neural networks. These two factors have cooperated to at last provide systems engineers and statisticians with a working, practical, and successful ability to routinely make accurate complex, nonlinear models of such ill-understood phenomena as physical, economic, social, and information-based time series and signals and of the patterns hidden in high-dimensional data. The models are based closely on the data itself and require only little prior understanding of the stochastic mechanisms underlying these phenomena. Among these models, the feedforward neural networks, also called multilayer perceptrons, have lent themselves to the design of the widest range of successful forecasters, pattern classifiers, controllers, and sensors. In a number of problems in optical character recognition and medical diagnostics, such systems provide state-of-the-art performance and such performance is also expected in speech recognition applications. The successful application of feedforward neural networks to time series forecasting has been multiply demonstrated and quite visibly so in the formation of market funds in which investment decisions are based largely on neural network–based forecasts of performance. The purpose of this monograph, accomplished by exposing the methodology driving these developments, is to enable you to engage in these applications and, by being brought to several research frontiers, to advance the methodology itself. The focus on feedforward neural networks was also chosen to enable a coherent, thorough presentation of much of what is cur-
vi
rently known—the rapid state of advancement of this subject precludes a comprehensive and final disposition. Chapter 1 provides some of the historical background, a rapid survey of successful applications, a transition from neurobiological origins to a mathematical model, and four questions that provide an organizational theme of this monograph. In Chapter 2 we treat the case of a single hard-limiting neuron, the traditional perceptron of the pioneer Frank Rosenblatt, and then proceed in Chapter 3 to the multilayer perceptron with several hardlimiting nodes. The material in these first chapters, except Sections 3.2 and 3.5, is largely independent of that covered in the remainder. Chapter 4 introduces the mathematical properties of multilayer perceptrons with the differentiable, saturating node functions that have enabled the successful applications of neural networks starting in the late 1980s. Chapter 5 addresses the variety of computer-intensive training algorithms needed to estimate the network parameters by minimizing a quadratic sum of approximation errors to a training set. Chapter 6 considers several approaches to the thorny problem of network architecture selection, a pressing problem in the methodology of neural networks. Chapter 7 presents a variety of results on the generalization behavior of trained neural networks, an area that is still developing. We conclude with an appendix on our experience teaching this material to senior and first-year graduate electrical engineering students. MATLAB programs, placed in appendices, are provided to illustrate programs in detail. While we use these programs in our applications and research, their inclusion here is intended only for instructional purposes. We commonly proceed at a mathematical level appropriate to a substantial first-year graduate-level course for students with systems and statistical (as distinct from neurobiology or cognitive psychology) orientations to the area of neural networks. Such readers (e.g., senior and first-year students in engineering, especially electrical and systems engineering, statistics, operations research, and computer science students oriented toward machine intelligence) are interested in pursuing substantial applications drawn from pattern classification, signal processing, time-series forecasting, and regression problems or in developing further the methodology by which such networks are designed. It is assumed that the reader has had a course in probability at an undergraduate level and has the mathematical maturity to follow an argument when provided with the definitions of unfamiliar terms and statements of unfamiliar theorems. My colleague Thomas W. Parks first aroused my interest in the renaissance that neural networks were undergoing in the late 1980s, and our research in this area was encouraged by early support from Dr. Barbara Yoon’s program in neural networks at DARPA. As part of this effort, I originated a course, EE577 Artificial Neural Networks, in the School of Electrical Engineering, Cornell University, whose evolution and need for a text resulted in this monograph. Regular discussions of research on neural networks and their applications with Jen-Lun Yuan, Michael John Turmon,
vii
Chiu-Fai (Wendy) Wong, and Sayandev Mukherjee have contributed the largest part of what I have learned about the subject. I am truly grateful to my editor, John Kimmel, for securing constructive readings of portions of earlier versions of this book by the outstanding scholars and researchers Shun-Ichi Amari, Andrew Barron, Peter Bartlett, Eduardo Sontag, and Mathukumalli Vidyasagar. Ronald DeVore and Kenneth Constantine provided important comments on elements of the manuscript. All of the valued assistance I have received notwithstanding, the omissions and errors that surely remain are my responsibility. Accepting this, for at least a year following publication, I will maintain a web site at http://www.ee.cornell.edu/˜tlfine to post notices of errata and improvements of results. Readers are invited to send to
[email protected] findings of errors or citations to better results for posting to the web site; credit will be given unless the reader requests otherwise. Much of the material covered is still in a state of flux, with new research contributions appearing each month. I expect that the subject will continue to develop significantly long after this monograph is published, and I hope that you, the reader, will be among those contributing to our understanding and to successful and novel applications of feedforward neural networks.
Terrence L. Fine School of Electrical Engineering Cornell University
Contents
Preface
v
List of Figures
xv
1 Background and Organization 1.1 Objectives and Setting . . . . . . . . . . . . . . . . . . . . 1.2 Motivation: Why Care about Artificial Neural Networks? 1.2.1 Biological and Philosophical . . . . . . . . . . . . . 1.2.2 Computational . . . . . . . . . . . . . . . . . . . . 1.2.3 Systems Applications . . . . . . . . . . . . . . . . 1.3 Historical Background . . . . . . . . . . . . . . . . . . . . 1.4 Biological Bases for Neuronal Modeling . . . . . . . . . . 1.4.1 Neuronal Properties . . . . . . . . . . . . . . . . . 1.4.2 Mathematical Model of a Neuron . . . . . . . . . . 1.5 Organizational Approach . . . . . . . . . . . . . . . . . . 1.5.1 Role of Networks in Function Representation . . . 1.5.2 Issues in Function Representation . . . . . . . . . . 1.5.3 Order of Treatment . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
1 1 4 4 5 6 9 10 10 13 14 14 15 16
2 Perceptrons—Networks with a Single Node 2.1 Objectives and Setting . . . . . . . . . . . . . 2.2 Linear Separability . . . . . . . . . . . . . . . 2.2.1 Definitions . . . . . . . . . . . . . . . 2.2.2 Characterizations through Convexity .
. . . .
17 17 19 19 20
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
x
Contents
2.3
2.4 2.5 2.6
2.7
2.8 2.9 2.10
2.11 2.12 2.13
2.2.3 Properties of Solutions . . . . . . . . . . . . . . . . . The Number of Learnable Training Sets . . . . . . . . . . . 2.3.1 The Upper Bound D(n, d) . . . . . . . . . . . . . . . 2.3.2 Achieving the Upper Bound . . . . . . . . . . . . . . Probabilistic Interpretation, Asymptotics, and Capacity . . Complexity of Implementations . . . . . . . . . . . . . . . . Perceptron Training Algorithm in the Separable Case . . . 2.6.1 The PTA . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Convergence of the PTA . . . . . . . . . . . . . . . . 2.6.3 First Alternatives to PTA . . . . . . . . . . . . . . . The Nonseparable Case . . . . . . . . . . . . . . . . . . . . 2.7.1 Behavior of the PTA . . . . . . . . . . . . . . . . . . 2.7.2 Further Alternatives to PTA . . . . . . . . . . . . . The Augmented Perceptron—Support Vector Machines . . Generalization Ability and Applications of Perceptrons . . . Alternative Single-Neuron Models . . . . . . . . . . . . . . . 2.10.1 Binary-Valued Inputs . . . . . . . . . . . . . . . . . 2.10.2 Sigmoidal Nodes and Radial Basis Functions . . . . Appendix 1: Useful Facts about Binomial Coefficients and Their Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 2: Proofs of Results from Section 2.3 . . . . . . . Appendix 3: MATLAB Program for Perceptron Training . .
3 Feedforward Networks I: Generalities and LTU Nodes 3.1 Objectives and Setting . . . . . . . . . . . . . . . . . . . . . 3.2 Network Architecture and Notation . . . . . . . . . . . . . . 3.2.1 Node Types . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Purpose of the Network . . . . . . . . . . . . . . . . 3.3 Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . 3.4 Learning Arbitrary Training Sets . . . . . . . . . . . . . . . 3.4.1 Error-Free Learning . . . . . . . . . . . . . . . . . . 3.4.2 Learning with Errors . . . . . . . . . . . . . . . . . . 3.5 The Number of Learnable Training Sets . . . . . . . . . . . 3.5.1 Growth Function and VC Capacity . . . . . . . . . . 3.5.2 VC Capacity of Networks . . . . . . . . . . . . . . . 3.6 Partitioning the Space of Inputs . . . . . . . . . . . . . . . 3.6.1 Enumeration . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Construction of Partitions . . . . . . . . . . . . . . . 3.6.3 Limitations of a Single Hidden Layer Architecture . 3.7 Approximating to Functions . . . . . . . . . . . . . . . . . . 3.8 Appendix: MATLAB Program for the Sandwich Construction 3.9 Appendix: Proof of Theorem 3.5.1 . . . . . . . . . . . . . .
22 23 23 27 27 29 31 31 32 35 36 36 37 38 41 42 42 43 44 45 50 53 53 54 54 54 57 58 59 61 61 64 66 66 68 71 72 73 74 76 77 78
Contents
4 Feedforward Networks II: Real-Valued Nodes 4.1 Objectives and Setting . . . . . . . . . . . . . . . . . . . . . 4.1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Single Hidden Layer Functions . . . . . . . . . . . . 4.1.4 Multiple Hidden Layers–Multilayer Perceptrons . . . 4.2 Properties of the Representation by Neural Networks . . . . 4.2.1 Uniqueness of Network Representations of Functions 4.2.2 Closure under Affine Combinations and Input Transformations . . . . . . . . . . . . . . . . . . . . 4.2.3 Regions of Constancy . . . . . . . . . . . . . . . . . 4.2.4 Stability: Taylor’s Series . . . . . . . . . . . . . . . . 4.2.5 Approximating to Step, Bounded Variation, and Continuous Functions . . . . . . . . . . . . . . . . . 4.3 Formalization of Function Approximation . . . . . . . . . . 4.4 Outline of Approaches . . . . . . . . . . . . . . . . . . . . . 4.5 Implementation via Kolmogorov’s Solution to Hilbert’s 13th Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Implementation via Stone-Weierstrass Theorem . . . . . . . 4.7 Single Hidden Layer Network Approximations to Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Constructing Measurable etc. Functions . . . . . . . . . . . 4.8.1 Measurable Functions . . . . . . . . . . . . . . . . . 4.8.2 Enumerating the Constructable Partition Functions 4.8.3 Constructing Integrable Functions . . . . . . . . . . 4.8.4 Implementing Partially Specified Functions . . . . . 4.9 Achieving Other Approximation Objectives . . . . . . . . . 4.9.1 Approximating to Derivatives . . . . . . . . . . . . . 4.9.2 Approximating to Inverses . . . . . . . . . . . . . . . 4.10 Choice of Node Functions . . . . . . . . . . . . . . . . . . . 4.11 The Complexity of Implementations . . . . . . . . . . . . . 4.11.1 A Hilbert Space Setting . . . . . . . . . . . . . . . . 4.11.2 Convex Classes of Functions . . . . . . . . . . . . . . 4.11.3 Upper Bound to Approximation Error . . . . . . . . 4.11.4 A Class of Functions for which da¯ = 0 . . . . . . . . 4.12 Fundamental Limits to Nonlinear Function Approximation . 4.13 Selecting an Implementation: Greedy Algorithms . . . . . . 4.14 Appendix 4.1: Linear Vector Spaces . . . . . . . . . . . . . 4.15 Appendix 4.2: Metrics and Norms . . . . . . . . . . . . . . 4.16 Appendix 4.3: Topology . . . . . . . . . . . . . . . . . . . . 4.17 Appendix 4.4: Proof of Theorem 4.11.1 . . . . . . . . . . . .
xi
81 81 81 82 82 85 86 86 87 88 88 90 92 93 95 96 99 103 103 104 105 106 107 107 108 110 111 111 113 114 115 117 120 120 121 123 124
5 Algorithms for Designing Feedforward Networks 129 5.1 Objectives and Setting . . . . . . . . . . . . . . . . . . . . . 129 5.1.1 Error Surface . . . . . . . . . . . . . . . . . . . . . . 129
xii
Contents
5.2
5.3
5.4
5.5
5.6
5.7 5.8 5.9 5.10 5.11 5.12 5.13
5.14
5.1.2 Taylor’s Series Expansion of the Error Function . . . 5.1.3 Multiple Stationary Points . . . . . . . . . . . . . . 5.1.4 Outline of Algorithmic Approaches . . . . . . . . . . Backpropagation Algorithm for Gradient Evaluation . . . . 5.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Gradients for a Single Hidden Layer Network . . . . 5.2.3 Gradients for MLP: Backpropagation . . . . . . . . 5.2.4 Calculational Efficiency of Backpropagation . . . . . Descent Algorithms . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Overview and Startup Issues . . . . . . . . . . . . . 5.3.2 Iterative Descent Algorithms . . . . . . . . . . . . . 5.3.3 Approximate Line Search . . . . . . . . . . . . . . . 5.3.4 Search Termination . . . . . . . . . . . . . . . . . . Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Locally Optimal Steepest Descent . . . . . . . . . . 5.4.2 Choice of Constant Learning Rate α . . . . . . . . . 5.4.3 Learning Rate Schedules . . . . . . . . . . . . . . . . 5.4.4 Adaptively Chosen Step Size . . . . . . . . . . . . . 5.4.5 Summary of Steepest Descent Algorithms . . . . . . 5.4.6 Momentum Smoothing . . . . . . . . . . . . . . . . . Conjugate Gradient Algorithms . . . . . . . . . . . . . . . . 5.5.1 Conjugacy and Its Implications . . . . . . . . . . . . 5.5.2 Selecting Conjugate Gradient Directions . . . . . . . 5.5.3 Restart and Performance . . . . . . . . . . . . . . . 5.5.4 Summary of the Conjugate Gradient Directions Algorithm . . . . . . . . . . . . . . . . . . . . . . . . Quasi-Newton Algorithms . . . . . . . . . . . . . . . . . . . 5.6.1 Outline of Approach . . . . . . . . . . . . . . . . . . 5.6.2 A Quasi-Newton Implementation via BFGS . . . . . Levenberg-Marquardt Algorithms . . . . . . . . . . . . . . . Remarks on Computing the Hessian . . . . . . . . . . . . . Training Is Intrinsically Difficult . . . . . . . . . . . . . . . Learning from Several Networks . . . . . . . . . . . . . . . . Appendix 1: Positive Definite Matrices . . . . . . . . . . . . Appendix 2: Proof of Theorem 5.5.3 . . . . . . . . . . . . . Appendix 3: MATLAB Listings of Single-Output Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.13.1 1HL Network Response, Gradient, and Sum-Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . 5.13.2 1HL Steepest Descent . . . . . . . . . . . . . . . . . 5.13.3 1HL Conjugate Gradient . . . . . . . . . . . . . . . . 5.13.4 1HL Quasi-Newton . . . . . . . . . . . . . . . . . . . 5.13.5 1HL Levenberg-Marquardt . . . . . . . . . . . . . . Appendix 4: MATLAB Listings of Multiple-Output QuasiNewton Training Algorithms . . . . . . . . . . . . . . . . .
131 134 134 136 136 138 139 143 144 144 147 151 154 155 155 158 161 161 162 163 164 164 166 169 169 170 170 173 173 175 176 177 178 179 181 181 184 187 190 194 197
Contents
6 Architecture Selection and Penalty Terms 6.1 Objectives and Setting . . . . . . . . . . . . . . . . 6.1.1 The Issue . . . . . . . . . . . . . . . . . . . 6.1.2 Formulation as Model Selection . . . . . . . 6.1.3 Outline of Approaches to Model Selection . 6.2 Bayesian Methods . . . . . . . . . . . . . . . . . . 6.2.1 Setting . . . . . . . . . . . . . . . . . . . . 6.2.2 Priors . . . . . . . . . . . . . . . . . . . . . 6.2.3 Likelihood and Posterior . . . . . . . . . . . 6.2.4 Bayesian Model and Architecture Selection 6.3 Regularization . . . . . . . . . . . . . . . . . . . . 6.3.1 General Theory for Inverse Problems . . . . 6.3.2 Lagrange Multiplier Formulation . . . . . . 6.3.3 Application to Neural Networks . . . . . . . 6.4 Information-Theoretic Complexity Control . . . . . 6.4.1 Kolmogorov Complexity . . . . . . . . . . . 6.4.2 Application to Neural Networks . . . . . . . 6.5 Stochastic Complexity . . . . . . . . . . . . . . . . 6.5.1 Code Length Measure of Complexity . . . . 6.5.2 Application to Neural Networks . . . . . . . 6.6 Overfitting Control . . . . . . . . . . . . . . . . . . 6.7 Growing and Pruning Architectures . . . . . . . . 6.7.1 Growing a Network . . . . . . . . . . . . . . 6.7.2 Pruning a Network . . . . . . . . . . . . . .
xiii
. . . . . . . . . . . . . . . . . . . . . . .
203 203 203 204 205 206 206 208 209 212 215 215 219 220 221 221 224 226 226 228 231 232 232 233
7 Generalization and Learning 7.1 Introduction and Network Specification . . . . . . . . . . . 7.2 Empirical Training Set Error . . . . . . . . . . . . . . . . . 7.3 Gradient-Based Training Algorithms . . . . . . . . . . . . . 7.4 Expected Error Terms . . . . . . . . . . . . . . . . . . . . . 7.5 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Use of an Independent Test Set Ck . . . . . . . . . . . . . . 7.6.1 Limiting Behavior . . . . . . . . . . . . . . . . . . . 7.6.2 Central Limit Theorem Estimates of Fluctuation . . 7.6.3 Upper Bounds to Fluctuation Measures . . . . . . . 7.7 Creating Independence: Cross-Validation and Bootstrap . . 7.7.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . 7.7.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Uniform Bounds—VC Approach . . . . . . . . . . . . . . . 7.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 7.8.2 The Method of Vapnik-Chervonenkis . . . . . . . . . 7.8.3 Bounds Uniform Only over the Network Parameters 7.8.4 Fat-Shattering Dimension . . . . . . . . . . . . . . . 7.9 Convergence of Parameter Estimates . . . . . . . . . . . . . 7.9.1 Outline of the Argument . . . . . . . . . . . . . . . .
235 235 238 238 239 243 245 245 246 247 250 252 255 256 256 257 262 263 266 266
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
xiv
Contents
7.9.2 7.9.3 7.9.4
Properties of the Minima of Generalization Error . . Vapnik-Chervonenkis Theory and Uniform Bounds . Almost Sure Convergence of Estimated Parameters into the Set of Minima of Generalization Error . . . Asymptotic Distribution of Training Algorithm Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asymptotics of Generalization Error: Learning Curves . . . Asymptotics of Empirical/Training Error . . . . . . . . . . Afterword . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 1: Proof of Theorem 7.8.2 . . . . . . . . . . . . .
271 277 279 280 282
A Note on Use as a Text A.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Exercises for the Chapters . . . . . . . . . . . . . . . . . . . A.3 MATLAB Program Listings . . . . . . . . . . . . . . . . . .
285 285 286 306
7.10 7.11 7.12 7.13 7.14
267 269 270
References
309
Index
329
List of Figures
1.1 1.2
Logistic function. . . . . . . . . . . . . . . . . . . . . . . . . 14 Percentage error in approximating logistic by linear function. 15
2.1 2.2 2.3 2.4 2.5 2.6 2.7
Linear separability and nonseparability. . . . . Probability mass function for d∗ when n = 50. . Probability mass function for N ∗ when d = 10. PTA convergence behavior. . . . . . . . . . . . PTA behavior under nonseparability. . . . . . . Ridge function. . . . . . . . . . . . . . . . . . . Projection construction. . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
20 29 30 33 37 44 47
3.1 3.2 3.3 3.4 3.5
Network elements. . . . . . . . . Notation for neural networks. . . Sandwich construction. . . . . . . Polyhedral region recognition. . . Gibson’s inconsistent hyperplane.
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
55 58 62 73 76
4.1 4.2 4.3 4.4 4.5
Examples of functions in N1,σ for d = 1 and s1 = 1, 2, 3, 4. Examples of functions in N1,σ for d = 2. . . . . . . . . . . Kolmogorov theorem network. . . . . . . . . . . . . . . . . Approximations to a quadratic for s1 = 2, 4 nodes. . . . . Approximations to a sinusoid for s1 = 4, 8 nodes. . . . . .
5.1
Two views of an error surface for a single node. . . . . . . . 131
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. 83 . 84 . 96 . 102 . 102
xvi
List of Figures
5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10
Contour and gradient plots for quadratics. . . . . . . . . . . 132 Information flow in backpropagation. . . . . . . . . . . . . . 141 Training ET (solid) and validation Ev (dashed) error histories.156 Optimal steepest descent on quadratic surface. . . . . . . . 157 Descent behavior with small learning rate. . . . . . . . . . . 159 Descent behavior with large learning rate. . . . . . . . . . . 160 Sum-squared error and learning rate history. . . . . . . . . . 162 Variation of parameter value and sum-squared error. . . . . 163 Conjugate gradient trajectory, r = 5. . . . . . . . . . . . . . 171
7.1 7.2
Training and Generalization Errors Showing Overfitting . . 244 Log upper bound for = .1 and v = 50. . . . . . . . . . . . 262
1 Objectives, Motivation, Background, and Organization
1.1 Objectives and Setting The study of neural networks originated in attempts to understand and construct mathematical models for neurobiology and cognitive psychology, and their current development continues to shed light in these areas. However, our focus will be on their role in a methodology for treating large classes of statistical and engineering problems. Feedforward artificial neural networks and their computer-intensive design (training) algorithms have, for the first time, provided the engineering and statistics communities with an effective methodology for the construction of truly nonlinear systems accepting large numbers of inputs and achieving marked success in applications to engineering and statistics problems of classification, regression, and forecasting. Neural networks are fitted to applications through a computer-intensive methodology of learning from examples that requires relatively little prior knowledge—as opposed to fitting through incorporation of expert beliefs or heuristic programs that require the network designer to understand the essentials of the application. Of course, some prior understanding of the nature of the application is necessary and is reflected in the choice of input variables and network architecture. However, as with other statistical nonparametric methods, we do not need to assume a finitely parameterized probability model (e.g., multivariate normal/Gaussian). Usually little more is assumed known than that the data (training set) comprises independent pairs of network inputs (measurements that characterize the realm of interest, feature vectors) and desired outputs (targets, pattern classes, future
2
1. Background and Organization
values). Our goal is to expose the mathematical properties of these systems so as to delineate their capabilities, and to present the methodology by which they are deployed to confront applications. We assume knowledge of probability and statistics at the level of a good undergraduate course (somewhat more being desirable to follow the material of Chapters 6 and 7) and require a degree of mathematical maturity, perhaps first available at a first-year graduate level, construed as the ability to follow an abstract argument into unfamiliar terrain when definitions of new terms and relevant unfamiliar theorems are provided. Familiarity with the elements of real analysis (e.g., as presented in Royden [203] or Rudin [204] ) would be helpful with the material of Chapter 4. For purposes of initial orientation (we return to this in Section 3.2), we assert that artificial neural networks are networks or systems formed out of many highly interconnected nonlinear memoryless computing elements or subsystems; biological neurons have refractory periods and therefore have memory. The pattern of interconnections can be represented mathematically as a weighted, directed graph in which the vertices or nodes represent basic computing elements (functions), the links or edges represent the connections between elements, the weights represent the strengths of these connections, and the directions establish the flow of information and more specifically define inputs and outputs of nodes and of the network. The pattern of interconnections considered without specification of the weights is referred to as the neural network architecture. Architectures are loosely modeled on, and motivated by, those of the human brain in which the brain itself or a selected functional area (e.g., vision, hearing, balance) is the neural network, the elements/nodes are individual neurons, accepting generally many inputs at dendrites, emitting single outputs at axons, and connecting outputs to inputs of likely many other neurons at unidirectional junctions called synapses. Neuroscience and neurobiology concern themselves with the actual structure and function of living nervous systems and will not be addressed beyond a few words in this chapter. The philosophy motivating the study of neural-based intelligent systems is referred to as connectionism—intelligent responses emerging from the complexity of interconnections between many simple elements, and it is a form of emergentism. The work of Hopfield in [106] helped ignite the current interest in neural networks, and his views are expressed in the following: Computational properties of use to biological organisms or to the construction of computers can emerge as collective properties of systems having a large number of simple equivalent components (or neurons). . . The collective properties are only weakly sensitive to details of the modeling or the failure of individual devices. . .
1.1 Objectives and Setting
3
A study of emergent collective effects and spontaneous computation must necessarily focus on the nonlinearity of the inputoutput relationship. Nonlinearity is essential (see Theorem 4.7.2) if we are to be able to create increasingly complex systems by adding components/neurons. A linear system with a fixed number d of inputs and o of outputs can always be represented by a d × o matrix, no matter how “complex” or redundant we make the actual linear mapping. Hence, the need for nonlinearity. Interesting and important though they are, we pay only scant attention to the biological, psychological, and philosophical motivations for, and the implications that can be drawn from, the study of such networks. Our aim is a mathematical treatment that is of value to those wishing to understand the capabilities and limitations of an important class of artificial neural networks and to apply them to problems of engineering and statistical significance. We will gain an understanding of the capabilities of a feedforward neural network, also called a multilayer perceptron, a network in which the directed graph establishing the interconnections has no closed paths or loops. Such networks have significant computational powers but no internal dynamics. In the interests of a coherent development, we will restrict attention to those networks having several (denoted throughout by d) real-valued inputs x = {x1 , . . . , xd } ∈ IRd . Results that are specialized for, say, Boolean-valued inputs (x ∈ {0, 1}d ) will not be treated beyond brief observations made in Sections 2.10 and 3.3. Such results are available from discussions of circuit complexity (e.g.,[201]) and the synthesis of Boolean functions (e.g., [253]). To accomplish our aims we need both mathematical analyses and opportunities for experience gained through simulations. In our teaching of this material (see Appendix A Note on Use as a Text) we have used MATLAB for our simulations because it is easily learned by engineering students and is easily adapted to our instructional needs. There has been much literature on the subject of artificial neural networks. Many of the books have been either at a broad but rudimentary level or edited collections that lack coherence and finish, but the last few years have provided several substantial books that are broader in scope than this one (see the review by Fine [74]). Noteworthy among these latter books are Bishop [29], Hassoun [97], Haykin [100], and Hertz et al. [103]. Journals devoted to neural networks include [40, 112, 178, 179]. Conference proceedings are plentiful; the most carefully refereed is [1].
4
1.2
1. Background and Organization
Motivation: Why Care about Artificial Neural Networks?
The original interest in these systems arises from the belief that they may enable us to better understand the brain, human cognition, and perception and from the hope that these brainlike systems may enjoy greater success than has been possible at such tasks as pattern classification, decisionmaking, forecasting, and adaptive control. We sketch reasons for an interest in artificial neural networks that are drawn from the performance of brains, high-speed computation, and systems applications.
1.2.1
Biological and Philosophical
Human brains provide proof of the existence of artificial neural networks that can succeed at those cognitive, perceptual, and control tasks in which humans are successful. Rough arguments from neurobiology suggest that the cycle time of an individual human neuron is O(10−3 ) seconds for a clock rate of less than 1 KHz. This compares with current computers operating on a cycle time of O(10−9 ) seconds for a clock rate of about 1 GHz, a factor more than 106 faster. Nevertheless, the brain is capable of computationally demanding perceptual acts (e.g., recognition of faces and words) and control activities (e.g., walking, control of body functions) that are only now on the horizon for computers. An advantage of the brain is its effective use of massive parallelism, although there are other features of the brain (e.g., the large numbers of different kinds of neurons, chemical processes that modulate neuronal behavior) that are probably also essential to its effective functioning. Although the brain is an argument for the effectiveness of massively parallel processors, it is not certain that in neural networks we have abstracted the essential features that make for effectiveness. However, that we may succeed without closely imitating nature is evident when we recall that we did not succeed in flying by imitating birds flapping their wings. Artificial neural networks provide fruitful models and metaphors to improve our understanding of the human brain. The familiar serial computer with its precise spatially allocated functions of memory, computation, control, and communications is a poor metaphor for a brain. Memory in a brain is distributed, with your memory of, say, a face not precisely allocated to a small group of neurons as they are on a workstation. Nor is control and clock-slaved timing in a serial computer an apt model for the asynchronous operation of a brain. Insofar as we believe artificial neural networks to have abstracted essential elements of brain functioning, we are free to explore their consequences analytically and through simulations without the constraints and confusions attendant upon studying living organisms having rights and interests that compromise experimentation.
1.2 Motivation: Why Care about Artificial Neural Networks?
5
Philosophical themes of connectionism (e.g., Churchland [38]), with knowledge stored in the pattern of connection strengths, and emergentism, in which intelligent behavior emerges through the collective action of large numbers of simple and unintelligent elements, through their embodiment in the software and hardware of artificial neural networks provide concrete insight into the nature of cognition and perception. Neural networks prove rigorously that you do not need a detailed master program or “homunculus” to develop intelligent behavior. This is not to say that the abstraction to neural networks has captured most of what is essential to cognition and perception. Rather, enough has been abstracted to enable the construction of systems capable of performing some of the tasks assigned to cognition and perception and the limits to this have yet to be reached.
1.2.2
Computational
If one judges computational ability on the basis of storage, as measured by the number of interconnects (synapses) or weights at which a value can be stored, and on the basis of speed, as measured by the rate of change of storage per second, then the following estimates have been given of the computational abilities of various organisms: Organism Worm Fly Bee Human
Storage 5000 108 5×109 1014
Speed 500,000 109 5×1011 1016
These numbers compare with storage of less than 109 bytes in personal computer and workstation RAM and clock speed of about 109 for the best current single processor computers and suggest that the biological organization of computation, as reflected in neural networks, is worthy of notice and respect. A key element is that computation in a neural network is distributed throughout a network so that it might be carried out nearly simultaneously. The artificial neural network architecture is inherently massively parallel and should execute at high speed—if it can be implemented in hardware. In a multilayer network there is the delay encountered in feeding a signal forward through the individual processing layers. However, this delay is only the product of the number of layers, generally no more than three, and the processing delay in a given layer. Processing within a layer should be rapid as it is generally only a memoryless nonlinear operation on a scalar input that is the scalar product of the vector or array of network inputs with an array of connection weights. Neural networks are expected to possess the property of graceful degradation of performance with component failure and robustness with respect to variability of component characteristics. A network with many
6
1. Background and Organization
nodes or elements in a given layer has a response that is usually not highly dependent on any individual node. The property of robustness with respect to component variability is widely believed, but has been little studied (e.g., [126, 167, 228]). In part as a consequence of the training algorithms through which the network “learns” its task (suggestive anthropomorphic language such as “the network learns” or “understands” or “recognizes” is common in discussions of artificial neural networks, and it should be treated circumspectly), a network with many nodes or elements in a given layer has a response that is usually not highly dependent on any individual node. In such a case, failure of a few nodes or of a few connections to nodes should only have a proportionate effect on the network response—it is not an all-or-nothing architecture. This property holds true of the brain, as we know from the death of neurons over the human lifetime and from the ability of humans to function after mild strokes in which a small portion of the brain is destroyed. It would also appear to be the case in sensory signal processing; there are many sensory inputs (say visual or auditory) and the absence of a few inputs is unlikely to confuse the organism. For example, in the optic nerve bundle there are O(106 ) neurons from the retina, and visual discrimination is maintained even with the failure of many of these neurons. Of course, it is possible to set up a network having a critical computing path, and in such a case we do not expect graceful degradation. Admittedly, neural network applications are still commonly executed in emulations in FORTRAN or C and occasionally in MATLAB, and these emulations cannot enjoy some of the advantages just enumerated. However, parallel processing languages are becoming available for use with multiprocessor computing environments (e.g., supercomputers with hundreds of processors) and modern operating systems support program processes containing multiple threads that can be run separately on multiprocessor computers. There are now special-purpose truly parallel hardware implementations of neural networks in VLSI; this is an area in which citations are quickly outdated, but see the special issue of the IEEE Trans. on Neural Networks [208], Mead [162], Shibata et al. [216], Platt, et al. [185]. These implementations provide the highest computational speeds, but they are as yet insufficiently flexible to accommodate the range of applications. This situation, however, is changing.
1.2.3
Systems Applications
The strong interest in neural networks in the engineering community is fueled by the many successful and promising applications of neural networks to tasks of pattern classification, signal processing, forecasting, regression, and control. Statistical theory has always encompassed optimal nonlinear processing systems. For example, the optimality of the conditional expectation E(Y |X) has long been known in the context of the least mean square criterion minf E||Y − f (X)||2 when we attempt to infer Y from a well-
1.2 Motivation: Why Care about Artificial Neural Networks?
7
selected function f of the data/observations X. However, there has been little practical implementation of such nonlinear processors and none when the dimension d of the inputs X is large compared to unity. Actual implementations have generally been linear in X, linear in some simple fixed nonlinear functions of the components of X, or linear followed by a nonlinear scaling. This has been especially true when confronted with problems having a large number of input variables (e.g., econometric models, image classification, perception). In such cases, the usual recourse is to linear processing based on knowledge of only means and covariances or correlations. Nonlinear processor design usually requires knowledge of higher-order statistics such as joint distributions, and this knowledge is rarely available. Nonparametric, robust and adaptive estimation techniques have attempted to cope with only partial statistical knowledge but have had only limited success in real applications of any complexity. The neural networks methodology enables us to design useful nonlinear systems accepting large numbers of inputs, with the design based solely on instances of input-output relationships (e.g., pairs {(xi , ti )} of feature vector x and pattern category t). For purposes of motivation, we now list a few applications where neural networks have achieved success. Appropriate applications for neural networks are indicated by the presence of many possible input variables (e.g., large numbers of image pixels, time or frequency samples of a speech waveform, historical variables such as past hourly stock prices over a long period of time), such that we do not know a priori how to restrict attention to a small subset as being the only variables relevant to the classification or forecasting task at hand. Furthermore, we should anticipate a nonlinear relationship between the input variables and the variable being calculated (e.g., one of finitely many pattern categories such as the alphanumeric character set or a future stock price). In some instances (e.g., see below for optical character recognition) these neural network systems provide state-of-the-art performance in areas that have been long-studied. Neural networks make accessible in practice what has hitherto been accessible only in well-studied principle. It is characteristic of human sensory processing systems that they accept many inputs of little individual value and convert them into highly reliable perceptions or classifications of their environment. Classification problems suitable for neural network applications may be expected to share this characteristic. Examples of pattern classification applications include (specific citations are likely to be surpassed in performance): • Classification of handwritten characters (isolated characters or cursive script) (e.g., LeCun et al. [138], Knerr et al.[127]);
8
1. Background and Organization
• Image classification (e.g., classification of satellite photographs as to land uses [54], face detection [200]); • Sound processing and recognition (e.g., speech recognition of spoken digits, phonemes and words in continuous speech; see Lawrence et al. [136], Zavaliagkos et al. [260]); • Target recognition (e.g., radar and sonar signal discrimination and classification). Neural network–based approaches have achieved state-of-the-art performance in applications to the pattern classification problem of the recognition or classification of typed and handwritten alphanumeric characters that are optically scanned from such surfaces as envelopes submitted to the US Postal Service (Jackel et al. [113]), application forms with boxes for individual alphanumeric characters (Shustorovich and Thrasher [218]), or input from a touch terminal consisting of a pad and pen that can record the dynamical handwriting process (Guyon et al. [86]), and in identifying and reading information-bearing fields in documents (LeCun et al. [141]). In such applications, it is common to have several hundred input variables. Regression or forecasting problems also confront us with a variety of possible variables to use in determining a response variable such as a signal at a future time. Particular examples of forecasting in multivariable settings that have been the subject of successful neural network-based approaches include forecasting demand for electricity (e.g., Yuan [257, 258]), forecasting fluctuations in financial markets (e.g., currency exchanges), and modeling of physical processes (e.g., Weigend and Gershenfeld [246]). Control of dynamical systems or plants requires rapid estimation of the state variables governing the dynamics and the rapid implementation of a control law that may well be nonlinear. Rapid implementation is particularly essential in control because control actions must be taken in “real time”. Control that is delayed can yield instabilities and degrade performance. As there can be many state variables, we need to implement functions with many inputs. Frequently there is little prior knowledge of the system structure and the statistical characteristics of the exogeneous forces acting on the system. Such a situation suggests a role for neural network methodology. One area of successful application has been in the economically important area of semiconductor manufacturing using computer integrated manufacturing (CIM). Neural networks have been used to model the response surface of processes in which desired results (e.g., index of refractivity, permittivity, stress) are nonlinearly related to control variables (temperatures, RF powers, silane flow rates, etc.) and to perform process diagnostics for such complex processes as plasma etching of integrated circuits (see May [158]). Other applications of neural networks, such as associative memories for recall of partially specified states or combinatorial optimization based on
1.3 Historical Background
9
minimization of a quadratic form in a state vector, can be made using feedback, recurrent, or Hopfield networks. However, these networks and the issues of their dynamical behavior and design are not treated in this monograph (e.g., see Hassoun [97] and Haykin [100]).
1.3 Historical Background An overview of the historical background to artificial neural networks is available in Cowan [49] and easy access to the important original papers is provided in [10]. We distinguish a “prehistory” covering the 1940s and early 1950s in which neurobiologists such as McCulloch, Pitts, and Hebb [101] attempted to make abstract mathematical models of the nervous system in an attempt to understand intelligence and learning. We do not consider a separate mathematical development that led to the Hodgkins–Huxley equations describing the detailed biochemical and biophysical behavior of a neuron. From the mid-1950s to the early 1960s the dominant figure was Cornell’s Frank Rosenblatt who approached modeling brain function as a mathematical psychologist and reported much of his work in Principles of Neurodynamics [199]. His outlook is reflected in [198, p. 387] The perceptron is designed to illustrate some of the fundamental properties of intelligent systems in general, without becoming too deeply enmeshed in the special, and frequently unknown, conditions which hold for particular biological organisms. The current notion of an artificial neural network owes more to the pioneering work of Rosenblatt than to any other individual. In particular, Rosenblatt developed the perceptron, a term that perhaps first appeared in a journal in [198], and its training algorithm, as well as the study of networks of such perceptrons, although he and his co-workers (e.g., [30, 31]) were unable to find effective training algorithms for more complex networks. Many date the latency period of artificial neural networks from the critical work Perceptrons in which Minsky and Papert ([168]), originally writing in 1969, vigorously presented the limitations of a single artificial neuron or perceptron. Rosenblatt had considered more complex networks, but there was too little known about training them to render them useful. Interest in artificial neural networks waned following Perceptrons. However, a few individuals (including Amari, Barron, Cowan, Widrow) continued working on neural networks throughout the 1970s, and they are honored today for their contributions and vision. The 1980s saw the re-emergence of the study and applications of artificial neural networks initiated by the work of Hopfield ([106, 107]), the Parallel Distributed Processing Group, reported in Rumelhart et al. [205], and others. Hopfield introduced the socalled Hopfield net, also referred to as a feedback or recurrent network, and
10
1. Background and Organization
demonstrated, by relating its behavior to the statistical mechanics study of spin-glasses, that these networks could be designed to function as associative or content addressable memories. In the presentation of these networks as real analog circuits, a departure was initiated from thinking of network node functions only as binary-valued discontinuous step functions. Furthermore, an algorithm was provided for the encoding of such memories. A reinterpretation, by Hopfield and Tank in [108], of the dynamical behavior of these networks also showed that they could be used as combinatorial optimizers; their intrinsic dynamics amounted to the maximization of a certain quadratic form and you needed to recode your problem as such a maximization. We will not discuss this class of networks because they raise a rather different class of issues from the ones important for the feedforward networks we treat. The contribution of the PDP group was to show that if you replaced step function nodes by smooth, monotonically increasing, but bounded functions, then steepest descent algorithms, particularly as implemented by backpropagation, provided the hitherto missing effective training or learning procedure for complex neural networks. The success of this approach was critically dependent on the availability, starting in the mid-1980s, of cheap, powerful workstations and personal computer. Large-scale nonlinear optimization problems could now be carried out routinely even when they involved tens of millions of floating-point computations. What had been impenetrable terrain for applications of neural networks immediately became so accessible that there were many papers and applications that exhibited little more thought than turning on a personal computer or workstation and bringing up some package to carry out “backpropagation”. By the late 1980s, the renascent subject of neural networks was acquiring a worrisome reputation of hype, ill-thought-out applications, and a common belief in the “magical” properties of neural networks. In the 1990s this excess was moderated, much was learned about the capabilities and limitations of neural networks, and sober, encouraging experience was gained from a variety of applications. It is this still-evolving position that we undertake to describe.
1.4 Biological Bases for Neuronal Modeling 1.4.1
Neuronal Properties
In order to construct a mathematical model of a neuron, to be called a “node” or an “artificial neuron”, we consider some of the characteristics reported for human neurons. Historically, the mathematical abstraction of biological neuronal performance into an artificial neuron was first attempted by McCulloch and Pitts in [159], from whom we quote:
1.4 Biological Bases for Neuronal Modeling
11
Many years ago one of us, by considerations impertinent to this argument, was led to conceive of the response of any neuron as factually equivalent to a proposition which proposed its adequate stimulus. He therefore attempted to record the behavior of complicated nets in the notation of the symbolic logic of propositions. The “all-or-none” law of nervous activity is sufficient to insure that the activity of any neuron may be represented as a proposition. Physiological relations existing among nervous activities correspond, of course, to relations among the propositions;. . . We shall make the following physical assumptions for our calculus. 1. The activity of the neuron is an “all-or-none” process. 2. A certain fixed number of synapses must be excited within the period of latent addition in order to excite a neuron at any time, and this number is independent of previous activity and position of the neuron. 3. The only significant delay within the nervous system is synaptic delay. 4. The activity of any inhibitory synapse absolutely prevents excitation of the neuron at that time. 5. The structure of the net does not change with time. A present-day outline of the characteristics of a mammalian neuron is provided below and supported by several quotations which are drawn from West [249]. These eight points support our subsequent mathematical model (or caricature) of individual neurons and their interconnections. 1. A neuron has a single output conducted through its axon and multiple inputs conducted through dendrites. Though the nerve cell frequently possesses more than one dendrite, the axon is single. . . The dendrite is the receptive process of the neuron; the axon is the discharging process,. . . (p. 29) 2. The single neuronal output, produced by a pulse known as the action potential, can be “fanned out” to connect to many other neurons. Signal flow is unilateral at junctions. An axon gives rise to many expanded terminal branches (presynaptic terminal boutons)...A single neuron may be involved in many thousands of synaptic connections, but in every case the impulse transmission can occur only in one direction....(p. 50)
12
1. Background and Organization
3. The dendrites (and a neuron may have as many as 150,000 of them) are the major source of inputs although it is noted that inputs can also be made through the soma (body) and axon and the neuron as a whole can be responsive to fields in its environment (see West [249, p. 985] ). Hence, the model has multiple inputs, and there are many of them. 4. The neuronal response depends upon a summation of inputs. If two subliminal volleys are sent in over the same nerve, each volley produces an effect which is manifested by an EPSP [excitatory postsynaptic potential]. The EPSPs will then sum, and if the critical level of depolarization is reached, an impulse will be set off. (p. 50) 5. The neuronal response is “all or none” in that the characteristics of the pulse (referred to as the action potential) do not depend on detailed characteristics of the input. Thus, the propagated disturbance established in a single nerve fiber cannot be varied by grading the intensity or duration of the stimulus, i.e., the nerve fiber under a given set of conditions gives a maximal response or no response at all. (pp. 35,42) Hence, our model should exhibit thresholding and a limited range set of possible outputs. The single pulse case is that there are only two possible outputs with “1” representing the action potential and “0” or “-1” representing the quiescent state. 6. The above refers only to an individual action potential or spike. Characteristics of stimuli are neuronally encoded into collections of spikes or pulse trains. The encoding of additional stimulus characteristics leads us to also consider nodes with graded responses that can reflect such features as firing rate. Firing rate is but one means of signal coding by neurons (West, [249, p. 985] ) 7. Is there a memoryless mapping from excitation to response? It would appear to be so if we examine a neuron only over moderate time periods that are too short for there to have been long-term adaptation but long enough that the neuron has passed through its refractory period (approximately 10–30 ms to recover to 95% of its quiescent threshold level). Over short time periods the neuron seems to have the memory of a first-order (RC) system. Hence, we can model the transformation as memoryless only over an appropriately chosen time scale.
1.4 Biological Bases for Neuronal Modeling
13
8. Harris-Warwick [92] notes that mechanisms for changing the response of a neural network include the use of chemicals like calcium and neurotransmitters to change the characteristics of a neuron (open and close selected ion channels) as well as changing the synaptic strengths. Although only the second possibility is exploited in artificial neural networks, Harris-Warwick believes that the first mechanism is the most important.
1.4.2
Mathematical Model of a Neuron
The preceding facts about the behavior of neurons motivate the following mathematical model, while also causing us to be cautious about its neurobiological validity. We may not have captured the elements of neuronal behavior that are essential to the computational powers of the brain. The single output y is related to the multiple inputs [x1 , . . . , xd ], encoded as a column vector x, and to the strengths of the synaptic connections w = [w1 , . . . , wd ], and firing threshold τ through a memoryless nonlinear function f , d wi xi − τ = f (w · x − τ ). y=f 1
Choices for f that reflect both the “thresholding” behavior of a neuron and the “all or none” principle are the sign function 1, if z ≥ τ ; f (z) = sgn(z − τ ) = -1, otherwise, and the unit-step function
1, if z ≥ τ ; f (z) = U (z − τ ) = 0, otherwise,
with the correspondence sgn(z) = 2U (z) − 1. When we turn to consider neurons possessing graded responses, perhaps reflecting a firing rate, we will adopt a so-called sigmoidally shaped nonlinearity for f , i.e., a function that is continuously differentiable, increasing, and has a range of [0, 1] or [−1, 1]. Examples include the logistic function, σ(u) =
1 , 1 + e−αu
shown in Figure 1.1 for α = 1, and the hyperbolic tangent function (a simple rescaling of the logistic) tanh(αu) =
eαu − e−αu = 2σ(2u) − 1. eαu + e−αu
14
1. Background and Organization
Mathematical properties of the logistic sigmoidal function are presented
FIGURE 1.1. Logistic function.
in Minai et al. [166]. The logistic function is a bounded, monotonic increasing, analytic (which implies unlimitedly differentiable) function. Elementary properties include lim σ(u) = 0,
lim σ(u) = 1,
u→−∞
dσ(u) = σ (u) = ασ(1 − σ), du
u→∞
σ (u) = α2 σ(1 − σ)(1 − 2σ),
|σ | ≤
α , 4
1 α + u. 2 4 For small or moderate values of u, the logistic function is nearly linear. This implies that networks with logistic or tanh nodes using small enough weights, so that the total excitation to the nodes is small, will act as nearly linear systems. A plot of the percentage error incurred in the linear approximation to the logistic (α = 1) is shown in Figure 1.2. for small u σ(u) ≈ σ(0) + σ (0)u =
1.5
Organizational Approach
1.5.1
Role of Networks in Function Representation
Throughout we will have in mind for neural networks the two closely related problem areas of approximating either to given functions or to noisy data sets. In common practice the functions will be specified not by algorithms but by a table or training set T consisting of n argument-value pairs. We
1.5 Organizational Approach
15
FIGURE 1.2. Percentage error in approximating logistic by linear function.
will be given a d-dimensional argument x and an associated target value t that is our goal, and t will be approximated by a network output y. The function to be constructed will be fitted to T = {(xi , ti ) : i = 1 : n}. In most applications the training set T is considered to be noisy and our goal is not to reproduce it exactly but rather to construct a network function η(x, w) that produces a smoothed reconstruction that generalizes (learns) well to new function values.
1.5.2
Issues in Function Representation
If feedforward neural networks are to be our tools, what functions can be built with them? How hard is it to build with them? How do you do the construction? How well does the construction work? What are examples of successful constructions using these tools? Throughout our study of neural networks we will be guided by attempts to respond to the preceding rephrased as the following four questions. Q1. What are the functions implementable or representable by a particular network architecture? Q2. What is the complexity (e.g., as measured by numbers of weights or nodes) of the network needed to implement a given class of functions? Q3. How can we select the architecture, weights, and node characteristics to achieve an implementable function?
16
1. Background and Organization
Q4. What is the capability of the network and selection/training algorithm for learning or generalizing from training samples? The rapidly changing applications of neural networks and our focus and attempt at coherence have combined to have us pay little attention to the important issue of applications for a given family of networks.
1.5.3
Order of Treatment
In treating Q1 we will examine both exact implementations and approximate ones, for both functions and tables {(xi , ti = f (xi )} of pairs of arguments and function values called training sets. Answers to Q2 will vary in their terms and in the extent to which we can be informative. Chapters 2, 3, and 4 focus on Q1 and Q2. Chapter 2 treats a single node, known as a perceptron, having the classical binary-valued response of early neural network models. Chapter 3 introduces network architectures and notation and presents results on networks of binary-valued nodes. Chapter 4 analyzes the abilities of networks of real-valued nodes to approximate to functions that are fully specified. Little is known about selecting node functions, and practice relies on three such functions (logistic, hyperbolic tangent, step function). Chapter 5 is devoted to the nonlinear optimization methods of steepest descent, conjugate gradient, quasi-Newton, and Levenberg– Marquardt that are used by practitioners to respond successfully to Q3. These gradient-based methods enable the efficient, but computationally intensive, determination of the weights of a neural network given the architecture and node functions. Chapter 6 outlines the less satisfactory means available for architecture selection, although there are interesting theoretical responses and at least two methods suited to practice. Q4 is properly considered in a probabilistic or statistical setting that enables us to relate the data used in generating the network (training data) to the data to which it will be applied; this is the subject of Chapter 7. The wide range of applications outlined in Section 1.2.3 provides substantial motivation for an interest in neural networks, and these applications are not addressed significantly otherwise.
2 Perceptrons—Networks with a Single Node
2.1 Objectives and Setting We commence by studying the capabilities of the simplest networks, those comprising a single node. We distinguish four cases: Inputs are in either d IRd or {−1, 1} and outputs are in either IR or {−1, 1}. Our focus is on networks having inputs in IRd rather than on the synthesis of switching circuits or Boolean functions. The case of a real-valued output will be treated beginning with Chapter 4. The case of binary-valued output y and real vector-valued input x, related through the equation y = sgn(w · x − τ ) = η(x, w),
(2.1.1)
was called a perceptron by the pioneer Frank Rosenblatt in [199] and earlier works, and the node is referred to as a linear threshold unit (LTU) or a linear threshold gate. Equation 2.1.1 is well-founded on the neurobiological considerations presented in Section 1.4 and was the first network node model to be introduced. The “all-or-none” response y (the so-called action potential, a pulse propagated along an axon) is determined by the neuron’s excitations x = {xi } received at the dendrites across a synaptic cleft with transmissivity w. The integrated excitation is wi xi = wT x = xT w. w·x= i
This excitation (assumed to be presented on an appropriate time scale) is then compared to the threshold τ that must be exceeded for the action
18
2. Perceptrons—Networks with a Single Node
potential to be initiated. Our objective in this chapter is a thorough exploration of the properties and capabilities of a perceptron. Our approach is both geometrical and probabilistic. A useful geometric perspective (e.g., see Halmos [91] or Sommerville [223] for treatments of n-dimensional geometry) on Eq. 2.1.1 is that {x : y = 1}, {x : y = −1}, are the half-spaces H+ = {x : w · x ≥ τ }, H− = {x : w · x < τ }, respectively, with boundary the hyperplane (affine subspace of dimension d − 1) H = {x : w · x − τ = 0}. The homogeneity of the system given by Eq. (2.1.1) allows us to standardize on τ = ±1 if τ = 0 and on τ = 0, otherwise. A useful device in this connection is to realize that we can always take the threshold to be zero if we augment the dimensions of the input x and weight vector w by 1, as follows: ˜ ∈ IRd+1 through (∀i ≤ d) x ˜i = xi , x ˜d+1 = −1; x ∈ IRd → x ˜ ∈ IRd+1 through (∀i ≤ d) w ˜i = wi , w ˜d+1 = τ. w ∈ IRd → w It is now evident that ˜ ·x ˜, w·x−τ =w and the threshold has been absorbed or, equivalently, taken to be zero. We have embedded the set of threshold functions of a d-dimensional variable x = (x1 , ..., xd ) and arbitrary threshold τ into the set of functions of a d + 1–dimensional variable x ˜ = (x1 , ..., xd , −1), with τ˜ = 0 implicit. The ˜ process of embedding carries H, a hyperplane in IRd , into a hyperplane H d+1 that passes through the origin 0. in IR What are the capabilities and uses of such a device? Because a perceptron can yield only two responses, it can only be asked to recognize a set and its complement, not to make finer distinctions such as recognizing, say, the ten members of the set of printed decimal digits. At best we can dichotomize (partition into two parts) a set S of vectors in IRd . However, not all dichotomies are possible. For example, a dichotomy of the plane IR2 in which we assign +1 to the unit disk centered at the origin and −1 to its complement cannot be implemented by a single perceptron node. Clearly there is no hyperplane boundary H separating the unit disk from the rest of IR2 . In almost all applications of artificial neural networks, however, we are not provided with a known mathematical rule or function to implement or approximate, or in this case a mathematically defined set S + on which y = 1 and y = −1 on its complement S − . Rather we focus on learning from examples. We are provided with a training set T that is a finite collection of input-output pairs {(x1 , t1 ), ..., (xn , tn )}. Our objective is to
2.2 Linear Separability
19
learn from this so that given a (possibly new) input x we can likely infer the correct output or target t. No effort is made to parallel the efforts in artificial intelligence to come up with “heuristic rules” or in expert systems to extract the rules governing the behavior of human experts. The four questions listed in Section 1.5.2 guide our explication of the perceptron. We refine Question 1 as: Q1(a) Can a given training set be learned without error by a perceptron? Q1(b) How many of the 2n possible ±1-valued assignments of values t1 , . . . , tn can be made by a perceptron to the set S = {x : (∃(xi , ti ) ∈ T ) x = xi } of input variables recorded in T ? Q1(c) How well can we approximate a training set that cannot be learned exactly? We will examine our conclusions asymptotically in the size n of the training set T and in a probabilistic setting wherein we assume our inputs to be randomly selected vectors with randomly assigned binary classifications as targets. In this we follow Cover [46] and Nilsson [180]. A review of results is given by Takiyama [233], Hertz et al. [103, Ch. 5]; Nilsson [180] (a reissue of a 1965 work) provides a textbook level discussion.
2.2 Linear Separability 2.2.1
Definitions
In this section we will answer Q1(a) through Theorems 2.2.1 and 2.2.2 and we will provide insight into the nature of linear separability. The implementation of a binary-valued (±1) function η is equivalent to partitioning the possible input vectors into two families, {x : η(x) = 1},
{x : η(x) = −1}.
For a node of the form of Eq. (2.1.1) these two sets are half-spaces H+ = {x : y = 1} = {x : w · x − τ ≥ 0}, H− = {x : y = −1} = {x : w · x − τ < 0}, with hyperplane boundary H = {x : x · w − τ = 0}; in two dimensions (d = 2) the hyperplane boundary is a straight line that divides the plane IR2 into two halves, above and below the line. Given a training set T = {(xi , ti ), i = 1 : n}, we identify the finite sets of inputs S + = {x : (∃i) ti = 1, x = xi }, S − = {x : (∃i) ti = −1, x = xi }. (In MATLAB, for example, this is easily done using the find command.)
20
2. Perceptrons—Networks with a Single Node
Definition 2.2.1 (Linear Separability) We say that S + , S − are linearly separable if there exists H specified by w, τ such that S + ⊂ H+ , S − ⊂ H− or, equivalently, (∀x ∈ S + ) w · x − τ ≥ 0, (∀x ∈ S − ) w · x − τ < 0.
(2.2.1)
The case of separation with τ = 0 is referred to as homogeneous linear separability. Typically we require strict linear separability, meaning that no points lie on H and both of the inequalities in Eq. 2.2.1 are strict.
FIGURE 2.1. Linear separability and nonseparability.
Question 1(a) then becomes whether S + , S − are (strictly) linearly separable. Geometrically, we ask whether there is an H such that the points in S + lie above H whereas the points in S − lie below H.
2.2.2
Characterizations through Convexity
To understand the conditions for the existence of linear separability, recall the following elements of convex analysis (see Rockafellar [197]). Definition 2.2.2 (Convex Combination) The convex combination of vectors nx1 , . . . , xn , with respect to non-negative scalars λ1 , . . . , λn satisfying 1 λi = 1, is the vector n
λi xi .
1
If n = 2, for example, then the convex combination of x1 , x2 is a vector with tip on the line segment joining x1 , x2 .
2.2 Linear Separability
21
Definition 2.2.3 (Convex Set) A subset S of IRd is convex if it is closed under finite convex combinations. Informally, a subset of a linear vector space is convex if, for each pair of points in the subset, the line segment joining the pair is wholly contained in the subset. Definition 2.2.4 (Convex Hull) The convex hull C(S) of a set of vectors S is the set of all finite convex combinations of vectors in S, C(S) = {z : (∃n)(∃x1 , . . . , xn ∈ S)(∃λ1 , . . . , λn ) λi ≥ 0, n 1
λi = 1, z =
n
λi xi }.
1
The convex hull can also be understood as the smallest (in the sense of set inclusion) convex set containing S. See Rockafellar [197] for other relevant definitions and results from convex analysis. If S is a finite set, then C(S) is a convex polyhedron, a convex figure whose bounding faces are hyperplanes. A necessary and sufficient condition for two finite sets of vectors S + , S − to be linearly separable is given by the following: Theorem 2.2.1 (Separating Hyperplane/Hahn-Banach) The finite sets S + , S − are linearly separable if and only if the convex hulls C(S + ), C(S − ) of each of the two classes of vectors do not intersect. The Separating Hyperplane or Hahn-Banach Theorem appears in different forms, one version of which is given by Rockafellar [197, p. 99]. This is a necessary and sufficient condition for a finite training set T that has inputs in S to be learnable without error by a single node of the form of Eq. (2.1.1). This theorem extends to the case of infinite sets S + , S − with the addition of the condition that we use the closed convex hulls. The determination of whether there exist w, τ linearly separating two given finite sets of vectors S + , S − is facilitated by first standardizing the problem. Embed the vectors, as described earlier, so as to eliminate the threshold, thereby obtaining S˜+ , S˜− as subsets of IRd+1 . Replace vectors ˜ i by in S˜− by their negatives (in effect, we replace all augmented vectors x ˜ i ) and thereby produce the set F˜ = S˜+ ∪ {−S˜− } of vectors with the ti x property that the original sets S + , S − are linearly separable if and only if there exists w ˜ such that ˜ w ˜ ·x ˜ > 0. (∀˜ x ∈ F)
(2.2.2)
22
2. Perceptrons—Networks with a Single Node
Theorem 2.2.2 (Theorem of the Alternative [197, p. 198] ) Either there exists w ˜ satisfying Eq. 2.2.2 for F˜ or there exists k, x ˜ i1 , . . . , x ˜ ik , λi ≥ 0,
k
λj = 1,
j=1
k
λj x ˜ ij = 0.
(2.2.3)
j=1
In the latter case the convex combination need not be taken over more than k ≤ d + 1 terms when the vectors are in IRd . In other words, either there is a hyperplane separating the vectors in F˜ from the origin 0 or the origin can be produced by a convex combination of such vectors. If there is a solution to the system of equalities in Eq. (2.2.3), then the two sets cannot be linearly separated or learned by a perceptron without error. If there is no such solution, then the system of linear inequalities given by Eq. (2.2.2) has a solution. In principle, the existence of a solution to Eq. 2.2.3 can be determined algorithmically by searching on ˜ solving the resulting k and then on k-tuples of elements of the finite set F, system of linear equations for λi , and checking to see whether the λi are the weights in a convex combination. In Section 2.6 we will present Rosenblatt’s Perceptron Training Algorithm for determining w, τ when the sets are linearly separable. However, an industry-standard approach for dealing with the system of linear inequalities given by Eq. 2.2.2, when not in the context of learning paradigms and neural networks, would be to use the well-tested and highly developed simplex algorithm (see Chvatal [39]). The simplex algorithm allows you to test for linear separability. In the simplex problem one wishes to minimize an objective function (often quadratic) subject to linear inequality constraints. The simplex algorithm finds an initial feasible point (one satisfying the inequalities) and then iterates to reduce the objective function. In our case we can set the objective function to zero and just use the process for finding a feasible point.
2.2.3
Properties of Solutions
Further observe that a solution w, τ that achieves linear separability is not unique. An important duality property is that the number of ways in which a set S of n d-dimensional input vectors can be given a binary classification by hyperplanes having a common threshold τ is equal to the number of regions into which IRd is partitioned by n hyperplanes all with the same threshold τ . This observation is immediate from examination of the family of equations sgn(w · xi − τ ) = ti
i = 1, ..., n;
(2.2.4)
either consider the weight w as partitioning the space IRd of the points {x1 , . . . , xn } in S or the points in S as partitioning the space IRd of the
2.3 The Number of Learnable Training Sets
23
weight vectors. If we consider w-space IRd , then each hyperplane xi , τ divides this space into two half-spaces with the choice of half-space containing w being determined by ti . The intersection of all n such half-spaces is a convex polyhedral region. Thus, a given assignment of {ti } to S that can be implemented by a weight vector w can also be implemented by any weight vector lying in the convex polyhedral region containing w and having faces/hyperplanes determined by the vectors in S and sides (up or down) determined by the chosen assignments {ti }. Finally, we observe that if there is a separating hyperplane, then one can be chosen in the form H = {x :
d
αj tij xij · x − τ = 0}, αj > 0,
j=1
and this linear combination {αj } can be chosen convex. In effect, the weight vector w can be chosen to be in the form w=
d
αj tij xij , αj ≥ 0.
j=1
In terms of the transformed problem treated in Theorem 2.2.2, we have w ˜=
d
αj x ˜ ij , αj ≥ 0.
(2.2.5)
j=1
The weight vector can be taken to be a convex combination of the transformed input vectors that are to be classified. The Perceptron Training Algorithm will exploit this fact by searching for a separating hyperplane of this form, and Theorem 2.6.1 establishes the truth of Eq. 2.2.5. This observation is also essential to the idea of a support vector machine, to be discussed in Section 2.8. The xi entering in the determination of w are the support vectors.
2.3 The Number of Learnable Training Sets 2.3.1
The Upper Bound D(n, d)
To respond to Question 1(b) we inquire into the capabilities of the perceptron architecture by assessing how difficult it is for a training set T to be learnable. We focus on classifying a training set of n elements. Given distinct inputs S = {x1 , ..., xn }, there are 2n possible assignments of the corresponding classes t1 , t2 , . . . , tn . Each assignment provides a dichotomy of S in that it partitions the elements of S into two disjoint subsets. We
24
2. Perceptrons—Networks with a Single Node
follow Cover [46, 47] and Nilsson [180] in determining the number L(S) of these dichotomies that are linearly separable (including the threshold) and therefore can be implemented by a perceptron. In Theorem 2.3.1 we establish an upper bound D(n, d) to the number of linear dichotomies of S having n points in IRd . This result is essentially a special case of a more general result due to Vapnik and Sauer given in Theorem 3.5.1; a different boundary condition (D(n, 1) = 2n vs. mN (n) = n + 1 when v = 1) combines with the same difference equation to yield a larger upper bound in Theorem 3.5.1. Because the upper bound is the most important result, one could rely on Theorem 3.5.1 and the assessment of the VC capacity of a perceptron as d + 1 when there is a possibly nonzero threshold. However, one can establish even more in the case of the perceptron. In Theorem 2.3.2 we show that this upper bound is achieved by any set S satisfying a condition of being “in general position’ ’ that is specified in Definition 2.3.1. Finally, in Theorem 2.3.3 we will show that if S fails to be in general position, then it admits strictly fewer linear dichotomies than D(n, d). Our derivation follows that of previous expositors, with care being given to the assumption that all sets of points in general position share the same number of linear dichotomies. In the interest of maintaining the flow of ideas, we will postpone proofs of results to Appendix 2 and provide only the sequence of results here. Let L(S) denote the number of linearly separable dichotomies of S implementable by a perceptron. We will obtain a recursion relating L(S) to the count for the smaller set S − {xn }. Let Lx0 (S) denote the number of such dichotomies of S implementable by a perceptron with the constraints that the point x0 lies on the defining hyperplane H and x0 ∈ / S. Lemma 2.3.1 ([46, p. 327] ) The dichotomies {S + , S − ∪ {x0 }}, {S + ∪ {x0 }, S − } are both implementable by a perceptron if and only if {S + , S − } is linearly separable by a hyperplane containing x0 . Proofs of results asserted in Section 2.3 are provided in Appendix 2. Lemma 2.3.2 ([180, p. 34] ) If S = {x1 , . . . , xn }, L(S) = L(S − {xn }) + Lxn (S − {xn }).
(2.3.1)
We now need to evaluate Lxn (S − {xn }), assuming that no three points of S are collinear. We establish that for any S − {xn } and point xn , there is a set P of the same size n − 1 that is completely contained in a hyperplane or space IRd−1 , and there is a 1:1 correspondence between the dichotomies of S − {xn } by hyperplanes constrained to contain xn and of P by arbitrary hyperplanes. Lemma 2.3.3 Lxn (S − {xn }) = L(P).
2.3 The Number of Learnable Training Sets
25
Introduce ˆ d) = L(n,
max
{S:||S||=n,S⊂IRd }
L(S)
as the maximum number of dichotomies that can be formed by a perceptron when we consider all sets S of size n with elements drawn from IRd . Clearly, ˆ − 1, d), L(S − {xn }) ≤ L(n and viewing P as a subset of IRd−1 yields ˆ − 1, d − 1). L(P) ≤ L(n Combining results yields ˆ − 1, d) + L(n ˆ − 1, d − 1). L(S) ≤ L(n It is immediate that ˆ d) ≤ L(n ˆ − 1, d) + L(n ˆ − 1, d − 1). L(n, Introduce the finite difference equation system D(n, d) = D(n − 1, d) + D(n − 1, d − 1),
(2.3.2)
with D(n, d) satisfying ˆ d), D(n, 1) = L(n, ˆ 1). D(1, d) = L(1, Induction on n, d easily establishes that ˆ d) ≥ L(S). (∀ n ≥ 1, d ≥ 1) D(n, d) ≥ L(n, It remains to specify the boundary conditions ˆ d) = 2, D(n, 1) = L(n, ˆ 1) = 2(n + 1) − 2 = 2n, D(1, d) = L(1,
(2.3.3)
to derive the upper bound. Theorem 2.3.1 (Upper Bound to L) The unique solution to Eqs. (2.3.2) and (2.3.3) and upper bound to L(S) is d 2 i=0 n−1 , if n > d + 1; i D(n, d) = (2.3.4) n 2 , if n ≤ d + 1 Appendix 1 contains a collection of facts about binomial coefficients and their sums that will be useful to us. A rough sense of our results is provided by the following table for d = 2, 3, 4.
26
2. Perceptrons—Networks with a Single Node
n: 2n : D(n, 2): D(n, 3): D(n, 4):
2 4 4 4 4
3 8 8 8 8
4 16 14 16 16
5 32 22 30 32
6 64 32 52 62
7 128 44 84 114
8 256 58 128 198
So long as D(n, d) < 2n , we cannot train a perceptron to learn all training sets. Observe that even for n = 4(5)(6) a perceptron can no longer learn all training sets in IR2 (IR3 )(IR4 ). From Appendix 1 we learn that d n k=0
k
<
n d ] if n ≥ 2d. [1 + n + 1 − 2d d
Combining this result with Eq. 2.3.4 yields the useful upper bound n−1 d ] if n ≥ 2d. D(n, d) < 2 [1 + d n + 1 − 2d The elementary inequality
n nd ≤ d d!
(which is a strict inequality for d > 1) allows us to assert that D(n, d) < 2
nd d [1 + ] if n ≥ 2d. d! n + 1 − 2d
An additional result from Appendix 1 is that n 1 2nH(d/n) ≤ ≤ 2nH(d/n) , d n+1 where H is the binary entropy function, H(x) = −x log2 x − (1 − x) log2 (1 − x) for 0 ≤ x ≤ 1. Hence, for large d and letting β = d/n, D(n, d) ≤ [1 +
1 β ]2nH(β) for β < . 1 − 2β 2
If we specialize the preceding assessments of the number of linearly separable dichotomies to an assessment of the number of homogeneous linearly separable dichotomies, we see that the answer to this case is provided by L0 , which is upper bounded by D(n, d − 1) + 1; the additional “1” coming from the trivial classifier with zero weight vector. An argument based on the notion of duality suggested at the close of Section 2.2 can be used to directly argue to D(n, d − 1) + 1 as the upper bound in the homogeneous case.
2.4 Probabilistic Interpretation, Asymptotics, and Capacity
2.3.2
27
Achieving the Upper Bound
Previous work on the enumeration problem (the earliest such being Schlafli in 1857 as reprinted in Schlafli [211]) averred that the upper bound of D is tight in that it is achieved precisely whenever S satisfies a condition of its points being in general position. However, the proofs available do not establish this; they start by assuming two such sets admit the same number of linear dichotomies. We treat this question because it is uncommon in the more general theory discussed in Section 3.5 to know that the so-called VC bounds are tight. We introduce the hypothesis that the n points in S, considered as points in IRd , are in general position as given in Definition 2.3.1 (General Position) We say that n points in IRd are in general position if for no k, 2 ≤ k ≤ d + 1, do k of these points lie in an affine subspace (hyperplane) of dimension k − 2. Hence, on a k-dimensional hyperplane (or flat) there are no more than the determining number of k + 1 points from S. Theorem 2.3.2 (L) If S, S are both sets of n points in general position in IRd then ˆ d) = D(n, d). (2.3.5) L(S) = L(S ) = L(n, A converse to Theorem 2.3.2 is provided next. Theorem 2.3.3 If the n points in S are not in general position in IRd , then L(S) < D(n, d). In what follows we accept that the enumeration by the upper bound D given by Eq. 2.3.4 is exact for sets of points in general position, and it informs us as to the number of training sets T with given S that are learnable by a perceptron. We now turn to examine some of the implications of this conclusion.
2.4 Probabilistic Interpretation, Asymptotics, and Capacity We follow Cover [46] and introduce a probabilistic interpretation based on establishing a random selection mechanism for the training set. This interpretation will enable us to make statements about the likelihood of a perceptron being able to correctly learn a randomly generated training set. Furthermore, moving into a probabilistic framework makes available well-developed analytical tools with which to explore neural networks. We assume that elements of T are generated independently and identically
28
2. Perceptrons—Networks with a Single Node
distributed, abbreviated i.i.d., according to some probability measure/law P (generally unknown to us). We further assume that P selects input vectors x that are in general position with probability one; this is equivalent to requiring that (∀w, τ ) P (w · x = τ ) = 0; this result would hold, for example, if x is chosen according to P described by a continuous density function in IRd . In addition, the classification t of the input vector is made independently with classes being equally likely to be drawn from either category {−1, 1}; 1 . 2 A training set T = {(xi , ti )} of size n generated by this prescription then has inputs S = {xi } that are in general position with probability one and associated classifications t1 , . . . , tn that are equally likely to take on any of the 2n possible binary assignments. Hence, the probability that a randomly selected T will be classifiable by a node satisfying Eq. 2.1.1 is given by d 1 n−1 n − 1 n . (2.4.1) P (n, d) = D(n, d)/2 = ( ) 2 k P (t = 1|x) =
k=0
Curiously, P immediately interprets as the probability that in n − 1 tosses of a fair coin we will obtain no more than d “heads”—the two formulas are identical. This interpretation enables us to study D(n, d) by invoking results from the well-studied subject of probabilistic coin-tossing. The probabilistic weak law of large numbers (e.g., Loeve [151] or any of a large number of probability texts) can now be invoked for d = βn to evaluate 1, if β > 12 ; 1 lim P (n, βn) = U (β − ) = n→∞ 2 0, if β < 12 . Hence, P thresholds at β = 1/2. We can pursue this further in the vicinity of β = 1/2 by using the Central Limit Theorem to conclude that d − n/2 ∼ N (0, 1), 1√ 2 n i.e., the left-hand side is asymptotically normally distributed with mean 0 and variance 1,
x z2 1 n 1 √ √ e− 2 dz. P (d ≤ + x n) = 2 2 2π −∞ Hence, for large d, with high probability we can correctly classify about 2d training samples with input vectors of dimension d. This result is sometimes stated as the capacity of an LTU node with d inputs is 2d. Thus the greater the number of input variables (d), the greater the capacity or memory of an LTU node, and we exploit this in Section 2.7.
2.5 Complexity of Implementations
29
2.5 Complexity of Implementations In response to Question 2 concerning the complexity of implementations, we first inquire into the minimum number of dimensions d∗ needed to separate n randomly selected and classified inputs. We assume that x ∈ IRd (d ≥ n > 1), and we wish to select a lower-dimensional feature vector that still allows us to learn the training set T without error. The distribution of d∗ is given by the binomial B(n − 1, 1/2) n − 1 1 n−1 Pd∗ (d) = P (n, d)−P (n, d−1) = ( ) , d = 0, ..., n−1. (2.5.1) d 2 Eq. 2.5.1 provides the probability that a randomly selected T with n elements can be classified correctly by a perceptron having no less than d∗ = d inputs. The input dimension d∗ interprets as the number of heads in n − 1 tosses of a fair coin. This conclusion is consistent with the fact that D(n, n − 1) = 2n because any n points in general position in IRn−1 can be classified arbitrarily by a perceptron.
FIGURE 2.2. Probability mass function for d∗ when n = 50.
We can also ask for the distribution PN ∗ of the largest number N ∗ of randomly generated input vectors in IRd that can be correctly classified by a perceptron. For example, Figure 2.1 includes an example of five points in IR2 that are not linearly separable. Depending on the order of presentation of these points, N ∗ is either 3 or 4. This random variable has a Pascal/negative-binomial distribution corresponding to the event of the waiting time N ∗ to the appearance of the (d + 1)-st head in successive tosses of a fair coin, P (N ∗ = n) = PN ∗ (n) = P (n, d) − P (n + 1, d) =
30
2. Perceptrons—Networks with a Single Node
n−1 /2n for n > d. d
(2.5.2)
Note that PN ∗ (d) = 0, reflecting the fact that more than d patterns can always be correctly learned with S in general position in IRd . We also observe that there is no upper limit to the largest number of patterns that can be learned, albeit with diminishing probability. Eq. 2.5.2 provides the probability that as we increase the number of training samples provided to a perceptron with d inputs, the largest number we can classify correctly is n.
FIGURE 2.3. Probability mass function for N ∗ when d = 10.
One can think of N ∗ as arising from a sum of d + 1 i.i.d. geometrically distributed random variables {Gi }, where Gi is the waiting time between the (i − 1)-st and ith occurrences of heads. The characteristic function φN ∗ (u) (e.g., see Loeve [151]) is then given by (
eiu d+1 ) . 2 − eiu
Using cumulants, or directly from the interpretation as a sum of i.i.d random variables having mean 2 and variance 2, we readily conclude that EN ∗ = 2(d + 1),
V AR(N ∗ ) = 2(d + 1).
Once again we see that the maximum size training set that can be correctly classified by a perceptron with d inputs is about 2d. We can roughly summarize the preceding in the following practical advice: Each weight/connection stores about two patterns.
2.6 Perceptron Training Algorithm in the Separable Case
31
2.6 Perceptron Training Algorithm in the Separable Case 2.6.1 The PTA Question 3 invites us to examine how to select w, τ to correctly learn a training set. This problem can be thought of readily as finding a solution to a system of linear inequalities given by Eq. 2.2.4. Such systems are generally best treated by the methods of linear programming, and in particular, by approaches such as the simplex algorithm to finding feasible points to be used in initializing such algorithms ([39], [76, pp. 162–168]). However, linear programming, while effective, is a form of batch processing that does not reflect a process of “learning”. Rosenblatt’s Perceptron Training/Learning Algorithm is a form of iterative or online updating that corrects itself by repeatedly examining individual elements drawn from the training set. The resulting solution is of the form given by Eq. 2.2.5, wherein the separating hyperplane w ˜ for the standardized vectors {˜ z i } can be written as a non-negatively weighted sum of these vectors. Issues include: 1. the convergence of this algorithm when the two classes are linearly separable; 2. the behavior of the algorithm when the classes are not linearly separable; 3. the speed of convergence; and 4. its performance relative to other algorithmic methods for solving the same “‘learning” problem. We slightly generalize [168], Ch. 11, (see also Siu et al. [221]) to introduce the Perceptron Training Algorithm (PTA): STANDARDIZE: As described in Section 2.2, reduce the search for a hyperplane described by w, τ in IRd to dichotomize S into S + , S − to the search for a hyperplane w ˜ in IRd+1 containing the origin (homogeneous case) such that when we augment to S˜+ , −S˜− , we can form F˜ = S˜+ ∪ {−S˜− },
˜ w (∀˜ z ∈ F) ˜ · z˜ > 0.
START: Choose any z˜ ∈ F˜ as an initial value for w. ˜ Adopt a process for selecting elements of F˜ that generates an infinite sequence {˜ z ij } of elements ˜ ˜ of F such that each element of F appears infinitely often. For example, for finite F˜ having n elements, simply choose ij − 1 to be the remainder of j after division by n to cycle through all elements or select the next element ˜ at random from F.
32
2. Perceptrons—Networks with a Single Node
˜ TEST: Apply the selection process to obtain z˜ij ∈ F. If w ˜ · z˜ij > 0, go to TEST. Otherwise go to ADD. ADD: Select a sequence φj of values that are bounded below by a positive ¯ The element φj may depend constant φ and above by a finite constant φ. on the history of the PTA up to the jth recourse to the ADD step. Replace ˜ + φj z˜ij . Go to TEST. w ˜ by w HALT: If you have cycled through F˜ without going to ADD, or have exceeded the run time limit. REMARK: In the basic PTA φn = 1. A useful alternative is to choose φn = ||˜z1 || so that we have in effect scaled all of F˜ to contain unit length in vectors. 2 ˜ It is clear from the ADD step, if we initialize by choosing a vector in F, that the solution w ˜ will be of the form j αj z˜ij for αj > 0 as in Eq. 2.2.5. Another argument for the ADD step is that w ˜ j · z˜ij = w ˜ j−1 · z˜ij + φj ||˜ z ij ||2 > w ˜ j−1 · z˜ij . Hence, the ADD step is sensible in that it increases the value of the inner product in an attempt to reach a positive value from the current negative value (we only have recourse to ADD when it is negative). Appendix 2 contains a MATLAB implementation of this pseudo-code that contains some batch processing elements to exploit MATLAB’s capabilities. Figure 2.4 illustrates the behavior of the PTA on a linearly separable set of 100 vectors in IR4 . To show convergence properties, we plot the sequence of the magnitudes of weight vectors scaled by their final value and the cosines of the angles between pairs of successive weight vectors.
2.6.2
Convergence of the PTA
That this algorithm works, in the sense of finding a separating hyperplane when one exists, is established in the next result. Theorem 2.6.1 (Perceptron Training Convergence) If the set F˜ is such that ˜ w z ∈ F) ˜ ∗ · z˜ > δ, (2.6.1) (∃w ˜ ∗ , δ > 0)(∀˜ ˜ (∃b)(∀˜ z ∈ F)
|˜ z | < b,
(2.6.2)
and the choice process in the Perceptron Training Algorithm is such that each element of F˜ will be chosen arbitrarily often, then the step ADD will be visited only finitely often. Hence, the sequence of weight vectors converges after finitely many iterations to a vector w ˜ such that ˜ (∀˜ z ∈ F)
w ˜ · z˜ > 0.
2.6 Perceptron Training Algorithm in the Separable Case
33
FIGURE 2.4. PTA convergence behavior.
Proof. Introduce the cosine of the angle between w ˜ ∗ , w, ˜ G(w) ˜ =
˜ w ˜∗ · w , ∗ |w ˜ ||w| ˜
and note that by the Schwarz inequality |G| ≤ 1. We will show that the numerator grows at least linearly in the number n of recourses to the √ step ADD, and the denominator grows as n. Hence, if there were an unbounded number of recourses to ADD, |G| would exceed unity and there would be a contradiction. Let {w ˜ n } denote the subsequence of weight vectors in which there are changes between successive terms. Hence w ˜n = w ˜ n−1 + φn z˜in ,
w ˜ n−1 · z˜in < 0;
the latter condition is the one that governs the ADD step. The vector w ˜∗ is unchanging. For notational convenience, let ˜ ∗ · w, ˜ Gn = G(w ˜ n) = Nn = w
Nn . Dn
Then using the hypothesis that w ˜ ∗ · z˜j > δ, we see that Nn = w ˜ ∗ · (w ˜ n−1 + φn z˜in ) > w ˜∗ · w ˜ n−1 + δφn > 0. Simple iteration reveals that ˜∗ · w ˜0 + δ Nn > w
n 1
1 φj ) > 0, n+1 0 n
φj ≥ (n + 1)δ(
34
2. Perceptrons—Networks with a Single Node
˜ These inequalities suggest that with the last inequality holding if w ˜ 0 ∈ F. Nn grows at least linearly in n because Nn ≥ (n + 1)δφ. Turning now to Dn , we examine ˜n · w ˜ n = (w ˜ n−1 + φn z˜in ) · (w ˜ n−1 + φn z˜in ) = |w ˜ n |2 = w |w ˜ n−1 |2 + φ2n |˜ z in |2 + 2φn w ˜ n−1 · z˜in . Recall that we only use ADD when w ˜ n−1 · z in < 0 and that by hypothesis |z in |2 < b2 to obtain |w ˜ n |2 < |w ˜ n−1 |2 + b2 φ2n . Iterate to obtain ˜ 0 |2 + b2 |w ˜ n |2 < |w
n
1 2 φ ), n+1 0 j n
φ2j < (n + 1)b2 (
1
˜ Hence, with the last inequality holding if w ˜ 0 ∈ F. n √ 1 2 ˜ ∗ | (n + 1)b2 ( φj ) ≤ |w ˜ ∗ |bφ¯ n + 1. 0 ≤ Dn < |w n+1 0 Thus we find that Gn >
√
δφ n+1 ¯ ∗ , bφ|w ˜ |
and Gn diverges with increasing n, as claimed. This conclusion can only be reconciled with |Gn | ≤ 1 if n is bounded above. 2 Although not needed in the proof, we can also upper bound by n∗ the number of recourses to ADD by solving for n∗ such that Gn∗ ≥ 1, its maximum possible value. Hence, it suffices if b φ¯ φ¯ ˜ ∗ |2 ( )2 ( )2 = |γ ∗ |2 ( )2 , n∗ > |w δ φ φ to guarantee that a solution is found by the PTA in the linearly separable case. This result provides an indication of the difficulty of implementing a linearly separable dichotomy. If we normalize Eq. 2.6.1 by dividing through by δ and normalize Eq. 2.6.2 by dividing through by b, then we obtain the normalized system γ ∗ · x > 1 subject to |x| < 1. We estimate that the number of recourses to the ADD step grows as |γ ∗ |2 . Hence, a large value of |γ ∗ | leads us to expect a long running time for the
2.6 Perceptron Training Algorithm in the Separable Case
35
PTA. Unfortunately, we usually have no way to assess |γ ∗ | without first running PTA. Minimizing the lower bound to n∗ might seem like a good idea and would lead us to φ¯ = φ, or the constant value of φn assumed in the basic PTA. However, experience suggests that when the vectors to be classified contain outliers of large magnitude, then there is a speedup gained from scaling the vectors to common unit length. The maximum running time of the PTA can be explored further if we move into a probabilistic setting with the two classes considered to be drawn in i.i.d. fashion from two different probability distributions µ−1 , µ1 . If we can find two hyperplanes w, τ−1 and w, τ1 , differing only in their threshold values, with |τ1 − τ−1 | > δ, such that µ1 ({x : w · x ≥ τ1 }) = µ−1 ({x : w · x ≤ τ−1 }) = 1, then we are assured that no matter which random samples are generated they will be linearly separable in a time upper bounded by a multiple of 1/δ 2 . However, in general, the two probabilistic models or statistical hypotheses generating the two classes will overlap. In this case, the randomly generated training set T will fail to be linearly separable, with probability converging to one, as its size n increases. The limiting case of a hyperplane separating the support sets of the two probability models with δ = 0 has been studied in [25].
2.6.3
First Alternatives to PTA
The preceding can be generalized somewhat further (e.g., [6, p. 118]). Examination of the convergence proof reveals that we can ensure the divergence of Gn , and hence convergence after a finite number n∗ of changes to the weight vector, if we require that the ratio of numerator Nn to denominator Dn diverges. This in turn requires that n ( 0 φj )2 lim n 2 = ∞. n→∞ 0 φj Because we are requiring that φj ≥ 0 and be positive for at least one j, this implies that the numerator itself must diverge lim
n
n→∞
φj = ∞.
0
Divergence of the numerator Nn is also necessary if we are to ensure that starting from any initial vector w ˜ 0 we can reach any other solution vector w ˜ ∗ . The boundedness and bounded away from zero conditions we imposed guarantee the satisfaction of these more general conditions. A PTA variation is to adjust φj =
−w ˜ j−1 · z˜ij ||˜ z ij ||2
36
2. Perceptrons—Networks with a Single Node
in the ADD step to ensure that w ˜ j · z˜ij = > 0. Hence, in one step we ˜ j. ensure a positive dot product for z˜ij with the revised weight vector w Another variation using the batch processing approach is to ADD all of the vectors that have a negative dot product with w ˜ ij . A trivial training algorithm (see Minsky and Papert [168], pp. 180,181) can be constructed by observing that if the set F˜ is separable as assumed in the convergence theorem, then so long as its elements x ˜ ∈ IRd there exists a separating weight vector having only rational elements; rational approximations to the components of w∗ can be made accurately enough to ensure that we can have separation to within, say, δ/2. Because there are only countably many rationals and only countably many d-tuples of rationals, we need only consider countably many possible weight vectors. Enumerate the vectors and test them in sequence, changing to the next vector only when a vector in F˜ that is misclassified is chosen. This algorithm will also converge to a solution in a finite time, although any bound to this time would depend on the enumeration process. Finally, of course, there is the option of the simplex algorithm and the techniques of linear programming.
2.7 The Nonseparable Case 2.7.1 Behavior of the PTA Queston 1(c) has us enquire into the case of a training set T that is not linearly separable and therefore can only be approximated by a perceptron. In the nonseparable case, the PTA will never halt, although it will orbit among a set of weight vectors of bounded length (see [168, Secs. 11.8– 11.10]). Figure 2.5 shows the ratios of the magnitudes ||wn || of successive w ·w weight vectors and the cosines ||w n||||wn−1 || of the angles between successive n
n−1
weight vectors when we have 100 training samples in IR4 , and the collection is not linearly separable. A different analysis of the behavior of the PTA for two models of nonseparable data is provided by Shynk and Bershad [219]. They assume asymptotic convergence in distribution of the weight vector and solve for the possible limit points identified as stationary points. In their formulation of the PTA, letting tn denote the correct classification of xn , yn = sgn(wn−1 · xn ), en = tn − yn , wn = wn−1 + 2αen xn . Assuming a random pattern selection mechanism, we can take expectations and write Ewn = Ewn−1 + 2αEen xn . The stationary points are characterized by Ewn = Ewn−1 ⇒ Een xn = 0.
2.7 The Nonseparable Case
37
FIGURE 2.5. PTA behavior under nonseparability.
They explore this latter condition under their models of nonseparable data to determine the values of w for which it holds. In their second nonseparable case (the outputs of two perceptrons excited by i.i.d. Gaussian inputs are then multiplied together) they find a unique stationary point of 0 and have supporting simulations.
2.7.2
Further Alternatives to PTA
In the nonseparable setting we might adopt Definition 2.7.1 (Optimal Separation) A weight vector and threshold pair w∗ , τ ∗ is optimal for a finite sized T if this pair correctly classifies as many points in T as can be correctly classified by any other pair. No consideration of how far points in T are from the separating hyperplane is included in this definition of optimal separation. As will be noted in the next section, this can be important for good generalization behavior (performance on new data). Clearly the optimal pair must correctly classify at least half of the elements of T . One can adapt the PTA by keeping track of the best hyperplane to date and the number of errors it makes, although this requires batch evaluation of the performance of each hyperplane. This is done in the algorithm given in Appendix 2. Remarkably, achievement of an optimal perceptron is intrinsically difficult and has been shown to be an NP-complete problem in Amaldi [6]. Hoffgen and Simon [105] provide a fundamental look at the difficulty of training a perceptron when linear separability fails. Heuristic algorithms have been proposed by Gallant [81], Amaldi [6, Ch. 5], and by Roychowdhury et al. [202]. An online revision of the PTA by [81], called a pocket algorithm, yields linearly separable solutions when they exist and approximately optimal
38
2. Perceptrons—Networks with a Single Node
solutions when the two families are not linearly separable. The basic idea is to keep in the “pocket” both that weight vector w∗ that has had the longest run of consecutive correct classifications and its run-length score s∗ . The contents of the “pocket” are revised whenever a new weight vector generates a longer run of correct classifications. Gallant assumes that the training vector selection process is that of i.i.d. choices made at random. He claims that if one iterates the PTA long enough, then with probability as large as desired, the weight vector in the pocket will be optimal. Frean [78] proposed a thermal perceptron algorithm in which the scaling φn in the PTA is chosen to have a slowly decreasing magnitude of the form w ˜ · z˜ Tn exp( n in ); T0 Tn where the “temperature” Tn decreases linearly from T0 to 0. This algorithm favors making corrections only when the error w ˜ n · z˜in is small. The performance of this algorithm is reported by Amaldi in [6, Ch. 5], to be sensitive to the choice of initial temperature T0 , and some variants are presented there. Amaldi proposes a probabilistic perceptron algorithm in which we make a change to the weight vector, when one is indicated by a misclassification, only with probability ˜ n−1 + φn z˜in ) = exp( P (w ˜n = w
w ˜ n−1 z˜in ). Tn
The temperature schedule {Tn } should be chosen to be slowly decreasing (e.g., logarithmically) and to be scaled so that initially the probability of making a change is close to unity. As the number of visits to ADD increase, we become increasingly unlikely to change the weight vector. It is asserted in Amaldi’s Corollary 5.5.3 [6, p. 129] that for any given probability π there exists a number of iterations iπ such that when we have exceeded this number of iterations we will have achieved an optimal weight vector with probability at least π.
2.8 The Augmented Perceptron—Support Vector Machines The classification power of a perceptron operating on a training set of size n depends on the dimension d of the space of inputs. We learned in Section 2.3 that the number D(n, d) of dichotomies that can be induced by a perceptron is order of nd /d!, which, from Stirling’s formula in Appendix 1, is approximately ne ( )d . d For n >> d, the typical case, this number, while very large, is but a very small fraction of the total number 2n of possible dichotomies. In Section
2.8 The Augmented Perceptron—Support Vector Machines
39
2.5 we considered the input space dimension d∗ that would be needed to classify all of the training samples, and we found the distribution Pd∗ (d) in Eq. 2.5.1, when we assume that all input vectors are in general position and the classes are assigned at random. We further observed that when n/d < 2 then there is a very high probability that we will be able to correctly classify the training set. If we can augment the dimension of the input space, then we will have an increasing probability of successful classification, and for d = n − 1 we are guaranteed to succeed, so long as the input vectors remain in general position: perceptrons are powerful devices in high-dimensional spaces. These considerations suggest the desirability of augmenting the input vectors to increase their dimension while preserving their remaining in general position in the augmented space of enlarged dimension d . Traditionally, this augmentation was accomplished by introducing polyk nomial functions of x as additional variables; e.g., xki , j=1 xij , etc. For example, if we wish to recognize (assign value 1 to the set and −1 to its complement) an elliptical disk in IR2 , then adding to the original variables x1 , x2 the three new variables x21 , x22 , x1 x2 enables us to use the perceptron in IR5 to form quadratic boundaries in IR2 . It is known that such socalled polynomial threshold gates can synthesize all Boolean functions. More generally, one augments x ∈ IRd to z ∈ IRd by selected functions ψ1 (x), . . . , ψd (x), of the original vector. A training set Tn that was not linearly separable can become so when augmented to {(z i , ti )}. Vapnik [241, Ch. 5], in an original discussion of this old approach, that is developed in detail in Vapnik [243, Part II], distinguishes the two issues of computing the perceptron function in a high-dimensional space and of the generalization ability of the augmented perceptron. Turning first to computing or representing the high-dimensional perceptron, Vapnik notes that the usual inner product w · x that we have been employing for vectors in IRd , when thought of as a function of w, x, can be generalized to a continuous, symmetric function K(w, x) that is also positive definite. A function is positive definite over some set A ⊂ IRd if for all nonzero, integrable-square functions f on A,
K(w, x)f (w)f (x)dwdx > 0. A
A
Such a function K, called a kernel, arises in linear integral equations of the form
αψ(w) = K(w, x)ψ(x)dx. (2.8.1) A
The solutions to Eq. 2.8.1 are the eigenfunctions {ψi } and the corresponding eigenvalues {αi }. Theorem 2.8.1 (Mercer’s Theorem ([45])) If the kernel K is symmetric, continuous, and positive definite with eigenvalues {αi } and eigen functions {ψi }, then the orthonormal expansion i αi ψi (w)ψi (x) converges
40
2. Perceptrons—Networks with a Single Node
absolutely and uniformly on A, the {αi } are positive, and αi ψi (w)ψi (x) K(w, x) =
(2.8.2)
i
is valid. Hence, K(w, x) is a conventional positively weighted inner product in the augmented representations w → {ψi (w)}, x → {ψi (x)}.
(2.8.3)
Interestingly, the function K can also be thought of more familiarly as a correlation function for a spatial random process indexed by the two points w, x ∈ IRd , rather than by the more familiar indexing by a scalar time variable. To relate such a correlation function K to our generalization of the perceptron, we use Eqs. 2.8.2 and 2.8.3. The inner product is now given by the (positively weighted) sum αi ψi (w)ψi (x) = K(w, x). w · x → i
However, instead of having to evaluate this large (potentially infinite) sum, we need only evaluate the correlation function K. The original inner product corresponds, of course, to K(w, x) =
d
wi xi ,
i=1
where ψi (x) = xi , for i = 1 : d. In this case the orthonormality condition for {ψi } holds if we restrict the components xi , wi ∈ [−a, a]. For polynomials of degree k, Vapnik suggests using K(w, x) = [1 + w · x]k . We can now embed our original inputs into spaces of much higher dimension, with the high-dimensional inner product easily computed from a spatial correlation function. The resulting augmented perceptron is described by y(x) = sgn(K(w, x) − τ ).
(2.8.4)
If the embedding space is of high enough dimension, then it is likely that we will achieve separability for the augmented training set—the training set will now be learnable by a perceptron. There remain, however, the issues of the choice of embedding and of the generalization ability of the augmented perceptron. The choice of embedding to ensure good generalization led Vapnik [241, Ch. 5], to the concept of support vector machines αi K(xi , x) − τ ), (2.8.5) y(x) = sgn( i
2.9 Generalization Ability and Applications of Perceptrons
41
with {xi } the support vectors. Support vector machines have attracted significant interest (e.g., Vapnik et al. [242], Vapnik [243, Part II]). Given a linearly separable training set, good generalization performance is achieved by the separating hyperplane that maximizes the minimum distance to training vectors of either of the two classes (see also Bartlett [21] and Section 7.8). The solution for such a weight vector w can be expressed in terms of a weighted sum of certain training set input vectors, called support vectors. Hence, in this case, as in Section 2.2.3, y(x) = sgn( (αi ti xi · x) − τ ). i
Given a training set, not necessarily linearly separable, a family of nonlinear decision surfaces is introduced by embedding (replacing xi · x by K(xi , x)), as above. The choice of specific surface that is made trades off minimization of the length of a normalized normal vector w to the hyperplane (in the possibly infinite dimensional embedding variable space) and the number of classification errors made by this choice. That shorter parameter vectors correspond to better generalization performance will be defended in Section 7.8. This optimization problem, the details of which can be found in Pontil and Verri [190], is transformed into a dual problem in which in place of w we need only deal with the inner or dot product between training vectors given by the kernel K. Again the solution can be expressed in terms of the kernel evaluated at certain pairs of training input vectors, as shown in Eq. 2.8.4.
2.9 Generalization Ability and Applications of Perceptrons The issue of learning/generalization raised by Question 4 can be framed as, “What is the probability that a new (n + 1)-st randomly selected and classified input vector will be correctly classified by the node?” Cover in [46] does not address this question but rather discusses the probability that the (n + 1)-st input vector will be unambiguously classified by the node trained on the previous n inputs. By Lemma 2.3.1, the input xn+1 is ambiguous if and only if there is a hyperplane through it that correctly dichotomizes the preceding n inputs. If points are in general position, then the number of such dichotomies is D(n, d − 1), whereas without the restriction to xn+1 being ambiguous, the number of dichotomies of the first n points is D(n, d). Hence, the probability A of ambiguous classification (i.e., there exist two nodes, each correctly classifying the training set, but disagreeing in their classification of the new input) is given by A(n, d) = D(n, d − 1)/D(n, d),
42
2. Perceptrons—Networks with a Single Node
with limiting behavior given by lim
n→∞,d=βn
A(n, d) =
β 1−β
for β ≤ 1/2. Vapnik [241, Ch. 5] addresses the generalization ability of the augmented perceptron in some detail. His analysis enables him to provide advice about a choice of the separating hyperplane that will generalize well to new data. Recall our observation in Section 2.2 that if there is a separating hyperplane, then there is one that can be expressed in the form H = {x :
k
αj tij xij · x − τ = 0}.
j=1
Vapnik’s advice is that we choose such a separating hyperplane involving the fewest (smallest k) training set vectors and maximizing its minimum distance to the two convex hulls C(S + ), C(S − ). We postpone detailed study of generalization/learning to the treatment of multinode networks (see also Anthony and Biggs [11]). It is well-known in statistics that if we have an input vector x coming from a multivariate normal distribution with mean vector mi and covariance matrix R, x N (mi , R), i = −1, 1, then the optimal statistical classifier is the Fisher linear discriminant function and is precisely in the form of a perceptron. Because there are many applications of this statistical model (e.g., to detection of binary signals in noise in communications or radar), there are many applications of the perceptron.
2.10 Alternative Single-Neuron Models 2.10.1
Binary-Valued Inputs
If the inputs to the perceptron are binary-valued, then Question 1 concerns the implementation of Boolean functions, the synthesis of switching circuits. We can adapt the results of Section 2.3.3 to the implementation of a Boolean function of d variables by choosing the training set size n to be the total number 2d of inputs to a Boolean function of d variables. Because the 2d binary-valued sequences that now comprise the input set S are not in general position as points in IRd , our evaluation of D(2d , d) is only an upper bound to the exact number B(d) of Boolean functions that are implementable by a perceptron. A comparison of the number D(2d , d), assisted by the bound 2
B(d) ≤ D(2d , d) < 2d /(d − 1)!,
2.10 Alternative Single-Neuron Models
43
d
with the exact count of 22 possible Boolean functions reveals that asymptotically very few Boolean functions can be implemented by a single perceptron. The logarithm of the upper bound B(d) is asymptotically accurate. Theorem 2.10.1 (Theorem 2.13 of Siu et al. [222]) lim
d→∞
log2 B(d) = 1. d2
Winder in [253] has calculated B(d); a few exact values are: n: B(d):
1 4
2 14
3 104
4 1882
5 94,572
6 15,028,134
If we look at the d = 2 case in detail, then we can identify XOR and EQ as the only two nonimplementable Boolean functions of the 16 cases. However, as N AN D is known to be complete for Boolean functions, we expect that combinations of nodes in a network will be able to implement all Boolean functions.
2.10.2
Sigmoidal Nodes and Radial Basis Functions
Two very important alternatives to the linear threshold unit discussed earlier are: 1. change the function from signum to, say, sigmoidal or more particularly logistic; 2. change the argument of the function from w · x − τ . Many applications, say, in forecasting and control, require real-valued responses, and this suggests the desirability of choosing a continuous nonlinearity y = σ(w · x − τ ). This is a form of regression using ridge functions (not to be confused with what is known as “ridge regression”). The response is constant along ridges defined by the hyperplane w, τ , as illustrated in Figure 2.6. The choice of a nonlinearity with range [0,1] has led some workers to identify the response with a probability (e.g., of correct classification). They interpret σ(w ·x−τ ) as the probability that x ∈ S + , thereby permitting overlap between S + , S − . Most of our discussion of networks of nodes will take place in the context of continuously valued node functions. An example of item (2) is provided by the discussion on augmentation of Section 2.7. Another example is provided by the concept of radial basis functions. This concept is discussed in Hassoun [97] and Haykin [100] and leads to approximations to a function f (x) of the form f (x) ≈
m i=0
αi φ(||x − ci ||),
44
2. Perceptrons—Networks with a Single Node
FIGURE 2.6. Ridge function.
where {ci } are the centers of the radial basis function φ, and a typical choice for φ would be the Gaussian 2
φ(z) = e−αz . The radial basis function representation can also be established, as noted by Vapnik through the correlation function K augmentation of a perceptron.
2.11 Appendix 1: Useful Facts about Binomial Coefficients and Their Sums We assume that d < n/2:
√ n! ∼ e−n nn 2πn Stirling’s formula; n n = ; k n−k n ↑ for k < n/2; k n nk ; < k k! d n 1 nH(d/n) 2 ≤ ≤ 2nH(d/n) , k n+1 k=0
where H(p) = −p log2 p − (1 − p) log2 (1 − p) is the binary entropy function; d n n n < < (d + 1) ; d k d k=0
2.12 Appendix 2: Proofs of Results from Section 2.3 d n k=0
k
<
n d ]; [1 + d n + 1 − 2d
d n 1−β k=0 k n lim = n→∞,d=βn 1 − 2β d d n k=0
k
< 1.5
(n + 1)d d!
45
β<
for n > d
1 2
(Bahadur/Cover);
(Vapnik[240, p. 166]).
2.12 Appendix 2: Proofs of Results from Section 2.3 Lemma 2.3.1 The dichotomies {S + , S − ∪{x0 }}, {S + ∪{x0 }, S − } are both implementable by a perceptron if and only if {S + , S − } is linearly separable by a hyperplane containing x0 . Proof. If {S + , S − } is linearly separable by a hyperplane w, τ containing x0 , > w · x0 = τ, x± ∈ S ± ⇒ w · x± − τ 0, < then by slightly perturbing this hyperplane, by replacing τ by τ ± for 0 < < min |w · x − τ |, x∈S
the classification of points in {S + , S − } is unchanged and that of x0 can be made ∓1. On the other hand, if the dichotomies {S + ∪ {x0 }, S − }, {S + , S − ∪ {x0 }} are both linearly separable, say, by (w1 , τ1 ), (w2 , τ2 ), respectively, then we can construct a hyperplane, w3 =
w1 w2 τ1 τ2 , τ3 = , + + w1 · x0 − τ1 −(w2 · x0 − τ2 ) w1 · x0 − τ1 −(w2 · x0 − τ2 )
that contains the point x ˜ 0 yet continues to separate {S˜+ , S˜− }. To verify this, note that −(w2 · x0 − τ2 ) > 0, w1 · x0 − τ1 > 0, and (w1 , τ1 ), (w2 , τ2 ), by hypothesis, separate {S˜+ , S˜− }. 2 Lemma 2.3.2 If S = {x1 , . . . , xn }, L(S) = L(S − {xn }) + Lxn (S − {xn }).
(2.3.1)
Proof. To see this, note that the possible linearly separable dichotomies of the full set S are those dichotomies of just S − {xn } for which the assignment to xn is uniquely specified as a consequence of the other assignments, together with those dichotomies of S − {xn } that can be refined by either
46
2. Perceptrons—Networks with a Single Node
of the possible assignments to xn . Invoking Lemma 2.3.1, we see that the number of those dichotomies of S − {xn } that can be augmented by either of the possible assignments to xn is Lxn (S − {xn }). Hence, the number of dichotomies of S −{xn } for which the assignment to xn is uniquely specified as a consequence of the other assignments is L(S − {xn }) − Lxn (S − {xn }). Thus the total number of linearly separable dichotomies of S is L(S) = L(S − {xn }) − Lxn (S − {xn }) + 2Lxn (S − {xn }) = L(S − {xn }) + Lxn (S − {xn }). 2 We now evaluate Lxn (S − {xn }), assuming that the points of S are in general position. Following Nilsson [180, p. 34], we construct points P = {p1 , . . . , pn−1 } from S as follows. Introduce the n − 1 lines li connecting xn to each of x1 , . . . , xn−1 , li = {a(xi − xn ) + xn : a ∈ IR}. By the assumption of general position, it follows that no k ≤ d + 1 points of S are such that any of them is in the convex hull of the other k − 1 points of S. Hence, these lines intersect only at xn , and we can select a (d − 1)dimensional hyperplane H0 = (w0 , τ0 ) intersecting all of the lines (we need only ensure that H0 is not parallel to any of the lines) and not containing xn , w0 · xn = τ0 . The set P is the set {pi } of intersections between H0 and the n − 1 lines, τ0 − w0 · xn (x − xn ), pi = xn + w0 · (xi − xn ) i and P will be viewed either as a subset of H0 or as a subset of IRd−1 . Lemma 2.12.1 If the points of S are in general position, then Lxn (S − {xn }) = L(P). Proof. Consider the set P = {z 1 , . . . , z n−1 } where (pi −xn )·(xi −xn ) > 0 implies that z i = xi , otherwise z i = 2xn − xi . Thus z i is either xi or the reflection of xi about xn on the line joining the two. Any hyperplane H through xn produces a linear dichotomization of S − {xn } that agrees with the linear dichotomization it produces of P , except that if z i = xi then the class assigned to z i is the negative of the class assigned to xi . Hence, there is a one-to-one correspondence between linear dichotomizations, produced by hyperplanes containing xn , of S − {xn } and those of P . This verifies that Lxn (S − {xn }) = Lxn (P ). Observe that xi , xn , pi , and z i are all collinear, and in particular that z i − xn = α(pi − xn ) with α > 0.
2.12 Appendix 2: Proofs of Results from Section 2.3
47
FIGURE 2.7. Projection construction.
Hence, for all i, z i lies on the line li joining xi and xn and is on the same side of xn as is pi . The point of this is that z i and pi lie on the same side of any hyperplane through xn , and therefore must be classified identically by any such hyperplane, w · z i − τ = w · (z i − xn ) = αw · (pi − xn ) = α(w · pi − τ ). From α > 0, it follows that z i , pi enjoy the same classification by any hyperplane for which w · xn = τ . Hence, dichotomizations of P by hyperplanes containing xn are equivalent to dichotomizations of P. If we can show now that Lxn (P ) = L(P), then it will follow that Lxn (S − {xn }) = L(P), as desired to establish the lemma. The linear dichotomization of P in H0 by H containing xn can also be thought of as being accomplished by the (d−2)dimensional hyperplane H1 in H0 that results from the intersection of H with H0 . Given the point xn and the hyperplane H0 there is a one-to-one correspondence between (d − 1)-dimensional hyperplanes H in IRd containing xn and arbitrary (d − 2)-dimensional hyperplanes H1 in H0 . Hence, there is a one-to-one correspondence between the linear dichotomizations
48
2. Perceptrons—Networks with a Single Node
of P by a hyperplane through xn and the linear dichotomizations of P by an arbitrary hyperplane H1 in H0 , and the lemma is proven. 2 Theorem 2.3.1. The unique solution to Eqs. 2.3.2 and 2.3.3 and upper bound to L(S) is d 2 i=0 n−1 , if n > d; i D(n, d) = (2.3.4) n 2 , if n ≤ d. Proof. This solution can be verified by substitution, use of the identity n n−1 n−1 = + , k k k−1 and recalling that ab = 0 if b > a. 2 To prove Theorem 2.3.2, we return to the projection operation introduced earlier and establish Lemma 2.12.2 If the n points in S are in general position in IRd then the projected points P are in general position in IRd−1 . Proof. Assume to the contrary of our expectations, that general position fails. Then there is a k ≤ d and points pi , . . . , pi ∈ P that lie in an affine 1 k subspace (hyperplane) H of dimension k − 2. Select a (k − 1)-dimensional hyperplane H in IRd containing xn and containing (extending) H . H now contains the lines joining xn to each of the projected points pi and therefore j also contains the original points xij of which they were the projections. Hence, H is now a (k − 1)-dimensional hyperplane on which we find k + 1 points, xn together with the points xi1 , . . . , xik ∈ S, thereby contradicting the assumption that S itself is in general position. 2 We now have the basis for an induction proof to establish the following: Theorem 2.3.2. If S, S are both sets of n points in general position in IRd then ˆ d) = D(n, d). (2.3.6) L(S) = L(S ) = L(n, Proof. For the induction hypothesis let D(m, d) = L(S) when ||S|| = m < n, S ⊂ IRd , S has points in general position, and we have proven that for those values of m < n and all d all such sets can be dichotomized in the same number of ways. Clearly, in agreement with Eq. 2.3.3, ||S|| = 1 =⇒ L(S) = D(1, d) = 2,
d = 1 =⇒ L(S) = D(n, 1) = 2n,
the latter holding so long as the points in S are distinct (are in general position in IR1 ). Hence, in the two cases of n = 1 and all d or of d = 1 and all n, we know that L(S) = D(n, d). Assume now that n > 1, d > 1. We will proceed by induction on n. For n = 1 we know D(1, d) for each value
2.12 Appendix 2: Proofs of Results from Section 2.3
49
of d. The induction hypothesis, true for n = 2, is that for any m < n and ˜ = m, any d, and any S˜ ⊂ IRd with ||S|| d , if m − 1 > d; 2 i=0 m−1 i ˜ L(S) = D(m, d) = m 2 , if m − 1 ≤ d. Our induction step is provided by the recursion of Eq. 2.3.1. Assume ||S|| = n. Then as S˜ = S − {xn } is in general position, it follows by the induction hypothesis for m = n − 1 that L(S − {xn }) = D(n − 1, d) for any set S of points in general position in IRd . By Lemma 2.3.3 we see that Lxn (S − {xn }) is precisely the number of ways of dichotomizing the n−1 projected points P that by the preceding lemma are in general position in IRd−1 . By the induction hypothesis this number is D(n−1, d−1). Hence, invoking Lemma 2.3.2 in the induction step we see that L(S) = L(S − {xn }) + Lxn (S − {xn }) = D(n − 1, d) + D(n − 1, d − 1). We have then proven that L(S) depends only on the parameters n, d with ˆ d) and given by Eq. 2.3.4. The inducthe solution for D(n, d) equal to L(n, tion argument is complete. 2 We provide a converse to Theorem 2.3.2, with elements of the proof suggested by K. Constantine [41]. Theorem 2.3.3. If the n points in S are not in general position in IRd , then L(S) < D(n, d). Proof. If S is not in general position, then for some 2 ≤ k ≤ d + 1 we find k points of S lying in a (k − 2)-dimensional subspace. Let k be the smallest such positive integer. If k = 2, then we have fewer than n distinct points and thus L(S) ≤ D(n − 1, d) < D(n, d). If k > 2, then we can select k − 2 of these points (not necessarily uniquely), say, u1 , . . . , uk−2 , such that the hyperplane (of dimension k − 3) defined by them in the subspace of dimension k−2 separates the remaining two points, say, y, w. Any hyperplane dichotomization of S induces a hyperplane of dimension k − 3 in the subspace of dimension k − 2 that we are discussing and that then dichotomizes the k points we have identified. Among the dichotomizations of S there are ones in which the k − 2 points u1 , . . . , uk−2 share the same classification, say, 1, and the point w has the different classification −1. It follows that in this case the remaining point y must be classified as a 1, as are the uj . Because there are only finitely many dichotomizations of S by hyperplanes, say, {Hi }, and inequalities are all
50
2. Perceptrons—Networks with a Single Node
strict, there is a minimum positive distance, say, δ, between any point of S and all of the chosen hyperplanes {Hi }. If we displace y by a distance of, say, δ/3 from a linear subspace (hyperplane) containing the k points we have identified, then y continues to lie on the same side of all the dichotomizing hyperplanes {Hi }. Hence, all the old dichotomies are preserved by this small displacement. However, now there exists a hyperplane that classifies u1 , . . . , uk−2 as, say 1, and w, y both as −1, and a new dichotomy has been created. Because we have created at least one new dichotomy by displacing y without eliminating any of the older dichotomies, the original number of dichotomies could not have been maximal. 2
2.13 Appendix 3: MATLAB Program for Perceptron Training pta_program %perceptron Training Algorithm for arbitrary threshold %training set matrix T is configured with %each sample a column and first d rows are input vector components and %(d+1)-st row is class t in -1,1 %uses batch processing to exploit MATLAB capabilities %calls phi.m, written by the reader, to supply weights in ADD step %set time clock and flop count to determine speed of algorithm t0=cputime; f0=flops; %determine input dimension and sample size [d n]=size(T); d=d-1; %set maximum number of iterations maxcycle=500; %augment T to TT handle threshold TT=[T(1:d,:);-ones(1,n);T(d+1,:)]; %standardize TT to input vectors all of class +1 tset=TT(1:d+1,:).*(ones(d+1,1)*TT(d+2,:)); %initiate weight vector choice to first column w=tset(:,1)’;
2.13 Appendix 3: MATLAB Program for Perceptron Training
51
index=1; errcount=n; for k=1:maxcycle, %takes a batch step if w*tset > 0 disp(’Training set is linearly separable with separating weights:’) wfinal=w(1:d) threshold=w(d+1) cycle=k break else I=find(w*tset <= 0); %save best performer if not linearly separable err=length(I); if err < errcount, wbest=w; errcount=err; end %find first input vector that fails and has higher index than last such %if there is one with a higher index J=find(I>index); if (length(J)>0) index=I(J(1)); else index=I(1); end w=w+phi(k)*tset(:,index)’; end if k==maxcycle, disp(’Training set is not linearly separable, best weights are:’) threshold=wbest(d+1) wbest=wbest(1:d) errcount end end t1=cputime-t0, f1=flops-f0
This page intentionally left blank
3 Feedforward Networks I: Generalities and LTU Nodes
3.1 Objectives and Setting As we saw in Chapter 2, a single node cannot learn all training sets. This motivates us to consider interconnections of such nodes in an artificial neural network in the expectation that we will greatly enlarge the set of tasks that can be performed. Philosophically we are motivated by “connectionism” and the hope of better understanding the emergence of intelligence from the actions of a large collection of simple elements. Biologically, the human brain provides us with a complex instance of such a cooperating interconnection of a large number of relatively simple elements. In the brain there is no specific site of “intelligence” nor a ruling “master” program that is detailed. Self-organization is central (e.g., Pagels [183].) However, the very complexity of successful biological systems and our rudimentary current state of knowledge of what makes them successful makes us share the following cautionary remark by J.T. Schwartz in Graubard [84], Thus theorists who take some hypothesis about learning as their starting point are choosing to begin in a particularly dark area of neuroscience. Instead we shall proceed mathematically, hopeful that such an analysis can provide metaphors to enable us to better understand biological phenomena and philosophical positions. We organize our approaches to artificial neural networks by the: (a) type of neurons;
54
3. Feedforward Networks I: Generalities and LTU Nodes
(b) type of network architecture; and (c) purpose of the network.
3.2 Network Architecture and Notation 3.2.1
Node Types
We distinguish four types of nodes corresponding to input and output variables being either binary (B) or real (R). Real-valued outputs can be generated by nodes having graded responses that henceforth will be assumed to be restricted to a finite interval I, typically [0,1]. Nodes with binary-valued inputs and outputs (BB nodes) arise in the design of computing and switching circuits and form a classical chapter in electrical engineering. Although there has been interesting recent work on the complexity (numbers of layers and numbers of nodes in each layer) needed to implement various Boolean functions (e.g., Anthony and Biggs [11], Roychowdhury et al. [201], Siu et al. [222]) we treat such networks here only briefly in Section 3.3. Nodes with binary-valued inputs and real-valued outputs (BR nodes) have been little studied beyond treating the inputs as if they were real-valued. The output reals do not make a good match to the finitely many responses of such a node. RB nodes occur commonly in pattern classification and decision-making applications. RR nodes occur widely in control, estimation, and optimization applications, and networks with such nodes have been the most thoroughly studied and widely applied.
3.2.2
Architecture
Mathematically, a network is represented by a weighted, directed graph with elements shown in Figure 3.1. The graph, a collection of nodes or vertices connected by directed or oriented links (also known as edges or arcs) that carry associated weights, establishes the network topology. To complete the specification of the network, we need to declare how the nodes process information arriving at the incoming links and disseminate information on the outgoing links. The nodes of a network are either input variables, computational elements, or output variables. A weighted directed edge or link is a two-terminal device in which the real-valued signal variable x at the input end is directed to the other end and established at the value wx. Information flows unidirectionally from input to output and the direction of flow is indicated by an arrow on the link symbol. The links connected to an input variable node are all directed away from that node and none are incoming. An input variable is determined independently of the network (it is an exogenous variable). An output variable node can be connected to a single incoming link and is designated as a variable of
3.2 Network Architecture and Notation
55
interest or, more conveniently, it may be the value established by a linear node performing a weighted summation of its inputs; strictly speaking such a node is a computational node, but this distinction is usually ignored in practice. Output variables are either certain input variables (a degenerate case) or the responses of selected computational elements. A computational element is a function of those variables to which it is connected by incoming links, and its unique output value is disseminated by outgoing links. These outgoing links may be to output nodes or to the inputs of other computational elements. Generally we assume a network has only a single
FIGURE 3.1. Network elements.
kind of nonlinear computational element (e.g., LTU or sigmoid) and that we have available linear summation elements. It is clear that the brain uses many different kinds of neurons, and the evolutionary parsimony of nature suggests that this proliferation has advantages. However, current practice respects this restriction, and it may be justified by the possibilities for hardware implementations of artificial neural networks. Definition 3.2.1 (Architecture) An architecture is a family N = {η} of networks having the same directed graph and node functions but with possibly different weights on the links. The class of architectures we will treat is referred to as the feedforward neural network (FFNN). To introduce the appropriate class of graphs we need to recall that an acyclic directed graph is one in which for no node is there a directed path that departs from that node and returns to it—there are no closed “cycles”. (By our definitions of network elements, it is only
56
3. Feedforward Networks I: Generalities and LTU Nodes
the computational element nodes that are candidates for the termini of a cycle.) Definition 3.2.2 (Feedforward Network) The architecture of a feedforward net is defined by a directed, acyclic graph and a choice of node functions. A directed, acyclic graph can be levelized into layers. The initial or zeroth layer L0 contains the input nodes. These nodes have no incoming links attached to them. The first layer L1 consists of those nodes with inputs provided by links from L0 nodes. The ith layer Li consists of those nodes having inputs provided by links from Lj nodes for j < i. The final layer contains the output nodes. An important special case of an FFNN is the multilayer perceptron (MLP) in which the links to the ith layer Li come only from the immediately preceding layer Li−1 . This is the most commonly used feedforward architecture. Formally, by introducing “null computing nodes” that act only to pass forward their input variables, we can embed the more general FFNN in the class of MLP. The two types of LTU-based network architectures that we discuss in this chapter will have a first hidden layer of (generally many) LTU nodes followed either by: (a) a single LTU node in a second layer or (b) a somewhat narrower second hidden layer of LTU nodes followed by a single LTU node in the third layer. Although we shall not discuss Hopfield (also known as feedback or recurrent) networks, they are easily defined as an architecture whose graph contains cycles. The presence of cycles endows these networks with dynamics (e.g., the capabilities to sustain oscillations and generate signals persisting in time even in the absence of exogenous inputs) and enables their study as a species of dynamical system. As such they have attracted the attention of physicists and some mathematicians and circuit theorists. The dynamical state of the network consists of all the node outputs at a given time. Inputs are provided to such a net by “clamping” the node outputs at desired values at the initial time. An output can also be considered to be the sequence of states. The computational process in such an architecture is either synchronous or asynchronous. In the former case all nodes are updated at the same time; this is clearly not the situation in the human brain. In the latter case, nodes, possibly chosen at random, are updated one at a time. Feedback networks have been of interest as socalled associative or content addressable memories (e.g., McEliece et al. [160], Dembo [55]) and for their limited ability to solve certain optimization problems (e.g., Hopfield and Tank [108], Schaffer and Yannakakis [210]).
3.2 Network Architecture and Notation
3.2.3
57
Notation
Enumerating the number of layers in a feedforward network has not yet been standardized. Hence, one should be cautious in interpreting a statement that a network has k layers until one can discern what is being counted. Any node whose output is not connected by a single link to an output node having no other incoming links is referred to as a hidden node—“hidden” in that observation of the network output does not tell us the response of a hidden node. The perceptron discussed in Chapter 2 is an example of a feedforward network having d input nodes in Level 0, an LTU computational element in Level 1, a single output node in Level 2, and there are no hidden nodes. The notational system we adopt for thresholds or biases, a bias b is the negative of a threshold τ, b = −τ , and nodes is that of a double subscript i : j, where i specifies that the object is at Level i and j specifies the particular object at that level. The notational system for weights is that of a triple subscript i : j, k, where i specifies Level i and j, k specifies the weight connecting node k in Level i − 1 to node j in Level i. The output of a node function Fi:j (or often just σ when all hidden layer nodes are the same) is denoted by ai:j and the input to that node, its argument, by ci:j . The number of nodes in Level i is si and the total number of parameters (weights plus biases) describing the network is p. The inputs to the network are either a0:i , i = 1 : d, or more familiarly, xi , i = 1 : d. When we are considering a variety of possible inputs {xm , m = 1 : n}, we indicate m m m this, if necessary, by a superscript as in am 0:i , xi , ai:j , ci:j . An example of a network with a single hidden layer is shown in Figure 3.2. Layer 0 contains the four inputs x0:1 , . . . , x0:4 , although for input variables we generally just write x1 , . . . , x4 . Layer 1 contains three computational elements F1:1 , F1:2 , F1:3 with, for example, F1:1 having inputs x1 , x2 , x3 . The outputs {a1:i } of the individual nodes {F1:i } are given by 4 a1:i = F1:i ( w1:i,j xj − τ1:i ) = F1:i (c1:i ). j=1
The Layer 1 computational elements are hidden nodes because their outputs are not directly observable at the network output z. Layer 2 contains the single (nonhidden) computational element F2:1 having inputs from each of the computational elements in Layer 1, and Layer 3 contains the single output node with output variable 3 a2:1 = z = w3:1,1 F2:1 ( w2:1,j a1:j − τ2:1 ). j=1
For convenience in analysis and training explorations, we commonly aggregate all the preceding parameters into a single-column vector w of
58
3. Feedforward Networks I: Generalities and LTU Nodes
FIGURE 3.2. Notation for neural networks.
“weights”, and assume that all connections between adjacent layers are present, with individual links that should be absent being given zero weight. An example of a specific encoding or stacking of parameters into w is given in the MATLAB program for ntout1 provided in Section 5.13.
3.2.4
Purpose of the Network
As in the case of the Perceptron, we will develop our analysis of nets around the responses to the four questions posed in Section 1.5.2. We refine Question 1 as to implementable functions by dividing it into the following subquestions: (a) implement a Boolean function;
3.3 Boolean Functions
59
(b) learn a training set T —classify a given set S of n vectors in IRd into M classes; (c) provide a function δ to partition IRd into M subsets or recognize a set; and (d) implement an arbitrary continuous function f from IRd into IRm . We can treat (a) as a special case of (b), although this will not generally be an efficient use of network hardware. Nonetheless, it establishes that any Boolean function can be implemented by a feedforward neural network. However, it is informative to show how familiar Boolean operations such as OR and AND can be implemented easily, and we start our discussion here. To treat (b) we need only deal with the binary M = 2 case—for larger values of M we can use the binary representation of M requiring log2 M (x is the smallest integer greater than or equal to x) such nets sharing the inputs. Each net is trained to achieve a single binary digit in the expansion of M . This reduction to the M = 2 case, of course, does not deal with issues of efficient use of network hardware. Another representation for this problem, frequently used in pattern classification, is to have a network with M real-valued output nodes and to decide that the overall output is the index of the node having the largest response. In effect, one attaches another network at the M output nodes to do this determination. Although we say little about networks having multiple outputs, an illustrative suite of MATLAB programs is provided in Section 5.14. Finally, it is also possible to use a thermometer representation, in which we use a single realvalued output node with its output range I partitioned into M subintervals I1 , . . . , IM . A numerical output falling in Ij is considered as corresponding to class j. We will treat (c) exactly for LTUs and as an approximation to (d). To deal with (d) we need only deal with m = 1 and have m such nets sharing the inputs. Generally we will only produce approximate solutions, where the approximation can be made as close as desired through increased network complexity. We return to this topic in the next chapter, when we treat real-valued nodes. The discontinuous function δ will then be observed to be approximable by a continuous function f , except on sets of arbitrarily small measure.
3.3 Boolean Functions We assume binary-valued variables x1 , . . . , xd with each variable taking values in {0, 1}, rather than the less traditional {−1, 1}. There is an extensive literature on Boolean functions in the context of switching circuits and threshold logic (e.g., Muroga [176], Winder [253]) and it has been revisited
60
3. Feedforward Networks I: Generalities and LTU Nodes
under the impetus of neural networks (e.g., Anthony and Biggs [11], Nilsson [180], Siu, et al. [222]). Boolean OR(x) takes on the value 1 if and only if at least one of the components of x is itself 1. It is simplest to assume the LTU nodes now take on the appropriate values of {0, 1} rather than their common range of {−1, 1} and are represented by the unit-step function U . Throughout this subsection let 1T = (1, 1, . . . , 1) and 0 < τ < 1. In this case it is easy to verify that d xi − τ ) = U (1 · x − τ ). OR(x) = U ( 1
Hence, OR is implemented by a single node or perceptron. Boolean AN D(x) takes on the value 1 if and only if every component of x has the value 1. It is easily verified that d xi − d + τ ) = U (1 · x − d + τ ), AN D(x) = U ( 1
and the perceptron suffices. A well-known instance where the perceptron fails to represent a Boolean function is that of XOR. The function XOR(x) takes on the value 1 if and only if exactly one component of x takes on the value 1. In other words, XOR(x) = 1 ⇐⇒ 1 · x = 1. Hence, it suffices to have XOR(x) = U (1 · x − 1 + τ ) − U (1 · x − 2 + τ ), a network having a first layer of two nodes whose outputs are differenced by a linear node in the second layer. If the representation is to be entirely in terms of LTU nodes, then XOR(x) = U (U (1 · x − 1 + τ ) − U (1 · x − 2 + τ ) − τ ). For completeness, we summarize the answer to Question 1(a). Theorem 3.3.1 (Implementation of Boolean Functions) Any Boolean function f defined either on the set {−1, 1}d and taking values ±1, or on the set {0, 1}d and taking values 0, 1, can be implemented by a feedforward neural network composed of LTU or unit-step nodes. Proof. To implement any Boolean function f on {−1, 1}d consider the sets S + = {x : f (x) = 1}, S − = {x : f (x) = −1}. Assume, without loss of generality that ||S + || ≤ ||S − ||. Given each x+ of the no more than 2d−1 inputs in S + , we create a unit-step node with
3.4 Learning Arbitrary Training Sets
61
threshold τ + = d − 12 and weight vector w+ = x+ . These nodes act as so–called “grandmother nodes” (nodes designed for quick recognition of a specific familiar individual, say, your grandmother) and responds with a 1 if and only if the actual net input x = x+ . Thus we have a wide first layer of width ||S + || and need only introduce a single unit-step output node to OR the responses from the first hidden layer. If f is defined instead to take values in {0, 1}, then a slight revision in the above argument leads to a “grandmother” node for recognition of x+ with parameters τ+ = d −
1 1 + 1T (2x+ − 1) = 21T x+ − , 2 2
w+ = 2(2x+ − 1) = 4x+ − 21. An alternative proof will be provided in the next section, where we show how to learn any training set. Implementing a Boolean function amounts to learning a training set T = {(x, t)} of size 2d , where S = {0, 1}d and we have to learn a dichotomy S + = {x : t = 1}, S − = {x : t = 0}. 2 A response to Question 2 of Section 1.5, on complexity of implementations, is provided by Arai [12]. We defer our consideration to Section 3.4.1, where an answer to the complexity of implementing a Boolean function is given by Eq. 3.4.2. The Question 3 issue of the selection of the parameters of a neural network to implement a given Boolean function is answered by the constructive proofs. Question 4 on generalization is inapplicable to our formulation because we have presented complete learning of a Boolean function. Issues of generalization will be left to analyses to be presented in subsequent chapters.
3.4 Learning Arbitrary Training Sets 3.4.1
Error-Free Learning
We examine Question 1(a) for a feedforward net composed of RB nodes. The basic answer, was established by Nilsson [180] in 1965 and reworked in Baum [23]. Theorem 3.4.1 (Sandwich Construction) Any n vectors S in general position (see Definition 2.3.1) in Rd can be arbitrarily dichotomized into S + , S − by means of a feedforward neural network composed of no more n LTU computational elements lying in Layer 1 and a single LTU than 2 2d node in Layer 2. Proof. Identify the smaller of S + , S − , say, S + . Partition the vectors in S + into groups of size d (the last group may have fewer than d vectors) and
62
3. Feedforward Networks I: Generalities and LTU Nodes
select hyperplanes containing these individual groups and no others (this is possible by the hypothesis of general position). Then split the hyperplane (w, τ ) containing a group into two parallel hyperplanes, of opposite orientation, (w, τ − ), (−w, −τ − ), to form a sandwich containing the given group of d vectors and no others; select > 0 small enough to avoid including other points. If the hyperplanes in a pair are oriented to have their positive side facing the group of vectors from S + then a vector in the group will be given a weight of 2 and a vector not lying in such a sandwich will be given a weight of 0.
FIGURE 3.3. Sandwich construction. n Hence, we need only sum the outputs from all of the no more than 2d pairs of hyperplanes and threshold the sum at 1, that is, we implement the Boolean function OR. This sandwich construction is developed into a MATLAB program in the appendix. 2 Theorem 3.4.1 answers Question 1(b) and establishes that any group of vectors in general position can be dichotomized arbitrarily. In fact, we do not require the hypothesis of general position. If need be, we can select a hyperplane containing a given vector and no others in the set and expand to n2 pairs of hyperplanes. At worst, this will multiply the number of connections or weights by d. This provides an alternative constructive proof for Theorem 3.3.1. Hence, any training set T assigning vectors to one of two classes can be learned (memorized) by a single hidden layer network of LTU nodes. Furthermore, our constructive proof provides answers to Question 2 (complexity) and Question 3 (selection of the net) as well. To respond to Question 2, we estimate the complexity of a feedforward net by counting the number p of parameters (connection weights
3.4 Learning Arbitrary Training Sets
63
plus thresholds) needed to provide any desired dichotomy of n vectors in general position. An upper bound is p ≤ (d + 1)
n 2 n + ( + 1) = n(1 + ) + 1 ≈ n, d d d
(3.4.1)
and it is derivable from the Nilsson solution sketched earlier, assuming general position. In the worst case, if general position is not satisfied, then we need only increase our upper bound by a factor of d. The practical advice that emerges from these calculations is that it takes about one connection (weight) for each input vector that we wish to classify. The corresponding conclusion for the perceptron that one connection could learn two patterns was, of course, only true with high probability. That the upper bound is tight is suggested by Baum in [23, Lemma 1, p. 198]. Baum divides the family of vectors into two equal-sized subfamilies and considers the graph formed by having the vectors as nodes and the links being between nearest neighbors that are then constrained to be of different families; e.g., consider a “ring” of points in IRd that alternate in their class assignments so that nearest neighbors are of another class. It is possible to arrange this so that there are n such links, all of which must be cut by hyperplanes if the first hidden layer is to preserve the separation between classes. In general position a hyperplane can cut only d links in IRd . Hence, you need n/d hyperplanes or computing nodes in the first hidden layer. One needs to go from this count of nodes to a count of weights (d + 1)( nd ) to derive a lower bound on the complexity required of a general-purpose LTU-based feedforward network. One concludes that the complexity of the sandwich construction is not excessive and cannot be uniformly improved. Baum, on p. 207, shows that for specially generated problems (e.g., vectors selected i.i.d. and uniformly distributed over I d ) you can expect to correctly classify with far fewer nodes and links. In the special case of (partially defined) Boolean functions, the input vectors being the vertices of the hypercube {0, 1}d are not in general position as envisaged above. Arai in [12] considers this special case in detail and notes that we can follow the sandwich construction and always arrange matters so that we can include any three vertices within a sandwich and no other vertices. This yields the complexity estimate for a fully defined Boolean function of p ≤ (d + 1)
2n d 2n + + 1 ≈ 2n . 3 3 3
(3.4.2)
In response to Question 4 concerning learning and generalization, this sandwich construction cannot be expected to have good generalization properties. By choosing small, we are implying that there is only a small
64
3. Feedforward Networks I: Generalities and LTU Nodes
likelihood of gaining observations from the +1 class; the probability of the sandwich regions is likely to be small given that the volume they occupy is small and assuming a bounded density of feature vectors from either class. Hence, good generalization properties will require other methods for learning training sets. In a broad sense, methods that learn a training set exactly cannot be expected to generalize well for training sets of stochastic origin. Such methods, by fitting the data too closely, cannot learn the underlying stochastic law or model. This problem of overfitting is endemic and will be discussed at some length later. There have been a number of other architectures proposed for errorfree learning by LTU nodes, several of which are discussed in Gallant [82, Ch. 10]. As an example, we note the iteratively constructed pyramid architecture in which we sequentially add a single LTU node to a network that has been constructed with this node connected directly to all input variables as well as to the outputs of all previous nodes (an FFNN but not an MLP). Keeping the parameters of all previous nodes fixed, we train only the most recently added node to achieve error-free learning of T . It is easy to prove that with the addition of a new node we can continue to correctly classify all the training samples that were correctly classified previously and add at least one new correctly classified sample. Hence, a sufficiently tall pyramid network will learn any T . We can understand the pyramid architecture as expanding the dimension of the space of input variables by adjoining to the actual inputs x ∈ IRd those functions of these inputs computed by the intermediate level nodes. The last node that we add then acts to classify the n training samples that now lie in a higher-dimensional space, and we know that increasing the dimension of the input space increases the implementational ability of a perceptron or LTU node. A rigorous argument for the representational power of the pyramid architecture can also be founded on the observation that by setting the weights transferring all but the inputs to zero for all but the last node and the weights transferring the inputs to the last node to zero we have constructed an arbitrarily wide single hidden layer network of LTU nodes having a single LTU node output. The various proposals share the characteristic of being trained by greedy algorithms. That is, we add individual nodes recursively and train the most recent node to achieve the best possible current performance without regard to whether this is optimal with respect to the addition of future nodes.
3.4.2
Learning with Errors
The design of LTU-based networks, when their architecture is fixed to avoid overfitting and we are presented with a training set that cannot be learned without error by such an architecture, has proven to be an intrinsically difficult problem. The otherwise thorough study of discrete-valued networks in Siu et al. [222] neglects issues of learning except in the case of a perceptron. The absence of a solution to this problem in the early days of neural
3.4 Learning Arbitrary Training Sets
65
networks discouraged the pursuit of neural networks. There is still no satisfactory resolution in the context of LTU-based networks. Brent [33] designs a neural network by first constructing a binary decision tree. In such a tree the two-way split at a nonterminal node is determined by whether the input vector lies above or below a particular hyperplane and hence can be implemented by an LTU node. The terminal nodes or leaves of the tree correspond to the different classifications of the input vector, and we can accommodate more than two output classes. The critical element in the design of the tree is the selection of the hyperplanes, and Brent approaches this by adopting a criterion to determine a good split into sets S0 , S1 of the subset S of the training set that is available at that nonterminal node. The criterion he adopts is to select Si to maximize
K
m0,k !m1,k ! , log K k=1 K ( k=1 m0,k )! k=1 m1,k )! where mi,k is the number of training points of class k in Si . To relate the decision tree to an LTU-based network, we observe that all the hyperplanes determining the nonterminal nodes can be placed in a first hidden layer of the network and then a second layer of nodes can identify the output classes or represent the leaves, by forming a disjunction of the appropriate first layer responses; such a disjunction can be formed using only binaryvalued weights. This approach to the design of LTU nets does not have the immediate objective of producing a close approximation to a given training set in that the number of incorrect classifications is minimized. Nonetheless, Brent asserts it to be successful in yielding a good fit to the training set. In a different approach, Grossman et al. [85] proceed by guessing the internal representations (node outputs) for LTUs in hidden layers and using the perceptron training algorithm on the output layer to see if these guesses lead to a set of weights that solve the classification problem. Key to this approach is a guided process for choosing new guesses for the internal representation. Hence the problem of learning becomes one of searching for proper internal representations, rather than one of minimization. Failure of the PLR [Perceptron Learning Rule] to converge to a solution is used as an indication that the current guess of internal representations needs to be modified. [85, p. 74] Nonetheless, the guidance provided is insufficient to eliminate the need for exponentially large searches if a good fit is desired.
66
3. Feedforward Networks I: Generalities and LTU Nodes
3.5 The Number of Learnable Training Sets 3.5.1
Growth Function and VC Capacity
Paralleling our work in Section 2.3 (with the same title), we inquire into how many of the 2n possible ±1-valued assignments of values t1 , . . . , tn can be made by a network of LTU nodes of fixed architecture to the set S = {x : (∃(xi , ti ) ∈ T ) x = xi } of input variables recorded in T . So long as the output node of the network is an LTU, the network can only dichotomize S. We see from Section 3.4.1 that we must restrict the architecture or complexity of the network to make this question interesting. The answer we provide here will be based on the concept of Vapnik-Chervonenkis (VC) dimension/capacity (e.g., see Vapnik [240, 241], Devroye et al. [61], Kearns and Vazirani [125], Valiant [239], Vidyasagar [244]). We introduce here this concept and the important bound of Theorem 3.5.1, and we will return to it in Chapter 7 when we discuss the generalization ability of networks. Given a fixed architecture N = {η}, a particular network produces a subset Sη+ of S through Sη+ = S ∩ {x : η(x) > 0}. Our subsequent study of networks with real-valued nodes will have access to the theory we are now developing through this understanding that such networks can dichotomize according to whether or not their output is positive. Let LN (S) denote how many subsets of S can be produced by the action of the classifiers in the family N , LN (S) = ||{Sη+ : ∃η ∈ N }||. Although there are generally infinitely many networks in N , there can be only 2||S|| subsets of S, and this upper bounds LN (S). Our notation now departs from that of Chapter 2 to follow that of the VC literature. The generalization of D in Section 2.3 is given by the growth function. Definition 3.5.1 (Growth Function) The growth function mN (n) =
max
{S:||S||=n}
LN (S).
The growth function mN (n) will be seen in Section 7.8 to be a key characteristic determining the generalization ability of the combination of a network architecture N and a training algorithm yielding η ∗ ∈ N that is a best fit to the training set of size n. When we are dealing with a perceptron of fixed input dimension d, then the results of Section 2.3 establish that d 2 i=0 n−1 , if n > d + 1; i mN (n) = D(n, d) = n 2 , if n ≤ d + 1.
3.5 The Number of Learnable Training Sets
67
To proceed, we need to be able to evaluate the growth function mN for more general networks. The approach that we take follows that of VapnikChervonenkis theory in that we relate mN to the size of the largest set that can be learned without error, no matter what the assignments of classes. Definition 3.5.2 (Shattering) We say that a set A can be shattered by a family of networks or functions N if for each subset A+ ⊂ A there is a network η ∈ N that selects precisely this subset in that A+ = Sη+ . Clearly, a training set Tn , built on a set Sn of feature vectors {xi } that can be shattered by N , can be learned no matter what the assignment of classes. Definition 3.5.3 (VC capacity) The Vapnik-Chervonenkis capacity v is the size of the largest set S that can be shattered by N . In the case of a perceptron, v = d + 1, and by Theorem 2.3.2 this holds for any set S of d + 1 points in general position in IRd . It is immediate from the definition of capacity that if ||S|| = n ≤ v, then mN (n) = 2n . If n ≥ v, then
mN (n) ≥ 2v ,
and it need not be any larger; e.g., N might only induce the 2v subsets of a fixed set of v points. An upper bound on the VC capacity v implies that there cannot be a very large number mN of subsets of S generated by N . Theorem 3.5.1 (Vapnik/Sauer, VC Bound) If v is the VC capacity of N , then 2n , if n ≤ v; mN (n) ≤ v n k=0 k , if n > v. Proof. See the appendix for a proof. 2 Note the similarity to the expression for D with v = d + 1. Sauer shows that this upper bound is tight in that there are families of functions N in which it is achieved. An example of such a family is the one in which for any set of v or fewer points there is a function in the family that takes the value +1 on the set and −1 on its complement, and the family contains only these functions. Vapnik provides the convenient bound v n mN (n) ≤ < 1.5nv /v!. k k=0
68
3. Feedforward Networks I: Generalities and LTU Nodes
For large enough v less than n, use of Stirling’s formula for v! yields the convenient upper bound ne mN (n) ≤ ( )v . v It is immediate from this upper bound that for finite VC capacity v, the growth function grows no more rapidly than a polynomial nv of degree v in set size n. As the number of subsets of a set of size n is given by the exponential 2n , it is evident that a family of sets of VC capacity v can shatter only a vanishingly small fraction of all the exponentially many subsets of a set of size n for n much larger than v. Although it is common in the neural network community to analyze mN in terms of v, we see from the preceding that for n > v, the upper and lower bounds (e.g., 2v ) to mN can differ greatly. If, as is generally the case, our interest is in mN , then we may be misled by analyses based on v.
3.5.2
VC Capacity of Networks
Results of Vapnik [240, 241], Devroye et al. [61], and Wenocur and Dudley [248] are helpful in evaluating the capacity v for a given family N . We learn there that the capacity for a single node (with adjustable threshold) in Rd is d + 1. Furthermore, if N is a d-dimensional real vector space of functions, e.g., d αi φi (x), η(x) = y = i=1
then its capacity is d in terms of sets it generates of the form {x : η(x) > 0}. In particular, the family of polynomials in a single variable (d = 1) of degree n have capacity n + 1. Furthermore, if N is a family of nested sets (i.e., linearly ordered by inclusion) then v = 1. If N contains intersections of sets drawn from m such families then v ≤ m. We now follow Cover [47] to estimate v for a more general family N of networks composed of µ nodes σi having individual capacities vi (in the case of LTU nodes, vi is determined by their having a fan-in of vi − 1 inputs) (see also Sontag [224], Maass [154]). Given n possible inputs x1 , . . . , xn and fixed weights to all but node σi , there are at most mσi (n) possible functions or subsets that can be generated by node σi as we vary its input weights and apply the n different net inputs. In fact, we may generate fewer functions or subsets because the part of the net preceding node σi may compress these n distinct inputs into fewer inputs to node σi , they may be in some special position such that node σi cannot generate its maximal number of functions/subsets mσi (n), or the remainder of the net may suppress the variations generated by this node from appearing at the net output (e.g., zero output weights attached to the node). Hence, for each µ-tuple of functions or subsets specified by the µ nodes in the net, there
3.5 The Number of Learnable Training Sets
69
µ are at most i=1 mσi (n) possible µ-tuples of functions that comprise the components of the net. Thus the net can form at most mN (n) ≤
µ
mσi (n)
(3.5.1)
i=1
distinct functions/subsets from the n possible input sequences to the net. Alternatively, we can apply this argument to the number of subsets that can be generated by just the s1 nodes in the first hidden layer. If the net has only one hidden layer, then the analysis is as above with µ = s1 + 1, where we account for the final output node. However, if the net has more than one hidden layer then the subsequent layers can at most implement s1 any of the 22 possible Boolean functions in the s1 variables corresponding to the outputs of the first hidden layer. In this case we have mN (n) ≤ 22
s1
s1
mσi (n).
(3.5.2)
i=1
From Vapnik we have learned that n > vi ⇒ mσi (n) < 1.5
nvi . vi !
Combining this result with Eq. 3.5.1 (or with the second factor on the right-hand side of Eq. 3.5.2 for µ replaced by s1 ) yields n > max vi ⇒ mN (n) < i
µ
1.5
i=1
µ
nv i . vi !
Letting w = i=1 vi , (the number of weights or parameters in an LTU network) and v¯ = w/µ represent the average node capacity, and using the combinatorial fact (derivable through the fact that a multinomial discrete distribution on µ equally probable categories has most probable outcome when all µ categories occur equally often in the d trials) that for integer v¯ µ
vi ! ≥ [¯ v !]s ,
i=1
we can assert that mN (n) < [1.5/¯ v !]µ nw . An appeal to Stirling’s formula for factorial enables us to claim that Eq. 3.5.1 yields 1.5 µ en w en mN (n) < ( √ ) ( ) < ( )w . (3.5.3) v¯ v¯ 2π¯ v This latter result was obtained, in a slightly different manner, in Baum and Haussler [24] and originally by Cover in [47].
70
3. Feedforward Networks I: Generalities and LTU Nodes
Similarly, letting w1 =
s1
i=1 vi ,
and v¯1 = w1 /s1 we derive from Eq. 3.5.2
s1 s1 en 1.5 s1 en w1 mN (n) < 22 ( √ ) ( ) < 22 ( )w1 . v¯1 v¯1 2π¯ v1
(3.5.4)
The minimum of the estimates of Eqs. 3.5.3 and 3.5.4 provide the desired estimate for the growth function. We provide an upper bound V to the capacity v. Lemma 3.5.1 (VC Upper Bound for LTU Nets) Let an LTU-based neural network have s1 nodes in a first hidden layer, s nodes in total, and node σi have individual VC capacity vi . Define w, w1 as above. An upper bound V to the VC capacity v of the network is given by the minimum of the following two expressions: v ≤ w(log2 eµ + 2 log2 log2 eµ); v ≤ w1 (
1 s1 1 2 + log(es1 ) + 2 log( 2s1 + log(es1 ))). w1 w1
Proof. The definition of capacity implies mN (v + 1) < 2v+1 ⇒ mN (V ) ≤ 2V .
(3.5.5)
Thus, it suffices if from Eq. 3.5.3 that V satisfies eV w ) = 2V , v¯
(3.5.6)
s1 eV w1 ) = 2V −2 . v¯1
(3.5.7)
( or from Eq. 3.5.4 it satisfies (
Letting γ = V /w, γ1 = V /w1 , we see that Eqs. 3.5.6 and 3.5.7 transform into (3.5.8) eµγ = 2γ , 2s1
es1 γ1 = 2γ1 − w1 .
(3.5.9)
Upper bounds to the solutions of these transcendental equations, for reasonable parameter values, are given by γ ≤ log2 eµ + 2 log2 log2 eµ,
(3.5.10)
1 s1 1 (3.5.11) 2 + log(es1 ) + 2 log( 2s1 + log(es1 )). d d We conclude with an upper bound to the dimension or capacity v of a feedforward net that is the minimum of the following two expressions: γ1 ≤
v ≤ w(log2 eµ + 2 log2 log2 eµ);
(3.5.12)
3.6 Partitioning the Space of Inputs
71
1 s1 1 2 + log(es1 ) + 2 log( 2s1 + log(es1 ))). 2 (3.5.13) w1 w1 This result compares favorably with the bound of 2w log2 eµ provided by Baum and Haussler. Observe that if µ = 1 then our bound exceeds the correct capacity of w by a small factor of about 2.5. Noting that µ ≤ w, we restate the results of the preceding calculations (which could have been simplified by simply upper bounding mN (n) ≤ nv ). v ≤ w1 (
Theorem 3.5.2 (LTU Net VC Capacity Bound) The VC capacity vw of an LTU-based network having w parameters is such that vw ≤ 1. lim sup w→∞ w log2 w Maass [154] shows that the upper bound of Theorem 3.5.2 can be nearly achieved. Theorem 3.5.3 (Maass) There exist LTU-based networks having s1 input nodes and at most 17s21 weights such that v ≥ s21 log2 s1 . Hence, the capacity v for a net with w weights can be Ω(w log w). The notation Ω(f (n)) represents an asymptotic lower bound for a function, say, v(n), in that there is a > 0 and b such that for every n it follows that v(n) ≥ af (n) + b. Thus the upper bound of Theorem 3.5.2 is tight to within a constant factor. Based on the possible superlinear growth of VC capacity with the number of network parameters, Maass makes the interesting observation: a network of neuron-like elements is more than just the sum of its elements. Of course, the significance of this observation depends on the significance of the VC capacity. It is still the case, as shown in Eq. 3.5.3, that the more fundamental concept of the growth function mN (n) grows exponentially in n only with an exponent w that is the sum of the VC capacities of the individual nodes making up the network.
3.6 Partitioning the Space of Inputs Networks of LTU nodes can only partition IRd into connected components that are polyhedra. Many other sets, of course, can be arbitrarily wellapproximated by polyhedra. As we shall see, any partition into finitely many polyhedra can be implemented by a network with two hidden layers. Necessary and sufficient conditions are yet to be determined for partitions to be created by a single hidden layer network.
72
3.6.1
3. Feedforward Networks I: Generalities and LTU Nodes
Enumeration
We take IRd to be the space of the d-dimensional input x to a network composed of LTU or unit-step nodes. The action of the first layer of LTU or unit-step nodes sets the limits to the resolution to which the network can partition the space. Subsequent layers of nodes cannot partition regions that were not partitioned by the first layer—all that later layers can do is form logical combinations of the regions created by the first layer. We enumerate by M (s1 , d) the number of connected regions into which s1 nodes in a first layer can divide IRd . This enumeration parallels the developments given in Section 2.3 and results in the same recursion as Eq. 2.3.2, M (s1 , d) = M (s1 − 1, d) + M (s1 − 1, d − 1), the same first boundary condition of M (1, d) = 2, but the modified second boundary condition M (s1 , 1) = s1 + 1. The solution M (s1 , d) =
d s1 k=0
k
,
enables us to estimate the width s1 of the first layer of computational elements needed to implement a partition into a given number of regions. The Vapnik upper bound to sums of binomial coefficients provided in Appendix 1 of Chapter 2 yields sd M (s1 , d) < 1.5 1 , d! and for our purposes this is not very different from the lower bound of sd1 . Using this estimate and Stirling’s formula approximation to d! shows that the generation of M regions requires s1 ≈
√ 1 d [2M d] d e
nodes in the first layer. If we assume, as is reasonable, that the desired number of regions M might grow with input dimension as αd (e.g., we quantize each dimension into α levels, and this yields αd quantization cells), then dα √ 1 s1 ≈ [2 d] d < αd. e Hence, the required number of nodes grows linearly with the dimension of the input space and the required number of connections or parameters grows O(d2 ) (see also Cosnard et al. [42], Kowalczyk [131]). Fortunately, growth of complexity is no worse than quadratic in dimension of input.
3.6 Partitioning the Space of Inputs
73
FIGURE 3.4. Polyhedral region recognition.
3.6.2
Construction of Partitions
Lippmann [148] shows that an LTU net containing a single hidden layer can compute a classification or indicator function η(x) that is +1 on an arbitrary convex polyhedral region (such a region is a convex set as specified in Definition 2.2.3 that has a boundary composed of hyperplane faces) and 0 on its complement. This is achieved by having one hidden layer node for each of the s1 hyperplane boundary faces of the polyhedral region; the hyperplane boundary wi , τi is represented by the node function a1:i (x) = U (wi · x − τi ), where wi is chosen so as to have points in the convex figure on the positive side of the hyperplane. We then use a single output node to AND all the hidden layer responses s1 1 a1:i (x) − (s1 − )). η1 (x) = a2:1 (x) = U ( 2 1
A classification or indicator function that takes on the value +1 on a union of convex polyhedral regions can be implemented by a net having two hidden layers. First define each of the s2 component convex regions with individual networks as above and then have an output node that computes OR of each of the individual network responses s2 1 a2:i (x) − ). η2 (x) = a3:1 (x) = U ( 2 1
Because most sets can be approximated by a union of convex polyhedral sets, the preceding construction using two hidden layers can approximate to most decision regions. However, the number of nodes might be prohibitively large (at least one for each face) if one requires close approximation. In any
74
3. Feedforward Networks I: Generalities and LTU Nodes
event, we can estimate the complexity of the implementation and thereby respond to the second of our basic questions. While the preceding results on the ability of LTU networks to partition their input space indicate the power of such networks in a constructive manner (answering to Question 3), they do not tell us what the full reach of such networks might be. Are there partitions that cannot be achieved by such networks? Clearly, the answer to this question is affirmative. Simply consider a nonpolyhedral figure such as a sphere in IRd . LTU networks are constrained to generate partitions having hyperplane boundaries. However, can these networks generate all partitions of input space having hyperplane boundaries? Any such region is a union of polyhedra. Every polyhedron can be partitioned as a union of convex polyhedra. Hence, any such region can be represented as a union of convex polyhedra and thus can be implemented by an LTU network as described above. Theorem 3.6.1 (Finite Unions of Polyhedra) If IRd is partitioned into a finite union of polyhedra, then this partitioning can be implemented by a network having sufficiently wide LTU Layers 1 and 2 followed by a single LTU node in Layer 3. An alternative argument is that given a partition in which all regions have hyperplane boundaries, we create a first hidden layer in which each LTU node corresponds to one of the boundary hyperplanes. The partition then amounts to a Boolean function of the binary responses of this first hidden layer. Because Theorem 3.3.1 assures us that all such Boolean functions can be implemented by a network of LTU or unit-step nodes, we create this network to process the outputs of the first hidden layer. Although s1 hyperplanes in the first layer create no more than M = 1.5sd1 /d! regions, they generate a binary vector of length s1 for each point x in the input space. There are 2s1 such binary vectors, and this number will generally greatly exceed the number M of regions. Hence, we do not need to implement a full Boolean function in the remaining layers. For example, if d = 2 then the number of regions is exactly 12 (s21 + s1 + 2) and is much smaller than 2s 1 .
3.6.3
Limitations of a Single Hidden Layer Architecture
If one restricts attention to single hidden layer LTU networks composed of a first layer of LTUs followed by a second layer containing a single LTU (as in our “sandwich” construction) then Shonkwiler [217] provides the following characterization. Theorem 3.6.2 (Shonkwiler) It suffices for a region A to be of the following form for it to be classifiable by a single hidden layer network of LTU
3.6 Partitioning the Space of Inputs
75
nodes: (∃C1 ⊃ C2 ⊃ · · · ⊃ C2k , convex polyhedra )
A = ∪k1 (C2k−1 − C2k ).
For example, if the convex polyhedron Cj is defined by fj hyperplane faces represented by nodes of the form a1:i (x) = U (wi · x − τi ), x ∈ Cj ⇒ a1:i (x) = 1, then f1 η1 (x) = U ( a1:i (x) − i=1
f 1 +f2 f2 − 12 1 ), a1:i (x) − f1 + 1 + f2 f2 + 1 i=f1 +1
recognizes C1 −C2 when they satisfy the above-mentioned conditions. Shonkwiler was unable to establish necessary conditions for a partition to be implementable by such a restricted architecture. Gibson [83] identifies a characteristic of polyhedra that require two hidden layers for their recognition in terms of what he calls an inconsistent hyperplane. This is illustrated in Figure 3.5 in which the polyhedron has three connected components A1 , A2 , A3 , with the quadrilateral A1 and the triangle A2 lying on opposite sides of a common hyperplane H1 with parameters w1 , τ1 . If, contrary to our expectation, there is a single hidden layer representation, then it is of the form s1 η1 (x) = a2:1 (x) = U ( w2:1,i a1:i (x) − τ2:1 ), a1:i (x) = U (wi · x − τ1:i ). 1 − + − / A1 and these two points Choose points x+ 1 , x1 such that x1 ∈ A1 , x1 ∈ are separated from each other only by the inconsistent hyperplane H1 . − / A2 also separated from each other only by Similarly choose x+ 2 ∈ A2 , x2 ∈ H1 . Because η1 is hypothesized to represent the set A, − + − η1 (x+ 1 ) − η1 (x1 ) = 1 − 0 > 0 ⇒ w2:1,1 [a1:1 (x1 ) − a1:1 (x1 )] = w2:1,1 []1 > 0.
Similarly, − + − η1 (x+ 2 ) − η1 (x2 ) = 1 − 0 > 0 ⇒ w2:1,1 [a1:1 (x2 ) − a1:1 (x2 )] = w2:1,1 []2 > 0. + − − However, because x+ 1 and x2 lie on opposite sides of H1 , as do x1 and x2 , it follows that []2 = −[]1 , and we have reached the desired contradiction, with both terms being positive no matter what the sign of w2:1,1 .
Theorem 3.6.3 (Gibson, Theorem 4) Let S ⊂ IR2 be a finite union of bounded polyhedral sets for which no three essential hyperplanes (lines) intersect. Then S is realizable if and only if no essential hyperplane (a hyperplane having a nontrivial one-dimensional intersection with the boundary of S) is inconsistent.
76
3. Feedforward Networks I: Generalities and LTU Nodes
FIGURE 3.5. Gibson’s inconsistent hyperplane.
It is not known whether the existence of inconsistent hyperplanes is the only reason for the failure of single hidden layer representations in higher dimension spaces.
3.7 Approximating to Functions We will treat Question 1(d) in more detail in the next chapter, but it is possible to approximate a given function f (x) by unit-step–based networks. We first approximate to the range (set of values) of f by a finite set f1 , . . . , fm . i . For example, if f takes values in [0, 1], then we might choose fi = m+1 We next identify the function arguments for which the best approximation to the function value is fi through the inverse-image set Si = {x :
fi−1 + fi fi + fi+1 < f (x) ≤ }, i = 1 : m, 2 2
where f0 = −∞, fm+1 = ∞. If the sets {Si } are finite unions of polyhedra, then Theorem 3.4.1 assures us that we can construct {0, 1}-valued networks Ni that represent them exactly. In this unlikely case we would have the approximation m fi Ni (x). f˜(x) = 1
In effect, we combine the m individual networks using a final linear output node with the weights being given by {fi }. Because it is unlikely that the sets {Si } are unions of polyhedra, we must use the same technique to approximate them by such unions and proceed as above. Hence, there are two sources of approximation error: quantizing the range of the function into m levels and approximating the inverse-image sets {Si } by finite unions of polyhedra.
3.8 Appendix: MATLAB Program for the Sandwich Construction
77
The issue of the generalization ability of LTU networks is deferred to Chapter 7, where we will treat networks with LTU and with real-valued nodes.
3.8 Appendix: MATLAB Program for the Sandwich Construction %Implements Sandwich Construction to learn a training set T %each sample in matrix T is a column %first d rows are input vector components \xv and %d+1st row is class y in -1,1 t0=clock; [r n]=size(T); d=r-1; %identify smallest of S+,S- as F Iplus=find(T(r,:)==1); Iminus=find(T(r,:)==-1); [a plus]=size(Iplus); if (plus <=(n/2)) F=T(1:d,Iplus); indicator=’plus’; ind=1; sizef=plus; else F=T(1:d,Iminus); indicator=’minus’; ind=-1; sizef=n-plus; end %randomly augment F so that sizef is divisible by d cycle=floor(sizef/d); if sizef>d*cycle, cycle=cycle+1; F=[F randn(d,d*cycle-sizef)]; end %solve for hyperplanes wt, with tau=1, containing d vectors wt=zeros(cycle,d); for k=1:cycle, ff=F(:,(k-1)*d+1:k*d); wt(k,:)=ones(1,d)/ff; end %identify max epsilon for sandwich width z=abs(wt*F-ones(cycle,d*cycle)); epsilon=sort(z);
78
3. Feedforward Networks I: Generalities and LTU Nodes
epsilon=sort(epsilon(2,:)); epsilon=epsilon(1)/2; %provide hyperplanes (wt,tau-epsilon),(-wt,-tau-epsilon) W=[wt ones(cycle,1)*(1.0-epsilon);-1*wt... ones(cycle,1)*(-1.0-epsilon)]; disp(’Positive response for class:’),disp(indicator) %check result check=ind*sum(W*T).*T(d+1,:); t1=etime(clock,t0)
3.9 Appendix: Proof of Theorem 3.5.1 Theorem 3.5.1 (Vapnik/Sauer, VC Bound) If v is the VC capacity of N , then
2n , mN (n) ≤ v k=0
n k
,
if n ≤ v; if n > v.
This upper bound can be achieved. Proof. Somewhat different proofs are provided by Vapnik [240, Section 6.A2], Sauer [209], Pollard [188, pp. 19, 20], and Devroye et al. in [61]. We follow closely Pollard [188]. Assume a family of binary-valued functions N (if the functions are real-valued, as they will be in the remaining chapters, then take the sign of their value to create a binary-valued function). Hence, these functions dichotomize sets of points. Consider a set S = {x1 , . . . , xn } of any n > v points. N applied to S creates a family A of subsets of S, with A ∈ A if there is an η ∈ N such that A = {x : x ∈ S, η(x) = 1}. Under the hypothesis that v is the VC capacity, no subset of S of size greater than v can be shattered. Thus, for any F ⊂ S, ||F || = v + 1, the family of sets A ∩ F = {G : (∃A ∈ A) G = A ∩ F } omits at least one subset H of F . Enumerate the 2v+1 subsets of size v + 1 as F1 , . . . , F2v+1 and let (0) Hi denote a (nonunique) corresponding omitted subset of Fi . Introduce the family of subsets of S (0)
C0 = {C : (∀i) C ∩ Fi = Hi }. Clearly C0 ⊃ A and we need only upper bound the cardinality of C0 to upper bound that of A and thence upper bound mN (n). (0) If, by chance, (∀i)Hi = Fi , then no set C in C0 can have at least v + 1 (0) elements; if it did then there would be an Fi = C ∩Fi = Hi and contradict (0) the property that C ∩ Fi = Hi . Hence, in this case all the sets in C0 have v no more than v elements and there can be at most k=0 nk such subsets of S, as claimed in the theorem.
3.9 Appendix: Proof of Theorem 3.5.1
79
We construct a collection of sets {Ci , i = 1 : n} of nondecreasing cardinalities, ||C0 || ≤ ||C1 || ≤ · · · ≤ ||Cn ||, (they will not be nested) such that the (n) final collection Cn will be seen to be the special case with (∀i)F n i = H i . v Because it will have cardinality upper bounded by k=0 k , the upper bound of the theorem will follow. (0) Enumerate the elements of S = {x1 , . . . , xn }. Consider the sets {Hi } (1) (1) (0) appearing in C0 and augment them to {Hi }, where Hi = Hi ∪ {x1 } (1) (0) provided that x1 ∈ Fi and Hi = Hi otherwise. We establish that C1 has at least as many elements as C0 (not necessarily the same elements) by comparing the differences C1 − C0 and C0 − C1 . To each C ∈ C0 − C1 we shall show that C − {x1 } ∈ C1 − C0 . Hence, there will be at least as many elements in C1 − C0 as in C0 − C1 . To verify this, select any C ∈ C0 − C1 . From C ∈ C0 , the set C must (0) satisfy (∀i)C ∩ Fi = Hi , while from C ∈ / C1 there must exist j such that (1) (1) (0) C ∩ Fj = Hj . It follows that Hj = Hj and x1 is therefore an element (1)
(0)
of C, Fj , Hj but not of Hj . We now use this observation to verify that C − {x1 } ∈ C1 − C0 . Because (1)
(C − {x1 }) ∩ Fj = Hj
(0)
− {x1 } = Hj ,
/ C0 . It remains to verify it follows from the definition of C0 that C − {x1 } ∈ (1) that C − {x1 } ∈ C1 . If Fi contains x1 then, by construction, so must Hi . However, C − {x1 } cannot contain x1 and the condition for membership in (1) C1 , that (C − {x1 }) ∩ Fi = Hi is satisfied. On the other hand, if Fi does (1) (0) not contain x1 , then neither does Hi , making it equal to Hi . Thus (0)
(C − {x1 }) ∩ Fi = C ∩ Fi = Hi
(1)
= Hi ,
where the first inequality follows from the hypothesis that C ∈ C0 . Hence, we conclude that in either case, C − {x1 } ∈ C1 . Combining results yields C − {x1 } ∈ C1 − C0 . We have established a one-to-one mapping of the elements of C0 − C1 into C1 − C0 , and thus that C0 has no more elements than C1 . (2) (1) We construct the next family C2 by creating Hi by adjoining x2 to Hi if x2 ∈ Fi . Repeating the preceding analysis, we conclude that ||C1 || ≤ ||C2 ||. Proceeding through the n elements of S yields the conclusion that ||A|| ≤ ||C0 || ≤ ||Cn ||. (n)
To wrap up this proof, we need only observe that to each Hi event appearing in the definition of Cn we have adjoined each element of S that was (n) also in the corresponding Fi event. Hence, in Cn we have that Fi = Hi , and we are in the special case treated at the outset. To show that the upper bound can be achieved, consider A = {A : ||A|| ≤ v}.
80
3. Feedforward Networks I: Generalities and LTU Nodes
This family of sets contains all finite sets that contain less than or equal to v points and only those sets. It is clear that vA = v. Because there are precisely ni subsets of size i of a set of size n, the upper bound is achieved precisely. 2
4 Feedforward Networks II: Real-Valued Nodes
4.1 Objectives and Setting 4.1.1
Objectives
We continue the discussion of Chapter 3 on the representational powers of feedforward neural networks by shifting our focus from networks of LTU nodes to those composed of real-valued nodes. The internal values and outputs of these networks are now real-valued and thus capable of representing or implementing the usual range of functions encountered in forecasting, estimation, signal processing, and control problems. In common practice we are usually given a training set T consisting of n input-output pairs {(xi , f (xi )} that only partially describes a function to be learned. We do not have complete knowledge of the function itself. We will treat this important case in Chapter 5. In this chapter we look at the fundamental question of approximating as closely as desired to a fully specified function. An understanding of this issue is central to assessing the abilities of neural networks. In essence, we shall see that single hidden layer networks using almost any nonpolynomial node function can approximate arbitrarily closely, in a variety of senses, to almost any given function f by using sufficiently many nodes with properly chosen weight vectors and thresholds. We first explore the representational abilities (our Question 1) of a single output, single hidden layer feedforward network and then consider the complexity required for a given degree of approximation (our Question 2). The actual selection of an approximation (Question 3) is addressed somewhat cursorily in closing and in detail for training sets in Chapter 5.
82
4. Feedforward Networks II: Real-Valued Nodes
Perforce, this chapter makes use of analytical methods (e.g., see Ash [13], Royden [203], Rudin [204], Simmons [220]) that may be unfamiliar to many readers of this monograph. Although we provide rigorous and general results, the statements of such results will require explanation. In some instances we can provide correspondences to more familiar results by using finite-dimensional spaces as analogies to infinite-dimensional function spaces. We offer a compromise between the needs of those to whom real analysis is unfamiliar and the expository preferences of those who are comfortable with these ideas.
4.1.2
Setup
There are three elements to be considered: • the family of functions N1,σ that are the single output, single hidden layer neural networks; • the family L of functions to which we wish to be able to approximate closely; and • the quality of approximation of one function by another as measured by a norm-based metric. We discuss these elements in the remainder of this section and in Section 4.3. Typically the set X = {x} of inputs to a neural network is a compact (closed and bounded) subset of the d-dimensional reals (d-tuples of real numbers, a vector of dimension d) and more specifically often the ddimensional cube [a, b]d with side the interval [a, b]. We will only consider networks that have a single output y and that compute a real-valued function. Vector-valued functions taking values in IRm can be implemented by m separate networks, one for each of the m components, albeit with a possibly less efficient use of network resources and potential loss of generalization ability due to the increased number of parameters or weights (e.g., see Caruana [36, 37]).
4.1.3
Single Hidden Layer Functions
The mathematical structure of a representation of a function through a single hidden layer network having s1 nodes is characterized as follows. Definition 4.1.1 (1HL Class) The class N1,σ = {η1 } of real-valued functions exactly constructable by a single hidden layer (1HL) feedforward net with a single linear output node and hidden layer node nonlinearity σ is the linear span of functions of x of the form σ(w · x − τ ).
4.1 Objectives and Setting
83
Restated, N1,σ is the set of functions η1 of x specified by: η1 (x, w) = a2:1 (x) =
s1
w2:1,i σ(c1:i ) − τ2:1 ,
1
c1:i (x) =
d
w1:i,k xk − τ1:i .
1
A given function η1 (·, w) ∈ N1,σ is described by p parameters assembled into a column vector w (e.g., see the MATLAB program for ntout1 in Section 5.13). In the fully connected case of a single hidden layer, p = (d + 2)s1 + 1, and in practice p is at least several hundred. Examples of functions in N1,σ , using the logistic node, for d = 1 are shown in Figure 4.1 and for d = 2 in Figure 4.2. In these figures, nodes are added successively to the previous sum of nodes to improve approximation ability by allowing construction of more complicated functions.
FIGURE 4.1. Examples of functions in N1,σ for d = 1 and s1 = 1, 2, 3, 4.
84
4. Feedforward Networks II: Real-Valued Nodes
FIGURE 4.2. Examples of functions in N1,σ for d = 2.
Generally, we have little interest in the members of N1,σ for themselves. They are highly unlikely to provide exact models for phenomena of interest. For example, if the common node function σ is logistic or tanh then the polynomials and trigonometric functions cannot be found in N1,σ . If σ is itself a polynomial of degree k then all the functions in N1,σ are also only polynomials of degree k. We are interested in the ability of network functions in N1,σ to approximate to families of functions of interest. Familiar examples of such families of interesting functions are the following: continuous C(X ): the family of continuous functions on compact (e.g., closed and bounded) X . integrable Lp (X ): the family of pth power Lebesgue integrable functions
|f (x)|p dx < ∞; X
rth differentiable Dr (X ): the family of functions having r continuous derivatives, with D0 = C; piecewise-constant P-constant: the family of functions on X that are piecewise constant; and
4.1 Objectives and Setting
85
piecewise-continuous P-continuous: the family of functions on X that are piecewise continuous. There now exist theorems, a few of which we will present, on the ability of the members of N1,σ to approximate to each of the preceding families of functions. Before presenting these results, we will address some of the basic mathematical properties of the representations of functions in N1,σ and the meaning we will give to approximation.
4.1.4
Multiple Hidden Layers–Multilayer Perceptrons
Multiple hidden layers enable us to construct compositions of functions in which the outputs of earlier layers become the inputs to later layers. An example of a representation by a two hidden layer (2HL) network in the family N2,σ with a single linear output node is η2 (x, w) = a3:1 (x) = −τ3:1 +
s2
w3:1,i σ2:i (η1 (x, wi )) =
i=1
−τ3:1 +
s2 i=1
s1 d w3:1,i σ2:i −τ2:i + w2:i,j σ1:j ( w1:j,k xk − τ1:j ) , j=1
k=1
where η1 (x, wi ) is the response of a single hidden layer network with wi a temporary shorthand for the parameters defining the ith such node. As will be seen in this chapter, single hidden layer networks suffice to achieve arbitrarily close approximations for many combinations of family of functions and measures of approximation. However, this is not true for the sup-norm metric and functions that are discontinuous, perhaps being piecewise continuous. As observed in Section 3.6.3, a function that is piecewise constant and has hyperplane boundaries for its level sets (regions of constant value) can be constructed exactly using a two hidden layer network, but not always using a single hidden layer net (see Gibson [83]). The need to construct such functions arises in certain inverse problems of control theory, and the resulting controller design may require sup-norm approximation by a two hidden layer network (e.g., see Sontag [224]). In image and speech recognition problems (see Section 2.3), multiple hidden layer architectures are motivated by attempts to incorporate spatially or temporally localized features that are expected to be helpful (e.g., see Le Cun and Bengio [137]). For example, in image processing applications, it is common to have a first layer of nodes that are connected only to small, local regions of the image. These node responses are then aggregated in succeeding layers leading to a final layer of multiple outputs, with each output corresponding to a possible image pattern class.
86
4. Feedforward Networks II: Real-Valued Nodes
4.2 Properties of the Representation by Neural Networks Five of the mathematical properties of neural networks represented as in N1,σ , or indeed those with multiple hidden layers (Nk,σ denotes the family of k hidden layer networks with node function σ in the hidden layers and a linear output node), are indicated below. These properties help us understand what happens when we turn to training algorithms in Chapter 5.
4.2.1
Uniqueness of Network Representations of Functions
The specification of N1,σ or Nk,σ is such that different neural networks do not necessarily implement different functions from their inputs in IRd to a scalar output. Some understanding of this is important for its implications for the existence of multiple approximations of equal quality and its eventual implications for the existence of multiple minima when one comes to training to reduce approximation error. The following are examples of conditions under which distinct networks implement the same function: 1. Permuting the nodes in a given hidden layer (e.g., interchanging the weights and thresholds corresponding to the first and second nodes) will not change the function. In, say, a single hidden layer network, the network output is i w2:1,i σ(c1:i ) and the value of the sum is invariant with respect to reordering the summands. 2. If the node function σ is an odd function, σ(−z) = −σ(z), such as the commonly used tanh(z), then negating the input weights and thresholds to a node and negating the output weights from that same node will introduce sign changes that cancel and leave the network output invariant; e.g., wi:j,k σ(ci−1:k ) = −wi:j,k σ(−ci−1:k ). 3. If all the weights wi:1,k , . . . , wi:si ,k leaving the kth node in layer i − 1 are zero, then the function is invariant with respect to all of the inputs to that node σi−1:k ; e.g., to the values of the weight vectors {wi−1:k,1 , . . . , wi−1:k,si−2 }. If all the weights {wL:1,1 , . . . , wL:1,sL−1 } to the output are 0, then the function is identically 0 and its structure cannot be further determined from its response. 4. If two nodes, say, σi:1 , σi:2 , in a hidden layer have the same thresholds τi:1 = τi:2 and the same input connection weights, wi:1,j = wi:2,j for j = 1 : si−1 , then ci:1 = ci:2 and the node responses ai:1 , ai:2 will be identical. Hence, what is transferred to the next layer is wi+1:j,1 ai:1 + wi+1:j,2 ai:2 = (wi+1:j,1 + wi+1:j,2 )ai:1 , and all that matters is the sum wi+1:j,1 + wi+1:j,2 and not the individual weight values.
4.2 Properties of the Representation by Neural Networks
87
This question has been examined by Albertini et al. [3], Fefferman and Markel [71], Fefferman in [70], Kurkova and Kainen [134], and Sussman [232], among others, for a single hidden layer network. Albertini et al. observe the need for the node function σ to have a linear independence property in that (∀s)(∀(w1 , τ1 ), . . . (ws , τs )) 1, σ(w1 x − τ1 ), . . . , σ(ws x − τs ) are linearly independent functions of x so long as for no two pairs (wi , τi ) = ±(wj , τj ). Kurkova and Kainen establish that for a network with d inputs and s1 nodes in a single hidden layer and a parameter vector u ∈ IRs1 (d+2) describing this network, two such parameter vectors u, u yield the same function if u can be derived from u by the composition of permutations of hidden units and a generalized notion of sign changes. Their theorems require that the node function have certain properties that hold for the familiar node functions (e.g., logistic, tanh).
4.2.2 Closure under Affine Combinations and Input Transformations It is clear from Definition 4.1.1 that N1,σ is closed under linear combinations, and indeed under affine combinations;
(∀η1 , η1 ∈ N1,σ )(∀a, a , b ∈ IR)η1 = aη1 + a η1 + b ∈ N1,σ .
In the 1HL case η1 is again 1HL with a number s of hidden layer nodes no more than the sum of the numbers of nodes in each η1 , η1 . Hence, if σ is a polynomial of degree k, then so are all the functions in N1,σ . If one considers networks with more than one hidden layer, then it is still true that there is closure under affine combinations—a function η that is a sum of functions ηki , each represented by ki hidden layers, has a network representation using no more than (maxi ki ) hidden layers. Given an invertible affine transformation of the input variables x0 = Ax + b, det(A) = 0, x = A−1 (x0 − b), and a network ηk (x, w), there is a new network ηk (x0 , w0 ) such that (∀A nonsingular, b)(∀w)(∃w0 )(∀x)ηk (x0 , w0 ) = ηk (x, w). To verify this claim, observe that we need only modify the first hidden layer weights and biases to ensure that for all k = 1 : s1 , we have c1:k (x0 ) = c1:k (x). Hence, we require d j=1
w1:k,j xj − τ1:k =
d j=1
0 0 w1:k,j x0j − τ1:k ,
88
4. Feedforward Networks II: Real-Valued Nodes
or 0 w1:k,· x − τ1:k = w01:k,· x0 − τ1:k .
This is achieved by the choices 0 = τ1:k + w01:k,· b. w01:k,· = w1:k,· A−1 , τ1:k
We will make use of this observation to allow us to standardize training data (e.g., convert it to sample means of zero and sample standard deviations of unity) to expedite training, without any loss of generality.
4.2.3
Regions of Constancy
If, as is commonly the case in applications such as character recognition, the number s1 of nodes in the first hidden layer (and we may allow more than one hidden layer in this discussion) is less than the dimension d of the inputs to the network, then in effect the response of the first hidden layer is determined by s1 hyperplanes specified by the individual node weight vectors w1:i,· and thresholds τ1:i . The locus of constant response from the first hidden layer is then the intersection of these hyperplanes. Explicitly, we can identify the set of input values that share the same function value η(x0 , w) as a particular input x0 {x : η(x, w) = η(x0 , w)} ⊃
s1
{x : w1:i,· · (x − x0 ) = 0}.
i=1
If s1 < d then this intersection is a nonempty linear manifold. Insofar as the first hidden layer responses are constant over this manifold of inputs, all subsequent network responses, including the output, must also be constant over this manifold (and possibly constant over an even larger set). An example of this constancy is available from Figure 4.3 for the single-node case; it is evident that there are lines along which the network response is constant. Input values that are not distinguished by at least one node in the first hidden layer can never be distinguished by subsequent network nodes.
4.2.4
Stability: Taylor’s Series
It is typically the case for real-valued networks that the selected nonlinear node function σ is continuously differentiable and even analytic in the most common cases of the logistic and hyperbolic tangent functions. For such a choice of node function, the network response η(x, w) is also a differentiable function in both its parameters w and its arguments x. If each σ1:i is rtimes (continuously) differentiable then so is η in each of the components
4.2 Properties of the Representation by Neural Networks
89
of x, w. The d-dimensional gradient vector of a scalar function η(x) of a vector argument is a column vector ∇x η = [gi ], gi =
∂η . ∂xi
The gradient with respect to input variables for a 1HL network is given by ∇x η1 (x, w) =
s1
w2:1,i σ (w1:i,· · x − τ1:i )w1:i,· .
1
Here we consider the Taylor’s series expansion in the argument x for the function η1 (x, w). Insofar as σ is once continuously differentiable, we have that η ∈ D1 and Taylor’s Theorem with remainder enables us to write η(x, w) = η(x0 , w) + (∇Tx0 η)(x − x0 ) + o(||x − x0 ||), showing a remainder that is of higher order (o(z)/z converges to 0 as z converges to 0) in the difference between input vectors. Thus nearby inputs give rise to nearby network values—a property of stability of the network representation. If σ is twice continuously differentiable, as is commonly the case with real-valued nodes, then η ∈ D2 and we can extend the representation to second order by introducing the Hessian matrix H of mixed second partial derivatives H = [Hij ], Hij =
∂ 2 η(x, w) . ∂xi ∂xj
In the case of a 1HL network Hij =
s1
w2:1,k σ (w1:k,· · x − τ1:k )w1:k,i w1:k,j .
k=1
Taylor’s Theorem now yields 1 η(x, w) = η(x0 , w) + ∇Tx0 η(x − x0 ) + (x − x0 )T H(x − x0 ) + o(||x − x0 ||2 ). 2 Hence, neural network representations using differentiable node functions are smooth or stable functions of their input variables—similar inputs give rise to similar outputs. The computationally effective evaluation of the gradient ∇w η with respect to the parameter vector is discussed at length in Section 5.2 and is known as backpropagation. Calculation and storage of the Hessian can be problematic, given the typically high dimension of input vectors and especially of parameter vectors. A Taylor’s series to second order in the parameter vector w will be the basis of much of the analysis of training algorithms in Chapter 5.
90
4. Feedforward Networks II: Real-Valued Nodes
4.2.5 Approximating to Step, Bounded Variation, and Continuous Functions Although we will provide formal analyses of the approximation ability of single hidden layer neural networks in the following sections, that these networks enjoy the ability to arbitrarily closely approximate the commonly encountered functions can be seen from the demonstration we now provide of their ability to approximate to common functions of a scalar (d = 1) argument. A direction in which to extend these results to higher dimensional inputs is provided by Dahmen and Micchelli [51]. If we look ahead to Definition 4.7.2, we see that a commonly selected nonlinear node function σ has the properties of being sigmoidal or S-shaped in that it is such that ¯ = lim σ(z) < ∞. −∞ < σ = lim σ(z) < σ z→−∞
z→∞
For a sigmoidal node function and any > 0, |z0 | > 0 we can find large enough a > 0 such that the product a|z0 | is large enough that σ(±az0 ) is ¯, within a small fraction of its limiting values σ, σ σ − σ(az)| < . (∀z > |z0 |) |σ − σ(−az)| < , |¯ Thus
ˆa (z) = σ(az) − σ U σ ¯−σ
is approximately a unit-step function U (z), ˆa (z) − U (z)| < . (∀|z| > |z0 |) |U Restated, lim
sup
a→∞ {z:|z|>|z |} 0
ˆa (z) − U (z)= 0, U
and we can approximate arbitrarily closely to unit-step functions. If the function f of interest is bounded, monotone, and, say, continuous from the right, then it is easily approximated by a sum of step functions (U (0) = 1). Because f can only have countably many jump discontinuities, we first identify these discontinuities f (di ) − f (d− i ) = Ji , and form the discrete part fd of f through fd (x) = Ji U (x − di ). i
Now form the remaining continuous part fc of f through fc (x) = f (x) − fd (x).
4.2 Properties of the Representation by Neural Networks
91
The continuous part fc can be uniformly approximated to within any selected > 0 by selecting the (possibly doubly infinite) countable sequence {τi } of approximation points through τ0 = 0, |fc (τi ) − fc (τi−1 )| < and defining the stepwise approximation ∞
fˆc (x) = lim fc (τi ) + i→−∞
(fc (τi ) − fc (τi−1 ))U (x − τi ).
i=−∞
We can then conclude that sup |f (x) − fd (x) − fˆc (x)| < , x
and we can approximate arbitrarily closely to right-continuous, bounded, monotone functions. A function f of bounded variation is one for which the Stieltjes integral
|df (x)| < ∞. X
When f is differentiable with derivative f , then it is of bounded variation if
|f (x)|dx < ∞. X
Recall that a function of bounded variation can be written as a difference of two monotone functions. Hence, the elements of the large class of functions of bounded variation are also easily approximated by weighted sums of step functions. Because the functions of bounded variation on a compact set (finite interval) include the trigonometric functions, we see that we can approximate to trigonometric functions and thence to those other functions representable as Fourier sums of trigonometric functions. Given that we can approximate to unit-step functions we can also approximate to unit-pulse functions through pτ (z) = U (z) − U (z − τ ) being a pulse of unit height and width τ . A function f is Lipschitz if (∃K)(∀x, y) |f (x) − f (y)| < K|x − y|.
Thus a function f with derivative magnitude |f | uniformly bounded by K is Lipschitz. That a Lipschitz function can be uniformly arbitrarily closely approximated by a weighted sum of unit-pulse functions follows from sup |f (z) − z
∞ k=−∞
f (kτ )pτ (z − kτ )| ≤ Kτ.
92
4. Feedforward Networks II: Real-Valued Nodes
4.3 Formalization of Function Approximation If σ is a continuous function or one that is r-times differentiable, then so are the functions in Nk,σ , and they form a linear subspace of a linear vector space (see Appendix 4.1) of continuous functions C(X ) on X or of r-times differentiable functions Dr (X ), respectively. That Nk,σ is a proper (strictly smaller) subspace of C(X ), when σ is continuous, is easily verified by noting that polynomials are not contained in Nk,σ unless σ is itself a polynomial. However, if σ is a polynomial of degree q then such networks can only implement polynomials of degree kq, and they are severely limited in their approximation abilities. We inquire into the ability of neural networks (especially those in N1,σ ) to approximate to arbitrary continuous functions and subsequently to other families. To present results on the ability of feedforward networks to approximate to functions, and eventually to training sets, we recall (see Appendix 4.2) that a metric d is used to measure the dissimilarity or distance between two functions. For the space of continuous functions C(X ) on X we typically use the worst-case sup-norm metric d(f, g) = sup |f (x) − g(x)|. x∈X
A set of functions F can arbitrarily closely approximate to a set of functions G, in the sense of a given metric d, if to any g ∈ G and to any positive there is an f ∈ F that is at least that close to g. This notion is captured in the following Definition 4.3.1 (F Dense in G) F is dense in G if (∀g ∈ G, > 0)(∃f ∈ F) d(f, g) < . When F is a subset of G, then the functions in G that can be arbitrarily closely approximated by the functions in F are the ones in the subset F together with the limit points of this set of functions. The set F and its limit points is known as the closure F¯ of F in G with respect to the metric d and is specified as follows: Definition 4.3.2 (Closure) Given a metric d on a space of functions G the closure F¯ of a family F ⊂ G of functions is F¯ = {g ∈ G : (∀ > 0) (∃f ∈ G) d(g, f ) < }. In other words, F¯ is precisely the set of functions in G that can be arbitrarily closely approximated by functions in F. Given its importance, we set off the immediate consequence as ¯1,σ is precisely the set of functions that can be arbitrarily N closely approximated by single hidden layer neural networks with a linear output node and hidden layer node function σ.
4.4 Outline of Approaches
93
¯1,σ . We shall see in Section 4.7 that when Hence, we need to understand N we have σ continuous and use the metric induced by the sup-norm, then ¯1,σ = C—hence a single hidden layer neural network can N1,σ ⊂ C and N approximate any continuous function. Finally, we will note that N1,σ , calculated using different metrics and with node functions that are not necessarily continuous, also provides approximations to (is dense in) the classes of integrable, measurable, and piecewise continuous or constant functions. These results will provide a very satisfactory answer to our first question about the representational power of feedforward neural networks.
4.4 Outline of Approaches A number of approaches that demonstrate the power of neural networks have been published, largely in 1989, with a few partial precursors and several subsequent refinements. Almost all approaches rely on some previously established method for representing or approximating a function f (x) through an expansion in terms of a given set of basis functions. If one can show that a neural network can arbitrarily closely approximate any of the basis functions and that there is an architecture that enables us to form the desired expansion by combining the individual neural networks approximating the basis functions, then a theorem that allows us to control all errors of approximation is a theorem that demonstrates that a neural network can approximate f . These approaches can be grouped according to their basic technique as follows: (a) Kolmogorov’s solution to Hilbert’s 13th Problem of representing multivariate continuous functions (Section 4.5); (b) Stone-Weierstrass Theorem on approximation of continuous functions by subalgebras of continuous functions (Section 4.6); (c) Fourier or integral equation (e.g., see Wiener [252]) representations of integrable functions of several variables; (d) trigonometric, polynomial, and spline approximations to functions of several variables; and (e) analytic existence proofs of neural network representations of functions of several variables (Section 4.7). Approach (a), originally suggested by Hecht-Nielsen in [102], was followed by several others [80, 133, 227, 122] and is discussed in Section 4.5. It is based on Kolmogorov’s exact representation for a continuous function of several variables in terms of sums of a composition of a continuous function of sums of continuous functions of single variables. At first glance, this result suggests that a neural network with finitely many nodes arranged in
94
4. Feedforward Networks II: Real-Valued Nodes
two hidden layers can exactly represent a given function. However, this is only true if the network nodes in the second hidden layer can be designed to fit the particular function being represented. Because this possibility is not practical, this theorem has been adapted to prove the existence of approximate representations. Approach (b) is discussed in Section 4.6 and provides immediate proofs of the ability of a single hidden layer network to arbitrarily closely approximate a given continuous function of several variables provided the node functions satisfy a condition of generating a subalgebra. While there are node functions that satisfy this condition, most of them actually considered in the practice of neural networks violate this condition. Approach (c), mentioned only here, is the basis of the work of Funahashi [80] and of Koiran [128]. Given a function f of several variables with Fourier transform Ff = F and a kernel function ψ with Fourier transform Ψ we have the relations
∞
∞ 1 iw·x F (w)e dw, ψ(w·x−w0 )eiw0 dw0 = Ψ(1)eiw·x . f (x) = (2π)n −∞ −∞ Hence, 1 f (x1 , . . . , xn ) = (2π)n Ψ(1)
1
∞ −∞
(2π)n Ψ(1)
∞
∞
−∞
ψ(x·w−w0 )F (w)eiw0 dw0 dw =
F (w)Ψ(1)eiw·x dw.
−∞
Funahashi truncates the intervals of integration to finite limits and shows first that the truncated integrals can be made to approximate f as closely as desired by proper choice of the finite limits, and second that the integrals can be approximated well enough by finite Riemann sums. The Riemann sums then interpret directly as neural networks with node function ψ. More generally, given the integral equation
K(x, w)λ(w)dw, f (x) = W
with known kernel K that is capable of representing f ∈ F by choice of λ ∈ L, we can argue, as above, to approximate the multidimensional integral by finite Riemann sums to find f (x) ≈ δ
s1
K(x, wi )λ(wi ).
i=1
If the kernel K is of the form σ(w·x−τ ), then we have the usual 1HL neural network representation. If K(x, w) = σ(||w − x||), then we have a radial basis function respresentation. We will make limited use of these ideas in Section 4.11.4.
4.5 Implementation via Kolmogorov’s Solution to Hilbert’s 13th Problem
95
Approach (d), followed by many [110, 118, 128, 144, 163], shows that the usual neural networks can arbitrarily closely approximate the functions {φk (x)}in a familiar basis set and the basis set functions can in turn arbitrarily closely approximate the desired function f . These approaches are generally constructive and provide estimates of the complexity (number of nodes needed for a given degree of approximation) needed to approximate f . Approach (d) will be touched on in Section 4.11, when we turn to assess the complexity of a neural network approximation to a function f . Approach (e), followed by Cybenko [50], provides an elegant proof of the universal approximation ability of a single hidden layer network using almost any sigmoidal node function; it is discussed in Section 4.7. The results presented below need to be supplemented by results giving reasons for choosing particular node functions (Section 4.10) and providing a qualitative understanding of the advantages that might accrue from using multiple hidden layers. In practice many applications have been treated with multiple hidden layer networks, with the layering incorporating the insights of the practitioner.
4.5 Implementation via Kolmogorov’s Solution to Hilbert’s 13th Problem The synthesis of a continuous real-valued function on, say, the cube I d = [a, b]d in IRd with sides the intervals I = [a, b], using a neural network–like representation admits an exact solution as a consequence of Kolmogorov’s solution to Hilbert’s 13th problem. The subsequently simplified form of this solution was given by Lorentz [152] and Sprecher [227]. Theorem 4.5.1 (Kolmogorov/Sprecher) Let λk be a sequence of positive integrally independent numbers; i.e., k rk λk = 0 for any integers rk 1 that are not all 0. Letting D = [0, 1 + 5! ], there exists a continuous monotonically increasing function ψ : D → D having the following property: For every real-valued continuous function f : [0, 1]d → IR with d ≥ 2 there is a continuous function Φ and constants {ad }, {bm } such that f (x1 , ..., xd ) =
2d d Φ bk + λj ψ(xj + kad ) . k=0
j=1
d In effect, if we embed x ∈ IRd in bk + j=1 λj ψ(xj + kad ) ∈ IR2d+1 (so-called Whitney embedding), then a continuous function of d real variables can be written as a sum of 2d + 1 continuous functions, differing d only in that each of the 2d + 1 arguments is a single real variable bk + j=1 λj ψ(xj + kad ). The Kolmogorov solution can be implemented by a net with two hidden layers; the first layer contains d(2d + 1) copies of ψ, and the second layer
96
4. Feedforward Networks II: Real-Valued Nodes
FIGURE 4.3. Kolmogorov theorem network.
contains 2d + 1 copies of the function Φ. The scalar output is achieved by summing all of the outputs of the second hidden layer. This network is shown in Figure 4.3. A crucial problem with this solution is that the node function Φ in the second hidden layer depends on the choice of function f , and while Φ is continuous, it is complicated. Thus the synthesis of different functions of several variables requires us to synthesize a function of a single variable and do more than just adapt weights as given by the {λj }, {ad }, {bk }. Kurkova [133] and Katsuura and Sprecher [122] have adapted the proof of the Theorem 4.5.1 to show that we can approximate the continuous functions k in the theorem by ones made out of a sigmoidal function through 1 ai σ(bi x + ci ) and thereby approximately synthesize a continuous function f of d variables with a two hidden layer neural network. Kurkova [133] then estimates the numbers of nodes needed in each hidden layer to achieve a given degree of approximation; in principle, this result can be compared to the Hilbert space estimates of Section 4.11. A further adaptation is presented by Katsuura and Sprecher [122]. Cotter and Guillerm [44] argues that the networks derived from the Kolmogorov representation are similar to those of the CMAC (Cerebellar Model Articulation Controller) architecture. However, we do not pursue this representation further because we will show that we need only a single hidden layer network to achieve arbitrarily close approximations to f .
4.6 Implementation via Stone-Weierstrass Theorem The easiest access to a result that indicates the power of a single hidden layer network is provided by the celebrated Stone-Weierstrass theorem (see
4.6 Implementation via Stone-Weierstrass Theorem
97
also Cotter [43], Hornik et al. [109]). The original form of this theorem, due to Weierstrass, established that continuous functions defined on an interval could be uniformly approximated by polynomials. The Weierstrass theorem could be given a network representation if we allowed our nonlinear nodes to be polynomials of arbitrary degree. This would have the significant disadvantage of making a network out of a wide variety of neurons rather than out of a single kind of nonlinear neuron. The generalization of the celebrated Weierstrass theorem allows us to choose a single kind of nonlinear neuron and yet achieve uniform (sup-norm) approximation to continuous functions. However, the general theorem requires several preliminary definitions, and those concerning vector spaces can be found in Appendix 4.1. Definition 4.6.1 (Subalgebra) A subalgebra A of a set C of real-valued functions is a subset of C that is closed under the operations of products, f, g ∈ A ⇒ f g ∈ A, and linear combinations, f, g ∈ A ⇒ (∀α, β)αf + βg ∈ A ⊂ C. Clearly the set of continuous functions is itself a subalgebra, as is the set of polynomials. The results on approximation that we develop all assume that the set of possible input values (the domain of the function being approximated) is a closed (contains its limit points) and bounded set, typically [a, b]d = I d . The appropriate analytical setting for such issues is that of a topological space as reviewed in Appendix 4.3. We are now prepared to state a result on approximation of functions by members of a subalgebra that can be found in Ash [13, p.393]. Theorem 4.6.1 (Stone-Weierstrass) If C(X ) is the space of continuous functions on a compact Hausdorff space X (e.g., I d ) and 1. A is a subalgebra of C; 2. such that a nonzero constant function is in A; 3. and if x, y ∈ X , x = y, then there is an f ∈ A such that f (x) = f (y), then the sup-norm closure A¯ of A is C. If X = I d ⊂ IRd , then the polynomials in d variables clearly form a subalgebra of C, contain the constants and are such that for any pair of elements x0 , x1 there is a polynomial (e.g., the polynomial of degree 1 in the variable corresponding to a coordinate in which x0 , x1 disagree) that takes on distinct values for the distinct arguments. Hence, the polynomials satisfy the
98
4. Feedforward Networks II: Real-Valued Nodes
conditions of the Stone-Weierstrass theorem, and thereby it incorporates the original Weierstrass theorem. Of greater interest to us is that we learn that we can approximate any continuous function f on I d through linear combinations of exponentials of the form f (x) ≈ fn (x) =
n
αi ewi ·x .
(4.6.1)
i=1
It is immediate that the class of representations of the form of Eq. 4.6.1 is closed under linear combinations. That the products of such representations are in the same family follows from the product of exponentials being an exponential. 1. Hence, we have a subalgebra. 2. Take n = 1, α1 = 0, w1 = 0 for the nonzero constant function. 3. Take n = 1, α1 = 1, w1 = x − y to distinguish f (x) from f (y). Even more simply, if x = y, then there is a component, say, the jth, at which xj = yj . Let w1 have all zero entries, except at the jth position, where it has the entry 1. Thus the conditions of the Stone-Weierstrass Theorem are met, and Eq. 4.6.1 gives us a family of approximations that can be made arbitrarily close to any continuous function and that can be represented by a single hidden layer neural network with node function σ(z) = ez . Furthermore, we can form a subalgebra from the family of weight vectors restricted to have integer or rational coordinates and thereby the network parameters need only be of limited precision, although this precision will vary depending on the problem at hand. We can also use trigonometric functions sin(w · x), cos(w · x) as node functions when X = [a, b]d . That the trigonometric functions form a subalgebra is readily seen when one recalls that a product of two such functions can be re-expressed using trigonometric identities in terms of sums and differences. If one recalls the Euler relation eiθ = cos(θ) + i sin(θ), then it is easy to verify the conditions of the Stone-Weierstrass theorem, although we have formally introduced functions in place of complex-valued n our real-valued functions. The sums k=1 αk eiwk ·x form a subalgebra that contains the nonzero constant function. The separation property for x = y is ensured when we identify at least one component xj = yj and choose 1 w such that k = j → wk = 0, wj = b−a ; we need to exercise care in this periodic case that w·(x−y) = 2mπ. We conclude that a trigonometric node function will also satisfy the Stone-Weierstrass theorem and enable a single
4.7 Single Hidden Layer Network Approximations to Continuous Functions
99
hidden layer neural network composed of such node functions to arbitrarily closely approximate any continuous function f (x) on I d in supnorm. However, if we take for our node function the usual sigmoidal function 1/(1 + e−x ) then the functions in N1,σ do not form a subalgebra and the Stone-Weierstrass Theorem is not directly applicable. Hornik et al. [109] deal with this by forming the algebra of sums of products of node functions. This algebra clearly satisfies the conditions of the Stone-Weierstrass theorem. However, they then need to show that a product of node functions can be arbitrarily well-approximated by a network which is restricted to be a sum of node functions. They use Stone-Weierstrass to show that sums of products of trigonometric functions can arbitrarily closely approximate continuous functions on I d . This would seem not to apply to neural networks because they form sums, but not products. A product of trigonometric functions, however, can in turn be expressed as a sum of trigonometric functions using the usual identities (e.g., 2 cos(θ1 ) cos(θ2 ) = cos(θ1 −θ2 )+cos(θ1 +θ2 )). Hence, to approximate a sum of products of trigonometric functions we must be able to approximate a sum of trigonometric functions. It remains only to show that a single hidden layer neural network can uniformly approximate a given trigonometric function with argument a linear combination of the elements of x.
4.7 Single Hidden Layer Network Approximations to Continuous Functions We follow Cybenko [50], and show that for “almost all” node nonlinearities ¯1,σ = C(X ). σ, N1,σ is dense in C(X ) and that for σ continuous the closure N Hence, we can arbitrarily closely approximate any continuous function f by a function in N1,σ . The method of proof for σ continuous is to assume, contrary to our expectations, that there is a function in C(X ) that is not ¯1,σ . While the proof sketched below invokes results beyond the level in N assumed of a reader of this monograph, the argument will be restated in terms of familiar geometrical ideas drawn from IRd .
Infinite and Finite Dimensional Analogies Function space F F = C(X ) f Linear subspace N1,σ ¯1,σ N Linear functional L : F → IR Integral representation of L
IRd Hd x Hyperplane through 0 Hk Hk w· i wi xi
100
4. Feedforward Networks II: Real-Valued Nodes
We noted in Section 4.2.2 that N1,σ is a linear subspace of functions. If instead of considering the infinite-dimensional functions spaces of actual interest to us, we turn to the d-dimensional space IRd , then a linear subspace Hk of IRd is a k-dimensional hyperplane including the origin; e.g., consider the line {(x1 , x2 ) : x2 = ax1 } through the origin in the plane IR2 as an example of a one-dimensional linear subspace. We discussed aspects of such finite-dimensional linear subspaces in Chapter 2 (and more abstractly in Appendix 4.1) and know that they can be represented in terms of d − k weight vectors wi through Hk =
d−k
{x : wi · x = 0}.
i=1
The whole space itself is a d-dimensional hyperplane Hd . We make the analogies that the space of continuous functions C(I d ) corresponds to IRd , in that a function corresponds to a point, and the linear subspace N1,σ ¯1,σ = corresponds to a k-dimensional hyperplane. The issue as to whether N C becomes by analogy the issue as to whether k = d. If we assume to the contrary, then k < d, and there exists a point y ∈ Hd = IRd not on a hyperplane Hd−1 ⊃ Hk characterized by a single weight vector (normal to the hyperplane) w. In our straight line linear subspace example the point y = (1, − sgn(a)) is in IR2 but not on the line and w = (1, − sgn(a)) is such that it is orthogonal to every point on the line. Equivalently, if k < d then there exists a nonzero linear functional L given by L(x) = w · x that is identically zero on Hd−1 , and thus on Hk . To prove that k = d we must show that this is a contradiction and that the only linear functional of Hk that is zero there must be zero everywhere. This can be done by selecting d linearly independent points xi and establishing that L(xi ) = 0; there cannot be a nonzero vector w ∈ IRd that is orthogonal to each of d linearly independent vectors. This completes our finite-dimensional analogy to the actual infinite-dimensional proof developed by Cybenko and sketched below. ¯1,σ = C(I d ), then a Hahn-Banach Theorem If the closed linear subspace N (e.g., Ash [13], p.141, Rudin [204], p.108], or Simmons [220], p. 229]) can be invoked to ensure the existence of a nonzero, bounded, linear functional L (the analog of the scalar or dot product w · x for fixed w) on C that is ¯1,σ , zero on N ¯1,σ ) L(f ) = 0, (∀f ∈ N
(∃g ∈ C) L(g) = 0.
One then uses a Riesz Representation Theorem (Ash, [13], p.184]) to assert that L can be given an integral representation (an analog to the dot product
4.7 Single Hidden Layer Network Approximations to Continuous Functions
representation)
101
f (x)µ(dx),
L(f ) = Id
where µ is a finite, signed, regular measure. Informally, one can think of this integral representation more familiarly as a multidimensional integral
L(f ) = f (x)µ (x)dx, Id
in which the weighting corresponding to w in the analog dot product representation is given by the derivative (density) µ (x). The specific type of integral chosen is determined by C and would be different for different function spaces. In order to provide a condition under which µ is forced to be zero-valued (an analog of the orthogonality of w to each of d linearly independent vectors), we consider that L(f ) = 0, f ∈ N1,σ means that L(f ) = L(
n
αi σ(wi · x − τi )) =
i=1
n
αi L(σ(wi · x − τi )) = 0.
i=1
Because this must hold for all choices of parameters, it holds if and only if (∀w, τ ) L(σ(w · x − τ )) = 0. Definition 4.7.1 (Discriminatory σ) σ is discriminatory if for any finite, signed, regular measure µ
σ(w · x − τ )dµ(x) = 0 (∀w ∈ IRd )(∀τ ∈ IR) Id
implies that µ = 0. For example, if σ(z) = eiz , then by the unicity of Fourier transforms (or in probabilistic terms, characteristic functions) we expect that µ = 0 whenever its Fourier transform (can even set τ = 0) is 0. We can now conclude that if the nonlinearity σ is continuous and discriminatory, then L must be identically zero in contradiction to Hahn-Banach. Hence, the desired ¯1,σ = C. contradiction is reached and N Theorem 4.7.1 (Cybenko, Theorem 1, p. 306) If σ is continuous and discriminatory, then functions N1,σ implemented by a single hidden layer net are dense in C(I d ), or, equivalently stated, ¯1,σ = C. N We now identify an important class of discriminatory node functions.
102
4. Feedforward Networks II: Real-Valued Nodes
Definition 4.7.2 (Sigmoidal Function) A sigmoidal function σ is one such that ∞ > lim σ(z) = σ ¯ > σ = lim σ(z) > −∞ z→∞
z→−∞
Lemma 4.7.1 (Cybenko, Lemma 1, p. 307) Any bounded, measurable sigmoidal function σ is discriminatory. Thus, any bounded, piecewise continuous sigmoidal function (e.g., LTU, logistic, tanh) is discriminatory. Proof Sketch. This conclusion can be made to follow from the observations of Section 4.2.5 that when σ is sigmoidal it can approximate a step function. Furthermore, step functions can approximate trigonometric functions. Finally, by the uniqueness of Fourier transforms, trigonometric functions are discriminatory. 2 Illustrations of approximations by single hidden layer neural networks are provided in Figures 4.4 and 4.5.
FIGURE 4.4. Approximations to a quadratic for s1 = 2, 4 nodes.
FIGURE 4.5. Approximations to a sinusoid for s1 = 4, 8 nodes.
A characterization of the acceptable node functions that does not require the assumptions of bounded and sigmoidal is given by Leshno et al. [144]. They have shown that virtually any nonpathological and nonpolynomial function σ that is locally bounded and whose points of discontinuity are a set of measure (volume) zero will serve to generate networks with the
4.8 Constructing Measurable etc. Functions
103
universal approximation property. That σ not be a polynomial is clearly a necessary condition because a single hidden layer network with polynomial nodes of degree p can only generate a polynomial of degree p no matter what the width s1 of the hidden layer. Hopfield’s previously quoted remark about the importance of nonlinearity needs to be slightly modified to understand nonlinearity as nonpolynomial behavior. To report their somewhat technical results we need to first define the set M of node functions. Definition 4.7.3 (Leshno et al.) Let M = {σ} denote the set of node functions such that 1. The closure of the set of points of discontinuity of any σ ∈ M has zero Lebesgue measure (length). 2. For every compact set K ⊂ IR, the essential supremum of σ on K, with respect to Lebesgue measure ν, is bounded esssupx∈K |σ(x)| = inf{λ : ν{x : x ∈ K, |σ(x)| ≥ λ} = 0} < ∞. For example, property (1) is satisfied if the points of discontinuity have only finitely many limit points, and property (2) is satisfied if σ is bounded almost everywhere. Theorem 4.7.2 (Leshno et al., Theorem 1, p. 863) If σ ∈ M , then the linear span of σ(w · x − τ ) is dense in C(IRd ) if and only if σ is not almost everywhere an algebraic polynomial. Leshno et al. remark on the necessity for including the threshold τ for this to be true.
4.8 Constructing Measurable, Partition, Integrable, and Partially Specified Functions 4.8.1
Measurable Functions
Cybenko extends these results from constructing functions in C to those in L1 (absolutely integrable functions) and summarizes his results and those of others in a table on [50, p. 312]. Hornik et al. [109] also established the ability of a single hidden layer network with sigmoidal nodes to arbitrarily closely approximate well-behaved functions of several variables. To make these generalizations plausible we invoke the following characterization of measurable functions (e.g., Ash [13, pp. 186 and 187], Rudin [204, p. 53]). Theorem 4.8.1 (Lusin) If f is a measurable function on X and µ a regular, finite measure on appropriate subsets of X (e.g., a probability measure
104
4. Feedforward Networks II: Real-Valued Nodes
on the event (Boolean σ-) algebra for X ), then for any > 0 there exists a closed set A and a continuous function g such that f and g agree on A and µ(Ac ) < . Basically, measurable functions on a compact domain X = I d can be arbitrarily closely approximated by continuous functions in the sense that they will agree with a continuous function except on a set of arbitrarily small measure (probability). Note that this sense of approximation is not in supnorm but is compatible with Lp norms. Because a continuous function can be closely approximated by functions (networks) in N1,σ , it follows that a measurable function can be closely approximated over all but an arbitrarily small part of its domain. In particular, a discontinuous partition, classification, or decision function can be closely approximated by a continuous function, which can in turn be closely approximated by a function in N1,σ . Of course, a two hidden layer network with LTU nodes can perhaps better approximate such a classification function. As discussed in Section 3.6, a function that is piecewise-constant over finitely many polyhedral regions (and the complement of their union) can be realized exactly by such a network. The practical conclusion is that single hidden layer neural networks can be chosen to approximate arbitrarily closely to any continuous or integrable function or, indeed, any function of practical interest. An accurate approximation, however, may require a prohibitively large number of nodes and therefore much implementation hardware. However, because we only require a single hidden layer, any such network then can compute the desired function very rapidly, with speed limited primarily by the need to calculate the weighted sum of the potentially very large number of individual node responses.
4.8.2
Enumerating the Constructable Partition Functions
We examine the construction of partition or classification functions in more detail by enumerating the number of such functions that can be implemented by a network of a given complexity. As in Chapters 2 and 3, we restrict attention to dichotomizations. Given a network function η(w, x) parameterized by weights (and thresholds) w, we create a dichotomization through S + = {x : η(w, x) > 0}, S − = {x : η(w, x) ≤ 0}. In effect, we can imagine following the network output by an LTU. Because the family of networks is typically uncountable, we can expect to achieve uncountably many different dichotomizations of IRd by a given architecture. To make the enumeration question more interesting and relevant to the issues that arise in the practice of neural networks, we need to shift to dichotomizations of a finite training set Tn = {(xi , ti ) : i = 1 : n}. The
4.8 Constructing Measurable etc. Functions
105
enumeration of the number of dichotomizations that can be produced by a family of networks of a given architecture parallels that given in Section 3.5, where we introduced the Vapnik-Chervonenkis capacity or dimension v as the size of the largest set that could be shattered (all subsets generated) by the architecture. Results that are helpful in evaluating the capacity v for a given family N can be found in the literature [60, 224, 240, 248] and Devroye et al. [61, Section 13.2]. We learn from this literature that if N is a d-dimensional real vector space of functions, e.g., η(x) = y =
d
αi φi (x),
i=1
then its capacity is d in terms of sets it generates of the form {x : η(x) > 0}. In particular, the family of polynomials of degree n in a real variable have capacity n + 1. Furthermore, if N is a family of nested sets (i.e., linearly ordered by inclusion) then v = 1. If N contains intersections of sets drawn from m such families then v ≤ m. We noted in Section 3.5 that LTU-based networks had a VC dimension upper bounded by O(w log w), but it turns out that this is no longer true for networks using more general node functions. Theorem 4.8.2 (Koiran and Sontag, [129, Theorem 1] √ ) For every n ≥ 1, there is a network architecture with inputs in IR and O( N ) weights that can shatter a set of size N = n2 . This architecture is made only of linear and threshold gates. The same authors in [129] note that this theorem extends to networks made of the usual logistic nodes. Hence, a lower bound on the maximal VC dimension of a network with w parameters is O(w2 ). It appears from other work (Karpinski and MacIntyre [121]) that the upper bound to VC dimension may be O(w4 ).
4.8.3
Constructing Integrable Functions
We now consider approximating functions that are not necessarily continuous but are absolutely integrable. The family of absolutely Lebesgue integrable functions on a set X is denoted by L1 (X ). The appropriate approximation metric is the L1 metric (see Appendix 4.2) given by
||f − g|| = |f (x) − g(x)|dx. X
Theorem 4.8.3 (Cybenko Theorem 4, p. 310) If X is compact and the node function σ is bounded, measurable, and sigmoidal, then the set of functions N1,σ implementable by a single hidden layer network is dense in L1 (X ) with respect to the L1 metric.
106
4. Feedforward Networks II: Real-Valued Nodes
Thus, for example, piecewise continuous, bounded, sigmoidal nodes generate networks that can approximate any L1 function in the sense of the L1 metric. That this cannot be true for the sup-norm metric considered earlier for continuous functions is apparent from the irreducible sup-norm error incurred when, say, σ is continuous but the function f being approximated has a jump discontinuity at x0 . In that case the functions in N1,σ are required to be continuous and at the point of discontinuity of f must perforce incur an irreducible sup-norm error with respect to the values of f when limits are taken at x0 from the left and the right. Leshno et al. [144] also provide a general result on approximation to integrable functions and treat the general case of functions in Lp (µ) that have integrable pth powers with respect to non-negative measures µ that are absolutely continuous with respect to Lebesgue measure (µ has a density) and have compact support (assign full measure to a compact set),
p |f (x)|p µ(dx) < ∞}. L (µ) = {f : X
The appropriate metric is now
d(f, g) = X
1/p |f (x) − g(x)|p µ(dx) .
Theorem 4.8.4 (Leshno et al. [144, Proposition 1, p. 863]) Given µ as above and σ satisfying Definition 4.7.3, then N1,σ is dense in Lp (µ) in the sense of the Lp metric, so long as σ is not almost everywhere a polynomial and 1 ≤ p < ∞. The essence of the preceding technical discussion is: Single hidden layer neural networks are universal approximators in that they can arbitrarily closely approximate, in the appropriate corresponding metric, to continuous or to pth-power integrable functions so long as the node function satisfies the necessary condition of not being a polynomial.
4.8.4
Implementing Partially Specified Functions
We now turn to the problem of approximating closely to a partially specified function. The problem format is that we are given a training set T = {(xi , ti ), i = 1 : n} of input-output pairs that partially specify t = f (x), and we wish to select a net η(·, w) so that the output y i = η(xi , w) is close to the desired output
4.9 Achieving Other Approximation Objectives
107
ti for the input xi . This is the typical situation in applications of neural networks—we do not know f but we have points on its graph. If instead you are fortunate enough to be given the function f relating t to x, but need a rapid means of computing it to a good approximation, then you can generate arbitrarily large training sets by sampling the function domain, either deterministically or randomly, and calculating the corresponding responses, thereby reducing this problem to the one we will treat in detail in the next chapter. The notion of “closeness” on the training set T is typically formalized through an error or objective function or metric of the form 1 ||y − ti ||2 . 2 i=1 i n
ET =
Note that the dependence of y on the parameters w defining the selected network η implies that ET = ET (w) is a function of w. Although there are infinitely many other measures of closeness (e.g., metrics such as sup-norm for real-valued functions and cross-entropy for the discrete-valued targets found in pattern classification), it is usually more difficult to optimize for these other metrics through calculus methods, and virtually all training of neural networks takes place using the quadratic metric, even in cases where eventual performance is reported for other metrics. A continuous function can interpolate a training set T , provided that the inputs {xi } are all distinct. Hence, the results in Section 4.7 establish that a single hidden layer network can approximate arbitrarily closely to any given training set T of size n if it is wide enough (s1 >> 1). An appropriate measure of the complexity of a network that relates to its ability to approximate closely to a training set is given by the notion of Vapnik-Chervonenkis (VC) dimension or capacity. Discussion of VC dimension is provided in Sections 3.5, 7.8.2 and available from Devroye et al. [61], Kearns and Vazirani [125], Vapnik [241], and Vidyasagar [244]. Although the discussion in Section 3.5 applies to networks with only binaryvalued outputs, the discussion in Section 7.8.2 shows how to extend this to networks with real-valued outputs. Some studies of network generalization ability (see Section 7.8) also rely on VC dimension.
4.9 Achieving Other Approximation Objectives 4.9.1
Approximating to Derivatives
We may wish to approximate to f as well as to several of its derivatives. A network with LTU nodes can arbitrarily closely approximate to a continuous or integrable f through a piecewise constant network function. However, the derivatives of this approximation are almost everywhere zero,
108
4. Feedforward Networks II: Real-Valued Nodes
unlike what is likely the case with the function f itself. Results on simultaneous approximation to a function as well as its derivatives are available from a number of sources, including Hornik [110], Hornik et al. [111], and Yukich et al. [259]. Approximation to derivatives is incorporated through norms that take them into account. Let a = (a1 , . . . , ad ) denote a so-called multi-index indicating that the operator Da takes an ai -fold derivative with respect to the variable xi , Da f (x) =
∂ a1 ,··· ,ad f (x) . ∂xa1 1 · · · ∂xadd
The total order a ¯ of the derivative is given by a ¯=
d
ai .
1
Let C m (X ) denote the space of functions on compact X having continuous derivatives of all orders less than or equal to m. An extension of the supnorm for functions in C m (X ) is given by a ||f ||∞ m,X = max sup |D f (x)|, a ¯≤m x∈X
and an extension of the Lp norm is given by
1 p ||f ||m,X = |Da f |p µ(dx) p . {a:¯ a≤m}
X
Theorem 4.9.1 (Hornik, [110, Theorem 3, p. 253]) If σ ∈ C m (IR) is nonconstant and bounded, then for all finite measures µ on compact X N¯1,σ = {f : f ∈ C m (X ), ||f ||pm,X < ∞}. Furthermore, N¯1,σ = C m (X ) with respect to ||f ||∞ m,X . Hence, with reasonable assumptions on the node function, we find that a single hidden layer network can again arbitrarily closely approximate to f as well as its derivatives up to any given order, provided that they exist appropriately as measured by either of the norms given above.
4.9.2
Approximating to Inverses
The results displayed thus far demonstrate the abilities of a single hidden layer network as a universal approximator when approximation is understood in its standard sense as formalized through a norm-based metric. However, applications sometimes require us to construct functions to a different standard. We have shown that a single hidden layer suffices for arbitrarily accurate approximations, and we will make some arguments about
4.9 Achieving Other Approximation Objectives
109
the complexity of such representations in Section 4.11, but we cannot argue to the optimality of this architecture. There are simple problems that have exact solutions with multiple hidden layers but only awkward approximations with a single hidden layer. If we return to LTU-based networks, then the function on IR2 that is 0 in the first and third quadrants and 1 in the second and fourth quadrants can be implemented exactly using two hidden layers of LTU nodes. However, it can only be approximated by any finite single hidden layer network. Indeed, if one measures approximation by sup-norm, then one cannot approximate arbitrarily closely by a single hidden layer network and must use two hidden layers. This point is enforced in relevance and detail by Sontag [225] in a discussion of the requirements for feedback stabilization of continuous control systems. Control problems typically involve the implementation of an inverse function φ to a forward mapping f that represents the so-called plant dynamics. The function f maps from the state variables, exogenous environmental inputs or disturbances, a time variable t, and a control φ to the target variables x ∈ IRp . For example, a target variable when I am driving a car may be my position measured laterally across the highway, perpendicular to my direction of travel. My control variable is the angle φ of the steering wheel. The plant dynamics take into account the dynamics of my car and the wind and road surface inputs that perturb my position. Of course in a car I also have the controls of gas and brake pedals that determine my velocity along the direction of travel. The control law φ maps from the desired target set (I may be satisfied to have x ∈ T corresponding to any lateral position so long as my car is contained within my lane) to the excitation or control applied to the plant f needed to secure my entry into the target set T . Stripped of extraneous details, φ is an inverse to f in that we desire f (φ(x)) ≈ x, x ∈ T. More formally, we desire for a given > 0, sup-norm (so as to ensure we do not leave a target set, e.g., the road), and approximate inverse φˆ ˆ ||f (φ(x)) − x|| < .
(4.9.1)
However, even in the case of continuous plants f Sontag [225] has shown that this inverse φ may be discontinuous and not be approximable by a single hidden layer network so as to satisfy the criterion of Eq. 4.9.1. Theorem 4.9.2 (Sontag, [225, Proposition 2.5, p. 983]) The set of functions N1,σ computable by a single hidden layer network with node function either an LTU or continuous (e.g., the logistic) is not such that for any continuous f and for all > 0 there will exist a sup-norm approximation φˆ ∈ N1,σ to the inverse φ satisfying ˆ (∀x ∈ T ) ||f (φ(x)) − x|| < .
110
4. Feedforward Networks II: Real-Valued Nodes
However, if we allow networks with two hidden layers then we have the following. Theorem 4.9.3 (Sontag, [225, Proposition 2.4, p. 983]) The set of functions computable by a two hidden layer network with LTU nodes is such that for any continuous f and > 0 there exists a member φˆ such that ˆ − x|| < . (∀x ∈ C) ||f (φ(x)) Hence, the nature of approximation in applications may dictate the use of multiple hidden layers.
4.10 Choice of Node Functions The important issue of architecture selection involves the choice of network graph as well as the choice of node function. The choice of network graph is commonly treated as just the choice of the number of hidden layers, with the layers assumed to be fully connected to their predecessor layer. Particular applications, such as optical image processing, may adopt more specific network configurations based on understanding signal processing, say, through local receptive fields whose responses are then combined in succeeding layers. These issues are addressed somewhat in Chapter 6. It must be admitted that much remains to be done before we have rational and computationally applicable methods of network graph selection. Nor is the situation much better when it comes to the choice of node nonlinearity. It is clear that no one node function or architecture can be best in terms of always yielding accurate implementations with minimal numbers of parameters. The uniqueness of neural network representations discussed in Section 4.2.1 shows us that the most parameter-efficient accurate implementation of a function that is itself defined by a particular neural network is given by that particular network. Asking for nodes that are efficient in terms of yielding accurate approximations to a given class of functions that involve a minimal number of such nodes can lead to amusing consequences. Kurkova [135] suggested that if d = 1, X = [0, 1], and the class of functions F being considered has a countable dense set {fi }, then we can construct a single node function σ that by itself is universal. Simply take σ(x) =
∞
fi (x − i)[U (x − i) − U (x − i − 1)].
i=0
Given any f ∈ F and > 0, by assumption there exists fj ∈ F, d(f, fj ) < . Hence, σ(x − j) approximates to f , to within for x ∈ [0, 1].
4.11 The Complexity of Implementations
111
Mhaskar and Micchelli [164, 163, 165] provide a discussion of the capabilities for approximation of a variety of node functions. They consider node functions that are unbounded, in contrast to the assumption made in Definition 4.7.3. Definition 4.10.1 (Sigmoidal of order k) σ is sigmoidal of order k if it is a measurable function satisfying σ(x) σ(x) = 0, lim = 1, (∃K)(∀x)|σ(x)| ≤ K(1 + |x|k ). k x→∞ x xk The case of k = 0 corresponds to our previous notion of a sigmoidal node function. Although unbounded node functions are unrealistic in terms of hardware synthesis, results obtained with them are informative as to what is possible. In effect, a node of order k is nearly the positive part xk if x ≥ 0, xk+ = 0 if x < 0, lim
x→−∞
of a polynomial of degree k (scaling through α−k σ(αz), α >> 1, can make this more precise, just as scaling through σ(αz), α >> 1, can convert a node of order 0 into a unit step function approximation) and a multiple hidden layer structure enables us to form compositions of positive parts of polynomials of a given order to achieve arbitrary polynomials of any order. These polynomials can then be used to represent spline functions (see [212, 245]). Basically, splines are piecewise polynomials that are joined together smoothly at points called knots. Definition 4.10.2 ( [163, p.67] ) Given a partition P of X and an integer r ≥ 0, a spline function S(r, P) is an r − 1 times continuously differentiable function of Q whose restriction to any hyperrectangle in P is a polynomial of degree at most r. Certain optimality properties for this family of nodes will be noted in Section 4.12. The basic argument is that free knot (knot location can be selected for best fit) splines have optimal approximation properties for differentiable functions. In turn, one can approximate arbitrarily closely to the spline functions by positive part polynomials of order ≤ r. At root we are approximating an r-times differentiable function over the cells of a fine partition with the splines being polynomials over the cells.
4.11 The Complexity of Implementations 4.11.1
A Hilbert Space Setting
The preceding results do not enable us to respond to our second basic question which can be rephrased as asking how complex a net is required
112
4. Feedforward Networks II: Real-Valued Nodes
to secure the necessary degree of approximation. Work by Jones [116, 117] and Barron [18], subsequently enlarged on by DeVore and Temlyakov [59], Donahue et al. [63], Lee et al. [142], and Yukich et al. [259], has provided an attractive response to these questions. We first discuss approximation in Hilbert space and then relate these results to neural networks. Extensions to other norms are given in Donahue et al. [63] and Yukich et al. [259]. An answer to our second question can be given if we endow with an inner product structure the space of functions F, ·, · : F × F → IR, whose elements we are considering approximating. A first example of an inner product is Euclidean space E d composed of IRd and the inner product between two vectors being the familiar dot product x, y = x · y =
d
xi yi .
1
A second example suited to functions that are square-integrable (in L2 ) is
∞ f (x)g(x)dx. f, g = −∞
A third example takes into account the compactness of the domain and allows for nonuniform weighting over the domain. Choose a compact subset B ⊂ X and a probability measure µ on appropriate subsets of X and define
f (x)g(x)µ(dx) = Eµ (f gIB ). f, g = B
An inner product has formal properties of symmetry and linearity explicated by f, g = g, f ,
αf0 + βf1 , g = αf0 , g + βf1 , g,
and satisfies f, f ≥ 0, f, f = 0 ⇐⇒ f = 0. From an inner product we can define a norm through ||f ||2 = f, f . The usual distance formula, familiar from vector spaces, is then ||f − g||2 = f − g, f − g = ||f ||2 + ||g||2 − 2f, g. The set F of functions whose approximation properties we will study is introduced in the following.
4.11 The Complexity of Implementations
113
Definition 4.11.1 (Hilbert space) A Hilbert space F is a linear vector space that has been endowed with an inner product and is such that limits of sequences of points in the Hilbert space are also in the space (completeness or closure). A finite-dimensional example of a Hilbert space is provided by the usual Euclidean space consisting of IRd and the familiar dot product between vectors and size of a vector measured by the usual square root of the sums of squares of components. What follows assumes that the functions to be approximated are drawn from a Hilbert space.
4.11.2
Convex Classes of Functions
¯1,σ , it is the case that f can be well-approximated Given that f ∈ F ⊂ N by a function h ∈ N1,σ , (∀δ > 0)(∃hδ ∈ N1,σ ) ||f − hδ || < δ. When hδ ∈ N1,σ we can write a representation in terms of a single hidden layer network having the appropriate sδ nodes hδ (x) =
sδ
αi σ(wi · x − τi ),
i=1
where the additive bias term can be absorbed into the sum by choosing, say, w1 = 0 and τ1 such that σ(−τ1 ) = 0. We rewrite this as follows: aδ =
sδ i=1
|αi | > 0, λi =
sδ |αi | ≥ 0, λi = 1, gi (x) = sgn(αi )σ(wi · x − τi ), aδ i=1
hδ (x) = aδ
sδ
λi gi .
i=1
The point of this manipulation is to present hδ as a positive constant aδ multiplied by a convex combination, with weights {λi }, of functions {gi } that are the node functions multiplied by ±1. With this motivation for our interest in such representations, we introduce the “dictionary” ¯} Ga¯ = {aσ(w · x − τ ) : |a| ≤ a of node functions with bounded output weights; so long as a ¯ ≥ aδ , we can approximate to f as above. Ga¯ forms a basis for convex representations of approximations. The family of n-node approximations with upper bound a ¯
114
4. Feedforward Networks II: Real-Valued Nodes
to the sum of magnitudes of output weights is given by the n-dimensional convex hull of Ga¯ , Fn,¯a = {h : (∃{λi }), λi ≥ 0,
n
λi = 1, (∃{gi } ⊂ Ga¯ ),
i=1
h=
n
λi gi }.
i=1
The full convex hull Fa¯ of Ga¯ is then taken without limit on the number of nodes, ∞ Fn,¯a . Fa¯ = n=1
The functions in the closure F¯a¯ ⊂ N¯1,σ can be arbitrarily closely approximated by single hidden layer networks with a bounded sum of the magnitudes of the output weights. That such a restriction is desirable can be seen from the discussion of generalization error in Section 7.8.4. Clearly, ¯1,σ . F¯a¯ , the closed, convex hull of Ga¯ , is a proper subset of N In what follows we will require that the node functions and ranges of parameter values satisfy sup ||σ(w · x − τ )||2 ≤ b2 < ∞. w,τ
This condition on σ is satisfied if, for example, we choose the logistic or tanh node functions and an approximation norm based on the third example of an inner product given in Section 4.11.1 for a compact set B of input values. Hence, g ∈ Ga¯ ⇒ ||g|| ≤ γ = a ¯b.
4.11.3
Upper Bound to Approximation Error
We consider approximating to any f in the Hilbert space F through elements of a convex subset Fa¯ of single hidden layer networks having bounded output weights. Let da¯ (f ) = inf ||f − h||. h∈Fa ¯
Any given f ∈ F need not be arbitrarily closely approximable (da¯ (f ) > 0) by functions in Fa¯ . For a given tolerance δ > 0, let fδ ∈ Fa¯ denote a close approximation to f ||f − fδ || ≤ δ + da¯ (f ). We claim that a simple iterative (in the number n of nodes of the network) construction of a series of greedy (at each iteration we make a locally best choice without regard to its impact on future choices) approximations {fn , fn ∈ Fa¯ }, in which fn has n nodes at the nth iteration and fn is chosen to approximate to f , will converge in that ||fn − f ||2 ≤ d2a¯ (f ) + O(1/n).
4.11 The Complexity of Implementations
115
Motivated largely by Lee et al. [142], building on prior work of Barron [18], we formulate these results in the following. Theorem 4.11.1 (Complexity of Iterative Approximation) Let F be a Hilbert space with norm || · ||, Ga¯ a subset of F with g ∈ Ga¯ implying ¯) and Fa¯ ||g|| ≤ γ (Ga¯ is the set of nodes with output weight bounded by a denote the convex hull of Ga¯ . Let 0 < n ≤ ρ/n2 , β = 1 − 2/n, β¯ = 1 − β, c = ρ + 4γ 2 . Let fn denote a function formed by a convex combination of n elements of Ga¯ . Choose f1 ∈ Ga¯ to satisfy ||f1 − f ||2 ≤ inf ||g − f ||2 + 1 , g∈Ga ¯
¯ − f ||2 + n . ||fn − f ||2 ≤ inf ||βfn−1 + βg g∈Ga ¯
It follows that ||fn − f ||2 − d2a¯ (f ) ≤
c . n
Proof. See Appendix 4.4. Our proof owes much to DeVore and Temlyakov [59]. 2 Theorem 4.11.1 not only provides us with convergence rates in terms of numbers of nodes, it also suggests a relatively easy way of achieving these rates. In each step of the iteration we must perform a nonlinear minimization. However, this minimization is only over the d + 1 parameters defining a single node and not over all of the p parameters defining the nnode function fn . Moreover, n provides us with a tolerance to judge when we are close enough to the minimum. Interestingly, the rate of decrease of approximation error in n is independent of the input dimension d. These results contrast with our prior expectations as stated by Cybenko [50, p.313], At this point, we can only say that we suspect quite strongly that the overwhelming majority of approximation problems will require astronomical numbers of terms. This feeling is based on the curse of dimensionality that plagues multidimensional approximation theory and statistics. However, freedom from this curse can only be partial, and we will return to this issue in Section 4.12.
4.11.4
A Class of Functions for which da¯ = 0
If we desire arbitrarily close approximation, then we need da¯ (f ) = 0. In this subsection we identify a family Φa¯ of functions in F for which da¯ (f ) = 0
116
4. Feedforward Networks II: Real-Valued Nodes
holds. Introduce a kernel function κ(θ, x) that is parameterized by θ ∈ Θ and is such that sup ||κ(θ, ·)||2 < ∞. θ∈Θ
We can think of κ(θ, x) as our neural network node function σ(w · x − τ ) indexed by θ = (w, τ ) ∈ Θ. Generalize the notion of a finite convex combination to an uncountable convex combination through a density function
λ(θ)dθ = 1; λ(θ) ≥ 0, Θ
λ(θ) plays the role of the weight in the convex combination. We now specify
Φa¯ = {f : (∃a < a ¯)f (x) = a λ(θ)κ(θ, x)dθ}. Θ
If κ are not our node functions, we can still proceed if we can give these kernel functions a similar convex representation in terms of weighting γ(θ) ≥ 0 of our actual node functions,
γ(θ)g(w · x − τ )dθ. (∀θ ∈ Θ)(∃α ≤ c1 ) κ(θ, x) = α Θ
Barron [18] provides an example of a class of functions satisfying these schema. Let f˜ be the d-dimensional Fourier transform of f , and define
|ω||f˜|dω < c}. (4.11.1) Γc = {f : Rd
This condition requires the absolute integrability of the Fourier transform of the gradient ∇f of f . From the Fourier inversion formula we have
f˜(ω)(eiω·x − 1)dω. f (x) − f (0) = (2π)−d Rd
Equivalently, if
f˜(ω) = eiφ(ω) |f˜(ω)|,
then −d
f (x) − f (0) = (2π)
Rd
|f˜(ω)||ω|
eiφ (eiω·x − 1) dω. |ω|
If f (x) − f (0) ∈ Γc then its Fourier transform, also denoted by f˜, satisfies
|ω||f˜|dω = cf ≤ c, Rd
and we see that f (x) − f (0) has been given a convex representation with weights |ω||f˜| of kernel functions of the form θ = (ω, φ), κ(θ, x) =
cos(ω · x − φ) − cos(φ) , |ω|
4.12 Fundamental Limits to Nonlinear Function Approximation
117
where we have used the fact that f is real to conclude that we can rewrite the inverse Fourier transform in terms of cosine functions. One then argues that these functions can be written in terms of convex combinations of step functions to show that there is a convex representation in terms of step functions. Finally, one argues that one can approximate step functions by the usual sigmoidal functions, and that convex combinations of sigmoidal functions can approximate to functions in Γc . Hence, we have identified classes of functions that can be approximated as specified by Theorem 4.11.1, with the approximation error being independent of the input dimension d. The preceding has been extended to approximation in the stronger sense of sup-norm by Yukich et al. [259] and for several norms by Donahue et al. [63].
4.12 Fundamental Limits to Nonlinear Function Approximation The remarkable independence from dimension d identified earlier suggests that neural network approximations are somewhat free from Richard Bellman’s curse of dimensionality. However, as we will see, this cannot be true for all classes of functions being approximated. An understanding of the limitations of neural networks in function implementation can be gained from the theory of nonlinear approximation, and in this we take DeVore et al. [58] as our guide. We proceed informally and let F be the set of both the approximating functions and the functions defined on the d-dimensional unit cube that we seek to approximate. We assume that this set F is a compact set in a normed (to measure approximation error), linear vector space containing its limit points—a Banach space. Let IRp denote a p-dimensional parameter space that will index our family of approximating functions. Thus if we are considering a single hidden layer neural network having s1 nodes with inputs drawn from IRd then for each node we have d input weights, one threshold and one output weight for a total of p = s1 (d + 2) + 1 parameters describing the particular neural network architecture. We have a mapping Mp : IRp → F,
Mp = Mp (IRp ) ⊂ F,
that establishes the correspondence between the p-dimensional parameter values and the approximating functions, assumed to be selected from the same set F. Mp describes an architecture insofar as it is the correspondence between the parameters and the actual functions. The collection Mp is the set of approximating functions. The error incurred in approximating a single f ∈ F is inf w∈IRp ||f − Mp (w)||. It is customary to look at the
118
4. Feedforward Networks II: Real-Valued Nodes
worst-case error over all of F, and this is given by e˜(F, Mp ) = sup inf p ||f − Mp (w)||. f ∈F w∈IR
At this level of generality the approximation problem admits a trivial solution in which e˜(F, Mp ) = 0 in such cases as F being separable (i.e., having a countable subset {fi } that is dense in F). An additional constraint is to introduce a continuous mapping A : F → IRp , having the role that we approximate f by Mp (A(f )). Such a restriction corresponds well with the problem-solving attitude with which we approach the implementation of functions in the neural network setting. Given a function f to be implemented and a neural network architecture having p parameters, we would like to have a training algorithm A that converts f into a specification of network parameters in some regular way. With this in mind we introduce the error term e˜(F, A, Mp ) = sup ||f − Mp (A(f ))|| f ∈F
reflecting the worst-case error over the family F of functions to be approximated by means of an architecture Mp described by p parameters that are selected by a continuous training algorithm A. If we now ask how well we can do with the best choice of training algorithm and architecture, we are led to the continuous nonlinear p-width of F dp (F) = inf e˜(F, A, Mp ). A,Mp
Estimates of dp tell us how well we can possibly estimate all functions in F when we restrict ourselves to architectures described by p parameters, without regard to the particular way in which we convert these parameters into approximating functions in Mp . Neural networks provide one such conversion, and other methods include familiar expansions in terms of sums of orthogonal functions (e.g., truncated Fourier series) and the use of splines. DeVore et al., in [58, p.475] provide a lower bound to dp for F a Sobolev space of functions of d arguments that are in Lq (qth power Lebesgue integrable). They assume that the arguments of the functions in F are drawn from [0, 1]d , the unit hypercube in IRd . Although we do not pursue the details here, a Sobolev space Wqr of functions f on IR is a set of functions that have absolutely continuous derivatives Dr−1 f of order r − 1 and all derivatives of order less than or equal to r have an absolutely integrable qth power (are in Lq ). The norm associated with the Sobolev space is r ||f ||Wqr = ||Di f ||Lq ; 1
4.12 Fundamental Limits to Nonlinear Function Approximation
119
that is the Sobolev norm is the sum of the d-dimensional Lq norms of the first r derivatives of f . We then restrict our attention to those functions F for which the Sobolev norm is less than or equal to a prescribed constant, which we take to be 1—the unit ball in the Sobolev space Wqr . In this case, DeVore et al. in [58, Theorem 4.2] establish that dp (F) ≥ Cr p− d , r
(4.12.1)
where Cr depends only on r. Hence, to achieve dp (F) ≤ requires an architecture with at least p≥( d
Cr d )r
(4.12.2)
parameters (hence, at least ( C r ) r /d nodes). For a class of functions that is only assumed to be once differentiable (r = 1) we require at least O(−d ) parameters to obtain an error upper bound of . Of course, if the class of functions is highly differentiable (at least d times) then the estimate of dp ≤ can yield parameter estimates no larger than p = O(1/), independent of the input dimension d. A related result by Mhaskar [163] establishes that if F = C is the set of continuous functions, without any assumption of differentiability, then (d + 1)−d sigmoidal nodes suffice to approximate to within O(). Mhaskar and Micchelli [165, p.325] conclude that the nodes that are sigmoidal of order k (see Definition 4.10.1) with k ≥ 2 (composition of order k = 1 polynomials only results in further polynomials of order 1) allow us to achieve the lower bound to the continuous nonlinear pwidth given by Eq. 4.12.1, provided the network has O(log r/ log k) layers. This provides another argument for the use of more than one hidden layer. The appearance of the dimension of the input as an exponent in Eq. 4.12.2 shows us that we cannot altogether avoid the curse of dimensionality, no matter how ingenious we are in selecting architectures and training algorithms. Neural networks provide powerful representations for function classes but have no magic. Unless the function class is carefully limited, as discussed in the preceding section, there will be a dependence among the input dimension, the number of network parameters, and the approximation error. We will revisit this dependence in Chapter 7, when we address issues of statistical generalization. Finally, the dependence on input dimension may not be a serious obstacle when we consider that problems usually come to us not as fully specified functions, with or without regularity properties, but as a finite training set of input-output pairs. Issues of orders of differentiability have no natural place in such a discrete setting.
120
4. Feedforward Networks II: Real-Valued Nodes
4.13 Selecting an Implementation: Greedy Algorithms Our third question addresses the actual choice of a neural network approximation given that we have answered our first two questions as to which functions are approximable and the complexity (number of nodes) required for a given degree of approximation. It is uncommon in applications of neural networks to be given, either analytically or algorithmically, the function one wishes to implement. Generally one is given only a training set containing input-output pairs. Approximation to a training set, with error measured by the usual sum-of-squares training error (see Chapter 5), fits well within the Hilbert space setting of Section 4.11. Of course, if one were given the function, then one could compute arbitrarily large training sets, and, moreover, take care to space the input arguments appropriately. However, if one has the function f specified analytically, then one can proceed through simple iterative algorithms to find a network implementation. As noted by Jones [117], DeVore and Temlyakov [59], and Donahue et al. [63], in the setting of Section 4.11, an approximation can be achieved in a computationally plausible iterative fashion by optimizing the weight vectors and thresholds for the individual nodes one at a time, as specified in Theorem 4.11.1. An early approach to existence and selection of network implementations was taken by Carroll and Dickinson [35]. They use Radon transform techniques to solve for the approximating net and to provide an indication of the complexity of such approximations. As they note in their conclusion on page I-609: The method proposed here . . . provides an algorithm for reducing the design of such nets to the design of segments of real-valued functions of a single variable. There are efficient solutions to this simpler problem for some classes of sigmoids, so we regard this problem as well-solved. We refer the reader to Jones [118], Mhaskar and Micchelli [164], and Mhaskar [163] for a treatment of the selection of an implementation in terms of Fourier and spline approximations that does not require iterative optimization. The next chapter will treat basic algorithms for approximating to training sets rather than to known functions.
4.14 Appendix 4.1: Linear Vector Spaces If f, g ∈ N1,σ and α, β are constants, then the function h = αf +βg ∈ N1,σ . Hence, N1,σ is closed under linear combinations of its members and forms a
4.15 Appendix 4.2: Metrics and Norms
121
linear subspace that includes the origin or zero-valued constant function (simply take α = 0, αf ). N1,σ is an example of a linear vector space. Definition 4.14.1 (Linear Vector Space) A linear vector space L is a nonempty set of elements (vectors or functions) in which we have defined an operation of addition x + y between elements x, y of the space and an operation of multiplication αx of an element x of the space by a scalar α. Although these operations can be given a formal definition, it is simplest to just keep in mind the usual finite-dimensional vector spaces in which we have (componentwise) vector addition (x + y)i = xi + yi ,
x = [1 2 3], y = [3 5 2] ⇒ x + y = [4 7 5],
and multiplication (componentwise) by a scalar (that can be complexvalued, although we shall only need real scalars) (αx)i = αxi . Such a linear vector space has a distinguished zero element or origin 0. For function spaces of functions defined on a domain X we define addition through f, g ∈ L then h = f + g ⇐⇒ (∀x ∈ X ) h(x) = f (x) + g(x), and scalar multiplication by h = αf ⇐⇒ (∀x ∈ X ) h(x) = αf (x). The dimension of the function space is that of the cardinality (number of elements) of the domain X and is typically uncountably infinite.
4.15 Appendix 4.2: Metrics and Norms To measure approximation error of f by g we need to determine their dissimilarity. A numerical measure of the dissimilarity or distance between two functions is provided by the concept of a metric. Definition 4.15.1 (Metric) Given a function space F, a metric d measuring the dissimilarity between functions f, g ∈ F satisfies (a) (nonnegative) d(f, g) ≥ 0, (b) d(f, g) = 0 ⇐⇒ f = g, (c) (symmetry) d(f, g) = d(g, f ), (d) (triangle inequality) (∀f, g, h ∈ F) d(f, g) ≤ d(f, h) + d(g, h).
122
4. Feedforward Networks II: Real-Valued Nodes
The preceding are all properties of the familiar idea of Euclidean distance between two points. If we weaken (b) to the unidirectional implication f = g ⇒ d(f, g) = 0, then d is a pseudometric. A common manner in which to specify a metric is in terms of a measure of the size of a function given by the concept of a norm. Definition 4.15.2 (Norm) A norm || · || on a linear vector space L satisfies (a) ||f || ≥ 0, (b) ||f || = 0 ⇐⇒ f = 0, (c) ||αf || = |α|||f ||, ||f + g|| ≤ ||f || + ||g||.
(d) (triangle inequality)
Examples of commonly encountered norms on finite-dimensional vector spaces are d x2 , (l norm) ||x|| = 2
2
i
1
(l1 norm) ||x||1 =
d
|xi |,
1
(sup-norm) ||x|| = max |xi |. 1≤i≤d
If we weaken (b) to the unidirectional implication f = 0 ⇒ ||f || = 0, then ||·|| is a pseudonorm. Counterparts of the above norms for function spaces of suitably integrable functions are the following pseudonorms:
1 |f (x)|p µ(dx)} p , (Lp norm) { x∈X
(L2 norm) ||f ||2 =
f 2 (x)µ(dx),
(L1 norm) ||f ||1 =
X
X
|f (x)|µ(dx).
The L2 norm is familiar in engineering approximation contexts where the Lebesgue measure µ is just the familiar length or volume. An example of a norm, known as the sup-norm, is (sup-norm) ||f || = sup |f (x)|, x∈X
and it is the usual choice when we deal with a linear vector space of continuous functions on a set (domain) X . The sup-norm does not commit us to integrability (only an issue if X is not compact). We now relate metrics to norms, although this is not the only way to define a metric.
4.16 Appendix 4.3: Topology
123
Definition 4.15.3 (Norm-based Metric) The norm-based metric d in a normed, linear vector space is given by d(f, g) = ||f − g||. For the space of continuous functions C(X ) on a domain X we typically use d(f, g) = sup |f (x) − g(x)|, x∈X
where distance or error here is worst-case error. For functions that are square-integrable we might use
d(f, g) = X
(f (x) − g(x))2 dx.
These metrics provide us with a measure of the degree to which one function approximates another.
4.16 Appendix 4.3: Topology Definition 4.16.1 (Topological Space) A topological space X consists of a set (usually denoted again by X ) and a nonempty collection T of subsets of X , called the topology that are the open sets and satisfy the three properties X,∅ ∈ T , T0 , T1 ∈ T ⇒ T0 ∩ T1 ∈ T , Tα ∈ T ⇒ ∪α Tα ∈ T . The open sets have the properties that an intersection of finitely many of them, or a union of arbitrarily many of them, is again an open set. A common topological space based on the reals IR is generated by the open intervals (a, b) (intervals with the endpoints omitted) by letting T be the smallest topology containing the open intervals. The collection of open intervals is called a base for the topology in that any set in T can be expressed as a union of sets in the base. This generalizes immediately to IRd with base the set of d-dimensional rectangles (Cartesian products of open intervals) ×di=1 (ai , bi )} with open sides. An important example of a topology for a normed linear vector space is to have T be the smallest topology (satisfying the three properties given in Definition 4.16.1) containing all the open balls (∀f ∈ L)(∀δ > 0) Bf,δ = {g : g ∈ L, ||f − g|| < δ}. The set Bf,δ is the open ball centered at f with radius δ. In particular, the topology described for IR can be so created.
124
4. Feedforward Networks II: Real-Valued Nodes
The theorems establishing the universal approximation abilities of neural networks require the following kind of topological space. Definition 4.16.2 (Compact Space) A set A (e.g., a topological space X ) is compact if, given the topology T (set of open sets—e.g., the open intervals when X = IR), any open covering of A by sets {Tα } in T A = ∪α Tα has a finite subcovering A = ∪m i=1 Tαi . A subset of IRd is compact in the topology of open intervals if it is closed and bounded; this result is known as the Heine-Borel theorem. Thus, the interval [a, b] is compact, but IR is neither closed nor bounded and therefore not compact. While not needed beyond Section 4.6, we introduce the following definition. Definition 4.16.3 (Hausdorff space) A Hausdorff space X is a topological space in which for any two distinct elements x0 , x1 ∈ X there exist open sets S0 , S1 such that xi ∈ Si and S0 ∩ S1 = ∅. In a Hausdorff space any two distinct points are separated in that they have neighborhoods (open sets) that do not overlap. IRd with the usual topology is an example of a Hausdorff space.
4.17 Appendix 4.4: Proof of Theorem 4.11.1 Theorem 4.11.1 (Complexity of Iterative Approximation) Let F be a Hilbert space with norm || · ||, Ga¯ a subset of F with g ∈ Ga¯ implying ¯), and Fa¯ ||g|| ≤ γ (Ga¯ is the set of nodes with output weight bounded by a denote the convex hull of Ga¯ . Let 0 < n ≤ ρ/n2 , β = 1 − 2/n, β¯ = 1 − β, c = ρ + 4γ 2 . Let fn denote a function formed by a convex combination of n elements of Ga¯ . Choose f1 ∈ Ga¯ and fn to satisfy ||f1 − f ||2 ≤ inf ||g − f ||2 + 1 , g∈Ga ¯
¯ − f ||2 + n . ||fn − f ||2 ≤ inf ||βfn−1 + βg g∈Ga ¯
It follows that ||fn − f ||2 − d2a¯ (f ) ≤
c . n
4.17 Appendix 4.4: Proof of Theorem 4.11.1
125
Proof. Consider an arbitrary f ∈ F that we wish to approximate by a network in Fa¯ . Because f need not be in the closure of Fa¯ , select a small δ > 0 and f ∗ ∈ Fa¯ such that ||f − f ∗ ||2 ≤ d2a¯ (f ) + δ. In effect, we approximate to f by approximating to f ∗ . Introduce the nnode approximant function fn ∈ Fa¯ . Define θn = ||f − fn ||2 − ||f − f ∗ ||2 as the excess or avoidable approximation error. It is our goal to show that under the hypotheses of the theorem, θn = O(1/n). We accomplish this by deriving a recursive inequality relating θn to θn−1 . From the elementary properties of inner products and norms (for example, as exhibited in the usual vector algebra) ||f −fn ||2 = ||f −f ∗ +f ∗ −fn ||2 = ||f −f ∗ ||2 +||f ∗ −fn ||2 +2f −f ∗ , f ∗ −fn . Hence, θn = ||f − fn ||2 − ||f − f ∗ ||2 = ||f ∗ − fn ||2 + 2f − f ∗ , f ∗ − fn , a fact that will be used again below for fn−1 . Recall that in our iterative formulation ¯ n , β¯ = 2 = 1 − β. fn = βfn−1 + βg n Thus, ¯ ∗ − gn )||2 = ||f ∗ − fn ||2 = ||β(f ∗ − fn−1 ) + β(f ¯ ∗ − fn−1 , f ∗ − gn , β 2 ||f ∗ − fn−1 ||2 + β¯2 ||f ∗ − gn ||2 + 2β βf and ¯ ∗ −fn−1 , f ∗ −gn +2f −f ∗ , f ∗ −fn . θn = β 2 ||f ∗ −fn−1 ||2 +β¯2 ||f ∗ −gn ||2 +2β βf Rewrite ||f ∗ − gn ||2 = ||f ∗ ||2 + ||gn ||2 − 2f ∗ , gn . The iterative algorithm is also greedy in that it chooses the optimal gn without regard to future consequences. Hence, including the tolerance n , θn ≤ n + inf β 2 ||f ∗ − fn−1 ||2 + g∈Ga ¯
¯ ∗ −fn−1 , f ∗ −gn +2f −f ∗ , f ∗ −fn . β¯2 [||f ∗ ||2 +||gn ||2 −2f ∗ , gn ]+2β βf
126
4. Feedforward Networks II: Real-Valued Nodes
Recalling that 0 ≤ β < 1 implies β 2 ≤ β and by assumption ||gn ||2 ≤ γ 2 , we conclude that θn ≤ n + β||f ∗ − fn−1 ||2 + β¯2 (||f ∗ ||2 + γ 2 )+ inf
g∈Ga ¯
2 ∗ ∗ ∗ ∗ ∗ ¯ ¯ − 2β f , gn + 2β βf − fn−1 , f − gn + 2f − f , f − fn .
¯ n to rewrite Use fn = βfn−1 + βg ¯ − f ∗ , f ∗ − gn , f − f ∗ , f ∗ − fn = βf − f ∗ , f ∗ − fn−1 + βf and substitute in the above to find θn ≤ n + β||f ∗ − fn−1 ||2 + β¯2 (||f ∗ ||2 + γ 2 ) + 2βf − f ∗ , f ∗ − fn−1 + ¯ ∗ − fn−1 , f ∗ − gn + 2βf ¯ − f ∗ , f ∗ − gn . − 2β¯2 f ∗ , gn + 2β βf inf
g∈Ga ¯
The infimum with respect to gn is now being taken over a group of terms, all of which are linear in gn —gn appears in only one argument of each inner product. Hence, the infimum with respect to gn is also equal to the infimum with respect to any convex combination of g ∈ Ga¯ . Thus, the infimum with respect to gn ∈ Ga¯ can be replaced by an infimum with respect to h ∈ Fa¯ . An upper bound results by then selecting any element of Fa¯ . We choose h = f ∗ to find that θn ≤ n + β||f ∗ − fn−1 ||2 + β¯2 (||f ∗ ||2 + γ 2 ) + 2βf − f ∗ , f ∗ − fn−1 − ¯ ∗ − fn−1 , f ∗ − f ∗ + βf ¯ − f ∗ , f ∗ − f ∗ . 2β¯2 f ∗ , f ∗ + 2β βf Simplifying yields θn ≤ n + β||f ∗ − fn−1 ||2 + 2βf − f ∗ , f ∗ − fn−1 + β¯2 (γ 2 − ||f ∗ ||2 ). Observe that, as at the outset of our proof, θn−1 = ||f − fn−1 ||2 − ||f − f ∗ ||2 = ||f ∗ − fn−1 ||2 + 2f − f ∗ , f ∗ − fn−1 . Hence,
θn ≤ n + βθn−1 + β¯2 (γ 2 − ||f ∗ ||2 ).
Note that β = 1 − 2/n and let c = ρ + 4γ 2 ≥ ρ + 4(γ 2 − ||f ∗ ||2 ) to obtain θn ≤ (1 −
2 c )θn−1 + 2 . n n
4.17 Appendix 4.4: Proof of Theorem 4.11.1
127
To draw out the implications of the preceding inequality, first note that irrespective of the values of θ0 , θ1 , we find that θ2 =
c . 4
Consider the sequence defined by the recursive equality a2 =
c 2 c , an = (1 − )an−1 + 2 . 4 n n
We claim that for any n ≥ 2, an ≥ θn . This follows from the agreement a2 = θ2 providing the same initial condition for both sequences and an satisfying with equality the upper bound to θn . We need only show that an ≤ c/n to conclude, as desired, that θn ≤ c/n. Proceed by induction to note that a2 = c/4 ≤ c/2, and take as the induction hypothesis that an−1 ≤ c/(n − 1). We need to verify that (1 −
c c c 2 c 2 )an−1 + 2 ≤ (1 − ) + 2 ≤ . n n n n−1 n n
Elementary algebraic rearrangements verify this claim that an ≤ c/n. Hence, we conclude that under the hypotheses of the theorem, θn = ||f − fn ||2 − ||f − f ∗ ||2 ≤
c . n
We are free to take δ arbitrarily small, so we can replace ||f − f ∗ ||2 by d2a¯ (f ), c θn = ||f − fn ||2 − d2a¯ (f ) ≤ . 2 n
This page intentionally left blank
5 Algorithms for Designing Feedforward Networks
5.1 Objectives and Setting 5.1.1
Error Surface
We address a version of Question 3 concerning the problem of selecting the weights and thresholds to learn a training set. In the algorithms to follow, the weights and thresholds defining the various layers are referred to simply as weights and are listed for convenience as a column vector w; an explicit encoding of the parameters specifying the elements in a 1HL is given in Appendix 3. The problem formulation is that we are given a training set T = {(xi , ti ), i = 1 : n} and wish to select a net η so that the output y i = η(xi , w) is “close” to the desired output ti for the input xi . If, instead, you are fortunate enough to be given the function f relating t to x, then you can generate arbitrarily large training sets by sampling the function domain, either deterministically or randomly, and calculating the corresponding responses, thereby reducing this problem to the one we will treat. The notion of closeness on the training set T is typically formalized through an error or objective function or metric of the form 1 ||y − ti ||2 . 2 i=1 i n
ET =
130
5. Algorithms for Designing Feedforward Networks
The error ET = ET (w) is a function of w because y = η depends upon the parameters w defining the selected network η. Of course, there are infinitely many other measures of closeness (e.g., metrics such as the sup-norm discussed in Chapter 4 and those discussed in Bishop [29, Ch. 6]). However, it is more difficult to optimize for these other metrics through calculus methods, and virtually all training of neural networks takes place using the quadratic or squared-error metric, even in some cases where eventual performance is reported for other metrics. In Section 7.5 we will suggest an adaptation in which the companion validation error is measured by a metric that best reflects the actual application. A variation on this is that of regularization (see Section 6.3), in which a penalty term is added to the performance objective function ET (w) to discourage excessive model complexity (e.g., the length of the vector of weights w describing the neural network connnections). Thus we are confronted with a nonlinear optimization problem: minimize ET (w) by choice of w ∈ W ⊂ IRp . The inherent difficulty of such problems is aggravated by the typically very high dimension of the weight space W; networks with hundreds or thousands of weights are commonly encountered in image processing and optical character recognition applications. To develop intuition, it is helpful to think of w as being two-dimensional and determining the latitude and longitude coordinates for position on a given portion W (e.g., your county) of the surface of the earth. The error function ET (w) can be thought of as the altitude of the terrain at that location. We seek the point on W of lowest elevation. Clearly we could proceed by first mapping the terrain, in effect by evaluating ET at a closely spaced grid of points, and then selecting the mapped point of lowest elevation. The major difficulty with this approach is that if grid points are evenly spaced, then the number of required grid points grows exponentially in the dimension p of W (number of parameter coordinates). What might be feasible on a two-dimensional surface (e.g., a map of a 1km×1km region with grid points spaced 1m apart, requiring 106 points), quickly becomes impossible when we have, as we might well, a 100-dimensional surface where even bisection in each dimension is computationally impossible. One expects that the objective function ET (w) for a neural network with many parameters defines a highly irregular surface with many local minima, large regions of little slope (e.g., directions in which a parameter is already at a large value that saturates its attached node for most inputs), and symmetries (see Section 4.2). It would be useful in developing and understanding a search algorithm, to know more than we do about the qualitative characteristics of the error surface. In this regard, it is more helpful to think of the error surface as being as irregular as the surface of a sheet of paper that has been unfolded after being crumpled into a tight ball, a deeply and irregularly wrinkled surface, rather than as some smooth quadratic surface.
5.1 Objectives and Setting
131
The surface is technically smooth (continuous first derivative) when we use the usual indefinitely differentiable node functions. However, thinking of it as smooth is not a good guide to our intuition about the behavior of search/optimization algorithms. Some idea of the nature of this surface in high-dimensional space can be gleaned by plotting sections along two directions in parameter/weight space as shown in Figure 5.1. Figure 5.1 shows two views of a single three-dimensional plot of the error surface of a single-node network having three inputs and trained on ten input-output pairs.
FIGURE 5.1. Two views of an error surface for a single node.
5.1.2
Taylor’s Series Expansion of the Error Function
The common node functions (e.g., logistic, tanh, but not unit step) are differentiable to arbitrary order, and through the chain rule of differentiation, this implies that the error function is also differentiable to arbitrary order. Hence, we are able to make a Taylor’s series expansion in w for ET . We shall first discuss the algorithms for minimizing ET by assuming that we can truncate a Taylor’s series expansion about a point w0 that is possibly a local minimum. Introduce the gradient vector, the vector of first partial derivatives, ∂ET |w . (5.1.1) g(w) = ∇ET |w = ∂wi The gradient vector points in the direction of steepest increase of ET and its negative points in the direction of steepest decrease. Also introduce the symmetric Hessian matrix of second partial derivatives H(w) = [Hij (w)] = ∇2 ET (w), Hij =
∂ 2 ET (w) = Hji . ∂wi ∂wj
(5.1.2)
132
5. Algorithms for Designing Feedforward Networks
The Taylor’s series for ET , assumed twice continuously differentiable about w0 , can now be given as 1 ET (w) = ET (w0 )+g(w0 )T (w−w0 )+ (w−w0 )T H(w0 )(w−w0 )+o(||w−w0 ||2 ), 2 (5.1.3) where o(δ) denotes a term that is zero-order in small δ, o(δ) = 0. δ→0 δ lim
If, for example, there is a continuous third derivative at w0 , then the remainder term is of order ||w − w0 ||3 . Most of our analyses of training algorithms will assume an approximating quadratic model 1 m(w) = ET (w0 ) + g(w0 )T (w − w0 ) + (w − w0 )T H(w0 )(w − w0 ). (5.1.4) 2 Figure 5.2 shows contour plots for two quadratic surfaces, with the two plots differing in their condition number r (the ratio of largest to smallest eigenvalues).
FIGURE 5.2. Contour and gradient plots for quadratics.
Reliance upon this quadratic approximation to ET is equivalent to assuming a linearization in w of the network η(x, w) about the parameter point w0 . Insofar as the node functions are differentiable, this approximation will be valid only in a sufficiently small neighborhood of w0 .
5.1 Objectives and Setting
133
Lemma 5.1.1 (Minimum of a Quadratic) The model m of Eq. 5.1.4 has a unique minimum if and only if H is a positive definite matrix (see Appendix 1). Proof. If H is only positive semidefinite, then it has at least one zero eigenvalue and associated eigenvector e0 . By properly choosing the scalar multiplier α, the choice w − w0 = αe0 can be made to yield as negative a value of m as desired, provided only that g T e0 = 0. If the latter condition is not satisfied, then the minimum exists but is not unique. If H is indefinite, then it has negative eigenvalues and m can be made arbitrarily negative by selecting an arbitrarily large multiple of an eigenvector corresponding to a negative eigenvalue. The sufficiency of the condition of a positive definite H is verified by the facts that expressing w − w0 = i αi ei yields ∂ 2 m/∂αi2 = 2λi > 0. 2 Taking the gradient in the quadratic model of Eq. 5.1.4 yields ∇m = g(w0 ) + H(w − w0 ).
(5.1.5)
Setting the gradient equal to zero and solving for the minimizing w∗ yields w∗ = w0 − H−1 g.
(5.1.6)
The model m can be re-expressed in terms of the minimum value w∗ and 1 m(w∗ ) = m(w0 ) + g(w0 )T H−1 g(w0 ), 2 through 1 m(w) = m(w∗ ) + (w − w∗ )T H(w∗ )(w − w∗ ), 2 a result that follows from Eq. 5.1.4 by completing the square or recognizing that g(w∗ ) = 0. Hence, starting from any initial value of the weight vector, we can in the quadratic case move in one step to the minimizing value when it exists. This is known as Newton’s method and can be used in the nonquadratic case where H is the Hessian and is positive definite. We will return to this when we discuss quasi-Newton methods in Section 5.6. We shall have to deal with the consequences of departures from the quadratic model as well as the computational burdens of computing the Hessian when, as is common today, the neural network has hundreds or thousands of parameters. All these methods require us to evaluate ET (w) and g(w) repeatedly as we iteratively construct a sequence {wk } of weight vector approximations to a minimizing value w∗ of ET . Fortunately, the widely relied-on process of backpropagation is an effective method for calculating such values and will be described in the context of neural networks in the next section. The Hessian is not used in steepest descent methods
134
5. Algorithms for Designing Feedforward Networks
but is implicit in the other methods. The computational burden of the Hessian being large, clever ways have been found to either approximate it (e.g., quasi-Newton and Levenberg-Marquardt methods of Sections 5.6 and 5.7) or calculate what is needed indirectly (e.g., conjugate gradient methods of Section 5.5).
5.1.3
Multiple Stationary Points
A long-recognized bane of analyses of the error surface and the performance of training algorithms is the presence of multiple stationary points, including multiple minima. The discussion in Section 4.2.1 of the lack of uniqueness of a neural network representation of a function establishes that some multiple minima will occur due to the same symmetries that cause the nonuniqueness (e.g., relabeling of nodes); several parameter vectors give rise to the same function and hence to the same value of error. The multiplicity of minima will exhibit themselves when we initialize a neural network training algorithm as discussed in Section 5.3.1. Different randomly selected initial parameter values will almost always result in convergence to different neural network designs. Analyses of the behavior of training algorithms generally use the Taylor’s series expansions discussed above, typically with the expansion about a local minimum w0 . However, the multiplicity of minima confuse the analysis because we need to be assured that we are converging to the same local minimum as used in the expansion. Failing this, truncated expansions will yield wrong conclusions as they will be evaluated far away from their expansion point. How likely are we to encounter a sizable number of local minima? Empirical experience with training algorithms shows that different initializations yield different resulting networks. Hence, the issue of many minima is a real one. A recent construction by Auer et al. [14] shows that one can construct training sets of n pairs, with the inputs drawn from IRd , for a single-node network with a resulting number ( nd )d of local minima! Hence, not only do multiple minima exist, but there may be huge numbers of them. The saving grace in applications is that we often attain satisfactory performance at many of the local minima and have little incentive to persevere to find a global minimum. Recent techniques involving the use of families of networks trained on different initial conditions also enables us, through either linear combinations of the trained networks (see Hashem [93]) or a process of ensemble pruning (see Mukherjee [172]), to improve performance.
5.1.4
Outline of Algorithmic Approaches
In the Rosenblatt era of the late 1950s and 1960s the problem of finding the weights that best fit the training data was construed as the problem
5.1 Objectives and Setting
135
of assigning credit to each weight value buried inside the network for the resulting response error e = y − t. Rosenblatt’s inability to solve this still unsolved problem of credit assignment for LTU-based networks limited the development of neural networks and forced him to rely on networks of limited capabilities (the single-node perceptron). The forceful display of the limitations of these networks by Minsky and Papert [168] led to a significant loss of interest in neural networks. The current strong interest in feedforward neural networks is largely due to an approach to this problem by Rumelhart, McClelland, Hinton, Williams and others as reported in Rumelhart et al. [206]. The critical steps were the replacement of discontinuous node functions (LTUs) by smooth node functions and the introduction of the gradient-based backpropagation algorithm (BPA) that provided a workable solution to the credit assignment problem even for very large networks. Although it has been argued that the BPA, by making locally determined corrections to network parameters has some biological plausibility, it seems clear that it has little basis in actual brain operation. There is no “best” algorithm for finding the weights and thresholds, for solving the credit assignment problem that is now also called the loading problem—the problem of loading the training set T into the network parameters. Indeed, as will be noted in Section 5.9, this problem is provably intrinsically difficult. Hence, different algorithms have their staunch proponents, who can always construct instances in which their algorithm performs better than most others. In practice today there are four types of optimization algorithms that are used to select network parameters to minimize ET (w). Good overviews are available in Battiti [22], Bishop [29], Fletcher [76], and Luenberger [153]. The first three methods, steepest descent, conjugate gradients (e.g., Møller [169]), and quasi-Newton (see preceding references), are general optimization methods whose operation can be understood in the context of minimization of a quadratic error function. Although the error surface is surely not quadratic, for differentiable node functions it will be so in a sufficiently small neighborhood of a local minimum, and such an analysis provides information about the behavior of the training algorithm over the span of a few iterations and also as it approaches its goal. The fourth method of Levenberg and Marquardt (e.g., Hagan and Menhaj [89], Press et al. [191]) is specifically adapted to minimization of an error function that arises from a squared-error criterion of the form we are assuming. Backpropagation calculation of gradient can be adapted easily to provide the information about the Jacobian matrix J needed for this method; albeit for large p, n the Jacobian becomes unwieldly and this method is inapplicable. All these methods require efficient, repeated calculation of gradients, and backpropagation is the most commonly relied on organization of the gradient calculation.
136
5. Algorithms for Designing Feedforward Networks
Appendices 3 and 4 contain listings of MATLAB programs to carry out neural network training by each of the aforementioned methods. These listings are meant to clarify the structure and details of the algorithms. However, they carry no guarantees of correctness in detail or efficiency. Although we have attempted to provide correct illustrative programs, and they have been tested on examples, we have learned that these algorithms have a robustness that enables them to work well even when programming errors have been made.
5.2 Backpropagation Algorithm for Gradient Evaluation 5.2.1
Notation
Backpropagation provides an effective method for evaluating the gradient vector needed to implement the steepest descent, conjugate gradient, and quasi-Newton algorithms. BPA differs from straightforward gradient calculations using the chain rule for differentiation in the way it organizes efficiently the gradient calculation for networks having more than one hidden layer. In the words of Rumelhart et al. [206, pp. 326 and 327]: The application of the generalized delta rule [BPA], thus, involves two phases: During the first phase the input is presented and propagated forward through the network to compute the output value opj for each unit. This output is then compared with the targets, resulting in an error signal δpj for each output unit. The second phase involves a backward pass through the network (analogous to the initial forward pass) during which the error signal is passed to each unit in the network and the appropriate weight changes are made. This second, backward pass allows the recursive computation of δ . . . Further exposition of BPA in the context of feedforward neural networks (FFNN) requires us to recall the notation introduced in Section 3.2.3, and illustrated in Figure 3.2, to describe a multiple-layer FFNN. 1. The collection of inputs form a d × n matrix S whose mth column is xm = {xm i }. 2. Let i generically denote the ith layer, with the inputs occuring in the 0th layer and the last layer being the Lth and containing the outputs. 3. The layer will be indexed as the first subscript and separated from other subscripts by a colon (:).
5.2 Backpropagation Algorithm for Gradient Evaluation
137
4. It is common for the last layer node to be linear in approximation problems (e.g., estimation and forecasting) and nonlinear in pattern classification problems. 5. The number of nodes in the ith layer is given by the “width” si . 6. The jth node function in layer i is Fi:j ; alternatively we also use σi:j or simply σ when there is little room for confusion. 7. The argument of Fi:j , when xm is the input to the net, is denoted cm i:j . m m m 8. The value of Fi:j (cm i:j ) = ai:j , when the net input xm = {xj = a0:j }.
9. The si -dimensional vector am i denotes the responses of nodes in the ith layer to the mth input vector. 10. The derivative of Fi:j with respect to its scalar argument is denoted by either fi:j or σi:j . 11. The biases (negative of the thresholds) for nodes in the ith layer are given by the si -dimensional vector bi = {bi:j } and appear additively. 12. The weight wi:j,k assigned to the link connecting the kth node output in layer i−1 to the jth node input in layer i is an element of a si−1 ×si matrix Wi . Hence, in this notation the basic neural network operating equations are: m (5.2.1a) am 0:j = (xm )j = xj ,
si−1
cm i:j
=
wi:j,k am i−1:k + bi:j , i > 0,
k=1 am i:j =
Fi:j (cm i:j ), i > 0.
(5.2.1b) (5.2.1c)
Finally, we shall follow common practice and simplify our exposition by assuming that the network has a single output so that am L:1 = ym , 1 1 ||y m − tm ||2 = (ym − tm )2 = e2m = Em . 2 2 The extension of our development for vector-valued outputs is straightforward but obscures the exposition. In Section 5.14 we provide MATLAB programs for training on a network with multiple outputs, and these programs provide details of the extension. An iterative development for the responses from the various layers is given by
si−1
am 0:j
=
xm j ,
am i:j
= Fi:j (
k=1
wi:j,k am i−1:k + bi:j ).
(5.2.1d)
138
5. Algorithms for Designing Feedforward Networks
For i ≥ 1, the responses am i−1 in layer i − 1 are forward propagated to determine the responses am i in layer i. All the training algorithms that select network parameters to yield a small error in fitting a given training set make extensive use of gradients of the network function with respect to its defining parameters. Such calculations must be carried out efficiently because they are repeated many times (millions of times, in many cases) in the course of operation of a training algorithm. We present two approaches to gradient calculation, the first based on the familiar univariate chain rule and the second that of backpropagation.
5.2.2
Gradients for a Single Hidden Layer Network
The univariate chain rule of differentiation is df (z) dg(x) df (g(x)) = |z=g(x) . dx dz dx Consider a network with linear output node generating a scalar output produced by weighted summation of the responses from a single hidden layer having s1 nodes. We first address the gradient with respect to the output weights w2:1,j , n n ∂ym ∂ET ∂ym = (ym − tm ) = em , ∂w2:1,j ∂w ∂w 2:1,j 2:1,j m=1 m=1
where em = ym − tm is the error in the network response to input xm . Observe that with a linear output node m ym = am 2:1 = c2:1 =
s1
w2:1,p am 1:p + b2:1 ,
p=1
∂ym = am 1:j . ∂w2:1,j Hence,
n ∂ET = em am 1:j . ∂w2:1,j m=1
The gradient term for the output bias b2:1 is given by n n ∂ET ∂ym = em = em . ∂b2:1 ∂b2:1 m=1 m=1
Turning now to the first layer weights n ∂ET ∂ym = em , ∂w1:j,k ∂w 1:j,k m=1
5.2 Backpropagation Algorithm for Gradient Evaluation
139
∂am ∂ym 1:j = w2:1,j . ∂w1:j,k ∂w1:j,k Recall that m am 1:j = F1:j (c1:j ).
Hence, letting f1:j (z) =
dF1:j (z) , dz
∂cm ∂am 1:j 1:j m m m = f1:j (cm = f1:j (cm 1:j ) 1:j )a0:k = f1:j (c1:j )xk . ∂w1:j,k ∂w1:j,k From this, we conclude that n ∂ET m = em w2:1,j f1:j (cm 1:j )a0:k . ∂w1:j,k m=1
Finally, we turn to the gradient with respect to the first layer biases, n ∂ET = em w2:1,j f1:j (cm 1:j ). ∂b1:j m=1
In summary, the gradient components are n ∂ET = em am 1:j , ∂w2:1,j m=1 n ∂ET = em , ∂b2:1 m=1 n ∂ET m = em w2:1,j f1:j (cm 1:j )a0:k , ∂w1:j,k m=1 n ∂ET = em w2:1,j f1:j (cm 1:j ). ∂b1:j m=1
We have now calculated the full gradient for this network.
5.2.3
Gradients for MLP: Backpropagation
Although the direct chain rule approach is clear enough for a single hidden layer network, it soon becomes murky when there is more than one hidden layer. A systematic organization of the calculation of the gradient for a multilayer perceptron is provided by the celebrated backpropagation algorithm. This organization is also efficient with respect to the number
140
5. Algorithms for Designing Feedforward Networks
of floating-point operations (flops) needed to calculate the gradient (see the next subsection). As before, define Em (w) =
1 m (a − tm )2 , 2 L:1
ET (w) =
n
Em (w).
m=1
To relate this to our interest in the gradient of Em with respect to a weight wi:j,k or bias bi:j , note that these parameters affect Em only through their appearance in Si−1 cm = wi:j,k am (5.2.1b) i:j i−1:k + bi:j . k=1
Hence, an expression for all the elements of the gradient vector is ∂Em ∂cm ∂Em ∂Em i:j = m = m am , ∂wi:j,k ∂ci:j ∂wi:j,k ∂ci:j i−1:k
(5.2.2a)
∂Em ∂Em = m. ∂bi:j ∂ci:j
(5.2.2b)
We supplement our notation by introducing m = δi:j
∂Em (w) , ∂cm i:j
(5.2.3)
the rate of contribution to the error term Em by the excitation cm i:j to the jth node in the ith layer. This term is the error signal “delta” that is backpropagated from the output to node j in layer i in response to the input xm . We can now re-express ∂Em m m = δi:j ai−1:k , ∂wi:j,k
(5.2.4a)
∂Em m = δi:j . ∂bi:j
(5.2.4b)
m It remains to evaluate δi:j . Note that because Em depends on cm i:j only m through ai:j , ∂Em ∂am ∂Em i:j m δi:j = = fi:j (cm . (5.2.5) i:j ) m m ∂ai:j ∂ci:j ∂am i:j
If layer i is hidden, then Em depends on am i:j only through its effects on the layer i + 1 to which it is an input. Hence, si+1 si+1 ∂Em ∂cm ∂Em i+1:k m = = δi+1:k wi+1:k,j . m m m ∂ai:j ∂ci+1:k ∂ai:j k=1
k=1
(5.2.6)
5.2 Backpropagation Algorithm for Gradient Evaluation
141
Combining Eqs. 5.2.5 and 5.2.6 yields the backward recursion
si+1 m δi:j = fi:j (cm i:j )
m δi+1:k wi+1:k,j ,
(5.2.7a)
k=1
for i < L. This equation can be rewritten in matrix-vector form using m m m Wi+1 = [wi+1:k,j ], δ m i = [δi:j ], f i = [fi:j (ci:j )], m T m δm i = {(δ i+1 ) Wi+1 }. ∗ f i ,
where .∗ is the Hadamard product (MATLAB elementwise multiplication of matrices of the same dimensions). The “final” condition, from which we initiate the backward propagation, is provided by the direct evaluation of m m = fL:1 (cm δL:1 L:1 )(aL:1 − tm ).
(5.2.7b)
For the common case of a linear output node we can write, without loss of generality, m δL:1 = em .
Layer i
bi:j
ami:j Fi:j c mi:j δ i:j
wi:j,k a i-1:k a mi-1:k Fi-1:k
Layer i-1 bi-1:k
c mi-1:k
FIGURE 5.3. Information flow in backpropagation.
Thus the evaluation of the gradient is accomplished by: 1. a forward pass of the training data through the network to determine m the node outputs am i:j and inputs ci:j ; m 2. a backward pass through the network to determine the δi:j through Eq. 5.2.7; and
3. combine results to determine the gradient through Eq. 5.2.4.
142
5. Algorithms for Designing Feedforward Networks
This calculation is nicely modular with respect to the number of hidden m is to distribute the credit for the error term Em back layers. The role of δi:j down the network to the excitation cm i:j at node Fi:j . This credit for the error term is then shared by the weights and bias contributing to the summ mands of cm i:j proportionately to their multiplicative excitations, ai−1:k for wi:j,k and 1 for bi:j . In effect, a weight is adjusted in a manner that depends only on the “local” delta error signal at the output end of the weighted, directed link and on the activation at the input end. This viewpoint suggests variations on backpropagation in which this distribution of error to weights is carried through either a function monotonic in both δi:j , ai−1:k that is not their product or in which the weights attached to a given node share the burden of adjustment by means other than proportionality to their inputs. The resulting process is somewhat in accord with the Hebbian model of neural learning, although a meaningful concordance between BPA and biological processes of learning is doubtful (Stork [231]).
5.2 Backpropagation Algorithm for Gradient Evaluation
5.2.4
143
Calculational Efficiency of Backpropagation
An estimate can be formed of the number of flops (floating-point operations, following MATLAB usage we use this term to count the number of operations rather than their per-second rate) required to calculate all the p partial derivatives of the error ET with respect to each of the p parameters (weights and biases) by starting with the forward propagation equation of Eq. 5.2.1d. No effort is expended in Layer 0 where the inputs are given. In Layer i, for each of the si nodes, we need to evaluate m m m m cm i:j , ai:j = Fi:j (ci:j ), fi:j (ci:j ). Evaluation of ci:j involves si−1 sums and si−1 products, for a total of 2si si−1 flops. We assume that the evaluation of both the node functions Fi:j andfi:j cost φ flops, independent of i, j, except for the last layer (i = L) where they cost no flops. If we count, as MATLAB does, the evaluation of an exponential as a single flop, then evaluation of a logistic function requires three flops for a division, addition, and exponentiation. Thus the flop count f lf needed to forward propagate a single input is L f lf = si (2si−1 + φ) − φ, i=1
where s0 = d. To compare this to the number of parameters describing the network, note that to add an ith layer requires us to specify si biases and an si−1 × si weight matrix. Hence, p=
L
si (1 + si−1 ).
i=1
If we let s = we find that
L 1
si denote the total number of nodes in the network, then f lf = 2p + s(φ − 2) − φ ≈ 2p.
That this cannot be significantly improved follows from the observation that the output aL:1 does depend on all of the p network parameters (so long as they are all nonzero), and including each parameter can be expected to take at least one flop/parameter. This yields a lower bound of f lf ≥ p. Turning to the complexity of the backpropagation calculation, we see m requires si+1 − 1 additions and from Eq. 5.2.7(a) that the calculation of δi:j si+1 + 1 multiplications, for a total of 2si+1 flops. This evaluation must m be done for each of the si such terms in Layer i. The evaluation of δL:1 requires only one subtraction for a linear output node. Hence, the number f lb of flops consumed by the backpropagation for a single input is f lb = 1 + 2
L−1 i=1
si si+1 = 2
L
si si−1 + 1 − 2s1 d =
1
2p + 1 − 2(s + s1 d).
144
5. Algorithms for Designing Feedforward Networks
Finally, as can be seen from Eq. 5.2.4, the gradient components due to a single input specialized for the biases require no further computation. The gradient components for the weights require a single multiplication for each of the p − s weights. Thus the computation of gradients for a single input requires a flop count of f l = f lf + f lb + (p − s) = 5p + s(φ − 5) + 1 − φ − 2s1 d ≈ 5p. This result compares favorably with a naive approach that would evaluate each of the p derivatives with order of p flops, for a total estimate of order of p2 flops. We now conclude by noting that all of this computational effort must be expended n times, once for each of the training set elements, and the resulting n terms must be added together for each of the p parameters. Hence, the backpropagation evaluation of all p derivatives, taking into account all n training set elements, results in the total flop count totf l = n ∗ f l + np ≈ 6np. Thus a 1HL with d = 20, s1 = 10 and training set size n = 1000 yields a computational estimate of about 1 megaflop (1 Mfl). As we shall see, in iterative training it is common to perform on the order of 1000 iterations, for a total computational effort of about 1 gigaflop (1 Gfl). It becomes evident that the current successes of neural network applications very much depend on the availability of inexpensive, high-speed computing.
5.3 Descent Algorithms 5.3.1
Overview and Startup Issues
The backpropagation algorithm (BPA), in common usage, refers to a descent algorithm that iteratively selects a sequence of parameter vectors {wk , k = 1 : T } for a moderate value of running time T , with the goal of having {ET (wk ) = E(k)} converge to a small neighborhood of a good local minimum rather than to the usually inaccessible global minimum ET∗ = minw∈W ET (w), Issues that need to be addressed are: (a) preprocessing or standardization of the training data; (b) initialization of the algorithm; (c) choice of online (stochastic) versus batch processing; (d) iterative or recursive algorithm search for an error surface minimum;
5.3 Descent Algorithms
145
(e) selection of parameters of the algorithm; (f) rules for terminating the algorithmic search; and (g) convergence behavior (e.g., selection of local versus global minima, rates of convergence). Typical node functions like the logistic and tanh have magnitudes lying in [0, 1]. If the training data target variables take values in a very different range, then a portion of the iterative training will be devoted to achieving the correct output scale. It is more efficient to simply linearly transform the variables {ti , i = 1 : n} to take values in approximately the [−1, 1] range through, say, 1 t¯ = max ti , t = min ti , t˜ = (t¯ + t), i i 2
ti = 2
ti − t˜ . t¯ − t
Alternatively, and more commonly, one may use the sample mean m and standard deviation σ (square root of the sample variance) to rescale to the standardized random variable
t =
t−m . σ
After training, the net outputs are rescaled back to the original range by rescaling the final layer weights and bias. Specifically, if aL:1 = cL:1 is the response of the network with a linear output node to the scaled target set, then we need to undo the scaling to achieve
t = σt + m ⇐⇒ aL:1 = σaL:1 + m. This is easily accomplished by assigning new output weights and bias as follows: wL:1,j = σwL:1,j , bL:1 = σbL:1 + m. Similar scale changes should be used on the d × n matrix S of training set inputs having xi as its ith column, although care must be taken that whatever transformation is used on, say, input xi is also used on input xj . The rows (features) of S can be transformed independent of each other, but whatever transformation is made of the ith element of a row must also be made of the jth element of the same row. Rescaling helps correct a situation in which the different components of the input vector have different meanings, and therefore they may be measured on wildly incomparable scales. Values of inputs that are near zero will require large weight values to make them effective in controlling node outputs. Values of inputs that are very large in magnitude will saturate the nodes and make gradient-based learning proceed very slowly. In changing scale in S one can independently rescale each row (corresponding to a feature or input variable), but one
146
5. Algorithms for Designing Feedforward Networks
should not rescale each column independent of the others. If, say, we train on rescaled data in which the jth component xj of the input/feature vector is rescaled to xj − m j , xj = σj then to preserve the same excitations c1:k to the kth first layer node when we use xj in place of xj , we must have the new first layer weights and biases connected to the kth node satisfy c1:k =
d j=1
w1:k,j xj + b1:k =
d
w1:k,j xj + b1:k .
j=1
This condition is easily achieved by the assignments w1:k,j =
d 1 mj w1:k,j , b1:k = b1:k − w . σj σj j=1 1:k,j
While linear scale changes are generally considered to be inocuous and in these cases positively beneficial, there is also the possibility for a nonlinear scale change of the input variables. The neural network being a general nonlinear system has the capability of taking advantage of good nonlinear rescalings of x and of overcoming unfortunate choices of rescaling. An example of such a rescaling in the context of blind equalization is given by Wong [255]. In this case it was desired to separate groups of inputs, in effect belonging to different pattern or transmitted message classes. A quantization of the inputs brought them to uniform separation and made the classification task easier to learn. However, if you nonlinearly trans form xj to xj through some function xj = gj (xj ), then to use the network trained on xj you need to again preprocess the true input xj by use of gj . The search algorithm is usually initialized with a choice w0 of parameter vector that is selected at random to have moderate or small values. The random choice is made to prevent inadvertent symmetries in the initial choice from being locked into all the subsequent iterations from that starting point. Moderate weight values are selected to avoid saturating initially the node nonlinearities; gradients are very small when S-shaped nodes are saturated and convergence will be slow. It has been argued in Kolen and Pollack [130] that the performance of steepest descent for neural networks is very sensitive to the choice of w0 . They provide evidence of a fractal-like partition of the set W of possible initial vectors in that initial vectors that are very close to each other can yield radically different convergence behavior. In practice, one often trains several times, starting from different initial conditions. One can then select the solution with the smaller minimum or make use of a combination of all the solutions found.
5.3 Descent Algorithms
147
A descent algorithm can be developed in either a batch mode or an online/stochastic mode. In the batch mode we attempt at the k + 1– st step of the iteration to reduce the total error over the whole training set, ET (wk ) = E(k), to a lower value ET (wk+1 ) = E(k + 1). In the online mode we attempt at the k + 1–st step of the iteration to reduce a selected component Emk+1 , the error in the response to excitation xmk+1 , of the total error. Over the course of the set of iterations, all components will be selected, usually many times. Each version has its proponents. As we shall see, to achieve true steepest descent on ET (w) we must do the batch update in which the search direction is evaluated in terms of all training set elements. In practice, the most common variant of BPA is online and adjusts the parameters after the presentation of each training set sample. It is then necessary to pass repeatedly through the training set to achieve the same results as in the batch mode. The online process fails to perform steepest descent on the total error ET (w). However, it is reputed to have performance advantages over batch training (e.g., LeCun, et al. [140, p.157]) for large-scale problems. The operation of the online search is more stochastic than that of the batch search because directions depend upon the choice of training set term. The online mode replaces the large step size taken by the batch process (online mode steps summed over each training sample) by a sequence of smaller step sizes in which you continually update the weight vectors as you iterate. This mode makes it less likely that you will degrade performance by a significant erroneous step. There is a belief that this enables the algorithm to find better local minima through a more random exploration of the parameter space W.
5.3.2
Iterative Descent Algorithms
We generically denote all network parameters (link weights and biases) by a vector w ∈ W ⊂ IRp . Specifically, for a single hidden layer (1HL) network in MATLAB notation (see Appendix 3) we create the column vector w by first reshaping the matrix W1: of weights connecting the inputs to the first layer of nodes into a column vector, appending the vector of first layer biases, further appending the transpose of the row vector that is the matrix of weights connecting the first layer nodes to the single output, and finally appending the scalar bias supplied to the output variable,
w = [reshape(W1: , s1 d, 1); b1 ; W2: ; b2 ]. The basic iterative recursion, common to all the training methods in widespread use today, is a two-stage process that determines a new parameter vector wk+1 in terms of the current vector wk through first selecting a search direction dk and then a learning rate or step size αk , wk+1 = wk + αk dk .
(5.3.1)
148
5. Algorithms for Designing Feedforward Networks
A search through p-dimensional space for a minimum is conducted by searching sequentially along individual directions {dk } with the distance searched along dk being determined by the learning rate or step size αk . Typically, descent algorithms are Markovian in that a state can be defined so that its future state depends only on its current state and not the succession of past states that led up to it. The state after iteration k might consist of (wk , dk , g k ). In the case of basic steepest descent, this state is simply the current value of the parameter wk and gradient g k . In the variation on steepest descent using momentum smoothing, the state depends on the current parameter value and gradient and the most recent parameter value. Each algorithm we will present determines the next search point by looking locally at the error surface. We can explore the choice of search direction by considering the following first-order approximation (i.e., f (x) − f (x0 ) ≈ f (x0 )(x − x0 )) to successive values of the objective/error function E(k + 1) − E(k) ≈ g(wk )T (wk+1 − wk ).
(5.3.2)
If we wish our iterative algorithm to yield a steady descent, then we must reduce the error at each stage. For increments wk+1 − wk that are not so large that our first-order Taylor’s series approximation of Eq. 5.3.2 is invalid, we see that we must have g(wk )T (wk+1 − wk ) = g(wk )T (αk dk ) = αk g Tk dk < 0
(descent condition).
(5.3.3)
The simplest way to satisfy the descent condition of Eq. 5.3.3 is to have αk > 0, dk = −g k .
(5.3.4)
This particular choice of descent direction of Eq. 5.3.4 is the basis of steepest descent algorithms; the search direction is taken in the direction of steepest descent of the error surface. More generally, if {Mk } is a sequence of positive definite matrices, then dk = −Mk g k also satisfies the descent condition. The Newton method chooses Mk = H−1 (wk ), which is generally unknown to us. Quasi-Newton and Levenberg-Marquardt methods make other choices for Mk that are positive definite approximations to the inverse of the Hessian H. Yet another choice of descent direction is discussed in Section 5.5 on conjugate gradient methods.
5.3 Descent Algorithms
149
An “optimal” choice αk∗ for the learning rate αk for a given choice of descent direction dk is the one that minimizes E(k + 1), αk∗ = argminα E(wk + αdk ). To carry out the minimization we use ∂E(wk+1 ) ∂E(wk + αdk ) |α=α∗k = |α=α∗k = 0. ∂α ∂α To evaluate this equation, note that ∂E(wk + αdk ) = g Tk+1 dk , ∂α and conclude that for the optimal learning rate we must satisfy the orthogonality condition g Tk+1 dk = 0; (5.3.5) the gradient of the error at the end of the iteration step is orthogonal to the search direction along which we have changed the parameter vector. Hence, in the case of steepest descent (Eq. 5.3.4), successive gradients are orthogonal to each other. When the error function is not specified analytically, then its minimization along dk can be accomplished through a numerical line search for αk∗ (to be discussed in the next subsection) or through numerical differentiation as noted herein. We have used “optimal” in quotes to underscore the fact that this choice of learning rate is truly optimal only if we are at the final stage of iteration. If, as is most often the case, we are at an intermediate stage of iteration, then the above “greedy” choice of learning rate is not optimal with respect to eventual minimization of the error on completion of the iterative descent process. Analyses of such algorithms often examine their behavior when the error function is truly a quadratic, as given in Eqs. 5.1.4 and 5.1.5. In our current notation, g k+1 − g k = αk Hdk . Hence, the optimality condition for the learning rate αk derived from the orthogonality condition Eq. 5.3.5 becomes αk∗ =
−dTk g k dTk Hdk
.
(5.3.6)
When search directions are chosen via dk = −Mk g k , with Mk symmetric, then the optimal learning rate is αk∗ =
g Tk Mk g k g Tk Mk HMk g k
.
(5.3.7)
150
5. Algorithms for Designing Feedforward Networks
In the particular case of steepest descent for a quadratic error function, Mk is the identity and gT g αk∗ = Tk k . (5.3.8) g k Hg k One can think of αk∗ as the reciprocal of an expected value of the eigenvalues {λi } of the Hessian with probabilities determined by the squares of the coefficients of the gradient vector g k expanded in terms of the eigenvectors {ei } of the Hessian p (g Tk ei )2 1 = q λ , q = . i i i αk∗ g Tk g k i=1
In Newton’s method, Mk = H(wk )−1 and αk∗ = 1. Unfortunately, Eq. 5.3.7 involves the Hessian. Although the Hessian is calculable, its computation is often considered too time-consuming and its storage requirements excessive. For p network parameters there are 12 p(p + 1) distinct elements in the Hessian—a number that can easily exceed 106 and strain current computational resources; although computational resources may continue to follow Moore’s law and grow exponentially in time, our appetite for ever larger models will keep pace and maintain a stressed frontier. There are now several proposals for effectively calculating approximate versions of the Hessian (see Section 5.8). Of course, if we are not in the vicinity of a minimum, then H may not be positive definite and a Newton’s method choice may lead in the wrong direction. A practical alternative is to assess αk∗ by evaluating the vector Hdk rather than the matrix H (see also LeCun et al. [140], where αk∗ is approximated by an estimate of the reciprocal of the maximum eigenvalue of H). To do this, note that ∂∇ET (wk + αdk ) = Hdk . ∂α Hence, for small we have the approximation from numerical differentiation that g(wk + dk ) − g k ≈ Hdk . Thus, by evaluating an additional gradient at wk + dk we can approximate to the desired Hdk and thereby evaluate the “optimal” learning rate αk . An exact calculation of Hdk , if desired, is provided by Pearlmutter [184] based on a rewriting of the the backpropagation equations. Because the Hessian might not be positive definite and thereby lead to a negative value of learning rate, we ad hoc correct this estimate to ensure a positive learning rate. A simple correction is to select γ > 0 and use dTk Hdk ≈ γ||dk ||2 + dTk
g(wk + dk ) − g k
.
(5.3.9)
5.3 Descent Algorithms
151
Møller [169], in the context of the method of conjugate gradients, provides an adaptive method for choosing γ. An alternative is to check the sign of αk and reset it to a small positive value if it is negative. This formulation enables us to evaluate the “optimal” learning rate at the expense of an additional gradient evaluation at wk +dk , and such gradients are efficiently calculated by backpropagation.
5.3.3
Approximate Line Search
The one-step optimum choice of αk for a nonquadratic ETn can be determined by searching along the line αdk to locate the minimum of ETn (wk + αk dk ). This search does not require calculation of the Hessian. Because it is not worth expending sizable computational resources doing a precise line search when the local minimum is unlikely to be precisely in the direction dk , we conduct an approximate line search. A line search approximation algorithm is satisfactory, if the combination of descent direction and termination criteria yields convergence of the overall algorithm for any initial condition (e.g., Luenberger [153, Section 7.5], Fletcher [76, Theorem 2.5.1], or Battiti [22, p.150]). Search termination criteria that guarantee convergence are given by a pair of inequalities proposed in various versions by Armijo, Goldstein, Wolfe, and Powell (e.g., see Luenberger [153, Section 7.5], Fletcher [76, Section 2.5]). The inequality of Eq. 5.3.10a assures the achievement of a significant decrease in the error term with the chosen step size α, if it is not too small, and the inequality of Eq. 5.3.10b prevents the step size from being too small: E(wk+1 ) ≤ E(wk ) + αρ
∂E(wk+1 ) |α=0 = E(wk ) + αρg Tk dk ∂α
∂E(wk+1 ) ∂E(wk+1 ) = g Tk+1 dk ≥ σ |α=0 = σg Tk dt ∂α ∂α
1 , 2 (5.3.10a)
for 0 < ρ <
for ρ < σ < 1. (5.3.10b)
Fletcher recommends replacing Eq. 5.3.10b by ∂E(wk+1 ) ≤ −σ ∂E(wk+1 ) |α=0 for ρ < σ < 1. ∂α ∂α Of course, for these termination criteria to work we have to know that they are always achievable for proper choice of αk . That this is the case for any descent direction dk can be seen more readily if we rewrite the inequalities of Eq. 5.3.10 by introducing the scalar function of a single scalar argument h(α) = E(wk+1 ), h(0) = E(wk ). Equations 5.3.10 can be rewritten more transparently as h(α) ≤ h(0) + ρh (0)α, (5.3.11a) h (α) ≥ σh (0).
(5.3.11b)
152
5. Algorithms for Designing Feedforward Networks
For a descent direction dk we have that h (0) < 0. Hence, Eq. 5.3.11a requires a choice of step size α > 0 in that h(α) < h(0) by an amount that depends on α. To verify that Eq. 5.3.11a is achievable, rewrite it as h(α) − h(0) ≤ ρh (0). α From the Theorem of the Mean for once continuously differentiable functions we see that the left-hand side is precisely h (α ) for some 0 ≤ α ≤ α, and Eq. 5.3.11a becomes h (α ) < ρh (0). As 0 < ρ < 1, this termination criterion has us select a value of slope that is less than something larger than h (0), and this can always be achieved. Because the non-negative error function ETn is bounded below, we are assured that a value of zero derivative can be approached eventually. Hence, there exist choices for α such that the slope h (α) of h(α) is strictly closer to the desired value of 0 than is the negative value of σh (0), σ > ρ. Therefore, the condition of Eq. 5.3.11b or Eq. 5.3.10b is achievable. We first consider three forms of approximate line search that differ in their choice of quadratic or cubic interpolating function. The quadratic interpolation uses three function values; the cubic interpolation makes use of either two function values and their corresponding gradients (four values to establish the four constants that define a cubic) (see Luenberger [153], Dennis and Schnabel [57]) or three function values and a single gradient (see Press et al. [191]). To initialize a quadratic-based search we start with several error function evaluations to bracket the location of the minimum in the given direction. For example, we halve and/or double an initial step size until, in addition to the current (starting) point p1 , we have identified two other points p1 < p2 < p3 , pi = wk + αki dk , all lying along the search direction dk and such that E(p1 ) > E(p2 ) < E(p3 ). We now have a set of three points with a minimum (of value at least E(p2 )) lying between the extreme points. We iterate with quadratic interpolation to better identify the location of the minimum. More specifically, following Luenberger [153], in quadratic interpolation we assume that at stage l we have available the last three selected points pl−2 , pl−1 , pl that bracket a minimum as well as the associated values of the error function Ei = E(pi ), i = l − 2, l − 1, l. We consider the error function to be a function of the scalar argument α and construct a quadratic l−i 2 i=j (α − αk ) El−j , q(α) = l−j − αkl−i ) i=j (αk j=0 passing through these three points. It is easy to see that q is a quadratic in α and that q(αkl−j ) = El−j . The new point αkl+1 is determined by setting q (α) = 0. Letting aij = αki − αkj ,
bij = (αki )2 − (αkj )2 ,
5.3 Descent Algorithms
we obtain αkl+1 =
1 b12 El−2 + b20 El−1 + b01 El . 2 a12 El−2 + a20 El−1 + a01 El
153
(5.3.12)
A higher rate of convergence of order 2 can be obtained by using a cubic interpolation. The form requiring two gradients can be implemented in the neural network setting through backpropagation calcuations of derivative information. To initialize a cubic interpolation we supplement our starting point p1 and the gradient g 1 at that point by another pair p2 , g 2 with the property that along the search direction dk we have that dk · g 2 > 0. The descent condition ensures that dk · g 1 < 0. Hence, requiring dk · g 2 > 0 ensures that there is a minimum between p1 and p2 . The iteration using cubic interpolation proceeds (e.g., Luenberger [153, p.206]) in terms of the two most recent function values El , El−1 and their corresponding directional derivatives dl = dTk g l with respect to α along the search direction dk , through u1 = dl−1 + dl − 3
El−1 − El , αkl−1 − αkl
αkl+1 = αkl − (αkl − αkl−1 )[
1
u2 = [u21 − dl−1 dl ] 2 , dl + u2 − u1 ]. dl − dl−1 + 2u2
(5.3.13)
A simpler implementation of a cubic line search follows Press et al. [191, pp. 384–385], and uses the known error and gradient evaluated at wk or αk = 0 and an iterative selection of two additional error values at nonzero step sizes. In effect, we expand h(α) = E(wk + αdk ) as a cubic in α,
h(α) = aα3 + bα2 + h (0)α + h(0),
where h(0) = E(wk ), h (0) = dTk g k . We solve for the parameters a, b by evaluating h(α0 ), h(α1 ). We terminate this line search at l = L once the condition on error decrease of Eq. 5.3.11a is met, say, for ρ = .01, E(wk + αkL dk ) ≤ E(wk ) + αkL ρg Tk dk . We avoid the problems of step sizes that are too small by forcing the iterate αkl ≥ .1αkl−1 , and Press et al. also recommend αkl ≤ .5αkl−1 . Details are provided in the appended MATLAB program cubic1.m. In either quadratic or cubic interpolation we repeat the iteration until we satisfy the stopping conditions of Eqs. 5.3.10a,b for inexact or approximate line search. The advantage of the cubic interpolation scheme is that it has rapid secondorder convergence to the minimum. An even simpler line search algorithm is that of Fibonacci search (Press et al. [191, Section 10.1]), although this requires the error function to be
154
5. Algorithms for Designing Feedforward Networks
unimodal over the search interval. One can also, at the expense of an additional gradient calculation for each step in the line search, simply bisect a given initial interval (say, α ∈ [0, 2] to allow for the choice α = 1 that is best in Newton search) on the basis of whether the gradient g at a current interval midpoint has a positive or negative scalar product with the search direction dk . Approximate line searches to determine the learning rate do not seem necessary in steepest descent searches but are necessary to obtain good results in quasi-Newton and Levenberg-Marquardt searches. Failure to carry out a line search in applications of quasi-Newton or LevenbergMarquardt typically results in training periods of increasing error.
5.3.4
Search Termination
Finally, we need to determine when to terminate the search for a minimum of E. Five commonly relied-on stopping conditions to establish termination are: (a) preassigned upper bound (stopping time) T to the number of iterations; (b) achievement of a preassigned satisfactory value Ef inal of ET (wk ); (c) successive changes in parameter values fall below a preassigned threshold, ||wk − wk−1 || < ; (d) the magnitude ||g k || of the current gradient is small, ||g k || < ; and (e) increasing estimate (e.g., by cross-validation or an independent test set) of generalization error. Several of these conditions are often used simultaneously. Computational limits generally impose a running time bound T , and some form of item (a) should always be supplied. It may not be clear what a reasonable Ef inal is, as required by item (b), unless prior experimentation has provided some indication of what is achievable and the problem is understood well enough that acceptable performance can be identified. Items (c) and (d) are attempts to judge when convergence is near. In real applications of some complexity, descent algorithms cannot be expected to converge to a global minimum. Convergence of the algorithm, even to a local minimum, is problematic, although there is some treatment of this in a statistical setting (see Section 7.10). Nor can we readily tell when convergence is nearly achieved. There can be plateaus in the error surface that eventually lead to good minima. Evaluation of the Hessian would be indicative of convergence but is computationally expensive. There is ample empirical experience that the algorithms we present in this chapter will generally converge to a local minimum. At least for noisy training data, where good generalization behavior usually requires that we not fit the
5.4 Steepest Descent
155
training data too closely, convergence to a well-selected local minimum is generally satisfactory. Nonetheless, an argument for the use of Item (d) is presented in Sections 7.10–7.11 where the long-run statistical behavior of the parameter estimates is analyzed. Item (e) is discussed in Sections 7.5–7.7. In the neural network community, frequent reference is made to cross-validation, although this usually turns out to mean the use of an independent test set (e.g., see Kearns [123]). As discussed in Chapter 7, a validation error Ev is computed, say, on the basis of a validation or test set Dm that is independent of the training set T . Ev (wk ) is determined by running the network with parameters wk on Dm and evaluating the sum-squared error incurred in fitting Dm . This calculation is repeated as we progress through the sequence wk of parameter values returned by our iterative training algorithm. Training halts when Ev (wk ) reaches its first minimum. The behavior as steepest descent training progresses of the validation error Ev and the training error ET is shown in Figure 5.4 for a particular training set and validation set, each of size 1000. The objective is to guard against overtraining, a condition in which the network overfits the training set and fails to generalize well. One explanation for the failure of generalization when overtraining occurs is that overtraining renders accessible the more complex members of the excessively flexible family of neural networks being deployed. Hence, we may end up fitting the data with a more complex function than the true relationship (e.g., a higher-degree polynomial can fit the same sample points as a lower-degree polynomial). A more common explanation observes that the target variables often contain noise as well as signal—there is usually only a stochastic relationship between feature vector x and target t, with repetitions of the same feature vector often corresponding to different target values. Fitting too closely to the training set means fitting to the noise as well and thereby doing less well on new inputs that will have noise independent of that found in the training set.
5.4 Steepest Descent 5.4.1
Locally Optimal Steepest Descent
Steepest descent algorithms follow the recursion of Eq. 5.3.1, specialized to the choice of steepest descent search direction of dk = −g k , wk+1 = wk − αk g k . The learning rate αk is ideally the value achieving the minimum value of E in the chosen direction and can be determined by either a one-dimensional line search (Section 5.3.3) or the solution given by Eq. 5.3.8 and approxi¯ denote the largest eigenvalue of the Hessian H mated in Eq. 5.3.9. Let λ
156
5. Algorithms for Designing Feedforward Networks 1
10
0
10
-1
10
0
500
1000
1500
2000 2500 3000 3500 Training Iteration Number
4000
4500
5000
FIGURE 5.4. Training ET (solid) and validation Ev (dashed) error histories.
and λ denote the smallest eigenvalue; then from the optimality condition of Eq. 5.3.8, we see that 1 1 ≤ αk∗ ≤ . ¯ λ λ An illustration of the parameter trajectory in optimal steepest descent search on a true quadratic surface in two parameters is provided in Figure 5.5. ¯ of the largest Introduce the condition number as the ratio r = λ/λ and smallest eigenvalues of the Hessian. If r >> 1, then we are searching in a narrow valley (e.g., see the two contour plots in Section 5.1.1). In such a case the steepest descent algorithm tends to take small zig-zag steps (e.g., see Figure 5.5). Luenberger [153, pp. 218–219] establishes that for a true quadratic and the choice of the locally “optimal” learning rate given by Eq. 5.3.8, ¯−λ λ )2 (ET (wk ) − ET (w∗ )) = ET (wk+1 ) − ET (w∗ ) ≤ ( ¯ λ+λ r−1 2 ) (ET (wk ) − ET (w∗ )). r+1 Hence, if r >> 1 then there will be very slow convergence to the minimum of the error as we go from the kth to the k + 1st iterate. Convergence is linear in the error for steepest descent. Discussion of steepest descent in the context of neural networks, and insights into its limitations, are provided by the discussion in Chapter 7 of Luenberger [153], Chapters 4 and 7 of Bishop [29], and Battiti [22], as well as by some of the contributors to the annual Advances in Neural Information Processing Systems (collections of papers presented at the annual Neural Information Processing Systems (NIPS) conferences) and the journal Neural Computation. Steepest descent, even in the context of a truly (
5.4 Steepest Descent
157
quadratic error surface and with line search, suffers from greed. The successive directions do not generally support each other in that after two steps, say, the gradient is usually no longer orthogonal to the direction taken in the first step (e.g., see the contour plot of training trajectory in Figure 5.5).
60
50
40
30
20
10
0
10
20
30
40
50
60
FIGURE 5.5. Optimal steepest descent on quadratic surface.
In the quadratic case there exists a choice of learning rates that will drive the error to its absolute minimum in no more than p + 1 steps where p is the number of parameters. To see this, note that 1 1 E(w) = E(w∗ ) + (w − w∗ )T H(w − w∗ ) = E(w∗ ) + g T H−1 g. 2 2 It is easily verified that if g k = g(wk ), then k g k = [ (I − αj H)]g 0 . 1
Hence, for k ≥ p we can achieve g k = 0 simply by choosing α1 , . . . , αp any permutation of 1/λ1 , . . . , 1/λp , the reciprocals of the eigenvalues of the Hessian H; the resulting product of matrices is a matrix that annihilates each of the p eigenvectors and therefore any other vector that can be represented as their weighted sum. Of course, in practice, we do not know the values of the eigenvalues and cannot implement this algorithm. However, this observation points out the distinction between optimality when one looks ahead only one step and optimality when one adopts a more distant horizon.
158
5. Algorithms for Designing Feedforward Networks
5.4.2
Choice of Constant Learning Rate α
In the basic descent algorithm we follow the preceding process with the major exception that the step size is held at a constant value αk = α. The simplicity of this approach is belied by the need to carefully select the learning rate. If the fixed step size is too large, then we leave ourselves open to overshooting the line search minimum, we may engage in oscillatory or divergent behavior, and we lose guarantees of monotone reduction of the error function ET . For large enough α the algorithm will diverge. If the step size is too small, then we may need a very large number of iterations T before we achieve a sufficiently small value of the error function. To better understand the implications of the size of αk , we assume that we have reached the neighborhood of a local minimum w∗ and can make a quadratic approximation to the error 1 E(w) = E(w∗ ) + (w − w∗ )T H(w∗ )(w − w∗ ). 2 The gradient is given by Eq. 5.1.5 as g(w) = H(w − w∗ ), and again {λi } are the eigenvalues of the Hessian. Theorem 5.4.1 (Bounds on Constant Learning Rate) Under the assumption of the quadratic model, and when the limit exists, 0 < lim αk < min k→∞
m
2 2 = ¯ λm λ
is necessary and sufficient to guarantee that for all wk lim wk+l = w∗ .
l→∞
Proof. From the quadratic approximation and the expression for the gradient we can write wk+1 − w∗ = wk − w∗ − αk H(wk − w∗ ) or
wk+1 − w∗ = (I − αk H)(wk − w∗ ).
Assuming the validity of this approximation we can iterate it l times to get wk+l − w∗ = (wk − w∗ )
k+l−1
(I − αj H).
j=k
Convergence of wk+l to the local minimum w∗ , for arbitrary wk , is equivk+l−1 alent to j=k (I − αj H) converging to the zero matrix as l → ∞. This
5.4 Steepest Descent
159
k+l−1 latter condition is equivalent to all eigenvalues of j=k (I − αj H) conk+l−1 verging to zero. Note that the eigenvectors of j=k (I − αj H) are just k+l−1 the eigenvectors of H. The eigenvalues of j=k (I − αj H) are then easily k+l−1 expressed as j=k (1 − αj λm ), m = 1, . . . , p in terms of the eigenvalues λm , m = 1, . . . , p of H. Hence, when the limit exists, convergence to zero occurs if and only if 0 < lim αk < min m
k→∞
2 2 = ¯. 2 λm λ
One can avoid assuming the limit exists by appropriate recourse to lim inf k αk , ¯ then w lim supk αk . If αk stays above 2/λ, k+l must diverge in magnitude. We illustrate these results with plots of the steepest descent trajectory calculated for 25 iterations on a quadratic surface in two dimensions with eigenvalues of 1, 5, and constant αk = α. Hence, the bound on convergence for α is .4. In Figures 5.6 and 5.7 we present four plots with α taking on the values .02, .1, .35, .45. In Figure 5.6 we see that a very small learning rate does not allow us to reach the minimum in the allotted training time, whereas a moderately small value enables a smooth approach to the minimum. In Figure 5.7 we see that a large value of the learning rate enables us to converge to the minimum in an erratic fashion. However, a too-large learning rate leads to the predicted divergence. It is clear that a usable fixed choice of learning rate requires experimentation with short trial runs of the training algorithm applied to the specific problem at hand. Optimal Steepest Descent on Quadratic Surface, alpha=0.02
Optimal Steepest Descent on Quadratic Surface, alpha=0.1
60
60
50
50
40
40
30
30
20
20
10
10
10
20
30
40
50
60
10
20
30
40
50
60
FIGURE 5.6. Descent behavior with small learning rate.
Of course, once the magnitudes of the increments become large, the quadratic approximation to the error fails and the Hessian changes as we move over the error surface. We will subsequently note a possibility for adaptively selecting this step size. Some have suggested that we allow the step size to vary with the parameter components. Following this suggestion, of course, means giving up on moving in the direction of steepest descent; we move at different rates in the different components and hence “angle”
160
5. Algorithms for Designing Feedforward Networks Optimal Steepest Descent on Quadratic Surface, alpha=0.35
Optimal Steepest Descent on Quadratic Surface, alpha=0.45
60
60
50
50
40
40
30
30
20
20
10
10
10
20
30
40
50
60
10
20
30
40
50
60
FIGURE 5.7. Descent behavior with large learning rate.
downhill. The other optimization methods, discussed in the following sections, adjust parameters by moving in nonsteepest descent directions. It is productive of insight into the behavior of the steepest descent algorithm to linearly transform the parameter vector w. In terms of the (positive) eigenvalues and eigenvectors of H, let D be the diagonal matrix with Djj = λj and U the orthogonal matrix with jth column the eigenvector ej ; s λj eeT = UDUT . H= 1 ∗
Let v = U (w − w ). Then T
1 E(v) = E(w∗ ) + v T Dv. 2 The descent algorithm becomes v k+1 = v k − αk UT g k . The gradient h of E with respect to v is h = Dv = UT g. Hence, v k+1 = v k − αk Dv k = (I − αk D)v k . It follows that in this approximation v k+l = v k
k+l−1
(I − αj D)l .
j=k
It is clear that convergence to a local minimum w∗ occurs if and only if v k+l converges to 0. The advantage of this representation follows from the diagonal character of D; changes in, say, the jth component of v k+1 depend only on the corresponding jth component of v k .
5.4 Steepest Descent
5.4.3
161
Learning Rate Schedules
A variation on the constant learning rate is to adopt a deterministic learning rate schedule that varies the learning rate dependent on the iteration number. Learning rate schedules are typically used in conjunction with an online or stochastic gradient search. A schedule proposed by Darken and Moody [52] is given in terms of three user-settable parameters α0 , c, τ through 1 + αc0 τt αk = α0 2 . 1 + αc0 τt + τ τt 2 If t << τ , then the learning rate is nearly constant at its initial value of α0 , whereas for t >> τ the learning rate becomes c/t. The latter condition is argued to lead to optimally rapid convergence provided that when λ is the smallest eigenvalue of the Hessian (assumed constant) then c > 1/2λ, but to poor performance when c is chosen smaller. Because the the value of this minimum eigenvalue is usually unknown to us, Darken and Moody [52] suggest a method for estimating when c is too small from the performance of the algorithm.
5.4.4
Adaptively Chosen Step Size
The choice of fixed step size or learning rate is largely dictated by the following qualitative considerations: (a) α not be so small that convergence to satisfactory performance is unreasonably delayed; and (b) α not be so large that we either oscillate about the optimum parameter values or repeatedly select parameter values that are so large that we saturate many of the node nonlinearities and may induce instabilities that manifest themselves in a diverging error E. The determination of a fixed step size typically requires some initial experimentation; such a choice cannot be expected to yield rapid convergence. Combining Eqs. 5.3.8 and 5.3.9 provides a method for calculating effectively a good approximation to the one-step optimal variable learning rate. However, the additional computational cost incurred is only somewhat less than that of an additional search step. An ad hoc adaptive algorithm for determining step size can be based on the overshooting considerations by examining the agreement in the direction of successive gradients to detect oscillations. For example, if sign[g t ·g t−1 ] is positive then we can increase α, whereas if it is negative we can decrease α at the (t+1)st stage of iteration. A different adaptive algorithm for step size, called the bold driver method was proposed by Battiti, [22, p. 161]. In this algorithm we introduce three parameters: error ratio er > 1, learning rate decrease dm < 1, and learning
162
5. Algorithms for Designing Feedforward Networks
rate increase im > 1. Let Ek denote the error after the kth iteration of steepest descent. If with the current learning rate the error is increasing, Ek > er Ek−1 , then we decrease the learning rate for the next iteration by setting αk+1 = dm αk . If Ek < Ek−1 , then we increase the learning rate by setting αk+1 = im αk . In the remaining case of Ek−1 < Ek < er Ek−1 , we leave the learning rate unchanged. Typically, er , im are only a few percent greater than 1, but dm is of the order of 1/2. An example of the behavior of this adaptive algorithm in training a single hidden layer network with three nodes on a training set with n = 50, d = 2 is provided in Figure 5.8; the two plots show the variation of error (sse) and learning rate with iteration number. Figure 5.9 shows the variation of error and the value of a single network parameter w1:1,1 as training proceeds. It is clear that even when the error performance seems to have settled down, the network parameters can still change significantly.
sum-square error/n
1
10
0
10
-1
10
-2
10
0
50
100
150
200
250
300
350
250
300
350
learning rate lr
0
10
-1
10
-2
10
-3
10
-4
10
0
50
100
150
200
FIGURE 5.8. Sum-squared error and learning rate history.
5.4.5
Summary of Steepest Descent Algorithms
We summarize the several versions of an iterative steepest descent algorithm that use the backpropagation approach of Section 5.2.3 for calculating gradients. 1. (Section 5.4.2) Use a fixed learning rate α in steepest descent—socalled “vanilla” backpropagation.
5.4 Steepest Descent
163
2. (Sections 5.4.2 and 5.4.3) Select the learning rate adaptively on the basis of the current and most recent performances (sum-squared errors). 3. (Section 5.3.3) Use an approximate line search to determine the learning rate or step size. 4. (Eqs. 5.3.8 and 5.3.9) Use the Hessian to solve for the best learning rate at each iteration
Variation of SSE (-) and W_{1:1,1} (.) with Iteration 6
5
4
3
2
1
0
-1
0
50
100
150
200
250
300
FIGURE 5.9. Variation of parameter value and sum-squared error.
5.4.6
Momentum Smoothing
An ad hoc departure from steepest descent is to add memory to the recursion through a momentum term. Now the change in parameter vector w depends not only on the current gradient but also on the most recent change in parameter vector, ∆k+1 = wk+1 − wk = β∆k − αg k for k ≥ 0. What we gain is a high–frequency smoothing effect through the momentum term. To show this we solve this difference equation system to find ∆ k = β k ∆0 − α
k−1 j=0
β k−j g j for k ≥ 1.
164
5. Algorithms for Designing Feedforward Networks
The change in parameter vector depends not only on the current gradient g k−1 but also, in an exponentially decaying fashion (provided that 0 ≤ β < 1), on all previous gradients. If the succession of recent gradients has tended to alternate directions, then the sum will be relatively small and we will make only small changes in the parameter vector. This could occur if we are in the vicinity of a local minimum; successive changes would just serve to bounce us back and forth past the minimum. If, however, recent gradients tend to align, then we will make an even larger change in the parameter vector and thereby move more rapidly across a large region of descent and possibly cross over a small region of ascent that screened off a deeper local minimum. Of course, if the learning rate α is well chosen, then successive gradients will tend to be orthogonal and a weighted sum will not cancel itself out. The use of ad hoc adaptive or fixed learning rate versions of steepest descent is not likely to be preferred to the iterative methods involving some use of the Hessian which are discussed in the next sections.
5.5 Conjugate Gradient Algorithms 5.5.1
Conjugacy and Its Implications
The motivation behind the conjugate gradient algorithm is that we wish to iteratively select search directions {dk } that are noninterfering in the sense that successive minimizations along these directions do not undo the progress made by previous minimizations. This contrasts with the situation in steepest descent, where later search directions can undo the effects of earlier ones. We now select search directions such that at each iteratively selected parameter value wk , the current gradient g k is orthogonal to all previous search directions d1 , . . . , dk−1 . Hence, at any given step in the iteration, the error surface has a direction of steepest descent that is orthogonal to the linear subspace of parameters spanned by the prior search directions. Steepest descent merely assured us that the current gradient is orthogonal to the last search direction (Eq. 5.3.5). We assume, as in the case of steepest descent, that the new weight vector is given recursively by wk+1 = wk + αk dk = w0 +
k
αi di .
(5.5.1)
i=0
The condition for optimality of the linear weighting coefficients {αi } and search directions is that the error function gradient g k+1 = ∇ET (wk+1 ) achieved at the new location be orthogonal to all of the previous search directions, (∀i ≤ k) g Tk+1 di = 0; (5.5.2) there is no remaining projection of the error on the linear space spanned by the search directions d0 , . . . , dk .
5.5 Conjugate Gradient Algorithms
165
For purposes of this analysis we assume that the error function is truly quadratic and thus has a constant Hessian H that is also assumed to be positive definite (so that there is a unique minimum). Observe then that g k+1 − g k = H(wk+1 − wk ) = αk Hdk .
(5.5.3)
For the actual nonquadratic error function ET , Eq. 5.5.3 is only an approximation based on a first-order Taylor’s series expansion for the gradient. Proceeding nevertheless, dTk (g k+1 − g k ) = αk dTk Hdk , and choosing αk to satisfy Eq. 5.3.6 yields dTk g k+1 = 0.
(5.5.4)
Continuing with i < k, we have dTi (g k+1 − g k ) = αk dTi Hdk . If i < k then the requirement of orthogonality implies that (∀i < k) dTi Hdk = 0,
(5.5.5)
and Eq. 5.5.5 being true for all p > k > 1 is the defining condition for {di } being H-conjugate directions. The upper bound of p arises because you cannot have more than p pairwise orthogonal, or similarly H-conjugate, vectors in IRp . Eq. 5.5.5 is then a necessary condition for optimality of the search directions. To verify sufficiency consider for k > i dTi (g k − g i ) = dTi H(wk − wi ). Note that from Eq. 5.5.1, wk − wi =
k−1
αj dj .
j=i
Hence, dTi (g k − g i ) =
k−1
αj dTi Hdj .
j=i
The H-conjugacy condition of Eq. 5.5.5 allows to simplify this to dTi (g k − g i ) = αi dTi Hdi , and Eq. 5.3.6 for αi yields
dTi g k = 0.
166
5. Algorithms for Designing Feedforward Networks
Theorem 5.5.1 (H-conjugacy) If the error function ET (w) is quadratic with positive definite Hessian H, choosing the search directions {di } to be H-conjugate and the {αi } to satisfy Eq. 5.3.6 is equivalent to the orthogonality between the current gradient and the past search directions given by (∀i < k < p) dTi g k = 0. It is easily verified that conjugate directions {di } also form a linearly independent set of directions in weight space. If weight space has dimension p then of course there can only be p linearly independent directions or vectors. To verify linear independence we need only establish that if there is a set of coefficients {γi } for which p−1
γi di = 0,
i=0
then these coefficients must be identically 0. To see this form dTj H
p−1
γi di = γj dTj Hdj = 0.
i=0
The first equality is a consequence of conjugacy and the second one comes from the preceding hypothesis. The positive definiteness of H enables us to conclude that γi = 0, and linear independence is affirmed for conjugate directions. Hence, we can represent any point as a linear combination of no more than p of the conjugate directions, and in particular if w∗ is the sought location of the minimum of the error function, then there exist coefficients such that p−1 αi∗ di . w∗ − w0 = i=0
Theorem 5.5.2 (Efficiency of Conjugate Gradients) If the error function is quadratic with a positive definite Hessian, then selecting H-conjugate search directions and learning rates according to Eq. 5.3.6 is guaranteed to achieve the minimum in no more than p iterations. Of course, in a neural network setting the parameter p can be quite large, and the error surface is palpably not quadratic for large enough learning rates.
5.5.2
Selecting Conjugate Gradient Directions
To be able to apply the method of conjugate gradients we must be able to determine such a set of directions and then solve for the correct coefficients. The solution for the learning rate coefficients is determined by a line search
5.5 Conjugate Gradient Algorithms
167
or by Eq. 5.3.6 as evaluated by a finite difference, as illustrated in Eq. 5.3.9. The use of a line search to find the minimizing step, as discussed in Section 5.3.3, obviates the need for the Hessian and enables us to properly extend the method of conjugate directions to the real case where the error function E is not a quadratic surface and the step size formula of Eq. 5.3.6 is no longer correct for the minimizing step. A version of the conjugate gradient algorithm that does not require line searches is developed by Møller in [169] and uses the finite difference method for estimating Hdk that was shown in Eq. 5.3.9. A widely used means for establishing the conjugate gradient directions is to initialize with d0 = −g 0 , introduce a scaling βk to be determined, and then iterate with the simple recursion (5.5.6) dk+1 = −g k+1 + βk dk . The conjugacy condition Eq. 5.5.5 and the recursion of Eq. 5.5.6 yield dTk Hdk+1 = 0 = dTk H(−g k+1 + βk dk ). Solving yields the necessary condition that βk =
dTk Hg k+1 dTk Hdk
.
(5.5.7)
Induction can be used to establish that this recursive definition of conjugate gradient search directions does indeed yield a fully conjugate set when the error function is quadratic, although our derivation of Eq. 5.5.7 only established that dk , dk+1 are conjugate.
Theorem 5.5.3 (Establishing Conjugacy) Let the error surface be quadra in a p-dimensional parameter with Hessian H. If we define the search directions {dk } through Eqs. 5.5.6 and 5.5.7, the parameter increments through Eq. 5.5.1, and the learning rate αk through Eq. 5.3.6, then these are conjugate gradient directions and satisfy for any k < p conjugacy noninterference gradient orthogonality
(∀j < k ≤ k ) dTk Hdj
(∀j < k ≤ k )
(∀j < k ≤ k )
g Tk dj g Tk g j
= 0, = 0, = 0.
Proof. See Appendix 2. 2 We can re-express the β coefficient given by Eq. 5.5.7, to eliminate the dependence on the Hessian, by multiplying numerator and denominator by
168
5. Algorithms for Designing Feedforward Networks
αk , reordering the quadratic forms, and recalling Eq. 5.5.3, to find that βk =
g Tk+1 (g k+1 − g k ) dTk (g k+1 − g k )
.
Of course, from Theorem 5.5.3 we have that dTk g k+1 = 0. To re-express the term dTk g k use the recursive construction to write it as (−g k +βk−1 dk−1 )T g k and use Theorem 5.5.3 to conclude that dTk (g k+1 − g k ) = g Tk g k and hence that βk =
g Tk+1 (g k+1 − g k ) g Tk g k
.
(5.5.8)
The advantage of this expression, known as the Polak-Ribiere formula, is that it does not require knowledge of the Hessian. Another formulation uses the orthogonality of the successive gradients given by Theorem 5.5.3 to write more simply g T g k+1 , (5.5.9) βk = k+1 g Tk g k and this is known as the Fletcher-Reeves formula. However, Kramer and Sangiovanni-Vincentelli [132], Fletcher [76, p.85], and others suggest that when we return to the real problem of a nonquadratic error function for which these expressions are no longer equivalent, then the Polak-Ribiere formula, in which small gradient differences lead to steepest descent training, is preferred. With these choices for conjugate gradient search we can expect to satisfy the descent condition of Eq. 5.3.3 and ensure that each iteration reduces the error function. Theorem 5.5.4 (Conjugacy and Descent) Conjugate gradient search directions {dk } generated by Eq. 5.5.6 with βk given by Eq. 5.5.9 and learning rate αk given by a line search in the nonquadratic case, satisfy the descent condition g Tk dk < 0. Proof. By Eq. 5.5.6, g Tk dk = g Tk (−g k + βk−1 dk−1 ). However, by Theorem 5.5.3, g Tk dk−1 = 0. Hence, g Tk dk = −g Tk g k < 0.
2
5.5 Conjugate Gradient Algorithms
5.5.3
169
Restart and Performance
Periodically restarting the conjugate gradient algorithm is recommended by many as a means to cope with departures from the ideal. There can be no more than p linearly independent and H-conjugate vectors in IRp . Of course, this is not an issue if the error function is truly quadratic; in that case we are guaranteed to identify the minimum in no more than p iterations, and we will halt. In the case of actual (nonquadratic) error functions, acceptable performance may not have been achieved by the pth iteration and the algorithm will continue. However, we can no longer maintain conjugacy over more than p directions. Furthermore, because the error function is not quadratic, the true Hessian has been varying as we iterate and the assumptions underlying the method of conjugate gradients become less realistic. Lastly, the choice of nearly optimal learning rate is difficult to achieve, but essential for conjugacy. These factors combine in real applications to make it difficult to maintain a good approximation to conjugacy after many iterations. Restarting helps to repair this situation. The performance of the conjugate gradient algorithm with restart at r, and under the idealized assumption of a quadratic error, can be indicated in terms of λj , the jth largest eigenvalue of the Hessian, through (Luenberger [153, p.250]) |ET (wk+1 ) − ET (w∗ )| ≈ (
λp−r+1 − λ1 2 ) |ET (wk ) − ET (w∗ )|. λp−r+1 + λ1
When r = 1 we have steepest descent and when r = p we have full convergence, as expected from the method of conjugate gradients. Both the steepest descent and conjugate gradient algorithms exhibit only linear convergence in that the magnitude of the error in the (k + 1)st step of iteration is proportional to the magnitude of the error in the kth step. However, the conjugate gradient algorithm can better condition its performance by acting as if it had eliminated the r − 1 largest eigenvalues of the Hessian, thereby reducing the prefactor multiplying the error magnitude term by effectively improving the condition number of the Hessian. An illustration of the performance of conjugate gradient search in the case of a two-dimensional quadratic with condition number 5 is given in Figure 5.10—compare this to the same plot for steepest descent in Figure 5.5.
5.5.4 Summary of the Conjugate Gradient Directions Algorithm We summarize: 1. Select an initial point w0 , likely chosen at random and with high probability to have values small enough not to saturate the network nodes.
170
5. Algorithms for Designing Feedforward Networks
2. Use backpropagation to evaluate the error function E0 = ET (w0 ) and its gradient g 0 = ∇ET (w0 ). 3. Compute the first search direction d0 = −g 0 , taken initially as if this were steepest descent, and set the cycle index k = 0. 4. Evaluate the step size αk by conducting an approximate line search for the minimum over α of ET (wk + αdk ) (Section 5.3.3) using either quadratic or cubic interpolation until the stopping conditions of Eq. 5.3.10 are satisfied and wk+1 = wk + αk dk has been determined. An alternative worth considering is the self-scaling process described by Møller [169] and based on the approximate evaluation of Hdk as discussed in Section 5.3.2. (The choice of good step size is critical to achieving conjugacy.) 5. In applying the cubic interpolation or evaluating the stopping condition Eq. 5.3.10b, you will have already used backpropagation to evaluate Ek+1 , g k+1 the value and gradient of the error function at wk+1 . 6. Compute βk using the Polak-Ribiere formula of Eq. 5.5.8 and the previously calculated gradients g k , g k+1 . 7. Compute the next search direction dk+1 = −g k+1 + βk dk . 8. Return to the operations of Step 4 after incrementing the index k by 1. 9. If the iteration index k is a multiple of a chosen m less than or equal to the dimension s of w, then restart this algorithm by returning to Step 2 and setting dk = −g k . 10. Halt according to any of the stopping criteria listed in Section 5.3.4.
5.6 Quasi-Newton Algorithms 5.6.1
Outline of Approach
If the error function is truly quadratic, then as pointed out in Section 5.1.2, we can solve for the minimizing weight vector in a single step through Newton’s method. Reproducing Eq. 5.1.6, w∗ = w0 − H−1 g. This solution requires knowledge of the Hessian and assumes it to be constant and positive definite. We need a solution method that can take into account:
5.6 Quasi-Newton Algorithms
171
1. the variation of H(w) with w; 2. the fact that the error function is at best only approximately quadratic; 3. at some remove from a local minimum the approximating quadratic surface is likely to have a Hessian that is not positive definite; and 4. that the evaluation of the true Hessian is computationally too expensive.
60
50
40
30
20
10
0
10
20
30
40
50
60
FIGURE 5.10. Conjugate gradient trajectory, r = 5.
The quasi-Newton methods address themselves to these tasks by first generalizing the iterative algorithm to the form wk+1 = wk − αk Mk g k . The choice of step size αk to use with a search direction dk = Mk g k is determined by an approximate line search, and use of line search is essential to the success of this method. When wk is close to a local minimum w∗ of E, then we expect that αk ≈ 1, Mk ≈ H−1 (w∗ ). From the discussion in Section 5.3.2, we recall that the descent condition is met if we select Mk to be a positive definite matrix. An example of a family of such matrices is Mk = [k I + H(wk )]−1 . If k = 0 then we have the Newton formulation. If k is large, then this is approximately steepest descent. In the event that Hk is not positive definite, then Mk can be made so by choosing k greater than maxi (−λi ) for {λi } the eigenvalues of the Hessian. This formulation interpolates between steepest descent and Newton’s method and allows us to treat a Hessian
172
5. Algorithms for Designing Feedforward Networks
that is not positive definite. However, it still requires us to determine the Hessian. The family of quasi-Newton methods iteratively track the inverse of the Hessian without ever computing it directly. To understand why this is possible, let q k = g k+1 − g k , and consider the expansion for the gradient (exact in the quadratic case) q k = Hk (wk+1 − wk ) = Hk pk . If we can evaluate the difference of gradients for p linearly independent increments p0 , . . . pp−1 in the weight vectors, then we can solve for the Hessian (assumed constant). To do so, form the matrices P with ith column the vector pi−1 and Q with ith column the vector q i−1 . Then we have the matrix equation Q = HP, which can be solved for the Hessian, when the columns of P are linearly independent, through H = QP−1 . For example, when we use conjugate gradient directions in the quadratic error function case, we know from Section 5.5.1 that the search directions, and hence pk , are linearly independent. Thus from the increments in the gradient induced by increments in the weight vectors as training proceeds, we have some hope of being able to track the Hessian. An approximation to the inverse M of the Hessian is achieved by interchanging q k and pk in an approximation to the Hessian itself M = PQ−1 . Hence, the information is available in the sequence of gradients that determine the {q k }, and the sequence of search directions and learning rates that determine the {pk }, to infer to the inverse of the Hessian, particularly if it is only slowly varying. In outline, a quasi-Newton algorithm operates as follows: (a) construct a positive definite approximation Mk to the inverse of the Hessian at wk that satisfies p k = Mk q k ; (b) select a search direction dk = −Mk g k ;
5.7 Levenberg-Marquardt Algorithms
173
(c) perform an approximate line search to determine the learning rate αk in the recursion wk+1 = wk + αk dk ; (d) re-evaluate the approximate inverse Hessian Mk+1 at wk+1 ; and (e) iterate on the preceding.
5.6.2
A Quasi-Newton Implementation via BFGS
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton algorithm implements these ideas as follows (e.g., Luenberger [153, Sec. 9.4], Battiti [22]). The update for the approximate inverse M of the Hessian is given by q T Mk q p pT p q T Mk + Mk q k pTk . Mk+1 = Mk + 1 + k T k kT k − k k q k pk pk q k q Tk pk
(5.6.1)
This recursion is initialized by starting with a positive definite matrix such as the identity, M0 = I. The BFGS quasi-Newton algorithm then proceeds as outlined above. The determination of the learning rate is critical, as was the case for the method of conjugate directions. Improvements to the performance of this algorithm can be gained through preconditioning to obtain more favorable sets of eigenvalues, but we leave the details to the references. Fletcher, [76, pp. 54–56], asserts that the preceding BFGS quasi-Newton method enjoys the property of superlinear convergence for general (nonquadratic) functions. Superlinear convergence is one in which, for E ∗ the value at an appropriate local minimum, |Ek+1 − E ∗ | = 0. k→∞ |Ek − E ∗ | lim
Hence, quasi-Newton methods enjoy asymptotically more rapid convergence than that of steepest descent or conjugate gradient methods.
5.7 Levenberg-Marquardt Algorithms The Levenberg-Marquardt algorithm (LMA) (e.g., Press et al. [191]) exploits the fact that the error function is a sum of squares, 1 1 2 (yi − ti )2 = e . 2 i=1 2 i=1 i n
ET =
n
Introduce the following notation for the error vector and its Jacobian with respect to the network parameters w, e = [ei ], J = [Jij ], Jij =
∂ej , i = 1 : p, j = 1 : n. ∂wi
174
5. Algorithms for Designing Feedforward Networks
The Jacobian J is a large p × n matrix, all of whose elements are calculated directly by backpropagation as presented in Section 5.2.3. We can now express the p-dimensional gradient g for the quadratic error function as n ei ∇ei (w) = Je. g(w) = i=1
Proceeding to the Hessian we find H = [Hij ],
Hij =
∂ 2 ET = ∂wi ∂wj
1 ∂ 2 e2k ∂ 2 ek ∂ek ∂ek = [ek + ]= 2 ∂wi ∂wj ∂wi ∂wj ∂wi ∂wj n
n
k=1
k=1
n k=1
[ek
∂ 2 ek + Jik Jjk ]. ∂wi ∂wj
Hence, defining D=
n
ei ∇2 ei
i=1
yields the expression H(w) = JJT + D. The key to Levenberg-Marquardt is to approximate this expression for the Hessian by replacing the matrix D involving second derivatives by the much simpler positively scaled unit matrix I. The LMA is a descent algorithm using this approximation in the form Mk = [JJT + I]−1 , wk+1 = wk − αk Mk g(wk ). Successful use of LMA requires approximate line search to determine the learning rate αk . The matrix JJT is automatically symmetric and nonnegative definite. The typically large size of J may necessitate careful memory management in evaluating the product JJT . Hence, any positive will ensure that Mk is positive definite, as required by the descent condition. Moreover, by choosing large enough we can ensure that the matrix whose inverse we are calculating is also well-conditioned (i.e., ratio of largest to smallest eigenvalue is not too large). The performance of the algorithm does depend on the choice of . Hagan and Menhaj [89] propose to increase by a multiplicative factor β when ET (wk+1 ) > ET (wk ), and to decrease by the same factor when the reverse is true. Of course, for large we are effectively doing steepest descent. Another view of the role of the scaling parameter is obtained if we consider adding regularization (see Section 6.3) in the form of the term
5.8 Remarks on Computing the Hessian
175
to the empirical error ET (w). The gradient g λ of the sum of terms is the gradient g of the empirical error plus λw, 1 T 2 λw w
g λ (w) = g(w) + λw = Je + λw. The Hessian Hλ of the regularized expression is given by Hλ (w) = H + λI ≈ JJT + I + λI. Thus, addition of the regularization term affects the Levenberg-Marquardt approximation to the Hessian simply by increasing the scaling from to + λ. LMA with regularization can be written as wk+1 = wk − αk [( + λ)I + JJT ]−1 (Je + λwk ). This suggests that we can interpret the scaling as a regularization parameter if we correct the expression for the gradient and approximate the unregularized Hessian just by JJT . There remains the need to calculate the inverse of the p × p matrix JJT + I. This can be accomplished by a Cholesky decomposition in which we adopt the representation JJT + I = GGT for G a lower triangular matrix with positive diagonal elements. Such a decomposition is possible if and only if JJT + I is positive definite, and this is assured by > 0. The inverse of JJT + I is then readily constructed in terms of the easily obtained inverse G−1 of a lower triangular matrix.
5.8 Remarks on Computing the Hessian With time and the exponential growth in computing power, what was once out of reach becomes mundane. Hessians of moderate size can now be computed using brute force, and the sizes for which this is possible will continue to grow. However, there are a number of more efficient approximate approaches. A useful discussion of backpropagation-based calculational methods for the Hessian is provided by Bishop [29, Sec. 4.10], and an overview is given by Buntine and Weigend [34]. Until recently the primary approach in neural network training has been to approximate the Hessian by calculating only the diagonal entries (e.g., Hassibi et al. [94, 95, 96]). The realization that we often want Hd rather than H itself leads to the use of finite difference estimates, as outlined in Section 5.3.2, and to the exact method of Pearlmutter [184]. More complex approximations have been used in such second-order optimization methods as the BFGS version of quasi-Newton (wherein an iterative approximation is constructed for the Hessian based
176
5. Algorithms for Designing Feedforward Networks
on what is being learned from the gradient) and the Levenberg-Marquardt algorithms (wherein an approximation is made to the Hessian that drops certain terms) (see Battiti [22], Press et al. [191]). It seems a safe prediction that state-of-the-art neural network training algorithms will come to rely substantially on the Hessian in second-order nonlinear optimization algorithms, and that for moderate-sized networks this Hessian will be computed fully.
5.9 Training Is Intrinsically Difficult There is little reason to expect that one can find a uniformly best algorithm for selecting the weights in a feedforward net. Blum and Rivest note in the abstract of [32] that: We consider a 2-layer, 3-node, n-input neural network whose nodes compute linear threshold functions of their inputs. We show that it is NP-complete to decide whether there exist weights and thresholds for the three nodes of this network so that it will produce output consistent with a given set of training examples. . . This result suggests that those looking for perfect training algorithms cannot escape inherent computational difficulties just by considering only simple or very regular networks. A similar conclusion of the training problem being NP-complete for a network of three sigmoidal nodes is reached by Hoffgen [104]. Judd [120] summarizes his earlier work of 1987 and 1988, establishing that it is an NPcomplete problem to find a set of weights for a given neural network and a set of training examples to classify even two-thirds of them correctly. He notes with reference to his work of 1987, “This process of loading information into adaptive networks is problematical and has been shown in the general case to be NP-complete. Hence there is little hope of finding ‘learning rules’ to load arbitrary feed-forward nets with arbitrary training data in feasible time.” While these results have been established for LTU nodes, there is other work on the intractability of continuous node functions (e.g., Das Gupta, et al. [53]). Jones [119] studied the problem of accurate training for a single hidden layer network of s monotone sigmoidal nodes when the output layer weights {w2:1,i } are constrained to satisfy the convexity condition w2:1,i ≥ 0,
s
w2:1,i = 1,
i=1
and he established the following theorem.
5.10 Learning from Several Networks
177
Theorem 5.9.1 (Training is NP-hard (Jones [119])) Consider a family Ns of single hidden layer networks having s ≥ 3 nodes, where s = s(d) may vary with the dimension d of the input x, and satisfying the above convexity condition on the output layer weights. Assume all expansions for the data and weights are finite and in base s. If s(d) is bounded by a polynomial in d, then the problem of determining, for a given training set Tn , if a network in Ns exists achieving quadratic error (not divided by n) ETn <
1 , 4s5/2
is NP-hard. With the additional assumption that the node function σ is piecewise-linear, the problem is NP-complete. As Blum and Rivest [32, pp. 494–495] note: While NP-completeness does not render a problem totally unapproachable in practice, it usually implies that only small instances of the problem can be solved exactly, and that large instances can at best only be solved approximately, even with large amounts of computer time. Further discussion of this issue is found in Vidyasagar [244, Sec. 10.2], where it is noted that limiting the input dimension d does avoid the problem of NP-completeness. In sum, one should be skeptical of claims in the literature on training algorithms that the one being proposed is substantially better than most others. Such claims are often defended through a handful of simulations in which the proposed algorithm performed better than some familiar alternative.
5.10 Learning from Several Networks Although the preceding discussion has focused on identifying a single good network yielding a low value of training error, the algorithms themselves identify large sequences of networks. Furthermore, as noted in Section 5.3.1, the outcome of training is strongly dependent on the choice of starting point. Hence, when one repeats training with a different, randomly chosen initial condition, one rarely ends with the same network. Thus we are encouraged to explore a variety of initial conditions if we are to achieve a good minimum training error. Can we do better by combining several of the networks encountered in our training explorations than by simply choosing the single best network? Hashem [93] establishes that this is the case for even linear combinations of networks. However, a linear combination of given networks results in a new, more complex network, and it is this gain in network complexity that can explain the improved fit to the training
178
5. Algorithms for Designing Feedforward Networks
data. The problem of combining the outputs of several neural networks can be related to prior studies on combining the opinions of experts (e.g., see Jacobs and Jordan [114], Lindley [147]). Drucker et al. [64] compare a number of approaches to combining the outputs of several neural networks, with emphasis on the idea of boosting. An alternative is to use wisely our limited computational resources to explore an ensemble of networks only so far as necessary to identify the good ones. Methods are discussed in Maron and Moore [157] and Mukherjee [172] for pruning an ensemble of networks, derived from, say, different initial conditions, as training proceeds. The methods of Mukherjee show only a limited ability to prune accurately while conserving computational effort, although more may eventually be achieved by other means.
5.11 Appendix 1: Positive Definite Matrices Definition 5.11.1 (Positive Definite Matrix) A matrix H is positive definite if it is square, say, of dimension p × p and for any nonzero real vector x of dimension p the quadratic form Q(x) = xT Hx =
p p
Hij xi xj > 0.
i=1 j=1
When H is positive definite, then the locus of points for which Q(x) is a positive constant is an ellipsoid. Theorem 5.11.1 (Positive Definite Matrices) A symmetric matrix H (= HT ) is positive definite if and only if its p eigenvalues λ1 , . . . , λp are all positive. Given any collection of distinct column vectors e1 , . . . , ep and positive numbers λ1 , . . . , λp we can form a positive definite matrix H=
p
λi eeT .
i=1
Definition 5.11.2 (Non-Negative Definite Matrix) A matrix H is nonnegative definite if it is square, say, of dimension p × p, and for any real vector x of dimension p xT Hx ≥ 0. Theorem 5.11.2 (Non-Negative Definite Matrices) A symmetric matrix H is non-negative definite if and only if its p eigenvalues are all nonnegative.
5.12 Appendix 2: Proof of Theorem 5.5.3
179
Definition 5.11.3 (Indefinite Matrix) A symmetric matrix H is indefinite if neither H nor −H are non-negative definite.
5.12 Appendix 2: Proof of Theorem 5.5.3 Theorem 5.5.3 (Establishing Conjugacy) Let the error surface be quadratic in a p-dimensional parameter with Hessian H. If we define the search directions {dk } through Eqs. 5.5.6 and 5.5.7, the parameter increments through Eq. 5.5.1, and the learning rate αk through Eq. 5.3.6, then these are conjugate gradient directions and satisfy for any k < p: conjugacy
(∀j < k ≤ k ) dTk Hdj
noninterference
(∀j < k ≤ k )
gradient orthogonality
(∀j < k ≤ k )
g Tk dj g Tk g j
= 0, = 0, = 0.
Proof. The proof is by induction on k and is initialized by verifying directly the k = 2 cases that dT2 Hd1 = dT2 Hd0 = dT1 Hd0 = 0, g T2 d1 = g T2 d0 = g T1 d0 = 0, g T2 g 1 = g T2 g 0 = g T1 g 0 = 0. In this regard it is useful to reorder Eq. 5.5.6 as g k+1 = −dk+1 + βk dk , and recall that for a quadratic error function g k+1 − g k = H(wk+1 − wk ) = H(αk dk ). Note that dTk+1 Hdk = (−g k+1 + βk dk )T Hdk = 0 follows immediately from the solution for βk from Eq. 5.5.7. Hence, dT2 Hd1 = dT1 Hd0 = 0 is true. The claim that g T2 d1 = g T1 d0 = 0 follows from the optimal choice of learning rate yielding successively orthogonal gradients and search directions, as concluded in Eq. 5.3.5. The claim of g T2 d0 = 0 follows from the rewrite (g T2 − g T1 + g T1 )d0 = (α1 dT1 H + g T1 )d0 and from what has just been established. The claim g T1 g 0 = 0 follows from the initialization d0 = −g 0 and the preceding. Furthermore, g T2 g 1 = g T2 (−d1 + β0 d0 ) = 0,
180
5. Algorithms for Designing Feedforward Networks
by the preceding. In addition, g T2 g 0 = −g T2 d0 = 0 follows by the preceding. The remaining case is dT2 Hd0 = dT2 (g 1 − g 0 )/α0 = 0, by the orthogonality just established between the search directions and gradients. We continue the induction proof by assuming the conclusions of the theorem as hypotheses and establishing them for k replaced by k + 1 for j < k + 1, dTk+1 Hdj = 0,
(5.12.1a)
for j < k + 1, g Tk+1 dj = 0,
(5.12.1b)
for j < k + 1, g Tk+1 g j = 0.
(5.12.1c)
The optimal choice of learning rate immediately yields Eq. 5.12.1b for j = k. Hence, consider j ≤ k − 1, g Tk+1 dj = (g k+1 − g k + g k )T dj = (αk Hdk + g k )T dj . The hypotheses of conjugacy and noninterference show the preceding to be 0, thereby completing the verification of Eq. 5.12.1b. Rewrite Eq. 5.12.1c; g Tk+1 g j = g Tk+1 (−dj + βj−1 dj−1 ). Invoking the already established Eq. 5.12.1b verifies Eq. 5.12.1c. Proceeding now to Eq. 5.12.1a, note again that for j = k, dTk+1 Hdk = (−g k+1 + βk dk )T Hdk = 0, by the solution of Eq. 5.5.7 for βk . Hence, assume j < k and use conjugacy to derive dTk+1 Hdj = −g Tk+1 Hdj = g Tk+1 (g j+1 − g j )/αj . Invoking Eq. 5.12.1c now enables us to conclude with Eq. 5.12.1a.
2
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms
181
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms For illustrative purposes and to clarify the discussion in the previous sections on particular training algorithms, we append MATLAB listings of steepest descent with one-step optimal learning rate, quasi-Newton, and Levenberg-Marquardt training algorithms, some with regularization. These programs are developed primarily for the basic case of a single hidden layer (1HL) with linear output node. We also provide the subprograms (e.g., backpropagation-based gradient evaluation) that they call. There is no warranty as to the accuracy or efficiency of these programs. We assume throughout that the training data Tn is represented as a (d + 1) × n matrix T with feature vector xi occupying the first d elements of the ith column and scalar target ti the (d + 1)st element of the ith column. The d × n matrix S = T (1 : d, :) of inputs is all but the last row of T . All of our training algorithms require us to evaluate the network response y = η(x, w) to a given input x, the sum-squared (empirical) error ET (w) over the training set, and the gradient of the empirical error ET (w) with respect to the parameters w. We present these evaluations for a 1HL network first with the parameters enumerated as w1, b1, w2, b2, corresponding to separate specifications for the first layer weights w1 and biases b1 connecting inputs to hidden layer nodes and weights w2 connecting the hidden layer nodes to the single linear output with bias b2. We then develop programs with all these parameters stacked together as a single column vector w = ww as specified in the listing for ntout1. This latter specification is less economical in terms of computation (but not significantly so), but it is much easier to use in network design experiments.
5.13.1
1HL Network Response, Gradient, and Sum-Squared Error
The function netout1 calculates the responses of the network, to each of the inputs, at each layer, including the derivatives of the responses in the first layer nodes: %compute 1HL network response function [a1, d1, a2]=netout1(node,w1,b1,w2,b2,S) %specification of the network parameters %w1 is s1xd, b1 is s1x1, w2 is 1xs1, S is dxn
a1=w1*S+b1*ones(1,size(S,2));
182
5. Algorithms for Designing Feedforward Networks
if strcmp(node,’logistic’), a1=1./(1+exp(-a1)); d1=a1.*(1-a1); elseif strcmp(node,’tanh’), a1=tanh(a1); d1=1-a1.^2; else, error(’Improper node specification’); end a2=w2*a1+b2; The program grad1 uses backpropagation to calculate the gradients of the squared-error function with respect to the network parameters: %calculate 1HL gradients function [gw1, gb1, gw2, gb2]=grad1(node,w1,b1,w2,b2,T) [c,n]=size(T); S=T(1:c-1,:); [a1 d1 a2]=netout1(node,w1,b1,w2,b2,S); delta2=a2-T(c,:); delta1=d1.*(w2’*delta2); gb1=delta1*ones(n,1); gb2=delta2*ones(n,1); gw1=delta1*S’; gw2=delta2*a1’; The program sse1 determines performance through sum-squared error on T : %calculate sum-squared error for 1HL % this error is not scaled function [err]=sse1(node,w1,b1,w2,b2,T) [a,b]=size(T); S=T(1:a-1,:); [a1 d1 a2]=netout1(node,w1,b1,w2,b2,S);
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms
183
err=a2-T(a,:); err=err*err’; It is often more convenient to consider the network parameters as stacked into a single vector ww. Versions of the preceding programs that accept ww as input are the following. The mapping between ww and w1, b1, w2, b2 is spelled out in the next program. %compute 1HL network output function [a1, d1, a2]=ntout1(node,ww,S) %ww is p=s1(d+2)+1 by 1 [d,n]=size(S); p=length(ww); s1=(p-1)/(d+2); q=s1*d; w1=reshape(ww(1:q),s1,d); b1=ww(q+1:q+s1); w2=ww(q+s1+1:p-1)’; b2=ww(p); [a1 d1 a2]=netout1(node,w1,b1,w2,b2,S);
%calculate 1HL gradients %regularization terms added in training programs function [gww]=grd1(node,ww,T) [c,n]=size(T); t=T(c,:); S=T(1:c-1,:); pp=length(ww); s1=(pp-1)/(c+1); [a1 d1 a2]=ntout1(node,ww,S); %unstack ww w1=reshape(ww(1:s1*(c-1)),s1,c-1); b1=ww(s1*(c-1)+1:s1*c); w2=ww(s1*c+1:s1*(c+1));
184
5. Algorithms for Designing Feedforward Networks
b2=ww(pp); delta2=a2-t; delta1=d1.*(w2*delta2); gb1=delta1*ones(n,1); gb2=delta2*ones(n,1); gw1=delta1*S’; gw2=a1*delta2’; %stack gradients gww=reshape(gw1,(c-1)*s1,1); gww=[gww;gb1;gw2;gb2];
%calculate sum-squared error for 1HL function [err]=ss1(node,ww,T) c=size(T,1); [a1 d1 a2]=ntout1(node,ww,T(1:c-1,:)); err=a2-T(c,:); err=err*err’;
5.13.2
1HL Steepest Descent
The preceding programs are then called by the steepest descent training algorithm SD1.m. This program estimates the one-step optimal learning rate by finite difference approximation to differentiation; the parameter probe, used to approximate a derivative by a finite difference, should be set to a small value (e.g., 1e-4). The output wwbest is the best parameter vector found in the search over the runlim iterations. Training is assisted by internally linearly rescaling the matrix T to standardize it. %1HL Steepest Descent training, optimal alpha, %with regularization %calls grd1.m, ntout1.m
function [wwbest,err,gdnorm,alpha,fl,cptime]=SD1(runpar,node,ww,T %runpar sets training parameters steps, runlim;
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms
%node specifies the node function %set probe increment probe=runpar(1); %set multiplicative modification factor modify=runpar(2); runlim=runpar(3); %regularization multiplier lambda=runpar(4); cptime=cputime; fl=flops; [c,n]=size(T); d=c-1; S=T(1:d,:); t=T(c,:); %standardize t and undo on conclusion m=mean(t); st=std(t); t=(t-m)./st; %standardize S mm=mean(S’); S=S-mm’*ones(1,n); ss=std(S’); S=S./(ss’*ones(1,n)); T=[S;t]; pp=length(ww); s1=(pp-1)/(d+2); err=ones(1,runlim); gdnorm=zeros(1,runlim); [a1 d1 a2]=ntout1(node,ww,S); er=t-a2; err(1)=er*er’; minerr=err(1); wwbest=ww; alpha=zeros(1,runlim); %initialize %weight decay regularization %grd=grd1(node,ww,T)+lambda*ww; %Bartlett regularization
185
186
5. Algorithms for Designing Feedforward Networks
grd=grd1(node,ww,T)+lambda*sum(sign(ww(pp-s1:pp))); %start training iterations for k=1:runlim, dir=-grd; %weight decay regularization %grdincr=grd1(node,ww+probe*dir,T)+lambda*probe*dir; %Bartlett regularization grdincr=grd1(node,ww+probe*dir,T)+... lambda*sum(sign(ww(pp-s1:pp)+probe*dir(pp-s1:pp))); alpha(k)=probe/(1-((grdincr’*grd)/(grd’*grd))); if alpha(k)<=0, alpha(k)=.001; end alpha(k)=modify*alpha(k); p=alpha(k)*dir; ww=ww+p; %weight decay regularization %grd=grd1(node,ww,T)+lambda*ww; %Bartlett regularization grd=grd1(node,ww,T)+lambda*sum(sign(ww(pp-s1:pp-1))); gdnorm(k)=grd’*grd; [a1 d1 a2]=ntout1(node,ww,S); er=t-a2; err(k)=er*er’; if err(k)<minerr, minerr=err(k); wwbest=ww; end end %undo standardization on S w1=reshape(wwbest(1:d*s1),s1,d); w1=w1./(ones(s1,1)*ss); b1=wwbest(d*s1+1:d*s1+s1); b1=b1-w1*mm’; wwbest(1:d*s1)=reshape(w1,d*s1,1); wwbest(d*s1+1:d*s1+s1)=b1;
%undo standardization on t wwbest(pp-s1:pp-1)=st.*wwbest(pp-s1:pp-1); wwbest(pp)=st*wwbest(pp)+m; scale=(st^2)/n;
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms
187
err=scale*err; fl=flops-fl; cptime=cputime-cptime;
subplot(3,1,1), semilogy(err),grid,title([’SD sum-squared Training Errors/sample,... probe=’,num2str(probe),... ’, modify=’,num2str(modify)]) subplot(3,1,2), semilogy(alpha),grid,title(’Learning Rate’), subplot(3,1,3), semilogy(gdnorm),grid, title(’Squared Norm of Gradient’)
5.13.3
1HL Conjugate Gradient
Eq. 5.3.9 is used in evaluation of learning rate in support of the conjugate gradient algorithm of Section 5.5. An alternative is to use the line search algorithms presented later. %conjugate gradient search for 1HL function [wwbest,err,gdnorm,fl,cptime]=CG1(runpar,node,wwinit,T) %choose a restart fraction of pp=length(wwinit) restart=runpar(1); %choose a runlim as a multiple of restart runmult=runpar(2); %needed to calculate learning rate probe=runpar(3); gamma=runpar(4); cptime=cputime; fl=flops; [c,n]=size(T); d=c-1; S=T(1:d,:); t=T(c,:); %standardize t and undo on conclusion m=mean(t); st=std(t); t=(t-m)./st; %standardize S mm=mean(S’);
188
5. Algorithms for Designing Feedforward Networks
S=S-mm’*ones(1,n); ss=std(S’); S=S./(ss’*ones(1,n)); T=[S;t]; pp=length(wwinit); s1=(pp-1)/(d+2); restart=floor(restart*pp); runlim=runmult*restart; err=ones(1,runlim); gdnorm=zeros(1,runlim); alpha=zeros(1,runlim); beta=zeros(1,runlim); [a1 d1 a2]=ntout1(node,wwinit,S); er=t-a2; err(1)=er*er’; minerr=err(1); wwbest=wwinit; ww0=wwinit; for j=1:runmult, %initialize gradient and search direction grd=grd1(node,ww0,T); grdprev=grd; dir=-grd; grdincr=grd1(node,ww0+probe*dir,T); %denom=gamma*(dir’*dir)+dir’*(grdincr-grd)/probe; denom=dir’*(grdincr-grd)/probe; alpha0=-dir’*grd/denom; p=alpha0*dir; ww=ww0+p;
%start training iterations for k=1:restart, kk=(j-1)*restart+k; grd=grd1(node,ww,T); %use Polak-Ribiere form beta(kk)=grd’*(grd-grdprev)./(grdprev’*grdprev); dir=-grd+beta(kk)*dir; grdincr=grd1(node,ww+probe*dir,T);
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms
189
% denom=gamma*(dir’*dir)+dir’*(grdincr-grd)/probe; denom=dir’*(grdincr-grd)/probe; alpha(kk)=-dir’*grd/denom; p=alpha(kk)*dir; ww=ww+p; grdprev=grd; grd=grd1(node,ww,T); gdnorm(kk)=grd’*grd; [a1 d1 a2]=ntout1(node,ww,S); er=t-a2; err(kk)=er*er’; if err(kk)<minerr, minerr=err(kk); wwbest=ww; end end ww0=ww; end %undo standardization on S w1=reshape(wwbest(1:d*s1),s1,d); w1=w1./(ones(s1,1)*ss); b1=wwbest(d*s1+1:d*s1+s1); b1=b1-w1*mm’; wwbest(1:d*s1)=reshape(w1,d*s1,1); wwbest(d*s1+1:d*s1+s1)=b1;
%undo standardization on t wwbest(pp-s1:pp-1)=st.*wwbest(pp-s1:pp-1); wwbest(pp)=st*wwbest(pp)+m; scale=(st^2)/n; err=scale*err; fl=flops-fl; cptime=cputime-cptime;
subplot(3,1,1), semilogy(err),grid,title([’CG sum-squared Training Errors/sample, probe=’,num2str(probe)]) subplot(3,1,2), semilogy(alpha),grid,title(’Learning Rate’), subplot(3,1,3), semilogy(gdnorm),grid, title(’Norm Sq. Gradient’)
190
5.13.4
5. Algorithms for Designing Feedforward Networks
1HL Quasi-Newton
Both quasi-Newton and Levenberg-Marquardt algorithms require a line search (see Section 5.3.3) to determine a good value of learning rate α. We first present the simpler, but less efficient, bisection version and then a cubic interpolation version. %line search based on bisection and gradients for 1HL %with weight decay regularization function [alpha]=bisectreg1(node,steps,lambda,dir,ww,T) %node is node type, dir is search direction, T is training set %initial range of alpha is [0,2], steps is number of searches beta(1)=0; beta(2)=1; beta(3)=2;
for k=1:steps, if (dir’*(lambda*(ww+beta(2)*dir)+grd1(node,ww+beta(2)*dir,T))>=0 beta(3)=beta(2); else beta(1)=beta(2); end beta(2)=(beta(1)+beta(3))/2; end alpha=beta(2); A cubic interpolation line search algorithm incorporating weight decay regularization is provided next: %implement cubic line search with regularization for 1HL %based on Press et al. pp.384-385 function [alpha1]=cubic1(node,lambda,dir,ww,T) %rho sets the required performance improvement of Eq.5.3.10a rho=.1; [c n]=size(T); d=c-1; S=T(1:d,:); t=T(c,:); %initialize search using a quadratic determined by %h(0),h’(0), and h(1) d0=dir’*(grd1(node,ww,T)+lambda*ww);
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms
191
[a d a2]=ntout1(node,ww,S); er=t-a2; errorig=er*er’ +lambda*ww’*ww; err0=errorig; %try Newton optimal step of 1 alpha1=1; ww1=ww+alpha1*dir; d1=dir’*(grd1(node,ww1,T)+lambda*ww1); [a d a2]=ntout1(node,ww1,S); er=t-a2; err1=er*er’+lambda*ww1’*ww1; if err1<=errorig+rho*d0*alpha1, break; end alpha0=alpha1; %using quadratic only for initialization alpha1=-d0/(2*(err1-errorig-d0)); %iterate until new error is small enough while err1>errorig+rho*d0*alpha1, %find parameters of cubic and its minimum alpha1 par=(1/(alpha0^2 *alpha1^2 *(alpha0-alpha1)))*... [alpha1^2 -alpha0^2 ;-alpha1^3 alpha0^3]*... [err0 -d0*alpha0-errorig; err1 -d0*alpha1-errorig]; alpha0=alpha1; a=par(1); b=par(2); alpha1=(-b+sqrt(b^2 -3*a*d0))/(3*a); %set limits to change in alpha, version of Eq.5.3.10b if alpha1>.5*alpha0, alpha1=.5*alpha0; elseif alpha1<.2*alpha0, alpha1=.2*alpha0; end err0=err1; ww1=ww+alpha1*dir; [a d a2]=ntout1(node,ww1,S); er=t-a2; err1=er*er’+lambda*ww1’*ww1; end
1HL quasi-Newton QN1reg with weight decay regularization is listed next. If regularization is not wanted, then set lambda = 0. Although it
192
5. Algorithms for Designing Feedforward Networks
is set to select the BFGS update, the choice of Davidon-Fletcher-Powell is also included.
%1HL Quasi-Newton training with regularization %calls grd1.m, ntout1.m, cubic1.m function [wwbest,err,gdnorm,fl,cptime]=QN1reg(runpar,node,ww,T) %runpar sets training parameters steps, runlim; %node specifies the node function steps=runpar(1); runlim=runpar(2); %set regularization coefficient of w’w lambda=runpar(3); cptime=cputime; fl=flops; [c,n]=size(T); d=c-1; S=T(1:d,:); t=T(c,:); %standardize t and undo on conclusion m=mean(t); st=std(t); t=(t-m)./st; %standardize S mm=mean(S’); S=S-mm’*ones(1,n); ss=std(S’); S=S./(ss’*ones(1,n)); T=[S;t]; pp=length(ww); s1=(pp-1)/(c+1); err=ones(1,runlim); gdnorm=zeros(1,runlim); [a1 d1 a2]=ntout1(node,ww,S); er=t-a2; err(1)=er*er’;
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms
193
minerr=err(1); wwbest=ww; %initialize M=eye(pp); grd=grd1(node,ww,T)+lambda*ww; %start training iterations for k=2:runlim, q0=grd; dir=-M*q0; %alpha=bisectreg1(node,steps,lambda,dir,ww,T); alpha=cubic1(node,lambda,dir,ww,T); p=alpha*dir; ww=ww+p; grd=grd1(node,ww,T)+lambda*ww; gdnorm(k)=grd’*grd; q=grd-q0; %update M by BFGS M=M+((1+(q’*M*q)./(q’*p))*(p*p’)/(p’*q))-((p*q’*M +M*q*p’)/(q’*p)); %update M by Davidon-Fletcher-Powell %M=M+((p*p’)/(p’*q))-((M*q*q’*M’)/(q’*M*q)); [a1 d1 a2]=ntout1(node,ww,S); er=t-a2; err(k)=er*er’; if err(k)<minerr, minerr=err(k); wwbest=ww; end end %undo standardization on S w1=reshape(wwbest(1:d*s1),s1,d); w1=w1./(ones(s1,1)*ss); b1=wwbest(d*s1+1:d*s1+s1); b1=b1-w1*mm’; wwbest(1:d*s1)=reshape(w1,d*s1,1); wwbest(d*s1+1:d*s1+s1)=b1;
%undo standardization on t wwbest(pp)=st*wwbest(pp)+m’; wwbest(pp-s1:pp-1)=st.*wwbest(pp-s1:pp-1); scale=(st^2)/n;
194
5. Algorithms for Designing Feedforward Networks
err=scale*err; fl=flops-fl; cptime=cputime-cptime; subplot(2,1,1), semilogy(err),grid,title([’QN SSE/sample, alpha=’, num2str(alpha),’, lambda=’,num2str(lambda)]), subplot(2,1,2), semilogy(gdnorm),grid,title(’Norm of Gradient’)
5.13.5
1HL Levenberg-Marquardt
Levenberg-Marquardt requires the calculation of a large Jacobian matrix. The size of this matrix can militate against choice of this algorithm. However, it has the advantage of being designed for a quadratic error expression. %compute LM Jacobian and gradient, without regularization %Jacobian J is pxn and grad=Je’ %arrange parameters (rows) as w1,b1,w2,b2 %w1 is s1xd, w2 is 1xs1, b1 is s1,1, b2 is 1x1 %enumerate w1 by columns %ww is stack w1,b1,w2,b2 %program calls ntout1.m function [J,gww]=jacobian1(node,ww,T) [c,n]=size(T); d=c-1; S=T(1:d,:); pp=length(ww); s1=(pp-1)/(c+1); s0=d; s2=1; [a1 d1 a2]=ntout1(node,ww,S); w2=ww(s1*c+1:s1*(c+1))’; delta2=a2-T(c,:); delta1=d1.*(w2’*delta2); gb1=delta1;
5.13 Appendix 3: MATLAB Listings of Single-Output Training Algorithms
195
gb2=delta2; for k=1:d, gw1((k-1)*s1+1:k*s1,:)=delta1.*(ones(s1,1)*S(k,:)); end gw2=(ones(s1,1)*delta2).*a1; %stack gww=[gw1;gb1;gw2;gb2]; %eliminate e_m multiplicative factor for Jacobian J=gww./(ones(pp,1)*delta2); %calculate gradient summed over training gww=gww*ones(n,1); The version LM1.m includes weight decay regularization. Cubic line search is used to select the learning rate. %1HL Levenberg-Marquardt training with regularization %calls jacobian1.m, ntout1.m, cubic1.m function [wwbest,err,fl,cptime]=LM1(runpar,node,ww,T) %runpar sets training parameters scaling=runpar(1); lambda=runpar(2); runlim=runpar(3); cptime=cputime; fl=flops; [c,n]=size(T); d=c-1; S=T(1:d,:); t=T(c,:); pp=length(ww); s1=(pp-1)/(c+1); %standardize t and undo on conclusion m=mean(t); st=std(t); t=(t-m)./st;
196
5. Algorithms for Designing Feedforward Networks
%standardize S mm=mean(S’); S=S-mm’*ones(1,n); ss=std(S’); S=S./(ss’*ones(1,n)); T=[S;t];
err=ones(1,runlim); [a1 d1 a2]=ntout1(node,ww,S); err(1)=(t-a2)*(t-a2)’; minerr=err(1); wwbest=ww; %start training iterations for k=2:runlim, clear J; [J,grd]=jacobian1(node,ww,T); %add in weight decay regularization to gradient grd=grd+lambda*ww; M=J*J’; %add in regularization to Hessian estimate M= M+(scaling+lambda)*eye(pp); dir=-M\grd; alpha=cubic1(node,lambda,dir,ww,T); ww=ww+alpha*dir; [a1 d1 a2]=ntout1(node,ww,S); err(k)=(t-a2)*(t-a2)’; if err(k)<minerr, minerr=err(k); wwbest=ww; end end %undo standardization on S w1=reshape(wwbest(1:d*s1),s1,d); w1=w1./(ones(s1,1)*ss); b1=wwbest(d*s1+1:d*s1+s1); b1=b1-w1*mm’; wwbest(1:d*s1)=reshape(w1,s1*d,1); wwbest(d*s1+1:d*s1+s1)=b1; %undo standardization on t
5.14 Appendix 4: MATLAB Listings of Multiple-Output Quasi-Newton Training Algorithms
197
wwbest(pp)=st*wwbest(pp)+m; wwbest(pp-s1:pp-1)=st.*wwbest(pp-s1:pp-1); scale=(st^2)/n; err=scale*err; fl=flops-fl; cptime=cputime-cptime; semilogy(err),grid, title([’LM sum-squared Training Errors/sample,... scaling=’,num2str(scaling)])
5.14 Appendix 4: MATLAB Listings of Multiple-Output Quasi-Newton Training Algorithms This section provides the programs needed to run quasi-Newton for a network having multiple outputs. %compute 1HL network response multiple outputs function [a1, d1, a2]=netout_mult1(node,w1,b1,w2,b2,S) %w1 is s1xd, b1 is s1x1, w2 is tout by s1, b2 is tout by 1, %S is dxn
a1=w1*S+b1*ones(1,size(S,2)); if strcmp(node,’logistic’), a1=1./(1+exp(-a1)); d1=a1.*(1-a1); elseif strcmp(node,’tanh’), a1=tanh(a1); d1=1-a1.^2; else, error(’Improper node specification’); end a2=w2*a1+b2*ones(1,size(S,2)); %compute 1HL network multiple outputs in ww specification
198
5. Algorithms for Designing Feedforward Networks
function [a1, d1, a2]=ntout_mult1(node,tout,ww,S) %tout is target variable dimension %w1 is s1xd, b1 is s1x1,w2 is tout xs1,b2 is tout x 1 %ww is pp=(d+tout+1)*s1+tout by 1 with tout output variables d=size(S,1); pp=length(ww); s1=(pp-tout)/(d+tout+1); q=s1*d; w1=reshape(ww(1:q),s1,d); b1=ww(q+1:q+s1); w2=reshape(ww(q+s1+1:pp-tout),tout,s1); b2=ww(pp-tout+1:pp); [a1 d1 a2]=netout_mult1(node,w1,b1,w2,b2,S); %calculate the gradient for multiple outputs in a 1HL function [gww]=grd_mult1(node,tout,ww,T) [c,n]=size(T); t=T(c-tout+1:c,:); d=c-tout; S=T(1:d,:); pp=length(ww); s1=(pp-tout)/(c+1); [a1 d1 a2]=ntout_mult1(node,tout,ww,S); %unstack ww w1=reshape(ww(1:s1*d),s1,d); b1=ww(s1*d+1:s1*(d+1)); w2=reshape(ww(s1*(d+1)+1:pp-tout),tout,s1); b2=ww(pp-tout+1:pp); delta2=a2-t; delta1=d1.*(w2’*delta2); gb1=delta1*ones(n,1); gb2=delta2*ones(n,1); gw1=delta1*S’;
5.14 Appendix 4: MATLAB Listings of Multiple-Output Quasi-Newton Training Algorithms
gw2=delta2*a1’; %stack gradients gww=reshape(gw1,d*s1,1); gww=[gww;gb1]; gw2=reshape(gw2,tout*s1,1); gww=[gww;gw2;gb2]; %calculate scalar sum-squared error for 1HL %multiple outputs function [err]=ss_mult1(node,tout,ww,T) a=size(T,1); S=T(1:a-tout,:); [a1 d1 a2]=ntout_mult1(node,tout,ww,S); er=a2-T(a-tout+1:a,:); err=0; for k=1:tout, err=err+er(k,:)*er(k,:)’; end %implement cubic line search with %weight decay regularization for 1HL %based on Press et al. pp.384-385 function [alpha]=cubic_mult1(node,tout,lambda,dir,ww,T) %zeta sets required improvement in performance zeta=.05; [c n]=size(T); d=c-tout; S=T(1:d,:); t=T(d+1:c,:); ww0=ww; d0=dir’*(grd_mult1(node,tout,ww0,T)+lambda*ww0); [a1 d1 a2]=ntout_mult1(node,tout,ww0,S); er=t-a2; err0=sum(diag(er*er’)) +lambda*ww0’*ww0; errorig=err0; alpha0=0; %try Newton optimal step of 1
199
200
5. Algorithms for Designing Feedforward Networks
alpha1=1; ww1=ww0+alpha1*dir; d1=dir’*(grd_mult1(node,tout,ww1,T)+lambda*ww1); [a d a2]=ntout_mult1(node,tout,ww1,S); er=t-a2; err1=sum(diag(er*er’))+lambda*ww1’*ww1; if err1<=errorig+zeta*d0*alpha1, alpha=alpha1; break; end alpha0=alpha1; alpha1=-d0/(2*(err1-err0-d0)); %iterate only until new error is small enough while err1>errorig+zeta*d0*alpha1, %find parameters of cubic and its minimum alpha1 par=(1/(alpha0^2 *alpha1^2 *(alpha0-alpha1)))*... [alpha1^2 -alpha0^2 ;-alpha1^3 alpha0^3]*... [err0 -d0*alpha0-errorig; err1 -d0*alpha1-errorig]; alpha0=alpha1; a=par(1); b=par(2); alpha1=(-b+sqrt(b^2 -3*a*d0))/(3*a); %set limits to change in alpha if alpha1>.5*alpha0, alpha1=.5*alpha0; elseif alpha1<.2*alpha0, alpha1=.2*alpha0; end err0=err1; ww1=ww0+alpha1*dir; [a d a2]=ntout_mult1(node,tout,ww1,S); er=t-a2; err1=sum(diag(er*er’))+lambda*ww1’*ww1; end alpha=alpha1;
%1HL Quasi-Newton training with regularization %calls grd1.m, ntout1.m, cubic_mult1.m %multiple outputs
5.14 Appendix 4: MATLAB Listings of Multiple-Output Quasi-Newton Training Algorithms
201
function [wwbest,err,fl,cptime]=QN_mult1(runpar,node,tout,ww,T) %runpar sets training parameters steps, runlim; %node specifies node function steps=runpar(1); runlim=runpar(2); %set regularization coefficient of w’w lambda=runpar(3); cptime=cputime; fl=flops; [c,n]=size(T); d=c-tout; S=T(1:d,:); t=T(d+1:c,:); %standardize t and undo on conclusion m=mean(t’); %st=std(t); t=t-m’*ones(1,n); %standardize S mm=mean(S’); S=S-mm’*ones(1,n); ss=std(S’); S=S./(ss’*ones(1,n)); T=[S;t]; pp=length(ww); s1=(pp-tout)/(c+1); err=ones(1,runlim); err(1)=ss_mult1(node,tout,ww,T); minerr=err(1); wwbest=ww; %initialize M=eye(pp); grd=grd_mult1(node,tout,ww,T)+lambda*ww; %start training iterations for k=2:runlim, q0=grd; dir=-M*q0;
202
5. Algorithms for Designing Feedforward Networks
alpha=cubic_mult1(node,tout,lambda,dir,ww,T); p=alpha*dir; ww=ww+p; grd=grd_mult1(node,tout,ww,T)+lambda*ww; q=grd-q0; %update M by Davidon-Fletcher-Powell %M=M+((p*p’)/(p’*q))-((M*q*q’*M’)/(q’*M*q)); %update M by BFGS (Press et al. Eq. 10.7.8--10) u=(p/(p’*q))-((M*q)/(q’*M*q)); M=M+((p*p’)/(p’*q))-((M*q*q’*M’)/(q’*M*q))+(q’*M*q)*u*u’; err(k)=ss_mult1(node,tout,ww,T); if err(k)<minerr, minerr=err(k); wwbest=ww; end end %undo standardization on S w1=reshape(wwbest(1:d*s1),s1,d); w1=w1./(ones(s1,1)*ss); b1=wwbest(d*s1+1:d*s1+s1); b1=b1-w1*mm’; wwbest(1:d*s1)=reshape(w1,d*s1,1); wwbest(d*s1+1:d*s1+s1)=b1;
%undo standardization on t wwbest(pp-tout+1:pp)=wwbest(pp-tout+1:pp)+m’; err=err/n; fl=flops-fl; cptime=cputime-cptime; semilogy(err),grid,title([’QN sum-squared Training Errors/sample, alpha=’,num2str(alpha)])
6 Architecture Selection and Penalty Terms
6.1 Objectives and Setting 6.1.1
The Issue
When we use the methods of Chapter 5 to design a neural network, having a known number of outputs, say, 1, we are confronted at the outset by the need to choose an architecture. The architecture is defined first by a choice of node function σ ∈ Σ, where Σ might be the family of sigmoidal functions that are nondecreasing and bounded in [0,1], but is more commonly limited to a few possibilities such as the logistic and hyperbolic tangent functions. We then select a number d of net inputs in the first layer, L − 1 hidden layers, and numbers si of nodes in layer i. These choices in turn determine the number p of real-valued parameters that is the dimension of the weight vector w. We then use a training set T and training algorithm A, as discussed in Chapter 5, to select w ˆ n to minimize a training error objective function ET (w). As we are aware from the results of Chapter 4, if the architecture is impoverished then we will not achieve good performance either on the training set or on new data, while on the other hand it is possible with a rich enough architecture (e.g., wide enough single hidden layer network) to make ET (w ˆ n ) as small as desired and possibly 0. However, we will learn from Chapter 7 that by doing so we will generally make our generalization error worse if we approximate too well to T ; having a lower training set error ET than what would be achieved by the (unknown) Bayes estimator is an indication of loss of generalization ability. Thus is born the need to rationally select the architecture to achieve the minimum generalization
204
6. Architecture Selection and Penalty Terms
error, knowing that this will not correspond to the minimum training set error. The difficulty is that while we know the training set error, we do not know the generalization error. Neural networks confront us, in a sharper manner than is common in traditional statistical regression, with the need to balance complexity against prior knowledge. Few users of statistical regression procedures would be inclined to use a regression family characterized by a number of parameters that is comparable to the number of training samples. However, this is almost the rule among users of neural networks. There have been (early) applications in which the number of adjustable network parameters far exceeded the number of training pairs. There have been successful character recognition applications in which the number of adjustable parameters is in the thousands and half as large as the number of training pairs. A factor of three times as many training samples as network parameters has also been suggested as adequate to achieve good generalization when the number of parameters is large. If one adopts a classical statistical viewpoint in which one must estimate each of the parameters defining the given regression family, then one would be quite reluctant to have them number in the thousands without an astronomical amount of training data. Neural network training algorithms and ever-improving computational resources, however, have permitted practitioners to design such complex networks. What is there to guide us in this terrain of architecture or model selection? The problem of architecture selection (e.g, the numbers of inputs, hidden layers, and nodes), more commonly referred to in statistics and pattern classification as model selection, is one that has attracted substantial effort in several directions. Although there has been little agreement as to how to proceed in specific cases without incurring prohibitive computational burdens and adopting suspect philosophical positions, we recommend to practitioners the regularization methods discussed in Sections 6.3.3 and 6.5.2.
6.1.2
Formulation as Model Selection
Typically one has a countable family of models M = {M }, each of which represents an uncountably infinite indexed or parameterized family of functions, e.g., M = {ηM (·, w) : w ∈ IRpM }. The models are commonly distinguished from each other by the number pM of real parameters they depend on and by the nature of that dependence (e.g., arrangements of different numbers of nodes in different numbers of hidden layers). The usual estimation procedures require that the functional dependence specified by a model be known and one then sets out to use data T to estimate the pM real parameters w for model M . For example, one might have a family of models M indexed by p in which Mp are all the polynomials of degree p in the variables of x. Following common custom, we
6.1 Objectives and Setting
205
have asserted that each model is an uncountably infinite family of functions, but it can be argued that this is an idealization and that we should only be considering countably infinite families of parameterized functions: (i) We cannot specify a parameter vector w ∈ IRp to arbitrary precision. If we are computing solutions then we only have finite precision and in effect we are choosing from only a finite set of possibilities (e.g., vectors with components specified to 16 decimal digits). Near-term hardware implementations of neural networks are likely to provide only a few bits to specify each weight. (ii) From a theoretical viewpoint, there are only countably infinitely many computable functions. In essence the specification of a computable function requires a program, and programs consist of finite strings of characters drawn from the finite alphabet of a programming language (e.g., there are only finitely many C programs written using no more than 1010 characters). Hence, there are only countably many programs and thus no more than countably many computable functions (some programs or character strings may not compute any function; other programs may appear to be different but in fact compute the same function). (iii) From a practical viewpoint, the phenomena we encounter can be wellapproximated by the countable collection of computable functions. In what follows we will assume countably many models, with each model containing either countably or uncountably many functions parameterized by a finite-dimensional vector. In the latter case we may then quantize the parameter vector to regain countability (e.g., see Section 6.5), albeit the fineness of the quantization will depend on the amount of training data.
6.1.3
Outline of Approaches to Model Selection
The bounds to generalization error given in Chapter 7 can in principle guide the choice of architecture. However, in current practice they are either too loose or require substantial computation in their own right and cannot be relied on yet to solve the architecture selection problem. Cross-validation provides an estimate of generalization error that is frequently used as a rough guide to architecture selection. The several methods suited to application and having a sound basis essentially adapt the objective function of the training algorithm by adding a term to it that penalizes architecture complexity. Training, with an appropriate penalty term added to a term measuring the degree to which the network approximates the data, automatically balances complexity against a close fit to the data. Three approaches to selection of the penalty term will be discussed in Sections 6.2.4, 6.3–6.5, and in 7.8.4. Finally, there are other methods that have been
206
6. Architecture Selection and Penalty Terms
proposed to automatically select the proper architecture through pruning an overly rich initial choice, growing a too-restricted initial choice, or limiting training of an overly rich initial choice so as to not fully exploit its capabilities. In sum, we have methods (e.g., those of Chapter 5) that are effective in selecting a member of a properly specified family of functions (e.g., those given by a specific model or architecture), whether that family be countably or uncountably infinite, so that the choice meets some criterion of optimality (e.g., small approximation error on a training set). We need methods that are as effective when we attempt to select from a family that combines functions from several models. Although one might adopt a two-stage process in which in the first stage we select a model M ∈ M, by a process to be discussed, and in the second stage we select a specific function or member of that model by the methods that were discussed in Chapter 5, we favor alternative selection strategies that integrate these two stages. We will discuss the following five types of approach: (i) Bayesian calculations of posteriors; (ii) regularization; (iii) complexity and minimum description length (MDL) principles; (iv) cross-validation and generalization error bound to control overfitting through control of overtraining; and (v) ad hoc methods for growing or pruning architectures. Of course, there is always the sixth alternative of preliminary training of a variety of networks having a variety of architectures chosen in an ad hoc fashion, and then fully training only the one that performed the best. Although this approach is straightforwardly implemented and used, in practice it consumes computational resources and is likely to lead to overfitting and poor generalization performance. Finally, bear in mind that if we knew the generalization error eg (w) (see Section 7.4) for each architecture, then we would not need any of the methods in this chapter. Hence, methods requiring us to know elements that are as hard to learn as eg are not likely to be helpful in practice, whatever their contribution to a more fundamental understanding.
6.2 Bayesian Methods 6.2.1
Setting
Bayesian methods provide the most principled approach to neural network design (and to the general choice of statistical estimators and decision rules) but one that paradoxically results in the most ad hoc and unprincipled
6.2 Bayesian Methods
207
of practical applications. It is the method that is most consistent with a probabilistic outlook; one inquires into how probable a network is rendered by information based on training data. This method is discussed in the setting of neural networks by MacKay [155, 156], Neal [177], Bishop [29], Ripley [192], and Wolpert [254], among others. Significant attention is paid to effective computation using Markov chain–based Monte Carlo methods in Neal [177] and Muller and Insua [174]. Bayesian methods have been well-studied by statisticians (e.g., Berger [27], Bernardo and Smith [28], West and Harrison [250]). The setting of neural networks, as we shall see, presents some difficulties with application of the basic ideas of Bayesian analysis, and these difficulties need more careful consideration than they have been given in the neural network literature. The conveniences and admirable properties (e.g., statistical admissibility) of the Bayes approach seduce users into applying it even when there is a chasm between what we know and what the Bayes approach requires us to know. This chasm is too often bridged by willful assertion. For simplicity, we will assume that the usual parameter vector w is augmented by a few additional components to contain within itself a specification of the architecture graph (number of layers and number of nodes within each layer) as well as the weights or parameters describing a specific function within that architecture. This allows us to consider a model M as just a collection of such vectors. While we do not do so here, we might consider augmenting the information in w ∈ W about the network by information v ∈ V about the probability mechanism or law P generating the data x, t. This information or data model v ∈ V has not been made explicit in Bayesian approaches to neural networks. In respect of the fully probabilistic perspective of the Bayesian position, in our discussion of this position we adopt the convention that W , T , T, X denote random variables taking on respective values w, T , t, x; we depart from our convention only in that the training set notation T does double duty as both a specific value of the training set and a random variable. Given a network and data model W playing the role of a “hypothesis” and the training set or data T playing the role of “evidence”, one would like to evaluate the conditional density or posterior fW |T (w|T ) of hypothesis given evidence, and thereby determine the support, lent by the evidence data to the hypothesis that W = w is the correct choice of parameter vector. If we can evaluate fW |T , then we can compare any two networks w, w as to which is more likely to be the “correct” function. We can also compare models or architectures through evaluation of the posterior probability for a family M of networks
fW |T (w|T )dw. (6.2.1) M ∈ M, P (M |T ) = w∈M
Given that this is possible, we can also assess the expected generalization error induced by a model selection algorithm by averaging over the generalization error incurred by the function chosen from each model.
208
6. Architecture Selection and Penalty Terms
Although the Bayesian position is the best position to be in, it will do us no good if we cannot reach it. (For example, an even better position is to be clairvoyant and just pick the right network. If you think you can do this, you need read no further!) How then are we to determine the necessary posterior density fW |T ? Even a confirmed adherent of such methods would be embarrassed at postulating directly the desired posterior density. Rather one follows the familiar Bayes formula and constructs the posterior density from a prior density π over W and a likelihood, that i,s a conditional density gT |W (T |w) for the evidence given the hypothesis, through the familiar Bayes theorem fW |T (w|T ) =
6.2.2
gT |W (T |w)π(w) . g (T |w )π(w )dw W T |W
(6.2.2)
Priors
The prior π can be thought of as represented conditionally upon the models through individual conditional priors {πW |M (w|M ), M ∈ M} that are conditional over subsets corresponding to individual models and a collection of conditional probabilities {PM (M ), M ∈ M} of particular network models being the ‘true’ families. While it would be best to think of the prior π as defined over a function space that contains all of the network functions, we do not do so. As we are aware from Chapter 4, our construction of networks from their parameterization W means that different parameters may define the same network function. Nevertheless, choices of networks and training algorithms proceed based upon this parameterization W and not based upon a function space representation. We can, if we wish, think of W as a union of collections of strings (vectors) of different lengths. The models M ∈ M typically involve parameterizations of different lengths corresponding to networks with different architectures. In this section we will assume that a model M assumes a parameterization by vectors w of length pM . This decomposition into prior probabilities {PM (M ), M ∈ M} over models and conditional densities {πW |M (w|M ), M ∈ M} for the networks in a given model makes it easier to assess the prior in a two-stage process. The prior π represents our knowledge, in advance of consulting the training data, as to the probabilities of sets of hypotheses/parameters containing the “correct” hypothesis. In practice, it must be admitted that such knowledge is never more than vague and certainly too imprecise to yield a probability measure π over the complicated set (function space) of all network and data models. There are those committed Bayesians who argue that failure to come up with π, even in some indirect fashion, exposes oneself to making incoherent decisions—there can exist sequences of decisions such that the overall outcomes are sure losses. Although this is true, incoherence is no more to be feared than the false “knowledge” invented to avoid it, the old problem of making the foot fit the shoe. A precise answer to the
6.2 Bayesian Methods
209
wrong problem is not necessarily to be preferred to a flawed answer to the right problem. A somewhat more worldly response by Bayesians is to assert robustness: the posterior is not that sensitive to the prior, at least in the presence of substantial training data that makes the likelihood gT |W (T |w) very sharply peaked over a small subset of possible model parameters. A precise prior is not required. It is often a short step from this maneuver to assuming a “tractable” Gaussian prior (e.g., MacKay [155]). A detailed argument for constructing a prior that is invariant under relabelings of the parameterization is provided by Balasubramanian [15] and is in the spirit of much earlier work by Jeffreys [115]. A further objection, at least in the majority of neural network applications, is that the true underlying stochastic mechanism generating T is not based on a particular, unknown neural network. The prior probability πW |M , PM cannot be understood as the probability of the “true” model generating the training data being indexed by w ∈ M . At best, this prior can be understood as our initial beliefs ordering the various possible networks according to how likely they are to succeed at forecasting the training data. We think of using neural networks not because they are true models of the random phenomena being investigated but rather because they are successful models in terms of yielding acceptably low prediction errors. Nor do these prediction errors need to be the minimum possible (Bayes) error; it suffices for applications that the error be low enough. This argument is developed further in the next subsection.
6.2.3
Likelihood and Posterior
How can we construe the likelihood gT |W (T |w)? In common Bayesian statistical applications the likelihood is uncontroversial. For example, if we have an unusual coin with unknown probability p for ‘heads’, the coin is tossed n times, and we observe the total number H of outcomes that are ‘heads’, then the familiar binomial distribution B(n, p) supplies the likelihood. The only controversial element is what prior π to assign to p ∈ [0, 1]. The situation is not so straightforward in the application of Bayesian ideas to neural network design. The Bayesian view most expressed in the neural network community assumes that w indexes a specific probability law governing the production of the data X, T . On this account, the evaluation of the likelihood is straightforward. However, as this is not at all our case, it marks the need for a more careful analysis. The vector w indexes directly a neural network function η(x, w) rather than a probability law. What is the relationship between the likelihood gT |W and η? Although occasionally η is assumed to provide an estimate of the conditional probability P (T = t = 1|X = x) (e.g., see Denker and LeCun [56]), the much more common case is that the network function η is used to predict the target T = t from X = x. Ideally, in the limit of infinite training data or of knowledge of the underlying probability law P or density fX,T (x, t) generating
210
6. Architecture Selection and Penalty Terms
the data, we would choose w to minimize the generalization error eg (w) = E(η(X, w) − T )2 . However, we should not forget that knowledge of eg is so substantial that it would obviate the need for the training set T and for the discussion in Chapters 6 and 7 and require only the nonlinear optimization methods of Chapter 5 to locate a good value of w. For notational convenience, let c(x) = E(T |X = x), and write eg (w) = E(η(X, w) − T )2 = E(η(X, w) − c(X) + c(X) − T )2 = E(c − T )2 + E(η − c)2 + 2E((η − c)(c − T ). The last term in this equation is well-known to be zero (evaluate the expectation by first conditioning on X = x) and the first term does not depend on w. Hence, an ideal “best fit” network is one that minimizes E(η − c)2 . Thus the information supplied by w is information about the conditional expectation. To be more precise, on a “best fit” interpretation, knowledge of W = w amounts to knowing that the function c(x) is closer to η(x, w) than it is to any other network function. This might encourage us to assume that c(x) = E(T |X = x) ≈ η(x, w). Indeed, for a rich set W we have learned in Chapter 4 that there will be network functions that approximate to reasonably-behaved c as closely as desired. Thus it is plausible to interpret w ∈ W as specifying the conditional mean c(x). However, it is apparent that knowledge of w falls as far short of that needed to specify the full probability model needed to specify gT |W as does knowledge of the mean when we specify an otherwise arbitrary probability distribution. This fact has often been ignored in Bayesian approaches to neural network modeling. What then are we to conclude about the likelihood gT |W ? We have throughout assumed that T = {(xi , ti )} are i.i.d.. Hence, we may write gT |W (T |w) =
n
gX,T |W (xi , ti |w).
1
To continue, we consider gX,T |W (x, t|w) = gT |X,W (t|x, w)fX|W (x|w) ≈ gT |X,W (t|x, w)fX (x). The last approximation of fX|W (x|w) by fX (x) assumes that W is uninformative about (independent of) X. This appears plausible when one considers that W is chosen to approximate to E(T |X) and this conditional expectation does not depend on fX (x). The mean square error measure of approximation used to select W does require averaging over X using fX ,
6.2 Bayesian Methods
211
but this need not imply a strong dependence between W and X. W tells us about the link between T and X but not about X itself. We now need to evaluate gT |X,W (t|x, w). One direction is to consider the regression model T = ∆ + c(X) ≈ ∆ + η(X, W ), where the residual is ∆ and E(∆|X = x) = 0. Of course, this representation is only of interest if ∆ is in some further way restricted; otherwise it is a mere tautology. While this is not the general case, it is plausible to restrict ∆ to be independent of W , X and thereby to assert that the conditional expectation captures the full linkage between the target random variable T and the data X. Making this assumption yields gT |X,W (t|x, w) = f∆ (t − c(x)) ≈ f∆ (t − η(x, w)). Combining results yields the desired likelihood function gT |W (T |w) ≈
n
f∆ (ti − η(xi , w))fX (xi ).
1
It is evident that we cannot go further and specify f∆ without making substantive and limiting statements about the generation of the data. There is certainly no epistemic (knowledge-based) justification for assuming ∆ to have a normal/Gaussian distribution and much by way of non-Gaussian applications (e.g., the target T is bounded or multimodally distributed) to contradict such an assumption. Thus epistemically we cannot support the approaches leading to a Gaussian likelihood that have been taken by, among others, those approaching this problem from a statistical physics viewpoint. One alternative is to place a prior over the distributions of ∆— a complex maneuver given that all of this is in aid of selecting w based on the mean eg . An instrumentalist defense of the use of the Gaussian will be given in Section 6.5.2 when we show that use of the Gaussian and regularization does lead to reasonable performance— ‘it works’. From the evaluation of gT |W we see that the desired posterior conditional density for w is given by n π(w) 1 f∆ (ti − η(xi , w))fX (xi ) n = fW |T (w|T ) = π(w ) 1 f∆ (ti − η(xi , w ))fX (xi )dw W n π(w) 1 f∆ (ti − η(xi , w)) . n π(w ) 1 f∆ (ti − η(xi , w ))dw W 2 ) and conWe now make the unwarranted assumption that ∆ ∼ N (0, σ∆ clude that E (w)
e fW |T (w|T ) =
−
−
e W
T 2σ 2 ∆
ET (w ) 2σ 2 ∆
π(w) n σ∆
π(w ) n (w ) dw σ∆
,
(6.2.3)
212
6. Architecture Selection and Penalty Terms
albeit with a poor foundation in probabilistic reality. Nor do we know the 2 , conveniently assumed not to depend upon w. variance σ∆
6.2.4
Bayesian Model and Architecture Selection
How are we to use the posterior derived in Eq. 6.2.3? A thorough-going Bayesian would use the posteriors to calculate posterior expectations. Thus, rather than selecting a specific network w∗ , the committed Bayesian would predict T through E(η(X, W )|X = x, T ), an average over all networks, and not just accept the prediction η(x, w∗ ). Evaluating such expectations can be a surprisingly onerous task, one requiring use of sophisticated Monte Carlo (see Neal [177], Muller and Insua [174]) techniques to approximate by sample averages to the integrals in question. Instead, we focus on extracting information from the posteriors by studying the selection of a network w∗ that maximizes the posterior density as given in Eq. 6.2.3. Such a selection can be made over all models because we are no longer just trying to fit closely to the training data. We are also forced to take into account the prior. This will be drawn out more clearly later through the terms to be introduced in Eq. 6.2.5 (see also the analysis in Balasubramanian [15]). Returning to our focus on evaluating models, combine Eq. 6.2.1 and Eq. 6.2.3 to derive P (M |T ) = P (M |T )
π(w) − n e w∈M σ∆
ET (w) 2σ 2 ∆
dw
E (w ) − T 2 2σ ∆
π(w ) n e w ∈M σ∆
.
(6.2.4a)
dw
Restated, introducing the function h, whose details will not concern us, the ratio P (M |T ) = eh(T ) (6.2.4b) ET (w) − 2 π(w) 2σ ∆ dw e w∈M σ n ∆
depends upon the data T but does not depend upon the choice of model M ∈ M. We undertake to carry out approximately the integrations over models required in Eqs. 6.2.4. Assume, as usually intended, that a given model M contains w that are of the same dimension pM and share the same architecture. In effect, within a model we will ignore the terms that were added in Section 6.2.1 to specify the architecture and take gradients as needed with respect to w now construed as just the original weights of the network of a given architecture. The integrands in each of the forms of Eqs. 6.2.4 consist of a product of an exponential and the two terms, π and 1/σ. Putting aside the prior term π and looking at the negative logarithm of the remaining integrands (proportional to the likelihoods) yields Q(w) = n log σ∆ (w) +
ET (w) 2 (w) , w M = argmin Q(w). 2σ∆ w∈M
(6.2.5)
6.2 Bayesian Methods
213
The term wM is a random (function of T ) maximum likelihood estimator for the network that is best within a given model M . Paralleling the Taylor’s series expansions for ET given in Eq. 5.1.3, we see that for HM (wM ) = ∇∇Q the non-negative definite Hessian matrix of second derivatives of Q at the minimizing (stationary) point wM we have that 1 Q(w) ≈ Q(wM ) + (w − wM )T HM (wM )(w − wM ); 2
(6.2.6)
the terms linear in w − wM are absent as wM was chosen as a point of zero gradient of Q. The matrices HM are closely related to the important statistical concept of a Fisher information matrix F defined for a family of density functions gX|W parameterized by w ∈ M by F(w) = −∇w ∇w E log gX|W . For example, if X ∼ N (w, C) then F = C−1 . In our case, F(wM ) = EHM (wM ). A typical ratio of posterior model probabilities can now be expressed as T 1 e−Q(wM )− 2 (w−wM ) HM (wM )(w−wM ) π(w)dw P (M |T ) w∈M ≈ . 1 T P (M |T ) e−Q(wM )− 2 (w −wM ) HM (wM )(w −wM ) π(w )dw w ∈M (6.2.7) The exponential factor is proportional to the density of a normal random vector with mean the maximum likelihood estimate wM and covariance matrix H−1 M , where we now assume that HM is positive definite. Given that we expect to have little detailed prior knowledge about w, its representation should be through a diffuse prior π that is only slowly varying over M . The normal density is well-known to be rapidly varying. Hence, over the region of significant contribution of the normal density, we expect the prior to be effectively constant at π(wM ). Making the reasonable assumption that a “large” ball centered at wM is contained in M , we have 1 π(wM ) Q(wM )−Q(wM ) det(HM (wM )) P (M |T ) (p −p ) e , ≈ (2π) 2 M M P (M |T ) π(wM ) det(HM (wM )) (6.2.8a) where Q is defined in Eq. 6.2.5. Restated, to within the above approximations and following the line of Eq. 6.2.4b, the ratio P (M |T ) 1
2 π(wM )e−Q(wM ) √ (2π) pM det(HM (wM ))
≈ eh(T )
(6.2.8b)
depends upon the data T but does not depend upon the choice of model M ∈ M.
214
6. Architecture Selection and Penalty Terms
In effect, we replace a model M by a representative value wM , one that represents a best fit in that model to the training data. The prior probability density π(wM ) acts to moderate our interest in models that otherwise have very high likelihood because they fit T closely and Q(wM ) is small. This is a desired outcome insofar as it guards against overfitting by limiting the favor bestowed on models containing nets that by happenstance fit the data closely. We can now compare the probabilities of two models, given the training data, provided we can calculate various characteristics of ET that are, in principle, available to us (albeit as it takes the training methods of Chapter 5 to approximate to these quantities, the computational burden is quite large) and provided that we can assess the problematic prior probability density π. If the best performance using model M is much better than the best performance using model M then we may be able to ignore their prior probabilities if they are somewhat comparable. However, in practice we are likely to be assessing models in which there is something of a performance improvement but it is unlikely to overwhelm the prior. In that case we are being directed by our prior beliefs, however ill-founded they are. In defense of the use of a prior, even one ill-founded, any such prior on the countable collection M of neural network models guards against overfitting, (∀π)(∀δ > 0)(∃¯ p) π({M : pM > p¯}) < δ. In other words, overly complex models are given little credence. Combining Eqs. 6.2.5 and 6.2.8, making the obvious definitions for RM and letting h(T ) be the needed function of the data, yields the following expression for the conditional probability density of model M given the training data T : log(P (M |T )) = −
pM ET (wM ) log(2π)− − n log(σ∆ ) + log(π(wM )) + 2 2σ∆ 2
1 ET (wM ) log(det(HM )) + h(T ) = − + RM + h(T ). 2 2 2σ∆ An approximation to the variance is given by
(6.2.9)
2 σ∆ ≈ E(T − c(X))2 ,
where c(X) = E(T |X). The matrices HM were assumed positive definite. The geometric mean λg,M of the eigenvalues of HM is given by log λg,M =
1 log(det(HM )). pM
This enables us to rewrite Eq. 6.2.9 as log(P (M |T )) = −
2π ET (wM ) pM log −n log(σ∆ )+log(π(wM ))+ +h(T ). 2 2σ∆ 2 λg,M (6.2.10)
6.3 Regularization
215
Thus a maximum likelihood estimate of the model amounts to choosing the model M ∗ that maximizes the right-hand side of Eq. 6.2.10, excluding the term h(T ) that is constant over all models. A model with a higher prior probability is preferred to one with a lower prior, other things being equal. Eq. 6.2.10 asserts that a smaller value of λg,M is also desirable in comparing models. We also find that when the geometric mean eigenvalue λg,M of HM is less than 2π, a model with more parameters is favored over one with fewer parameters, although the impact of the other terms may reverse this conclusion. Of course, the evaluations of these terms are quite difficult to carry out: all of the efforts of Chapter 5 aim to evaluate something similar to wM . Finally, we are now encouraged by Eq. 6.2.9 to minimize not just the errors on a training set but such a term together with a penalty term RM . This adjustment in attitude introduces the notion of regularization that will yield more practical methods of network design.
6.3 Regularization 6.3.1
General Theory for Inverse Problems
Regularization adds to the usual data-fitting error terms a term that penalizes model complexity. In the setting of neural networks, regularization is discussed in Poggio and Giroso [187], and Moody [171], among others. Typically, instead of selecting w to minimize ET (w) we attempt to select w to minimize ET (w) + λC(w) where a common choice is C = ||w||2 . In our Bayesian analysis such a “penalty” term is generated naturally from the logarithm of the prior π. The expectation is that adding C(w) discourages the training algorithm from overfitting the training data through excessively complex networks. Small ||w|| are encouraged, and weight vector components that are very small can be taken to be zero. The resulting network is then simpler than might have been permitted at the outset. A more precise understanding being available from Theorem 7.8.5. However, as we shall see, the goal of regularization is to ensure a smooth inference from T to w∗ , and implications for simplicity are incidental consequences. We undertake to sketch this approach in some generality because it provides a perspective from which to understand and adapt the specific methods of regularization that have been pursued in the neural network context. A motivation for, and description of, this process in terms of ill-posed (inverse) problems is given by Tikhonov and Arsenin in [235] and reproduced in part in Vapnik [240, Chs. 1 and 9]. A recent treatment is in Baumeister [26]. We rephrase the network learning problem as an inverse problem and use Tikhonov’s notation in which a known continuous (nonlinear) functional A maps from a domain metric space F with metric (distance mea-
216
6. Architecture Selection and Penalty Terms
sure) ρF to a range metric space U with metric ρU , z ∈ F ⇒ A(z) = u ∈ U. From an observed u we wish to infer back to z. In terms of our neural network problems we can make the correspondences that U ⊂ IRn is the space of vectors with ith component the desired target ti and F ⊂ IRs is the space of weight or parameter vectors, with z = w. If we assume that the training set {(xi , ti )} is actually generated by a network (which is not truly our case), then the nonlinear functional or operator A corresponds to a vector-valued function with ith component η(·, xi ); A(z) is the vector with ith component η(xi , z). The network design or training problem is the one of finding the inverse to this mapping that recovers the correct w from knowledge of the target vector u and the inputs {xi } embedded in A. The selection of a network w = z given the architecture and training data then amounts to finding an inverse functional A−1 to map from U back into F : z = A−1 (u). We distinguish three kinds of inverse problems: (i) there is no zT that can give rise to a given uT through A(zT ); (ii) there are many (perhaps infinitely many) zT such that A(zT ) = uT ; and (iii) there is a unique such zT . The first case can arise in the neural network setting either because the training data is not itself generated by a network, and this is usually the case, or because there is so much noise that even though the underlying regression is in the network family it cannot fit the training data exactly. However, if the architecture is made complex enough, then we expect to be able to fit the training data exactly, the problem of overfitting that we are seeking to avoid. Thus if we approach regularization with a complex enough architecture, then we will not have to be concerned with the first case. The second case becomes a likelihood once the architecture is complex enough to avoid the first case. As we noted in Chapter 4, neural networks generally admit several implementations of the same function and thus fall under the second case. Although the treatment of regularization can accommodate this, it is clearest to proceed as if the third case holds and A is 1:1 and we have a unique inverse A−1 , A(z) = u ⇔ A−1 u = z. In the neural network context this assumes that we have accounted for the equivalence classes of networks generated by, say, appropriate permutations of the weights assigned to different nodes (see Section 4.2.1).
6.3 Regularization
217
We cannot rest here because in the neural network case we expect there to be a noisy relationship between input variables (e.g., feature vector) x and response variable (e.g., pattern category) t. Outputs are observed under noisy conditions that prohibit arbitrarily precise evaluations. We need to know that inferring back from an output uδ , approximating to uT , will yield an input zδ that approximates to zT . This can be formulated as a condition of smoothness or stability of the inverse (∀z1 , z2 ∈ F )(∀ > 0)(∃δ > 0) ρU (A(z1 ), A(z2 )) < δ ⇒ ρF (z1 , z2 ) < . Stability ensures that sufficiently similar outputs imply similar inputs. Such a condition is clearly essential in our neural network application. We might hope that asserting the continuity of A will guarantee the continuity of A−1 and hence guarantee a stable inverse. Although A will be continuous when η is continuous as a function of w, this unfortunately does not imply the continuity of A−1 . Revising this inversion or training problem to achieve a stable inverse is the goal of regularization. A basic device to ensure stability is to restrict the selection or parameter set F to a compact (closed and bounded in the case of subsets of IRs ) set C ⊂ F. Lemma 6.3.1 (Continuity of Inverses [235, p.29] ) If a continuous 1:1 operator A is defined on a compact metric space C ⊂ F , then the inverse operator A−1 is continuous on its domain the image set AC ⊂ U . In our neural network case the forward functional A will be continuous if η is continuous, which is the usual case. One might then select a sequence of nested, increasing compact sets Cn with ∪n Cn = F and study the convergence of the sequence of resulting stable solutions to the true solution. Vapnik [240] develops a version of regularization under the name structural risk minimization (Guyon et al. [87]) that pursues this direction. One identifies a structure consisting of a hierarchy of increasingly complex models (e.g., networks with increasingly many nodes) and attempts to solve for the structure and associated parameters. An alternative to the direct selection of such compact sets is provided through the idea of a regularizing operator to approximate to A−1 . Tikhonov and Arsenin introduce the idea of a regularizing operator R(u, λ) to provide the desired stable inverse in the following sense. Given any uT and its true inverse (we assume that A is 1:1) zT , we want to be able to approximate zT to within any desired by some solution zλ , provided we are supplied with a value uδ sufficiently close to uT , ρU (uT , uδ ) ≤ δ(). This solution is to be zλ = R(uδ , λ(δ, uδ )). Definition 6.3.1 (Regularizing Operator [235, p.47] ) An operator R(u, λ) depending on a parameter λ is a called a regularizing operator for the equation A(z) = u in a neighborhood of u = uT if
218
6. Architecture Selection and Penalty Terms
1. there exists a δ1 > 0 such that R is defined for all λ > 0, u ∈ U for which ρU (u, uT ) ≤ δ1 ; and 2. there exists a function λ = λ(δ, uδ ) such that for all > 0 there exists δ() ≤ δ1 and uδ ∈ U, ρU (uT , uδ ) ≤ δ() ⇒ zλ = R(uδ , λ(δ, uδ )), ρF (zT , zλ ) ≤ . Clearly, if we can identify one of the many possible such regularizing operators, then we can find good inverses (weight vectors) in that a small perturbation in the target vector u produces a correspondingly small perturbation in the inverse. A subclass of regularizing operators is given by the following. Theorem 6.3.1 (Class of Regularizing Operators [235, p.49] ) Let ¯ λ) denote an operator from U into F that is defined for every element R(u, of U and every λ > 0 and that is continuous with respect to u. If for every z∈F ¯ lim R(A(z), λ) = z, λ→0
¯ is a regularizing operator for A(z) = u. then R We turn to methods for constructing such continuous regularizing operators. Definition 6.3.2 (Stabilizing Functional) A stabilizing functional C(z) has the property that (∀d > 0) {z : C(z) ≤ d} is a compact subset of F (e.g., a closed and bounded set when F ⊂ IRs ). An example of such a stabilizing functional for vector z = w is ||w||2 , the squared length of the vector corresponding to the Euclidean metric ρF . Consider any uδ that approximates to the true uT through ρU (uδ , uT ) ≤ δ. If we now choose zδ as a point in the set of approximations to uδ , {z : ρU (uδ , A(z)) ≤ δ}, that minimizes C(z), then we can think of zδ as given by a mapping from uδ , δ, that can be represented as a functional ˜ δ , δ). zδ = R(u ˜ is also a regIt can be shown (Tikhonov et al. [235, pp. 52–54]) that R ularizing operator, and thus that zδ is an approximation to zT . Thus a constrained minimization gives rise to a stable solution to the inverse problem. Whereas radically different values z, z can give rise to values u = A(z), u = A(z) that are both close to uT , this being the nature of the unstable or discontinuous inverse, the stabilizing functional C is likely to assign dissimilar values C(z), C(z ) to radically different arguments and thereby eliminate one of them from consideration in achieving a minimum.
6.3 Regularization
6.3.2
219
Lagrange Multiplier Formulation
One can proceed further and rephrase the preceding minimization problem as one of minimizing C(z) for those z satisfying the equality constraint ρU (A(z), uδ ) = δ, in place of the inequality constraint. A minimization with an equality constraint can be approached using Lagrange multipliers and recast as the minimization of the functional M λ (z, uδ ) = ρ2U (A(z), uδ ) + λC(z),
(6.3.1)
where λ is the Lagrange multiplier that then needs to be chosen to satisfy the equality constraint. In Eq. 6.3.1 we now encounter a more familiar formalization of regularization in terms of an error function ρ2U determining the squared distance between the element A(z) achieved by z and the approximate goal of uδ plus a penalty term C. Tikhonov and Arsenin call M λ a smoothing functional. Once again we can interpret the minimizing value zλ of M λ as arising from a functional through zλ = R1 (uδ , λ(δ, uδ )). It needs to be shown that with a proper choice of the Lagrange multiplier λ, R1 is again a regularizing operator and the resulting zλ provides a stable inverse that yields the approximations ρU (A(zλ ), uδ ) = δ, ρU (uT , uδ ) ≤ δ, ρU (A(zλ ), uT ) ≤ 2δ. This is verified in Tikhonov et al. [235, pp. 65–67]. The original inverse problem that is likely to be unstable because A−1 is not likely to be continuous has been turned into a stable problem in Eq. 6.3.1 because the minimum of M λ is stable under perturbation of uδ . The choice of λ is addressed in the following. Theorem 6.3.2 (Constraints on Lagrange Multiplier [235, p. 65] ) Let A(zT ) = uT . Then for all > 0 and non-negative, nondecreasing, continuous functions β1 (δ), β2 (δ) satisfying β2 (0) = 0,
δ2 ≤ β2 (δ), β1 (δ)
there exists δ0 (, β1 , β2 ) such that u, uT ) ≤ δ ≤ δ0 ⇒ z˜λ = R1 (˜ u, λ), ρF (zT , z˜λ ) ≤ ρU (˜ for all λ satisfying δ2 ≤ λ ≤ β2 (δ). β1 (δ) For example, we can take β1 (δ) = 1, β2 (δ) = δ, and then δ 2 ≤ λ ≤ δ.
220
6. Architecture Selection and Penalty Terms
6.3.3
Application to Neural Networks
In so-called weight decay or weight elimination regularization, the penalty term C(w) is the sum of squares of the weights and biases ||w||2 or a variation thereof. For example, Weigend et al. [247] adopt the overall objective function E=
n i=1
(η(xi , w) − ti )2 + λ[C(w) =
p j=1
wj2 /w02 ], 1 + wj2 /w02
(6.3.2)
where λ, w0 are adjustable to vary the effects of the regularization. w0 provides a scale to distinguish between “large” and “small” weights. Note that for large |wj | the summand is approximately 1, and for small |wj | the summand is approximately wj2 /w02 . This weight penalization term can be interpreted as the negative logarithm of a prior probability density for weights that is roughly normal for small weights and uniform for large weights. The parameter λ on this account is interpretable as the reciprocal of a variance (e.g., see Eq. 6.2.9). The set {w : C(w) ≤ τ } is a compact set and hence C is a stabilizing functional. Hence, the set {w : E ≤ τ } is a subset of a compact set and will be compact if it is closed. Closure will follow from the continuity of the objective function E, and continuity of E will follow from the continuity of the network function η(x, w) considered as a function of w. In most cases, the network function is a continuous function of w and compactness follows. Hence, Tikhonov’s condition for stability is satisfied by the addition of a regularization term C. It was first noted by Vapnik in [240], elaborated by him in [241, Theorem 5.1], and ensuing discussion, and discussed in greater generality by Bartlett [21] that the size of the parameter vector is key to the generalization performance of a network. More precisely, the l1 -norm A of the final layer weights is shown in Theorem 7.8.5 to play a key role in determining generalization behavior. In view of these results, regularization using A appears well-justified by its emphasizing solutions with small values of A that then generalize well. It remains to choose the regularization parameter λ, and this is a nontrivial task. It is clear from Theorem 6.3.2 that λ can be neither too large nor too small. If λ is very large, then the regularized network is just the “smoothest” or “simplest” network with very little regard for its performance. If λ is very small, then we are nearly in the unregularized case of seeking only to minimize error on the training set with its attendant likely failure to generalize. Another way to view the effect of varying λ is that when it is very large we generate estimates with high bias but low variance, but when it is very small we generate estimates with low bias and high variance. As the expected squared error is the sum of the variance and the square of the bias, there is some choice of λ that yields a minimum of this expected squared error. Finally, the interpretation of λ as a Lagrange multiplier in Eq. 6.3.1 shows that varying λ amounts to varying the distance
6.4 Information-Theoretic Complexity Control
221
ρU (A(z), uδ ) between the approximate target uδ and the network response A(z) to a weight vector z. Choosing λ gives us some ability to steer the solution to some point that may be desired for inadequate reasons, while inappropriately cloaking ourselves in the mantle of regularization. One approach to the selection of λ is to estimate the generalization error as a function of just the λ parameter (having solved for the best network for a given value of λ) and then to choose λ to minimize the generalization error estimate. Typically, this generalization error estimate is a ‘cross-validation’ estimate (see Section 7.7.1). We are reduced from seeking a minimum of cross-validation error over a high-dimensional parameter space W to seeking its minimum as a function of a scalar. This reduction in dimension may make this process more reliable. However, as Hastie and Tibshirani in [98, p.52], note, Although cross-validation and the other automatic methods for selecting a smoothing parameter seem well-founded, their performance in practice is sometimes questionable.
6.4 Information-Theoretic Complexity Control 6.4.1
Kolmogorov Complexity
The complexity approach attempts to formalize the medieval principle, known as Occam’s Razor, of adherence to the simplest, well-founded explanation. Actually, Occam, or William of Ockham (see Safire [207]), is reputed to have announced in Latin that “entities should not be multiplied.” However, this is doubtful; a more accurate attribution in E. Moody [170] is either, “Plurality is not to be assumed without necessity,” or “What can be done with fewer [assumptions] is done in vain with more.” There are many formulations of such an approach, with the most fundamental ones due to Solomonoff, Chaitin, and Kolmogorov (e.g., see Li and Vitanyi [146], Cover and Thomas [48]) and ones more capable of implementation having been developed by Rissanen in [193, 194, 195, 196] and Barron in [16, 17, 20] (see Section 6.5). In these latter approaches, one is directed to minimize an expression that can often be construed as a cost function measuring the degree of approximation of the model to the data that is added to a term that measures the complexity of the model. Some of the first work in this direction is that of Akaike [2] who proposed a simple and well-known criterion called AIC for model selection that was analyzed by Shibata [214, 215] and revised to make it consistent by Schwartz [213]. These approaches can also be thought of as other means for determining the smoothing functional (e.g., Eq. 6.3.1) of regularization. The most fundamental complexity approach makes use of what has become known as
222
6. Architecture Selection and Penalty Terms
Kolmogorov complexity and suffers from the critical practical disadvantage of leading to ineffectively computable prescriptions. The approach based on Kolmogorov complexity requires us to introduce a representation of the class of effectively computable functions. There are several common ways to do so, but perhaps the most intuitive is through the notion of a Turing machine. A Turing machine (T M ) is a basic representation of a digital computer. It has a single, doubly infinitely long tape that is ruled into squares. The tape is processed by a tape head that can read and overwrite on an individual tape square and a finite state controller to direct the tape head. Adopt a symbol alphabet of blank b, zero 0, and one 1. The tape is initially blank. The finite state controller can move the tape one square at a time either left (L) or right (R) or leave the tape unmoved (U). The tape head is always positioned over a square, and it can read what is written on the square and overprint one of the three symbols on the square or do nothing (halt). The controller state set S = {si } is a finite set of distinct elements. The controller is specified by a (necessarily) finite collection C = {< s , P , P , M, s >} of quintuples having the meaning that if you are currently in state s ∈ S and the tape head is reading the symbol P ∈ {0, 1, b} on the tape square underneath it, then you overprint the symbol P , move the tape according to M ∈ {L, R, U }, and go to controller state s ∈ S. A T M is a deterministic model of computing. It accepts an input finite string p that is the program, written on the tape as a consecutive string of 0s and 1s without intervening blanks; it starts with the tape head positioned at the left end of this string. The T M has a distinguished start state s0 ∈ S and a distinguished halt state sh . Computation starts with the controller state being s0 . The halt state is such that there is no quintuple whose first element is sh . Hence, if the controller ever makes a transition to sh , it can take no subsequent action and has halted. The T M (p) output, when it has halted, is then the finite string of 0s and 1s that are on the tape, without regard to any intervening blanks. While it is clear that a Turing machine is a well-defined digital computing device, it is less clear that all reasonable computations can be carried out by such a machine with an appropriate controller. However, several different attempts to define the class of computable functions all were shown to define the same class of partial recursive functions. Church’s thesis maintains that any of these representations does indeed define what can be computed. Hence, restricting our attention to T M s loses no generality in modeling computable functions and processes. Because there can only be countably infinitely many controllers specified by the finite sets C, it is clear that there can only be countably infinitely many distinct T M s, and hence only countably many computable functions. We can enumerate the set of T M s by assigning to each an integer such that given the integer we can reproduce the controller C. Such an enumeration is called a G¨ odel numbering. Hence, the G¨odel number g uniquely describes a specific T Mg .
6.4 Information-Theoretic Complexity Control
223
For convenience, we assume that g is given a binary representation, and we denote the length (number of bits) of this representation by |g|. Furthermore, there can only be countably infinitely many programs because they are all finite strings of characters from a finite (two elements in this case) alphabet. Clearly then, there are only countably infinitely many computable real numbers (values of computable functions), and therefore almost all irrational numbers are incomputable. Thus, say, in dealing with a training set T of real-valued vectors, we need to assume that they are all computable. This can be achieved, from a practical viewpoint, simply by assuming that all of these real numbers are given only to finite precision. Let a binary string p that is a potential program have its length in bits denoted by |p|. The set of all finite binary strings is denoted by {0, 1}∗ . Not all such strings are actual programs. For some of them, the T Mg (p) will never enter the halt state and thus never reach the end of a computation (e.g., the computer might loop, although this is not the only way a T M can fail to halt). We can compute a function f on a T M if we can find the appropriate T Mg , say, such that when the program px encodes the input x to f then T Mg (px ) = f (x). A function that can be computed by some T M is said to be effectively computable; otherwise it is said to be ineffective. There is no T M that can, given the G¨ odel number g of a T M and a program p, the two binary strings being uniquely encoded as a single binary string, determine whether T Mg (p) halts; this is known as the undecidability of the halting problem. There is no effective means of always determining whether a program p runs on a given machine. Obviously, in specific cases we can answer this question by running T Mg (p), waiting a specified time τ , and learning that the machine did halt in this time or that it was observed to be in a loop and could never halt. There exist infinitely many universal Turing machines (UTM) that have the property that when presented with an appropriately encoded pair of G¨ odel number g and program p can simulate T Mg and then run p on T Mg . One way to encode the pair g and p is to first encode the length |g|, and this requires log(|g|) bits. However, we also need to separate g from p. For easy decodability we repeat each bit of |g| once and then terminate the expression by a pair of different bits, say, 01. The pair 01 uniquely marks the end of the encoding of |g| and this encoding requires approximately 2 log(|g|) bits. We then concatenate this encoding of |g| with the binary string g and follow it with the binary string p. Although this encoding can be shortened, it cannot be shortened by more than log(|g|) bits, assuming (as would typically be the case) that p is significantly longer than g. A first attempt to define the Kolmogorov complexity K of a training set T would be to count, say, the number of bits required to define a specific Turing machine together with its program such that this machine would
224
6. Architecture Selection and Penalty Terms
yield as output a complete description of T . Because we want the most efficient such description, we introduce the following. Definition 6.4.1 (Kolmogorov Complexity) K(T ) = min{|p| + |g| + 2 log(|g|) : (∃p, g ∈ {0, 1}∗ ) T Mg (p) = T }. In effect, the complexity of a training set or data is the length of shortest code word that could be used to reproduce the data by first constructing a machine T Mg and then providing it with a program of length |p|. We can interpret this result as the term |g| + 2 log(|g|) being a model complexity penalty added to the complexity of description (length of program |p|) of the data given the model. This yields a parallel to our previous results in this chapter on adding penalty terms to data-fitting terms. We relate this approach more closely to neural networks in Sections 6.4.2 and 6.5.2. An alternative notion of complexity, and one considered earlier by Kolmogorov, is to use the length |p| of the shortest program that generates the desired response on a particular UTM, say, T Mu Ku (T ) = min{|p| : (∃p ∈ {0, 1}∗ ) T Mu (p) = T }. Because an encoding of the pair of binary sequences needs to have length no longer than |p| + |u| + 2 log(|u|), complexity evaluations according to the length of the shortest program on the UTM T Mu are no more than |u| + 2 log(|u|) bits larger than the complexity evaluations according to any other T Mg and might be shorter, |K(T ) − Ku (T )| ≤ |u| + 2 log(|u|). The usual approach to Kolmogorov complexity assumes that evaluations are made using UTMs. Although Kolmogorov complexity allows us to make a fundamental attack on the measurement of the complexity of a data set, it suffers from the fact that K and Ku are not themselves effectively computable; this can be seen as a consequence of the undecidability of the halting problem preventing us from knowing, in general, when we have found a shortest program for the data on a given machine.
6.4.2
Application to Neural Networks
We adapt the Kolmogorov complexity concept, making it effectively computable, by reference to an architecture N of neural networks. The neural network problem differs from the ones discussed in the preceding two subsections. We are not trying to fit the training data T but rather trying to infer from a given x to the desired t. To adapt the Kolmogorov complexity definition we need to consider that the T Mg has a program p = p, x that
6.4 Information-Theoretic Complexity Control
225
encodes the pair of inputs consisting of the function input x and a program p, that enables T Mg to conclude with the desired t, K(T ) = min{|p|+|g|+2 log(|g|) : (∃p, g ∈ {0, 1}∗ )(∀i ≤ n) T Mg (p, xi ) = ti }. To adapt this definition we move from the class of Turing machines to the class of neural networks. Because there can only be countably many computable network functions, we could assume an enumeration of N . This family of functions is parameterized by w ∈ W ⊂ IRp , with w in the role of the G¨ odel index g. We can confront the uncountability of W by approximating or quantizing the elements of the parameter √ vector w describing the network η(·, w) by supplying approximately p log2 ( n) bits of description of w (see Section 6.5). However we do this, let |w| denote the number of bits needed to encode the quantized w and take W to be countable. After all, computational evaluations of w via the training algorithms of Chapter 5 will always return vectors with finite precision that satisfy the quantization constraint and countability. A first attempt at a complexity definition is ˆ N (T ) = min{|w| : (∃w ∈ W)(∀i ≤ n) η(x, w ) = ti }. K i Of course, if N is not complex enough (e.g., W is too small) then there will be no w capable of reproducing T . This suggests that we need to modify the definition of complexity to allow for approximating to the training data. The simplest application is to pattern classification where the target t takes on only finitely many values. For simplicity, assume that we are studying binary pattern classification. If a given binary-valued network produces yi in place of ti , then we can construct T by supplying the additional information required to correct y to t. Introduce the Hamming distance between {0, 1}n -valued vectors n |ti − yi |. h(t, y) = 1
Given y we can construct t if we are given the locations of the h disagreements. Letting H denote the binary entropy function (see Chapter 2), this in turn requires n h b(t, y) = log2 ≈ nH( ) bits. h(t, y) n We are thus led to the following. Definition 6.4.2 (Effective N -Complexity) KN (T ) = min{b(t, y) + |w| + 2 log2 (|w|) : (∀i ≤ n) η(xi , w) = yi }. y,w
226
6. Architecture Selection and Penalty Terms
We can bound KN (T ) by noting that it requires no more than n bits to specify t. Hence min{|w|+2 log2 |w| : w ∈ W} ≤ KN (T ) ≤ n+min{|w|+2 log2 |w| : w ∈ W}, and the upper bound is essentially n under any reasonable encoding of w. If the architecture N has VC capacity at least n, then T can be shattered or learned without error. In this case we can make sense of our first attempt at a complexity definition, but it is not preferred to the final one. A choice of network w∗ can now be made on the basis of a minimum complexity interpretation of Occam’s Razor. Noting that yi = η(w, xi ), we let β(w, T ) = b(t, y) to define w∗ = argmin[β(w, T ) + |w| + 2 log2 (|w|], w∈W
where |w| represents the bit length in some √ encoding of the elements of W. As above, we can take |w| = p log2 n, and thereby determine the size p of the network. Once again, we find an error term β, measuring the discrepancy between the target and the network outputs, that is augmented by a penalty term that enables us to balance approximation ability and network complexity.
6.5 Stochastic Complexity 6.5.1
Code Length Measure of Complexity
We turn now to effective procedures based upon replacing a T M by a computable probability density for possible data sets. In the approach of Rissanen in [193, 194, 195, 196], one considers probabilistically based encodings of the training set T of size n and looks for shortest or minimal length encodings. In this setting one thinks of a model as a probability measure Pθ (x, t) that governs the random process generating the data x, t. The family of such models is Θ, perhaps Θ ⊂ Rs for an s-parameter indexing of models. For convenience, in this section we will let x represent the pair x, t. Complexity is now to be measured by the overall length L(x) of a shortest description of the data. Because there can only be finitely many code words of a given length and only countably many code words of all lengths, we can only decipherably encode countably many distinct values of x. For the moment, assume that x is discrete-valued to satisfy this condition. L(x) can be thought of as the sum of the description length of the data given a specific probability model L(x|θ) plus the description length LΘ (θ) of the model, L(x) = L(x|θ) + LΘ (θ),
6.5 Stochastic Complexity
227
although there would have to be an additional term of O(log(LΘ (θ))) to account for an encoding of the pair of strings. The description length L(x|θ) of the data given a specific model is derivable from the associated probability measure Pθ (x) by considering an efficient encoding (short average code word length) of the data. If x is discrete-valued then we can use Huffman coding or arithmetic coding (see Cover and Thomas [48]) to conclude that the shortest average code words have bit lengths − log2 Pθ (x). Huffman coding is a straightforward algorithm for assigning shortest code words to most probable data values in such a manner that the resulting (prefix) code has shortest average (expected) length. If x is real-valued then its encoding has to take into account the accuracy of the necessary discrete approximation. Basically we finely quantize the observation space with x replaced by its quantized value [x] and use L(x|θ) = − log2 Pθ ([x]). We must similarly encode a specification of the model. If there are only countably many models in Θ then we somehow fix on a uniquely decipherable encoding of Θ; i.e., the code word lengths must satisfy the Kraft inequality (Cover and Thomas [48, p.82]) 2−LΘ (θ) ≤ 1. θ
Such an encoding can be interpreted in terms of a prior probability distribution π on Θ through LΘ (θ) = − log2 π(θ) and thereby relates to our Bayesian approach of Section 6.2. More commonly, Θ ⊂ Rs and we must approximate the real-valued weights or parameters describing the model. Arguments for the needed precision of quantization have been advanced by Rissanen based on the asymptotic behavior of a maximum likelihood estimator θˆn of the model parameters using i.i.d. case by quantizdata Tn . Rissanen suggests that we should proceed in this √ ing the parameters describing the model using about log2 n bits for each parameter, and thereby balance errors of approximation and those of estimation. This can be applied immediately to an i.i.d. training set T having n input-output pairs through √ LΘ (θˆn ) = s log2 n, n − log2 Pθˆn ([xi ]) + LΘ (θˆn ) + O(log LΘ (θˆn )). L(T ) = 1
Again, this set of code words can be reinterpreted in terms of a prior probability on Θ that can be used in the Bayes formulation.
228
6. Architecture Selection and Penalty Terms
Hence, we have the basis for a stochastic measure of complexity through R(T ) = min[− θ∈Θ
n
log2 Pθ ([xi ]) + LΘ (θ) + 2 log(LΘ (θ))].
1
The minimizing value θ∗ is then the estimated “best fit” probability model. Adding in 2 log(LΘ (θ)) to account for the encoding of pairs yields a result for complexity measure R that is a close parallel to that of K. In effect, by selecting the model that minimizes the square-bracketed expression, we are choosing a maximum a posteriori or Bayes model. If we ignored the terms assessing the complexity of the model itself, then our choice of model would be maximum likelihood.
6.5.2
Application to Neural Networks
Further details on these encodings and their use for real-valued data and parameters can be found in Barron and Cover [20], and the cited papers by Rissanen. We do not pursue this in our neural network setting, although those who view neural networks as providing the probabilities of responses given inputs have access to the preceding as a means of selecting networks. The parameter θ ∈ Θ translates to w ∈ W. The term Pθ ([x]) transforms into the joint conditional probability Pθ (t|{xi }) of observing t = {ti } when we have inputs {xi }. Specifically, we determine the probabilities of responses by η(t|x, w); if t is {0, 1}-valued then we understand the usual η(x, w) to be the probability of t = 1. From the assumption that T is i.i.d. we have the conditional probability Pθ (t|{xi }) −→
n
η(ti |xi , w).
1
This approach, however, provides no way to address the distribution of the inputs {xi }. Instead we use Barron [16] to tie together the following three themes: • regularization as discussed in Section 6.3.3 and now placed in the usual context of minimizing training set error; • the minimum description length approach of Section 6.5.1 divorced from assumptions about true underlying probability models; • the value of the Bayesian approach based on direct use of empirical error as discussed critically in Section 6.2.4 that is now understood in terms of practical usefulness and not in terms of epistemic truth (faithful modeling of beliefs and prior knowledge). While use of the empirical error could not be justified from an epistemic Bayesian perspective in Section 6.2, it can be justified instrumentally, as we shall see below.
6.5 Stochastic Complexity
229
As in the previous discussion, we envision an infinite (countable) family M = {M } of models in which the model M corresponds to a particular network architecture described by sM parameters. A particular network in M can be denoted by the vector parameter wM , with w denoting an arbitrary parameter vector in W. To fix ideas, the regularization or complexity penalty term Cn (wM ), for a sample size n, can be selected as a sum Cn (wM ) = cM + cθ,n (wM ), where cM is a penalty for the complexity of the model or family M and cθ,n (wM ) is an additional penalty for the particular choice of vector parameter permitted by that model. To relate the complexity penalty to the Bayesian picture, we can think of it as a logarithm of the prior density π(w) and factor this density into a product of a mass function πM (M ) on the countable class of models and a density πM (w) over the parameter vectors that make up the model M . If we assumed logarithms taken to base 2, then this interpretation implies (similar to the Kraft Inequality) 2−cM = 1, 2−cθ (wM ) = 1. wM ∈M
M
Interpretation of the complexity penalty in terms of a log-prior (e.g., see the related term RM in Eqs. 6.2.9) is difficult to sustain in detail. We suggest that this approach can be justified only by convergence results of the kind presented below in Theorem 6.5.1 that show it leading to good results, and it cannot be justified by the fundamental epistemic position of Bayesianism. Dropping a commitment to a probabilistic interpretation for the complexity penalty, we follow Barron and accept normalization by other than unity (and change the logarithmic base) to require e−Cn (w) ≤ s; w∈W
the parameter s will appear in the bound given below. In place of the usual empirical error, we use the per-sample form 1 (η(xi , w) − ti )2 , n 1 n
ETn (w) =
to which we add the complexity penalty to form the desired quantity to be minimized by choice of network wn Dn (w) = ETn (w) +
λ Cn (w). n
The algorithms of the sort discussed in Chapter 5 can be used to identify the parameter vector wn ∈ W that comes close to minimizing Dn (w). For
230
6. Architecture Selection and Penalty Terms
the theoretical analyses to be presented, we assume that the true minimum of Dn is at wn . We make the unrealistic assumption that the Bayes estimator E(t|x) corresponds precisely to a specific network η(x, w∗ ). This is unlikely to be the case. Results on this question that do not assume the Bayes estimator to be a neural network, but place a different restriction on networks (bounded fan-in at each node), can be found in Lee et al. [142]. The theoretical regret r incurred by choosing w as compared with the best choice of w∗ is measured by r(w) = E(η(x, w) − t)2 − E(η(x, w∗ ) − t)2 = E[η(x, w) − η(x, w∗ )]2 . We use this measure r of regret to define an index of resolvability λ Rn∗ = min r(w) + Cn (w) . w∈W n In the limit of large sample size n we expect that Rn∗ converges to zero. Theorem 6.5.1 (Complexity Regularization [16], Sec. 5.2) Assume that the target t and all network responses lie in a finite interval of length b (e.g., see Section 7.1). Let wn achieve a minimum of the empirical regularized expression Dn (w), and define the threshold τ (n, b, λ, Rn∗ , δ) = (1 +
6b2 6b2 λ )Rn∗ + (2 + ) ln(1/δ). 2 3λ − 5b 3λ − 5b2 n
Restrict λ > 5b2 /3 to ensure that τ > 0. For any 0 < δ < 1 and s as defined above in terms of the penalty function Cn , we conclude that P (r(wn ) ≥ τ ) ≤ (s + 1)δ. Assuming that as expected Rn∗ converges to zero with increasing n, then given any small b, λ, s, δ we can choose large enough sample size n to make τ as small as desired. Thus the regularized estimate wn is good in that with probability at least 1 − (s + 1)δ, that can be made as high as desired, and with τ that can be made as small as desired, by use of large enough n, the resulting actual regret r(wn ) is not much bigger than the index of resolvability Rn∗ , and this index is the theoretically optimum value of weighted penalty plus regret. Restated, the discrepancy r(wn ) between the performance of the network specified by the regularized parameter wn and the performance of the optimal network specified by w∗ will converge in probability to zero, thereby defending the use of a regularized estimate wn .
6.6 Overfitting Control
231
6.6 Overfitting Control: Validation, Cross-Validation, and Generalization Error Bounds It is common in practice to approach a problem with a neural network architecture of greater complexity (e.g., more nodes or inputs) than is warranted by the size of T , as treated in our earlier discussions of generalization and training performance. Such an architecture enables its user to overfit the training data (ET (w ˆ n ) can be made too small) and thereby produce a design that has poor generalization performance. In order to guard against this conclusion, the progress of the training algorithm is monitored through frequent estimates of true generalization performance, eg (wi ), where wi is the estimator produced by the training algorithm at epoch i. When the validation (see Section 7.5) or cross-validation generalization error estimates stop decreasing and begin to increase, it is deemed that overtraining is imminent and the algorithm is terminated, even though the training error continues to decrease. In effect, exploration of the parameter space is restricted to reduce the initial flexibility provided by a too large architecture. There is something unappealing about this ad hoc process, but it has its adherents. Dodier [62] makes the interesting observation that at the early stopping time for steepest descent training the training error gradient is orthogonal to the validation error gradient as a consequence of the rate of change of the validation error at the stopping time being 0. The remainder of the analysis is restricted to linear networks. In this case Dodier concludes that the most probable stopping point does not depend on the size of the independent validation set used to estimate generalization error. However, the dispersion about this stopping point does depend on the size of validation set and thereby affects the expected generalization error at the stopping point. Amari et al. [9] conclude “that asymptotically the gain of the generalization error is small even if we could find the optimal stopping time.” Cross-validation, as discussed in Section 7.7.1, is a computationally intensive process of estimating generalization error, and frequently in neural network practice what passes for cross-validation is actually a single reserved test set. Currently cross-validation does not have an adequate statistical defense for application to neural networks in the presence of at best moderate training sample sizes. Of course, in place of cross-validation, we can use other methods such as the jackknife and bootstrap to estimate generalization error; these will also be described critically in Section 7.7.2. Liu in [150, Theorem 1] presents an asymptotic approximation for what he calls a jackknife estimator (a hold-one-out estimator without bias correction) that eases its computation. A related approach trains to minimize upper bounds to generalization error eg that derive from Vapnik-Chervonenkis theory and its extensions
232
6. Architecture Selection and Penalty Terms
(see Section 7.8). In binary pattern classification applications we can use Theorem 7.8.3 to produce an upper bound to eg (w) that depends not only on the empirical error (number of misclassifications on the training set by the network described by w) but on aspects of the network encapsulated in the fat-shattering dimension; this fat-shattering dimension can be estimated through Theorem 7.8.4. One then selects the network that minimizes the upper bound to eg . In so doing one is acting much as if one was minimizing a sum of empirical error (actually a term that includes not only the number of misclassifications but also the number of instances in which the network was insufficiently decisive) plus a regularization term but with a better grounded, and more complex, choice of penalty term.
6.7 Growing and Pruning Architectures 6.7.1
Growing a Network
Early work on growing networks is available in Barron and Barron [19] and the references therein. There are methods of backfitting (Hastie and Tibshirani [98], Yuan and Fine [258]) that sequentially enlarge a network by adding nodes, depending on the performance achieved to that point. The well-known cascade correlation algorithm (CCA) of Fahlman and Lebiere [69] and its version by Littmann and Ritter in [149] provides a different method for growing a network. The basic idea of the CCA of Fahlman and Lebiere has several components, presented here for a network with a single output: 1. to initiate the architecture select some node function σf (e.g., sigmoidal or linear) and connect x to σf through weight vector wf to generate an output of σf (wf · x); 2. train to minimize the error on the training set—this is a straightforward least-squares problem if σf is linear; 3. add a new node σ1 that has inputs all of the network inputs x connected through w1 and has output that is a new input to σf —this new node creates a new hidden layer just below σf ; 4. train w1 so that you maximize the covariance between the output of σ1 and the errors made by the prior network; 5. retrain the output weights wf , augmented to connect σ1 as well as x to σf , to minimize training set error; 6. evaluate the new training error; 7. given that k − 1 nodes have been added, if the performance is as yet unsatisfactory then add a new node σk , again in a new hidden
6.7 Growing and Pruning Architectures
233
layer just below σf , with weight vector wk that accepts as inputs x together with the k − 1 outputs of the prior nodes; 8. leave unchanged all the prior node input weight vectors w1 , . . . , wk−1 , and train wk to maximize the covariance between the output of σk and the errors made by the prior network; 9. retrain only the output weights wf , augmented to connect σ1 , . . . σk as well as x to σf , to minimize training set error; 10. evaluate the new training error; and 11. halt if either the error is adequate or some other limit (number of nodes or computation effort) has been reached; otherwise add another node and repeat the above steps. This process generates a deep feedforward network having a single node in each hidden layer, although the connection pattern is not that of a multilayer perceptron in that the input x is made available at all layers. CCA can be adapted to add several nodes at a time, a module, to a hidden layer. Littmann and Ritter revise this process to eliminate the new issue of maximizing covariance between the new node output and the previous network errors by not fixing the output node as above. Rather they view the CCA as adding a new output (“final”) node at each iteration and then training to minimize the usual training error while keeping all prior node input weights fixed.
6.7.2
Pruning a Network
There are also methods to prune a given large network to eliminate relatively unimportant nodes or connections (e.g., LeCun et al. [139], Hassibi et al. [94, 95, 96]). We attempt to identify those weight components that make little contribution to reducing the error ET (w) in approximating to the training set. We can then consider setting these weights to zero and thereby removing them from the network. The basic idea underlying the determination of the nearly irrelevant weights is a Taylor’s series of the error function to second order in the weights or parameters. If we assume that the network has been well-trained, then the first-order (gradient) term is approximately zero and can be neglected compared to the second-order (Hessian) term. Letting δw be a small increment in the weight vector w then at a local minimum w ET (w + δw) − ET (w) = δET ≈
1 (δw)T Hδw. 2
(6.7.1)
We wish to select a component wq of w for elimination such that when we optimize over the remaining components we will sustain a minimum
234
6. Architecture Selection and Penalty Terms
increase in error over the value at the local minimum. Letting eq denote the unit vector in the component wq direction. The constraint on the increment δw to reduce the qth component of the perturbed vector to 0 is eTq δw + wq = 0. Minimizing the increase in error subject to this constraint yields the variational formulation with Lagrange multiplier λ given by 1 min[ (δw)T Hδw + λ(eTq δw + wq )]. δw 2 The variational calculus solution is that the remaining weights should then be adjusted according to δw = −λH−1 eq . −1 denote the q, q entry Solving for λ to achieve the constraint, letting Hqq in the matrix H−1 , and using −1 eTq H−1 eq = Hqq
yields δw = −
wq −1 eq , −1 H Hqq
as the perturbation that minimizes δET subject to eTq δw + wq = 0. So-called Optimal Brain Surgery (OBS) of Hassibi et al. discussed in [94, 95, 96] then selects a particular component wq of w for elimination on the basis of its saliency, or the increase in ET due to its elimination, given by Lq =
1 1 1 wq2 (δw)T Hδw = λ2 eTq H−1 HH−1 eq = −1 . 2 2 2 Hqq
We select the weight having the smallest value of Lq for elimination. The major problem with this prescription or the related one of so-called Optimal Brain Damage (OBD) of LeCun et al. [139] is the need to determine the Hessian and its inverse. These are computationally intensive determinations requiring substantial storage. In OBD a simplification is attempted in that H is approximated by a diagonal matrix. OBS attempts a more accurate approximation including off-diagonal terms. The suggested approximation in Hassibi et al. [96] of eigenspace decomposition, on their own account, does not appear to be successful. Although these approaches are reasonable and tractable for small networks, they appear to be sensitive to the means of approximating the Hessian.
7 Generalization and Learning
7.1 Introduction and Network Specification In this chapter we will provide an overview of several unrelated methods for evaluating the statistical performance of a selected neural network (or a selected pattern classifier or regression function). The first five sections lay the groundwork in defining terms of interest and making several assumptions that allow us to derive subsequently analytical results. The methods of the next three sections assume that we have a moderate sample size n. Section 7.6 presents the basic methods for assessing expected generalization error from an independent test set. These methods are the most successful but require sample sizes typically of order one thousand. Section 7.7 presents the statistical methods of cross-validation and bootstrap, methods that have been widely discussed in both the statistical pattern classification and neural network literature. However, as we shall see, they are poorly suited to typical neural network problems that possess a large number of minima of training error and that strain their computational resources. The elegant approach to uniform bounds pioneered by Vapnik and Chervonenkis is presented in Sections 7.8.1 and 7.8.2. Unfortunately, this method yields sample size estimates for reliable performance that are orders of magnitude larger than expected on the basis of experience. Sections 7.8.3 and 7.8.4 outline directions for improved results, especially the role of fat-shattering dimension described in Section 7.8.4. In the remaining sections we turn to a large sample analysis in the hopes of learning about the qualitative behavior of the parameter estimate w ˆ n returned by
236
7. Generalization and Learning
the training algorithms discussed in Chapter 5. In Sections 7.9 and 7.10 we establish results on the asymptotic convergence of these estimates with increasing n, not to a specific value but into a set of values, and determine the conditional asymptotic normality of the discrepancies between w ˆ n and its nearest minimum of generalization (expected) error. Finally, in Sections 7.11 and 7.12 we study the asymptotic behavior of both the generalization and training errors in the vicinity of their minima. How are we to judge the quality of the neural network produced by the considerations of the preceding chapters? In this chapter we put aside otherwise important issues of the ease of implementation of the network, its computational speed, and robustness with respect to component misspecification and failure, and move to the appropriate statistical setting in which the target t is only stochastically related to the feature vector x and the features themselves are usually generated stochastically. Hence, there is a joint probability measure P which for convenience we may assume is described by a joint density function ft,x (τ, ξ) or probability mass function pt,x when the variables are discrete-valued. We assume that the elements (xi , ti ) ∈ Tn are generated independently and identically distributed (i.i.d. P ). Our objective is to explore the linkage between the observable empirical error whose minimization was studied in Chapter 5 and the true statistical performance, known as generalization error, of the selected networks. We use y = η(x, w) to denote the (scalar) output y of the neural network η described by weight or parameter vector w when presented with the input (feature) vector x. The architecture η ∈ N is parameterized by the set W = {w} ⊂ IRp of p-dimensional weight vectors. The input vector x ∈ X ⊂ IRd is d-dimensional. The desired response to input x is the (scalar) target denoted by t ∈ T ⊂ IR. Assumption 7.1.1 (Compact and Differentiable) The finitedimensional feature X and parameter W spaces are both compact (closed and bounded), W is convex, and the target space T is bounded. The finitely many node functions in the layers making up the network η are twice continuously differentiable. Restricting W to be compact is reasonable in both practice and theory. Recent work by Bartlett [21] (see Section 7.8) shows that the size of the network parameters can control the generalization ability of neural networks and that smaller parameter vectors are preferred. Convexity is postulated so that when we subsequently (e.g., Section 7.10) make a Taylor’s series ˜ we are assured with remainder to expand a function of w about a point w, ˜ that any intermediate-value parameter w∗ on the line segment joining w, w will also be in W. In Assumption 7.9.1 we will in effect add an assumption that W is not too small; it should be large enough to contain in its interior at least one of the minima of the generalization error. Making X compact
7.1 Introduction and Network Specification
237
rules out in theory such familiar models as normally distributed feature vectors. However, in practice there is no difference between a normally distributed random vector and one whose components have been truncated to lie, say, within 10 standard deviations of their mean. In pattern classification applications the target space T would be not only bounded but a finite set. In estimation and regression problems the target variable may be known to be bounded |t| < c, perhaps because the target denotes a physical variable whose range is constrained (e.g., it is a phase angle lying in the interval [−π, π] or it is known that electric power generation cannot exceed 5000 MW). Furthermore, the remark about truncating distributions with unbounded support also applies to the target space T . The node differentiability assumptions eliminate linear threshold units (e.g., step functions) but include all the familiar smooth node functions (e.g., logistic, hyperbolic tangent) needed for gradient-based training algorithms. The chain rule of differentiation and the structure of a neural network as a composition at layer i of sums of node function responses at layer i − 1 enables us to conclude from Assumption 7.1.1 that η(x, w) is twice continuously differentiable with respect to the components of both w, x. From the assumed compactness of X and W and the continuity of both the component nodes and of the overall network function η, we can conclude that η(x, w) is also uniformly bounded; this follows from the fundamentals of real-valued continuous functions on compact subsets of IRd × IRp . Hence, the network response y is uniformly bounded over the family of networks indexed by W. Thus from Assumption 7.1.1 we see that there exists some finite b such that (∀x ∈ X , ∀w ∈ W) (η(x, w) − t)2 < b < ∞. We can also conclude that the first two partial derivatives of η with respect to the components of w are uniformly bounded. This follows from the compactness assumed for X , W and the fact that a continuous function on a compact set is uniformly continuous, and thus is bounded. We summarize the preceding arguments in the following. Lemma 7.1.1 (Network Functions) Under Assumption 7.1.1 the neural networks η ∈ N for x ∈ X are uniformly bounded and have continuous and uniformly bounded first and second partial derivatives with respect to the components of w ∈ W. Furthermore, (η − t)2 is uniformly bounded by some finite b.
238
7. Generalization and Learning
7.2 Empirical Training Set Error For purposes of statistical analysis we must distinguish a number of error terms and special parameter values, some of which are randomly selected (e.g., the parameter value w ˆ n chosen by the training algorithm). The degree to which a network η(·, w) approximates to (learns) Tn is usually measured by the quadratic empirical training error 1 (η(xi , w) − ti )2 . n i=1 n
ETn (w) =
Although other approximation measures (e.g., entropy-related terms like divergence) are sometimes considered (particularly in a pattern classification setting; see Section 7.5 regarding measures of validation error), most training concerns reduction of the quadratic ETn by choice of w. From Lemma 7.1.1 we immediately conclude that the following is true. Lemma 7.2.1 (Empirical Error Analytics) Under Assumption 7.1.1, the non-negative ETn is upper bounded by b. Furthermore the gradient column vector, ∂ETn (w) , ∇ETn (w) = Gn = ∂wi and the Hessian matrix of second derivatives, HE (w) = [Hi,j ], Hi,j =
∂ 2 ETn (w) , ∂wi ∂wj
both exist and have bounded and uniformly continuous elements. We are interested in the behavior of the empirical training error in the vicinity of its local and global minima as w ranges over its domain of definition W. Let SE denote the set of stationary points of ETn (w), SE = {w ˜ : ∇ETn (w) ˜ = 0}, and ME denote the set of (local and global) minima of ETn (w). It is well known (Auer et al. [14]) that ETn (w) is likely to have many local minima and even multiple global minima—forcing us to take into account that ME > 1, a condition given too little weight in previous asymptotic analyses. We will eventually assume that W is large enough that ME and SE overlap in W, but we defer this point for now.
7.3 Gradient-Based Training Algorithms We assume a training algorithm A whose input is at least a training set Tn , and possibly also a validation set, and whose output is a parameter vector
7.4 Expected Error Terms
w ˆ n ∈ W,
239
A(Tn ) = w ˆ n ∈ W ⊂ IRp .
A selects a network or its parameters, usually through an iterative process that generates many intermediate estimates. A is typically a random algorithm (e.g., due to the random initialization of the iterations). The goal is to minimize the empirical error, (∀w ∈ W) ETn (A(Tn )) ≤ ETn (w), but we know from Chapter 5 that this goal is essentially unachievable, and we must content ourselves with achieving close approximations to local minima in ME . We assume that A succeeds in approximating a local ˆ n yielding a near-zero gradient of the function minimum in ME by finding w ETn (w), ∇ETn (w)|wˆ n = Gn , |Gn | ≤ gn = o(n−1/2 ) ≈ 0. (7.3.1) This condition can be assured by incorporating it as one of the termination criteria for the iterative search that defines A. Assumption 7.3.1 (Training Termination) Select a positive sequence δn converging to 0 and do not permit termination of the training algorithm at w ˆ n until (a) ∇ETn (w ˆ n ) < δn ; √ 1 (b) limn→∞ nδn = 0 (δn = o(n− 2 )); and ˆ n ) positive definite. (c) HE (w These three conditions can be assured by incorporating them in the termination criteria for the iterative search that defines A. Conditions (a) and (b) simply assert that we are doing gradient-based training in that one of our conditions for identifying the minimum w ˆ n is that the gradient at this point is small. We have refined this statement somewhat by the rate specification in (b), and shall show that this is important. Generally, we do not verify condition (c) on the Hessian, taking it for granted that the design of A leads to the vicinity of a minimum rather than to the vicinity of a maximum or saddle point. We are thinking primarily in terms of batch training. However, our analysis covers online training provided it is interrupted occasionally to verify the batch conditions of Assumption 7.3.1 that govern termination.
7.4 Expected Error Terms We focus on how well the network described by a parameter vector w ˆn (e.g., determined by the methods of Chapter 5) can deal with new cases,
240
7. Generalization and Learning
ones not usually contained in the training set Tn used to select w ˆ n . How ˆ n ) = y to the target t when x, t is stochastically independent close is η(x, w of the training set Tn ? Evaluation of the performance of the network at parameter vector w requires us to choose an approximation norm so as to generate an error measure. The most commonly used such norm is the quadratic (y − t)2 , although in the discrete case it is also common to use entropy-related functions. Other choices for measuring the size of the random error are possible (e.g., a percentile such as the median) but the expected error proves to be far more tractable and is well-entrenched as a choice. We adopt the quadratic error function. From Lemma 7.1.1 the nonnegative error (η − t)2 is bounded above by some finite b. As the quadratic error is a random variable, we assess performance through expected quadratic error that is guaranteed to exist finite by the boundedness just noted. Definition 7.4.1 (Generalization Error) The generalization error,
2 eg (w) ≡ E(η(x, w) − t) = (η(ξ, w) − τ )2 ft,x (τ, ξ)dτ dξ = EETn (w). The expectation is evaluated with respect to the (commonly unknown) measure P of (x, t). In the specific case of binary classification in which η ∈ {0, 1}, t ∈ {0, 1}, the expectation E(η(x, w) − t)2 = P (η = t), and we are dealing with the error probability criterion. Lemma 7.1.1 combined with the Dominated Convergence Theorem yields the following. Lemma 7.4.1 (Generalization Error Analytics) Under Assumption 7.1.1, the non-negative eg is upper bounded by b. Furthermore, its gradient ∇eg and Hessian Heg exist and have bounded and uniformly continuous elements. Paralleling definitions given in Section 7.2, we define the set Meg of local and global minima of eg (w), identify the set of stationary points Seg = {w ˜ : ∇eg (w)|w˜ = 0}, and assume that Meg ⊂ Seg (see Assumption 7.9.1). Ideally, we would like to know the generalization error eg (w) as a function of w so as to select the parameter vector giving rise to the smallest such error. The generalization error of a best (we know from Chapter 4 that it is not unique) network w0 ∈ W is e0g ≡ min eg (w), w0 = argmin eg (w). w∈W
w∈W
7.4 Expected Error Terms
241
Unfortunately, without knowledge of P we are in no position to make such a global evaluation of w0 . Although it is common to think of eg as nonrandom, it will be random if its argument w is randomly selected (e.g., the outcome of a numerical algorithm A starting with a random initialization). If our training algorithm A achieves its goal of a global minimum (which is highly unlikely), then the generalization error e∗g ≡ eg (w∗ ), w∗ = argmin ETn (w) w∈W
is that of a network η ∗ with parameter w∗ that is a (nonunique) global minimum of the known empirical error ETn (w). Note that e∗g is random because it depends on the random training set Tn and on the choice of w∗ ; different values that yield the same minimum of ETn need not share the same values of eg . However, we know from the discussion in Chapter 5 that finding a global minimum of the empirical error is unrealistic. In practice we settle for the parameter vector w ˆ n returned by our random training algorithm A. We then incur a random generalization error eˆg = eg (w ˆ n ) = eg (A(Tn )). The evaluation of eˆg , addressed in Section 7.11 for large n, proves to be challenging given that we are not willing to make substantial (e.g., lowdimensional parametric models) assumptions about the probabilistic model generating x, t. Finally, the statistically optimum estimator corresponding to minimizing the expectation of a quadratic objective/loss function is the conditional expectation E(t|x) and it is the Bayes estimator. Its statistical performance is eB = E(E(t|x) − t)2 . Evaluation of the latter requires knowledge that we do not expect to have of the probability law governing the selection of (x, t). Nonetheless, it is evident that (7.4.1) eB ≤ e0g ≤ e∗g , e0g ≤ eˆg . The discrepancy e0g − eB is a bias resulting from the limitations imposed by the architecture N because the Bayes estimator need not be in N . Section 6.5 notes Barron’s index of resolvability as a measure of this kind of discrepancy. Additional results are given in Sections 4.11–4.13 on the approximation ability of a family of networks having at most a given number of nodes. The discrepancy eˆg − e∗g is due to the fact that A generally errs in finding the global minimum of the empirical error. Thus the discrepancies w∗ = w ˆ n , e∗g = eˆg are due to limitations of the algorithm A and discrepancies between the empirical and generalization errors. It is not unlikely that eˆg < e∗g , although we might expect it to typically be the reverse; w∗ achieving the
242
7. Generalization and Learning
global minimum of the training error ET need not achieve a good minimum value for eg . The discrepancy e∗g − e0g is one manifestation of the statistical fluctuation inherent in dealing with a finite training set—the empirical error ETn is only a noisy estimate of eg that can be expected to improve as n increases. A first direction in which to assess the discrepancy between e∗g and e0g is to define W = {w : eg (w) ≥ e0g + }. It is then evident that e∗g ≥ e0g + ⇒ w∗ ∈ W ⇒ min ETn (w) < ETn (w0 ). w∈W
Thus,
P (e∗g ≥ e0g + ) ≤ P ( min (ETn (w) − ETn (w0 )) < 0). w∈W
Letting ∆(w) = ETn (w0 ) − e0g − ETn (w) + eg (w), we find P (e∗g ≥ e0g + ) ≤ P ( max (∆(w) + e0g − eg (w)) > 0) ≤ w∈W
P ( max ∆(w) > ). √
w∈W
The term ∆(w) = Op (1/ n) (fluctuations of averages of i.i.d. terms about their mean) and is being compared with a term that exceeds . Rather than proceed further with this development, we bound the discrepancy between e∗g and e0g through a uniform bound on the discrepancies between the empirical error ETn (w) and its expectation eg (w) through 0 ≤ e∗g − e0g ≤ e∗g − ETn (w∗ ) + ETn (w0 ) − e0g = ∆(w∗ ) ≤ |e∗g − ETn (w∗ )| + |ETn (w0 ) − e0g )| ≤ 2supw∈W |ETn (w) − eg (w)|.
(7.4.2)
This yields the bound P (e∗g − e0g ≥ ) ≤ P ( sup |ETn − eg | ≥ /2).
(7.4.3)
w∈W
Of course, the upper bound of Eq. 7.4.2 is also an upper bound to other ˆ n ) − eg (w ˆ n )|. For a slight improvement on terms of interest such as |ETn (w this bound consider P (e∗g − e0g ≥ ) ≤ P (|e∗g − ETn (w∗ )| + |e0g − ETn (w0 )| ≥ ). Note that (∀γ) P (X + Y ≥ ) ≤ P (X ≥ − γ) + P (Y ≥ γ),
7.5 Data Sets
243
to conclude that P (e∗g − e0g ≥ ) ≤ P (|e∗g − ETn (w∗ )| ≥ − γ) + P (|e0g − ETn (w0 )| ≥ γ). Invoking the Hoeffding inequality of Theorem 7.6.2 (to be presented Section 7.6.3), we see that (∀w ∈ W) P (|eg (w) − ETn (w)| ≥ ) ≤ 2e−2n
2
/b
.
(7.4.4)
Hence, P (e∗g − e0g ≥ ) ≤ P (|e∗g − ETn (w∗ )| ≥ − γ) + 2e−2nγ
2
/b
,
and it does not need a large sample size n to make the last quantity quite small. The parameter γ can therefore be chosen significantly smaller than . To a good approximation, we can replace Eq. 7.4.3 by P (e∗g − e0g ≥ ) ≤ P ( sup |eg (w) − ETn (w)| ≥ ).
(7.4.5)
w∈W
Observe that the event maxi Xi ≥ is equivalent to the event ∪i Ai where Ai is the event Xi ≥ . Hence, letting Aw denote the event |eg (w)−ETn (w)| ≥ , Eq. 7.4.5 becomes P (e∗g − e0g ≥ ) ≤ P (∪w∈W Aw ).
(7.4.6)
The evaluation of the bound in Eq. 7.4.5 is taken up in Section 7.8.
7.5 Data Sets In practice there are three kinds of data sets: training, validation, test. Training data is the data set we denoted by Tn , consisting of n i.i.d. inputoutput pairs that are used to determine the empirical error ETn (w), which is then minimized by choice of network parameters w through one of the various search algorithms discussed in Chapter 5. The resulting parameter estimate w ˆ n is, of course, highly dependent on the training set Tn . The second kind of data, frequently but not always available, is a validation set Vm of size m that is stochastically independent of Tn but shares the same distribution. This set is used to determine when to halt the training process, that is, when to accept an estimate w ˆ n = A(Tn , Vm ) produced in the course of training iteration (see also Section 6.6). Halting is determined by calculating the validation empirical error EVm (w) made by using w on Vm . As the iteration process proceeds, the validation error
244
7. Generalization and Learning
2
0.11
1.8
0.1
1.6
0.09
1.4 0.08 1.2 0.07 1 0.06 0.8 0.05 0.6 0.04
0.4
0.03
0.2 0 0
10
20
30
40
50
60
70
80
0.02 0
10
20
30
40
50
60
70
80
FIGURE 7.1. Training and Generalization Errors Showing Overfitting
generally decreases, but there comes a time when this error starts to generally increase even though the training set error ETn is still decreasing. It is then judged that overfitting is occuring, training is halted, and w ˆ n is chosen. As the validation error does not enter into gradient calculations, we are free to make a choice other than the usual quadratic error. For example, in a pattern classification application we might choose to validate the sequence of neural networks being returned by the iterative training algorithm by calculating their performance in terms of classification error rates. We terminate training when the overall appropriately measured error rate starts to increase. As an illustration of this we determined the training and (quadratic) validation errors as quasi-Newton training proceeds on a network having d = 8 inputs and a 1HL of s1 = 5 sigmoidal nodes. The training set was of size n = 500 and was generated by adding normal mean zero noise of variance .01 to the output of a pre-selected network of the same architecture. Hence, eB = e0g = .01. The validation set is atypically large (m = 10, 000) to better reveal the behavior of the true generalization error. The sequence of training set (left side of Figure 7.1) and validation errors is shown in Figure 7.1. In this training run eˆg = .026, significantly above the optimal value of .01. The final data set is that of a test (confirmation) set Ck of size k that is chosen independently of Tn , Vm and distributed identically to them. The parameter estimate w ˆ n is always stochastically independent of Ck . Assumption 7.5.1 (Data Sets) The data sets Tn , Vm , Ck are stochastically independent and comprised of independent and identically distributed (i.i.d.) elements (x, t) distributed as the probability measure P . From Assumption 7.1.1 it is immediate that P has compact support in IRd × IR. Given an initial data set D we are free to partition it into the three sets just described, although what we gain by making one component larger we may lose by making the others smaller. Kearns [123, pp. 1158–1159] (see also Guyon et al. [88]) suggests that the precise trade-off between the sizes of training and validation sets is not critical and that,
7.6 Use of an Independent Test Set Ck
245
“ . . . somewhere between 20 and 30 percent devoted for testing—seems to give good performance . . . ” We first discuss the most straightforward situation of the estimation of eˆg = eg (w ˆ n ) from independent test and confirmation data Ck . In this case it is irrelevant how w ˆ n is chosen, and we need only analyze the case of eg at a fixed parameter w. Several methods are presented in Section 7.6 for this purpose. The family of approaches based on worst-case bounds estimates eˆg by estimating a uniform bound on the discrepancy between empirical and generalization errors over all parameter values and underlying data distributions ft,x . Intermediate between these approaches are a few that attempt to estimate discrepancies in a manner that make use of some distributional information. We then turn to the more complex case of the estimation of eˆg from data Tn ∪ Vm on which it depends. The methods of cross-validation and bootstrap, reviewed in Section 7.7, introduce independence through revisions of the training data, but for reasons to be noted, are likely to be of little use in a neural network setting. The family of approaches based on learning curves, treated in Sections 7.9–7.12, studies the asymptotic in n properties of w ˆ n.
7.6 Use of an Independent Test Set Ck 7.6.1
Limiting Behavior
We assume that we are provided with an independent test set Ck of k i.i.d. samples distributed as density f or probability measure P and chosen independently of the training set Tn . Indeed, given a large enough training set, it can be partitioned into a test set and a training set, and advice on this partitioning is given in Kearns [123]. It is good practice to set aside such a test set. However, the total amount of data is often quite limited (optical alphanumeric character recognition and speech recognition applications being exceptions in which large amounts of data can usually be obtained at a reasonable cost), and one would like to use most of the data for training purposes to achieve a higher likelihood of selecting a good network. If we drop the multiplicative factor of 1/2 used in Chapter 5, then the quadratic empirical error Ei (w) = (η(xi , w) − ti )2 ,
ECk (w) =
k 1 Ei (w). k i=1
(7.6.1)
This error term, a function of w, is directly observable. For large enough sample size k, any of the usual laws of large numbers (the boundedness of η, t noted earlier suffices to ensure the applicability of these laws) assure us that for fixed w, ECk (w) → E(η(x, w) − t)2 ≡ eg (w)
(7.6.2)
246
7. Generalization and Learning
in all of the usual probabilistic convergence modes of mean square, in probability, and with probability one. Hence, in the long run (large k) the empirical error based on the test set will converge to the desired generalization error. Should we be fortunate enough to have a very large test set, then we can rely on this estimate of generalization error. However, such a large test set is likely to be the exceptional case; we need guidance as to rates of convergence as well as results applicable when the test set is of only moderate size.
7.6.2
Central Limit Theorem Estimates of Fluctuation
Information on rates of convergence in this case can be derived easily from the fundamental equal components case of the Central Limit Theorem. Theorem 7.6.1 (Central Limit Theorem (CLT)) Given random variables {Yi }, i.i.d.with common expectation EYi = EY and common finite variance V AR(Yi ) = σ 2 < ∞, the limiting distribution function of the normalized sum k 1 Sk = √ (Yi − EY ) σ k 1 is normal or Gaussian,
lim P (Sk ≤ s) = Φ(s) ≡
k→∞
s
−∞
x2 1 √ e− 2 dx. 2π
Informally, we can write that the probability law of Sk approximates to that of the standard normal L(Sk ) ≈ N (0, 1). Perhaps the best-known results on the rate of convergence or the error in approximation are those of Berry and Esseen (see Feller [73, pp. 515,525]). If EY = 0, σ 2 = V AR(Y ), ρ = E|Y |3 , then sup |P (Sk ≤ s) − Φ(s)| ≤ 4.4 s
ρ √ . σ3 k
√ Because k(ECk (w) − eg (w))/σ is itself a properly normalized sum, we have immediately that √ k P( (ECk (w) − eg (w)) > ) ≈ 1 − Φ(), σ √ k |ECk (w) − eg (w)| > ) ≈ 2(1 − Φ()), P( σ
7.6 Use of an Independent Test Set Ck
247
where Φ, as earlier, is the standard normal distribution function for N (0, 1). We can expect the CLT to be applicable in the neural network setting provided that we respect the approximation, as indicated by the above Berry–Esseen bounds, and do not choose too large as a function of k. Using the asymptotic approximation, derivable by integration by parts, z2
e− 2 1 − Φ(z) ∼ √ z 2π yields for > 1 √
2
2 k e− 2 < e− 2 , P( (ECk (w) − eg (w)) > ) = τ ∼ √ σ 2π
under our preceding assumptions on the parameters. Of course, if we are interested in the discrepancy ECk (w) − eg (w), then we also need to estimate the usually unknown variance σ 2 to be able to use the estimate √ k )). P (|ECk (w) − eg (w)| > ) = τ ≈ 2(1 − Φ( σ In the binary classification case of t ∈ {0, 1}, σ is upper bounded by 1/2 (also see the next subsection). It is evident from our calculation that our final bound is not valid for either too large or too small.
7.6.3
Upper Bounds to Fluctuation Measures
The CLT approximation yields an approximate and, therefore, somewhat indefinite assessment of the probability of a discrepancy between eg (w) and ECk (w). We now provide guaranteed upper bounds on probabilities of discrepancies between these terms. The relationship between the test set empirical error ECk (w) for a particular net η(x, w) and its true performance eg (w) is, as before, simply given by the expectation eg (w) = E{ECk (w)}. The fluctuation in the estimate of true error by the empirical error therefore can be specified by the variance V AR(ECk (w)) =
1 V AR(E(C1 , w)). k
The variance based on a test set of size k is k1 of the variance based on a test set C1 having only one element; this is an immediate consequence of ECk (w) being an average of i.i.d. terms. If both η and t are known to lie in the finite interval [a, b], then we can upper bound the unknown V AR(E(C1 , w)) by 1 4 4 (b − a) . Hence, 1 (b − a)4 . V AR(ECk (w)) ≤ 4k
248
7. Generalization and Learning
This upper bound can be refined if more is known about P that can enable us to approximate V AR(E(C1 , w)). For example, in the classification problem, if the true error probability eg (w) were known (in which case there would be no need for this estimate!) then V AR(ECk (w)) =
1 1 eg (w)(1 − eg (w)) ≤ . k 4k
The variance is but one of many measures of fluctuation. If instead we inquire into the probability of a large deviation between the true error and the empirical error, P (|ECk (w)−eg (w)| > ), then the common Chebychev bound yields P (|ECk (w) − eg (w)| > ) ≤
V AR(E(C1 , w)) , k2
(7.6.3)
and we can use any of the preceding upper bounds to the variance. Alternatively, letting σ 2 = V AR(EC1 ), we can explore fluctuations in the normalized error through 1 1 P ( |ECk (w) − eg (w)| > ) ≤ 2 . σ k Of course, we can also use generalized Chebychev bounds to obtain upper bounds having a different functional dependence on the test set sample size k. It will be of value in the responses we give to the remaining questions to develop an upper bound that decays exponentially in k. Such bounds are known as Chernoff bounds, and in the case of variables confined to the range [a, b] they can be sharpened slightly into Hoeffding bounds. Theorem 7.6.2 (Hoeffding Inequality) If {Xi } are independent and P (ai ≤ Xi − EXi ≤ bi ) = 1, then P (|
k
2
(Xi − EXi )| ≥ ) ≤ 2e−2
/
k
2 i=1 [bi −ai ]
.
i=1
In particular, if all the mean-centered random variables have the common range [a, b] then 2 1 − 2k P( | (Xi − EXi )| > ) ≤ 2e (b−a)2 . k i=1
k
(7.6.4)
Proof. See Pollard [188, p.192]. 2 The Hoeffding inequality yields a bound that decays exponentially in test set size k. For example, in a binary pattern classification example we
7.6 Use of an Independent Test Set Ck
249
find that we can upper bound by τ = .05 the probability of a deviation of size = .1 with a test set size of k = 738, P (|EC738 (w) − eg (w)| > .1) ≤ .05. The Hoeffding bound applies to the evaluation of eg when we identify Xi = (η(xi , w) − ti )2 and η, t ∈ {0, 1} to yield a = −1 and b = 1. If, contrary to Assumption 7.1.1, η, t are not constrained to take values in a finite interval, perhaps because η is an estimate of an unbounded t, then we can produce a similar exponential bound by use of Theorem 7.6.3 (Bernstein’s Inequality [238]) If {Xi } are independent, have finite moments of all orders, variances {σi2 }, and are such that there exists a C > 0 satisfying (∀s > 2)(∀i)E|Xi − EXi |s ≤
1 2 σ s!C s−2 , 2 i
then, letting σ2 =
k
σi2 ,
i=1
we have the upper bound P (|
k
2
− 2(σ2+C)
(Xi − EXi )| ≥ ) ≤ 2e
.
i=1
Unfortunately, use of this inequality requires us to know not only the variances but also the constant C. We can derive exponential “upper bounds” from the special case of the generalized Chebychev bound given by (∀λ ≥ 0) P (Y ≥ ) ≤ e−λ EeλY . Because this bound holds for all non-negative λ, it is tightest for λ = λ∗ satisfying the transcendental equation E(Y eλ
∗
Y
∗
) = Eeλ
Y
.
(7.6.5)
Of course, we usually do not have the probabilistic knowledge required to solve Eq. 7.6.5 for the optimum λ∗ . Because any non-negative value of λ serves to ensure a valid upper bound, we can simply adopt the choice σk 2 , where σ 2 = V AR(EC1 ) = V AR((η − t)2 ). This yields, for our problem, k2
P (ECk (w) − eg (w) > ) ≤ e− σ2 Ee σ2 (ECk (w)−eg (w)) . k
250
7. Generalization and Learning
Letting Xi = (η(xi , w)−ti )2 −eg (w) and recalling that ECk (w) is an average of i.i.d. summands enables us to rewrite this as k2
P (ECk (w) − eg (w) > ) ≤ e− σ2 (Ee σ2 X1 )k .
The utility of this result depends upon what we can say about Ee σ2 X1 . To generate an approximation that will only depend upon the variance σ 2 , we note that in our case EX1 = 0, and assume a choice of parameters for which 2 2 X1 E( 2 )2 = 2 << 1 < k 2 . σ σ σ Hence, we can use the approximation for a small random variable Y with EY = 0 that 2 k 1 (EeY )k ≈ [1 + EY 2 ]k ≈ e 2 EY , 2 to obtain the estimated upper bound k2
k2
k2
P (ECk (w) − eg (w) > ) ≤ e− σ2 e 2σ2 = e− 2σ2 . The required σ 2 is just the term V AR(EC1 (w)) discussed earlier. A doublesided version of this inequality can be pursued along the identical lines to result in k2 P (|ECk (w) − eg (w)| > ) ≤ 2e− 2σ2 , (7.6.6) or the alternative k2 1 P ( |ECk (w) − eg (w)| > ) ≤ 2e− 2 . σ
Thus we can generally upper bound the probability of a significant deviation of size of the empirical error from the true error, for a given network η, when we have a test set Ck that is independent of η, by either a bound such as Eq. 7.6.3 that approachs zero as k1 or a bound such as Eq. 7.6.4 that approachs zero exponentially in k. Both bounds, under reasonable hypotheses, depend on the variance or expected squared deviation for a test set of size 1, and this can often be approximated as described earlier. These results are used to determine the accuracy of error estimates made from a test set Ck of size k, and will be used in our discussion of uniform bounds in Section 7.8.
7.7 Creating Independence: Cross-Validation and Bootstrap If due to limited training data we cannot create an independent confirmation/test set of significant size, then we must seek other approaches to
7.7 Creating Independence: Cross-Validation and Bootstrap
251
assess generalization error. Two approaches that create independence while still using nearly the full data for training are cross-validation and bootstrap. These statistical methods have been studied in the pattern classification and neural network literature (e.g., Devroye [60], Efron [67], Efron and Tibshirani [68], Hastie and Tibshirani [98], McLachlan [161], Stone [229, 230]). The goal of these procedures is to reduce the bias incurred in using the same data to estimate both a parameter and its performance—to deal with the situation that the expected empirical error is smaller than the generalization error. However, these two approaches have the following serious flaws when applied to neural network problems that are typically computation-limited and have both multiple minima and high-dimensional inputs. Cross-validation and bootstrap require multiple retraining efforts based on a large subset of the training data. Because training even once typically pushes the limits of available computational resources, the requirement that it be repeated many times (e.g., n times in cross-validation and as many times as needed to get a small sampling variance in bootstrap) strains credulity in a neural network setting. In some instances of other statistical applications it has been possible to recalculate estimates without having to redo the full estimation processes. However, this is unlikely to be the case in neural network design. We can find ourselves having to repeat training O(100) times, and this is so computationally burdensome that it is inappropriate in a neural network setting. Indeed, the common references in the neural network literature to the use of cross-validation, to signal the presence of “overtraining” through the onset of increasing estimated generalization error, generally turn out to be to uses of a single independent test set Ck and not true recourses to cross-validation. For example, Kearns [123] uses the results on independent test sets to partition the training data into independent training and testing subsets while calling the process crossvalidation. As was noted in Chapter 5, neural network parameter estimation must confront the presence of many minima. Typically, restarting the training algorithm yields a different solution point. Changing the training set, even if we do not change the algorithm initialization, is likely to yield convergence to a different local minimum. Hence, cross-validation and bootstrap cannot be used to estimate the generalization error at the local minimum found by our initial use of the training algorithm on the full training data. Rather, what they return is an average of generalization errors evaluated over a randomly selected group of local minima. This is especially true of uses of bootstrap where we multiply retrain with substantially different training sets. If n is large, then it is less problematic in the hold-one-out version of cross-validation described below. Finally, as we shall see, bootstrap methods make explicit use of the empirical distribution of the training data. In the neural network case we are likely to have a high-dimensional feature vector input (d >> 1) and
252
7. Generalization and Learning
make recourse to this method only when we have limited training data. This combination of circumstances makes the empirical distribution a poor estimator of the true distribution.
7.7.1
Cross-Validation
Cross-validation is a technique, for estimating the generalization error of an estimator from a single data set, introduced by Stone [229, 230] and discussed at length in the statistics literature (e.g., Hastie and Tibshirani [98, Sec. 3.4], Wahba [245, Ch. 4]). In order to describe cross-validation estimates of performance consider a training algorithm yielding an estimated network A(Tn ) = w ˆ n . Let Tα ⊂ Tn with A(Tα ) = w ˆ α . One performs cross-validation by repeatedly deleting subsets and averaging over the res sults. However, the number of such subsets of size s grows as ns ≈ ( en s ) , and it quickly becomes impractical to consider more than a negligible fraction of such deletions. In practice, one considers only k-fold crossvalidation in which the training data is partitioned into k nearly equal subsets of size s = n/k. The most common choice of k = n is referred to as hold-one-out cross-validation. In hold-one-out cross-validation we train ˆ α ) on the on Tα = Tn − (xα , tα ) and evaluate the performance Eα of η(·, w deleted observation (xα , tα ), ˆ α ) − tα )2 . Eα = (η(xα , w
(7.7.1)
We replace our empirical error estimator of performance ETn (w ˆ n ) by ECV =
n 1 Eα . n α=1
(7.7.2)
We expect a more accurate estimate of generalization error because in each ˆ α ) based term Eα we are estimating tα based on xα using a function η(·, w on data Tα that is truly independent of (xα , tα ). We then average over all possible choices of α to reduce the estimation variance. Of course, as noted at the outset, in modifying the training data set we have the likelihood that we will be converging to different local minima and that the resulting average taken to reduce variance will be in fact an average of terms having different expectations (generalization errors). It is possible that in the holdone-out case the slight change in training data might not affect the selection of local minimum, provided care is taken to restart the training algorithm with constant initial conditions. In our analysis to follow of the bias of crossvalidation we will make the assumption that in each iteration we converge to the same local minimum w ˜ of the generalization error eg . The major advantage of the cross-validation estimator is its reduced bias. As we have observed, the training set error tends to seriously underestimate the true generalization error—its expectation is less than that of the quantity being estimated and thus it is biased. It is immediate from Eq. 7.7.1
7.7 Creating Independence: Cross-Validation and Bootstrap
253
and the independence between w ˆ α and (xα , tα ) that each term in the average is exactly unbiased for the generalization error of a rule using n − 1 training samples, and hence the average is also unbiased for such a rule. In the generalization of cross-validation, where we partition Tn into subsets of size s and test on each subset while training on the remainder, then we have an estimator that is unbiased for the generalization error of rules using n − s training samples. Hence, the only source of bias is that we are assessing the performance of nets that are trained on slightly less than all n elements of Tn . To understand the reduction in bias gained by cross-validation, assume that we continue to converge to the vicinity of the same local minimum ˆ α that depend on w, ˜ albeit with somewhat different random estimators w which single data point (xα , tα ) we deleted from Tn . Hence, we calculate EECV =
n 1 1 EEα = E(η(xα , w ˆ α ) − tα )2 . n α=1 n α
To account explicitly for the reduction in training set size, introduce the temporary notation for generalization error ˆ α )|w], ˜ eg (n − 1) = E[eg (w reflecting that we are evaluating the expected generalization error for an estimator based upon a sample size n − 1 when we are in the vicinity of the local minimum w. ˜ Assuming that hold-one-out cross-validation is a small enough change to preserve the selection of local minimum, we have that eg (n − 1) = EECV . Efron [67, p.6], asserts that “for many common statistics, including most maximum likelihood estimates,” we can make the expansion eg (n) = eg (∞) +
a2 a1 + 2 + ... . n n
(7.7.3)
A defense of this expansion for our case is available from the asymptotic learning curve results of Section 7.11 that offer, e g (w ˆ n ) = eg (w) ˜ +
Z + o(1/n). n
The term eg (w) ˜ = eg (∞), and Z is a positive mean random variable whose mean and variance do not depend on n. This result, with a1 = EZ, validates the first two terms in Eq. 7.7.3. Accepting this expansion, we see that the bias of the cross-validation estimate of performance is just the difference eg (n − 1) − eg (n) =
a1 a1 ≈ 2. n(n − 1) n
(7.7.4)
254
7. Generalization and Learning
Hence, the expected cross-validation estimate of generalization error is too large by the small amount of O(1/n2 ). This compares favorably with the results on empirical error learning curves of Section 7.11 that establish a bias of O(1/n). To complete a defense of cross-validation we also need to consider its fluctuations about its mean as measured, for example, by its variance. Unless the variance is small, the estimator will not be reliable. It is readily calculated that V AR(ECV ) =
1 1 V AR(E1 ) + (1 − )COV (E1 , E2 ), n n
where we have used the fact that V AR(Eα ) is independent of the choice of α and that COV (Eα , Eβ ) for α = β does not depend on the specific choice of distinct indices. We would like to conclude that COV (E1 , E2 ) = O( n1 ), and hence that V AR(ECV ) = O( n1 ) converges to 0 as desired to justify the use of cross-validation. This conclusion can be rendered plausible in terms of the asymptotic results to be developed in Sections 7.9–7.12, although we really need to obtain the conclusion even for small and moderate sample sizes. A general defense of cross-validation is not yet available. By this, we mean studies of its bias, convergence to the true generalization error (the statistical property of consistency), and rate of such convergence for general applications. Unfortunately, most of the defenses offered for cross-validation have been restricted in their applicability (e.g., Li’s [145] defense for linear estimators, Wahba’s report [245, pp. 45–65], on earlier work on generalized cross-validation for spline fitting). Plutowski, Sakata, and White [186] provide a defense of cross-validation specifically designed for neural networks in which they establish unbiasedness (their Proposition 1) in the sense that ECV has an expected value, taken over all training sets, that equals the expected value over training sets of the generalization error of a network trained on one fewer sample (the effect of the hold-oneout condition). Furthermore, making several assumptions, they establish consistency (their Theorem 4) in the form of almost sure (with probability one) convergence of the cross-validation estimator to the generalization error. However, no results are provided on the rate of convergence, and this is the critical quantity to judge the applicability of cross-validation to estimation of neural network generalization error. Asymptotic results are important insofar as they indicate the “sanity” of an approach, but we are definitely in an environment of limited training data and need to know how to proceed when possibly well short of limiting behavior. If we had large amounts of training data, it would be simpler and better controlled to use the independent test set idea treated in Section 7.6.
7.7 Creating Independence: Cross-Validation and Bootstrap
255
7.7.2 Bootstrap A bootstrap estimator (e.g., Efron [67], Efron and Tibshirani [68], Hall [90]) θˆ for a parameter θ that is a functional of the true distribution P and training set T , θ = θ(P, T ), is given in terms of the empirical distribution PˆT estimator of P based on T . The empirical distribution is described by a probability mass function PˆT (x, t) for observing the outcome pair (x, t) that is given by the relative frequency of the occurrence of (x, t) in T, 1 PˆT (x, t) = [#(xi , ti ) = (x, t)], θˆ = θ(PˆT , T ). n If we have an explicit form for θ, then the plug-in principle (Efron and Tibshirani [68, Sec. 4.3]) has us calculate θˆ by substituting PˆT for P in θ, as indicated above. If, however, we do not have an explicit form for θ(P, T ) (perhaps because we do not understand the action of A in selecting w ˆ n ) then we follow Efron [67, Sec. 5.4]. For concreteness, say that θ(P, T ) = P (η = t|T ) is the generalization error (error probability) of η. First draw an i.i.d. bootstrap sample T (b) = {(X1 , t1 )(b) , . . . , (Xn , tn )(b) }, of size n = ||T ||, distributed as the empirical distribution PˆT (but not distributed as the true P that is unknown to us) and evaluate η (b) trained on the bootstrap data T (b) . Then evaluate the error rate using PˆT (η (b) = t), and not the error relative frequency on the randomly selected sample T (b) . Repeat this process a large number m of times and bootstrap estimate θ(P, T ) by 1 ˆ PT (η (b) = t). θ˜ = m i=1 m
It is immediate that
1 4m and can be made as small as desired and permitted by our computational resources. Hence, we can estimate θ(PˆT , T ) as closely as desired, and the only significant discrepancy is between the bootstrap estimate θ(PˆT , T ) and the desired θ(P, T ). Of course, we know from Glivenko-Cantelli and Kolmogorov-Smirnov theorems (e.g., Devroye et al. [61, Secs. 12.3, 12.8], Loeve [151], Vapnik [241, Sec. 3.9]) that the empirical distribution function converges strongly to the true distribution. Hence, we can expect the difference between θ(PˆT , T ) and θ(P, T ) to converge strongly to zero. A recent treatise on the bootstrap is Hall [90], and Paass [182] illustrates the application of the bootstrap to neural networks. Unfortunately, it is evident that the resampling required by the bootstrap produces such different training sets T (b) that convergence to the vicinity of the same local minimum is highly unlikely. Nor are we likely to possess the computational resources needed to repeat training m times. Finally, too close a reliance E(θ˜ − θ(PˆT , T ))2 ≤
256
7. Generalization and Learning
on the empirical measure PˆT is unwise in our case of a high-dimensional random variable; the empirical measure is unlikely to be representative of the true measure P for sample sizes that are short of astronomical.
7.8 Uniform Bounds—VC Approach 7.8.1
Introduction
When the parameter w ˆ n is selected, as it is, dependent on both the trainˆ n) ing set Tn and the validation set Vm , then the random variables Ei (w are no longer independent; each of them contains information about the others encoded by the training algorithm A in the parameter value. The preceding calculations do not apply to averages of dependent random variables. We still need to assess the linkage between the resulting empirical error ETn (w ˆ n ) and its generalization error eg (w ˆ n ). In treating this issue we shall also come to better understand why we can be successful in minimizing empirical error ETn (w) when we really wish to minimize generalization error eg (w). The three approaches outlined in this section avoid direct confrontation of the dependencies and of the operation of A by looking at results uniform over parameter values. The first approach, due to Vapnik and Chervonenkis, is presented in Section 7.8.2 and yields probability bounds, in terms of VC capacity or dimension, that are uniform over all parameter values w ∈ W and all probability measures P generating x, t. Hence, this approach treats the “worst-case” parameter value as well as the “worst-case” generating probability measure. Unfortunately, the quantitative results are not good guides to practice. In Section 7.8.3 we sketch a second approach that reduces the effective VC dimension by clustering together those parameter values yielding tightly correlated empirical error responses. This approach, however, encounters difficulties in evaluating its terms of interest. Section 7.8.4 outlines a more successful attempt to obtain realistic results by concentrating only on parameter values that yield significant differences in performance; VC capacity is revised to fat-shattering capacity or dimension. Rather than confront the details of w ˆ n , dependent as they are on the choice of training algorithm A, in all three approaches we consider the uniform (over parameter values) upper bound provided by ˆ n ) − e g (w ˆ n )| ≤ sup |ETn (w) − eg (w)| ⇒ |ETn (w w∈W
P (|ETn (w ˆ n ) − eg ( w ˆ n )| > ) ≤ P ( sup |ETn (w) − eg (w)| > ). w∈W
In effect, we bound the discrepancy at the randomly chosen w ˆ n by the worst-case discrepancy over any parameter value. The difficulty with this
7.8 Uniform Bounds—VC Approach
257
upper bound is that the supremum (maximum) is taken over an uncountable set W, leaving us to deal with the probability of an uncountable union of events, Aw = {Tn : |ETn (w) − eg (w)| > }, Aw ). P ( sup |ETn (w) − eg (w)| > ) = P ( w∈W
w∈W
If W is a finite set, contrary to our usual expectations, then we can use the familiar union bound Aw ) ≤ P (Aw ) P( w∈W
w∈W
(sharper bounds are provided by the family of Bonferroni inequalities discussed in Feller [72, p. 100] and Wynn and Naiman [256], but they have not led to improved results in this area) to estimate P ( sup |ETn (w) − eg (w)| > ) ≤ ||W|| max P (|ETn (w) − eg (w)| > ). w∈W
w∈W
From Lemma 7.1.1 we have that b ≥ (η − t)2 , thereby allowing us to use the Hoeffding bound of Theorem 7.6.2 to conclude 2
P ( sup |ETn (w) − eg (w)| > b) ≤ 2||W||e−2n . w∈W
A major contribution of Vapnik and Chervonenkis (Vapnik [240]) was to show that there is a condition under which the uncountable union can be reduced to a finite union and the familiar union bound used. We outline this approach and carry it out to obtain an upper bound that will depend neither on the parameter values nor the unknown probability model P . The improvements to this approach presented in Sections 7.8.3 and 7.8.4 reduce the uncountable union to a smaller finite union by looking only at sets (parameter values) that are significantly different.
7.8.2
The Method of Vapnik-Chervonenkis
Changing notation to indicate that we are not restricting ourselves to neural networks, we let w ∈ W denote the indexing of a family of real-valued functions f (X, w) of a random variable X taking values in some set X , and let {Xi } be i.i.d. P as X. We are interested in the magnitude of the discrepancy ∆n (w) between the empirical average and its expectation, 1 f (Xi , w) − Ef (X, w). n i=1 n
∆n (w) =
The magnitude of the discrepancy is measured by P (supw∈W |∆n (w)| > ). We start by transforming the real-valued function case to an upper bound
258
7. Generalization and Learning
involving functions that are only {0, 1}-valued. In view of our intended application to error terms that are non-negative, we will assume that f ≥ 0. At the outset of this chapter we indicated that we would be willing to assume that (η − t)2 ≤ b. Hence, we will also assume that f ≤ b. Use the elementary facts that for a non-negative random variable Y bounded by b,
b
P (Y > α)dα,
EY = 0
and that using indicator functions we have
Y
dα =
Y =
b
IY >α dα,
0
0
to write 1 f (Xi , w) − Ef (X, w) = n i=1 n
{ 0
1 If (Xi ,w)>α − P (f (X, w) > α)}dα. n i=1 n
b
Observe that P (f (X, w) > α) = EIf (Xi ,w)>α . It is immediate from the elementary properties of the definite integral that 1 1 f (Xi , w) − Ef (X, w)| ≤ b sup | If (Xi ,w)>α − P (f (X, w) > α)| n i=1 0≤α≤b n i=1 n
|
n
and 1 If (Xi ,w)>α − P (f (X, w) > α)|. n i=1 n
sup |∆n (w)| ≤ b
w∈W
sup w∈W,0≤α≤b
|
We have upper bounded the worst-case discrepancy between the empirical average and expectation for an indexed family of bounded, real-valued functions f (Xi , w) of random variables Xi by a similar worst-case expression involving an indexed (by both W and 0 ≤ α ≤ b) family of {0, 1}-valued indicator functions If (X,w)>α . Thus, P ( sup |∆n (w)| ≥ b) ≤ w∈W
1 If (Xi ,w)>α − P (f (X, w) > α)| ≥ ). n i=1 n
P(
sup
w∈W,0≤α≤b
|
Given a finite set {X1 , . . . , Xn } ⊂ X of n random variables, there are only 2n possible different dichotomizations or assignments of binary values by indicator functions. Hence, there is some hope that although we have to deal with the supremum over an uncountable family of indicator functions, this complexity might be inessential. This was noticed by Vapnik
7.8 Uniform Bounds—VC Approach
259
and Chervonenkis and formalized through their introduction of the VapnikChervonenkis (VC) dimension (or capacity) vF for a family of binary-valued functions F given in Definition 3.5.3. In the case of interest to us, F is the family {If (X,t,w)>α : w ∈ W, 0 ≤ α ≤ b} and f (X, t, w) = (η(X, w) − t)2 . Recall from Definition 3.5.1 that the growth function mF (n) associated with a family F of binary-valued functions of argument X ∈ X is the maximum number of dichotomizations of a set of n points of X by the members of F. This growth function was evaluated in Section 2.3 for perceptrons as D(n, d) for X = IRd . Clearly mF (n) ≤ 2n . Evaluation of mF for binaryvalued networks was discussed in Section 3.5. From there we learned that if F has VC dimension vF then mF (n) ≤ 1.5
en nvF ≤ ( )vF . vF ! vF
What is needed to use the VC method is knowledge of the VC dimension of the given F. Sontag [224, Thm. 3], establishes that a single hidden layer network composed of s sigmoidal nodes having at least one point of differentiability, at which the derivative is nonzero, with d = 2 (two components to the input x) has a capacity of at least 4s − 1. Sontag [226] shows that, for certain parameterized families of functions (that include neural networks with tanh or logistic nodes) defined by p parameters, there are open sets of 2p + 2 samples that cannot be shattered; the open sets are introduced to capture a notion of points being in general position. Koiran and Sontag [129] provide examples where the VC dimension of multilayer networks can grow at least quadratically in s provided that the number of layers increases proportionally to s. Evaluating the VC dimension for a neural network is difficult. Vidyasagar [244, Sec. 10.3], discusses this issue at length. Although there are a variety of results on estimating the VC dimension, we omit them. In our view, VC theory is only helpful qualitatively and asymptotically and not in the quantitative detail valuable for guiding practice when confronted with limited amounts of data. A variety of probability bounds have been developed in terms of vF (e.g., see Devroye et al. [61, Ch. 12]) that, remarkably, do not require knowledge of the true probability model P . As a consequence, these bounds, by being applicable to all models, must apply to the worst-case model and are likely to be too conservative with respect to the model governing an actual application. It is known that a bound can be written as follows. Theorem 7.8.1 (Talagrand [234]) 1 c cn2 v −2n 2 If (Xi ,w)>α −P (f (X, w) > α)| > ) ≤ √ ( . ) e n i=1 n v n
P(
sup w∈W,0≤α≤b
|
This bound has the best possible exponent of 2n2 , the same exponent as in the Hoeffding bound, but suffers from an unknown constant c that is
260
7. Generalization and Learning
likely to be very large. We can rewrite this bound slightly as follows: γ=
n2 c , P ≤ √ [cγe−2γ ]v . v vγ
The dominant factor is the one with exponent the VC dimension v. Of course, this upper bound is only of interest when it is less than unity, and largely this requires cγe−2γ < 1 or a sample size v 1 n > ( log c) 2 . 2 The bound of Theorem 7.8.1, expressed in terms of empirical averages of the function f (x, w), is 1 c cn2 v −2n 2 ) e f (Xi , w) − Ef (X, w)| > b) ≤ √ ( = τn . P ( sup | n v w∈W n i=1 (7.8.1) This result can be re-expressed to provide anupper bound to the generally n unknown Ef (X, w) in terms of the known n1 i=1 f (Xi , w). It is immediate from Eq. 7.8.1 that with probability at least 1 − τn and for all w ∈ W n
1 f (Xi , w) − Ef (X, w)| ≤ b, n i=1 n
|
or when f ≥ 0 as in the case of generalization error 1 f (Xi , w) + b, with probability ≥ 1 − τn . n i=1 n
Ef (X, w) ≤
This result is frequently rewritten to express in terms of τn through an approximate solution to the transcendental equation relating the two terms. Note that the upper bound τn depends on n, only through n2 . It is easy θ−1/2 ) such that to see that we can select n ↓ 0 (e.g., 1/2 > θ > 0, n = n not only does τn ↓ 0 but more strongly n τn < ∞. By the Borel-Cantelli Lemma (see Loeve [151]), we conclude that with probability one (almost surely or a.s.) the events n 1 An = sup f (Xi , w) − Ef (X, w) > bn w∈W n i=1
occur only finitely often (f.o.). Hence, with such na choice of {n } we have convergence with probability one of supw∈W n1 i=1 f (Xi , w) − Ef (X, w) to 0. Restated, with probability one there exists a finite N such that for all samples sizes n > N , no matter how w ˆ n is chosen, the empirical error ETn approximates eg (w ˆ n ) to within bn .
7.8 Uniform Bounds—VC Approach
261
In the important special case of pattern classification, the target t takes on only finitely many values, and generalization error is error probability eg (w) = P (η(x, w) = t). The empirical error ETn is the relative frequency of errors νTn 1 ETn (w) = νTn (w) = Iη(xi ,w)=ti . n i=1 n
In the {0, 1}-valued case 1 ETn (w) = |η(xi , w) − ti |. n i=1 n
The function f appearing in Eq. 7.8.1 now corresponds to f (x, t, w) = Iη(x,w)=t , w to w, W to W, and the upper bound b = 1. A bound without an unknown constant, but having a worse exponent, is provided by the following. Theorem 7.8.2 (Discrepancy between eg and ETn ) Let (η−t)2 be uniformly bounded by b and define the family of functions F = {I(η(x,w)−t)2 >α : w ∈ W, 0 ≤ α ≤ b}: ˆ n )−eg (w ˆ n )| > b) ≤ 4mF (2n)e− P (|ETn (w
1 )2 n(− √ 2 n 2
1
≤ 4(
2
2en v − n(− 2√n ) 2 ) e . v (7.8.2)
Proof. See Appendix 1. 2 The exponent is approximately −n2 /2 rather than the correct −2n2 of Talagrand because reliance on symmetrization in the appended proof converted {0, 1}-valued random variables into {−1, 1}-valued random variables. A plot of the upper bound for = .1, v = 50 is shown in Figure 7.2. It is evident that even with the reasonable parameter values we have chosen, this upper bound only becomes useful (less than unity) for γ = 20 or a very large sample size n ≈ 100, 000. A requirement of a sample size of about 100,000 is unreasonably large to ensure a discrepancy of no more than 10% with high probability. If one had this much data one would be better advised to set aside, say, a test set of 20,000 for evaluation and train on the remaining 80,000, especially as the assumed VC dimension of 50 suggests a moderate to small network. That VC theory might not be quantitatively helpful is also affirmed by the observation that a small perturbation in the node function σ can yield a network of infinite VC dimension although this perturbation is unlikely to have practical consequences. Other bounds of this type can be found in Devroye et al. [61, Ch. 12].
262
7. Generalization and Learning 400 200 0
Log Upper Bound
-200 -400 -600 -800 -1000 -1200 -1400 2 10
3
10
4
10 Sample Size n
5
10
6
10
FIGURE 7.2. Log upper bound for = .1 and v = 50.
7.8.3
Bounds Uniform Only over the Network Parameters
Turmon [236] and Turmon and Fine [237] transform the problem of assessing P (supw∈W |ETn (w)−eg (w)| > ) into a question about level exceedances of Gaussian random fields. In so doing they obtain a result that is uniform over parameter values (guaranteed by the supremum appearing inside P ) but dependent on the aspects of the unknown P . Looking at an individual term |ETn (w) √ − eg (w)| we see that we can apply the basic central limit theorem to n(ETn (w) − eg (w)) to find that this quantity is approximately distributed as N (0, V AR(E1 )). The multivariate central limit theorem (see Theorem 7.10.1) ensures simultaneous convergence at any finite number of the terms corresponding to different choices w1 , . . . , wk and asserts that the resulting rescaled variables are jointly normally distributed. There are technical details (e.g., see Pollard [188, Ch.7], [189]) to overcome in the transition to the case of infinitely many random variables forming a random field (process), √ but we can claim that for large n the random field with variables Z(w) = n(ETn (w) − eg (w)) transforms into a Gaussian random field. The joint normal distributions for any finite subset of {Z(w), w ∈ W} is given by a mean value function EZ(w) = 0, and a covariance function R(w, w ) = EZ(w)Z(w ) = COV ((η(x, w) − t)2 , (η(x, w ) − t)2 ), V AR(Z(w)) ≡ σ 2 (w) = EZ 2 (w) = nV AR(ETn (w)) = V AR(ET1 (w)), ρ(w, w ) =
R(w, w ) . σ(w)σ(w )
We can now write √ P (|ETn (w ˆ n ) − eg ( w ˆ n )| ≥ ) ≤ P (supw∈W |Z(w)| ≥ b = n) ≤ 2P (supw∈W Z(w) ≥ b),
(7.8.3)
7.8 Uniform Bounds—VC Approach
263
with the latter condition following from the symmetric distribution of Z. Hence, we are interested √ in the probability that the Gaussian random field exceeds the large (O( n)) level b at some point in its index set W. Level exceedances of Gaussian random processes has been a subject of much study (e.g., Talagrand [234]). An analysis of these exceedances using the Poisson clumping heuristic (PCH) developed and discussed by Aldous [4], has been applied by Turmon and Fine in [236, 237]. In these latter calculations, interest focuses on the volume Γw of the set of parameter values w that yield Z(w ) closely correlated with a given Z(w). While there are uncountably many random variables Z(w), w ∈ W, one might expect that those variables that are closely correlated with each other could be treated as being approximately the same variable, and one need only account for the finite number of different neighborhoods of random variables. To carry out this idea we introduce an expression for the volume Γw of the parameter values w for which Z(w ) is tightly correlated (given in terms of the normalized correlation function ρ(w, w )) with Z(w) of the Gaussian random field Z as Γw (α) ≡ volume({w : ρ(w, w ) ≥ α}). Hence, Γw (α) is the volume of the region of high correlation (ρ ≥ α) of Z(w ) with Z(w). It is then shown by Turmon [236, Ch. 6] that, letting ¯ denote the complementary error function (tails of the standard normal), Φ and choosing an appropriate small value of γ,
¯ Φ(b) b ETn (w) − eg (w) 1 P ( sup ≥ √ )≈ ¯ (7.8.4) 2 dw. 1−γ σ(w) n Φ(bγ) W Γ ( w∈W ) w 1+γ 2
We can think of V olume(W)/Γw as a number of degrees of freedom or an effective number of sufficiently distinct parameter values contained in the uncountable collection W. The integral in Eq. 7.8.4 is then the reciprocal of the harmonic mean of the number of degrees of freedom when averaged over W. Evaluation of this estimate requires us to know enough about the underlying measure P to determine the normalized correlation function ρ. Generally, we would not have this prior knowledge and would have to estimate the correlation ρ from the errors made on the training data. Although estimation of correlation functions is a familiar undertaking in statistical practice, the difficulty here is that the arguments of ρ are not the familiar scalar time variables. Rather we are confronted with p-dimensional arguments for large p, and a significantly more difficult estimation task. In light of these practical difficulties, there is need for additional research, and we leave further discussion of this approach to the citations.
7.8.4
Fat-Shattering Dimension
In the analysis sketched in the preceding subsection, for unknown but fixed P , we find that an important statistic is the degrees of freedom determined
264
7. Generalization and Learning
by the volume of parameter values closely related to a given parameter. Normalized responses Z(w), Z(w ) that are highly correlated, and therefore dependent, can be treated as if they were the same. We are encouraged to focus on large distinctions and to ignore small ones. This attitude underlies the concept of fat-shattering dimension, a generalization of VC dimension that examines shattering by a sufficient margin γ ≥ 0, with γ = 0 reducing to VC dimension as described earlier. Previously we maintained that a family {f } = F of real-valued functions dichotomizes a set S = {xi } through the binary values of sgn(f (xi )). This thresholding about zero is now generalized in the following definitions drawn from Kearns and Schapire [124]. Definition 7.8.1 (γ-shattering) A set of points {xi } is γ-shattered by a family F of real-valued functions if there exists a sequence of biases {bi } such that for every binary assignment {ti } there exists is a function f ∈ F and such that (f (xi ) − bi )ti ≥ γ. Definition 7.8.2 (Fat-Shattering Dimension) The fat-shattering dimension f atF (γ) of a family F of real-valued functions having common domain X is the size of the largest (possibly infinite) set of points of X that can be γ-shattered by functions in F. What matters is that shattering is accomplished with a margin of at least γ. Of course, to make this definition productive we also need to relate it to bounds on the discrepancy between empirical and statistical averages such as the one provided in Eq. 7.8.1. It will be seen that fat-shattering dimension better characterizes generalization ability of a family of networks than does VC dimension. Indeed, it will be evident from Theorem 7.8.5 given below, that families of infinite VC dimension but finite fat-shattering dimension can still yield good generalization behavior. Bartlett, in an important and insightful paper [21], provides several results relating error probability to fat-shattering dimension, where, in our case, f is a neural network η described by parameter w, error probability is eg (w), and the empirical error ETn is increased to include not only the cases of misclassification, sgn(η) = t, but also those where η does not separate by γ. A first result is the following. Theorem 7.8.3 (Bartlett [21, Thm. 2] ) Let 0 < τ < 1/2, 0 < γ < 1, ρ = f atN (γ/16). With probability at least 1 − τ (∀w ∈ W) eg (w) <
1 ||{i : |η(xi , w)| < γ or sgn(η(xi , w)) = ti }||+(γ, n, τ ), n
2 (γ, n, τ ) =
2 (ρ ln(34en/ρ) log2 (578n) + ln(4/τ )). n
7.8 Uniform Bounds—VC Approach
265
A version of this theorem (Bartlett [21, Cor. 9]) that allows the choice of γ to be made in a sample-dependent fashion, rather than fixed in advance, revises 2 (γ, n, τ ) by replacing ρ = f atN (γ/16) by ρ = f atN (γ/32) and adding the term ln(2/γ). As can be seen, we now bound the generalization error by a “confidence” parameter added to a term larger than the usual empirical error because it also counts those instances in which the network took on a small (ambiguous) value less than γ. An advantage of this formulation becomes evident when the confidence in Theorem 7.8.5 is less than what it would have been from Theorem 7.8.1. This can occur because f atF (γ) ≤ vF . Another significant advantage of the fat-shattering formulation is that, unlike the case of the hard-to-estimate VC dimension, there are good estimates of f atN (γ) that depend on the size of w rather than the dimension p of w. Another result for single hidden layer networks is given next; similar results are available in Bartlett [21] for multiple hidden layer networks. Theorem 7.8.4 (Bartlett [21, Cor. 24] ) Let the node function A |σ| ≤ M/2 be nondecreasing and N1,σ be the class of single hidden layer neural networks with the weights to the output node satisfying the boundedness condition s1 |w2:1,i | ≤ A, |b2:1 | + i=1
for A ≥ 1. If γ ≤ M A, then f atN (γ) ≤
cM 2 A2 d MA ), log( 2 γ γ
for some universal constant c. We can now combine preceding results to conclude as follows. Theorem 7.8.5 (Bartlett [21, Thm. 28(1)] ) Let X = IRd , 0 < γ ≤ 1, 0 < τ < 1/2, and the node function σ be nondecreasing and bounded by A be as in Theorem 7.8.4. If the training unity, |σ| ≤ 1. For A ≥ 1, let N1,σ set is of size n and has {−1, 1}-valued targets, then with probability at least 1 − τ, eg (w) <
1 ||{i : |η(xi , w)| < γ or sgn(η(xi , w)) = ti }|| + (γ, n, τ ), n 2 (γ, n, τ ) =
c A2 d A ( log( ) log2 n − log τ ). n γ2 γ
The conclusion that the size, (as measured by the l1 -norm) of the final layer components of the weight vector w, governs generalization behavior helps to explain both the observation that large-dimension (many parameters)
266
7. Generalization and Learning
networks often succeed in practice and the value of properly selected regularization (see Section 6.3) in assuring good generalization performance. The significance of the l1 -norm A on the final layer weights is that a network η1 having a norm A and a large number of nodes (many small output layer weights) can be well-approximated by a network η2 having the same norm A but with fewer nodes, with this approximation having negligible consequences when we desire classification by a positive margin γ. Regularization, using the bound A to the l1 -norm or length of the output weights as penalty term, should produce results that generalize well. Finally, results of Alon et al. [5] establish that the notion of finite fatshattering dimension, unlike the case of finite VC dimension, provides a characterization of precisely those families F of functions or N of networks for which we can establish uniform convergence of sample averages of realvalued functions to their expectation. Alon et al. [5] provide an example where uniform convergence holds but the VC dimension is infinite, and the family of networks with bounded sum of output weights described in Theorem 7.8.5 provides another such instance.
7.9 Convergence of Parameter Estimates 7.9.1
Outline of the Argument
The material of Sections 7.9–7.13 is reproduced from Fine and Mukherjee [75]. The asymptotic (in sample size) behavior of the parameter or weight estimate returned by any member of a large family of neural network training algorithms has been oft-studied, but without properly accounting for the neglected but characteristic property of neural networks that their empirical and generalization errors possess multiple minima. We provide conditions under which the parameter estimate converges strongly into the set of minima of the generalization error. Convergence of the parameter estimate to a particular value cannot be expected under the assumptions we make. We then evaluate the asymptotic distribution of the distance between the parameter estimate and its nearest neighbor among the set of minima of the generalization error. Results on this question have appeared numerous times and generally assert asymptotic normality, the conclusion expected from familiar statistical arguments concerned with maximum likelihood estimators. These conclusions are usually reached on the basis of somewhat informal calculations. That the situation is somewhat delicate is indicated by the calculations of Mukherjee and Fine [173] that suggest elements of non-normality and by the discrepant findings of the statistical mechanics-based analyses of learning curves (e.g., Haussler et al. [99]) for finite families of functions. We impose a new condition on training to ensure convergence into a fixed finite set of minima of the generalization error. Additional assumptions are required to establish the commonly as-
7.9 Convergence of Parameter Estimates
267
serted conditional normal distribution. The preceding results are then used to provide a derivation of one family of previously derived learning curves for generalization and empirical errors that lead to bounds on rates of convergence of expectations. Our objective is to explore the behavior, as the training set size n becomes large, of the neural network parameter vector estimate w ˆ n made by ˆ n ), a training algorithm, of the observable empirical training set error ETn (w ˆ n ), all evaluated at the paand of the associated generalization error eg (w rameter vector estimate. We do so for a wide range of neural network architectures and choice of training algorithms. In outline, we incorporate Assumption 7.1.1 and its implications drawn in Lemma 7.4.1 to ensure the existence of bounded, uniformly continuous gradients and Hessian for eg . We also incorporate Assumption 7.3.1 on gradient-based training being terminated only when the norm of the gradient is sufficiently small. Assumptions 7.9.1 and 7.9.2 will be introduced to regulate the behavior of the minima of the generalization error. Assumption 7.9.3 then uses VapnikChervonenkis theory to connect the empirical and generalization errors in a manner tied to the behavior of the training algorithm. With these preliminaries in place, Section 7.9.3 establishes the basic Theorem 7.9.1 on the strong convergence of the parameter estimates returned by the training algorithm into the set of minima of the generalization error. Theorem 7.10.3 refines these results by establishing asymptotic conditional normality of the properly scaled discrepancies between the parameter estimates and the nearest minima of generalization error. In Section 7.10 we use these results to calculate bounds on learning curves for the convergence of generalization error. Similar learning curves for empirical error are derived in Section 7.11.
7.9.2
Properties of the Minima of Generalization Error
We introduce some plausible assumptions about the local minima of the generalization error eg (w) being well-determined by small values of gradients. Lemma 7.4.1 makes it meaningful to assert the following. Assumption 7.9.1 (Minima of eg ) eg has a finite set Meg of minima located in the interior of W and they are all stationary points: ˜ 1, . . . , w ˜ m } ⊂ Seg . Meg = {w The Hessian matrix Heg = ∇∇eg is positive definite at each of these interior minima. ˜ i we do require it to be positive Although He may be ill-conditioned at w definite and therefore nonsingular. In the natural parameterization that we have adopted, it is possible for the Hessian He to be singular (e.g., see Fukumizu [79]). If, for example, there is a minimum of generalization error for a network with fewer nodes than the one allowed by the dimension of W,
268
7. Generalization and Learning
then there will be a manifold of (and hence uncountably many) parameter values achieving this same minimum of eg . In real applications, in which, for example, the data was not generated by a regression function that is precisely a small neural network, this phenomenon will not occur—its being ruled out by Assumption 7.9.1 is of little consequence. We motivate an additional assumption relating the size of the gradient ˜ through of eg (w) to the distance from w to the nearest interior minimum w the following considerations. The differentiability assumptions allow us to write a truncated Taylor’s series for the gradient about a local minimum at w ˜ in terms of the Hessian matrix of second-order derivatives and a zeroorder remainder ˜ − w) ˜ + o(w − w). ˜ ∇eg (w) = H(w)(w We choose w ˜ as the local minimum nearest to the selected w. Using the Theorem of the Mean, we can rewrite this conclusion. Letting gi = [∇eg (w)]i denote the ith component of the gradient and hi (w) the ith row of the Hessian matrix Heg (w), there is a vector wi on the line segment joining ˜ such that w, w ˜ i = 1, ..., p, gi = hi (wi )(w − w), ˜ denote and by the postulated parameter space convexity wi ∈ W. Let H ˜ is small enough, then by the the matrix with ith row hi (wi ). If w − w uniform continuity of the second derivatives we have that the elements ˜ also has ˜ are close to those of the positive definite H(w). of H ˜ Hence, H positive eigenvalues (it need not be positive definite), is invertible for small ˜ and we can write enough w − w, ˜ ˜ −1 ∇eg . ∇eg (w) = H(w − w), ˜ w−w ˜ =H It follows that if λmax , λmin are the positive largest and smallest eigenvalues ˜ then of H, 1 1 ∇eg (w)2 ≤ w − w ˜ 2 ≤ 2 ∇eg (w)2 . λ2max λmin Hence, when w is sufficiently close to a local minimum w, ˜ the discrepancy between the two can be related to the length of the gradient of the generalization error. With this background as motivation, we reverse matters and specify what we mean by well-determined local minima by introducing the assumption that a positive definite (p.d.) Hessian Heg (w) at w and a small enough gradient ∇eg (w) imply that w is close to its nearest neighbor ∈ Meg . (closest) minimum w(w) ˜ Assumption 7.9.2 (Proximity to Minima) Let ˜ i , w ˜ i ∈ Meg , di (w) = w − w
7.9 Convergence of Parameter Estimates
269
d(w) = min di (w) = ||w − w(w)||. ˜ i
There exists δ > 0, ρ < ∞, such that Heg (w) p.d., ∇eg (w) < δ ⇒ d(w) < ρ∇eg (w). Assumption 7.9.2 is, of course, satisfied in the (unrealistic) well-studied case of a quadratic generalization error function eg (w0 ) + (w − w0 )T H(w − w0 ) with H positive definite as required for there to be a unique minimum.
7.9.3
Vapnik-Chervonenkis Theory and Uniform Bounds
We use the Vapnik-Chervonenkis (VC) theory to bound the deviation between the gradients of the empirical and generalization errors. A bound, possessing the best possible exponent of −2n2 , is provided by Theorem 7.8.1. As noted in Section 7.8, we can then nestablish that we have convergence with probability one of supw∈W n1 i=1 f (X i , w) − Ef (X, w) to 0. Although it is common to use VC theory to guarantee that with increasing sample size n the discrepancy between the empirical and generalization errors can be made arbitrarily small with probability arbitrarily close to unity, we do not do so here because we concentrate on training algorithms that are based on finding stationary points of the empirical error, i.e., that seek to set the gradient of the empirical error to zero. This is hardly a restrictive assumption; it covers the whole class of gradient-based training algorithms, and even the so-called “second-order” training algorithms like the quasi-Newton and Levenberg-Marquardt algorithms, which approximate Hessians by means of the gradient. Because the training algorithm yields parameter estimates that are chosen to be stationary points of the empirical error (or close approximations thereof), we wish to have these parameter estimates be close to stationary points of the generalization error. In other words, we seek to apply bounds derived from VC theory to the gradients of the empirical and generalization errors. For our purposes, we shall define, for each component wi of w, fi (x, w) =
1 ∂(η(x, w) − t)2 ∂η(x, w) = (η(x, w) − t) , 2 ∂wi ∂wi
which, from Lemma 7.1.1 is uniformly bounded (in magnitude) by Bi , say. We make the following assumption. Assumption 7.9.3 (Finite VC Dimension for Gradients) For each component wi of w the family ∂η(x, w) Di = I(α,∞) (η(x, w) − t) : w ∈ W, |α| ≤ Bi ∂wi of binary-valued (indicator) functions of x, t has finite VC dimension vDi .
270
7. Generalization and Learning
It follows from the VC bound of Eq. 7.8.1 that for each component wi and appropriately chosen n , τn converging to zero with increasing n, ∂ETn (w) ∂eg (w) > Bi n < τn . − P sup ∂wi ∂wi w∈W Because there are a finite number p of components of w, we can combine the results for the individual components (a union bound) into the following. Lemma 7.9.1 (Gradient Discrepancies) Under Assumption 7.9.3 there is a finite B = maxi≤p Bi and appropriately chosen n , τn converging to zero with increasing n, such that for any probability measure P , P
sup ∇ETn (w) − ∇eg (w) > Bn
< τn .
w∈W
Thus Assumption 7.9.3 ensures that the continuous gradients of the empirical and generalization errors are closely linked together in that with increasing sample size n the probability can be made arbitrarily close to unity (τn close to 0) that there is only an arbitrarily small (n near 0) discrepancy between them.
7.9.4 Almost Sure Convergence of Estimated Parameters into the Set of Minima of Generalization Error Combining Assumption 7.3.1(a) with Lemma 7.9.1 and the triangle inequality for norms informs us that with probability at least 1 − τn converging to unity, ∇eg (w ˆ n ) ≤ Bn + δn . Hence, for large enough n the condition of Assumption 7.9.2 that ∇eg < δ will be met and the positive-definiteness of the local Hessian is guaranteed by Assumption 7.3.1(c). Hence, we can conclude that with probability at least 1 − τn d(w ˆ n ) < ρ(Bn + δn ). From the remarks on a.s. convergence that followed Theorem 7.8.1, we see that we can choose n ↓ 0 such that n τn < ∞, and this implies by the Borel-Cantelli Lemma that the events Cn = {d(w ˆ n ) > ρ(Bn + δn )} occur only finitely often with probability one. Hence, because n , δn are both converging to 0 we have established the following.
7.10 Asymptotic Distribution of Training Algorithm Parameter Estimates
271
Theorem 7.9.1 (a.s. Parameter Convergence) Under Assumptions 7.1.1, 7.3.1, 7.9.1, 7.9.2, and 7.9.3, the parameter estimate w ˆ n returned by the training algorithm converges with probability one (and thus in probability) to its nearest neighbor minimum w( ˜ w ˆ n ) ∈ M eg : P
lim d(w ˆ n ) = 0 = P lim w ˆ n − w( ˜ w ˆ n ) = 0 = 1.
n→∞
n→∞
Furthermore, by the assumed compactness of W, we also have convergence in mean square: ˆ n )2 = 0. lim Ed(w
n→∞
What Theorem 7.9.1 establishes is that the parameter estimates returned by the training algorithm A(Tn ) = w ˆ n , for increasing sample size n, converge strongly into the finite set of interior minima Meg in that the distance between w ˆ n and its nearest neighbor in Meg converges to zero. No assertion is made as to convergence to a particular element of Meg , nor should that be expected. Reinitiating batch training with a larger training set is unlikely to return you to the same parameter estimate. Without further assumptions, online training (interrupted periodically to verify the conditions of Assumption 7.3.1) with increasing numbers of samples is also not guaranteed to converge to a particular minimum.
7.10 Asymptotic Distribution of Training Algorithm Parameter Estimates The argument presented in this subsection is motivated by the classical large sample statistical analyses of the asymptotic normality properties of the well-known maximum likelihood estimator (e.g., Lehmann [143, pp. 409–417]) but takes care to account for the uncommon circumstance of multiple stationary points. Such arguments in the context of neural networks, not accounting for the presence of multiple minima, have been advanced in a number of somewhat informal papers (e.g., Amari [7] and coworkers [8, 9], Murata [175], Ripley [192]), and some more formal papers (e.g., White [251]). We shall see that the presence of multiple minima makes the asymptotic analysis more complex than it has been assumed. Indeed, without additional assumptions we cannot reach the oft-claimed conclusion of asymptotic normality. We know from Theorem 7.9.1 that for large enough n, the estimated ˜ w ˆ n) ∈ parameter vector w ˆ n comes arbitrarily close to its nearest neighbor w( Meg with arbitrarily high probability. As we retrain with different trainings set sizes n, the particular value of the nearest neighbor w( ˜ w ˆ n ) can be expected to change. Let Kn denote the random variable taking values in
272
7. Generalization and Learning
{1, . . . , m} that specifies the index of the nearest neighbor minimum w ˜ Kn to w ˆ n— w ˜ Kn = w( ˜ w ˆ n ) in our earlier notation. Fix a large sample size n. Given an estimated parameter w ˆ n with nearest neighbor w ˜ Kn ∈ Meg , we introduce a Taylor’s series multivariable expansion for the gradient of the empirical error about w ˜ Kn , ∇ETn (w ˆ n ) = ∇ETn (w ˜ Kn ) + Hn (w ˜ Kn )(w ˆn − w ˜ Kn )+ ˜ n − Hn (w (H ˜ Kn ))(w ˆn − w ˜ Kn ),
(7.10.1)
where Hn (w ˜ Kn ) is the Hessian matrix of mixed second partial derivatives ˜ n is the matrix of mixed of the empirical error evaluated at w ˜ Kn , and H second partial derivatives of the empirical error that, paralleling the use of the Theorem of the Mean in Section 7.9.2, has rows that are evaluated ˜ Kn . It follows from the continuity of the at points lying between w ˆ n and w second derivatives established in Lemma 7.2.1 that the last term is a zeroorder remainder o(w ˆn −w ˜ Kn ). We now proceed to examine the asymptotic in n behavior of each of the terms in this expansion, taking them from left to right. We will evaluate the terms in Eq. 7.10.1 by simultaneously ˜ 1, . . . , w ˜ m} considering all of the finitely many possible values Meg = {w for w; ˜ i.e., explore all of the m values of Kn . Furthermore, seeking a central limit √ theorem type of result, we scale Eq. 7.10.1 by multiplying each term by n. Assumption 7.3.1 postulates that in our training process we have selected √ a sequence δn shrinking to 0 more rapidly than 1/ n and that this upper ˆ n ). Hence, in the scaled Eq. 7.10.1 bounds the magnitude ∇ETn (w √ ˆ n ) = 0. lim n∇ETn (w n→∞
The evaluation of this term does not depend on the value of Kn . Turning to the second term, for each w ˜ k ∈ Meg , consider the term Y (k) n =
√
1 n∇ETn (w ˜ k) = √ ∇(η(xi , w ˜ k ) − ti )2 . n i=1 n
˜ k ). However, eg has The summands are i.i.d. and have expectation ∇eg (w a stationary point at w ˜ k , making its gradient there the zero vector 0. Assumption 7.9.1 assures us that there are only finitely many local minima of ˜ 1, . . . , w ˜ m , and stack the gradients at the eg . Enumerate the minima as w various minima into column vectors Z i , S n of dimension mp, ˜ 1 ) − ti )2 , . . . , ∇(η(xi , w ˜ m ) − ti )2 ], Z i = [∇(η(xi , w 1 =√ Z . n i=1 i n
S n = [Y
(1) n ,...
,Y
(m) n ]
7.10 Asymptotic Distribution of Training Algorithm Parameter Estimates
273
The {Z i } vectors are i.i.d. with zero mean 0 and covariance matrix B, with the existence of finite moments assured by Lemma 7.1.1. Letting N (m, B) denote the multivariate normal distribution with mean vector m and covariance matrix B, we invoke the following theorem. Theorem 7.10.1 (Multidimensional Central Limit Theorem) If {Z i } are i.i.d. with mean vector EZ = 0 and covariance matrix B = EZ Z T and 1 Z , Sn = √ n 1 i n
then the probability law L(S n ) (i.e., distribution or characteristic function) converges to that of the multivariate normal N (0, B) (convergence in distribution to the normal). Proof. See Lehmann [143, p. 343]. 2 Thus for sufficiently large n, the distribution or probability law L(S n ) of S n will become arbitrarily close to that of a zero-mean normal. Because any subset of a set of jointly normal random variables is also jointly normal, we see that the zero-mean, asymptotic normality of S n established by Theorem 7.10.1 guarantees the simultaneous zero-mean, asymptotic normality of its partitions into {Y (k) n }. From the overall covariance matrix B we can determine the individual covariance matrix Ck corresponding to Y (k) n . However, even more has been shown. We have established that the collection of random vectors {Y n(k) } is jointly normally distributed. ˜ k ), the Hessian matrix of ETn (w ˜ k ), we see that it is an Turning to Hn (w average of i.i.d. bounded (hence expectations exist) random variables: 1 ∇∇(η(xi , w ˜ k ) − ti )2 . n i=1 n
Hn =
Invoke the strong law of large numbers to conclude that with probability one (and hence in probability and in mean square due to boundedness of ˜ k ), the Hessian the summands) we have convergence to its expectation He (w ˜ k ): of eg (w ˜ k ) = He (w ˜ k ) = 1. P lim Hn (w n→∞
Because there are not even countably many points in Meg , we can use the union bound to conclude that we have simultaneous a.s. convergence P (∀w ˜ k ∈ Meg ) lim Hn (w ˜ k ) = He (w ˜ k ) = 1. n→∞
√
Multiplying through by n in Eq. 7.10.1, we have established that the first term (the one on the left-hand side) is asymptotically 0. The third term
274
7. Generalization and Learning
is asymptotically, with√probability one, a positive definite matrix He (w ˜ Kn ) ˆn − w ˜ Kn ). We know from Theorem 7.9.1 that times the normalized n(w d(w ˆ n ) = w ˆ n −w ˜ Kn converges strongly to 0, so we have from the continuity ˜ n − Hn (w of the elements of the Hessian that as n grows, H ˜ Kn ) converges in probability to a zero matrix. Thus the fourth (last) term is zero order of the third term. Using these observations, we rewrite Eq. 7.10.1 as √ n) + H(w ˜ Kn ) n(w ˆn − w ˜ Kn ), op (1) = Y (K n with the remainder op (1) a sequence of random variables converging in probability to 0 with increasing n, and therefore having no influence on the form of the asymptotic distribution. Assumption 7.9.1 guarantees the positive-definiteness (and hence the invertibility) of He (w ˜ k ), the Hessian ˜ k . Introduce the shorthand for eg at each of its minima w ˜ k )Y (k) v k,n = −H−1 e (w n . Having established that the collection {Y n(k) } is asymptotically jointly normally distributed, it follows from the properties of linear transformations of normally distributed random vectors that so is the collection {−H−1 ˜ k )Y (k) e (w n }. Hence, {v k,n , k = 1, . . . , m} is also asymptotically jointly normally distributed. An individual term v k,n is asymptotically distributed as N (0, Fk ) where Fk = H−1 ˜ k )Ck H−1 ˜ k ). e (w e (w We may then rewrite this √ n(w ˆn − w ˜ Kn ) = v Kn ,n + op (1).
(7.10.2)
Adding a term op (1) that is asymptotically negligible √ will not change the ˆn − w ˜ Kn ) is disasymptotic distribution. Thus, for large enough n, n(w tributed as v Kn ,n . √ ˆn −w ˜ Kn ) = v Kn ,n is also asymptotIt is tempting to conclude that n(w ically normally distributed, and this temptation has been yielded to whenever asymptotic normality has been asserted based on assuming that you are in the vicinity of a particular minimum. Unfortunately, this conclusion cannot be supported without additional assumptions concerning the selection of Kn . To better understand the difficulty, consider the situation in which we have an i.i.d. collection C of random variables {X1 , . . . , Xm } with common probability law, say, N (0, 1). Another random variable Y = XK is defined as a selection from C. Conditional on the choice K = k we might be tempted to claim that Y is also distributed as N (0, 1). However, if this choice was made as Y = mink Xk , then Y would not be distributed as N (0, 1), even though Y is chosen as one of the random variables in the collection C. A sufficient condition to ensure the expected conclusion would
7.10 Asymptotic Distribution of Training Algorithm Parameter Estimates
275
be to constrain the choice K to be made independently of the values of the random variables in the collection C. For example, we might choose K = k with probability πk independently of C. Define Ln = min v k,n 2 , Un = max v k,n 2 . k
k
The next theorem draws conclusions from Eq. 7.10.2 about the rate of ˜ Kn . convergence of w ˆ n to w Theorem 7.10.2 (Upper and Lower Bounds to Parameter Error) Under Assumptions 7.1.1, 7.3.1, 7.9.1, 7.9.2, and 7.9.3, for all > 0, ˆn − w ˜ Kn 2 ≤ Un + ) = 1. lim P (Ln − ≤ nw
n→∞
√ ˜ Kn decreases to zero as Op (1/ n). Thus w ˆn − w Note that we have the information about joint normality required to determine the asymptotic distributions of the upper and lower bound random variables Un , Ln ; it suffices to observe that they are asymptotically nondegenerate random variables (e.g., finite, positive second moments). To proceed further with Eq. 7.10.2, we observe that in the usual neural network context, the minimum that one converges toward depends on the initialization of the training algorithm and on the training set Tn used to construct the empirical error surface ETn . For large enough n, by our assumptions, there will be little discrepancy between the locations of the minima of ETn and eg . Each of the m minima then has a basin of attraction for a given algorithm such that initiating training in the basin of attraction of w ˜ k should lead to convergence to the neighborhood of w ˜ k . If the initial value w0 in an iterative training algorithm A, like steepest descent, conjugate gradient, quasi-Newton, or Levenberg-Marquardt, is chosen according to some distribution over W that does not depend upon the training set Tn , then one expects the choice of nearest neighbor minimum Kn to be nearly (not exactly so) independent of Tn for large enough n. Because the existence of the distribution of Kn would commit us to an additional assumption, we focus on conditional distributions and are motivated to make the following assumption. Assumption 7.10.1 (Asymptotically Normal Conditionals) For each k = 1, . . . , m, the conditional distribution of v Kn ,n given Kn = k is asymptotically in n equal to the unconditional normal distribution of v k,n . We can now assert the following Theorem 7.10.3 (Conditional Asymptotic Normality) Under Assumptions 7.1.1, 7.3.1, 7.9.1, 7.9.2, 7.9.3, and 7.10.1, the conditional
276
7. Generalization and Learning
√ distribution of n(w ˆn − w ˜ Kn ), given that Kn = k, converges to a zeromean multivariate normal: √ D ˆ n − w)| ˜ w ˜ =w ˜ k ) −→ N (0, Fk ). (7.10.3) L( n(w In proving this theorem we used the fact that He is a symmetric matrix, indeed a correlation matrix. Note that when we condition on w ˜ k we do not mean that this value holds for all n. Rather, for large enough n, whatever the value of the√nearest neighbor minimum w ˜ Kn , the resulting conditional ˆ n −w ˜ Kn ) will be arbitrarily close to the cited zero-mean distribution of n(w normal. If we do not condition on a value for the nearest √ neighbor minimum ˆ n −w ˜ Kn ) may be a of eg , then the resulting asymptotic distribution of n(w mixture of zero-mean normals with a mixing distribution, the distribution of Kn , corresponding to the asymptotic probabilities with which the various minima are approached by the parameter estimate w ˆ n . The existence of this mixing distribution would require further assumptions as to the training algorithm and the connections between training on different sample sizes. The conclusion of Theorem 7.10.3, without the same concern for multiple minima, has also been asserted in Ripley [192, p. 32] and White [251, p. 1005]. We can simplify the conclusion of Eq. 7.10.3 by introducing further assumptions of limited applicability. Dropping the subscript k for convenience, note that ˜ = 2E[∇η∇η T + (η − t)∇∇η] = He = E∇∇E1 (w) 2E[∇η∇η T + E{(η − t)|x}∇∇η], ˜ − t)2 ∇η∇η T ] = 4E[E{(η(x, w) ˜ − t)2 |x}∇η∇η T ]. C = 4E[(η(x, w) If η(x, w) ˜ is the Bayes estimator (not a likely event in neural network applications), ˜ = E(t|x), η(x, w) then He = 2E[∇η∇η T ].
(7.10.4)
˜ Eq. 7.10.4 could also be derived if (η − t) is independent () of ∇∇η at w and Eη = Et. If the conditional mean square prediction error E{(η(x, w) ˜ − t)2 |x} is independent of ∇η, or more narrowly, if ∂η ∂η 2 (∀i, j) E{(η(x, w) ˜ − t) |x} , , ∂wi ∂wj then we can simplify ˜ e, C = 2eg (w)H
(7.10.5)
7.11 Asymptotics of Generalization Error: Learning Curves
277
using the simplification of Eq. 7.10.4. In this case we have that −1 ˜ −1 H−1 e CHe = 2eg (w)H e
and √ ˆ n − w)| ˜ w ˜ =w ˜ k ) ≈ N (0, 2eg (w ˜ k )H−1 ˜ k )). L( n(w e (w
(7.10.6)
7.11 Asymptotics of Generalization Error: Learning Curves We have established that the training algorithm A, under suitable assump˜ Kn that a.s. tions, returns a sequence of parameter estimate errors w ˆn − w converges to √ 0, and that by Theorem 7.10.2 the magnitude of the magnified ˆn −w ˜ Kn ) is asymptotically bounded above and below by discrepancy n(w nondegenerate random variables This enables us to use a Taylor’s series ˆ n ) to determine the rate of approach of the genwith remainder for eg (w eralization error of the network selected by the training algorithm to the ˜ Kn ) at a closest minimum of generalization error. generalization error eg (w This result is known as a learning curve. Lemma 7.4.1 enables us to write ˆ n ) = eg ( w ˜ Kn ) + ∇eg (w ˜ K n )T ( w ˆn − w ˜ Kn ) + eg (w 1 (w ˆ −w ˜ Kn )T He (w ˜ Kn )(w ˆn − w ˜ Kn ) + 2 n o(w ˆn − w ˜ Kn 2 ). ˜ k ) = 0 and that He (w ˜ k) Assumption 7.9.1 informs us that for each k, ∇eg (w is positive definite. Thus, 1 ˆ −w ˆ n ) = eg (w ˜ Kn ) + (w ˜ Kn )T He (w ˜ Kn )(w ˆn − w ˜ Kn )+ eg ( w 2 n o(w ˆn − w ˜ Kn 2 ). The a.s. convergence guaranteed by Theorem 7.9.1 allows us to conclude that asymptotically in n the zero-order remainder will become negligible compared to the quadratic form. Hence, we have the asymptotically valid expression ˆ n ) − e g (w ˜ Kn )) = n(eg (w
√ 1√ n(w ˆn − w ˜ Kn )T He (w ˜ Kn ) n(w ˆn − w ˜ Kn ). 2
¯ e as the An implication of this result can be drawn out if we first define λ maximum of the mp eigenvalues taken over each of the m p × p Hessian matrices He and λe as the corresponding minimum of these eigenvalues.
278
7. Generalization and Learning
Note that these extremal eigenvalues are determined by eg and are not random. By Assumption 7.9.1, ¯e. 0 < λe ≤ λ ¯ are the minimum and maximum eigenvalues of a positive If 0 < λ ≤ λ definite matrix A then the quadratic form 2 ¯ . λ||x||2 ≤ xT Ax ≤ λ||x||
Now use Theorem 7.10.2 to derive the following. Theorem 7.11.1 (Learning Curve Bounds) Under Assumptions 7.1.1, 7.3.1, 7.9.1, 7.9.2, and 7.9.3, for all > 0, 1 1¯ lim P ( λe Ln − ≤ n(eg (w ˆ n ) − e g (w ˜ Kn )) ≤ λ e Un + ) = 1. 2 2
n→∞
ˆ n ) − eg ( w ˜ Kn ) shrinks to 0 at a rate Op (1/n). Hence, the discrepancy eg (w Under the additional Assumption 7.10.1, √ Eq. 7.10.3 asserts the condiˆn −w ˜ Kn ). Let {Zk } denote tional asymptotic zero-mean normality of n(w the m non-negative random variables Zk =
1 T Y H(w ˜ k )Y k , Y k ∼ N (0, Fk ), 2 k
−1 Fk = H−1 k Ck Hk , EZk =
1 1 Trace(Hk Fk ) = Trace(Ck H−1 k ), 2 2
to conclude with the following. Theorem 7.11.2 (Learning Curve Distribution) Under Assumptions 7.1.1, 7.3.1, 7.9.1, 7.9.2, 7.9.3, and 7.10.1, n(eg (w ˆ n ) − eg ( w ˜ Kn )) conditional on Kn = k is asymptotically distributed as Zk : ˆ n ) − eg ( w ˜ Kn ])|w ˜ Kn = w ˜ k ) ≈ L (Zk ) = L(n[eg (w L
1 Trace(Ck H−1 ) + (Z − EZ ) . k k k 2
The oft-cited learning curve result of convergence at a rate of 1/n has been established under a number of assumptions, of which the one most worth remarking on is Assumption 7.3.1b on the termination criterion for the training algorithm. In the absence of such an assumption the terms that we neglected in the Taylor’s series expansions could become comparable to the ones included and thereby falsify our learning curve conclusions. If we proceed further, making the assumptions of little generality that led
7.12 Asymptotics of Empirical/Training Error
279
to Eq. 7.10.6, then we find that conditional upon w( ˜ w ˆ n) = w ˜ k , Ck = ˜ k )Hk and Trace(Hk H−1 ) = p and 2eg (w k ˆ n ) − eg ( w ˜ k )) ∼ p + (Zk − EZk ), n(eg (w ˜ is proportional to p/n, In this case, the bias in the achievement of eg (w) the ratio of the number of parameters to the training sample size, and we lend support to the practical guide that the sample size should be a significant multiple of the number of parameters (complexity) of the network. Of course, the results of Section 7.8.4 tell us to refocus from the dimension p to the size of the resulting weight vector.
7.12 Asymptotics of Empirical/Training Error We complete our discussion by examining the discrepancy between ETn (w ˆ n) ˜ Kn ). As in Section 7.11 we introduce a second-order Taylor’s and ETn (w series expansion with remainder, now taken about w ˆ n, ˜ K n ) = ET n ( w ˆ n ) + ∇ETn (w ˆ n )T ( w ˜ Kn − w ˆ n) + ETn (w 1 (w ˜ −w ˆ n )T Hn (w ˆ n )(w ˜ Kn − w ˆ n ) + o(w ˜ Kn − w ˆ n 2 ). 2 Kn Scaling by n yields √
√ n∇ETn (w ˆ n )T n(w ˜ Kn − w ˆ n) + √ 1√ n(w ˜ −w ˆ n )T Hn (w ˆ n ) n(w ˜ Kn − w ˆ n) + 2 √ Kn ˜ Kn − w ˆ n )2 ). o( n(w √ ˆ n ) converges to zero. TheoAssumption 7.3.1b informs us √ that n∇ETn (2w ˜ Kn − w ˆ n )|| is bounded above and below rem 7.10.2 informs us that || n(w by nondegenerate random variables√Un , Ln . Thus the asymptotically domi√ ˜ Kn − w ˆ n )T Hn (w ˆ n ) n(w ˜ Kn − w ˆ n ). nant term on the right-hand side is n(w ˜ Kn − w ˆ n a.s. converges to zero. Theorem 7.9.1 states that d(w ˆ n ) = w Hence, invoking the continuity of the second derivatives that comprise ˆ n ) we see that it a.s. converges to Hn (w ˜ Kn ). Section 7.10 estabHn ( w ˜ Kn ) to He (w ˜ Kn ) with probability one on lished the convergence of Hn (w the basis of the strong laws of large numbers. Assembling these remarks and using the same notation as in Theorem 7.11.1 enables us to conclude the following.
˜ Kn ) − ETn (w ˆ n )) = n(ETn (w
Theorem 7.12.1 (Empirical Error Learning Curve Bounds) Under Assumptions 7.1.1, 7.3.1, 7.9.1, 7.9.2, and 7.9.3, for all > 0, 1 1¯ lim P ( λe Ln − ≤ n(ETn (w ˜ Kn ) − ETn (w ˆ n )) ≤ λ e Un + ) = 1. 2 2
n→∞
280
7. Generalization and Learning
A parallel to Theorem 7.11.2 for eg is the following. Theorem 7.12.2 (Empirical Error Learning Curve Distribution) ˜ Kn )− Under Assumptions 7.1.1, 7.3.1, 7.9.1, 7.9.2, 7.9.3, and 7.10.1, n(ETn (w ˆ n )) conditional on Kn = k is asymptotically distributed as Zk : ETn (w ˜ − ETn (w ˆ n )]|Kn = k) ≈ L(n[ETn (w)
L(Zk ) 1 −1 Trace(Ck Hk ) + (Zk − EZk ) . = L 2
For large enough training set size n we can expect the training algorithm A to return a parameter estimate w ˆ n that yields an empirical error near to the empirical error at the nearest neighbor minimum w ˜ Kn of the unknown generalization error.
7.13 Afterword The methods presented in Sections 7.1–7.8 for moderate sample sizes are also given an overview in Ripley [192, Secs. 2.7, 2.8, and elsewhere]. It is clear that, with the exception of evaluations using an independent test set, estimation of generalization error either for a single network or uniformly across an architecture, is a very difficult and largely unsolved task. Although cross-validation and bootstrap techniques are well-studied in the pattern classification and statistical literature, they are largely irrelevant in the neural network setting. The existence of multiple minima and randomly initialized training algorithms conspires to make it unlikely that training on modified data samples will yield the same parameter values. In effect, the means differ with the samples, and the underlying assumptions of these methods are not instantiated. Furthermore, the computational burden of repeated retraining to implement cross-validation or bootstrapping is so excessive as to render these approaches ludicrous. There remain the uniform methods related to VC bounds. Putting aside the substantial issues of evaluating the VC capacity for a given network, it was shown by example at the close of Section 7.8.2 that VC calculations lead to unrealistically high estimates of training sample size n needed for reliable generalization. The revision of VC capacity to fat-shattering capacity, discussed in Section 7.8.4, provides more informative sample size estimates. When large sample sizes are available, it is not difficulty to reserve an independent test set and evaluate performance at the parameter estimate w ˆn returned by a training algorithm A. We examined this case to learn something about the qualitative behavior of the training process and found that convergence of the training algorithm parameter estimate is only into the set Meg of minima of eg and not to a specific minimum. However, this
7.13 Afterword
281
convergence is strong under the assumptions we have introduced. The importance of the termination criterion on training is brought out, and the Vapnik-Chervonenkis theory of universal approximation bounds is used to establish the proximity of the gradients of empirical and generalization error. It is the gradient of empirical error that is crucial to the behavior of most training algorithms, but it is of interest largely because we expect a small gradient of empirical error to correspond to a small gradient of generalization error. Under our Assumption 7.9.2, a small gradient of generalization error informs us that we are in the vicinity of a minimum of eg . We see that convergence of the training algorithm parameter estimate w ˆn is only into the set Meg of minima of eg and not to a specific minimum, even one randomly chosen. However, Theorem 7.9.1 shows that this convergence is strong (almost sure and in mean square) under the assumptions ˜ Kn shrinks we have √ introduced. Theorem 7.10.2 establishes that w ˆn − w as Op (1/ n), and Assumption 7.10.1 then enables us to derive Theorem 7.10.3 asserting the expected claim of (conditional) asymptotic normality. Whether or not the unconditional distribution is a mixture of normals depends on the existence of a limiting distribution for Kn . We do not believe that we have assumed enough to guarantee such a limiting distribution. The importance of the termination criterion of Assumption 7.3.1b on training is that it is required to eliminate a term in the scaled Taylor’s series expansion; if this term is not eliminated, then we cannot assert asymptotic conditional normality. These results are then used to rederive a family of learning curves that were announced earlier (e.g., Amari [7], Murata [175], Ripley [192]). Much of the work of Amari and his coworkers has focused on the special case of binary-valued targets (dichotomous classification) and that case under the unrealistic assumption that there is a true function or network w0 in the family of networks that can learn any size training set without error. Haussler et al. [99] treat only the very special case of a finite family of functions or networks. There is a qualitative difference in learning behavior for the cases of finite and infinite function classes. Their attempts to extend their treatment to the infinite case are suggestions and are not carried to a conclusion. Mukherjee√and Fine [173] use Markov methods to ˆn − w ˜ Kn ) in the very special case treat the asymptotic behavior of n(w of a single node and one-dimensional input (d = 1). They do find that the asymptotic distribution for the parameter sequence deviates from the normal. In contrast, we see that asymptotically the multiplicative constant changes with the random changes in the minima whose neighborhood is approached. Hence, an actual sample function (history) of {eg (w ˆ n )} need not follow a curve of the form c/n.
282
7. Generalization and Learning
7.14 Appendix 1: Proof of Theorem 7.8.2 Theorem 7.8.2 (Discrepancy between eg and ETn ) Let (η −t)2 be uniformly bounded by b and define the family of functions F = {I(η(x,w)−t)2 >α : w ∈ W, 0 ≤ α ≤ b}: 1 )2 n(− √ 2 n 2
1
2
2en v − n(− 2√n ) 2 ) e . v (7.14.1) We establish the upper bound presented in Theorem 7.8.2 by developing Vapnik-Chervonenkis theory for a family of {0, 1}-valued functions. The extension to real-valued bounded functions was provided in Section 7.8. A first step in assessing P (supw∈W |ETn (w) − eg (w)| > ) is to eliminate the dependence on the unknown generalization error eg through the argument of symmetrization.
ˆ n )−eg (w ˆ n )| > b) ≤ 4mF (2n)e− P (|ETn (w
≤ 4(
Theorem 7.14.1 (Symmetrization Inequalities (Loeve [151]) If {Zi } is a collection of random variables and {Zi } a second collection, independent of the first collection, and distributed identically to it, with median µi for Zi then P (max |Zi − µi | ≥ δ) ≤ 2P (max |Zi − Zi | ≥ δ). i
i
Observe that the difference between the mean m and the median µ of a random variable X can be bounded in terms of the standard deviation σ of X through |m − µ| ≤ σ; this result is given an easy derivation by O’Cinneide [181]. For the random variable ETn (w), σ=
1 eg (w)(1 − eg (w))/n ≤ √ . 2 n
Hence, there is only a small discrepancy between the median and the mean in our case, and we lose little by replacing the median with the mean in the symmetrization inequality. Introduce a hypothetical second training set Tn independent of Tn but distributed identically to it. The errors on Tn are measured by ETn . Theorem 7.14.1 yields 1 P ( sup |ETn (w) − eg (w)| ≥ ) ≤ 2P (sup |ETn (w) − ETn (w)| ≥ − √ ). 2 n w∈W W (7.14.2) Thus we have replaced a bound involving unknown expectations eg (w) that can vary with each choice of net w by random variables, all of which depend on a double sample that is now of size 2n.
7.14 Appendix 1: Proof of Theorem 7.8.2
283
We now follow Pollard [188] and introduce the auxiliary random variables {σi , i = 1, n} that are i.i.d. as P (σi = 1) = P (σi = −1) = 12 and that are independent of Tn , Tn . Introduce the notation Yi (w) = |η(xi , w) − ti |, (xi , ti ) ∈ Tn ,
Yi (w) = |η(xi , w) − ti |, (xi , ti ) ∈ Tn .
Observe that {Yi − Yi } is distributed exactly as {σi (Yi − Yi )}. In effect, if σi = 1 then there is no change due to it, but if σi = −1 then it amounts to interchanging (xi , ti ) and (xi , ti ) or Yi and Yi . However, because these two terms are identically distributed and independent of all other terms, there is no change in the distributions of the quantities concerned. (Vapnik actually considers all (2n)! permutations of the elements of the two data sets, rather than just those permutations that arise from interchanges between members of the similarly enumerated pairs. While this approach has something to recommend it, the subsequent calculations are simpler with Pollard’s approach.) Note that 1 (Yi (w) − Yi (w)) n i=1 n
ETn (w) − ETn (w) = is then distributed identically to
1 σi (Yi (w) − Yi (w)). n i=1 n
The key to further developments is to condition on the pair of training sets and then to average over the possible training sets, 1 σi (Yi (w)−Yi (w))| ≥ |Tn , Tn ). n i=1 n
P (sup |ETn (w)−ETn (w)| ≥ ) = EP (sup | W
W
The introduction of the auxiliary random variables {σi } renders the conditional probability nondegenerate. Given the two training sets, there are only finitely many different values achievable by ETn (w) − ETn (w) as w ranges over the (infinite) set W; had we not used symmetrization to replace eg (w) then this would not have been the case. Indeed, there are only 3n possible vectors {Yi − Yi } of length n with components taking on only the three values 0, ±1. Thus the maximum is really only over a finite subfamily WTn ,Tn of those nets that can classify Tn , Tn differently enough to make a change in {Yi − Yi }. Hence, P (sup | W
n
|σi (Yi −Yi )| ≥ nδ|Tn , Tn ) = P ( sup | WTn ,T
i=1
≤ ||WTn ,Tn ||
n
sup
P (|
w∈WTn ,T
n
n i=1
n
σi (Yi −Yi )| ≥ nδ|Tn , Tn )
i=1
σi (Yi (w) − Yi (w))| ≥ nδ|Tn , Tn ). (7.14.3)
284
7. Generalization and Learning
Given the training sets Tn , Tn , the random variables {Yi − Yi } are determined and take values in {0, ±1}. Let 0 ≤ n ≤ n be the number of nonzero differences and let {Zj } be an enumeration of the n nonzero terms of the form σij (Yij − Yij ). The random variables {Zj }, given the training sets, are i.i.d., P (Zj = 1) = P (Zj = −1) = 1/2. Then P (|
n
σi (Yi (w) −
Yi (w))|
≥
nδ|Tn , Tn )
= P (|
i=1
n
Zj | ≥ nδ|Tn , Tn ).
j=1
We can now use the Hoeffding bound of Theorem 7.6.2 to conclude that
P (|
n
Zj | ≥ nδ|Tn , Tn ) ≤ 2e−
n2 δ 2 2n
≤ 2e−
nδ 2 2
.
j=1
The upper bound does not depend on the conditioning variables. To proceed further we need to upper bound the effective size ||WTn ,Tn || of the family of nets. As earlier, we assume that our nets can only dichotomize; i.e., η(x) ∈ {0, 1}. Let S = {x : (x, t) ∈ Tn ∪ Tn } be the set of 2n inputs. Recall the introduction of the growth function mF given in Definition 3.5.1 and repeated in Definition 7.8.3. Clearly ||WTn ,Tn || ≤ mN (2n). Because this upper bound does not depend on Tn , Tn , we can now take expectations trivially in Eq. 7.14.2 and combine the result with Theorem 7.14.1, which introduces another multiplicative factor of 2, to obtain the desired conclusion that P ( sup |ETn (w) − eg (w)| > ) ≤ w∈W
1 4mN (2n) sup P (|ETn (w) − ETn (w)| ≥ δ = − √ ) ≤ 2 n W 4mN (2n)e−
nδ 2 2
≈ 4mN (2n)e−
n2 2
. 2
(7.14.4)
Appendix A A Note on Use as a Text
A.1 Overview Perhaps because of the changing interests and abilities of the students I have taught at Cornell, the emphasis of my teaching of neural networks has shifted over the decade from mathematical theory, as exemplified by Chapters 4, 6, and 7, to computing and construction of networks as exemplified by Chapter 5 and parts of Chapters 2 and 3. The major part of the work I require from students is the writing of programs and the application of the programs they have written. I feel that it is important that students write their own versions of programs such as the ones provided in Chapter 5, even if they subsequently adopt more efficient programs found elsewhere (e.g., in the Neural Networks Toolbox of MATLAB or the listings given in the chapter appendices). The challenge is to have the students think through and write their own programs rather than copy the available ones—to explain in their own words a given text and thereby gain and demonstrate an understanding of the text, or, in our case, the algorithm. My lectures include more mathematical issues, especially those required to answer Question 1 on the capabilities of a network architecture to implement classes of functions and Question 2 on the complexity of such implementations. The programming and computing work done by the students helps them learn in detail our responses to Question 3 on selecting particular networks. Although Question 4 on learning and generalization ability is treated in some detail in Chapter 7, this material requires a statistical background beyond the prerequisites of the course I have taught.
286
Appendix A. Note on Use as a Text
I provide only a few rules, derived from Chapters 6 and 7, to guide architecture selection so as to achieve good generalization. Applications of neural networks are dealt with through remarks interspersed throughout the lectures given each week and then in a final project that requires them to succeed with a data set drawn from a real application. Grading is based on the students’ homeworks and final projects; I have found it difficult to incorporate significant material into examinations given in class.
A.2 Exercises for the Chapters Vectors are column vectors, unless otherwise specified. In many of the problems provided here, programs that have been created are to be exercised on a training set T . In a course setting the instructor should generate and provide these training sets, with the realistic advantage that the structure of the set is unknown to the student. However, in a self-study setting you will have to generate your own training sets. A suggestion is that having chosen d, n, say d = 5, n = 200, you generate the feature vector part of the training set through S=randn(d,n) and occasionally through S=rand(d,n). It remains to construct the final row of T containing the targets t. For the problems related to Chapters 2 and 3 the targets are binary-valued and can be generated either randomly through tb = sign(rand(1, n) − 0.5) or in a linearly separable manner by first choosing a weight vector w and threshold tau and assigning tls = sign(w ∗ S − tau). In both cases the training sets are T b(d, n) = [S; tb ], T ls(d, n) = [S; tls ]. For the programs of Chapter 5 we need real-valued targets, and there are innumerable choices. For example, the quadratic tq = S(1, :) + S(d, :)2 , T q(d, n) = [S; tq ]. These suggestions are summarized in Section A.3 for ease of reference. In writing training programs, it is good to keep track of the use of resources, especially running time (e.g., use the MATLAB command cputime) and floating-point operations (e.g., use the MATLAB command f lops). Memory limitations can also be a significant factor in the use of Levenberg-Marquardt algorithms and to a lesser extent in quasi-Newton algorithms. Exercises—Chapter 1 The material in this introductory chapter does not lend itself to traditional exercises like the ones presented later. However, the student can be asked to search out and abstract a current article in at least one of the rapidly changing areas of applications (Section 1.2.3) or of truly parallel hardware implementations of neural networks (Section 1.2.2). Another possibility is to select, read, and abstract a fuller history (Section 1.3) of artificial neural networks.
A.2 Exercises for the Chapters
287
Exercises—Chapter 2 The purpose of Problems 1–4 is to initiate contact with MATLAB and the computer operating system and then to provide exposure to the concepts of linear separability and the sorts of bounds on sums of binomial coefficients that are needed later. It is assumed that the students have read Chapters 1 and 2 through Section 2.3. In a class setting, several data sets are provided to them to exercise use of plotting and to supply examples of linearly separable and nonseparable sets. An effort is made to reduce printouts and paper submission through electronic submission. The students are provided with hints on good MATLAB programming practices. They are asked to submit all results in ascii to make it possible to move the results between different computing platforms. Problems 5–9 treat Sections 2.3 through 2.6 on enumerating the number of sets learnable by a perceptron, interpreting this enumeration in terms of coin-tossing, programming the perceptron training algorithm, and exercising this algorithm on both linearly and nonlinearly separable data sets. Problems 10 and 11 introduce extensions to the PTA, as discussed in Sections 2.7 and 2.8, that allow the PTA to generate networks capable of learning nonlinearly separable training sets through augmenting the input variable set and that improve its behavior when it cannot learn a nonlinearly separable training set. To encourage the students to organize the material they have been exposed to, they are asked in Problem 12 to write a short essay response reviewing how we have responded in Chapter 2 to the four organizing questions of Section 1.5.2. 1.(a) Create T ls(2, 10). Final row entries of this matrix represent the classes assigned to the corresponding columns. Plot the columns less the final row and use the final row to assign ’x’ to a 1 and ’o’ to a −1. (It will help to use the f ind command.) (b) On the same plot, show the hyperplane determined by parameters ww = (w, τ ) provided by T ls, and print the result. (c) Does this hyperplane linearly separate the data sets? 2. Determine, by visual inspection of plots, which of the four data sets provided by the instructor are linearly separable. (For self-study, generate several data sets that are linearly separable of the form T ls(2, 20) and several of the form T b(2, 20).) 3.(a) Provide an example of n = 5 vectors in R3 that are in general position and divided into two classes such that you cannot separate the two classes by a single hyperplane (node), even allowing for nonzero threshold.
288
Appendix A. Note on Use as a Text
(b) If n = 4, can you always achieve the desired separation by a single node? 4. Appendix 1 of Chapter 2 provides three upper bounds to partial sums of binomial coefficients. For d ≤ n/2 and n > 1 write three MATLAB programs, pd1.m, pd2.m and pd3.m, to determine the accuracy of these upper bounds through calculation of percentage discrepancies pd(n, d) = 100
bound(n, d) − actual(n, d) actual(n, d)
between the actual partial sums, given below on the left-hand sides, and their upper bounds. Your program, when supplied with a value of n and run, should generate a plot of the percentage discrepancy for positive integer values of d ≤ n/2. (There is no factorial function in MATLAB but prod and cumprod will do.) d n
k
k=0 d n k=0
k
≤
d n k=0
k
n ≤ (d + 1) ; d
n d ]; [1 + d n + 1 − 2d
< 1.5
nd d!
for n > d.
(A.1)
(A.2)
(A.3)
5. Directly evaluate D(n, d) for d = 2 and n = 1 : 4 by sketching possible hyperplanes and counting the numbers of linear dichotomies that result. 6. If the inputs to an LTU or perceptron are themselves ternary-valued, {−1, 0, 1}d (e.g., there are now 3d possible inputs), then upper bound the total number T (d) of such functions that are implementable by the node and compare T with the total number of such functions. (Follow our treatment of Boolean functions.) 7.(a) If we generate random vectors xi ∈ IR2 with i.i.d. components that are uniformly distributed on [0,1] (U(0, 1)) and assign them categories ti that are also i.i.d. with P (ti = 1) = P (ti = −1) = 1/2, then empirically estimate the distribution of the largest number N ∗ of such vectors that you can generate that remain linearly separable; do so by successively adding vectors of dimension d until linear separability fails. Determine linear separability by inspection of a plot of the kind done in Problem 2.2. Estimate the distribution by repeating this experiment five times and counting the number of different values of N ∗ that you find. Save the results in a file nstarunif . (In a class setting, you can pool the results from the students to get a better empirical estimate of the calculated distribution.)
A.2 Exercises for the Chapters
289
(b) Now generate the components of xi so that they are normally distributed N (0, 1) with mean 0 and variance of 1 and rework part (a), saving the results in a file nstarnorm. 8.(a) Write a MATLAB version of the Perceptron Training Algorithm that is a function pta(T, runlim) of a training set T and a maximum number of iterations runlim. Your program pta.m should call a weighting function φ specified as a separate MATLAB function phi.m. Avail yourself of MATLAB’s ability to execute much more rapidly in vectorized form by modifying the PTA to operate so that you batch process a training set rather than presenting it sequentially. Be sure to include cputime and flop count in your program. In the event that the data set is not linearly separable, your program should save and output the best solution encountered in its search as well as the number of errors made by it. (b) You will find, or need to create (see Section A.3), two training sets T ls(5, 200) and T b(5, 200). Run your PTA algorithm with φ(n) = 1 on each of the two training sets in turn, and save the weights and threshold parameters in a file param1. (c) Repeat with φ chosen as the reciprocal of the length of the vector currently being added, and save the resulting parameters in a file param2. 9.(a) Generate a training set T b(15, 5) with S = rand(15, 5). Determine the smallest number of components d∗ of x, taken in the order of enumeration, such that this set is linearly separable when you restrict attention to the subspace of dimension d∗ . Use your PTA to determine linear separability. Repeat this experiment five times and save the resulting d∗ as dstarunif . (In a class setting you can then pool the results to form the empirical distribution of d∗ .) (b) Rework part (a) using S = randn(15, 5). Save your results as dstarnorm. 10. (a) Run your PTA program pta.m for 1000 cycles on the data T = [S; t], S = randn(2, 100), t = x21 + 3 ∗ x22 − 4. Determine and save the number of errors your solution makes. (b) Augment the set of input variables to include all terms of the form x2i and xi xj , and rerun your PTA program on T . Determine the number of errors your solution makes. 11. (a) Revise your PTA algorithm pta.m to amaldi.m to incorporate Amaldi’s method in which we actually make the change specified in the ADD step only with a probability that depends upon a “temperature” schedule Tj . The Tj should be a separate program (like φj ) that is called by your program and named temp.m. To be specific, let Tj = 1/ log (j + 1).
290
Appendix A. Note on Use as a Text
Scale the probability so that for j = 1 the probability of making the change is 1. (b) Run this program for 1000 cycles on the data set T b from Problem 8(b) and determine the number of errors made by the best result. 12. Our study of neural networks is oriented around the four questions of Section 1.5.2. In about two pages, describe our responses to these questions in the case of the perceptron. Cite results where they were developed. What are appropriate applications for the perceptron? Exercises—Chapter 3 The first three problems cover the sandwich construction of Section 3.4 that enables an LTU-based network to learn any dichotomization of a finite number of input vectors. Problems 4 and 5 are on the ability of 2HL networks to recognize polyhedra and approximate to real-valued functions as discussed in Sections 3.6 and 3.7. Problems 6 and 7 provide contact with Vapnik-Chervonenkis calculations of implementation ability, introduced in Section 3.5, and that are also the basis of an important approach to generalization (see Section 7.8). The final problem asks for a short essay reviewing how we have responded in Chapter 3 to the four organizing questions of Section 1.5.2. 1. Design a feedforward net of linear threshold nodes to implement the Boolean function XOR(x1 , x2 ), given by XOR(0, 0) = XOR(1, 1) = 0,
XOR(0, 1) = XOR(1, 0) = 1.
2. Consider the following training set T consisting of 121 points. The set S of input vectors lies in IR2 with the components x = (x1 , x2 ) of a vector constrained to take on only integer values for |xi | ≤ 5. The vector is of category +1 if the product x1 x2 is even and of category −1 if otherwise. Using the sandwich construction, completely describe a network of LTU nodes that learns T without error, and prepare it in the form of a matrix W that specifies the individual node hyperplanes as column vectors of dimension d + 1 with the threshold in the d + 1st position. 3. (a)Write a MATLAB program sandwich.m to carry out the sandwich construction. Pay attention to the choice of sandwich width . Your program should accept an input matrix T that is (d + 1) × n with the category t ∈ {−1, 1} as the d+1st row of T. The output of your program should be a matrix W that specifies the individual node hyperplanes as column vectors of dimension d + 1 with the threshold in the d + 1st position. Assume that in T = [S; t] the columns of S are in general position.
A.2 Exercises for the Chapters
291
(b) Run this program on T b(5, 200) from Problem 2.8b and save your results as W. 4. We wish to design a feedforward net composed of linear threshold nodes that can implement the decision or indicator function δ taking on the value +1 on the unit square {(x, y) : |x| ≤ 1, |y| ≤ 1} and on the triangular region having bounding vertices {(−1, 2), (1, 2), (0, 3)} and the value −1 elsewhere. Completely describe a net that can achieve this. 5.(a) Submit a MATLAB program f ncn1.m that constructs a piecewise constant function fncn1(x, f , c, C), with argument x ∈ IRd , that takes the value fi on a convex polyhedral region Ci , for i = 1 : m, and the value fm+1 on the complement Cm+1 of the union of the m convex regions c Cm+1 = {∪m 1 Ci } ,
fncn1(x, f , c, C) = fi for x ∈ Ci .
The m + 1 function values are specified in a vector f = (fi ). The collection of m convex polyhedral regions are specified in a matrix C and a vector c as follows. The component c1 ∈ c specifies the first c1 columns of C, and they define the convex polyhedron C1 ; each column specifies a hyperplane face through its first d elements specifying a weight vector w and its d + 1st element specifying a threshold τ . Hence, the first c1 columns of C specify the c1 faces of the convex polyhedron C1 in IRd . The second convex polyhedron C2 is specified by the next c2 columns of C, etc. This function should be implemented in terms of an LTU-based network with the exception that the final output node is a linear summation node capable of generating the desired function values. (b) Specify f , c, C for the regions described in Problem 3.4, with f taking the value π on the square, 2 on the triangle, and −3 otherwise. Verify that your program f ncn1.m works by evaluating it at (0, 0), (−.5, 2.2), (2, 1.5). 6. Using the MATLAB commands semilogy, hold, and n = 50, plot on the same graph the four functions of VC capacity v = 1 : n that are the growth function mN ; the Vapnik upper bound 1.5nv /v!; the approximation (ne/v)v ; and the lower bound of 2v . 7. Estimate the VC capacity v for a 1HL LTU network with d = 5, s1 = 2 by considering the implications of both Eqs 3.5.1 and 3.5.2 and choosing the smaller of the two estimates of capacity. 8. In about two printed pages summarize our responses in Chapter 3 to each of the four organizing questions for networks comprised solely of LTU nodes. Cite results where they were developed. What are appropriate applications for the these LTU networks?
292
Appendix A. Note on Use as a Text
Exercises—Chapter 4 The first problem provides initial contact with function approximation, as discussed in Section 4.2.5, and with the notion that different measures of approximation can lead to different approximants. Problems 2–5 have you write programs to calculate efficiently the responses of 1HL and 2HL networks and to do so with the networks either specified through a single column vector or more explicitly through matrices of weights connecting the successive layers and vectors of biases. Problems 6–10 are aimed at developing some experience with the capabilities of such networks by having the student do graphically based designs of simple networks to approximate to functions given either analytically or graphically. Some experience is also gained with the measures of function approximation given by the L2 norm and sup-norm. Problem 11 establishes contact with the Stone-Weierstrass approach of Section 4.6. Problem 12 concerns the condition of discriminability of node functions addressed in Section 4.7. Problems 13 and 14 are based on the approximation material of Section 4.11. Problem 15 is the wrap-up essay. 1. Following the ideas of Section 4.2.5, approximate to sin(x) over a period [0, 2π] by a sum fˆ6 (x) of six equal width pulse functions. (a) Select the pulse heights to approximately minimize the maximum discrepancy sup | sin(x) − fˆ6 (x)| x∈[0,2π]
between the true function sin(x) and its approximation by fˆ6 . (b) Select the pulse heights to approximately minimize the integral-squared discrepancy
2π
[sin(x) − fˆ6 (x)]2 dx
0
between the true function sin(x) and its approximation by fˆ6 . (c) Are the answers to (a) and (b) different? 2. Write a MATLAB function netout1.m, [a1, d1, a2] = netout1(node, w1, b1, w2, b2, S), that calculates the response of a 1HL network with hidden layer node nonlinearity specified by the string node and a linear output node. The string variable node can be either logistic or tanh. You will have to write a program logistic.m to calculate the logistic node function, but you can just call MATLAB’s tanh.
A.2 Exercises for the Chapters
293
w1, the matrix of first layer weights, is s1 × d, and has i, j entry w1:i,j . b1 specifies the s1 first layer node biases (negative of thresholds) as a column vector. w2 is 1 × s1 and has ith element w2:1,i . b2 is the scalar output bias. The matrix S is d × n and has as mth column the d-dimensional input vector xm . a1 is the s1 × n response matrix from the n input vectors (columns of S) to each of the first layer nodes having i, jth element a1:i,j . d1 is the s1 × n response matrix for the first layer node functions replaced by their derivatives; i.e., replace σ by σ . d1 will be needed later when we evaluate gradients. a2 is the 1 × n network output vector (because this is a 1HL with a scalar output). 3. Write a MATLAB program ntout1.m for a function [a1, d1, a2] = ntout1(node, w, S), where node, S have the same meanings as in Problem 4.2 and w specifies the connections by stacking w1, b1, w2, b2 together in a single-column vector as follows w = [reshape(w1, s1 ∗ d, 1); b1; w2 ; b2]. Use the MATLAB command reshape to define your function ntout1.m in terms of netout1.m. 4. Repeat Problem 4.2 for the response of a 2HL network having a linear output node [a1, d1, a2, d2, a3] = netout2(node, w1, b1, w2, b2, w3, b3, S). a3 is the 1 × n network output vector. a2 is the s2 × n matrix of responses from the second hidden layer nodes to each of the n input vectors specified by the columns of S. d2 is the s2 × n responses of the second hidden layer nodes when their node functions (e.g., σ) are replaced by their derivatives (e.g., σ ). w2 is now s2 × s1. b2 is s2 × 1. w3 is 1 × s2, and b3 is a scalar. 5. Repeat Problem 4.3 for the response of a 2HL network specified in the form [a1, d1, a2, d2, a3] = ntout2(node, s1, w, S).
294
Appendix A. Note on Use as a Text
The column vector w is the result of concatenating the reshaped versions of w1, b1, w2, b2, w3, and b3, taken in that order. 6. Estimate the ability of a 1HL network with a single logistic node to approximate to the N (0, 1) cdf G(x) for arguments x = −3 : .01 : 3. Revise MATLAB’s erf.m to obtain the correct cdf G(x) =
1 x (1 + erf( √ )). 2 2
Use ntout1.m and specify w and S (uniformly sample the range of the input variable x to generate S) to generate the approximations η1 . Some thought and a little trial and error should enable you to choose w to obtain a good (not necessarily optimal) approximation η1 (x, w) to G(x). Plot G(x) and η1 on the same graph. 3 Estimate the integral squared error of approximation (L2 -norm), { −3 [G(x)− η1 (x, w)]2 dx}1/2 , by simple numerical integration. 7. Estimate the ability of a 1HL network with s1 = 2 and logistic nodes to approximate to the N (0, 1) density x2 1 g(x) = √ e− 2 , 2π
over the range x = −3 : .01 : 3, by choosing w to achieve a good approximation η1 (x, w) to g(x). Plot g(x) and η1 on the same graph. Estimate the sup-norm error of approximation, sup|x|≤3 |g(x) − η1 (x, w)|, by simple numerical means. 8. Construct four single hidden layer neural networks using logistic nodes in the hidden layer and a linear output node, so that the responses of your networks are qualitatively the same as the four responses shown in Figure 4.1. Specify a network by encoding the network parameters into a single-column vector as described in Section 5.3.2 or Problem 4.3. Print plots of the four functions on the same graph using subplot. 9. Let d = 1, I = [−1, 1]; we are interested in approximating cos( π2 x). Assume an LTU node function of the form σ(z) = U (z) and a single hidden layer network composed of two such nodes (s1 = 2) with the output bias b2:1 = 0. (a) Find the best choices for the network parameters (you can use the apparent symmetry about the origin to simplify your choices to a1:1 = −a1:2 , b1:1 = −b1:2 ) to make an L2 approximation. You may use MATLAB and graphical means to help you in this calculus problem.
A.2 Exercises for the Chapters
295
Prepare a single plot of the approximation and of the true function being approximated. (b) For the network you have designed, evaluate the sup-norm approximation error as well as the L2 approximation error. 10. Solve the preceding problem for the network with two nodes having the smallest sup-norm approximation error. Is it the same network? 11.(a) Using the Stone-Weierstrass Theorem, show that a single hidden layer feedforward net with complex-valued node nonlinearity σ(z) = eiz can approximate any continuous function f (x), x ∈ I d . (b) Can we restrict the weights between the inputs and the hidden layer to have components that are multiples of some value? 12. (a) Using results of Cybenko, verify that the logistic and tanh node functions are both sigmoidal and discriminatory. (b) Verify that the following function is sigmoidal and discriminatory σ(z) = (1 − e−z )U (z). (b) Show by giving the parameters that a network using two of these nodes can generate the pulse-like function a[cosh(w) − cosh(wx)] on [−1, 1]. 13. We have that d = 2,
||f || = 2
0
1
1
f 2 (x1 , x2 )dx1 dx2 ,
0
√ √ the available node functions can implement γ 2n1 + 1 2n2 + 1xn1 1 xn2 2 for |γ| ≤ 10, n1 , n2 non-negative integers, and we are interested in approximating h(x1 , x2 ) = e−x1 −x2 . (a) Verify that h ∈ C(G) through a power series expansion for h. (b) By using the bound of Theorem 4.11.1, estimate how many nodes n will suffice to guarantee ||h − fn || ≤ .1. (c) Find f2 and sketch the corresponding neural network. 1 n n! .) (Note that: 0 xn e−x dx = n! − e−1 k=0 (n−k)! Conclude that in this case you can achieve the approximation error of (b) using far fewer nodes than are estimated in (b)? 14. Let d = 1, I = [−1, 1] and consider the pulse function pβ (z) = [U (z)−U (z−β)], G = {g : g(z) = γpβ (z−τ ), |γ| ≤ 2, 0 < β ≤ .2}.
296
Appendix A. Note on Use as a Text
Verify that for the quadratic function f (z) = z 2 , using the notation of ¯; hence, f is not in the convex Section 4.11.3, the da¯ (f ) > 0 for any finite a closure of G. Nonetheless, estimate the number of nodes that will suffice for a single hidden layer net to generate an approximation fn to f satisfying
1 |f (x) − fn (x)|2 dx ≤ .04. −1
15. Summarize how Chapter 4 responds to Questions 1–3 of Section 1.5.2. Exercises—Chapter 5 The first three problems concern calculation of the sum-squared error between training set targets and network outputs and its use in a display of the error surface discussed in Section 5.1. Problems 4–7 require programs to efficiently evaluate gradients based on the backpropagation formulation presented in Section 5.2. Gradients are at the core of all the training methods discussed in Chapter 5. Problems 8–14 create and run programs to implement steepest descent training algorithms, the approach that used to be referred to as backpropagation. Problems 15 and 16 create line search programs that identify minima of the error surface along chosen one-dimensional search directions. These programs are required by the remaining training algorithms. Problems 17–20 create and exercise the quasi-Newton training algorithms of Section 5.6. Finally, Problems 21–23 create and exercise the Levenberg-Marquardt training algorithms of Section 5.7. 1. (a) Write a MATLAB function ss1.m, [err] = ss1(node, w, T ) to evaluate the sample average sum-squared empirical error 1 (η(xi , w) − ti )2 n i=1 n
err = E(w) =
when a 1HL network specified by w and having a linear output node is provided with inputs from training set T . This function should call ntout1. (b) Repeat (a) for sse1.m [err] = sse1(node, w1, b1, w2, b2, T ). This function should call netout1.
A.2 Exercises for the Chapters
297
2. Consider a 1HL network η1 with d = 5, s1 = 3 and randomly select the 22 network parameters by encoding them in a column vector. Create a training set T q(5, 200) (see Section A.3) and use ss1.m of Problem 5.1 to plot the error function ET q (w) as a function of the two parameters w1:2,1 , b1:2 , as they each vary over [−4, 4], while the other 20 parameters are held constant. Use both contour(w1:2,1 , b1:2 , ET q , 15) and mesh plot commands. 3. Repeat Problem 5.2, only instead of varying the two parameters chosen there, vary the full parameter vector w along two directions p1 , p2 chosen by you. Consult the MATLAB program listing for errplot.m in Section A.3 to see how these vectors are used. (a) Set range = 5, q = 21 in errplot.m, and plot the empirical error surface as you move in the p1 , p2 plane of network parameters. (b) Modify p1 , p2 so that the variations correspond solely to changes in the output weights w2:1,1 , w2:1,2 , and repeat (a) for these new variation directions. 4. (a) Write a MATLAB function grad1.m, [gw1, gb1, gw2, gb2] = grad1(node, w1, b1, w2, b2, T ), that evaluates the gradients of the empirical error ET for the 1HL network parameters, with all function arguments as above and function values the gradients. This function should call netout1. and interpret gw1 as the gradient vector for w1, etc. Implement the backpropagation delta method of gradient evaluation, although this method is not needed for a single hidden layer network. Formulate the deltas by letting delta1, delta2 be two matrices of delta values corresponding to the first and second layers. For m is the first layer delta at node k in example, the entry delta1(k, m) = δ1:k response to excitation xm . (b) Create a training set Tnet1 (tanh, 0, 5, 10, 100), as described in Section A.3, using tanh nodes and randomly selected network parameter values. Test your program grad1.m by using it to evaluate the gradient of the sumsquared error for the same network architecture as that used to construct Tnet1 but with new randomly chosen parameters for the network. (c) Repeat (b), keeping the same Tnet1 , but now evaluating the gradient for a randomly selected 1HL network with s1 = 3. 5. (a) Write a MATLAB function grd1.m, [grd1] = grd1(node, ww, T ), that evaluates the gradient of the empirical error for the 1HL network parameters and calls ntout1; this does not require much more than using reshape properly in grad1.m. (b) Repeat the test of Problem 5.4(b) using grd1. in place of grad1.m.
298
Appendix A. Note on Use as a Text
6. Extend the results of Problem 5.4 to create [gw1, gb1, gw2, gb2, gw3, gb3] = grad2(node, w1, b1, w2, b2, w3, b3, T ) to calculate the gradient of empirical error for a 2HL network. The advantage of the backpropagation formulation is that this extension requires a little care but not much thought. The 2HL gradient is calculated by revising and renaming your previous program (e.g., grad1.m to grad2.m) in straightforward ways. 7. (a) Extend Problem 5.6 to create [grd2] = grd2(node, s1, ww, T ) to calculate the gradient for a 2HL network using the notational conventions provided in Problem 4.5. (b) Repeat Problem 5.4(b), modifying it by first creating a training set Tnet2 (tanh, 0, 5, 2, 10, 100) as described in Section A.3 and then using grd2.m to evaluate the gradient of the sum-squared error when the network has the same architecture as that specified in Tnet2 but with different randomly chosen parameters. 8. Create a fixed learning rate steepest descent training program stpdesc1.m for a 1HL network, [w1best, b1best, w2best, b2best, err, f l, cptime] = stpdesc1(runpar, node, w1, b1, w2, b2, T ). runpar is a vector [alpha, runlim] that establishes the fixed step size or learning rate α and the number of iterations runlim of the training algorithm. The string node is either tanh or logistic. This program will call grad1.m. The initialization of the algorithm is given in the arguments w1, b1, w2, b2 establishing the 1HL network parameters. The algorithm should return the best network found during the iteration (not necessarily the network found at the final iteration), a record err of the successive sumsquared errors made on the training set, the flop count fl, and the cputime cptime required to execute. Have your program return a semilog plot (use semilogy) of err at the conclusion of training. 9. (a) Run your program stpdesc1.m with logistic nodes on a training set T q(10, 200) and use randomly selected initial conditions. Prepare four plots of the sum-squared errors when you have trained with s1 = 5, runlim = 1000, and the four constant learning-rates α = .0001, .0005, .001, .010. Do you observe qualitatively different behaviors in the error plots? (b) Rerun stpdesc1.m with the best learning rate α found in part (a) and with four different randomly selected initial conditions.
A.2 Exercises for the Chapters
299
Are there significant differences in the performances observed for different initial conditions? 10. Revise stpdesc1.m to create a fixed learning rate steepest descent training program stpdsc1.m for a 1HL with the format [wwbest, err, f l, cptime] = stpdsc1(runpar, node, wwinit, T ). This program will call grd1.m. 11. Extend the work done in Problem 5.10 to create a fixed learning rate steepest descent training program stpdsc2.m for a 2HL network, [wwbest, err, f l, cptime] = stpdsc2(runpar, node, s1, wwinit, T ). It will require revising and renaming your previous programs (e.g., grad1.m to grd2.m) in straightforward ways. 12. Repeat Problem 5.9 on a T q(10, 200) training set using randomly chosen initial conditions for a 2HL with logistic nodes, s1 = 4, s2 = 2, runlim = 1000 and the four learning rates α = .0001, .0005, .001, .010. 13. Create a variable learning rate program for a 1HL network [wwbest, err, f l, cptime] = SD1(runpar, node, wwinit, T ), that uses the one-step optimal determination of learning rate αk described in Eqs. 5.3.8 and 5.3.9 and checks to verify that αk > 0. If it is not positive, then replace the nonpositive value with the small positive value .0001. The runpar = [, modif y, runlim] has a small = .0001, say, a multiplicative correction factor modif y for the learning rate modif y ∗ αk and a limit on the number of iterations runlim. 14. Run SD1 with s1 = 5 logistic nodes on T q(10, 200) with randomly selected initial conditions. Using runlim = 1000 and modify of .98 and 1, prepare two plots of sum-squared error for the two values of modify, and compare your results to each other and to those achieved in Problem 5.9. 15. Write a MATLAB program bisect.m (see Section 5.3.3) that defines a function [α] = bisect(node, steps, dir, ww, T ) that returns a value of learning rate α based on a bisection line search for a minimum of training set error given the specification of node type, the number steps of bisection steps to be taken in the line search, the search direction dir (d), the current parameter vector ww, and the training set T . 16. Write a MATLAB program to carry out cubic line search for a 1HL (e.g., see Section 5.3.3 or Press et al. [191, pp. 384–385]) in the form [α] = cubic1(node, dir, ww, T ),
300
Appendix A. Note on Use as a Text
with the notation of Problem 5.15. 17. Write a MATLAB program QN1.m that defines a function [wwbest, err, gdnorm, f l, cptime] = QN 1(runpar, node, wwinit, T ) that uses cubic line search to perform quasi-Newton training for a 1HL network on a training set T , using a node function that is either logistic or tanh. The program should return a best estimated parameter vector wwbest and an error history err, a history of the norms of the successive gradients gdnorm, and information on flops used and cputime. The argument runpar = [runlim] defines the the limit on number of iterations. 18. Run QN 1.m on T q(10, 200) using a 1HL with s1 = 5 logistic nodes, runlim=50. Examine four plots of training history err for four randomly selected initial conditions wwinit. How does the QN 1.m performance (running time, flops, best error achieved) compare to the performance achieved by the two other training methods you have used on this same data set? 19. Write a MATLAB program QN2.m that defines a function [wwbest, err, gdnorm, f l, cptime] = QN 2(runpar, node, s1, wwinit, T ) that performs quasi-Newton training for a 2HL network on a training set T , using a node function that is either logistic or tanh. 20. Run QN 2.m on T q(10, 200) using a 2HL with s1 = 4, s2 = 2 logistic nodes, steps=10, runlim=50. Study the effects of initial conditions by examining four plots of training history err for four randomly selected initial conditions wwinit. 21. Write a MATLAB program jacobian1.m that defines a function [J, gww] = jacobian1(node, ww, T ) returning the p × n Jacobian matrix J required by Levenberg-Marquardt, ej = η(xj , ww) − tj , J = [Ji,j ], Ji,j =
∂ej (ww) , ∂wi
and the usual gradient gww = g of the summed empirical error ETn (ww). 22. Write a MATLAB program LM1.m that defines a function [wwbest, err, gdnorm, f l, cptime] = LM 1(runpar, node, wwinit, T ) that performs Levenberg-Marquardt training with cubic line search by returning a best estimated parameter vector wwbest an error history err, a history of the norms of the successive gradients gdnorm, and information
A.2 Exercises for the Chapters
301
on flops used and cputime. The argument runpar=[epsilon,runlim] defines the scaling epsilon used in adding a unit matrix to ensure positive definiteness and the limit runlim on number of iterations. 23. Run LM 1.m on T q(10, 200) from Problem 5.9, using s1 = 5 logistic nodes, runlim=50 and randomly selected initial conditions. Examine four plots of training history err for the four values of scaling: = .01, .1, 1, 10. How does the LM performance (running time, flops, best error achieved) compare to the performance achieved by the steepest descent training methods that you have used on this same data set? Exercises—Chapter 6 The technique of regularization discussed in Section 6.3 lends itself to applications and is introduced in Problems 1–5. Problems 6–9 are based on the use of a validation set to control overfitting by terminating training, as discussed in Section 6.5. 1. Although it is clear that for a positive definite Hessian H the unique minimum of 1 E(w) = E(w0 ) + (w − w0 )T H(w − w0 ) 2 is at w0 , determine the location wλ of the minimum as a function of the regularization parameter λ for the regularized or penalized version 1 1 E(w) = E(w0 ) + (w − w0 )T H(w − w0 ) + λ||w||2 . 2 2 2. (a) Rewrite your MATLAB gradient calculation program grd1.m of Problem 5.5 as [grd] = grdr1r(node, λ, wwinit, T ) to include regularization in the form of a penalty term 12 λ||w||2 added to the usual training set error ET (w). (b) Redo Problem 6.2(a) for a 1HL by including as a penalty term s1 1 2 i=1 w2:1,i (we just penalize large output layer weights) as motivated 2λ by Theorem 7.8.4. 3. Rewrite the cubic line search program of Problem 5.16 to include regularization for 1HL in the form [α] = cubic1(node, λ, dir, ww, T ), with the addition of the regularization term 12 λ||ww||2 . 4. Rewrite the quasi-Newton program of Problem 5.20 as QN 1r.m to include regularization by changing runpar to [runlim, λ] and adapting it to include a penalty term 12 λ||ww||2 .
302
Appendix A. Note on Use as a Text
5. Run QN 1r.m of Problem 6.4 on T q(10, 200) using a 1HL with s1 = 5 logistic nodes and runlim=50 and randomly selected initial condition wwinit. Study the effects of the regularization parameter by examining four plots of training history err for the four values of regularization parameter λ : .1, 1, 10, 100. How does the performance of QN 1r.m on T q, for the best value of λ, compare with that achieved by QN 1.m in Problem 5.18 and by SD1.m in Problem 5.9? 6. (a) Adapt QN 1r.m to [wwbest, err, wwbestv, errv, gdnorm, f l, cptime] = QN 1v(runpar, node, ww, T ), so that you can access the sequence of parameter estimates for evaluation of validation errors and report the validation errors errv as well as the training errors err. In addition, determine the parameter values wwbestv at the minimum of the validation error. Add a third component splitpt to runpar that indicates how much of T is used for training, with the remainder used for validation. Your program should output parallel semilog subplots of the training error and validation error histories. (b) Generate a data set T = [S; t] as follows: S = randn(10, 2000), t = S(1, :) + 0.5 ∗ S 2 (10, :). Run QN 1v.m so that it uses T (:, 1 : 500) to train a 1HL with and the data T (:, 501 : 1000) to calculate the validation errors Ev (see Section 7.5) for the parameter estimates. Select s1 = 5 logistic nodes, runlim=200, and choose λ = 0. (c) Calculate the statistically independent sum-squared test error Et , using the remaining data T (:, 1001 : 2000), for both the parameter values wwbest returned by QN 1v.m and for the parameter values wwbestv at which the validation error was a minimum. Which of these two choices yields the lowest test error Et ? 7. Generate a data set T = Tnet1 (logistic, 0.5, 3, 5, 1500) use the first 250 samples for training, the next 250 for validation, and the final 1000 for testing, as in Problem 6.6. (a) Train on this data set using QN 1v.m and the correct s1 = 3. (b) Repeat part (a) using the excessive s1 = 9. (c) What do you observe about the value of validation? 8. (a) Adapt LM 1.m to [wwbest, err, wwbestv, errv, gdnorm, f l, cptime] =
A.2 Exercises for the Chapters
303
LM 1v(runpar, node, ww, T ), so that you can access the sequence of parameter estimates for evaluation of validation errors and report the validation errors errv and the training errors err. In addition, determine the parameter values wwbestv at the minimum of the validation error. Add a fourth component splitpt to runpar that indicates what percent of T is used for training, with the remainder used for validation. Your program should output parallel semilog subplots of the training error and validation error histories. (b) Generate a data set T with n = 2000 as in Problem 6.6. Use LM on T (:, 1 : 500) to train a single hidden layer network with s1 = 5 logistic nodes. Select runlim=200 and choose λ = 0. Simultaneously use the data T (:, 501 : 1000) to calculate the validation errors Ev for the parameter estimates generated earlier. (C) Calculate the sum-squared test error Et , using the data T (:, 1001 : 2000), for both the parameter values wwbest returned by LM 1v.m and for the parameter values wwbestv at which the validation error was a minimum. 9. Generate a data set T = Tnet1 (logistic, 0.5, 3, 5, 1500), using the code for T net1.m provided in Appendix A.3. Use the first 250 samples for training, the next 250 for validation, and the final 1000 for testing as in Problem 6.6. (a) Train on this data set using LM 1v.m, runpar = [.1, 0, 150, 0.5], and the correct s1 = 3. (b) Repeat part (a) using the excessive s1 = 9. (c) What can you observe about the value of validation? Exercises—Chapter 7 The problem given is intended to better fix an understanding of the various forms of error terms as discussed in Section 7.4. 1. Consider a stochastic model in which we observe x ∈ IR5 having i.i.d. components that are individually uniformly distributed on [−1, 1]. The target variable is specified by t = 1 − x21 + σN, where the additive noise N is distributed normally with mean 0 and variance 1 and is independent of x. The parameter σ is used to scale the variance of the additive noise to σ 2 . Hence, the training set T can be generated for simulation by S= 2*rand(5,n)-1, t =1-S(1,:).^2 +sigma*randn(1,n), T=[S;t].
304
Appendix A. Note on Use as a Text
We assume that our network architecture is N1,σ with s1 = 2 and σ denoting logistic nodes. Hence, the parameter vector w has length 15. (a) For n = 100, σ = 0.5, generate T100 . (b) Surf plot the empirical (training) error ET100 (w) as a function of the two output weights w2:1,1 , w2:1,2 , setting b2:1 = 0, w1:i,j = 0 for j > 1 and randomly choosing the four remaining parameters b1:i , w1:i,1 for i = 1, 2. (c) Show that the Bayes estimation rule for quadratic loss is E(t|x) = 1 − x21 , and that the resulting Bayes risk is eB = σ 2 . (d) Use T100 , and choose the Levenberg-Marquardt training algorithm with scaling of .1 and 50 iterations to determine the network specified by the resulting w. ˆ (e) Determine eˆg = eg (w) ˆ = E(η1 (x, w) ˆ − t)2 through simulation by evaluation on an independent test set T5000 . Of course, you do not recompute w ˆ using the test set. (f) Estimate w∗ by repeating (d) using Levenberg-Marquardt for 200 iterations on each of five different initial conditions and selecting the parameter value giving the lowest empirical error ET100 . (g) Determine e∗g = eg (w∗ ) = E(η1 (x, w∗ ) − t)2 through simulation by evaluation on an independent test set T5000 . (h) To determine the best 1HL with s1 = 2 specified by w0 we generate a new noiseless training set with σ = 0, d = 1, n = 1000, and choose x1 uniformly spaced over [−1, 1]. Train using Levenberg-Marquardt to find the 1HL η1 with s1 = 2 logistic nodes that minimizes the training error and use the resulting weights and biases (indicated by primes) as follows for i = 1, 2 0 = w1:i,1 , b01:i = b1:i , w1:i,1
0 w2:1,i = w2:1,i , b02:1 = b2:1 ,
and the remaining parameter values in w0 are set at 0. (i) As in (g), determine e0g = eg (w0 ) = E(η1 (x, w0 ) − t)2 . (j) Verify that the inequalities of Eq. 7.4.1 hold for the preceding.
A.2 Exercises for the Chapters
305
Final Project The object of the final project is to pull together what has been learned through choice and exercise of algorithms and written justification for the choices made. This project is in lieu of a final examination. The project is to design neural networks that perform well on training sets T 1 and T 2 that have been supplied to you on the course computer account. If you are engaged in self-study then you can find data sets that may be of interest at http://lib.stat.cmu/datasets You are to document in writing your approach, explaining the choices considered for network architecture and training methods, the choices then adopted, and provide printout of plots of the error history as you trained these networks. Grades will be based in part on the documentation establishing relevant contact with the material presented in this course. An indication of what is expected is given by the following list of significant project issues. • (a) Contact with the material of the course shown through citations to the literature; e.g., representation results from Ch. 4, algorithm alternatives and their implementation from Ch. 5, architecture selection issues from Ch. 6, and generalization issues and results in the early part of Ch. 7. • (b) Selection of training and validation sets from the data set; e.g., interleaved or at random from Tn . Validation set should be no bigger than the training set. (See Kearns reference for advice on the proportion between the two.) • (c) Experiments with a variety of architectures (1HL and 2HL, varying the numbers {s1 } of nodes) before selecting one. • (d) Efficient use of network resources (number of parameters). • (e) Use of validation sets and of regularization techniques to control the complexity of large networks. • (f) Exploration of steepest descent, quasi-Newton and LevenbergMarquardt training algorithms before settling on one of them. • (g) Exploration of several initial conditions.
306
Appendix A. Note on Use as a Text
• (h) Achievement of good sum-squared error on the unknown test set of 1000 samples. Part of your project grade will depend upon the performance of your designs on independent data that I have reserved. You are to submit electronically the network parameters you settled on. [There then follows a brief description of the characteristics and origins of the two data sets. The set T 1 is drawn from an actual application, while the set T 2 is artificially generated.]
A.3 MATLAB Program Listings %generate the training set Tb function T=Tb(d,n)
%input vector matrix is S S=randn(d,n); t=sign(rand(1,n)-0.5); T=[S;t]; ----------------------------------%generate the training set Tls function [T,ww]=Tls(d,n)
%input vector matrix is S S=randn(d,n); ww=randn(1,d+1); t=sign(ww(1:d)*S-ww(d+1)); T=[S;t]; ----------------------------------%generate the training set Tq function T=Tq(d,n)
%input vector matrix is S S=randn(d,n);
A.3 MATLAB Program Listings
t=S(1,:)+3*S(d,:).^2; T=[S;t]; ---------------------------------%generate the training set T_{net1} function [T,ww]=Tnet1(node,sigma,s1,d,n) %node is a string for either ’logistic’ or ’tanh’ %sigma is the std. dev. of additive normal noise %s1 is width of 1HL with d inputs %n is number of training pairs %input vector matrix is S S=randn(d,n); p=s1*(d+2)+1; ww=randn(p,1); [a1,d1,t]=ntout1(node,ww,S); t=t+sigma*randn(1,n); T=[S;t]; ---------------------------------%generate the training set T_{net2} function [T,ww]=Tnet2(node,sigma,s1,s2,d,n) %node is a string for either ’logistic’ or ’tanh’ %s1 is width of the first layer with d inputs, %s2 the width of the second layer %n is the number of training pairs %input vector matrix is S S=randn(d,n); p=s1*(d+1)+s2*(s1+2)+1; ww=randn(p,1); [a1,d1,a2,d2,t]=ntout2(node,s1,ww,S); t=t+sigma*randn(1,n); T=[S;t]; ---------------------------------function []=errplot(node,range,q,ww,pv1,pv2,T) %error surface plot %set increment to get q points in [-range,range] incr=2*range/(q-1);
307
308
Appendix A. Note on Use as a Text
%starting point set by ww network specification %two directions of variation set by pv1, pv2 %generate error matrix err=zeros(q,q); for i=1:q, for j=1:q, wij=ww+(-range+(i-1)*incr)*pv1+(-range+(j-1)*incr)*pv2; err(i,j)=ss1(node,wij,T); end,end surf([-range:incr:range],[-range:incr:range],err)
References
[1] Advances in Neural Information Processing Systems, an annual series of carefully reviewed conference proceeding volumes, with volumes 1– 7 published by Morgan Kaufmann Pub., subsequent volumes by MIT Press. [2] Akaike, H. [1974], A new look at the statistical model identification, IEEE Trans. on Automatic Control, AC-19, 716–723. [3] Albertini, A., E. Sontag, V. Maillot [1993], Uniqueness of weights for neural networks, in R. Mammone, ed., Artificial Neural Networks for Speech and Vision, Chapman and Hall, London, 113–125. [4] Aldous, D. [1989], Probability Approximations via the Poisson Clumping Heuristic, Springer-Verlag. [5] Alon, N., S. Ben-David, N. Cesa-Bianchi, D. Haussler [1993], Scalesensitive dimensions, uniform convergence, and learnability, Proceedings of the 34th IEEE Symp. on Foundations of Computer Science, IEEE Computer Society Press, Los Alamitos, CA. [6] Amaldi, E. [1994], From Finding Maximum Feasible Subsystems of Linear Systems to Feedforward Neural Network Design, Ph.D. dissertation, Dept. of Mathematics, Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland. [7] Amari, S. [1993], A universal theorem on learning curves, Neural Networks, 6, 161–166.
310
References
[8] Amari, S., N. Murata [1993], Statistical theory of learning curves under entropic loss function, Neural Computation, 5, 140–153. [9] Amari, S., N. Murata, K.-R. Muller, M. Finke, H.H. Yang [1997], Asymptotic statistical theory of overtraining and cross-validation, IEEE Trans. on Neural Networks, 8, 985–996. [10] Anderson, J., E. Rosenfeld, eds., [1988], Neurocomputing: Foundations of Research, MIT Press, Cambridge, MA. [11] Anthony, M., N. Biggs [1992], Computational Learning Theory, Cambridge University Press, Cambridge. [12] Arai, M. [1993], Bounds on the number of hidden units in binaryvalued three-layer neural networks, Neural Networks, 6, 855–860. [13] Ash, R. [1972], Real Analysis and Probability, Academic Press, New York. [14] Auer, P., M. Herbster, M. Warmuth [1996], Exponentially many local minima for single neurons, in D. Touretzky, M. Mozer, M. Hasselmo, eds., Advances in Neural Information Processing Systems 8, MIT Press, Cambridge, MA, 316–322. [15] Balasubramanian, V. [1997], Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions, Neural Computation, 9, 349–368. [16] Barron, A. [1991a], Complexity regularization with application to artificial neural networks, in G. Roussas, ed., Nonparametric Functional Estimation and Related Topics, Kluwer Academic Pub., 561–576. [17] Barron, A. [1991b], Approximation and estimation bounds for artificial neural networks, Proc. Fourth Annual Workshop on Computational Learning Theory, Morgan Kaufmann Pub., 243–249. [18] Barron, A. [1993], Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. on Information Theory, 39, 930–945. [19] Barron, A., R. Barron [1988], Statistical learning networks: a unifying view, in E. Wegman, ed., Computing Science and Statistics: Proc. of the 20th Symp. on the Interface, Amer. Stat. Assn., Alexandria:VA, 192–203. [20] Barron, A., T. Cover [1991], Minimum complexity density estimation, IEEE Trans. on Information Theory, 37, 1034–1054.
References
311
[21] Bartlett, P. [1998], The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Trans. on Information Theory, 44, 525–536. [22] Battiti, R. [1992], First- and second-order methods for learning: Between steepest descent and Newton’s methods, Neural Computation, 4, 141–166. [23] Baum, E.B. [1988], On the capabilities of multilayer perceptrons, Journal of Complexity, 4, 193–215. [24] Baum, E., D. Haussler [1989], What size net gives valid generalization? in D. Touretzky, ed., Advances in Neural Information Processing Systems 1, Morgan Kaufmann Pub., 81–90. [25] Baum, E. [1990], The perceptron algorithm is fast for nonmalicious distributions, Neural Computation, 2, 248–260. [26] Baumeister, J. [1987], Stable Solutions to Inverse Problems, F. Vieweg, Braunschweig. [27] Berger, J. [1985], Statistical Decision Theory and Bayesian Analysis, Second edition, Springer-Verlag, New York. [28] Bernardo, J., A. Smith [1994], Bayesian Statistical Decision Theory, Wiley, New York. [29] Bishop, C. [1995], Neural Networks for Pattern Recognition, Clarendon Press, Oxford. [30] Block, H.D. [1962], The Perceptron: A model for brain functioning. I, Reviews of Modern Physics, 34, 123–135. [31] Block, H.D., B.W. Knight, F. Rosenblatt [1962], Analysis of a fourlayer series-coupled Perceptron. II, Reviews of Modern Physics, 34, 135–142. [32] Blum, A., R. Rivest [1989], Training a 3-node neural network is NP-complete, Advances in Neural Information Processing Systems 1, Morgan Kaufmann Pub., 494–501. [33] Brent, R. [1991], Fast training algorithms for multilayer neural nets, IEEE Trans. on Neural Networks, 2, 3, 346–354. [34] Buntine, W., A. Weigend [1994], Computing second derivatives in feed-forward networks: A review, IEEE Trans. on Neural Networks, 5, 480–488.
312
References
[35] Carroll, S., B. Dickinson [1989], Construction of neural nets using the Radon transform, Proc. of IJCNN, I, IEEE Publications, Piscataway, NJ, I607–I611. [36] Caruana, R. [1995], Learning many related tasks at the same time with backpropagation, in G. Tesauro, D. Touretzky, T. Leen, eds., Advances in Neural Information Processing Systems 7, MIT Press, 657–664. [37] Caruana, R. [1997], Multitask learning, Machine Learning, 28, 41–75. [38] Churchland, P. [1986], Neurophilosophy: Toward a Unified Science of the Mind/Brain, MIT Press, Cambridge, MA. [39] Chvatal, V. [1983], Linear Programming, W.H. Freeman, New York. [40] Combining Artificial Neural Nets: Ensemble Approaches, special issue of Connection Science, 8, December 1996. [41] Constantine, K., Eastern Nazareth College, personal communication 26 September 1997. [42] Cosnard, M., P. Koiran, H. Paugam-Moisy [1994], Bounds on number of units for computing arbitrary dichotomies by multilayer perceptron, Journal of Complexity, 10, 57–63. [43] Cotter, N. [1990], The Stone-Weierstrass theorem and its application to neural networks, IEEE Trans. on Neural Networks, 1, No. 4, 290– 295. [44] Cotter, N., T. Guillerm [1992], The CMAC and a theorem of Kolmogorov, Neural Networks, 5, 221–228. [45] Courant, R., D. Hilbert [1953], Methods of Mathematical Physics, Wiley Interscience Press, New York. [46] Cover, T. [1965], Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Trans. on Electronic Computers, EC-14, 326–334. [47] Cover, T. [1968], Capacity problems for linear machines, in L. Kanal, ed., Pattern Recognition: Proceedings of the IEEE Workshop on Pattern Recognition, Thompson Book Comp., Washington, DC, 283–289. [48] Cover, T., J. Thomas [1991], Elements of Information Theory, Wiley, New York. [49] Cowan, J. [1990], Neural networks: The early days, in D. Touretzky, ed., Advances in Neural Information Processing Systems 2, Morgan Kaufmann, San Mateo, CA, 828–842.
References
313
[50] Cybenko, G. [1989], Approximations by superpositions of a sigmoidal function, Mathematics of Control, Signals & Systems, 2, 4, 303–314. Correction made in op. cit. [1992], 5, 455. [51] Dahmen, W., C. Micchelli [1987], Some remarks on ridge functions, Approximation Theory and Its Applications, 3, 139–143. [52] Darken, C., J. Moody [1992], Towards faster stochastic gradient search, in J. Moody, S.J. Hanson, R.P. Lippmann, eds., Advances in Neural Information Processing Systems 4, Morgan Kaufmann Pub., 1009–1016. [53] Das Gupta, B., H. Siegelmann, E. Sontag [1995], On the complexity of training neural networks with continuous activation functions, IEEE Trans. on Neural Networks, 6, 1490–1504. [54] Decatur, S. [1989], Application of Neural Networks to Terrain Classification, Proceedings of the IJCNN, I, IEEE Press, I283–I288. [55] Dembo, A. [1989], On the capacity of associative memories with linear threshold functions, IEEE Trans. on Information Theory, 35, 709– 720. [56] Denker, J., Y. LeCun [1991], Transforming neural-net output levels to probability distributions, in R. Lippmann, J. Moody, D. Touretzky, eds., Advances in Neural Information Processing Systems 3, Morgan Kaufmann Pub., San Mateo, CA, 853–859. [57] Dennis, J., R. Schnabel [1983], Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice Hall, Englewood Cliffs, NJ. [58] DeVore, R., R. Howard, C. Micchelli [1989], Optimal nonlinear approximation, Manuscripta Mathematica, 63, 469–478. [59] DeVore, R., V. Temlyakov [1996], Some remarks on greedy algorithms, Adv. in Computational Mathematics, 5, 173–187. [60] Devroye, L. [1988], Automatic pattern recognition: A study of the probability of error, IEEE Trans. Pattern Analysis and Machine Intelligence, 10, 530–543. [61] Devroye, L., L. Gyorfi, G. Lugosi [1996], A Probabilistic Theory of Pattern Recognition, Springer-Verlag, New York. [62] Dodier, R. [1996], Geometry of early stopping in linear networks, in D. Touretzky, M. Mozer, M. Hasselmo, eds., Advances in Neural Information Processing Systems 8, MIT Press, Cambridge, MA, 365– 371.
314
References
[63] Donahue, M., L. Gurvits, C. Darken, E. Sontag [1997], Rates of convex approximation in non-Hilbert spaces, Constructive Approximation, 13, 187–220. [64] Drucker, H., C. Cortes, L. Jackel, Y. LeCun, V. Vapnik [1994], Boosting and other ensemble methods, Neural Computation, 6, 1289–1301. [65] Duda, R., P. Hart [1973], Pattern Classification and Scene Analysis, Wiley, New York. [66] Dyer, M., A. Frieze, R. Kannan [1991], A random polynomial-time algorithm for approximating the volume of convex bodies, Jour. Assoc. Comput. Mach., 38, 1–17. [67] Efron, B. [1982], The Jackknife, the Bootstrap and Other Resampling Plans, lecture notes 38, Soc. for Indust. and Applied Mathematics. [68] Efron, B., R. Tibshirani [1993], An Introduction to the Bootstrap, Chapman and Hall, London and New York. [69] Fahlman, S., C. Lebiere [1990], The cascade-correlation learning architecture, in D.S. Touretzky, ed., Advances in Neural Information Processing Systems 2, Morgan Kaufmann Pub., San Mateo, CA, 524– 532. [70] Fefferman, C. [1994], Reconstructing a neural net from its output, Rev. Mat. Iberoamericana, 10, 507–555. [71] Fefferman, C., S. Markel [1994], Recovering a feed-forward net from its output, in J. Cowan, G. Tesauro, J. Alspector, eds. Advances in Neural Information Processing Systems 6, Morgan Kaufmann Pub., 335–342. [72] Feller, W. [1957], An Introduction to Probability Theory and Its Applications, I, second edition, Wiley, New York. [73] Feller, W. [1966], An Introduction to Probability Theory and Its Applications, II, Wiley, New York. [74] Fine, T. [1996], Review of M. Hassoun, Fundamentals of Artificial Neural Networks, in IEEE Trans. on Information Theory, 42. [75] Fine, T., S. Mukherjee [1999], Parameter convergence and learning curves for neural networks, Neural Computation, 11, No. 3, 747–769. [76] Fletcher, R. [1987], Practical Methods of Optimization, J. Wiley, New York.
References
315
[77] Frean, M. [1990], The upstart algorithm: A method for constructing and training feedforward neural networks, Neural Computation, 2, 198–209. [78] Frean, M. [1992], A “thermal” perceptron learning rule, Neural Computation, 4, 946–957. [79] Fukumizu, J. [1996], A regularity condition of the information matrix of a multilayer perceptron network, Neural Networks, 5, 871–879. [80] Funahashi, K. [1989], On the approximate realization of continuous mappings by neural networks, Neural Networks, 2, 183–192. [81] Gallant, S. [1990], Perceptron-based learning algorithms, IEEE Trans. on Neural Networks, 1, No. 2, 179–191. [82] Gallant, S. [1993], Neural Network Learning and Expert Systems, MIT Press, Cambridge, MA. [83] Gibson, G. [1996], Exact classification with two-layer neural nets, Journal of Computer and System Sciences, 52, 349–356. [84] Graubard, S., ed., [1988], The Artificial Intelligence Debate: False Starts, Real Foundations, MIT Press, Cambridge, MA. [85] Grossman, T., et al. [1989], Learning by choice of internal representations, Advances in Neural Information Processing Systems 1, Morgan Kaufmann Pub., 73–80. [86] Guyon, I., P. Albrecht, Y. LeCun, J. Denker, W. Hubbard [1991], Design of a neural network character recognizer for a touch terminal, Pattern Recognition, 24, 105–119. [87] Guyon, I., V. Vapnik, B. Boser, L. Bottou, S. Solla [1992], Structural risk minimization for character recognition, in J. Moody, S. Hanson, R. Lippmann, eds., Advances in Neural Information Processing Systems 4, Morgan Kaufmann Pub., San Mateo, CA, 471–479. [88] Guyon, I., J. Makhoul, R. Schwartz, V. Vapnik [1998], What size test set gives good error rate estimates?, IEEE Trans. on Pattern Analysis and Machine Intelligence, 20, 52–64. [89] Hagan, M., M. Menhaj [1994], Training feedforward networks with the Marquardt algorithm, IEEE Trans. on Neural Networks, 5, 989– 993. [90] Hall, P. [1992], The Bootstrap and Edgeworth Expansion, SpringerVerlag, New York.
316
References
[91] Halmos, P. [1974], Finite-Dimensional Vector Spaces, SpringerVerlag, New York. [92] Harris-Warwick, R. [1993], personal communication. [93] Hashem, S. [1993], Optimal Linear Combinations of Neural Networks, Ph.D. dissertation, Purdue University, W. Lafayette, IN. [94] Hassibi, B., D. Stork [1993], Second order derivatives for network pruning: optimal brain surgeon, in S. Hanson, J. Cowan, C. Giles, eds., Advances in Neural Information Processing Systems 5, Morgan Kaufmann Pub., San Mateo, CA, 164–171. [95] Hassibi, B., D. Stork, G. Wolff [1993], Optimal brain surgeon and general network pruning, Proceedings of IEEE Int. Conf. on Neural Networks (ICNN) 93, 1, IEEE Press, San Francisco, 293–299. [96] Hassibi, B., D. Stork, G. Wolff, T. Watanabe [1994], Optimal brain surgeon: extensions and performance comparisons, in J. Cowan, G. Tesauro, J. Alspector, eds., Advances in Neural Information Processing Systems 6, Morgan Kaufmann Pub., San Mateo, CA, 263–270. [97] Hassoun, M. [1995], Fundamentals of Artificial Neural Networks, MIT Press, Cambridge MA. [98] Hastie, T., R. Tibshirani [1990], Generalized Additive Models, Monographs on Statistics and Applied Probability 43, Chapman and Hall, London and New York. [99] Haussler, D., M. Kearns, H. Seung, N. Tishby [1996], Rigorous learning curve bounds from statistical mechanics, Machine Learning, 25, 195–236. [100] Haykin, S. [1999], Neural Networks: A Comprehensive Foundation, second edition, Prentice Hall, Englewood Cliffs, NJ. [101] Hebb, D. [1949], The Organization of Behavior, Wiley, New York. [102] Hecht-Nielsen, R. [1990], Neurocomputing, Addison-Wesley Pub. Co., Reading, MA. [103] Hertz, J., A. Krogh, R. Palmer [1991], Introduction to the Theory of Neural Computation, Addison-Wesley Pub. Co., Reading, MA. [104] Hoffgen, K.-U. [1993], Computational limits on training sigmoidal neural networks, Information Processing Letters, 46, 269–274. [105] Hoffgen, K.-U., H.-U. Simon [1995], Robust trainability of single neurons, Journal of Computer and System Sciences, 50, 114–125.
References
317
[106] Hopfield, J. [1982], Neural networks and physical systems with emergent collective computational abilities, Proc. National Academy of Sciences, 79, 2554–2558. Also in Neurocomputing, op. cit. [107] Hopfield, J. [1984], Neurons with graded response have collective computational properties like those of two-state neurons, Proc. National Academy of Sciences, 81, 3088–3092. Also in Neurocomputing, op. cit. [108] Hopfield, J., D. Tank [1986], Computing with neural circuits: A model, Science, 233, 625–633. [109] Hornik, K., M. Stinchcombe, H. White [1989], Multilayer feedforward networks are universal approximators, Neural Networks, 2, 359–366. [110] Hornik, K. [1991], Approximation capabilities of multilayer feedforward networks, Neural Networks, 4, 251–257. [111] Hornik, K., M. Stinchcombe, H. White, P. Auer [1994], Degree of approximation results for feedforward networks approximating an unknown mapping and their derivatives, Neural Computation, 6, 1262– 1275. [112] IEEE Transactions on Neural Networks, IEEE Press, Piscataway, NJ. [113] Jackel, L., et al. [1994], Neural-net applications in character recognition and document analysis, in B. Yuhas, N. Ansari, eds., Neural Networks in Telecommunications, Kluwer, Norwell, MA. [114] Jacobs, R., M. Jordan [1991], Adaptive mixtures of local experts, Neural Computation, 3, 79–87. [115] Jeffreys, H. [1946], An invariant form for the prior probability in estimation problems, Proc. Royal Society, series A, 186, 453–461. [116] Jones, L. [1990], Constructive approximations for neural networks by sigmoidal functions, in Proc. IEEE, 1586–1589. [117] Jones, L. [1992], A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training, The Annals of Statistics, 20, 608–613. [118] Jones, L. [1994], Good weights and hyperbolic kernels for neural networks, projection pursuit, and pattern classification: Fourier strategies for extracting information from high-dimensional data, IEEE Trans. on Information Theory, 40, 439–454. [119] Jones, L. [1997], The computational intractability of training sigmoidal neural networks, IEEE Trans. on Information Theory, 43, 167–173.
318
References
[120] Judd, S. [1990], Neural Network Design and the Complexity of Learning, MIT Press, Cambridge, MA. [121] Karpinski, M., A. Macintrye [1995], Polynomial bounds for VC dimension of sigmoidal neural networks, Proc. 27th ACM Symposium on Theory of Computing, 200–208. [122] Katsuura, H., D. Sprecher [1994], Computational aspects of Kolmogorov’s superposition theorem, Neural Networks, 7, 455–461. [123] Kearns, M. [1997], A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split, Neural Computation, 9, 1143–1161. [124] Kearns, M., R. Schapire [1990], Efficient distribution-free learning of probabilistic concepts, Proc. 31st Symposium on the Foundations of Computer Science, IEEE Computer Society Press, Los Alamitos, CA, 382–391. [125] Kearns, M., U. Vazirani [1994], An Introduction to Computational Learning Theory, MIT Press, Cambridge, MA. [126] Kerlirzin, P., F. Vallet [1993], Robustness in multilayer perceptrons, Neural Computation, 5, 473–482. [127] Knerr, S., L. Personnaz, G. Dreyfus [1992], Handwritten digit recognition by neural networks with single-layer training, IEEE Trans. on Neural Networks, 3, 962–968. [128] Koiran, P. [1993], On the complexity of approximating mappings using feedforward networks, Neural Networks, 6, 649–653. [129] Koiran, P., E. Sontag [1997], Neural networks with quadratic VC dimension, Journal of Comput. Systems Science, 54, 190–198. [130] Kolen, J., J. Pollack [1991], Back propagation is sensitive to initial conditions, in R. Lippmann, J. Moody, D. Touretzky, eds., Advances in Neural Information Processing Systems 3, Morgan Kaufmann Pub., 860–867. [131] Kowalczyk, A. [1997], Estimates of storage capacity of multilayer perceptron with threshold logic hidden units, Neural Networks, 10, 1417–1433. [132] Kramer, A., A. Sangiovanni-Vincentelli [1989], Efficient parallel learning algorithms for neural networks, Advances in Neural Information Processing Systems 1, Morgan Kaufmann Pub., 40–48. [133] Kurkova, V. [1992], Kolmogorov’s theorem and multilayer neural networks, Neural Networks, 5, 501–506.
References
319
[134] Kurkova, V., P. Kainen [1994], Functionally equivalent feedforward neural networks, Neural Computation, 6, 543–558. [135] Kurkova, Vera. personal communication on the work of a student of hers. [136] Lawrence, S., A. Tsoi, A. Back [1996], The gamma MLP for speech phoneme recognition, in Touretzky et al., op. cit., 785–791. [137] LeCun, Y., Y. Bengio [1995], Convolutional networks for images, speech, and time series, in M. Arbib, ed., The Handbook of Brain Theory and Neural Networks, MIT Press, Cambridge, MA, 255–258. [138] LeCun, Y., B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L. Jackel [1990], Handwritten digit recognition with a backpropagation network, in D. Touretzky, ed., Advances in Neural Information Processing Systems 2, Morgan Kaufmann Pub., San Mateo, CA, 396–404. [139] LeCun, Y., J. Denker, S. Solla [1990], Optimal brain damage, in D. Touretzky, ed., Advances in Neural Information Processing Systems 2, Morgan Kaufmann Pub., San Mateo, CA, 598–605. [140] LeCun, Y., P. Simard, B. Pearlmutter [1993], Automatic learning rate maximization by on-line estimation of the Hessian’s eigenvectors, in S. Hanson, J. Cowan, L. Giles, eds., Advances in Neural Information Processing Systems 5, Morgan Kaufmann Pub., San Mateo, CA, 156– 163. [141] LeCun, Y., L. Bottou, Y. Bengio, P. Haffner [1998], Gradient-based learning applied to document recognition, Proc. IEEE, 86, 2278– 2324. [142] Lee, W., P. Bartlett, R. Williamson [1996], Efficient agnostic learning of neural networks with bounded fan-in, IEEE Trans on Information Theory, 42, 2118–2132. [143] Lehmann, E. [1983], Theory of Point Estimation, Wiley, New York. [144] Leshno, M., V. Lin, A. Pinkus, S. Schocken [1993], Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks, 6, 861–867. [145] Li, K.-C. [1987], Asymptotic optimality for Cp , CL , cross-validation and generalized cross-validation: Discrete index set, Annals of Statistics, bf 15, 958–975. [146] Li, M., P. Vitanyi [1993], An Introduction to Kolmogorov Complexity and Its Applications, Springer-Verlag, New York.
320
References
[147] Lindley, D. [1985], Making Decisions, second edition, Wiley, London. [148] Lippmann, R.P. [1987], An introduction to computing with neural networks, ASSP Magazine, 4, 4–22. [149] Littmann, E., H. Ritter [1993], Generalization abilities of cascade network architectures, in S. Hanson, J. Cowan, C.L. Giles, eds., Advances in Neural Information Processing Systems 5, Morgan Kaufmann Pub., 188–195. [150] Liu, Y. [1993], Neural network model selection using asymptotic jackknife estimator and cross-validation method, in S.J. Hanson, J. Cowan, C.L. Giles, eds., Advances in Neural Information Processing Systems 5, Morgan Kaufmann Pub., 599–606. [151] Loeve, M. [1977], Probability Theory I, Springer, New York, p. 295. [152] Lorentz, G. [1986], Approximation of Functions, Chelsea Pub. Co., New York, Ch. 11. [153] Luenberger, D. [1984], Linear and Nonlinear Programming, second edition, Addison-Wesley Pub., Reading, MA. [154] Maass, W. [1994], Neural nets with superlinear VC-dimension, Neural Computation, 6, 877–884. [155] MacKay, D. [1991], Bayesian Methods for Adaptive Models, Ph.D. dissertation, Calif. Inst. of Technology, Pasadena, CA. [156] MacKay, D. [1992], Bayesian model comparison and backprop nets, in J. Moody, S. Hanson, R. Lippmann, eds., Advances in Neural Information Processing Systems 4, Morgan Kaufmann Pub., 839–846. [157] Maron, O., A. Moore [1994], Hoeffding races: Aaccelerating model selection search for classification and function approximation, in J. Cowan, G. Tesauro, J. Alspector, eds. Advances in Neural Information Processing Systems 6, Morgan Kaufmann Pub., San Mateo, CA, 59–66. [158] May, G. [1994], Manufacturing ICs the neural way, IEEE Spectrum, September, 47–51. [159] McCulloch, W.S., W. Pitts [1943], A logical calculus of the ideas immanent in nervous activity, Bull. of Math. Biophysics 5, 116. [160] McEliece, R., E. Posner, E. Rodemich [1987], The capacity of the Hopfield associative memory, IEEE Trans. on Information Theory, IT-33, 461–482.
References
321
[161] McLachlan, G. [1992], Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York. [162] Mead, C. [1989], Analog VLSI and Neural Systems, Addison-Wesley, Reading, MA. [163] Mhaskar, H. [1993], Approximation properties of a multilayered feedforward artificial neural network, Advances in Computational Mathematics, 1, 61–80. [164] Mhaskar, H., C. Micchelli [1992], Approximation by superposition of sigmoidal and radial basis functions, Advances in Applied Mathematics, 13, 350–373. [165] Mhaskar, H., C. Micchelli [1994], How to choose an activation function, in J. Cowan, G. Tesauro, J. Alspector, eds. Advances in Neural Information Processing Systems 6, Morgan Kaufmann Pub., 319–326. [166] Minai, A., R. Williams [1993], On the derivatives of the sigmoid, Neural Networks, 6, 845–853. [167] Minai, A., R. Williams [1994], Perturbation response in feedforward networks, Neural Networks, 7, 783–796. [168] Minsky, M., S. Papert [1988], Perceptrons, expanded edition of 1969 version, MIT Press, Cambridge, MA. [169] Møller, M. [1993], A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks, 6, 525–533. [170] Moody, E. [1967], William of Ockham, in Paul Edwards, ed., The Encyclopedia of Philosophy, 8, Macmillan Pub., New York, 306–317. [171] Moody, J. [1992], The effective number of parameters: an analysis of generalization and regularization in nonlinear learning systems, in J. Moody, S. Hanson, R. Lippmann, eds., op. cit., 847–854. [172] Mukherjee, S. [1996], Neural Network Training Algorithms Based on Quadratic Error Surface Models, Ph.D. dissertation, Cornell University, Ithaca, NY. [173] Mukherjee, S., T. Fine [1996], Asymptotics of gradient-based neural network training algorithms, Neural Computation, 8, 1075–1084. [174] Muller, P., D. Insua [1998], Issues in Bayesian analysis of neural network models, Neural Computation, 10, 749–770. [175] Murata, N. [1993], Learning curves, model selection and complexity of neural networks, in S. Hanson, J. Cowan, C.L. Giles, eds., Advances in Neural Information Processing Systems 5, Morgan Kaufmann Pub., 607–614.
322
References
[176] Muroga, S. [1971], Threshold Logic and Its Applications, Wiley, New York. [177] Neal, R. [1996], Bayesian Learning for Neural Networks, SpringerVerlag, NY. [178] Neural Computation, MIT Press, Cambridge, MA. [179] Neural Networks, publication of the Int. Neural Network Soc. (INNS). [180] Nilsson, N. [1990], The Mathematical Foundations of Learning Machines, Morgan Kaufmann Pub., San Mateo, CA. [181] O’Cinneide, C. [1990], The mean is within one standard deviation of any median, American Statistician, 44, 292–293. [182] Paass, G. [1993], Assessing and improving neural network predictions by the bootstrap algorithm, in S. Hanson, J. Cowan, C.L. Giles, eds., Advances in Neural Information Processing Systems 5, Morgan Kaufmann Pub., 196–203. [183] Pagels, H.R. [1988], The Dreams of Reason, Chap. 6, “Connectionism/Neural Nets,” reprinted as a paperback by Bantam Books, New York, 1989. [184] Pearlmutter, B. [1994], Fast exact multiplication by the Hessian, Neural Computation, 6, 147–160. [185] Platt, J., T. Allen [1996], A neural network classifier for the I1000 OCR chip, in D. Touretzky, M. Mozer, M. Hasselmo, eds., Advances in Neural Information Processing Systems 8, MIT Press, Cambridge, MA, 938–944. [186] Plutowski, M., S. Sakata, H. White [1994], Cross-validation estimates IMSE, in J. Cowan, G. Tesauro, J. Alspector, eds., Advances in Neural Information Processing Systems 6, Morgan Kaufmann Pub., 391– 398. [187] Poggio, T., F. Girosi [1990], Networks for approximation and learning, in Proc. IEEE, 1481–1497. [188] Pollard, D. [1984], Convergence of Stochastic Processes, SpringerVerlag, New York. [189] Pollard, D. [1990], Empirical Processes: Theory and Applications, Inst. of Math. Statistics Press, Hayward, CA. [190] Pontil, M., A. Verri [1998], Properties of support vector machines, Neural Computation, 10, 955–974.
References
323
[191] Press, W., B. Flannery, S. Teukolsky, W. Vetterling [1992], Numerical Recipes in C: The Art of Scientific Programming, second edition, Cambridge Univ. Press, Cambridge. [192] Ripley, B. [1996], Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge. [193] Rissanen, J. [1984], Universal coding, information, prediction, and estimation, IEEE Trans. on Information Theory, IT-30, 629–636. [194] Rissanen, J. [1986], Stochastic complexity and modeling, The Annals of Statistics, 14, 1080–1100. [195] Rissanen, J. [1987], Stochastic complexity, J. Royal Statistical Society, 49, 223–239. [196] Rissanen, J. [1989], Stochastic Complexity in Statistical Inquiry, World Scientific, Singapore. [197] Rockafellar, R. [1970], Convex Analysis, Princeton University Press, Princeton, NJ. [198] Rosenblatt, F. [1958], Psychological Review, 65, 386–408. [199] Rosenblatt, F. [1961], Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, Spartan Books, Washington, DC. [200] Rowley, H., S. Baluja, T. Kanade [1996], Human face detection in visual scenes, in D. Touretzky, M. Mozer, M. Hasselmo, eds., Advances in Neural Information Processing Systems 8, MIT Press, Cambridge, MA, 875–881. [201] Roychowdhury, V., A. Orlitsky, K.-Y. Siu [1994], Lower bounds on threshold and related circuits via communication complexity, IEEE Trans. on Information Theory, 40, 467–474. [202] Roychowdhury, V., K.-Y. Siu, T. Kailath [1995], Classification of linearly non-separable patterns by linear threshold elements, preprint. [203] Royden, H. [1988], Real Analysis, Macmillan, New York. [204] Rudin, W. [1987], Real and Complex Analysis, McGraw-Hill, New York. [205] Rumelhart, D., J. McClelland, eds., [1986], Parallel Distributed Processing, MIT Press, Cambridge, MA. [206] Rumelhart, D.E., G.E. Hinton, R.J. Williams [1986], “Learning internal representations by error propagation,” in D.E. Rumelhart, J.L. McClelland, Parallel Distributed Processing, MIT Press, Cambridge MA, Ch. 8; also in Neurocomputing, op. cit.
324
References
[207] Safire, W. [1999], Ockham’s razor’s close shave, The New York Times Magazine, January 31, 1999, p. 14. [208] Sanchez-Sinencio, E., R. Newcomb, eds. [1992], IEEE Trans. on Neural Networks, Special Issue on Neural Network Hardware, 3. [209] Sauer, N. [1972], On the density of families of sets, Journal of Combinatorial Theory, A 13, 145–147. [210] Schaffer, A., M. Yannakakis [1991], Simple local search problems that are hard to solve, SIAM Journal of Computing, 20, 56–87. [211] Schlafli, L. [1950], Gesammelte Mathematische Abhandlungen I, Verlag Birkhauser, Basel, 209–212 (compilation of his work). [212] Schumaker, L. [1981], Spline Functions: Basic Theory, Wiley, New York. [213] Schwartz, G. [1978], Estimating the dimension of a model, Annals of Statistics, 6, 461–464. [214] Shibata, R. [1976], Selection of the order of an autoregressive model by Akaike’s information criterion, Biometrika, 63, 117–126. [215] Shibata, R. [1980], Asymptotically efficient selection of the order of the model for estimating parameters of a linear process, Annals of Statistics, 8, 147–164. [216] Shibata, T., T. Nakai, T. Morimoto, R. Kaihara, T. Yamashita, T. Ohmi [1996], Neuron-MOS temporal winner search hardware for fully-parallel data processing, in D. Touretzky, M. Mozer, M. Hasselmo, eds., Advances in Neural Information Processing Systems 8, MIT Press, Cambridge, 685–691. [217] Shonkwiler, R. [1993], Separating the vertices of N-cubes by hyperplanes and its application to artificial neural networks, IEEE Trans. Neural Networks, 4, 343-347. [218] Shustorovich, A., C. Thrasher [1996], Kodak Imagelink OCR alphanumeric handprint module, in D. Touretzky, M. Mozer, M. Hasselmo, eds., Advances in Neural Information Processing Systems 8, MIT Press, Cambridge, MA, 778–784. [219] Shynk, J., N. Bershad [1993], Stationary points of a single-layer perceptron for nonseparable data models, Neural Networks, 6, 189–202. [220] Simmons, G. [1963], Introduction to Topology and Modern Analysis, McGraw-Hill, New York.
References
325
[221] Siu, K.-Y., A. Dembo, T. Kailath [1994], On the perceptron learning algorithm on data with high precision, Journal of Computer and System Sciences, 48, 347–356. [222] Siu, K.-Y., V. Roychowdhury, T. Kailath [1995], Discrete Neural Computation: A Theoretical Foundation, Prentice Hall Pub., Englewood Cliffs, NJ. [223] Sommerville, D. [1958], An Introduction to the Geometry of N Dimensions, Dover Pub., New York. [224] Sontag, E. [1992a], Feedforward nets for interpolation and classification, Journal of Computer and Systems Sciences, 45, 20–48. [225] Sontag, E. [1992b], Feedback stabilization using two-hidden-layer nets, IEEE Trans. on Neural Networks, 3, 981–990. [226] Sontag, E. [1997], Shattering all sets of k points in “general position” requires (k − 1)/2 parameters, Neural Computation, 9, 337–348. [227] Sprecher, D. [1993], A universal mapping for Kolmogorov’s superposition theorem, Neural Networks, 6, 1089–1094. [228] Stevenson, M., R. Winter, B. Widrow [1990], Sensitivity of feedforward neural networks to weight errors, IEEE Trans. on Neural Networks, 1, 71–80. [229] Stone, M. [1974], Cross-validatory choice and assessment of statistical predictions, Journal Royal Statistical Society, B 36, 111–147. [230] Stone, M. [1977], Asymptotics for and against cross-validation, Biometrika, 64, 29–35. [231] Stork, D. [1989], Is backpropagation biologically plausible?, Int. Joint Conf. Neural Networks, II, op. cit., II241–II246. [232] Sussman, H. [1992], Uniqueness of the weights for minimal feedforward nets with a given input-output map, Neural Networks, 5, 589– 593. [233] Takiyama, R. [1992], Separating capabilities of three layer neural networks, IEICE Trans. Fundamentals, E75–A. [234] Talagrand, M. [1994], Sharper bounds for Gaussian and empirical processes, Ann. Probability, 22, 28–76. [235] Tikhonov, A., V. Arsenin [1977], Solutions of Ill-Posed Problems, Winston & Sons through Wiley, Washington, DC.
326
References
[236] Turmon, M. [1995], Assessing Generalization of Feedforward Neural Networks, Ph.D. dissertation, Cornell University, Ithaca, NY. [237] Turmon, M., T. Fine [1995], Sample size requirements for feedforward neural networks, in G. Tesauro, D. Touretzky, T. Leen, eds., Advances in Neural Information Processing Systems 7, MIT Press, Cambridge, MA, 327–334. [238] Uspensky, J. [1937], Introduction to Mathematical Probability, McGraw-Hill Book Co., New York, 205. [239] Valiant, L. [1984], A theory of the learnable, Communications of the ACM, 27, 1134–1142. [240] Vapnik, V. [1982], Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York. [241] Vapnik, V.N. [1995], The Nature of Statistical Learning Theory, Springer-Verlag, New York. [242] Vapnik, V.N., S. Golowich, A. Smola [1997], Support vector method for function approximation, regression estimation, and signal processing, in M. Mozer, M. Jordan, T. Petsche, eds., Advances in Neural Information Processing Systems 9, MIT Press, Cambridge, MA, 281– 287. [243] Vapnik, V.N. [1998], Statistical Learning Theory, Wiley, New York. [244] Vidyasagar, M.M. [1997], A Theory of Learning and Generalization, Springer-Verlag, New York. [245] Wahba, G. [1990], Spline Models for Observational Data, CBMS-NSF Regional Conf. Series in Applied Math., SIAM, Philadelphia, PA. [246] Weigend, A., N. Gershenfeld, eds., [1994], Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley Pub., Reading, MA. [247] Weigend, A., D. Rumelhart, B. Huberman [1991], Generalization by weight-elimination with application to forecasting, in R. Lippmann, J. Moody, D. Touretzky, eds. Advances in Neural Information Processing Systems 3, Morgan Kaufmann Pub., 875–882. [248] Wenocur, R., R. Dudley [1981], Some special Vapnik-Chervonenkis classes, Discrete Mathematics, 33, 313–318. [249] West, J., ed. [1985], Best & Taylor’s Physiological Basis of Medical Practice, eleventh edition, Williams & Wilkins Pubs., Baltimore, MD.
References
327
[250] West, M., J. Harrison [1989], Bayesian Statistical Decision Theory, Springer-Verlag, New York. [251] White, H. [1989], Some asymptotic results for learning in single hidden-layer feedforward network models, Journal of Amer. Statistical Assn., 84, 1003–1013. [252] Wiener, N. [1933], The Fourier Integral and Certain of Its Applications, reprinted by Dover, New York. [253] Winder, R.O. [1962], Threshold Logic, Ph.D. dissertation, Dept. of Mathematics, Princeton University, Princeton, NJ. [254] Wolpert, D. [1993], On the use of evidence in neural networks, in S. Hanson, J. Cowan, C.L. Giles, eds., Advances in Neural Information Processing Systems 5, Morgan Kaufmann Pub., 539–546. [255] Wong, C.-F. [1996], Ph.D. dissertation, Cornell University, Ithaca, NY. [256] Wynn, H., D. Naiman [1992], Inclusion-exclusion-Bonferroni identities and inequalities for discrete tube-like problems via Euler characteristics, The Annals of Statistics, 20, 43–76. [257] Yuan, J.-L., T. Fine [1993], Forecasting demand for electric power, Advances in Neural Information Processing Systems 5, 739–746. [258] Yuan, J.-L. and T. L. Fine [1998], Neural network design for small training sets of high dimension, IEEE Trans. on Neural Networks, 9, 266–280. [259] Yukich, J., M. Stinchcombe, H. White [1995], Sup-norm approximation bounds for networks through probabilistic methods, IEEE Trans. on Information Theory, 41, 1021–1027. [260] Zavaliagkos, G., Y. Zhao, R. Schwartz, J. Makhoul [1993], A hybrid neural net system for state-of-the-art continuous speech recognition, in S. Hanson, J. Cowan, C. Giles, eds., Advances in Neural Information Processing Systems 5, Morgan Kaufmann Pub., San Mateo, CA, 704–711.
This page intentionally left blank
Index
N1,σ , 82 A , 265 N1,σ Nk,σ , 86 Ω, 71 1HL implementable functions, 82 2HL implementable functions, 85 action potential, 11, 12, 17 Akaike, H., 221 Albertini, A., et al., 87 Aldous, D., 263 algorithms line search Fibonacci search, 153 momentum smoothing, 163 backpropagation, 139 batch mode, 147 conjugate gradient, 135, 164 H-conjugate directions, 165 conjugate gradients effect of restart, 169
Fletcher-Reeves formula, 168 Polak-Ribiere formula, 168 restart, 169 summary of method, 169 convergence rates conjugate gradient, 169 quasi-Newton, 173 steepest descent, 156 descent condition, 148, 171 greedy, 120 interpolation between steepest descent and Newton’s method, 171 iterative or recursive, 144 iteratively tracking inverse Hessian, 172 learning rate or step size, 147 Levenberg-Marquardt, 135 Hessian approximation, 174 Jacobian, 174 Levenberg-Marquardt (LMA), 173 line search, 149, 151 cubic interpolation, 153 quadratic interpolation, 152
330
Index
Newton’s method, 133, 148 none are best, 135, 176 online mode, 147 optimal learning rate, 149 perceptron training, 22, 23 pocket, 37 preprocessing of inputs, 145 quasi-Newton, 135, 171 Broyden Fletcher Goldfarb Shanno (BFGS) inverse Hessian, 173 outline, 172 regularization, 215 search direction, 147 search termination, 154 simplex, 22 steepest descent, 135, 148, 156 adaptive learning rate, 161 constant learning rate, 158 learning rate schedules, 161 nonoptimality of optimal onestep learning rate, 157 superlinear convergence, 173 training A, 238 Alon, N., et al., 266 Amaldi, E., 37 Amari, S., 9, 271, 281 Amari, S., et al., 231, 271 ambiguous classification, 41 Anthony, M., et al., 42, 54, 60 anthropomorphic language, 6 applications, 7 Arai, M., 61, 63 architecture, 2 1HL of LTU capabilities, 74 CMAC, 96 feedforward neural network, 55 MLP, 56 multiple hidden layers, 85 node function selection, 110 notation, 136, 236 pyramid, 64 selection, 110 architecture selection, 204
Bayesian methods, 206 cascade correlation algorithm, 232 cross-validation error, 231 family of models, 204 jackknife estimator, 231 nearly irrelevant weights, 233 optimal brain damage, 234 optimal brain surgery (OBS), 234 overfitting, 231 overtraining, 231 pruning, 233 six approaches, 206 validation, 231 VC theory, 232 artificial neural networks, 2 Ash, R., 82, 97, 100, 103 associative memory, 8, 56 Assumption asymptotically normal conditionals, 275 compactness and differentiability, 236 data sets, 244 finite VC dimension for gradients, 269 minima of eg , 267 proximity to minima, 268 training termination, 239 asymptotic approximation to normal, 247 Auer, P., et al., 134, 238 axon, 2, 11 backfitting, 232 backpropagation, 10, 139 backward pass, 141 calculational effort, 144 delta, 140 forward pass, 141 backward recursion, 141 Balasubramanian, V., 209, 212 Banach space, 117
Index
Barron, A., 112, 116, 221, 228, 230 Barron, A., et al., 228 Barron, R., 9 Bartlett, P., 220, 236, 264, 265 Battiti, R., 135, 151, 156, 173, 176 Baum, E., 61, 63 Baum, E., et al., 69 Baumeister, J., 215 Bayesian methods, 206 Bayesian posterior, 207 likelihood, 208, 209 model as a hypothesis, 207 posterior, 207 posterior expectation, 212 prior, 208 prior probability, 227 Bellman, R., 117 Berger, J., 207 Bernardo, J., et al., 207 Bernstein’s inequality, 249 Berry–Esseen bounds, 246 bias, 57 binomial coefficients, 25, 44 biological neurons, 2 Bishop, C., 3, 130, 135, 156, 175, 207 Block, H., et al., 9 Blum, A., et al., 176, 177 Bonferroni inequalities, 257 Boolean function, 42, 60 AND, 60 complexity of LTU-based representation, 63 OR, 60 XOR, 60 Boolean functions, 3 Boolean-valued inputs, 3 boosting, 178 bootstrap, 251, 280 estimator, 255 inappropriate for neural networks, 255 plug-in principle, 255 sample, 255
331
bounded variation, 91 brain, 2, 6 human, 4 Brent, R., 65 Buntine, W., et al., 175
capacity LTU, 28 Carroll, S., et al., 120 Caruana, R., 82 cascade correlation algorithm (CCA) 232 Chaitin, G., 221 Chebychev bound, 248 Chernoff bound, 248 Cholesky decomposition, 175 Church’s thesis, 222 Churchland, P., 5 Chvatal, V., 22 circuit complexity, 3 coin-tossing interpretation of D(n, d) 28 combinatorial optimizers, 10 compact set, 124 compositions of functions, 85 computational ability organisms, 5 condition number, 174 condition number of matrix, 156 conditional expectation, 6 connectionism, 2, 5 Constantine, K., 49 content addressable memory, 56 control, 8 convex hull of family of functions, 114 convexity definitions, 20 Cosnard, 72 Cotter, N., 97 Cotter, N., et al., 96 Cover, T., 24, 27, 41, 68 Cover, T., et al., 221, 227 Cowan, J., 9 credit assignment problem, 135
332
Index
cross-validation, 155, 221, 251, 252, 280 bias reduction, 253 hold-one-out, 252 variance, 254 cross-validation error, 231 curse of dimensionality, 115, 117, 119, 130 Cybenko, G., 99, 103, 115 Dahmen, W., et al., 90 Darken, C., et al., 161 Das Gupta, B., et al., 176 data test, 244, 245 training, 243 validation, 243 Decatur, S., 8 decision tree, binary, 65 Definition H-conjugacy, 165 γ-shattering, 264 1HL class, 82 closure of set of functions with respect to a metric, 92 compact, 124 convex combination, 20 convex hull, 21 convex set, 21 dense in, 92 discriminatory σ, 101 effective N -complexity, 225 feedforward network architecture, 56 general position, 27 generalization error, 240 growth function mN , 66 Hausdorff space, 124 Hilbert space, 113 Kolmogorov complexity, 224 linear separability, 20 linear vector space, 121 metric, 121 norm-based, 123
multivariable spline function, 111 network architecture, 55 non-negative definite matrix, 178 norm, 122 optimal separation in nonseparable case, 37 positive definite matrix, 178 pseudometric, 122 pseudonorm, 122 regularizing operator, 217 set M of node functions, 103 shattering of sets, 67 sigmoidal function, 102 sigmoidal node of order k, 111 stabilizing functional, 218 subalgebra, 97 topological space, 123 VC capacity, 67 Dembo, A., 56 dendrite, 2, 11 Denker, J., et al., 209 Dennis, J., et al., 152 DeVore, R., et al., 112, 117–120 Devroye, L., 251 Devroye, L., et al., 66, 68, 78, 105, 107, 255, 259, 261 dichotomy, 18, 23 linearly separable, 24 homogeneous, 26 Dodier, R., 231 Donahue, M., et al., 112, 117, 120 Drucker, H. et al., 178 duality, 22 dynamics, 3 Efron, B., 251, 253, 255 Efron, B., et al., 251, 255 email address of author, vii emergentism, 2, 5, 53 empirical distribution, 255 empirical error, 245 expectation, 247 learning curve bounds, 279
Index
learning curve distribution, 280 Taylor’s series, 279 variance, 247 enumeration enumeration of dichotomies, 104 error function, 129 mapping possibility, 130 error surface, 129 multiple minima, 134 quadratic model, 132 qualitative characteristics, 130 Taylor’s series, 132 Fahlman, S., et al., 232 fat-shattering dimension, 264 feedback stabilization, 109 Fefferman, C., 87 Feller, W., 246, 257 FFNN, 56, 136 Fine, T., 3 Fine, T. et al., 266 Fisher information matrix, 213 Fletcher, R., 135, 151, 168, 173 flops, 143 forecasting, 8 forward propagation, 138 Fourier transform, 116 Frean, M., 38 Fukumizu, K., 267 Funahashi, K., 94 function approximation neural network as Riemann sum, 94 1HL via adaptation of StoneWeierstrass, 99 approaches, 93 bounded variation, 91 complexity, 112 continuous, 91 continuous functions, 101 continuous functions by 1HL with exponential nodes, 98
333
continuous functions via 1HL with nonpolynomial nodes, 103 integrable functions by 1HL, 105 LTU nets, 76 nonlinear p-width, 118 rectangular pulse, 91 simultaneous approximation to its derivatives, 108 simultaneously to derivatives, 107 step function, 90 sup-norm, 123 trigonometric, 91 function representation polynomials, 92 closure operations, 87 differentiability, 88 interesting families, 84 issues, 15 LTU-based complexity, 63 LTU-based implementation, 58 nonunicity of neural network representation, 86 regions of constancy, 88 Gallant, S., 37, 64 general position, 259 generalization error, 231 eg (w), 240 eˆg of network found by A, 241 Bayes eB , 241 best network e0g , 240 bias, 241 Chebychev bound, 248 CLT estimate from test data, 247 confidence bound in terms of empirical error, 260 generalized Chebychev bound, 249 learning curve, 277 learning curve bounds, 278
334
Index
learning curve distribution, 278 limiting from large test data, 246 network of least empirical error e∗g , 241 Turmon bound via degrees of freedom, 263 Gibson, G., 75, 85 graceful degradation, 5 gradient empirical error, 238 Taylor’s series, 268, 272 gradient vector, 89, 131 calculation backpropagation, 139 chain rule, 138 graph, 2, 3 acyclic, 55 Graubard, S., 53 greedy algorithms, 120 Grossman, T., et al., 65 Guyon, I. et al., 8, 244 Guyon, I., et al., 217 Hadamard product, 141 Hagan, M., et al., 135, 174 Hall, P., 255 Halmos, P., 18 Hamming distance, 225 hardware implementation, 6 Hashem, S., 134, 177 Hassibi, B., et al., 175, 233, 234 Hassoun, M., 3, 9, 43 Hastie, T., et al., 221, 232, 251, 252 Hausdorff space, 124 Haussler, D., et al., 266, 281 Haykin, S., 3, 9, 43 Hebb, D., 9 Hecht-Nielsen, R., 93 Hertz, J., et al., 3, 19 Hessian calculational methods, 175 Hessian matrix, 89, 131, 134, 234 computation, 150
empirical error, 238 Hilbert space, 112 Hilbert’s 13th problem, 95 history, 9 Hoeffding Inequality, 243 Hoeffding’s inequality, 248 Hoffgen, et al., 37 Hopfield networks, 9 Hopfield, J., 2, 9 Hopfield, J., et al., 10, 56 Hornik, K., 108 Hornik, K., et al., 97, 99, 103, 108 Huffman coding, 227 hype, 10 hyperbolic tangent node, 13 hyperplane, 18, 19 nonunique specification, 22 i.i.d., 28 inconsistent hyperplane, 75 index of resolvability, 230, 241 initialization of algorithms, 146 inner product, 112 input space, 82 interconnections, 2 internal representations, 65 Jackel, L. et al., 8 Jacobian, 173 Jacobs, R., et al., 178 Jones, L., 112, 120, 176 journals, 3 Judd, S., 176 Karpinski, M., et al., 105 Katsuura, H., et al., 96 Kearns, M., 155, 244, 245, 251 Kearns, M., et al., 66, 107, 264 Kerlirzin, P., et al., 6 kernel, 39 Knerr, S., et al., 7 Koiran, P., 94 Koiran, P., et al., 105, 259 Kolen, J., 146 Kolmogorov complexity, 222, 224
Index
effective approximation, 224 not effectively computable, 224 Kolmogorov, A., 221 Kowalczyk, A., 72 Kraft inequality, 227 Kramer, A., et al., 168 Kurkova, V., 96 personal communication, 110 Kurkova, V., et al., 87 Lawrence, S. et al., 8 learning curve, 277 LeCun, Y. et al., 8 LeCun, Y., et al., 7, 85, 147, 150, 233, 234 Lee, W., et al., 112, 115, 124 Lehmann, E., 271, 273 Lemma continuity of inverses, 217 Cover, 24, 45 Cybenko, 102 empirical error analytics, 238 generalization error analytics, 240 gradient discrepancies, 270 minimum of a quadratic, 133 network functions, 237 Nilsson, 24, 45 VC upper bound for LTU nets, 70 Leshno, A., et al., 102, 106 Li, K-C., 254 Li, M., et al., 221 Lindley, D., 178 line search, 149 approximate, 151 cubic interpolation, 153 quadratic interpolation, 152 linear convergence, 169 linear discriminant function, 42 linear separability, 19 homogeneous, 20 size of largest set, 29 strict, 20 linear subspace, 100
335
linear threshold unit capacity, 28 linear threshold unit or gate, 17 Lippmann, R., 73 Lipschitz function, 91 Littmann, E., et al., 232 Liu, Y., 231 loading problem, 135 NP-complete, 176 Loeve, M., 255, 260, 282 logistic node, 13 Lorentz, G., 95 LTU, 17 Luenberger, D., 135, 151–153, 156, 169, 173 Maass, W., 68, 71 MacKay, D., 207, 209 Maron, O., et al., 178 MATLAB, 3 MATLAB program bisectreg1 for bisection line search with regularization, 190 CG1 conjugate gradient training for 1HL, 187 cubic1 for cubic interp. line search with regularization, 190 grad1 for backpropagation calculation of 1HL gradients, 182 grd1 single parameter vector version of grad1, 183 jacobian1 for Jacobian calculation for 1HL, 194 LM1 is Levenberg-Marquardt for 1HL, 195 netout1 for 1HL network response, 181 ntout1 single parameter vector version of netout1, 183
336
Index
Perceptron Training Algorithm, 50 QN1 is quasi-Newton for a 1HL network, 192 quasi-Newton, multiple outputs, for 1HL, 197 sandwich construction, 77 SD1 steepest descent training for 1HL, 184 ss1 single parameter vector version of sse1, 184 sse1 for sum-squared error, 182 training set format, 181 matrices, 178 maximum likelihood, 228 maximum likelihood estimate, 215 May, G., 8 McCulloch, W., 9 McCulloch, W., et al., 10 McEliece, R., et al., 56 McLachlan, G., 251 Mead, C., 6 measurable function, 104 memory, 4 associative, 10 content addressable, 10 methodology for neural network design, 7 methodology for nonlinear systems design, 1 metric, 121 Mhaskar, M., 119, 120 Mhaskar, M., et al., 111, 119, 120 Minai, A., et al., 6, 14 minima empirical error, 238 generalization error, 240 Minsky, M., et al., 9, 36 model selection, 204 Monte Carlo simulation, 212 Moody, E., 221 Moody, J., 215 motivation, 4 Mukherjee, S., 134, 178, 281 Mukherjee, S., et al., 266
Muller, P., et al., 207, 212 multilayer perceptron, 3, 56 multiprocessor computing, 6 Murata, N., 271, 281 Muroga, S., 59 Møller, M., 135, 151, 167 Neal, R., 207, 212 negative-binomial distribution, 29 network representation, 54 neural network emulation, 6 operating equations, 137 feedback, 9 feedforward, 3 graph, 54 history, 1 Hopfield networks, 9, 56 notation, 57, 86, 136 recurrent, 9 neurobiology, 4 neuron, 2 neuronal modeling, 11 mathematical, 13 neuronal stimuli, 12 neuroscience, 2 Newton’s method, 133 Nilsson, N., 19, 24, 46, 60, 61 node, 2, 54 bounded and differentiable, 237 four types, 54 hidden, 57 radial basis function, 43 sigmoidal, 43 nonlinear p-width, 118 nonlinearity, 3 nonparametric methods, 1, 7 norm, 122 O’Cinneide, C., 282 Occam’s Razor, 221, 226 open ball, 123 optical character recognition, 7 optimal nonlinear processing, 6 organization of monograph, 14
Index
overfitting, 244 Paass, G., 255 Pagels, H., 53 Parallel Distributed Processing Group, 9 parallelism massive, 5 parameter space compact and convex, 236 partial recursive function, 222 partitioning of inputs into unions of polyhedra, 71 enumeration of connected regions, 72 hardware complexity, 72 unions of convex polyhedra, 73 pattern classification, 8 reduction to binary case, 59 alphanumeric, 8 image processing architecture, 85 speech recognition, 8 Pearlmutter, B., 175 penalty, 215, 226, 232 perceptron, 9, 17 augmentation of inputs, 39 organizing questions, 19 Perceptron Training Algorithm, 31, 50 alternatives, 35, 37 nonseparable case, 36 probabilistic, 38 thermal, 38 Pitts, W., 9 Platt, J., et al., 6 Plutowski, M., et al., 254 Poggio, T., et al., 215 Poisson clumping heuristic, 263 Pollard, D., 78, 248, 262, 283 Pontil, M., et al., 41 preprocessing nonlinear, 146 prerequisites, 2
337
Press, W. et al., 135, 152, 173, 176 probabilistic coding complexity, 226 probabilistic convergence modes, 246 pruning networks, 233 PTA, see algorithms, 22 radial basis function, 43 regression model, 211 regularization, 130, 174, 215 inverse problems, 215 choice of Lagrange multiplier, 220 choice of penalty term, 220 inverse problems, 216 Lagrange multiplier formulation, 219 regularizing operator, 217 smoothing functional, 219, 221 stability, 217 stabilizing functional, 218 weight decay, 220 ridge function, 43 see Dahmen et al., 90 Ripley, B., 207, 271, 276, 280, 281 Rissanen, J., 221, 226–228 robust, 7 robustness of neural network, 5 Rockafellar, W., 20, 21 Rosenblatt, F., 9, 17 Rowley, H. et al., 8 Roychowdhury, V., et al., 3, 37, 54 Royden, H., 2, 82 Rudin, W., 2, 82, 100, 103 Rumelhart, D., et al., 9, 135, 136 Safire, W., 221 Sauer, N., 78 Schaffer, A., et al., 56 Schlafli, 27 Schwartz, G., 221 Schwartz, J., 53 sensory signal processing, 6
338
Index
Shibata, R., 221 Shibata, T., et al., 6 Shonkwiler, R., 74 Shustorovich, A. et al., 8 Shynk, R., et al., 36 sigmoidal node, 13, 90 Simmons, G., 82, 100 simplex algorithm, 31 single hidden layer network see 1HL, 82 Siu, K., et al., 31, 54, 60, 64 Sobolev space, 118 Solomonoff, R., 221 Sommerville, D., 18 Sontag, E., 68, 85, 109, 259 spline function, 111 Sprecher, D., 95 standardization of inputs, 145 stationary points empirical error, 238 generalization error, 240 Stevenson, M., et al., 6 stochastic complexity, 226 Stone, M., 251 Stork, D., 142 structural risk minimization, 217 sup-norm, 122 support vector machine, 23, 40 Sussman, H., 87 synapse, 2 Takiyama, R., 19 Talagrand, M., 259, 263 target space, 237 Theorem a.s. parameter convergence, 271 H-conjugacy, 166 Bartlett on evaluating fat-shattering dimension, 265 Bartlett on generalization determined by l1 norm of output weights, 265 Bartlett on role of fat-shattering dimension, 264
Bernstein’s Inequality, 249 Boolean function implementation, 60 bounds on constant learning rate, 158 central limit, 246 class of regularizing operators, 218 complexity of iterative approximation, 115, 124 Complexity regularization of Barron, 230 conditional asymptotic normality, 275 conjugacy and descent, 168 constraints on Lagrange multiplier, 219 Cybenko approximating continuous functions, 101 approximating integrable func tions, 105 discrepancy between eg and ETn , 261 efficiency of conjugate gradients, 166 empirical error learning curve bounds, 279 empirical error learning curve distribution, 280 establishing conjugacy, 167 proof, 179 function approximation via nonpolynomial node functions, 103 Gibson, 75 Hahn-Banach, 21, 100 Heine-Borel, 124 Hoeffding, 248, 257, 259 Hornik simultaneous approximation to derivatives, 108 Koiran and Sontag, 105 Kolmogorov/Sprecher, 95 learning curve bounds, 278
Index
learning curve distribution, 278 Leshno et al. 1HL approximation of integrable functions, 106 continuous functions, 103 linearly separable Boolean functions, 43 LTU network VC capacity upper bound, 71 Lusin, 103 Maass, 71 Mercer, 39 multidimensional central limit, 273 necessity for general position, 27 non-negative definite matrices, 178 number of linearly separable dichotomies, 27, 48 of the alternative, 21 partitioning into finite unions of polyhedra, 74 perceptron training convergence, 32 positive definite matrices, 178 proof of discrepancy between eg and ETn , 282 Riesz representation, 100 sandwich construction, 61 separating hyperplane, 21 Shonkwiler, 74 Sontag 1HL approximation to inverses, 109 2HL approximation to inverses, 110 Stone-Weierstrass, 97 symmetrization inequalities, 282 Talagrand, 265 Talagrand on uniform bound via VC dimension, 259 Taylor’s series, 89 training is NP-hard, 177
339
upper and lower bounds to parameter error, 275 upper bound to L, 25 VC upper bound of Vapnik, Sauer, 78 VC upper bound of Vapnik/Saue 67 threshold augmentation, 18 embedding, 18 Tikhonov, A., et al., 215, 217–219 topology, 123 training batch, 147, 239 online, 147, 239 training objective, 130 training set, 18, 27, 81, 129, 236 Turing machine effectively computable function, 223 countably many, 222 G¨ odel numbering, 222 halting problem, 223 universal (UTM), 223 Turing machine (TM), 222 Turmon, M., 262 Turmon, M., et al., 262, 263 uniform bound, 256 uniform bounds via indicator functions, 258 Valiant, L., 66 validation, 231, 243 validation error, 243 Vapnik, V., 39, 40, 42, 66, 68, 78, 107, 215, 217, 220, 255, 257 Vapnik-Chervonenkis bounds in practice, 280 capacity, 66 capacity examples, 68 capacity upper bound, 67, 259 dimension, 66 fat-shattering dimension, 264
340
Index
VC abbreviation, 66 vector space linear, 121 Vidyasagar, M., 66, 107, 177, 259 Wahba, G., 252, 254 web site of author, vii Weigend, A. et al., 8 Weigend, A., et al., 220 weight decay, 220 weights, 2 weights as a vector, 129, 147 weights convergence, 271, 275 asymptotic distribution, 276 Wenocur, R., et al., 68 West, J., 11 West, J., et al., 207 White, H., 271, 276 Widrow, B., 9 Winder, R., 3, 43, 59 Wolpert, D., 207 Wong, C.-F., 146 Wynn, H. et al., 257 Yuan, J., et al., 232 Yuan, J.-L., 8 Yukich, J., et al., 108, 112, 117 Zavaliagkos, G. et al., 8