Springer Undergraduate Texts in Mathematics and Technology
For other titles published in this series, go to http://www.springer.com/series/7438
Series Editors Jonathan M. Borwein Helge Holden Editorial Board Lisa Goldberg Armin Iske Palle E.T. Jorgensen Stephen M. Robinson
Wilhelm Forst
•
Dieter Hoffmann
Optimization— Theory and Practice
123
Wilhelm Forst Universit¨at Ulm Fak. Mathematik und Wirtschaftswissenschaften Inst. Numerische Mathematik Helmholtzstr. 18 89069 Ulm Germany
[email protected] Series Editors Jonathan M. Borwein Computer Assisted Research Mathematics and its Applications, CARMA School of Mathematical & Physical Sciences University of Newcastle Callaghan NSW 2308 Australia
[email protected]
Dieter Hoffmann Universit¨at Konstanz FB Mathematik und Statistik Fach 198 78457 Konstanz Germany
[email protected]
Helge Holden Department of Mathematical Sciences Norwegian University Science and Technology Alfred Getz vei 1 NO-7491 Trondheim Norway
[email protected]
MATLAB is a registered trademarks of The MathWorks, Inc. For MATLAB product information, please contact: The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA, 017602098 USA, E-mail:
[email protected], Web: www.mathworks.com. Maple is a registered trademark of Maplesoft. For MapleTM product information, please contact: Maplesoft, 615 Kumpf Drive, Waterloo, ON, Canada, N2V 1K8, Email: info@ maplesoft.com. ISSN 1867-5506 e-ISSN 1867-5514 ISBN 978-0-387-78976-7 e-ISBN 978-0-387-78977-4 DOI 10.1007/978-0-387-78977-4 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010930977 Mathematics Subject Classification (2010): Primary 90-01; Secondary 68W30, 90C30, 90C46, 90C51 c Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To my children Martin, Christina and Christopher Wilhelm Forst To my grandsons ´ L´eon, Etienne, Gabriel, Nicolas and Luca who may some day want to read this book Dieter Hoffmann
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIII 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1 Examples of Optimization Problems . . . . . . . . . . . . . . . . . . . . . . .
1
Overdetermined Systems of Linear Equations . . . . . . . . .
1
Nonlinear Least Squares Problems . . . . . . . . . . . . . . . . . . .
6
Chebyshev Approximation . . . . . . . . . . . . . . . . . . . . . . . .
8
Facility Location Problems . . . . . . . . . . . . . . . . . . . . . . . . . 10 Standard Form of Optimization Problems . . . . . . . . . . . . 14 Penalty Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Historical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Optimization Problems in Antiquity . . . . . . . . . . . . . . . . . 21 First Heroic Era: Development of Calculus . . . . . . . . . . . . 21 Second Heroic Era: Discovery of Simplex Algorithm . . . . 26 Leonid Khachiyan’s Breakthrough . . . . . . . . . . . . . . . . . 27 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2
Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.1 Convex Sets, Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 Local First-Order Optimality Conditions . . . . . . . . . . . . . . . . . . . 44 Karush–Kuhn–Tucker Conditions . . . . . . . . . . . . . . . . 46 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Constraint Qualifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Convex Optimization Problems . . . . . . . . . . . . . . . . . . . . . 58
VIII
Contents 2.3 Local Second-Order Optimality Conditions . . . . . . . . . . . . . . . . . 61 2.4 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Lagrange Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Saddlepoints and Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Perturbation and Sensitivity Analysis . . . . . . . . . . . . . . . . 74 Economic Interpretation of Duality . . . . . . . . . . . . . . . . . . 77 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3
Unconstrained Optimization Problems . . . . . . . . . . . . . . . . . . . . . 87 3.0 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.1 Elementary Search and Localization Methods . . . . . . . . . . . . . . . 90 The Nelder and Mead Polytope Method . . . . . . . . . . . 90 The Shor Ellipsoid Method . . . . . . . . . . . . . . . . . . . . . . . . 93 3.2 Descent Methods with Line Search . . . . . . . . . . . . . . . . . . . . . . . . . 98 Coordinatewise Descent Methods . . . . . . . . . . . . . . . . . . . . 99 Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Kantorovich’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 101 Requirements on the Step Size Selection . . . . . . . . . . . . . . 104 3.3 Trust Region Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Levenberg–Marquardt Trajectory . . . . . . . . . . . . . . . 112 Powell’s Dogleg Trajectory . . . . . . . . . . . . . . . . . . . . . . . . 116 Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.4 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Generation of A-Conjugate Directions . . . . . . . . . . . . . . . . 122 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 The CG-Method in the Nonquadratic Case . . . . . . . . . . . 129 3.5 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Least Change Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 More General Quasi-Newton Methods . . . . . . . . . . . . . . 138 Quasi-Newton Methods in the Nonquadratic Case . . . . 139 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Contents 4
IX
Linearly Constrained Optimization Problems . . . . . . . . . . . . . . 151 4.1 Linear Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 The Revised Simplex Method . . . . . . . . . . . . . . . . . . . . . . . 152 Numerical Realization of the Method . . . . . . . . . . . . . . . . 155 Calculation of a Feasible Basis . . . . . . . . . . . . . . . . . . . . . . 158 The Active Set Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4.2 Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 The Barankin–Dorfman Existence Theorem . . . . . . . . 166 The Active Set Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Minimization Subject to Linear Equality Constraints . . 175 The Goldfarb–Idnani Method . . . . . . . . . . . . . . . . . . . . 179 4.3 Projection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Zoutendijk’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Projected Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . 189 Reduced Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Preview of SQP Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Wilson’s Lagrange–Newton Method . . . . . . . . . . . . . 200 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5
Nonlinearly Constrained Optimization Problems . . . . . . . . . . . 213 5.1 Penalty Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Classic Penalty Methods (Exterior Penalty Methods) . . 214 Barrier Methods (Interior Penalty Methods) . . . . . . . . . . 220 5.2 SQP Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Lagrange–Newton Method . . . . . . . . . . . . . . . . . . . . . . 226 Fletcher’s S1 QP Method . . . . . . . . . . . . . . . . . . . . . . . . 233 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
X
Contents
6
Interior-Point Methods for Linear Optimization . . . . . . . . . . . 241 6.1 Linear Optimization II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 The Duality Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 The Interior-Point Condition . . . . . . . . . . . . . . . . . . . . . . . . 248 6.2 The Central Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Convergence of the Central Path . . . . . . . . . . . . . . . . . . . . 256 6.3 Newton’s Method for the Primal–Dual System . . . . . . . . . . . . . 261 6.4 A Primal–Dual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.5 Neighborhoods of the Central Path . . . . . . . . . . . . . . . . . . . . . . . . 266 A Closer Look at These Neighborhoods . . . . . . . . . . . . . . 269 6.6 A Short-Step Path-Following Algorithm . . . . . . . . . . . . . . . . . . . . 272 6.7 The Mizuno–Todd–Ye Predictor-Corrector Method . . . . . . . . 277 6.8 The Long-Step Path-Following Algorithm . . . . . . . . . . . . . . . . . . . 281 6.9 The Mehrotra Predictor-Corrector Method . . . . . . . . . . . . . . . 284 6.10 A Short Look at Interior-Point Methods for Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
7
Semidefinite Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 7.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Basics and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Primal Problem and Dual Problem . . . . . . . . . . . . . . . . . . 304 7.2 Selected Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Linear Optimization and Duality Complications . . . . . . . 309 Second-Order Cone Programming . . . . . . . . . . . . . . . . . . . 313 7.3 The S-Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 Minimal Enclosing Ellipsoid of Ellipsoids . . . . . . . . . . . . . 316 7.4 The Function log ◦ det . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 7.5 Path-Following Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Primal–Dual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Barrier Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Contents
XI Central Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
7.6 Applying Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 7.7 How to Solve SDO Problems? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 7.8 Icing on the Cake: Pattern Separation via Ellipsoids . . . . . . . . . 334 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 8
Global Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 8.2 Branch and Bound Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 8.3 Cutting Plane Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Cutting Plane Algorithm by Kelley . . . . . . . . . . . . . . . . 354 Concavity Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 A
A Second Look at the Constraint Qualifications . . . . . . . . . . . . . 365 The Linearized Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Correlation to the Constraint Qualifications . . . . . . . . . . 369
B
The Fritz John Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
C
Optimization Software Tools for Teaching and Learning . . . . . . 374 R
Matlab
Optimization Toolbox . . . . . . . . . . . . . . . . . . . . . . 374
SeDuMi: An Introduction by Examples . . . . . . . . . . . . . . . 377 R
Maple Optimization Tools . . . . . . . . . . . . . . . . . . . . . . . . . 381 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Index of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Preface
This self-contained book on optimization is designed to serve a variety of purposes. It will prove useful both as a textbook for undergraduate and firstyear graduate-level courses as well as a reference book for mathematicians, engineers and applied scientists interested in a careful exposition of this fascinating and useful branch of mathematics. Students, mathematicians and practitioners alike can profit from an approach which treats central topics of optimization in a concise, comprehensive and modern way. The book is fresh in conception and lucid in style and will appeal to anyone who has a genuine interest in optimization. The mutually beneficial interaction of theory and practice is presented in a stimulating manner. As Johann Wolfgang von Goethe states: “All theory, dear friend, is gray, but the golden tree of life springs ever green.” Optimization is not only important in its own right but nowadays forms an integral part of a great number of applied sciences such as operations research, management science, economics and finance, and all branches of math-oriented engineering. Constrained optimization models are used in numerous areas of application and are probably the most widely used mathematical models in operations research and management science. This book is the outgrowth of many years of teaching optimization in the mathematics departments of the Universities of Konstanz and Ulm (Germany) and W.F.’s teaching experiences with a first English version of parts of this book during his stay as a guest professor at the Universidad Nacional de Trujillo (Peru). As the title suggests, one major aim of our book is to give a modern and well-balanced treatment of the subject by not only focusing on theory but also including algorithms and instructive examples from different areas of application. We put particular emphasis on presenting the material in a systematic and mathematically convincing way.
XIV
Preface
We introduce theory and methods at an elementary level but with an accurate and concise formulation of the theoretical tools needed. We phrase ideas and theorems clearly and succinctly, and complement them with detailed proofs and explanations of the ideas behind the proofs. We are convinced that many readers will find our book easy to read and value its accessibility. Since it is intended as an introduction, the mathematical prerequisites have been kept to a minimum, with only some knowledge of multidimensional calculus, linear algebra and basic numerical methods needed to fully understand the concepts presented in the book. For example, the reader should know a little about the convergence rate of iteration methods and have a basic understanding of the condition number for matrices, which is a measure of the sensitivity of the error in the solution to a system of linear equations relative to changes in the inputs. From the wide range of material that could be discussed in optimization lectures, we have selected aspects that every student interested in optimization should know about, as well as some more advanced issues. It is, however, not expected that the whole book will be ‘covered’ in one term. In practice, we have found that a good half is manageable. This book provides a systematic, thorough and insightful discussion of optimization of continuous nonlinear problems. It contains in one volume of reasonable size a very clear presentation of key ideas and basic techniques. The way they are introduced, discussed and illustrated is designed to emphasize the connections between the various concepts. In addition, we have included several features which we believe will greatly help the readers to get the most out of this book: First of all, every method considered is motivated and explained. Abstract statements are made into working knowledge with the help of detailed explanations on how to proceed in practice. Benjamin Franklin already knew: “A good example is the best sermon!” Therefore, the text offers a rich collection of detailed analytical and numerical examples which bring to life central concepts from diverse areas of science and applications. These selected examples have intentionally often been kept simple so that they can still be verified by hand easily. Counterexamples are presented whenever we want to show that certain assumptions cannot simply be dropped. Additionally, the reader will find many elaborate two-colored graphics which help to facilitate understanding via geometric insight. Often the idea of a result has a simple geometric background — or, as the saying goes: “A picture is worth a thousand words!” Another feature of the book is that it presents the concepts and examples in a way that invites the readers to think for themselves and to critically assess
Preface
XV
the observations and methods presented. In short, the book is written for the active reader. Furthermore, we have also included more than one hundred additional exR R ercises, which are partly supplemented by hints or Matlab /Maple code fragments. Here, the student has ample opportunity to practice concepts, statements and procedures, passing from routine problems to demanding extensions of the main text. In these exercises and particularly in appendix C, R we will describe relevant features of the Matlab Optimization Toolbox and demonstrate its use with selected examples. Today each student uses a notebook computer the way we old guys used slide rules and log tables. Using a computer as a tool in teaching and learning allows one to concentrate on the essential ideas. Nowadays the various opportunities of current computer technology should be part of any lecture on optimization. Nevertheless, we would like to stress that neither knowledge of nor access to R R Matlab or Maple is really needed to fully understand this text. On the other hand, readers who follow the proposed way will benefit greatly from our approach. Since many of the results presented here are nowadays classical — the novelty lies in the arrangement and streamlined presentation — we have not attempted to document the origin of every item. Some of the sources which served as valuable stimuli for our lectures years ago have since fallen into oblivion. We are, however, aware of the fact that it took many brilliant men numerous years to develop what we teach in one term. In section 1.2, we have therefore compiled a historical survey to give our readers an idea of how optimization has developed through the centuries to an important branch of mathematics. Interior-point methods, for example, show that this process is still going on with ever new and exciting results! We are certain that our selection of topics and their presentation will appeal to students, researchers and practitioners. The emphasis is on insight, not on giving the latest refinements of every method. We introduce the reader to optimization in a gradual and ‘digestible’ manner. Much of the material is ‘classical’ but streamlined proofs are given here in many instances. The systematic organization, structure and clarity of our book together with its insightful illustrations and examples make it an ideal introductory textbook. Written in a concise and straightforward style, this book opens the door to one of the most fascinating and useful branches of mathematics. The book is divided into eight chapters. Here is a sketch of the content: In Chapter 1, Examples of Optimization Problems are given. It is, however, not only intended as a general motivating introduction, but also as a demonR stration of the workings and uses of mathematical tools like Matlab and
XVI
Preface R
Maple and thus serves to encourage our readers not to neglect practice over theory. In particular the graphics features of these tools are important and very helpful in many instances. In addition, the first chapter includes — as already mentioned — a detailed Historical Overview. Chapter 2 on Optimality Conditions studies the key properties of constrained problems and necessary and sufficient optimality conditions for them. As an introduction, we give a summary of the ‘classic’ results for unconstrained and equality constrained problems. In order to formulate optimality conditions for problems with inequality constraints, some simple aids from Convex Analysis are introduced. The Lagrange multiplier rule is generalized by the Karush– Kuhn–Tucker conditions. This core chapter presents the essential tools via a geometrical point of view using cones. Chapter 3 introduces basics on Unconstrained Optimization Problems. The corresponding methods seek a local minimum (or maximum) in the absence of restrictions. Optimality criteria are studied and above all algorithmic methods for a wide variety of problems. Even though most optimization problems in ‘real life’ have restrictions to be satisfied, the study of unconstrained problems is useful for two reasons: First, they occur directly in some applications and are thus important in their own right. Second, unconstrained problems often originate as a result of transformations of constrained optimization problems. Some methods, for example, solve a general problem by converting it into a sequence of unconstrained problems. Chapter 4 presents Linearly Constrained Optimization Problems. Here the problems have general objective functions but linear constraints. Section 4.1 gives a short introduction to linear optimization which covers the simplest — yet still highly important — kinds of constrained optimization problems, where the objective function and all constraints are linear. We consider two methods as examples of how these types of problems can be solved: the revised simplex algorithm and the active set method. Quadratic problems, which we treat in section 4.2, are linearly constrained with a quadratic objective function. Quadratic optimization is an important field in its own right, since it forms the basis of several algorithms for nonlinearly constrained problems. This section contains, for example, the Barankin–Dorfman existence theorem as well as a lucid description of the Goldfarb–Idnani method. In contrast to the active set method this is a dual method which has the advantage that it does not need a primally feasible starting point. In section 4.3, we give an outline of selected projection methods. Besides classical gradient-based methods we present a sequential quadratic feasible point method. In the next chapter, Nonlinearly Constrained Optimization Problems are treated. In section 5.1, a constrained optimization problem is replaced by an unconstrained one. There are two different approaches to this: In exterior penalty methods a term is added to the objective function which ‘penalizes’ a violation of constraints. In interior penalty methods a barrier term prevents
Preface
XVII
points from leaving the interior of the feasible region. In section 5.2, sequential quadratic programming methods are introduced. The strategy here is to convert a usually nonlinear problem into a sequence of quadratic optimization problems which are easier to solve. Chapter 6 presents Interior-Point Methods for Linear Optimization. The development of the last 30 years has been greatly influenced by the aftermath of a “scientific earthquake” triggered in 1979 by the findings of Khachiyan (1952–2005) and in 1984 by those of Karmarkar. Efficient interior-point methods have in the meantime been applied to large classes of nonlinear optimization problems and are still topics of current research. Chapter 7 treats basics of Semidefinite Optimization. This type of optimization differs from linear optimization in that it deals with problems over the n cone of symmetric positive semidefinite matrices S+ instead of nonnegative vectors. It is a branch of convex optimization and covers many practically useful problems. The wide range of uses has quickly made semidefinite optimization very popular — besides, of course, the fact that such problems can be solved efficiently via polynomially convergent interior-point methods, which had originally only been developed for linear optimization. The last chapter on Global Optimization deals with the computation and characterization of global optimizers of — in general — nonlinear functions. It is an important task since many real-world questions lead to global rather than local problems. Although this chapter is relatively short, we are convinced that it will suffice to give our readers a useful and informative introduction to this fascinating topic with all the necessary mathematical precision. The book concludes with three Appendices: • A Second Look at the Constraint Qualifications The Guignard constraint qualification in section 2.2 seems to somewhat come out of the blue. The correlation between the regularity condition and the corresponding ‘linearized’ problem discussed here makes the matter more transparent. • The Fritz John Condition This is — in a certain sense — a weaker predecessor of the Karush–Kuhn– Tucker conditions. No regularity condition is required, but in consequence an additional multiplier λ0 ≥ 0 must be attached to the objective function. In addition, an arbitrary number of constraints is possible. The striking picture on the title page also falls into this area, the Minimum Volume Enclosing Ellipsoid of Cologne Cathedral (in an abstract 3D model). • Optimization Software for Teaching and Learning This part of the appendix gives a short overview of the software for the main areas of application in our book. We do not speak about professional software in modeling and solving large optimization problems. In our courses, R R we use Matlab and Maple and optimization tools like SeDuMi.
XVIII
Preface
Each chapter starts with a summary, is divided into several sections and ends with numerous exercises which reinforce the material discussed in the chapter. Exercises are carefully chosen both in content and in difficulty to aid understanding and to promote mastery of the subject. Acknowledgments We would like to acknowledge many people who, knowingly or unknowingly, have helped us: Vita Rutka and Michael Lehn each held the tutorial accompanying the lecture for some terms and provided a number of interesting problems and instructive exercises which we have gladly included in our book. D.H. would like to thank Rainer Janssen for stimulating discussions and an oftentimes helpful look at the working conditions. We are especially grateful to Andreas Borchert for his valuable advice concerning our electronic communication between Konstanz and Ulm. Largely based on [Wri] and [Klerk], Julia Vogt and Claudia Lindauer wrote very good master’s theses under our supervision which formed the basis of chapters 6 and 7. The book has also benefited from Markus Sigg’s careful readings. His comments and suggestions helped to revise a first preliminary version of the text. He also provided valuable assistance to D.H. with some computer problems. It has been our very good fortune to have had Julia Neumann who translated essential parts of the text and thereby contributed to opening the book to a broader audience. She displayed great skill, courage and intuition in turning — sometimes clumsy — German into good English. The editorial staff of Springer, especially Ann Kostant and Elizabeth Loew, deserve thanks for their belief in this special project and the careful attention to our manuscript during the publication process. We would also like to thank the copyeditor who did an excellent job revising our text and providing helpful comments. Lastly, we would like to say, particularly to experts: We would appreciate personal feedback and welcome any comments, recommendations, spottings of errors and suggestions for improvement. Commendations will, of course, also be accepted! R R For additions, updates and Matlab and Maple sources, please refer to the publisher’s website for the book.
Ulm Konstanz October 2009
Wilhelm Forst Dieter Hoffmann
Chapter 1
1 Introduction
1.1 Examples of Optimization Problems Overdetermined Systems of Linear Equations Nonlinear Least Squares Problems Chebyshev Approximation Facility Location Problems Standard Form of Optimization Problems Penalty Methods 1.2 Historical Overview Optimization Problems in Antiquity First Heroic Era: Development of Calculus Second Heroic Era: Discovery of Simplex Algorithm Leonid Khachiyan’s Breakthrough Exercises The general theory has naturally developed out of the study of special problems. It is therefore useful to get a first impression by looking at the ‘classic’ problems. We will have a first look at some elementary examples to get an idea of the kind of problems which will be stated more precisely and treated in more depth later on. Consequently, we will often not go into too much detail in this introductory chapter. Section 1.2 gives a historical survey of this relatively young discipline and points out that mathematics has at all times gained significant stimulations from the study of optimization problems.
1.1 Examples of Optimization Problems Overdetermined Systems of Linear Equations Let m, n ∈ N with m > n, A ∈ Rm×n and b ∈ Rm . We consider the overdetermined linear system (1) Ax = b x ∈ Rn . W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4 1,
1
Chapter 1
2
Introduction
Generally it has no solution. However, if A is quadratic (m = n) and regular, it is well known that the equation Ax = b has a unique solution x = A−1 b . The overdetermined case often appears in practical applications, for instance, if we have more observations or measurements than given independent parameters. In this case, it is useful to minimize the ‘residual vector’ r := r(x) := b − Ax in some sense to be stated more precisely. In most cases we are satisfied with the minimization of Ax − b with respect to a given norm on Rm , and often state the problem in the shortened form Ax − b −→ min x
or in like manner. So one considers, for instance, the following alternative problem for (1): (i) Ax − b2 −→ min x
n (ii) Ax − b∞ = max aμν xν − bμ −→ min x 1≤μ≤m ν=1
or more general than (i) (iii) 1 ≤ p < ∞ : Ad (i):
Ax − bp =
n p 1/p m aμν xν − bμ −→ min x
μ=1 ν=1
Linear Gauß Approximation Problem
We consider the functions fμ defined by fμ (x) :=
n ν=1
aμν xν −bμ for x ∈ Rn
and μ = 1, . . . , m. Then our problem is equivalent to the linear least squares problem m 1 1 2 F (x) := fμ (x)2 = Ax − b2 −→ min , x 2 μ=1 2 whose objective function is differentiable. Necessary and sufficient optimality conditions are given by the Gauss normal equations (cf. exercise 2) AT Ax = AT b . Ad (ii): Linear Chebyshev Approximation Problem
Examples of Optimization Problems
3
This problem (initially with a nondifferentiable objective function) is a special minimax problem and therefore equivalent to the ‘linear program’ η −→
min
x∈Rn ,η∈R
with respect to the constraints n aμν xν − bμ ≤ η
(μ = 1, . . . , m)
ν=1
or alternatively −η ≤
n
aμν xν − bμ ≤ η
(μ = 1, . . . , m) .
ν=1
Example 1 Consider the following overdetermined system of linear equations A x = b : > restart: with(linalg): x := vector(2): A := matrix(5,2,[1,1,1,-1,1,2,2,1,3,1]); b := vector([3,1,7,8,6]);
⎤ 1 1 ⎢ 1 −1 ⎥ ⎥ ⎢ ⎥ A := ⎢ ⎢1 2⎥ , ⎣2 1⎦ 3 1 ⎡
b := [3, 1, 7, 8, 6]
The Gauß approximation problem has the following normal equations: > B := evalm(transpose(A)&*A): c := evalm(transpose(A)&*b): geneqns(B,x,c);
{16 x1 + 7 x2 = 45, 7 x1 + 8 x2 = 30} R
These can be solved, for example, with the Maple command linsolve : > x := linsolve(B,c);
x :=
150 165 , 79 79
The command leastsqrs makes it even simpler: > leastsqrs(A,b), evalf(%,4);
Chapter 1
1.1
Chapter 1
4
Introduction 150 165 , , 79 79
[1.899, 2.089]
We get the following approximation error: > evalm(A&*x-b); norm(%,2); evalf(%,4);
78 −94 −73 −167 141 , , , , , 79 79 79 79 79
1 √ 68019 , 79
3.302 R
It is also possible to solve this problem with the help of Matlab . To do that, we utilize the function lsqlin from the Optimization Toolbox and demonstrate the use of some of the options which this command offers. In the interactive modus the following four command lines have to be entered successively:
5
A = [1, 1, 1, 2, 3; 1,-1, 2, 1, 1]’; b = [3, 1, 7, 8, 6]’; options = optimset(’Display’,’iter’,’LargeScale’,’on’); [x,resnorm,residual,exitflag,output,lambda] = ... lsqlin(A,b,[],[],[],[],[],[],[],options)
The last command line does not fit into one line. We therefore continue it in the next line after having typed in “. . . ”. Distinctly simpler but also less R informative is the Matlab command x = A \ b . Since we have already disR
cussed in detail how to solve this example with Maple, we will abstain from R giving the corresponding Matlab output. If we apply the maximum norm instead of the euclidean norm, the linear Chebyshev approximation problem is equivalent to a linear optimization problem whose objective function and ‘restrictions’ can be calculated with R Maple in the following way: > ObjFunc := eta; x := vector(2): convert(evalm(A&*x-b),list); Restr := convert(map(z->(-eta<=z,z<=eta),%),set);
ObjFunc := η [x1 + x2 − 3, x1 − x2 − 1, x1 + 2 x2 − 7, 2 x1 + x2 − 8, 3 x1 + x2 − 6] Restr := {−η ≤ x1 + x2 − 3 ≤ η, −η ≤ x1 − x2 − 1 ≤ η, − η ≤ x1 + 2 x2 − 7 ≤ η, −η ≤ 2 x1 + x2 − 8 ≤ η, −η ≤ 3 x1 + x2 − 6 ≤ η} R
We apply the simplex algorithm which is available under Maple : > with(simplex): minimize(ObjFunc,Restr); assign(%):
Examples of Optimization Problems
5
15 21 7 ,η= , x1 = x2 = 8 8 4 We verify that η is the approximation error: > evalm(A&*x-b); ’eta’ = norm(%); # maximum norm!
−15 15 11 −15 , , 0, , , 8 8 8 8
η=
15 8
The linear optimization problem we have just looked at can be solved with R the Matlab function linprog , for example, by writing the following sequence R
of commands in a Matlab
5
script called Cheb Linprog.m :
f = [0; 0; 1] A = [1, 1, 1, 2, 3,-1,-1,-1,-2,-3; 1,-1, 2, 1, 1,-1, 1,-2,-1,-1; -1,-1,-1,-1,-1,-1,-1,-1,-1,-1]’ b = [3 1 7 8 6 -3 -1 -7 -8 -6]’ disp(’Medium Scale, Simplex:’); options = ... optimset(’Display’,’iter’,’LargeScale’,’off’,’Simplex’,’on’); [x,fval,exitflag,output,lambda] = linprog(f,A,b,[],[],[],[],[],options)
This sequence of commands can then be executed in the Command Window with the command Cheb Linprog . As already mentioned, the linear Chebychev approximation problem is a special minimax problem. These kinds of problems can be solved with the R Matlab function fminimax in the following way:
5
x0 = [0; 0]; % Starting guess disp(’Minimize absolute values:’); options = optimset(’GradObj’,’on’,’Display’,’iter’, ... ’MinAbsMax’,5); % Minimize absolute values [x,fval,maxfval,exitflag,output] = ... fminimax(@ObjFunc,x0,[],[],[],[],[],[],[],options)
fminimax requires an m-file for the objective function and its gradient:
5
function [F,G] = ObjFunc(x) A = [1 1; 1 -1; 1 2; 2 1; 3 1]; b = [3 1 7 8 6]’; F = A*x-b; if nargout > 1 G = A’; end
Chapter 1
1.1
Chapter 1
6
Introduction
Nonlinear Least Squares Problems More generally, we consider optimization problems of the form F (x) :=
m 1 fμ (x)2 −→ min . x 2 μ=1
To simplify matters, we assume that the given functions fμ : Rn −→ R (μ = 1, . . . , m) are twice differentiable on Rn . Problems of this kind appear, for example, when dealing with curve or data fitting. In addition, systems of nonlinear equations can also be reduced to this form and can then be solved by methods of nonlinear optimization. Example 2 We wish to solve the following simple homogeneous system of nonlinear equations: x31 − x2 − 1 = 0, x21 − x2 = 0 > restart: Digits := 5: f_1 := x_1^3-x_2-1; f_2 := x_1^2-x_2;
f1 := x31 − x2 − 1 ,
f2 := x21 − x2 R
We calculate its solution numerically using the Maple command fsolve : > fsolve({f_1,f_2},{x_1,x_2});
{x1 = 1.4656, x2 = 2.1479} The following figure shows that the nonlinear system has a unique solution: >
with(plots): implicitplot({f_1,f_2},x_1=-1..3,x_2=-2..4);
4
f1
f2
2
–1
1
–2
2
Examples of Optimization Problems
7
Next, we determine the local extreme points of the objective function corresponding to the underlying least squares problem:
5
10
> F := 1/2*(f_1^2+f_2^2); D1F := diff(F,x_1); D2F := diff(F,x_2); solve({D1F,D2F},{x_1,x_2}): B := proc(z) local u; # This Boolean function selects the real stationary points subs(z,[x_1,x_2]); u := map(evalc@Im,%); is(u[1]=0 and u[2]=0) end proc: sol := allvalues([%%]): rsol := select(B,sol): ’rsol’ = evalf(rsol);
F := 1 (x31 − x2 − 1)2 + 1 (x21 − x2 )2 2 2 D1F := 3 (x31 − x2 − 1) x21 + 2 (x21 − x2 ) x1 , D2F := −x31 + 2 x2 + 1 − x21 rsol = x2 = −0.50000, x1 = 0. , x2 = −0.12963, x1 = 0.66667 , x1 = 1.4655, x2 = 2.1477 Obviously F has three (real) stationary points 1 , of which rsol[3] is the solution to the system of nonlinear equations: > subs(rsol[3],F): ’F’ = simplify(%);
F =0 For the other two solutions we get: > subs(rsol[1],F), subs(rsol[2],F);
1 961 , 4 2916 We evaluate the Hessian to decide of which kind the stationary points are: > with(linalg): H := hessian(F,[x_1,x_2]): seq(subs(rsol[k],eval(H)),k=1..2),subs(evalf(rsol[3]),eval(H));
⎡
⎤ 65 −8 1 0 ⎢ 27 3 ⎥ 50.101 −9.3741 ,⎣ ⎦, 02 −9.3741 2 −8 2 3
> definite(%[1],positive_def),definite(%[2],positive_semidef), definite(%[2],negative_semidef),definite(%[3],positive_def); 1
It is well known that these zeros of the first derivative are candidates for local extreme points.
Chapter 1
1.1
Chapter 1
8
Introduction true, false, false, true
rsol[1] and rsol[3] give local minima, and rsol[2] is a saddle point because the Hessian is indefinite. We see that not every local minimum of F is a solution to the given nonlinear system. By means of the contour lines of F we finally visualize the geometric details:
5
> Points := pointplot([seq(subs(evalf(rsol[k]),[x_1,x_2]),k=1..3)], symbol=circle,symbolsize=17,color=black): Levels:= contourplot(F,x_1=-1.5..2,x_2=-1.5..5,axes=box, levels=[0.1*k $ k=1..10],grid=[100,100]): display(Points, Levels);
4
2
0
0
1
Chebyshev Approximation We explain the mathematical problem by means of the following simple example: max | et − a − b t | −→ min 0≤t≤1
a, b
Consequently, we are looking for the linear polynomial function which is the best approximation of the exponential function in the interval [0, 1] with respect to the maximum norm. This problem is equivalent to η −→ min
a, b, η
subject to the constraints −η ≤ et − a − b t ≤ η for all t ∈ [0, 1].
Examples of Optimization Problems
9
There occur infinitely many constraints. The objective function and the constraint functions depend only on a finite number of variables. This type of problem is called a semi-infinite (linear) optimization problem. It will not be discussed any further in later chapters. R
Maple provides the command numapprox[minimax] to solve it, which we will apply as a ‘blackbox’: > restart: with(numapprox): p := minimax(exp,0..1,1,1,’eta’); ’eta’ = eta;
p := x → 0.8941114810 + 1.718281828 x ,
η = 0.105978313
> plot(exp-p,0..1,color=blue,thickness=2,tickmarks=[3,3], title="Error Function");
Error Function 0.1
0
0.2
0.4
0.6
0.8
1
–0.1
The error function alternately attains its maximal deviation η at 0, at a point τ ∈ (0, 1) and at 1. Therefore, we get the following equations for a, b, τ and η: > eta := ’eta’: f := exp: p := t -> a+b*t: eq1 := f(0)-p(0) = eta; eq2 := f(tau)-p(tau) = -eta; eq3 := f(1)-p(1) = eta; eq4 := D(f-p)(tau) = 0;
eq1 := 1 − a = η , eq2 := eτ − a − b τ = −η , eq3 := e − a − b = η , eq4 := eτ − b = 0 R
Maple
returns the following rounded values of the exact solution:
> solve({eq1,eq2,eq3,eq4},{a,b,tau,eta}): evalf(%);
{η = .1059334158, b = 1.718281828, a = 0.8940665842, τ = 0.5413248543} Comparing the results, we see that numapprox[minimax] computes the coefficient a and the maximal deviation η only with low accuracy.
Chapter 1
1.1
Chapter 1
10
Introduction
Facility Location Problems This kind of problem occurs in various applications. To illustrate this, we consider two planar location problems. Here, it is very important that distance measuring and the optimality criterion are suitable for the given problem. Example 3
Optimal Location of an Electricity Transformer
To simplify matters, assume that the electricity supply (for instance, in a thinly populated region) is provided by connecting each customer directly with the transformer. We are looking for the location of the transformer which minimizes the total net length. To measure the distance, we use the euclidean norm, and so the objective function has the form d(x, y) :=
m (x − xμ )2 + (y − yμ )2 −→ min , x, y
μ=1
where the coordinates (x, y) describe the sought location of the transformer and (xμ , yμ ) the given locations of the customers. > restart: Digits := 5: with(plots): x := vector(2):
Locations of the customers: > Points := [[0,0],[5,-1],[4,6],[1,3]];
Points := [[0, 0], [5, −1], [4, 6], [1, 3]] Objective function:
5
> d P1 P2 r1 r2
:= := := := :=
add(sqrt((x[1]-z[1])^2+(x[2]-z[2])^2),z=Points): op(map(z->z[1],Points)): # abscissas of the points op(map(z->z[2],Points)): # ordinates of the points x[1]=min(P1)..max(P1); x[2]=min(P2)..max(P2); # defines graphic window
r1 := x1 = 0..5 ,
r2 := x2 = −1..6
> p1 := plot3d(d,r1,r2,style=patchcontour,axes=box): display(p1,orientation=[-50,25]); display(p1,orientation=[-90,0]);
Examples of Optimization Problems
11 0
2
4 6
4
20 2
15
6
0
4 2
0
2 4
0
> p2 := pointplot(Points,symbol=circle,symbolsize=18,axes=frame): p3 := contourplot(d,r1,r2,levels=30,grid=[70,70],color=gray): display(p2,p3,color=black);
In contrast to the foregoing figure, we now get a real 2D graphic. Clicking R on the surmised minimum, causes Maple to return a rough approximation of the optimal location. We forego its reproduction in this book. The exact minimum can be calculated by analytical means: > with(linalg): g := grad(d,x): #solve(convert(g,set),{x[1],x[2]}); # symbolic calculation fails! Solution := fsolve(convert(g,set),{x[1],x[2]});
Solution := {x1 = 1.6000, x2 = 2.4000} > subs(Solution,eval(g)); # insert computed results
[0., 0.00005] As decimal numbers are well known to us, we detect the exact solution without the help of solve : > Solution := {x[1]=8/5,x[2]=12/5}: subs(Solution,eval(g)): simplify(%); # symbolic calculation!
[0, 0] By pure chance we have guessed the exact solution! > H := hessian(d,x): subs(Solution,eval(H)): simplify(%); evalf(%); definite(%,positive_def);
Chapter 1
1.1
Chapter 1
12
Introduction
⎡
⎤ 75 √13 + 25 √2 − 25 √13 + 25 √2 51 338 51 ⎢ 676 ⎥ ⎣ √ √ √ √ ⎦, 25 25 25 25 13 + 2 13 + 2 − 338 51 507 51
1.0933 0.42656 , 0.42656 0.87103
true
So the Hessian is positive definite in [8/5,12/5]. As d is strictly convex, there lies the unique minimum. > U := [8/5,12/5]: subs({x[1]=U[1],x[2]=U[2]},d): simplify(%); evalf(%);
2
√ √ 13 + 4 2 ,
12.868
> p4 := plot([seq([z,U],z=Points)],color=blue,thickness=2): p5 := plot([U],style=point,symbol=circle,symbolsize=24,color=blue): display(p2,p3,p4,p5,axes=frame); 6 5 4 3 2 1 0 –1 0
1
2
3
4
5
We see that the optimal location of the electricity transformer is the intersection point of the two diagonals. Example 4
Optimal Location of a Rescue Helicopter
Now, we are looking for the optimal location of a rescue helicopter. As in the foregoing case, the distance is measured in a straight line. The helicopter should reach its destination in minimal time. Therefore, we now have to minimize the maximum distance and get the objective function dmax (x, y) := max (x − xμ )2 + (y − yμ )2 −→ min . 1≤μ≤m
Coordinates of the destinations:
x, y
Examples of Optimization Problems
13
> restart: with(plots): x := vector(2): Points := [[0,0],[5,-1],[4,6],[1,3]]; r1 := x[1]=-1.5..7.5: r2 := x[2] = -2..7: # drawing window
Points := [[0, 0], [5, −1], [4, 6], [1, 3]] Objective function: > d_max := max(seq(sqrt((x[1]-z[1])^2+(x[2]-z[2])^2),z=Points)): p1 := plot3d(d_max,r1,r2,style=patchcontour,axes=box,grid=[70,70]): display(p1,orientation=[-40,20]); display(p1,orientation=[-90,0]); 0
2
4
6 6
4
10 2
5 6 0
4 2
0
2 4
0
6
–2
–2
This minimax problem can be solved more easily if we transform it (with r := dmax ) to the following equivalent optimization problem: 1 2 r −→ min x, y, r 2 subject to the constraints (x − xμ )2 + (y − yμ )2 ≤ r2 Substituting := problem:
1 2 2 (r
(1 ≤ μ ≤ m).
− x2 − y 2 ), we get a simple quadratic optimization 1 2 (x + y 2 ) + −→ min x,y, 2
subject to the (linear) constraints xμ x + yμ y + ≥
1 2 (x + yμ2 ) 2 μ
(1 ≤ μ ≤ m).
Later on, we will get to know solution algorithms for these types of problems which provide the following coordinates of the optimal helicopter location:
Chapter 1
1.1
Chapter 1
14
Introduction
> M := [52/17, 39/17]: # coordinates of the optimal location r := evalf(subs({x[1]=M[1],x[2]=M[2]},d_max));
r := 3.8235
5
> p2 := pointplot(Points,symbol=circle,symbolsize=18,color=black): p3 := contourplot(d_max,r1,r2,levels=30,grid=[70,70],color=gray): p4 := plot([seq([z,M],z=Points)],color=blue): with(plottools): p5 := disk(M,0.1,color=black): p6 := circle(evalf(M),r,color=blue,thickness=3): display(p2,p3,p4,p5,p6,axes=frame,scaling=constrained);
6
4
2
0
–2 0
2
4
6
In addition, one sees, for example with the help of the Karush–Kuhn– Tucker conditions from chapter 2 (cf. exercise 14), that the location M is the optimum place if three points lie on the boundary of the blue (minimum) circumscribed circle. They are the vertices of an acute triangle. Naturally, it is possible that only two points lie on the circle. In this case, however, they must lie diametrically opposite to each other. Standard Form of Optimization Problems The previous examples also served to demonstrate that optimization problems from quite different areas of application have common roots. Therefore, it is useful to classify them according to mathematical criteria. In this book, we consider optimization problems which are based on the following standard form: f (x) −→ minn x∈R
subject to the constraints
Examples of Optimization Problems
15
gi (x) ≤ 0
for i ∈ I := {1, . . . , m}
hj (x) = 0
for j ∈ E := {1, . . . , p} .
Here ‘I’ stands for ‘Inequality’, ‘E’ for ‘Equality’. As already denoted earlier, the function to be minimized is called the objective function. We write the constraints, respectively restrictions, in homogeneous form, and distinguish between inequality and equality constraints. The maximization of an objective function f is equivalent to the minimization of −f , and inequality restrictions of the type gi (x) ≥ 0 are equivalent to −gi (x) ≤ 0 . Generally, we assume that the objective function as well as the restriction functions are defined on a nonempty subset of Rn and are at least continuous there. Unconstrained optimization problems play an important role. Our discussion of them in chapter 3 will occupy a big space because they come up in various applications and are consequently important as problems in their own right. There exists quite a range of methods to solve them. Numerical methods for unconstrained optimization problems usually only compute a stationary point, that is, the first-order necessary conditions are fulfilled. Only with the help of additional investigations like the second-order optimality condition is it possible to decide whether the stationary point is a local extreme point or a saddle point. Global minima can only be stated in special cases. Minimax problems of the form max fμ (x) −→ minn
1≤μ≤m
x∈R
are a special kind of unconstrained optimization problem with a nondifferentiable objective function even if every fμ is differentiable. These are — similar to the Chebyshev approximation — equivalent to a restricted problem: η −→ min n x∈R , η∈R
subject to the constraints fμ (x) ≤ η
(μ = 1, . . . , m) .
Optimization problems with equality constraints often occur as subproblems and can be reduced via elimination to unconstrained optimization problems as we will see later on. With restricted problems, we are especially interested in the linear constraints gi (x) = aTi x − bi for i ∈ I and hj (x) = a ˆTj x − ˆbj for j ∈ E. In chapter 4 we will discuss linear optimization — here, we have f (x) = cT x — and quadratic optimization, that is, f (x) = 12 xT Cx + cT x with a symmetric positive semidefinite matrix C ∈ Rn×n , in detail. The above occur, for example, as subproblems in SQP2 methods. 2
The abbreviation stands for “Sequential Quadratic Programming”.
Chapter 1
1.1
Chapter 1
16
Introduction
Penalty Methods We will conclude this introductory section with the presentation of the classic penalty method from the early days of numerical optimization. At that time, problems with nonlinear constraints could only be solved by reducing them to unconstrained problems by adding the constraints as ‘penalty terms’ to the objective function. Here, we will only consider the simple case of equality constraints and quadratic penalty terms. The optimization problem f (x) −→ minn x∈R
subject to the constraints hj (x) = 0
for j = 1, . . . , p
is transformed into the optimization problem p
F (x) := f (x) +
λ hj (x)2 −→ minn x∈R 2 j=1
with a positive penalty parameter λ. Enlarging λ leads to a greater ‘punishment’ of nonfulfilled constraints. Thereby we hope that the ‘unconstrained minimizer’ x∗ (λ) will fulfill the constraints with growing accuracy and at the same time f (x∗ (λ)) will better approximate the minimum we are looking for. Example 5 Consider the problem f (x1 , x2 ) := x1 x22 −→ min2 x∈R
subject to the constraint h(x1 , x2 ) := x21 + x22 − 2 = 0 ,
√ T with x1 = − 2/3 , x2 = ±2/ 3 which has two global minima in (x√ 1 , x2 ) and a local minimum in x1 = 2, x2 = 0 . At the moment, we content ourselves with the following ‘graphical inspection’:
5
> restart: with(plots): f := x[1]*x[2]^2: h := x[1]^2+x[2]^2-2: F := subs({x[1]=r*cos(t),x[2]=r*sin(t)},[x[1],x[2],f]): plot(subs(r=sqrt(2),F[3]),t=0..2*Pi,color=blue,thickness=2); p1 := plot3d(F,t=0..2*Pi,r=0.1..sqrt(2),style=wireframe,grid=[25,25], color=blue): r := sqrt(2): p2 := spacecurve(F,t=0..2*Pi,color=blue,thickness=2): p3 := spacecurve([F[1],F[2],0],t=0..2*Pi,color=black,thickness=2): display(p1,p2,p3,orientation=[-144,61],axes=box);
Examples of Optimization Problems
17
Graph of the objective function
1
1
0.5
0
0 1
2
3
4
t
5
6
–1
–0.5
1 1
–1
0
0 x2
–1
–1
x1
The figure on the right shows the constraint circle in the x1 x2 -plane, along with the surface x3 = f (x1 , x2 ). The intersection of the cylinder h(x1 , x2 ) = 0 and the surface x3 = f (x1 , x2 ) gives the space curve shown by the thick blue line. Later on, we will be able to analyze and prove this in more detail with the help of the optimality conditions presented in chapter 2. Let us look at the penalty function: > F := f+lambda/2*h^2;
F := x1 x22 + 1 λ (x21 + x22 − 2)2 2 We get the necessary optimality conditions: > DF1 := diff(F,x[1]) = 0; DF2 := factor(diff(F,x[2])) = 0;
DF1 := x22 + 2 λ (x21 + x22 − 2) x1 = 0 DF2 := 2 x2 (λ x22 + x1 + λ x21 − 2 λ) = 0 If x2 = 0 , the first equation is simplified to: > subs(x[2]=0,DF1); solve(%,x[1]): Sol := map(z->[z,0],[%]);
2 λ (x21 − 2) x1 = 0 ,
√ √ Sol := [0, 0], [ 2, 0], [− 2, 0]
In these three points, F has the following Hessians: > with(linalg): H := hessian(F,[x[1],x[2]]): seq(subs({x[1]=z[1],x[2]=z[2]},eval(H)),z=Sol);
Chapter 1
1.1
Chapter 1
18
Introduction
8λ √ 0 8 λ 0√ −4 λ 0 , , 0 −4 λ 0 2 2 0 −2 2
√ T Obviously, the penalty function has a local minimum only in 2, 0 . For √ large λ, the corresponding Hessian has the condition number 2 2 λ and, therefore, is ill-conditioned. Now consider the remaining case in which the second factor of DF2 is equal to zero: > DF2a := collect(op([1,3],DF2),lambda) = 0; expand(DF1-2*x[1]*DF2a): Eq[1] := x[2]^2 = solve(%,x[2]^2);
DF2a := λ (x21 + x22 − 2) + x1 = 0 ,
Eq 1 := x22 = 2 x21
Substituting Eq1 into DF 2a leads to the following quadratic equation: > Eq[2] := subs(Eq[1],DF2a); Eq[3] := x[1]^2 = solve(Eq[2],x[1]^2);
Eq 2 := λ (3 x21 − 2) + x1 = 0 ,
Eq 3 := x21 =
1 2 λ − x1 3 λ
Before looking at the solutions in more detail, we will prepare the Hessian for analyzing whether it is positive definite or not: > diff(F,x[1]$2): subs(Eq[1],%): DF11 := factor(%); diff(F,x[2]$2): subs(Eq[1],%): DF22 := expand(%-2*lhs(Eq[2]));
DF11 := 2 λ (5 x21 − 2) ,
DF22 := 8 λ x21
> unprotect(Trace): Trace := expand(subs(Eq[3],DF11+DF22));
Trace := 8 λ − 6 x1 > DF11*DF22-diff(F,x[1],x[2])^2: expand(%): subs(Eq[1],%): factor(%); Determinant := 8*x[1]^2*expand(subs(Eq[3],op(3,%)));
8 x21 (6 λ2 x21 − 4 λ2 − 1 − 4 λ x1 ) ,
Determinant := 8 x21 (−6 λ x1 − 1)
> Qsol := solve(Eq[2],x[1]);
1 −1 + Qsol := 6
√ √ 1 + 24 λ2 1 −1 − 1 + 24 λ2 , λ 6 λ
> subs(x[1]=Qsol[1],Determinant): ’Determinant’ = factor(%);
2 (−1 + Determinant = − 9
√ √ 1 + 24 λ2 )2 1 + 24 λ2 λ2
Examples of Optimization Problems
19
> Qsol[2]^2-2/3: ’x[1]^2’-2/3 = simplify(%);
x21
−
2 3
=
1+
√ 1 + 24 λ2 18λ2
Looking at the determinant, it is obvious that the first of the two solutions Qsol is not possible. For x1 = Qsol 2 , we have x1 < − 2/3. Therefore, we get two minimizers, and because of x21 + x22 = 3 x21 > 2 these points do not lie within the circle. For λ −→ ∞, however, these minimizers converge to the global minimizer of the restricted optimization problem. For large λ their numerical calculation requires very good initial values because we will see at once that the Hessians of the penalty function are ill-conditioned in the (approximating) minimizers: > subs(x[1]=Qsol[2],mu^2-Trace*mu+Determinant): Mu := solve(%,mu): mu[max]/mu[min] = asympt(eval(Mu[1]/Mu[2]),lambda,1);
1 √ μmax = 6λ + O μmin λ In the following two graphics this fact becomes clear very quickly because the elliptic level curves around the local minimizers become more elongated for increasing penalty parameters: penalty parameter = 3
penalty parameter = 6
2 1.5 1
1
0.5 0
0
–0.5 –1
–1 –1.5
–2 –2
–1
0
1
–1.5
2
–1 –0.5
R
0
0.5
1
1.5
In comparison, we give a Matlab script Penalty.m which calculates a minimizer of the penalty function for the parameters λ = 1, 3, 6 and 9 by means of R the function fminsearch from the Matlab Optimization Toolbox. Furthermore it plots the feasible region, the level curves of the objective function and the penalty functions and marks the exact and approximating minimizers. % Script Penalty.m: Test of Penalty Method clc; clear all; close all; N = 200; X = linspace(-2,2,N); Y = X; T = linspace(0,2*pi,N);
Chapter 1
1.1
Chapter 1
20
5
10
15
20
25
30
35
Introduction
Z = X.*Y.^2; [X,Y] = ndgrid(X,Y); xx = sqrt(2).*cos(T); yy = sqrt(2).*sin(T); x1 = -sqrt(2/3); y1 = sqrt(4/3); f = @(x) x(1)*x(2)^2; h = @(x) x(1)^2+x(2)^2-2; x = [2, 1]; % starting value of fminsearch for lambda = [1, 3, 6, 9] Phi = Z+lambda/2*(X.^2+Y.^2-2).^2; phi = @(x) f(x)+lambda/2*h(x)^2; [x,fval] = fminsearch(phi,x) % In the second iteration (lambda=3) fminsearch uses the result of % the first iteration (i.e. lambda=1) as starting value, and so on. figure, clf subplot(2,1,1); contour(X,Y,Z,40); colorbar; hold on P = plot(xx,yy,’k’); set(P,’linewidth’,2); PP = plot([sqrt(2),x1,x1],[0,y1,-y1],’ko’); set(PP,’markersize’,6,’markerfacecolor’,’k’); axis([-2 2 -2 2]), axis equal, axis([-2 2 -2 2]), ylabel(’y’) title([’\lambda = ’, num2str(lambda)]); subplot(2,1,2); contour(X,Y,Phi,160); colorbar; hold on P = plot(xx,yy,’k’); set(P,’linewidth’,2); PP = plot([sqrt(2),x1,x1],[0,y1,-y1],’ko’); set(PP,’markersize’,6,’markerfacecolor’,’k’); % Plot the minimum of phi PP = plot(x(1),x(2),’r+’); set(PP,’markersize’,6,’linewidth’,3); axis([-2 2 -2 2]), axis equal, axis([-2 2 -2 2]) xlabel(’x’), ylabel(’y’) waitforbuttonpress % to wait for a click on a figure end
Penalty-based methods as presented above are at best suitable to find appropriate starting approximations for more efficient methods of optimization. Later on, however, we will see that a modified approach — it leads to the so-called ‘exact penalty functions’ — is of quite a practical interest.
1.2 Historical Overview In this book we almost exclusively deal with continuous finite-dimensional optimization problems. This historical overview, however, does not only consider the finite-dimensional, but also the infinite-dimensional case (that is, Calculus of Variations and Optimal Control ), because their developments have for
Historical Overview
21
large parts run in parallel and the theories are almost inextricably interweaved with each other.
Optimization Problems in Antiquity The history of optimization begins, like so many other stories, with the ‘ancient’ Greeks. In his Aeneid Virgil recorded the legendary story of the Phoenician queen Dido who — after her flight from Tyre — allegedly landed on the coast of Numidia in North Africa in 814 BC. With seeming modesty she asked the ruler there for as much land as could be enclosed with a bull’s hide. This bull’s hide, however, she cut into many thin strips so that she was able to enclose the whole area around the harbor bay of what was later to become the city of Carthage. Hence “Dido’s Problem” is to find the closed curve of fixed length which encloses the maximum area. In the language of modern mathematics it is the classical isoperimetric problem. Dido probably knew its solution, although not in today’s strict sense, since the necessary terms and concepts were only devised over 2000 years later. This was probably also the reason why only simple extremal problems of geometrical nature were examined in antiquity. Zenodorus (200–140 BC), for example, dealt with a variant of the isoperimetric problem and proved that of all polygons with n vertices and equal perimeter the regular polygon has the greatest area. The following optimization problem came from the famous mathematician Heron of Alexandria (second half of the first century AD). He postulated that beams of light always take the shortest path. Only in 1657 was it Pierre de Fermat (1601–1665) who formulated the correct version of this law of nature stating that beams of light cross inhomogeneous media in minimum time. Heron’s idea, however, was remarkable insofar as he was the first to formulate an extremal principle for a phenomenon of nature.
First Heroic Era: Development of Calculus For a long time each extremal problem was solved individually using specific methods. In the 17th century people realized the necessity of developing more general methods in the form of a calculus. As the name suggests, this was the motivation for the creation of “Calculus”. The necessary condition f (x) = 0 for extrema was devised by Fermat who discovered it in 1629 (allegedly only for polynomials). Isaac Newton (1643–1727) also knew Fermat’s method — in a more general form. He dealt with determining maxima and minima in
Chapter 1
1.2
Chapter 1
22
Introduction
his work Methods of Series and Fluxions which had been completed in 1671 but was published only in 1736. In 1684 Gottfried Wilhelm Leibniz (1646–1716) published a paper in the Acta Eruditorum in which he used the necessary condition f (x) = 0 as well as the second-order derivative to distinguish between maxima and minima. Very surprisingly the obvious generalization to functions of several variables took a long time. It was only in 1755 that the necessary optimality condition ∇f (x) = 0 was published by Leonhard Euler (1707–1783) in his book Institutiones Calculi Differentialis. The Lagrange multiplier rule for problems with equality constraints appeared only in 1797 in Lagrange’s Th´eorie des fonctions analytiques. Instead, the main focus was on problems in which an integral functional, like b J(y) =
f (x, y, y ) dx ,
a
is to be maximized or minimized in a suitable function space. The isoperimetric problem is a problem of this kind. Further extremal problems — mainly motivated by physics — were studied by Newton, Leibniz, Jacob Bernoulli (1654–1705) and John Bernoulli (1667–1748) as well as Euler. In his book Methodus Inveniendi Lineas Curvas Maximi Minimive Proprietate Gaudentes, sive Solutio Problematis Isoperimetrici Latissimo Sensu Accepti, published in 1744, however, Euler developed first formulations of a general theory and for that used the Euler method named after him. From the necessary optimality conditions of the discretized problems he deduced the famous “Euler equation” ∂f d ∂f = ∂y dx ∂y via passage to the limit for an extremal function y. By doing this, he confirmed known results and was able to solve numerous more general problems. Joseph-Louis Lagrange (1736–1813) greatly simplified Euler’s method. Instead of the transition to the approximating open polygon, he embedded the extremal function y(x) in a family of functions y(x, ε) := y(x) + εη(x) and demanded the disappearance of the “first variation” d δJ := J(y + εη) . dε ε=0 Following this expression, Euler coined the term “Calculus of Variations” for this new and important branch of Analysis. In Euler’s works there are already first formulations for variational problems with differential equation constraints. For these kinds of problems Lagrange
Historical Overview
23
devised the multiplier method named after him, of which the final version appeared in his M´echanique analytique in 1788. It took over 100 years until Lagrange’s suggestion was ‘properly’ understood and until there was a mathematically ‘clean’ proof. Calculus of Variations provided important impulse for the emergence of new branches of mathematics such as Functional Analysis and Convex Analysis. These in turn provided methods with which it was possible to also solve extremal problems from economics and engineering. First papers on the mathematical modeling of economic problems were published at the end of the thirties by the later winners of the Nobel Prize Leonid V. Kantorovich (1912–1986) and Tjalling C. Koopmans (1910–1985). In 1951 Harold W. Kuhn (born 1925) and Albert W. Tucker (1905–1995) extended the classical Euler–Lagrange multiplier concept to finite-dimensional extremal problems with inequality constraints. Later it became known that William Karush (1917–1997) and Fritz John (1910–1994) had already achieved similar results in 1939 [Kar] and 1948 [John]. From the mid-fifties onwards there evolved the Theory of Optimal Control from the classical Calculus of Variations and other sources like dynamic programming, control and feedback control systems. Pioneering discoveries by Lev S. Pontryagin (1908–1988) and his students V. G. Boltyanskii, R. V. Gamkrelidze and E. F. Mishchenko led to the maximum principle, and thus the theory had its breakthrough as an autonomous branch of mathematics. The development of more powerful computers contributed to the fact that it was now possible to solve problems from applications which had so far been impossible to solve numerically because of their complex structure. The most important and most widely used numerical methods of optimization were devised after the Second World War. Before that only some individual methods were known, about which sometimes only vague information is provided in the literature. The oldest method is probably the least squares fit which Carl Friedrich Gauss (1777–1855) allegedly already knew in 1794 [Gau] and which he used for astronomical and geodetic problems. A particular success for him was the rediscovery of the planetoid Ceres in late 1801. After the discovery during January 1801 by the Italian astronomer Giuseppe Piazzi it had been observed only for a short period of time. From very few data Gauss calculated its orbit using linear regression. Franz Xaver von Zach found Ceres on the night of December 7, just where Gauss predicted it would be. In 1805 Adrien-Marie Legendre (1752–1833) presented a printed version of the least squares method for the first time. Gauss, who published his results only in 1809, claimed priority for it.
Chapter 1
1.2
Chapter 1
24
Introduction
An important predecessor of numerical optimization methods and prototype of a descent method with step size choice is the gradient method. It was proposed by Augustin-Louis Cauchy (1789–1857) [Cau] in 1847 to solve systems of nonlinear equations by reducing them to linear regression problems of the form m 1 F (x) = fi (x)2 −→ minn x∈R 2 i=1 as described in section 1.1. Here the iterations are carried out according to x(k+1) := x(k) + λk dk — starting from an approximation x(k) and taking the direction of steepest descent dk := −∇F (x(k) ); the step size λk is the smallest positive value λ which gives a local minimum of F on the half line {x(k) + λdk | λ ≥ 0}. However, it seems that the gradient method has not gained general acceptance —probably because the rate of convergence slows down immensely after a few iterations. In a paper from 1944 Levenberg [Lev] regarded the Gauß– Newton method as the standard method for solving nonlinear regression problems. This method is obtained by linearization f (x) ≈ f (x(k) ) + Jk (x − x(k) ) in a neighborhood U (x(k) , Δk ), where Jk := f (x(k) ) is the Jacobian. The solution dk to the linear regression problem f (x(k) ) + Jk d2 −→ minn d∈R
gives a new approximation x(k+1) := x(k) + dk . Strictly speaking this only makes sense if dk ≤ Δk holds. In the case of dk > Δk Levenberg proposed considering the restricted linear regression problem f (x(k) ) + Jk d2 −→
min
d∈Rn , d2 =Δk
instead. Then there exist exactly one Lagrange multiplier λk > 0 and one correction vector dk with (JkT Jk + λk I)dk = −JkT f (x(k) ) and dk 2 = Δk . One can see that for λk ≈ 0 a damped Gauß–Newton correction and for large λk a damped gradient correction is carried out. Hence, this new method ‘interpolates’ between the Gauß–Newton method and the gradient method
Historical Overview
25
by combining the usually very good local convergence properties of the Gauß–Newton method with the descent properties of the gradient method. This flexibility is especially advantageous if the iteration runs through areas where the Hessian of the objective function — or its approximation — is not (yet) sufficiently positive definite. Levenberg unfortunately does not make any suggestions for determining λk in a numerically efficient way and controlling the size of Δk . Marquardt [Mar] proposed first “ad hoc ideas” in a paper from 1963. Working independently from Levenberg, he had pursued a similar approach. This was only properly understood when the findings of Levenberg and Marquardt were studied in the context of “trust region methods” which were originally proposed by Goldfeld, Quandt and Trotter [GQT] in 1966. In contrast to descent methods with choice of step size one chooses the step size — or an upper bound — first, and then the descent direction. When the first computers appeared in the middle of the 20th century, nonlinear regression problems were of central importance. Levenberg’s approach, however, remained unnoticed for a long time. Rather, at first the interest centered around methods with a gradient-like descent direction. Besides Cauchy’s classical gradient method these were iterative methods which minimized with alternating coordinates and which worked cyclically like the Gauss–Seidel method or with relaxation control like the Gauss– Southwell method [Sou]. In the latter case the minimization was done with respect to the coordinate with the steepest descent. A Quasi-Newton method proposed by Davidon (1959) turned out to be revolutionary. It determined the descent directions dk = −Hk gk from the gradient gk using symmetric positive definite matrices Hk . Davidon’s paper on this topic remained unpublished at first. In 1963 a revised version of his algorithm was published by Fletcher and Powell. It is usually called the Davidon–Fletcher–Powell method (DFP method). Davidon’s original work was published only 30 years later in [Dav]. Davidon himself called his method the “variable metric method” pointing out that the one-dimensional minimization was done in the direction of the minimum of an approximation of the objective function. For a quadratic objective function f (x) = bT x + 1 xT Ax 2 with a symmetric positive definite Hessian A the inner product u , v A := v T Au gives a norm uA := u , u A and thereby a metric. If one chooses dk as the steepest descent direction with respect to this norm, that means, as the solution to dT gk min , d=0 dA
Chapter 1
1.2
Chapter 1
26
Introduction
it holds that dk = −λA−1 gk with a suitable λ > 0. If A−1 is known, the quadratic function can be minimized in one step. In the general case one works with changing positive definite matrices which are used as a tool to determine the new descent direction. Hence, the metric changes in each step — therefore the name of this method. In addition it is based on the idea of iteratively determining a good approximation of the inverse of the Hessian. We thereby get an interesting connection with the conjugate gradient method by Hestenes and Stiefel (1952) since in the quadratic case the following holds for the descent directions: dTi Adj = 0 for i = j . This ensures that the optimality of the former search directions remains and that the method — at least in this special case — terminates at the minimal point after a finite number of iterations. Besides these kinds of minimization methods there were various other methods which were commonly used. These were mostly solely based on the evaluation of functions and most of them did not have a sound theoretical basis. They were plain search methods. They were easy to implement, worked sufficiently reliably and fulfilled the expectations of their users in many applications. However, with higher-dimensional problems their rate of convergence was very slow — with quickly increasing costs. Therefore, due to their lack of efficiency most of these methods have gone out of use. However, very popular even today is the Nelder and Mead polytope method from 1965 which will be presented in chapter 3.
Second Heroic Era: Discovery of Simplex Algorithm The simplex method can be regarded as the first efficient method for solving a restricted optimization problem — here in the special case of a linear objective function and linear (inequality) constraints. This algorithm, developed by George Dantzig (1914–2005) in 1947, greatly influenced the development of mathematical optimization. In problems of this kind the extrema are attained on the boundary of the feasible region — and in at least one ‘vertex’. The simplex method utilizes this fact by iteratively switching from one vertex to the next and thereby — if we are minimizing — reducing the objective function until an optimal vertex is reached. The representation of the simplex algorithm in tableau form, which had been dominant for a long time, as well as the fact that the problem was called ‘linear program’ indicate that it has its roots in the time of manual computation and mechanical calculating machines. More adequate forms of representation like the active set method only gradually gained acceptance. This also holds for first solution strategies for quadratic optimization problems which were devised
Historical Overview
27
by Beale and Wolfe in 1959. These can be regarded as modifications of the simplex algorithm. A survey by Colville (1968) gives a very good overview of the algorithms available in the sixties to solve nonlinearly constrained optimization problems. Surprisingly, the reduced gradient method by Abadie and Carpentier (1965, 1969) was among the best methods. First formulations of these kinds of projected gradient methods — at first for linear, later on also for nonlinear constraints — were devised by Rosen (1960/61) and Wolfe (1962). The idea to linearize the nonlinear constraints does not only appear in the reduced gradient method. The SLP method (SLP = sequential linear programming) by Griffith and Stewart (1961) solves nonlinear optimization problems by using a sequence of linear approximations — mostly using first-order Taylor series expansion. A disadvantage of this approach is that the solution to each LP-problem is attained in a vertex of the feasible region of the (linearized) constraints which is rather improbable for the nonlinear initial problem. Wilson (1963) was the first to propose the use of quadratic subproblems — with a quadratic objective function and linear constraints. There one uses a second-order Taylor series expansion of the Lagrange function. One works with exact Hessians of the Lagrange function and uses the Lagrange multipliers of the preceding iteration step as its approximate values. If dk is a solution of the quadratic subproblem, then x(k+1) := x(k) + dk gives the new approximative value; note that at first there is no ‘line search’. Similar to the classical Newton method one can prove local quadratic convergence for this method; in the literature it is called the Lagrange–Newton method. Beginning with Han (1976), Powell and Fletcher Wilson’s approach was taken up and developed further by numerous authors. The aim was to obtain a method with global convergence properties and — in analogy with Quasi-Newton methods — a simplified approximation of the Hessians. Furthermore, the (mostly nonfeasible) approximations x(k) were to reduce the objective function as well as their distance to the feasible region. To measure and control the descent the so-called ‘exact penalty functions’ or ‘merit functions’ were introduced. This led to a modified step size control where the solutions of the quadratic subproblems were used as the descent directions of the penalty functions. The methods obtained in this way were called SQP methods. Leonid Khachiyan’s Breakthrough The development of the last 30 years has been greatly influenced by the aftermath of a ‘scientific earthquake’ which was triggered in 1979 by the findings of the Russian mathematician Khachiyan (1952–2005) and in 1984 by those of the Indian-born mathematician Karmarkar. The New York Times, which
Chapter 1
1.2
Chapter 1
28
Introduction
profiled Khachiyan’s achievement in a November 1979 article entitled “Soviet Mathematician Is Obscure No More,” called him “the mystery author of a new mathematical theorem that has rocked the world of computer analysis.” At first it only affected linear optimization and the up to that time unchallenged dominance of the simplex method. This method was seriously questioned for the first time ever when in 1972 Klee and Minty found examples in which the simplex algorithm ran through all vertices of the feasible region. This confirmed that the ‘worst case complexity’ depended exponentially on the dimension of the problem. Afterwards people began searching for LPalgorithms with polynomial complexity. Based on Shor’s ellipsoid method, it was Khachiyan who found the first algorithm of this kind. When we speak of the ‘ellipsoid method’ today, we usually refer to the ‘Russian algorithm’ by Khachiyan. In many applications, however, it turned out to be less efficient than the simplex method. In 1984 Karmarkar achieved the breakthrough when he announced a polynomial algorithm which he claimed to be fifty times faster than the simplex algorithm. This announcement was a bit of an exaggeration but it stimulated very fruitful research activities. Gill, Murray, Saunders and Wright proved the equivalence between Karmarkar’s method and the classical logarithmic barrier methods, in particular when applied to linear optimization. Logarithmic barrier methods are methods which — unlike the example of an exterior penalty method from section 1.1 — solve restricted problems by transforming a penalty or barrier term into a parameterized family of unconstrained optimization problems the minimizers of which lie in the interior of the feasible region. First approaches to this method date back to Frisch (1954). In the sixties Fiacco and McCormick devised from that the so-called interior-point methods. Their book [Fi/Co] contains a detailed description of classical barrier methods and is regarded as the standard reference work. A disadvantage was that the Hessians of the barrier functions were ill-conditioned in the approximative minimizers. This is usually seen as the reason for large rounding errors. Probably this flaw was the reason why people lost interest in these methods. Now, due to the reawakened interest, the special problem structure was studied again and it was shown that the rounding errors are less problematic if the implementation is thorough enough. Efficient interior-point methods have in the meantime been applied to larger classes of nonlinear optimization problems and are still topics of current research.
29
Exercises 1. In Gauss’s Footsteps
(cf. [Hai])
A newly discovered planetoid orbiting the sun was seen at 10 different positions before it disappeared from view. The Cartesian coordinates (xj , yj ), j = 1, . . . 10, of these positions represented in a fitted coordinate system in the orbit plane are given in the following chart: xj : −1.024940, −0.949898, −0.866114, −0.773392, −0.671372, −0.559524, −0.437067, −0.302909, −0.155493, −0.007464 yj :
−0.389269, −0.322894, −0.265256, −0.216557, −0.177152, −0.147582, −0.128618, −0.121353, −0.127348, −0.148885
Our aim now is to determine the orbit of this object on the basis of these observations in order to be able to predict where it will be visible again. We assume the orbit to be defined by an ellipse of the form x2 = ay 2 + b xy + cx + dy + e . This leads to an overdetermined system of linear equations with the unknown coefficients a, b, c, d and e, which is to be solved by means of the method of least squares. Do the same for the parabolic approach x2 = dy + e . Which of the two trajectories is more likely? 2. Gauss Normal Equations Let m, n ∈ N with m > n, A ∈ Rm×n and b ∈ Rm . Consider the mapping ϕ given by ϕ(x) := Ax − b2
for x ∈ Rn
and show: A vector u ∈ Rn gives the minimum of ϕ if and only if AT Au = AT b . 3. Overdetermined System of Linear Equations (cf. Example 1, page 3) a) Formulate the linear 1 -approximation problem Ax − b1 → min x as an equivalent linear optimization problem similar to the Chebyshev approximation problem. b) Calculate a solution to the above linear optimization problem with R Maple or in like manner.
Chapter 1
Exercises to Chapter 1
Chapter 1
30
Introduction c) Prove that the convex hull of the points (3, 2), (2, 1), ( 32 , 32 ), (1, 3) yields solutions to the 1 -approximation problem. Are there any other solutions?
4. a) The minimum of the function f defined by f (x) := x21 +x22 −x1 x2 −3x1 for x ∈ R2 can instantly be calculated through differentiation. Another way is to use the method of alternate directions: The function f is firstly minimized with respect to x1 , then with respect to x2 , then again with respect to x1 and so on. Use x(0) := (0, 0)T as the starting value and show that the sequence x(k) obtained with the method of alternate directions converges to the minimal point x∗ = (2, 1)T . b) Let
f (x) = max |x1 + 2x2 − 7|, |2x1 + x2 − 5| .
• Visualize the function f by plotting its contour lines. • Minimize f similar to the previous exercise, starting with x(0) := (0, 0)T . Why don’t you get the minimizer x∗ = (1, 3)T with f (x∗ ) = 0 ? 5. a) Let m ∈ N. Assume that the points P1 , . . . , Pm and S1 , . . . , Sm in R2 are given, and we are looking for a transformation consisting of a rotation and a translation that maps the scatterplot of (P1 , . . . , Pm ) as closely as possible to the scatterplot of (S1 , . . . , Sm ). That means: We are looking for a, b ∈ R and ϕ ∈ [0, 2π) such that the value d(a, b, ϕ) :=
m j=1
fa,b,ϕ(Pj ) − Sj 22
with fa,b,ϕ (x, y) := (a, b) + (x cos ϕ − y sin ϕ, x sin ϕ + y cos ϕ) is minimized. Solve the problem analytically. b) Calculate the values of x, y, ϕ for the special case m = 4 with Sj (20, 20) (40, 40) (120, 40) (20, 60) Pj (20, 15) (33, 40) (125, 40) (20, 65) and Sj (20, 20) (40, 40) (120, 40) (20, 60) Pj (16, 16) (42, 30) (115, −23) (42, 58) . 6. Little Luca has gotten 2 dollars from his grandfather. He takes the money and runs to the kiosk around the corner to buy sweets. His favorites are licorice twists (8 cents each) and jelly babies (5 cents each). Luca wants to buy as many sweets as possible. From experience, however, he knows
31
that he cannot eat more than 20 jelly babies. How many pieces of each type of candy should he buy in order to get as many pieces as possible for his 2 dollars, in view of the fact that not more than 20 jelly babies should be bought (as he does not want to buy more than he can consume immediately)? a) Formulate the above situation as an optimization problem. Define an objective function f and maximize it subject to the constraints g(s1 , s2 ) ≤ 0. (g can be a vector function; then the inequality is understood componentwise!) s1 and s2 denote the amount of licorice twists and jelly babies. b) Visualize the feasible region F, that is, the set of all pairs (s1 , s2 ) such that g(s1 , s2 ) ≤ 0 holds. Visualize f : F −→ N and determine its maximum. c) Is the optimal point unique? 7. The floor plan below shows the arrangement of the offices of a company with a staff of thirty people including the head of the company and two secretaries: On the plan the number in each office denotes the number of staff members working in this room. The company has bought a copier and now the “sixty-four-dollar question” is: Where should the copier go? The best spot is defined as the one that keeps traffic in the hallways to a minimum. The following facts are known about the “copying customs” of the staff members: • Each staff member (except the head and the secretaries) uses the copier equally often. All the staff members do their own copying (that is, if there are four people in one office, each of them takes his own pile to the copier). • Both secretaries copy 5 times as much as the other staff members; the head almost never uses the copier. • There is nobody who walks from the server room or conference room to the copier. Since the copier cannot be placed in any of the offices, it has to be put somewhere in the hallway, where the following restrictions have to be considered: • The copier may not be placed directly at the main entrance. • The copier should not be placed in the middle of the hallway. Consequently, one cannot put it between two opposing doors. • The copier should not be placed directly in front of the glass wall of the conference room, so as not to disturb (customer) meetings.
Chapter 1
Exercises to Chapter 1
Introduction Altogether we thus obtain four “taboo zones”. Their exact position and layout can be seen in the floorplan:
S1: Secretary’s1 office S2: Secretary’s2 office
1 4
Door
2
Taboo Zones Glass Wall
2
2m
3
1
Server Room
1
3
3
1 Main Entrance
Chapter 1
32
1
2 Conference Room Head 1
S1 1
1
1
1
S2 1
Hint: In practice it is almost impossible to solve this problem without any simplifications. In this case it holds that: • It does not matter on which side of the hallway the copier is placed. Therefore the hallways may be treated as one-dimensional objects. • The distance ‘desk–copier’ is measured from the door of the respective office and not from the actual desk in the office. • You might need additional simplifications. Implement the above problem and determine the optimal location for the copier. What would be the optimal solution if there were no “taboo zones”? 8. Everybody knows that in an advertisement a product can always do everything. Mathematical software is no exception to that. We want to investigate such promises in practice. The problem comes from the area of electric power supply: A power plant has to provide power to meet the (estimated) power demand given in the following chart.
33 Expected Demand of Power
12 pm – 6 am 15 GW
6 am – 9 am 30 GW
9 am – 3 pm 25 GW
3 pm –6 pm 40 GW
6 pm – 12 pm 27 GW
There are three types of generators available, 10 type 1, 10 type 2 and 12 type 3 generators. Each generator type has a minimal and maximal capacity; the production has to be somewhere in between (or else the generator has to be shut off). The running of a generator with minimal capacity costs a certain amount of money (dollars per hour). With each unit above minimal capacity there arise additional costs (dollars per hour) (cf. chart). Costs also arise every time a generator is switched on. Technical information and costs for the different generator types Typ 1 Typ 2 Typ 3 mi , Mi ei ci fi
: : : :
mi 850 MW 1250 MW 1500 MW
Mi 4000 MW 1750 MW 4000 MW
ei 2000 5200 6000
ci 4 2.6 6
fi 4000 2000 1000
minimal and maximal capacity costs per hour (minimal capacity) costs per hour and per megawatt above minimal capacity costs for switching on the generator
In addition to meeting the (estimated) power demands given in the chart an immediate increase by 15 % must always be possible. This must be achieved without switching on any additional generators or exceeding the maximal capacity. Let nij be the number of generators of type i which are in use in the j-th part of the day, i = 1, 2, 3 and j = 1, 2, . . . , 5, and sij the number of generators that are switched on at the beginning of the j-th part of the day. The total power supply of type i generators in the j-th part of the day is denoted by xij . The costs can be described by the following function K: K =
5 3 i=1 j=1
ci zj (xij − mi nij ) +
5 3 i=1 j=1
ei zj nij +
5 3
fi sij
i=1 j=1
xij are nonnegative real numbers, nij and sij are nonnegative integers. zj denotes the number of hours in the j-th part of the day (which can be obtained from the above chart).
Chapter 1
Exercises to Chapter 1
Chapter 1
34
Introduction a) Which simplifications are “hidden” in the cost function? b) Formulate the constraints! c) Determine nij , sij and xij such that the total costs are as low as possible! You might not find the global minimum but only a useful suggestion. You R can try a software of your choice (for example the Matlab Optimization R Toolbox or Maple ). Your solution has to meet all the constraints.
9. Portfolio optimization
(cf. [Bar1], p. 1 ff)
We have a sum of money to split between three investment possibilities which offer rates of return r1 , r2 and r3 . If x1 , x2 and x3 represent the portions of total investment, we expect an overall return R = r1 x1 + r2 x2 + r3 x3 . If the management charge associated with the j-th possibility is cj xj , then the total cost of our investment is c1 x1 + c2 x2 + c3 x3 . For an aimed return R we want to pay the least charges to achieve this. Then we have to solve the problem: c1 x1 + c2 x2 + c3 x3 −→ min x1 + x2 + x3 = 1 r1 x1 + r2 x2 + r3 x3 = R x1 , x2 , x3 ≥ 0 Consider the corresponding quadratic penalty problem Φλ (x) := f (x) + λP (x) for x = (x1 , x2 , x3 )T ∈ R3 , h1 (x) := x1 + x2 + x3 − 1, h2 (x) := r1 x1 + 3 ψ(xj )2 with r2 x2 + r3 x3 − R and P (x) := h1 (x)2 + h2 (x)2 + j=1
ψ(u) := max(−u, 0) =
0 , if u ≥ 0 −u , if u < 0
where negative investments xj are penalized by a high management charge λψ(xj )2 . Use the values R = 1.25 together with c1 = 10, c2 = 9, c3 = 14
and
r1 = 1.2, r2 = 1.1, r3 = 1.4 .
Calculate the minimum of Φλ for the parameter values λ = 103 , 106 , 109 R for instance by means of the function fminsearch from the Matlab Optimization Toolbox (cf. the m-file Penalty.m at the end of section 1.1). How does the solution change in the cases where c3 = 12 and c3 = 11?
2
Chapter 2
Optimality Conditions
2.0 Introduction 2.1 Convex Sets, Inequalities 2.2 Local First-Order Optimality Conditions Karush–Kuhn–Tucker Conditions Convex Functions Constraint Qualifications Convex Optimization Problems 2.3 Local Second-Order Optimality Conditions 2.4 Duality Lagrange Dual Problem Geometric Interpretation Saddlepoints and Duality Perturbation and Sensitivity Analysis Economic Interpretation of Duality Strong Duality Exercises
2.0 Introduction In this chapter we will focus on necessary and sufficient optimality conditions for constrained problems. As an introduction let us remind ourselves of the optimality conditions for unconstrained and equality constrained problems, which are commonly dealt with in basic Mathematics lectures. We consider a real-valued function f : D −→ R with domain D ⊂ Rn and define, as usual, for a point x0 ∈ D: 1) f has a local minimum in x0 : ⇐⇒ ∃ U ∈ Ux0 ∀ x ∈ U ∩ D f (x) ≥ f (x0 ) W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4 2,
35
36
Optimality Conditions
Chapter 2
2) f has a strict local minimum in x0 : ⇐⇒ ∃ U ∈ Ux0 ∀ x ∈ U ∩ D \ {x0 } f (x) > f (x0 ) 3) f has a global minimum in x0 : ⇐⇒ ∀ x ∈ D f (x) ≥ f (x0 ) 4) f has a strict global minimum in x0 : ⇐⇒ ∀ x ∈ D \ {x0 } f (x) > f (x0 ) Here, Ux0 denotes the neighborhood system of x0 . We often say “x0 is a local minimizer of f ” or “x0 is a local minimum point of f ” instead of “f has a local minimum in x0 ” and so on. The minimizer is a point x0 ∈ D, the minimum is the corresponding value f (x0 ). Necessary Condition ◦
Suppose that the function f has a local minimum in x0 ∈ D, that is, in an interior point of D. Then: a) If f is differentiable in x0 , then ∇f (x0 ) = 0 holds. b) If f is twice continuously differentiable neighborhood of x0 , then the 2 in a f (x ) is positive semidefinite. Hessian Hf (x0 ) = ∇2 f (x0 ) = ∂x∂ν ∂x 0 μ We will use the notation f (x0 ) (to denote the derivative of f at x0 ; as we know, this is a linear map from Rn to R, read as a row vector ) as well as the corresponding transposed vector ∇f (x0 ) (gradient, column vector ). ◦
Points x ∈ D with ∇f (x) = 0 are called stationary points. At a stationary point there can be a local minimum, a local maximum or a saddlepoint. To determine that there is a local minimum at a stationary point, we use the following: Sufficient Condition Suppose that the function f is twice continuously differentiable in a neighborhood of x0 ∈ D ; also suppose that the necessary optimality condition ∇f (x0 ) = 0 holds and that the Hessian ∇2 f (x0 ) is positive definite. Then f has a strict local minimum in x0 . The proof of this proposition is based on the Taylor theorem and we regard it as known from Calculus. Let us recall that a symmetric (n, n)-matrix A is positive definite if and only if all principal subdeterminants ⎛ ⎞ a11 . . . a1k ⎜ .. ⎟ (k = 1, . . . , n) det ⎝ ... . ⎠ ak1 . . . akk
2.0
Introduction
37
are positive (cf. exercise 3). Now let f be a real-valued function with domain D ⊂ Rn which we want to minimize subject to the equality constraints
for p < n; here, let h1 , . . . , hp also be defined on D . We are looking for local minimizers of f, that is, points x0 ∈ D which belong to the feasible region
F := x ∈ D | hj (x) = 0 (j = 1, . . . , p) and to which a neighborhood U exists with f (x) ≥ f (x0 ) for all x ∈ U ∩ F . Intuitively, it seems reasonable to solve the constraints for p of the n variables, and to eliminate these by inserting them into the objective function. For the reduced objective function we thereby get a nonrestricted problem for which under suitable assumptions the above necessary optimality condition holds. After these preliminary remarks, we are now able to formulate the following necessary optimality condition: Lagrange Multiplier Rule Let D ⊂ Rn be open and f, h1 , . . . , hp continuously differentiable in D. Suppose that f has a local minimum in x0 ∈ F subject to the constraints hj (x) = 0 (j = 1, . . . , p). ∂h Let also the Jacobian ∂xkj (x0 ) have rank p. Then there exist real numbers p,n
μ1 , . . . , μp — the so-called Lagrange multipliers — with ∇f (x0 ) +
p
μj ∇hj (x0 ) = 0.
(1)
j=1
Corresponding to our preliminary remarks, a main tool in a proof would be the Implicit Function Theorem. We assume that interested readers are familiar with a proof from multidimensional analysis. In addition, the results will be generalized in theorem 2.2.5. Therefore we do not give a proof here, but instead illustrate the matter with the following simple problem, which was already introduced in chapter 1 (as example 5): Example 1 With f (x) := x1 x22 and h(x) := h1 (x) := x21 + x22 − 2 for x = (x1 , x2 )T ∈ D := R2 we consider the problem:
Chapter 2
hj (x) = 0 (j = 1, . . . , p)
38
Optimality Conditions
Chapter 2
f (x) −→ min
subject to the constraint h(x) = 0 .
We hence have n = 2 and p = 1 . Before we start, however, note that this problem can of course be solved very easily straight away: One inserts x22 from the constraint x21 + x22 − 2 = 0 into f (x) and thus gets a one-dimensional problem.
Points x meeting the constraint are different from 0 and thus also meet the rank condition. With μ := μ1 the equation ∇f (x) + μ∇h(x) = 0 translates into x22 + μ2 x1 = 0 and 2x1 x2 + μ2 x2 = 0 . Multiplication of the first equation by x2 and the second by x1 gives x32 + 2 μx1 x2 = 0 and 2 x21 x2 + 2 μx1 x2 = 0 and thus
x32 = 2 x21 x2 . √ For x√ 2 = 0 the constraint yields x1 = ± 2 . Of these two evidently only 2 2 x1 = 2 remains as a potential minimizer. If x 2 = 0, we have x2 = 2 x1 and √ hence with the constraint 3 x21 = 2, thus x1 = ± 2/3 and then x2 = ±2/ 3. In this case the distribution of the zeros and signs of f gives that only √ x = (− 2/3, ±2/ 3)T remain as potential minimizers. Since f is continuous on the compact set {x ∈ R2 | h(x) = 0}, we know that there exists a global minimizer. Altogether,
√ √ we get: f attains its global minimum at (− 2/3, ±2/ 3)T , the point ( 2, 0)T yields a local minimum. The following picture illustrates the gradient condition very well:
4 3 f<0
2
f>0 h=0
1 0 –1 –2
f<0 –3
–2
f>0 –1
0
1
2
The aim of our further investigations will be to generalize the Lagrange Multiplier Rule to minimization problems with inequality constraints:
2.1
Convex Sets, Inequalities f (x) −→ min (P )
39 subject to the constraints
gi (x) ≤ 0 for i ∈ I := {1, . . . , m} .
With m, p ∈ N0 (hence, E = ∅ or I = ∅ are allowed), the functions f, g1 , . . . , gm , h1 , . . . , hp are supposed to be continuously differentiable on an open subset D in Rn and p ≤ n. The set
F := x ∈ D | gi (x) ≤ 0 for i ∈ I, hj (x) = 0 for j ∈ E — in analogy to the above — is called the feasible region or set of feasible points of (P ). In most cases we state the problem in the slightly shortened form ⎧ ⎪ ⎨ f (x) −→ min gi (x) ≤ 0 for i ∈ I (P ) ⎪ ⎩ hj (x) = 0 for j ∈ E . The optimal value v(P ) to problem (P ) is defined as v(P ) := inf {f (x) : x ∈ F} . We allow v(P ) to attain the extended values +∞ and −∞ . We follow the standard convention that the infimum of the empty set is ∞ . If there are feasible points xk with f (xk ) −→ −∞ (k −→ ∞), then v(P ) = −∞ and we say problem (P ) — or the function f on F — is unbounded from below. We say x0 is a minimal point or a minimizer if x0 is feasible and f (x0 ) = v(P ) .
In order to formulate optimality conditions for (P ), we will need some simple tools from Convex Analysis. These will be provided in the following section.
2.1 Convex Sets, Inequalities In the following consider the space Rn for n ∈ N with the euclidean norm and let C be a nonempty subset of Rn . The standard inner product or scalar n product on Rn is given by x, y := xT y = ν=1 xν yν for x, y ∈ Rn . The n euclidean norm of a vector x ∈ R is defined by x := x2 := x, x. Definition a) C is called convex : ⇐⇒ ∀ x1 , x2 ∈ C ∀ λ ∈ (0, 1) (1 − λ)x1 + λx2 ∈ C b) C is called a cone (with apex 0) : ⇐⇒ ∀ x ∈ C ∀ λ > 0 λx ∈ C
Chapter 2
hj (x) = 0 for j ∈ E := {1, . . . , p}
40
Optimality Conditions
Remark
Chapter 2
C is a convex cone if and only if: ∀ x1 , x2 ∈ C ∀ λ1 , λ2 > 0 λ1 x1 + λ2 x2 ∈ C Proposition 2.1.1 (Separating Hyperplane Theorem) Let C be closed and convex, and b ∈ Rn \ C . Then there exist p ∈ Rn \ {0} and α ∈ R such that p, x ≥ α > p, b for all x ∈ C, that is, the hyperplane defined by H := {x ∈ Rn | p, x = α} strictly separates C and b. If furthermore C is a cone, we can choose α = 0 . The following two little pictures show that none of the two assumptions that C is convex and closed can be dropped. The set C on the left is convex but not closed; on the right it is closed but not convex.
C
C
b
b
Proof: Since C is closed,
δ := δ(b, C) = inf x − b : x ∈ C
is positive, and there exists a sequence (xk ) in C such that xk − b −→ δ . wlog let xk → q for a q ∈ Rn (otherwise use a suitable subsequence). Then q is in C with p = δ > 0 for p := q − b . For x ∈ C and 0 < τ < 1 it holds that p2 = δ 2 ≤ (1 − τ )q + τ x − b2 = q − b + τ (x − q)2 = p2 + 2 τ x − q , p + τ 2 x − q2 . From this we obtain 0 ≤ 2 x − q , p + τ x − q2 and after passage to the limit τ → 0 0 ≤ x − q , p . With α := δ 2 + b , p the first assertion p, x ≥ α > p, b follows. If C is a cone, then for all λ > 0 and x ∈ C the vectors λ1 x and λx are also in C.
2.1
Convex Sets, Inequalities
41
Therefore p, x = λ p, λ1 x ≥ λα holds and consequently p, x ≥ 0 . λ p, x = p, λx ≥ α shows 0 ≥ α , hence, p, b < α ≤ 0 .
C ∗ :=
y ∈ Rn | ∀ x ∈ C y , x ≥ 0
is called the dual cone of C.
C* C
Remark C ∗ is a closed, convex cone. We omit a proof. The statement is an immediate consequence of the definition of the dual cone. As an important application let us now consider the following situation: Let A = (a1 , . . . , an ) ∈ Rm×n be an (m, n)-matrix with columns a1 , . . . , an ∈ Rm . Definition cone(A) := cone (a1 , . . . , an ) := ARn+ = {Aw | w ∈ Rn+ } is called the (positive) conic hull of a1 , . . . , an . Lemma 2.1.2 1) cone(A) is a closed, convex cone.
∗ 2) cone(A) = y ∈ Rm | AT y ≥ 0 Proof: 1) It is obvious that Cn := cone (a1 , . . . , an ) is a convex cone. We will prove that it is closed by means of induction over n: For n = 1 the cone C1 = {ξ1 a1 | ξ1 ≥ 0} is — in the nontrivial case — a closed half line. For the induction step from n to n + 1 we assume that
Chapter 2
Definition
42
Optimality Conditions every conic hull generated by not more than n vectors is closed.
Chapter 2
Firstly, consider the case that −aj ∈ cone (a1 , . . . , aj−1 , aj+1 , . . . , an+1 ) for all j = 1, . . . , n + 1 . It follows that Cn+1 = span{a1 , . . . , an+1 } and therefore obviously that Cn+1 is closed: The inclusion from left to right is trivial, and the other one follows, with ξ1 , . . . , ξn+1 ∈ R from n+1
ξj aj =
j=1
n+1
|ξj | sign(ξj )aj .
j=1
/ cone (a1 , . . . , an ) = Cn ; because of Otherwise, assume wlog −an+1 ∈ the induction hypothesis, Cn is closed and therefore δ := δ(−an+1 , Cn ) n+1 is positive. Every x ∈ Cn+1 can be written in the form x = j=1 ξj aj with ξ1 , . . . , ξn+1 ∈ R+ . Then ξn+1 ≤
x δ
holds because in the nontrivial case ξn+1 > 0 this follows directly from n ξj x = ξn+1 − an+1 − aj ≥ ξn+1 δ . ξ n+1 j=1 ∈ Cn
Let (x(k) ) be a sequence in Cn+1 and x ∈ Rm with x(k) → x for k → ∞. (k) (k) We want to show x ∈ Cn+1 : For k ∈ N there exist ξ1 , . . . , ξn+1 ∈ R+ such that n+1 (k) x(k) = ξj aj . j=1
As (x(k) ) is a convergent sequence, there exists an M > 0 such that x(k) ≤ M for all k ∈ N, and we get (k)
0 ≤ ξn+1 ≤
M . δ
(k) wlog let the sequence ξn+1 be convergent (otherwise, consider a suit(k) able subsequence), and set ξn+1 := lim ξn+1 . So we have (k)
Cn x(k) − ξn+1 an+1 −→ x − ξn+1 an+1 . By induction, Cn is closed, thus x − ξn+1 an+1 is an element of Cn and consequently x is in Cn+1 .
2.1
Convex Sets, Inequalities cone(A) and of the dual cone give immediately: ∈ Rm | ∀ v ∈ cone (A) v , y ≥ 0 ∈ Rm | ∀ w ∈ Rn+ Aw , y ≥ 0 ∈ Rm | ∀ w ∈ Rn+ w , AT y ≥ 0 ∈ Rm | AT y ≥ 0
A crucial tool for the following considerations is the Theorem of the Alternative (Farkas (1902)) For A ∈ Rm×n and b ∈ Rm the following are strong alternatives: 1) ∃ x ∈ Rn+ Ax = b 2) ∃ y ∈ Rm AT y ≥ 0 ∧ bT y < 0 Proof: 1) =⇒ ¬ 2): For x ∈ Rn+ with Ax = b and y ∈ Rm with AT y ≥ 0 we have bT y = xT AT y ≥ 0 . ¬ 1) ⇐= 2): C := cone(A) is a closed convex cone which does not contain the vector b: Following the addendum in the Separating Hyperplane Theorem there exists a y ∈ Rm with y , x ≥ 0 > y , b for all x ∈ C, in particular aTν y = y , aν ≥ 0 , that is, AT y ≥ 0 . If we illustrate the assertion, the theorem can be memorized easily: 1) means nothing but b ∈ cone(A). With the open ‘half space’
Hb := y ∈ Rm | y , b < 0 ∗ the condition 2) states that cone(A) and Hb have a common point. In the two-dimensional case, for example, we can illustrate the theorem with the following picture, which shows case 1):
a1 b cone(A) a2 ∗ cone(A) Hb If you rotate the vector b out of cone(A), you get case 2).
Chapter 2
2) The definitions of ∗
cone(A) = y
= y
= y
= y
43
44
Optimality Conditions
Chapter 2
2.2 Local First-Order Optimality Conditions We want to take up the minimization problem (P ) from page 39 again and use the notation introduced there. For x0 ∈ F , the index set A(x0 ) :=
i ∈ I | gi (x0 ) = 0
describes the inequality restrictions which are active at x0 . The active constraints have a special significance: They restrict feasible corrections around a feasible point. If a constraint is inactive (gi (x0 ) < 0) at the feasible point x0 , it is possible to move from x0 a bit in any direction without violating this constraint.
Definition Let d ∈ Rn and x0 ∈ F . Then d is called the feasible direction of F at x0 : ⇐⇒ ∃ δ > 0 ∀ τ ∈ [0 , δ ] x0 + τ d ∈ F . A ‘small’ movement from x0 along such a direction gives feasible points.
The set of all feasible directions of F at x0 is a cone, denoted by Cfd (x0 ) . Let d be a feasible direction of F at x0 . If we choose a δ according to the definition, then we have gi (x0 + τ d) = gi (x0 ) + τ gi (x0 )d + o(τ ) ≤0
=0
for i ∈ A(x0 ) and 0 < τ ≤ δ . Dividing by τ and passing to the limit as τ → 0 gives gi (x0 )d ≤ 0 . In the same way we get hj (x0 )d = 0 for all j ∈ E. Definition For any x0 ∈ F C (P, x0 ) :=
d ∈ Rn | ∀ i ∈ A(x0 ) gi (x0 )d ≤ 0 , ∀ j ∈ E hj (x0 )d = 0
is called the linearizing cone of (P ) at x0 . Hence, C (x0 ) := C (P, x0 ) contains at least all feasible directions of F at x0 : Cfd (x0 ) ⊂ C (x0 ) The linearizing cone is not only dependent on the set of feasible points F but also on the representation of F (compare Example 4). We therefore write more precisely C (P, x0 ) .
2.2
Local First-Order Optimality Conditions
45
Definition For any x0 ∈ D d ∈ Rn | f (x0 )d < 0
Chapter 2
Cdd (x0 ) :=
is called the cone of descent directions of f at x0 . Note that 0 is not in Cdd (x0 ); also, for all d ∈ Cdd (x0 ) f (x0 + τ d) = f (x0 ) + τ f (x0 )d + o(τ ) <0
holds and therefore, f (x0 + τ d) < f (x0 ) for sufficiently small τ > 0 . Thus, d ∈ Cdd (x0 ) guarantees that the objective function f can be reduced along this direction. Hence, for a local minimizer x0 of (P ) it necessarily holds that Cdd (x0 ) ∩ Cfd (x0 ) = ∅ .
We will illustrate the above definitions with the following Example 1 Let F :=
x = (x1 , x2 )T ∈ R2 | x21 + x22 − 1 ≤ 0, −x1 ≤ 0, −x2 ≤ 0 ,
and f be defined by f (x) := x1 + x2 . Hence, F is the part of the unit disk which lies in the first quadrant. The objective function f evidently attains a (strict, global) minimum at (0, 0)T . In both of the following pictures F is colored in dark blue. x0 := (0, 0)T
x0 := (1, 0)T
1
∇f (x0 )
1
0.5 0.5
d –1
∇f (x0 ) d
1 0
1
2
–0.5
a) Let x0 := (0, 0)T . g1 (x) := x21 + x22 − 1, g2 (x) := −x1 and g3 (x) := −x2 give A(x0 ) = {2, 3} . A vector d := (d1 , d2 )T ∈ R2 is a feasible direction
Chapter 2
46
Optimality Conditions of F at x0 if and only if d1 ≥ 0 and d2 ≥ 0 hold. Hence, the set Cfd (x0 ) of feasible directions is a convex cone, namely, the first quadrant, and it is represented in the left picture by the gray angular domain. g2 (x0 ) = (−1, 0) and g3 (x0 ) = (0, −1) produce
C (x0 ) = d ∈ R2 | −d1 ≤ 0, −d2 ≤ 0 . Hence, in this example, the linearizing cone and the cone of feasible directions are the same. Moreover, the cone of descent directions Cdd (x0 ) — colored in light blue in the picture — is, because of f (x0 )d = (1, 1)d = d1 + d2 , an open half space and disjoint to C (x0 ).
b) If x0 := (1, 0)T , we have A(x0 ) = {1, 3} and d := (d1 , d2 )T ∈ R2 is a feasible direction of F at x0 if and only if d = (0, 0)T or d1 < 0 and d2 ≥ 0 hold. The set of feasible directions is again a convex cone. In the right picture it is depicted by the shifted gray angular domain. Because of g1 (x0 ) = (2, 0) and g3 (x0 ) = (0, −1), we get C (x0 ) = d ∈ R2 | d1 ≤ 0, d2 ≥ 0 . As we can see, in this case the linearizing cone includes the cone of feasible directions properly as a subset. In the picture the cone of descent directions has also been moved to x0 . We can see that it contains feasible directions of F at x0 . Consequently, f does not have a local minimum in x0 . Proposition 2.2.1 For x0 ∈ F it holds that C (x0 ) ∩ Cdd (x0 ) = ∅ if and only if there exist p λ ∈ Rm + and μ ∈ R such that ∇f (x0 ) +
m i=1
λi ∇gi (x0 ) +
p
μj ∇hj (x0 ) = 0
(2)
j=1
and λi gi (x0 ) = 0 for all i ∈ I.
(3)
Together, these conditions — x0 ∈ F , λ ≥ 0 , (2) and (3) — are called Karush–Kuhn–Tucker conditions, or KKT conditions. (3) is called the complementary slackness condition or complementarity condition. This condition of course means λi = 0 or (in the nonexclusive sense) gi (x0 ) = 0 for all
2.2
Local First-Order Optimality Conditions
47
i ∈ I. A corresponding pair (λ, μ) or the scalars λ1 , . . . , λm , μ1 , . . . , μp are called Lagrange multipliers. The function L defined by m
λi gi (x) +
i=1
p
μj hj (x) = f (x) + λT g(x) + μT h(x)
j=1
p for x ∈ D, λ ∈ Rm + and μ ∈ R is called the Lagrange function or Lagrangian of (P ). Here we have combined the m functions gi to a vector-valued function g and respectively the p functions hj to a vector-valued function h. p Points x0 ∈ F fulfilling (2) and (3) with a suitable λ ∈ Rm + and μ ∈ R play an important role. They are called Karush–Kuhn–Tucker points, or KKT points.
Owing to the complementarity condition (3), the multipliers λi corresponding to inactive restrictions at x0 must be zero. So we can omit the terms for i ∈ I \ A(x0 ) from (2) and rewrite this condition as ∇f (x0 ) +
λi ∇gi (x0 ) +
p
μj ∇hj (x0 ) = 0 .
(2 )
j=1
i ∈A(x0 )
Proof: By definition of C (x0 ) and Cdd (x0 ) it holds that: ⎧ ⎪ ⎨ f (x0 )d < 0 ∀ i ∈ A(x0 ) gi (x0 )d ≤ 0 d ∈ C (x0 ) ∩ Cdd (x0 ) ⇐⇒ ⎪ ⎩ ∀ j ∈ E h (x )d = 0 j 0 ⎧ f (x0 )d < 0 ⎪ ⎪ ⎪ ⎨ ∀ i ∈ A(x ) − g (x )d ≥ 0 0 0 i ⇐⇒ ⎪ ∀ j ∈ E − hj (x0 )d ≥ 0 ⎪ ⎪ ⎩ ∀ j ∈ E hj (x0 )d ≥ 0 With that the Theorem of the Alternative from section 2.1 directly provides the following equivalence: C (x0 ) ∩ Cdd (x0 ) = ∅ if and only if there exist λi ≥ 0 for i ∈ A(x0 ) and μj ≥ 0 , μj ≥ 0 for j ∈ E such that ∇f (x0 ) =
λi (−∇gi (x0 )) +
p
μj (−∇hj (x0 )) +
j=1
i∈A(x0 )
p
μj ∇hj (x0 ).
j=1
If we now set λi := 0 for i ∈ I \ A(x0 ) and μj := μj − μj for j ∈ E, the above is equivalent to: There exist λi ≥ 0 for i ∈ I and μj ∈ R for j ∈ E with p m ∇f (x0 ) + λi ∇gi (x0 ) + μj ∇hj (x0 ) = 0 i=1
j=1
Chapter 2
L(x, λ, μ) := f (x) +
48
Optimality Conditions
and
Chapter 2
λi gi (x0 ) = 0 for all i ∈ I .
So now the question arises whether not just Cfd (x0 ) ∩ Cdd (x0 ) = ∅ , but even C (x0 ) ∩ Cdd (x0 ) = ∅ is true for any local minimizer x0 ∈ F. The following simple example gives a negative answer to this question: Example 2 (Kuhn–Tucker (1951)) For n = 2 and x = (x1 , x2 )T ∈ R2 =: D let f (x) := −x1 , g1 (x) := x2 + (x1 − 1)3 , g2 (x) := −x1 and g3 (x) := −x2 . For x0 := (1, 0)T , m = 3 and p = 0 we have: ∇f (x0 ) = (−1, 0)T , ∇g1 (x0 ) = (0, 1)T , ∇g2 (x0 ) = (−1, 0)T and ∇g3 (x0 ) = (0, −1)T .
T 2 Since A(x0 ) = {1, 3} , we get C (x0 ) = (d , d ) ∈ R | d = 0 , as 1 2 2 T 2 well as Cdd (x0 ) = (d1 , d2 ) ∈ R | d1 > 0 ; evidently, C (x0 ) ∩ Cdd (x0 ) is nonempty. However, the function f has a minimum at x0 subject to the given constraints. 1
x2
F
00
Lemma 2.2.2
0.5 x1
• 1
For x0 ∈ F it holds that: C (x0 ) ∩ Cdd (x0 ) = ∅ ⇐⇒ ∇f (x0 ) ∈ C (x0 )∗ Proof: C (x0 ) ∩ Cdd (x0 ) = ∅ ⇐⇒ ∀ d ∈ C (x0 ) ∇f (x0 ), d = f (x0 )d ≥ 0 ⇐⇒ ∇f (x0 ) ∈ C (x0 )∗
The cone Cfd (x0 ) of all feasible directions is too small to ensure general optimality conditions. Difficulties may occur due to the fact that the boundary of F is curved. Therefore, we have to consider a set which is less intuitive but bigger and with more suitable properties. To attain this goal, it is useful to state the concept of being tangent to a set more precisely:
Definition A sequence (xk ) converges in direction d to x0 : ⇐⇒ xk = x0 + αk (d + rk ) with αk ↓ 0 and rk → 0 .
2.2
Local First-Order Optimality Conditions
49
d
We will use the following notation: xk −→ x0 d
1 (xk − x0 ) −→ d for k −→ ∞ . αk Definition Let M be a nonempty subset of Rn and x0 ∈ M . Then Ct (M, x0 ) :=
d d ∈ Rn | ∃ (xk ) ∈ M N xk −→ x0
is called the tangent cone of M at x0 . The vectors of Ct (M, x0 ) are called tangents or tangent directions of M at x0 . Of main interest is the special case Ct (x0 ) := Ct (F , x0 ) . Example 3 a) The following two figures illustrate the cone of tangents for F := x = (x1 , x2 )T ∈ R2 | x1 ≥ 0, x21 ≥ x2 ≥ x21 (x1 − 1)
and the points x0 ∈ (0, 0)T , (2, 4)T , (1, 0)T . For convenience the origin is translated to x0 . The reader is invited to verify this: x0 = (0, 0)T and x0 = (2, 4)T
x0 = (1, 0)T
4
4
2
2
0
2
0
1
b) F := x ∈ Rn | x2 = 1 : Ct (x0 ) = d ∈ Rn | d, x0 = 0
n n c) F := x ∈
R | nx2 ≤ 1 : Then Ct (x0 ) = R if x0 2 < 1 holds, and Ct (x0 ) = d ∈ R | d, x0 ≤ 0 if x0 2 = 1 . These assertions have to be proven in exercise 10.
Chapter 2
xk −→ x0 simply means: There exists a sequence of positive numbers (αk ) such that αk ↓ 0 and
50
Optimality Conditions
Lemma 2.2.3
Chapter 2
1) Ct (x0 ) is a closed cone, 0 ∈ Ct (x0 ) . 2) Cfd (x0 ) ⊂ Ct (x0 ) ⊂ C (x0 ) Proof: The proof of 1) is to be done in exercise 9. 2) First inclusion: As the tangent cone Ct (x0 ) is closed, it is sufficient to show the inclusion Cfd (x0 ) ⊂ Ct (x0 ). For d ∈ Cfd (x0 ) and ‘large’ integers k it holds that x0 + k1 d ∈ F. With αk := k1 and rk := 0 this shows d ∈ Ct (x0 ). Second inclusion: Let d ∈ Ct (x0 ) and (xk ) ∈ F N be a sequence with xk = x0 + αk (d + rk ), αk ↓ 0 and rk → 0 . For i ∈ A(x0 ) gi (xk ) = gi (x0 ) +αk gi (x0 )(d + rk ) + o(αk ) =0
≤0
produces the inequality for j ∈ E.
gi (x0 )d
≤ 0. In the same way we get hj (x0 )d = 0
Now the question arises whether Ct (x0 ) = C (x0 ) always holds. The following example gives a negative answer:
Example 4
a) Consider F := x ∈ R2 | −x31 + x2 ≤ 0 , −x2 ≤ 0 and x0 := (0, 0)T . In this case A(x0 ) = {1, 2} . This gives
C (x0 ) = d ∈ R2 | d2 = 0 and Ct (x0 ) = d ∈ R2 | d1 ≥ 0 , d2 = 0 . The last statement has to be shown in exercise 10.
b) Now let F := x ∈ R2 | −x31 + x2 ≤ 0 , −x1 ≤ 0 , −x2 ≤ 0 and T x0 := (0, 0) A(x0 ) = {1, 2,3} and therefore
. Then C (x0 ) = d ∈ R2 | d1 ≥ 0 , d2 = 0 = Ct (x0 ). Hence, the linearizing cone is dependent on the representation of the set of feasible points F which is the same in both cases! 1.5
1 x2 0.5
0• x0
x1
1
2.2
Local First-Order Optimality Conditions
51
Lemma 2.2.4 For a local minimizer x0 of (P ) it holds that ∇f (x0 ) ∈ Ct (x0 )∗ , hence Cdd (x0 ) ∩ Ct (x0 ) = ∅ .
Proof: Let d ∈ Ct (x0 ). Then there exists a sequence (xk ) ∈ F N such that xk = x0 + αk (d + rk ), αk ↓ 0 and rk −→ 0 . 0 ≤ f (xk ) − f (x0 ) = αk f (x0 )(d + rk ) + o(αk ) gives the result f (x0 )d ≥ 0.
The principal result in this section is the following: Theorem 2.2.5 (Karush–Kuhn–Tucker) Suppose that x0 is a local minimizer of (P ), and the constraint qualification1 C (x0 )∗ = Ct (x0 )∗ is fulfilled. Then there exist vectors λ ∈ Rm + and p μ ∈ R such that ∇f (x0 ) +
m i=1
λi ∇gi (x0 ) +
p j=1
μj ∇hj (x0 ) = 0 and
λi gi (x0 ) = 0 for i = 1, . . . , m . Proof: If x0 is a local minimizer of (P ), it follows from lemma 2.2.4 with the help of the presupposed constraint qualification that ∇f (x0 ) ∈ Ct (x0 )∗ = C (x0 )∗ ; lemma 2.2.2 yields C (x0 ) ∩ Cdd (x0 ) = ∅ and the latter together with proposition 2.2.1 gives the result. In the presence of the presupposed constraint qualification Ct (x0 )∗ = C (x0 )∗ the condition ∇f (x0 ) ∈ Ct (x0 )∗ of lemma 2.2.4 transforms to ∇f (x0 ) ∈ C (x0 )∗ . This claim can be confirmed with the aid of a simple linear optimization problem:
Example 5 (Kleinmichel (1975)) For x = (x1 , x2 )T ∈ R2 we consider the problem f (x) := x1 + x2 −→ min −x31 + x2 ≤ 1 x1 ≤ 1 , −x2 ≤ 0 1
Guignard (1969)
Chapter 2
Geometrically this condition states that for a local minimizer x0 of (P ) the angle between the gradient and any tangent direction, especially any feasible direction, does not exceed 90◦ .
52
Optimality Conditions 2
Chapter 2
x2 1•
F •
0
–1
1
x1
and ask whether the feasible points x0 := (−1, 0)T and x0 := (0, 1)T are local minimizers. (The examination of the picture shows immediately that this is not the case for x0 , and that the objective function f attains a (strict, global) minimum at x0 . But we try to forget this for a while.) We have A(x0 ) = {1, 3}. In order to show that ∇f (x0 ) ∈ C (x0 )∗ , hence, f (x0 )d ≥ 0 for all d ∈ C (x0 ), we compute min f (x0 )d. So we have the following linear d∈C (x0 )
problem:
d1 + d2 −→ min −3 d1 + d2 ≤ 0 −d2 ≤ 0
Evidently it has the minimal value 0 ; lemma 2.2.2 gives that C (x0 ) ∩ Cdd (x0 ) is empty. Following proposition 2.2.1 there exist λ1 , λ3 ≥ 0 for x0 satisfying 0 1 −3 0 + λ3 = . + λ1 1 −1 0 1 The above yields λ1 =
1 3
, λ3 =
4 3
.
x0 ) = {1} . In the same way as the above this leads to the For x 0 we have A( subproblem d1 + d2 −→ min d2 ≤ 0 x0 ) ∩ Cdd ( x0 ) = ∅. whose objective function is unbounded; therefore C ( So x 0 is not a local minimizer, but the point x0 remains as a candidate.
Convex Functions Convexity plays a central role in optimization. We already had some simple results from Convex Analysis in section 2.1. Convex optimization problems — the functions f and gi are supposed to be convex and the funcions hj affinely linear — are by far easier to solve than general nonlinear problems. These assumptions ensure that the problems are well-behaved. They have two significant properties: A local minimizer is always a global one. The KKT conditions are sufficient for optimality. A special feature of strictly convex functions is that they have at most one minimal point. But convex functions also play an important role in problems that are not convex. Therefore a simple and short treatment of convex functions is given here:
2.2
Local First-Order Optimality Conditions
53
Definition
holds for all x, y ∈ D and τ ∈ (0, 1). f is called strictly convex on D if and only if f (1 − τ )x + τ y < (1 − τ )f (x) + τ f (y) for all x, y ∈ D with x = y and τ ∈ (0, 1). The addition “on D” will be omitted, if D is the domain of definition. We say f is concave (on D) iff −f is convex, and strictly concave (on D) iff −f is strictly convex. For a concave function the line segment joining two points on the graph is never above the graph.
Let D ⊂ Rn be nonempty and convex and f : D −→ R a convex function. Properties 1) If f attains a local minimum at a point x∗ ∈ D, then f (x∗ ) is the global minimum. ◦
2) f is continuous in D . ◦
(x) for x ∈ D , h ∈ Rn and 3) The function ϕ defined by ϕ(τ ) := f (x+τ h)−f τ sufficiently small, positive τ is isotone, that is, order-preserving.
4) For D open and a differentiable f it holds that f (y)−f (x) ≥ f (x)(y−x) for all x, y ∈ D . With the function f defined by f (x) := 0 for x ∈ [0, 1) and f (1) := 1 we can see that assertion 2) cannot be extended to the whole of D . Proof: 1) If there existed an x ∈ D such that f (x) < f (x∗ ), then we would have f ((1 − τ )x∗ + τ x) ≤ (1 − τ )f (x∗ ) + τ f (x) < f (x∗ ) for 0 < τ ≤ 1 and consequently a contradiction to the fact that f attains a local minimum at x∗ . ◦
2) For x0 ∈ D consider the function ψ defined by ψ(h) := f (x0 + h) − f (x0 ) for h ∈ Rn with a sufficiently small norm h∞ : It is clear that the function ψ is convex. Let > 0 such that for K := {h ∈ Rn | h∞ ≤ }
Chapter 2
Let D ⊂ Rn be nonempty and convex. A real-valued function f defined on at least D is called convex on D if and only if f (1 − τ )x + τ y ≤ (1 − τ )f (x) + τ f (y)
54
Optimality Conditions ◦
Chapter 2
it holds that x0 +K ⊂ D. Evidently, there exist m ∈ N and a1 , . . . , am ∈ Rn with K = conv(a h ∈ K may be reprem1 , . . . , am ) (convex hull). Every m sented as h = μ=1 γμ aμ with γμ ≥ 0 satisfying μ=1 γμ = 1 . With ! α :=
max{|ψ(aμ )| | μ = 1, . . . , m}, if positive 1 , otherwise
m we have ψ(h) ≤ μ=1 γμ ψ(aμ ) ≤ α . Now let ε ∈ (0, α ]. Then firstly for all h ∈ Rn with h∞ ≤ ε /α we have ψ(h) = ψ 1 − ε 0 + ε α h ≤ ε ψ α h ≤ ε α α ε α ε and therefore with
0 = ψ(0) = ψ 1 h − 1 h ≤ 1 ψ(h) + 1 ψ(−h) 2 2 2 2
ψ(h) ≥ −ψ(−h) ≥ −ε , hence, all together |ψ(h)| ≤ ε . 3) Since f is convex, we have f (x + τ0 h) = f 1 − τ0 x + τ0 (x + τ1 h) τ τ 1 1 ≤ 1 − τ0 f (x) + τ0 f (x + τ1 h) τ1 τ1 for 0 < τ0 < τ1 . Transformation leads to f (x + τ0 h) − f (x) f (x + τ1 h) − f (x) ≤ . τ0 τ1 4) This follows directly from 3) (with h = y − x): f (x)h = lim
τ →0+
f (x + τ h) − f (x) f (x + h) − f (x) ≤ τ 1
Constraint Qualifications The condition C (x0 )∗ = Ct (x0 )∗ is very abstract, extremely general, but not easily verifiable. Therefore, for practical problems, we will try to find regularity assumptions called constraint qualifications (CQ) which are more specific, easily verifiable, but also somewhat restrictive.
For the moment we willconsider the case that we only have inequality constraints. Hence, E = ∅ and I = {1, . . . , m} with an m ∈ N0 . Linear constraints pose fewer problems than nonlinear constraints. Therefore, we will assume the partition I = I1 I2 .
2.2
Local First-Order Optimality Conditions
55
If and only if i ∈ I2 let gi (x) = aTi x − bi with suitable vectors ai and bi , that is, gi is ‘linear’, more precisely affinely linear. Corresponding to this partition, we will also split up the set of active constraints A(x0 ) for x0 ∈ F into
We will now focus on the following Constraint Qualifications: (GCQ) (ACQ)
Guignard Constraint Qualification: C (x0 )∗ = Ct (x0 )∗ Abadie Constraint Qualification: C (x0 ) = Ct (x0 )
(MFCQ) Mangasarian–Fromovitz Constraint Qualification: " gi (x0 )d < 0 for i ∈ A1 (x0 ) n ∃d∈R gi (x0 )d ≤ 0 for i ∈ A2 (x0 ) (SCQ)
Slater Constraint Qualification: The functions gi are convex for all i ∈ I and ∃x ˜ ∈ F gi (˜ x) < 0 for i ∈ I1 .
The conditions gi (x0 ) d < 0 and gi (x0 ) d ≤ 0 each define half spaces. (MFCQ) means nothing else but that the intersection of all of these half spaces is nonempty.
We will prove (SCQ) =⇒ (MFCQ) =⇒ (ACQ) . The constraint qualification (GCQ) introduced in theorem 2.2.5 is a trivial consequence of (ACQ). Proof: (SCQ) =⇒ (MFCQ): From the properties of convex and affinely linear functions and the definition of A(x0 ) we get: gi (x0 )(˜ x − x0 ) ≤ gi (˜ x) − gi (x0 ) = gi (˜ x) < 0 for i ∈ A1 (x0 ) gi (x0 )(˜ x − x0 ) = gi (˜ x) − gi (x0 ) = gi (˜ x) ≤ 0 for i ∈ A2 (x0 ). (MFCQ) =⇒ (ACQ): Lemma 2.2.3 gives that Ct (x0 ) ⊂ C (x0 ) and 0 ∈ Ct (x0 ) always hold. Therefore it remains to prove that C (x0 ) \ {0} ⊂ Ct (x0 ). So let d0 ∈ C (x0 ) \ {0} . Take d as stated in (MFCQ). Then for a sufficiently small λ > 0 we have d0 + λd = 0 . Since d0 is in C (x0 ), it follows that gi (x0 )(d0 + λd) < 0 for i ∈ A1 (x0 ) and gi (x0 )(d0 + λd) ≤ 0 for i ∈ A2 (x0 ). For the moment take a fixed λ. Setting u :=
d0 +λ d d0 +λ d 2
produces
Chapter 2
Aj (x0 ) := Ij ∩ A(x0 ) for j = 1, 2 .
56
Optimality Conditions gi (x0 + tu) = gi (x0 ) + t gi (x0 )u + o(t) for i ∈ A1 (x0 ) and =0
<0
=0
≤0
Chapter 2
gi (x0 + tu) = gi (x0 ) + t gi (x0 )u
for i ∈ A2 (x0 ).
Thus, we have gi (x0 + tu) ≤ 0 for i ∈ A(x0 ) and t > 0 sufficiently small. For the indices i ∈ I \ A(x0 ) this is obviously true. Hence, there exists a t0 > 0 such that x0 + tu ∈ F for 0 ≤ t ≤ t0 . For the sequence (xk ) u defined by xk := x0 + tk0 u it holds that xk −→ x0 . Therefore, u ∈ Ct (x0 ) and consequently d0 + λd ∈ Ct (x0 ). Passing to the limit as λ −→ 0 yields d0 ∈ Ct (x0 ) . Lemma 2.2.3 or respectively exercise 9 gives that Ct (x0 )is closed. Hence, d0 ∈ Ct (x0 ). Now we will consider the general case, where there may also occur equality constraints. In this context one often finds the following linear independence constraint qualification in the literature: (LICQ) The vectors ∇gi (x0 ) | i ∈ A(x0 ) and ∇hj (x0 ) | j ∈ E are linearly independent. (LICQ) greatly reduces the number of active inequality constraints. Instead of (LICQ) we will now consider the following weaker constraint qualification which is a variant of (MFCQ), and is often cited as the Arrow–Hurwitz– Uzawa constraint qualification: " gi (x0 )d < 0 for i ∈ A(x0 ), (AHUCQ) There exists a d ∈ Rn such that h (x0 )d = 0 for j ∈ E , j and the vectors ∇hj (x0 ) | j ∈ E are linearly independent. We will show: (LICQ) =⇒ (AHUCQ) =⇒ (ACQ) Proof: (LICQ) =⇒ (AHUCQ): (AHUCQ) follows, for example, directly from the solvability of the system of linear equations gi (x0 )d = −1 for i ∈ A(x0 ), hj (x0 )d = 0 for j ∈ E. (AHUCQ) =⇒ (ACQ): Lemma 2.2.3 gives that again we only have to show d0 ∈ Ct (x0 ) for all d0 ∈ C (x0 ) \ {0} . Take d as stated in (AHUCQ). Then we have d0 + λd =: w = 0 for a sufficiently small λ > 0 and thus gi (x0 )w < 0 for i ∈ A(x0 ) and hj (x0 )w = 0 for j ∈ E. Denote
2.2
Local First-Order Optimality Conditions
57
A := ∇h1 (x0 ), . . . , ∇hp (x0 ) ∈ Rn×p . For that AT A is regular because rank(A) = p. Now consider the following system of linear equations dependent on u ∈ Rp and t ∈ R: (j = 1, . . . , p)
For the corresponding vector-valued function ϕ we have ϕ(0, 0) = 0 , and because of ∂ϕj , ∂ui (u, t) = hj (x0 + Au + tw)∇hi (x0 ) we are able to solve ϕ(u, t) = 0 locally for u, that is, there exist a nullneighborhood U0 ⊂ R and a continuously differentiable function u : U0 −→ Rp satisfying u(0) = 0 , hj (x0 + Au(t) + tw ) = 0 for t ∈ U0
(j = 1, . . . , p).
=: x(t)
Differentiation with respect to t at t = 0 leads to hj (x0 ) Au (0) + w = 0 (j = 1, . . . , p) and consequently — considering that hj (x0 )w = 0 and AT A is regular — to u (0) = 0 . Then for i ∈ A(x0 ) it holds that gi (x(t)) = gi (x0 ) + t gi (x0 )x (0) + o(t) = t gi (x0 ) Au (0) + w + o(t) . With u (0) = 0 we obtain o(t) gi (x(t)) = t gi (x0 )w + t and the latter is negative for t > 0 sufficiently small. Hence, there exists a t1 > 0 with x(t) ∈ F for 0 ≤ t ≤ t1 . From u(t1 /k) t1 t1 = x0 + w+ A x k k t1 /k −→ 0 (k→∞)
w for k ∈ N we get x tk1 −→ x0 ; this yields w = d0 + λd ∈ Ct (x0 ) and also by passing to the limit as λ → 0 d0 ∈ Ct (x0 ) = Ct (x0 ) .
Chapter 2
ϕj (u, t) := hj (x0 + Au + tw) = 0
58
Optimality Conditions
Chapter 2
Convex Optimization Problems Firstly suppose that C ⊂ Rn is nonempty and the functions f, gi : C −→ R are arbitrary for i ∈ I . We consider the general optimization problem ! (P )
f (x) −→ min gi (x) ≤ 0 for i ∈ I := {1, . . . , m}
.
In the following section the Lagrangian L to (P ) defined by L(x, λ) := f (x) +
m
λi gi (x) = f (x) + λ, g(x) for x ∈ C and λ ∈ Rm +
i=1
will play an important role. As usual we have combined the m functions gi to a vector-valued function g . Definition A pair (x∗ , λ∗ ) ∈ C × Rm + is called a saddlepoint of L if and only if L(x∗ , λ) ≤ L(x∗ , λ∗ ) ≤ L(x, λ∗ ) ∗ ∗ ∗ holds for all x ∈ C and λ ∈ Rm + , that is, x minimizes L( · , λ ) and λ ∗ maximizes L(x , · ).
Lemma 2.2.6 If (x∗ , λ∗ ) is a saddlepoint of L, then it holds that: • x∗ is a global minimizer of (P ). • L(x∗ , λ∗ ) = f (x∗ ) • λ∗i gi (x∗ ) = 0 for all i ∈ I . Proof: Let x ∈ C and λ ∈ Rm + . From 0 ≥ L(x∗ , λ) − L(x∗ , λ∗ ) = λ − λ∗ , g(x∗ )
(4)
we obtain for λ := 0 λ∗ , g(x∗ ) ≥ 0 .
(5)
With λ := λ∗ + ei we get — also from (4) — gi (x∗ ) ≤ 0 for all i ∈ I , that is, g(x∗ ) ≤ 0 .
(6)
Because of (6), it holds that λ∗ , g(x∗ ) ≤ 0. Together with (5) this produces λ∗ , g(x∗ ) = 0 and hence, λ∗i gi (x∗ ) = 0 for all i ∈ I .
2.2
Local First-Order Optimality Conditions
59
For x ∈ F it follows that f (x∗ ) = L(x∗ , λ∗ ) ≤ L(x, λ∗ ) = f (x) + λ∗ , g(x) ≤ f (x) .
We assume now that C is open and convex and the functions f, gi : C −→ R are continuously differentiable and convex for i ∈ I . In this case we write more precisely (CP ) instead of (P ). Theorem 2.2.7 If the Slater constraint qualification holds and x∗ is a minimizer of (CP ), ∗ ∗ then there exists a vector λ∗ ∈ Rm + such that (x , λ ) is a saddlepoint of L.
Proof: Taking into account our observations from page 55, theorem 2.2.5 gives that there exists a λ∗ ∈ Rm + such that 0 = Lx (x∗ , λ∗ ) and λ∗ , g(x∗ ) = 0 . With that we get for x ∈ C 1 L(x, λ∗ ) − L(x∗ , λ∗ ) ≥ Lx (x∗ , λ∗ )(x − x∗ ) = 0 and
L(x∗ , λ∗ ) − L(x∗ , λ) = − λ , g(x∗ ) ≥ 0 . ≥0
∗
≤0
∗
Hence, (x , λ ) is a saddlepoint of L .
The following example shows that the Slater constraint qualification is essential in this theorem: Example 6 With n = 1 and m = 1 we regard the convex problem ! f (x) := −x −→ min (P ) . g(x) := x2 ≤ 0 The only feasible point is x∗ = 0 with value f (0) = 0 . So 0 minimizes f (x) subject to g(x) ≤ 0 . L(x, λ) := −x + λx2 for λ ≥ 0, x ∈ R. There is no λ∗ ∈ [0, ∞) such that (x∗ , λ∗ ) is a saddlepoint of L . The following important observation shows that neither constraint qualifications nor second-order optimality conditions, which we will deal with in the 1
By the convexity of f and gi the function L( · , λ∗ ) is convex.
Chapter 2
≤0
Therefore x∗ is a global minimizer of (P ).
60
Optimality Conditions
Chapter 2
next section, are needed for a sufficient condition for general convex optimization problems: Suppose that f, gi , hj : Rn −→ R are continuously differentiable functions with f and gi convex and hj (affinely) linear (i ∈ I, j ∈ E), and consider the following convex optimization problem 2 ⎧ ⎨ f (x) −→ min gi (x) ≤ 0 for i ∈ I (CP ) ⎩ hj (x) = 0 for j ∈ E
.
We will show that for this special kind of problem every KKT point already gives a (global) minimum: Theorem 2.2.8 p Suppose x0 ∈ F and there exist vectors λ ∈ Rm + and μ ∈ R such that
∇f (x0 ) +
m i=1
λi ∇gi (x0 ) +
p j=1
μj ∇hj (x0 ) = 0 and
λi gi (x0 ) = 0 for i = 1, . . . , m, then (CP ) attains its global minimum at x0 . The Proof of this theorem is surprisingly simple: Taking into account 4) on page 53, we get for x ∈ F : f (x) − f (x0 )
f
gi
≥
f (x0 )(x − x0 )
=
−
convex
≥
convex
−
m i=1 m i=1
λi gi (x0 )(x − x0 ) −
p j=1
μj hj (x0 )(x − x0 )
= hj (x)−hj (x0 ) = 0
λi (gi (x) − gi (x0 )) = −
m
i=1
λi gi (x) ≥ 0
The following example shows that even if we have convex problems the KKT conditions are not necessary for minimal points:
Example 7 With n = 2, m = 2 and x = (x1 , x2 )T ∈ D := R2 we consider: 2
Since the functions hj are assumed to be (affinely) linear, exercise 6 gives that this problem can be written in the form from page 58 by substituting the two inequalities hj (x) ≤ 0 and −hj (x) ≤ 0 for every equation hj (x) = 0 .
2.3
Local Second-Order Optimality Conditions
(P )
61
⎧ ⎨ f (x) := x1 −→ min g1 (x) := x21 +(x2 −1)2 −1 ≤ 0 ⎩ g2 (x) := x21 +(x2 +1)2 −1 ≤ 0
∇g2 (x0 )
g1 (x)=0
1
∇f (x0 )
x0
−1 −1
g2 (x)=0 ∇g1 (x0 )
−2
x1
1
Of course, one could also argue from proposition 2.2.1: The cones
Cdd (x0 ) = d ∈ R2 | f (x0 )d < 0 = d ∈ R2 | d1 < 0 and C (x0 ) =
d ∈ R2 | ∀ i ∈ A(x0 ) gi (x0 )d ≤ 0
=
d ∈ R2 | d2 = 0
are clearly not disjoint.
2.3 Local Second-Order Optimality Conditions To get a finer characterization, it is natural to examine the effects of second-order terms near a given point too. The following second-order results take the ‘curvature’ of the feasible region in a neighborhood of a ‘candidate’ for a minimizer into account. The necessary second-order condition sT Hs ≥ 0 and the sufficient second-order condition sT Hs > 0 for the Hessian H of the Lagrangian with respect to x regard only certain subsets of vectors s .
Suppose that the functions f, gi and hj are twice continuously differentiable.
Chapter 2
Obviously, only the point x0 := (0, 0)T is feasible. Hence, x0 is the (global) minimal point. Since ∇f (x0 ) = (1, 0)T , ∇g1 (x0 ) = (0, −2)T and ∇g2 (x0 ) = (0, 2)T , the gradient condition of the KKT conditions is not met. f is linear, the functions gν are convex. Evidently, however, the Slater condition is not fulfilled. x2 2
62
Optimality Conditions
Theorem 2.3.1 (Necessary second-order condition)
Chapter 2
p Suppose x0 ∈ F and there exist λ ∈ Rm + and μ ∈ R such that
∇f (x0 ) +
m i=1
λi ∇gi (x0 ) +
p j=1
μj ∇hj (x0 ) = 0 and
λi gi (x0 ) = 0 for all i ∈ I. If (P ) has a local minimum at x0 , then p m sT ∇2 f (x0 ) + λi ∇2 gi (x0 ) + μj ∇2 hj (x0 ) s ≥ 0 i=1
j=1
holds for all s ∈ Ct + (x0 ), where
F+ := F+ (x0 ) := x ∈ F | gi (x) = 0 for all i ∈ A+ (x0 ) with
A+ (x0 ) := i ∈ A(x0 ) | λi > 0 and
d N xk −→ x0 . Ct + (x0 ) := Ct (F+ , x0 ) = d ∈ Rn | ∃ (xk ) ∈ F+ With the help of the Lagrangian L the second and fifth lines can be written more clearly ∇x L(x0 , λ, μ) = 0 , respectively
sT ∇x2 x L(x0 , λ, μ) s ≥ 0 .
Proof: It holds that λi gi (x) = 0 for all x ∈ F+ because we have λi = 0 for i ∈ I \ A+ (x0 ) and gi (x) = 0 for i ∈ A+ (x0 ), respectively. With the function ϕ defined by ϕ(x) := f (x) +
m i=1
λi gi (x) +
p
μj hj (x) = L(x, λ, μ)
j=1
for x ∈ D this leads to the following relation: ϕ(x) = f (x) for x ∈ F+ . x0 gives a local minimum of f on F , therefore one of ϕ on F+ . Now let s ∈ Ct + (x0 ). Then by definition of the tangent cone there exists a sequence (x(k) ) in F+ , such that x(k) = x0 + αk (s + rk ), αk ↓ 0 and rk → 0 .
2.3
Local Second-Order Optimality Conditions
63
for all sufficiently large k and a suitable τk ∈ (0, 1) . Dividing by α2k /2 and passing to the limit as k → ∞ gives the result sT ∇2 ϕ(x0 )s ≥ 0 .
In the following example we will see that x0 := (0, 0, 0)T is a stationary point. With the help of theorem 2.3.1 we want to show that the necessary condition for a minimum is not met.
Example 8
f (x) := x3 − 1 x21 −→ min 2 g1 (x) := −x21 − x2 − x3 ≤ 0 g2 (x) := −x21 + x2 − x3 ≤ 0 g3 (x) := −x3 ≤ 0
For the point x0 := (0, 0, 0)T we have f (x0 ) = (0, 0, 1), A(x0 ) = {1, 2, 3} and g1 (x0 ) = (0, −1, −1), g2 (x0 ) = (0, 1, −1), g3 (x0 ) = (0, 0, −1). We start with the gradient condition: ⎞ ⎞ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎛ ⎛ 0 0 0 0 0 ∇x L(x0 , λ) = ⎝ 0 ⎠ + λ1 ⎝ −1 ⎠ + λ2 ⎝ 1 ⎠ + λ3 ⎝ 0 ⎠ = ⎝ 0 ⎠ 0 1 −1 −1 −1 " −λ1 + λ2 = 0 ⇐⇒ −λ1 − λ2 − λ3 = −1 ⇐⇒ λ2 = λ1 , λ3 = 1 − 2 λ1 For λ1 := 1/2 we obtain λ = (1/2, 1/2, 0)T ∈ R3+ and λi gi (x0 ) = 0 for i ∈ I. Hence, we get A+ (x0 ) = {1, 2} ,
F+ = x ∈ R3 | g1 (x) = g2 (x) = 0 , g3 (x) ≤ 0 = {(0, 0, 0)T } and therefore Ct + (x0 ) = {(0, 0, 0)T } . In this way no decision can be made! Setting λ1 := 0 we obtain respectively λ = e3 , A+ (x0 ) = {3} ,
F+ = {x ∈ F | x3 = 0} , Ct + (x0 ) = α e1 | α ∈ R and ⎞ ⎛ −1 0 0 H := ∇2 f (x0 ) + ∇2 g3 (x0 ) = ⎝ 0 0 0 ⎠ . 0 0 0
Chapter 2
By assumption ∇ϕ(x0 ) = 0 . With the Taylor theorem we get ϕ(x0 ) ≤ ϕ x(k) = ϕ(x0 ) + αk ϕ (x0 )(s + rk ) =0 + 1 α2k (s + rk )T ∇2 ϕ x0 + τk (x(k) − x0 ) (s + rk ) 2
64
Optimality Conditions
Chapter 2
H is negative definite on Ct + (x0 ). Consequently there is no local minimum of (P ) at x0 = 0 . In order to expand the second-order necessary condition to a sufficient condition, we will now have to make stronger assumptions. Before we do that, let us recall that there will remain a ‘gap’ between these two conditions. This fact is well-known (even for real-valued functions of one variable) and is usually demonstrated by the functions f2 , f3 and f4 defined by fk (x) := xk for x ∈ R , k = 2, 3, 4 , at the point x0 = 0 .
The following Remark can be proven in the same way as 2) in lemma 2.2.3: ⎧ ⎪ ⎨
⎫ # g (x0 )s = 0 for i ∈ A+ (x0 ) ⎪ ⎬ # i n# Ct + (x0 ) ⊂ C + (x0 ) := s ∈ R # gi (x0 )s ≤ 0 for i ∈ A(x0 ) \ A+ (x0 ) # ⎪ ⎪ ⎩ ⎭ hj (x0 )s = 0 for j ∈ E Theorem 2.3.2 (Sufficient second-order condition) p Suppose x0 ∈ F and there exist vectors λ ∈ Rm + and μ ∈ R such that
∇x L(x0 , λ, μ) = 0 and λT g(x0 ) = 0 . Furthermore, suppose that sT ∇x2 x L(x0 , λ, μ) s > 0 for all s ∈ C + (x0 ) \ {0}. Then (P ) attains a strict local minimum at x0 . Proof (indirect): If f does not have a strict local minimum at x0 , then there exists a sequence (x(k) ) in F \ {x0 } with x(k) −→ x0 and f (x(k) ) ≤ f (x0 ). (k) −x0 For sk := xx(k) −x it holds that sk 2 = 1 . Hence, there exists a convergent 0 2
subsequence. wlog suppose sk −→ s for an s ∈ Rn . With αk := x(k) − x0 2 we have x(k) = x0 + αk sk and wlog αk ↓ 0 . From f (x0 ) ≥ f (x(k) ) = f (x0 ) + αk f (x0 )sk + o(αk ) it follows that
f (x0 )s ≤ 0 .
For i ∈ A(x0 ) and j ∈ E we get in the same way:
2.3
Local Second-Order Optimality Conditions
65
gi (x(k) ) = gi (x0 ) + αk gi (x0 )sk + o(αk ) =⇒ gi (x0 )s ≤ 0 ≤0
=0
=0
=0
hj (x(k) ) = hj (x0 ) + αk hj (x0 )sk + o(αk ) =⇒ hj (x0 )s = 0
f (x0 )s + ≤0
m
λi gi (x0 )s +
i=1
=
p
μj hj (x0 )s = 0 j=1 =0
λi gi (x0 )s i∈A+ (x0 ) ≤0
and from that gi (x0 )s = 0 for all i ∈ A+ (x0 ) . Since s2 = 1 , we get s ∈ C + (x0 ) \ {0} . For the function ϕ defined by ϕ(x) := f (x) +
m
λi gi (x) +
i=1
p
μj hj (x) = L(x, λ, μ)
j=1
it holds by assumption that ∇ϕ(x0 ) = 0 . ϕ(x(k) ) = f (x(k) ) + ≤ f (x0 )
m i=1
λi gi (x(k) ) + ≤0
p
μj hj (x(k) ) ≤ f (x0 ) = ϕ(x0 ) j=1 =0
The Taylor theorem yields
1 ϕ(x(k) ) = ϕ(x0 ) + αk ϕ (x0 ) sk + α2k sTk ∇2 ϕ x0 + τk (x(k) − x0 ) sk 2 =0
with a suitable τk ∈ (0, 1). From this we deduce, as usual, sT ∇2 ϕ(x0 )s ≤ 0 . With s ∈ C + (x0 ) \ {0} we get a contradiction to our assumption. The following example gives a simple illustration of the necessary and sufficient second-order conditions of theorems 2.3.1 and 2.3.2:
Example 9 (Fiacco and McCormick (1968)) f (x) := (x1 − 1)2 + x22 −→ min g1 (x) := x1 − x22 ≤ 0 We are looking for a > 0 such that x0 := (0, 0)T is a local minimizer of the problem: With ∇f (x0 ) = (−2, 0)T , ∇g1 (x0 ) = (1, 0)T the condition ∇x L(x0 , λ, μ) = 0 firstly yields λ1 = 2 . In this case (MFCQ) is fulfilled with A1 (x0 ) = A(x0 ) = {1} = A+ (x0 ). We have
Chapter 2
With the assumption ∇x L(x0 , λ, μ) = 0 it follows that
66
Optimality Conditions C + (x0 ) = {d ∈ R2 | d1 = 0} = Ct + (x0 ) .
Chapter 2
The matrix 2
2
∇ f (x0 ) + 2 ∇ g1 (x0 ) =
2 0 0 2
+2
0 0 0 −2
= 2
1 0 0 1 − 2
is negative definite on Ct + (x0 ) for > 1/2 . Thus the second-order necessary condition of theorem 2.3.1 is violated and so there is no local minimum at x0 . For < 1/2 the Hessian is positive definite on C + (x0 ). Hence, the sufficient conditions of theorem 2.3.2 are fulfilled and thus there is a strict local minimum at x0 . When = 1/2, this result is not determined by the second-order conditions; but we can confirm it in the following simple way: f (x) = (x1 − 1)2 + x22 = x21 + 1 + (x22 − 2 x1 ). Because of x22 − 2 x1 ≥ 0 this yields f (x) ≥ 1 and f (x) = 1 only for x1 = 0 and x22 − 2 x1 = 0 . Hence, there is a strict local minimum at x0 .
=1
= 1/4 2
2
–1
1
–2
3
–1
0
–2
1
3
2.4 Duality Duality plays a crucial role in the theory of optimization and in the development of corresponding computational algorithms. It gives insight from a theoretical point of view but is also significant for computational purposes and economic interpretations, for example shadow prices. We shall concentrate on some of the more basic results and limit ourselves to a particular duality — Lagrange duality — which is the most popular and useful one for many purposes. Given an arbitrary optimization problem, called primal problem, we consider a problem that is closely related to it, called the Lagrange dual problem. Several properties of this dual problem are demonstrated in this section. They help to provide strategies for solving the primal and the dual problem. The Lagrange dual problem of large classes of important nonconvex optimization problems can be formulated as an easier problem than the original one.
2.4
Duality
67
Lagrange Dual Problem
we regard the primal problem in standard form: ! f (x) −→ min (P ) x∈F There is a certain flexibility in defining a given problem: Some of the constraints gi (x) ≤ 0 or hj (x) = 0 can be included in the definition of the set C . Substituting the two inequalities hj (x) ≤ 0 and −hj (x) ≤ 0 for every equation hj (x) = 0 we can assume wlog p = 0 . Then we have
F =
x ∈ C | g(x) ≤ 0 .
The Lagrangian function L is defined as a weighted sum of the objective function and the constraint functions, defined by L(x, λ) := f (x) + λT g(x) = f (x) + λ, g(x) = f (x) +
m
λi gi (x)
i=1
for x ∈ C and λ = (λ1 , . . . , λm )T ∈ Rm +. The vector λ is called the dual variable or multiplier associated with the problem. For i = 1 , . . . , m we refer to λi as the dual variable or multiplier associated with the inequality constraint gi (x) ≤ 0. The Lagrange dual function, or dual function, ϕ is defined by ϕ(λ) := inf L(x, λ) x∈C
on the effective domain of ϕ ! ' FD := λ ∈ Rm | inf L(x, λ) > −∞ . + x∈C
The Lagrange dual problem, or dual problem, then is defined by ! ϕ(λ) −→ max (D) λ ∈ FD . In the general case, the dual problem may not have a solution, even if the primal problem has one; conversely, the primal problem may not have a solution, even if the dual problem has one:
Chapter 2
With n ∈ N, m, p ∈ N0 , ∅ = C ⊂ Rn , functions f : C −→ R, g = (g1 , . . . , gm )T : C −→ Rm , h = (h1 , . . . , hp )T : C −→ Rp and the feasible region
F := x ∈ C | g(x) ≤ 0, h(x) = 0
68
Optimality Conditions
Chapter 2
Example 10 For both examples let C := R, m := 1 and p := 0: ! f (x) := x + 2010 −→ min a) (P ) g(x) := 12 x2 ≤ 0 1. x∗ := 0 is the only feasible point. Thus inf {f (x) | x ∈ F } = f (0) = 2010 . 2. L(x, λ) := f (x) + λg(x) = x + 2010 + λ2 x2 (λ ≥ 0, x ∈ R) FD = R++ (for λ > 0: parabola opening upwards; for λ = 0: unbounded from below) : ϕ(λ) = 2010 − 21λ ! b)
(P )
f (x) := exp(−x) −→ min g(x) := −x ≤ 0
1. We have inf {f (x) | x ∈ F} = inf {exp(−x) | x ≥ 0} = 0 , but there exists no x ∈ F = R+ with f (x) = 0 . 2. L(x, λ) := f (x) + λg(x) = exp(−x) − λx (λ ≥ 0) shows FD = {0} with ϕ(0) = 0 . So we have sup{ϕ(λ) | λ ∈ FD } = 0 = ϕ(0). The dual objective function ϕ — as the pointwise infimum of a family of affinely linear functions — is always a concave function, even if the initial problem is not convex. Hence the dual problem can always be written (ϕ → −ϕ) as a convex minimum problem: Remark The set FD is convex, and ϕ is a concave function on FD . Proof: Let x ∈ C , α ∈ [0 , 1 ] and λ, μ ∈ FD : L(x, α λ + (1 − α)μ) = f (x) + α λ + (1 − α)μ, g(x) = α f (x) + λ, g(x) + (1 − α) f (x) + μ, g(x) = α L(x, λ) + (1 − α)L(x, μ) ≥ α ϕ(λ) + (1 − α)ϕ(μ) This inequality has two implications: α λ + (1 − α)μ ∈ FD , and further, ϕ(α λ + (1 − α)μ) ≥ α ϕ(λ) + (1 − α)ϕ(μ).
As we shall see below, the dual function yields lower bounds on the optimal value p∗ := v(P ) := inf(P ) := inf {f (x) : x ∈ F} of the primal problem (P ). The optimal value of the dual problem (D) is defined by
2.4
Duality
69 d∗ := v(D) := sup(D) := sup {ϕ(λ) : λ ∈ FD } .
What is the relationship between d∗ and p∗ ? The following theorem gives a first answer: Weak Duality Theorem If x is feasible to the primal problem (P ) and λ is feasible to the dual problem (D), then we have ϕ(λ) ≤ f (x) . In particular d∗ ≤ p∗ . Proof: Let x ∈ F and λ ∈ FD : ϕ(λ) ≤ L(x, λ) = f (x) + λT g(x) ≤ f (x) ≥0
This implies immediately d∗ ≤ p∗ .
≤0
Although very easy to show, the weak duality result has useful implications: For instance, it implies that the primal problem has no feasible points if the optimal value of (D) is ∞ . Conversely, if the primal problem is unbounded from below, the dual problem has no feasible points. Any feasible point λ to the dual problem provides a lower bound ϕ(λ) on the optimal value p∗ of problem (P ), and any feasible point x to the primal problem (P ) provides an upper bound f (x) on the optimal value d∗ of problem (D). One aim is to generate good bounds. This can help to get termination criteria for algorithms: If one has a feasible point x to (P ) and a feasible point λ to (D), whose values are close together, then these values must be close to the optima in both problems.
Corollary If f (x∗ ) = ϕ(λ∗ ) for some x∗ ∈ F and λ∗ ∈ FD , then x∗ is a minimizer to the primal problem (P ) and λ∗ is a maximizer to the dual problem (D).
Chapter 2
We allow v(P ) and v(D) to attain the extended values +∞ and −∞ and follow the standard convention that the infimum of the empty set is ∞ and the supremum of the empty set is −∞ . If there are feasible points xk with f (xk ) → −∞ (k → ∞), then v(P ) = −∞ and we say problem (P ) — or the function f on F — is unbounded from below. If there are feasible points λk with ϕ(λk ) → ∞ (k → ∞), then v(D) = ∞ and we say problem (D) — or the function ϕ on FD — is unbounded from above. The problems (P ) and (D) always have optimal values — possibly ∞ or −∞ . The question is whether or not they have optimizers, that is, there exist feasible points achieving these values. If there exists a feasible point achieving inf(P ) , we sometimes write min(P ) instead of inf(P ) , accordingly max(D) instead of sup(D) if there is a feasible point achieving sup(D) . In example 10, a) we had min(P ) = sup(D), in example 10, b) we got inf(P ) = max(D).
70
Optimality Conditions
Proof:
Chapter 2
ϕ(λ∗ ) ≤ sup {ϕ(λ) | λ ∈ FD } ≤ inf {f (x) | x ∈ F} ≤ f (x∗ ) = ϕ(λ∗ ) Hence, equality holds everywhere, in particular f (x∗ ) = inf {f (x) | x ∈ F} and ϕ(λ∗ ) = sup {ϕ(λ) | λ ∈ FD } .
The difference p∗ − d∗ is called the duality gap. If this duality gap is zero, that is, p∗ = d∗, then we say that strong duality holds. We will see later on: If the functions f and g are convex (on the convex set C ) and a certain constraint qualification holds, then one has strong duality. In nonconvex cases, however, a duality gap p∗ − d∗ > 0 has to be expected. The following examples illustrate the necessity of making more demands on f, g and C to get a close relation between the problems (P ) and (D): Example 11 With n := 1, m := 1: a) d∗ = −∞, p∗ = ∞
C := R+ , f (x) := −x, g(x) := π
(x ∈ C):
L(x, λ) = −x + λπ (x ∈ C, λ ∈ R+ ) F = ∅, p∗ = ∞; inf L(x, λ) = −∞, FD = ∅, d∗ = −∞ x∈C
b) d∗ = 0, p∗ = ∞
C := R++ , f (x) := x, g(x) := x
(x ∈ C):
L(x, λ) = x + λx = (1 + λ)x F = ∅, p∗ = ∞; FD = R+ , ϕ(λ) = 0, d∗ = 0 c) −∞ = d∗ < p∗ = 0 C := R, f (x) := x3, g(x) := −x (x ∈ R): F = R+ , p∗ = min(P ) = 0 L(x, λ) = x3 − λx x ∈ R, λ ≥ 0 FD = ∅ , d∗ = −∞ d) d∗ = max(D) < min(P ) = p∗ C := [0 , 1 ] , f (x) := −x2 , g(x) := 2 x − 1 (x ∈ C): F = [0 , 1/2 ], p∗ = min(P ) = f (1/2) = −1/4 L(x, λ) = −x2 + λ(2 x − 1) x ∈ [0 , 1 ], λ ≥ 0 For λ ∈ FD = R+ we get ! −λ , λ ≥ 1/2 ϕ(λ) = min L(0, λ), L(1, λ) = min − λ, λ − 1 = λ − 1 , λ < 1/2
2.4
Duality
71
and hence, d∗ = max(D) = ϕ(1/2) = −1/2 .
The Lagrange dual problem of this linear problem is given by ! T b μ → max (D) AT μ ≤ c . Proof: With f (x) := cT x, h(x) := b − Ax
(x ∈ Rn+ =: C) we have
L(x, μ) = c, x + μ, b − Ax = μ, b + x, c − AT μ inf
x∈C
T
μ, b + x, c − A μ
=
!
bT μ , if AT μ ≤ c −∞ , else
(μ ∈ Rm ) .
It is easy to verify that the Lagrange dual problem of (D) — transformed into standard form — is again the primal problem (cf. exercise 18).
Geometric Interpretation We give a geometric interpretation of the dual problem that helps to find and understand examples which illustrate the various possible relations that can occur between the primal and the dual problem. This visualization can give insight in theoretical results. For the sake of simplicity, we consider only the case m = 1 , that is, only one inequality constraint: We look at the image of C under the map (g, f ), that is,
B := (g(x), f (x)) | x ∈ C . In the primal problem we have to find a pair (v, w) ∈ B with minimal ordinate w in the (v, w)-plane, that is, the point (v, w) in B which minimizes w subject to v ≤ 0 . It is the point (v ∗ , w∗ ) — the image under (g, f ) of the minimizer x∗ to problem (P ) — in the following figure, which illustrates a typical case for n = 2 :
Chapter 2
With m, n ∈ N, a real (m, n)-matrix A, vectors b ∈ Rm and c ∈ Rn we consider a linear problem in standard form, that is, ! T c x → min (P ) Ax = b , x ≥ 0 .
72
Optimality Conditions
Chapter 2
w (values of f ) 54
S
c
36 36
∗
slope −λ
slope −λ
B (v ∗ , w∗ ) 18 ϕ(λ)
ϕ(λ∗ )
v (values of g) 0
0
3
6
c λ
To get ϕ(λ) for a fixed λ ≥ 0 , we have to minimize L(x, λ) = f (x) + λg(x) over x ∈ C , that is, w + λv over (v, w) ∈ B . For any constant c ∈ R, the equation w + λv = c describes a straight line with slope −λ and intercept c on the w-axis. Hence we have to find the lowest line with slope −λ which intersects the region B (move the line w + λv = c parallel to itself as far down as possible while it touches B ). This leads to the line tangent to B at the point S in the figure. (The region B has to lie above the line and to touch it.) Then the intercept on the w-axis gives ϕ(λ). The geometric description of the dual problem (D) is now clear: Find the value λ∗ which defines the slope of a tangent to B intersecting the ordinate at the highest possible point. Example 12 Let n := 2, m := 1, C := R2+ and x = (x1 , x2 )T ∈ C : f (x) := x21 + x22 −→ min (P ) g(x) := 6 − x1 − x2 ≤ 0 g(x) ≤ 0 implies 6 ≤ x1 + x2 . The equality 6 = x1 + x2 gives f (x) = x21 + (6 − x1 )2 = 2 (x1 − 3)2 + 9 . The minimum is attained at x∗ = (3, 3) with f (x∗ ) = 18 : L(x, λ) = x21 + x22 + λ(−x1 − x2 + 6) λ ≥ 0, x ∈ C 2
2
min(P ) = 18
= (x1 − λ/2) + (x2 − λ/2) + 6 λ − λ2 /2 So we get the minimum for x1 = x2 = λ/2 with value 6 λ − λ2 /2 .
2.4
Duality
73
ϕ(λ) = 6 λ − λ2 /2 describes a parabola, therefore we get the maximum at λ = 6 with value ϕ(λ) = 18 : max(D) = 18
2
f (x) = x21 + x22 = x21 + (x1 + (v − 6)) = 2 x21 + 2 (v − 6)x1 + (v − 6)2 2 = 2 x1 + (v − 6)/2 + (v − 6)2 /2 ≥ (v − 6)2 /2 with equality for x1 = −(v − 6)/2 . f (x) = 2 x1 (x1 + v − 6) +(v − 6)2 ≤ (v − 6)2 ≤0
with equality for x1 = 0 . So we have
B = (v, w) | v ≤ 6 , (v − 6)2 /2 ≤ w ≤ (v − 6)2 .
The attentive reader will have noticed that this example corresponds to the foregoing figure. Example 13 We look once more at example 11, d):
B := {(g(x), f (x)) | x ∈ C} = 2 x − 1, −x2 | 0 ≤ x ≤ 1 v := g(x) = 2 x − 1 ∈ [−1 , 1 ] gives x = (1 + v)/2 , hence, w := f (x) = −(1 + v)2 /4 . Duality Gap w (values of f ) v (values of g) −1 slope −1/2
(v ∗ , w∗ ) ϕ(λ∗ ) −1
1
B
Chapter 2
To get the region B := {(g(x), f (x)) : x ∈ C }, we proceed as follows: For x ∈ C we have v := g(x) ≤ 6 . The equation −x1 − x2 + 6 = v gives x2 = −x1 + 6 − v and further
74
Optimality Conditions
Chapter 2
Saddlepoints and Duality For the following characterization of strong duality neither convexity nor differentiability is needed: Theorem 2.4.1 Let x∗ be a point in C and λ∗ ∈ Rm + . Then the following statements are equivalent: a) (x∗ , λ∗ ) is a saddlepoint of the Lagrange function L . b) x∗ is a minimizer to problem (P ) and λ∗ is a maximizer to problem (D) with f (x∗ ) = L(x∗ , λ∗ ) = ϕ(λ∗ ) . In other words: A saddlepoint of the Lagrangian L exists if and only if the problems (P ) and (D) have the same value and admit optimizers, that is, min(P ) = max(D) .
Proof: First, we show that a) implies b): L(x∗ , λ∗ ) = inf L(x, λ∗ ) ≤ sup inf L(x, λ) x∈C
≤ inf
x∈C λ∈Rm +
sup L(x, λ) ≤ sup L(x∗ , λ) = L(x∗ , λ∗ )
x∈C λ∈Rm +
λ∈Rm +
Consequently, ∞ > ϕ(λ∗ ) = inf L(x, λ∗ ) = sup L(x∗ , λ) = L(x∗ , λ∗ ). x∈C
∗
λ∈Rm +
By lemma 2.2.6 we know already: x is a minimizer of (P ) with f (x∗ ) = L(x∗ , λ∗ ). b) now follows by the corollary to the weak duality theorem. Conversely, suppose now that b) holds true: ϕ(λ∗ ) = inf {L(x, λ∗ ) | x ∈ C} ≤ L(x∗ , λ∗ ) = f (x∗ ) + λ∗ , g(x∗ ) ≤ f (x∗ )
(7)
We have ϕ(λ∗ ) = f (x∗ ), by assumption. Therefore, equality holds everywhere in (7), especially, λ∗ , g(x∗ ) = 0 . This leads to L(x∗ , λ∗ ) = f (x∗ ) ≤ L(x, λ∗ ) for x ∈ C and L(x∗ , λ) = f (x∗ ) + λ, g(x∗ ) ≤ f (x∗ ) = L(x∗ , λ∗ ) for λ ∈ Rm +. Perturbation and Sensitivity Analysis In this subsection, we discuss how changes in parameters affect the solution of the primal problem. This is called sensitivity analysis. How sensitive are the minimizer
2.4
Duality
75
and its value to ‘small’ perturbations in the data of the problem? If parameters change, sensitivity analysis often helps to avoid having to solve a problem again.
with the feasible region Fu :=
x ∈ C | g(x) ≤ u .
The vector u is called the ‘perturbation vector’. Obviously we have (P0 ) = (P ). If a variable ui is positive, this means that we ‘relax’ the i-th constraint gi (x) ≤ 0 to gi (x) ≤ ui ; if ui is negative we tighten this constraint.
We define the perturbation or sensitivity function p : Rm −→ R ∪ {−∞, ∞} associated with the problem (P ) by p(u) := inf {f (x) | x ∈ Fu } = inf {f (x) | x ∈ C, g(x) ≤ u} for u ∈ Rm (with inf ∅ := ∞). Obviously we have p(0) = p∗ . The function p gives the minimal value of the problem (Pu ) as a function of ‘perturbations’ of the right-hand side of the constraint g(x) ≤ 0 .
Its effective domain is given by the set
dom(p) := {u ∈ Rm | p(u) < ∞} = {u ∈ Rm | ∃ x ∈ C g(x) ≤ u} . Obviously the function p is antitone, that is, order-reversing: If the vector u increases, the feasible region Fu increases and so p decreases (in the weak sense). Remark If the original problem (P ) is convex, then the effective domain dom(p) is convex and the perturbation function p is convex on it. Since −∞ is possible as a value for p on dom(p), convexity here means the convexity of the epigraph 3 epi(p) := {(u, z) ∈ Rm × R | u ∈ dom(p), p(u) ≤ z} 3
The prefix ‘epi’ means ‘above’. A real-valued function p is convex if and only if the set epi(p) is convex (cf. exercise 8).
Chapter 2
For u ∈ Rm we consider the ‘perturbed’ optimization problem ! f (x) −→ min (Pu ) x ∈ Fu
76
Optimality Conditions
Chapter 2
Proof: The convexity of dom(p) and p is given immediately by the convexity of the set C and the convexity of the function g: Let u, v ∈ dom(p) and ∈ (0, 1). For α, β ∈ R with p(u) < α and p(v) < β there exist vectors x, y ∈ C with g(x) ≤ u , g(y) ≤ v and f (x) < α , f (y) < β . The vector x ( := x + (1 − )y belongs to C with g(( x) ≤ g(x) + (1 − )g(y) ≤ u + (1 − )v =: u ( and f (( x) ≤ f (x) + (1 − )f (y) < α + (1 − )β . This shows p(( u) ≤ f (( x) < α + (1 − )β, hence, p(( u) ≤ p(u) + (1 − )p(v). Remark We assume that strong duality holds and that the dual optimal value is attained. Let λ∗ be a maximizer to the dual problem (D). Then we have p(u) ≥ p(0) − λ∗ , u for all u ∈ Rm . Proof: For a given u ∈ Rm and any feasible point x to the problem (Pu ), that is, x ∈ Fu , we have p(0) = p∗ = d∗ = ϕ(λ∗ ) ≤ f (x) + λ∗ , g(x) ≤ f (x) + λ∗ , u . From this follows p(0) ≤ p(u) + λ∗ , u .
This inequality gives a lower bound on the optimal value of the perturbed problem (Pu ) . The hyperplane given by z = p(0) − λ∗ , u ‘supports’ the epigraph of the function p at the point (0, p(0)) . For a problem with only one inequality constraint the inequality shows that the affinely linear function u → p∗ − λ∗ u (u ∈ R) lies below the graph of p and is tangent to it at the point (0, p∗ ) .
p∗ = p(0) p u
p ∗ − λ∗ u We get the following rough sensitivity results:
2.4
Duality
77
If λ∗i is ‘small’, relaxing the i-th constraint causes a small decrease of the optimal value p(u) . Conversely, if λ∗i is ‘large’, tightening the i-th constraint causes a large increase of the optimal value p(u) .
Remark If the function p is differentiable 4 at the point u = 0 , then the maximizer λ∗ of the dual problem (D) is related to the gradient of p at u = 0 : ∇p(0) = −λ∗ Here the Lagrange multipliers λ∗i are exactly the local sensitivities of the function p with respect to perturbations of the constraints.
Proof: The differentiability at the point u = 0 gives: p(u) = p(0) + ∇p(0), u + r(u) u with r(u) → 0 for Rm u → 0. ∗ Hence we obtain − ∇p(0) + λ∗ , u ≤ r(u) u. We set u := −t[∇p(0) +λ ] ∗ 2 ∗ ∗ + λ r − t[∇p(0) + λ ] . This for t > 0 and get t ∇p(0) + λ ≤ t ∇p(0) shows: ∇p(0) + λ∗ ≤ r − t[∇p(0) + λ∗ ] . Passage to the limit t → 0 yields ∇p(0) + λ∗ = 0 . For the rest of this section we consider only the special case of a convex optimization problem, where the functions f and g are convex and continuously differentiable and the set C is convex.
Economic Interpretation of Duality The equation
∇p(0) = −λ∗
or −
∂p (0) = λ∗i for i = 1 , . . . , m ∂ui
leads to the following interpretation of dual variables in economics: The components λ∗i of the Lagrange multiplier λ∗ are often called shadow prices or attribute costs. They represent the ‘marginal’ rate of change of the optimal value p∗ = v(P ) = inf(P ) 4
Subgradients generalize the concept of gradient and are helpful if the function p is not differentiable at the point u = 0 . We do not pursue this aspect and its relation to the concept of stability.
Chapter 2
Under the assumptions of the foregoing remark we have:
78
Optimality Conditions
Chapter 2
of the primal problem (P ) with respect to changes in the constraints. They describe the incremental change in the value p∗ per unit increase in the righthand side of the constraint. If, for example, the variable x ∈ Rn determines how an enterprise ‘operates’, the objective function f describes the cost for some production process, and the constraint gi (x) ≤ 0 gives a bound on a special resource, for example labor, material or space, then p∗ (u) shows us how much the costs (and with it the profit) change when the resource changes. λ∗i determines approximately how much fewer costs the enterprise would have, for a ‘small’ increase in availability of the i-th resource. Under these circumstances λ∗i has the dimension of dollars (or euros) per unit of capacity of the i-th resource and can therefore be regarded as a value per unit resource. So we get the maximum price we should pay for an additional unit of ui . Strong Duality Below we will see: If the Slater constraint qualification holds and the original problem is convex, then we have strong duality, that is, p∗ = d∗. We see once more: The class of convex programs is a class of ‘well-behaved’ optimization problems. Convex optimization is relatively ‘easy’. We need a slightly different separation theorem (compared to proposition 2.1.1). We quote it without proof (for a proof see, for example: [Fra], p. 49 f):
Separation Theorem Given two disjoint nonempty convex sets V and W in Rk , there exist a real α and a vector p ∈ Rk \ {0} with p, v ≥ α for all v ∈ V
and p, w ≤ α for all w ∈ W.
In other words: The hyperplane x ∈ Rk | p, x = α separates V and W. The example V := W :=
x = (x1 , x2 )T ∈ R2 | x1 ≤ 0
and
x = (x1 , x2 )T ∈ R2 | x1 > 0, x1 x2 ≥ 1
(with separating ‘line’ x1 = 0) shows that the sets cannot be ‘strictly’ separated.
Strong Duality Theorem Suppose that the Slater constraint qualification ∃x ˜ ∈ F gi (˜ x) < 0 for all i ∈ I1 holds for the convex problem (P ). Then we have strong duality, and the value of the dual problem (D) is attained if p∗ > −∞.
2.4
Duality
79
In order to simplify the proof, we verify the theorem under the slightly stronger condition ∃x ˜ ∈ F gi (˜ x) < 0 for all i ∈ I .
Proof: There exists a feasible point, hence we have p∗ < ∞. If p∗ = −∞, then we get d∗ = −∞ by the weak duality theorem. Hence, we can suppose that p∗ is finite. The two sets V := {(v, w) ∈ Rm × R | ∃ x ∈ C g(x) ≤ v and f (x) ≤ w} W := {(0, w) ∈ Rm × R | w < p∗ } are nonempty and convex. By the definition of p∗ they are disjoint: Let (v, w) be in W ∩ V: (v, w) ∈ W shows v = 0 and w < p∗. For (v, w) ∈ V there exists an x ∈ C with g(x) ≤ v = 0 and f (x) ≤ w < p∗, which is a contradiction to the definition of p∗. The quoted separation theorem gives the existence of a pair (λ, μ) ∈ Rm × R \ {(0, 0)} and an α ∈ R such that: λ, v + μw ≥ α for all (v, w) ∈ V λ, v + μw ≤ α for all (v, w) ∈ W
and
(8) (9)
From (8) we get λ ≥ 0 and μ ≥ 0 . (9) means that μw ≤ α for all w < p∗ , hence μp∗ ≤ α . (8) and the definition of V give for any x ∈ C: (10) λ, g(x) + μf (x) ≥ α ≥ μp∗ For μ = 0 we get from (10) that λ, g(x) ≥ 0 for any x ∈ C, especially λ, g(( x) ≥ 0 for a point x ( ∈ C with gi (( x) < 0 for all i ∈ I . This shows λ = 0 arriving at a contradiction to (λ, μ) = (0, 0). So we have μ > 0 : We divide the inequality (10) by μ and obtain L x, λ/μ ≥ p∗ for any x ∈ C . From ϕ λ/μ ≥ p∗. By the weak duality theorem we have this follows ϕ λ/μ ≤ d∗ ≤ p∗. This shows strong duality and that the dual value is attained. Strong duality can be obtained for some special nonconvex problems too: It holds for any optimization problem with quadratic objective function and one quadratic inequality constraint, provided Slater’s constraint qualification holds. See for example [Bo/Va], Appendix B.
Chapter 2
For an extension of the proof to the (refined) Slater constraint qualification see for example [Rock], p. 277.
80
Optimality Conditions
Exercises
Chapter 2
1. Orthogonal Distance Line Fitting Consider the following approximation problem arising from quality control in manufacturing using coordinate measurement techniques [Ga/Hr]. Let M := {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )} be a set of m ∈ N given points in R2 . The task is to find a line L
L(c, n1 , n2 ) := (x, y) ∈ R2 | c + n1 x + n2 y = 0 in Hessian normal form with n21 + n22 = 1 which best approximates the point set M such that the sum of squares of the distances of the points from the straight line becomes minimal. If we calculate rj := c + n1 xj + n2 yj for a point (xj , yj ), then |rj | is its distance to L. a) Formulate the above problem as a constrained optimization problem. b) Show the existence of a solution and determine the optimal parameters c, n1 and n2 by means of the Lagrange multiplier rule. Explicate when and in which sense these parameters are uniquely defined. c) Find a (minimal) example which consists of three points and has infinitely many optimizers. R
R
d) Solve the optimization problem with Matlab or Maple and test your program with the following data (cf. [Ga/Hr]): xj 1.0 yj 0.2
2.0 1.0
3.0 2.6
4.0 3.6
5.0 4.9
6.0 5.3
7.0 6.5
8.0 7.8
9.0 8.0
10.0 9.0
2. a) Solve the optimization problem f (x1 , x2 ) := 2x1 + 3x2 −→ max √ √ x1 + x2 = 5 using Lagrange multipliers (cf. [Br/Ti]). b) Visualize the contour lines of f as well as the set of feasible points, and mark the solution. Explain the result! 3. Let n ∈ N and A = (aν,μ ) submatrices Ak ⎛ a11 a12 ⎜a21 a22 ⎜ Ak := ⎜ . . ⎝ .. ..
be a real symmetric (n, n)-matrix with the ... ... .. .
⎞ a1k a2k ⎟ ⎟ .. ⎟ . ⎠
for k ∈ {1, . . . , n}.
ak1 ak2 . . . akk Then the following statements are equivalent:
Exercises to Chapter 2
81
a) A is positive definite. b) ∃ δ > 0 ∀ x ∈ Rn xT Ax ≥ δx2 4. Consider a function f : Rn −→ R. a) If f is differentiable, then the following holds: f convex ⇐⇒ ∀x, y ∈ Rn f (y) − f (x) ≥ f (x)(y − x) b) If f is twice continuously differentiable, then: f convex ⇐⇒ ∀x ∈ Rn ∇2 f (x) positive semidefinite c) What do the corresponding characterizations of strictly convex functions look like? 5. In the “colloquial speech” of mathematicians one can sometimes hear the following statement: “Strictly convex functions always have exactly one minimizer.” However, is it really right to use this term so carelessly? Consider two typical representatives fi : R2 −→ R, i ∈ {1, 2}: f1 (x, y) = x2 + y 2 f2 (x, y) = x2 − y 2 Visualize these functions and plot their contour lines. Which function is convex? Show this analytically as well. Is the above statement correct? Let Dj ⊂ R2 for j ∈ {1, 2, 3, 4, 5} be a region in R2 with D1 := {(x, y) ∈ R2 : x21 + x22 ≤ 0.04}
D2 := {(x, y) ∈ R2 : (x1 − 0.55)2 + (x2 − 0.7)2 ≤ 0.04} D3 := {(x, y) ∈ R2 : (x1 − 0.55)2 + x22 ≤ 0.04}. The outer boundary of the regions D4 and D5 is defined by x = 0.5(0.5 + 0.2 cos(6ϑ)) cos ϑ + xc , ϑ ∈ [0, 2π) , y = 0.5(0.5 + 0.2 cos(6ϑ)) sin ϑ + yc where (xc , yc ) = (0, 0) for D4 and (xc , yc ) = (0, −0.7) for D5 . If we now restrict the above functions fi to Dj (i ∈ {1, 2}, j ∈ {1, 2, 3, 4, 5}), does the statement about the uniqueness of the minimizers still hold? Find all the minimal points, where possible! Where do they lie? Which role does the convexity of the region and the function play? 6. Show that a function f : Rn −→ R is affinely linear if and only if it is convex as well as concave.
Chapter 2
c) ∀ k ∈ {1, ..., n} det Ak > 0
82
Optimality Conditions
Chapter 2
7. Let X be a real vector space. For m ∈ N and x1 , . . . , xm ∈ X let " m ) m λi xi | λ1 , . . . , λm > 0, λi = 1 . conv(x1 , . . . , xm ) := i=1
i=1
Verify that the following assertions hold for a nonempty subset A ⊂ X: a) A convex ⇐⇒ ∀ m ∈ N ∀ a1 , . . . , am ∈ A conv(a1 , . . . , am ) ⊂ A b) Let A be convex and f : A −→ R a convex function. For x1 , x2 , . . . , xm ∈ A and x ∈ conv(x1 , . . . , xm ) in a representation as given above, it then holds that *m + m f λi xi ≤ λi f (xi ) . i=1
i=1
c) The intersection of an arbitrary number of convex sets is convex. Consequently there exists the smallest convex superset conv(A) of A, called the convex hull of A. , d) It holds that conv(A) = conv(a1 , . . . , am ). m∈N a1 ,...,am ∈A
e) Carath´ eodory’s lemma:
For X = Rn it holds that conv(A) =
,
conv(a1 , . . . , am ) .
m≤n+1 a1 ,...,am ∈A
f ) In which way does this lemma have to be modified for X = Cn ? g) For X ∈ {Rn , Cn } and A compact the convex hull conv(A) is also compact. 8. For a nonempty subset D ⊂ Rn and a function f : D −→ R let epi(f ) := {(x, y) ∈ D × R : f (x) ≤ y} be the epigraph of f . Show that for a convex set D we have f convex ⇐⇒ epi(f ) convex. 9. Prove part 1) of lemma 2.2.3 and additionally show the following assertions for F convex and x0 ∈ F : a) Cfd (x0 ) = {μ(x − x0 ) | μ > 0, x ∈ F} b) Ct (x0 ) = Cfd (x0 ) c) Ct (x0 ) is convex. 10. Prove for the tangent cones of the following sets F1 := {x ∈ Rn | x2 = 1}, F2 := {x ∈ Rn | x2 ≤ 1},
F3 := {x ∈ R2 | − x31 + x2 ≤ 0, −x2 ≤ 0} :
Exercises to Chapter 2
83
c) For x0 := (0, 0)T ∈ F3 it holds that Ct (x0 ) = {d ∈ R2 | d1 ≥ 0, d2 = 0}. 11. With f (x) := x21 + x22 for x ∈ R2 consider ⎧ f (x) −→ min ⎪ ⎪ ⎪ ⎨ −x ≤ 0 2 (P ) 3 ⎪ x − x2 ≤ 0 ⎪ 1 ⎪ ⎩ 3 x1 (x2 − x31 ) ≤ 0 and determine the linearizing cone, the tangent cone and the respective dual cones at the (strict global) minimal point x0 := (0, 0)T . 12. Let x0 be a feasible point of the optimization problem (P ). According to page 56 it holds that (LICQ) =⇒ (AHUCQ) =⇒ (ACQ). Show by means of the following examples (with n = m = 2 and p = 0) that these two implications do not hold in the other direction: a) f (x) := x21 +(x2 +1)2 , g1 (x) := −x31 −x2 , g2 (x) := −x2 , x0 := (0, 0)T b) f (x) := x21 + (x2 + 1)2 , g1 (x) := x2 − x21 , g2 (x) := −x2 , x0 := (0, 0)T 13. Let the following optimization problem be given: f (x) −→ min , x ∈ R2 g1 (x1 , x2 ) := 3(x1 − 1)3 − 2x2 + 2 ≤ 0 g2 (x1 , x2 ) := (x1 − 1)3 + 2x2 − 2 ≤ 0 g3 (x1 , x2 ) := −x1 ≤ 0 g4 (x1 , x2 ) := −x2 ≤ 0 a) Plot the feasible region. b) Solve the optimization problem for the following objective functions: (i) f (x1 , x2 ) := (x1 − 1)2 + (x2 − 32 )2 (ii) f (x1 , x2 ) := (x1 − 1)2 + (x2 − 4)2 Regard the objective function on the ‘upper boundary’ of F. (iii) f (x1 , x2 ) := (x1 − 54 )2 + (x2 − 54 )2 Do the KKT conditions hold at the optimal point? Hint: In addition illustrate these problems graphically. 14. Optimal Location of a Rescue Helicopter (see example 4 of chapter 1)
Chapter 2
a) For x0 ∈ F1 it holds that Ct (x0 ) = {d ∈ Rn | d, x0 = 0}. " x0 2 < 1 , Rn , b) For x0 ∈ F2 we have Ct (x0 ) = n {d ∈ R | d, x0 ≤ 0} , x0 2 = 1 .
84
Optimality Conditions a) Formulate the minimax problem dmax (x, y) := max
Chapter 2
1≤j≤m
(x − xj )2 + (y − yj )2
as a quadratic optimization problem " f (x, y, ) → min gj (x, y, ) ≤ 0 (j = 1, . . . , m) (with f quadratic, gj linear). You can find some hints on page 13. b) Visualize the function dmax by plotting its contour lines for the points (0, 0), (5, −1), (4, 6), (1, 3). c) Give the corresponding Lagrangian. Solve the problem by means of the Karush–Kuhn–Tucker conditions. 15. Determine a triangle with minimal area containing two disjoint disks with radius 1. wlog let (0, 0), (x1 , 0) and (x2 , x3 ) with x1 , x3 ≥ 0 be the vertices of the triangle; (x4 , x5 ) and (x6 , x7 ) denote the centers of the disks.
3 2 1 0
0
2
4
6
a) Formulate this problem as a minimization problem in terms of seven variables and nine constraints (see [Pow 1]). √ √ √ √ √ T b) x∗ = 4 + 2 2, 2 + 2, 2 + 2, 1 + 2, 1, 3 + 2, 1 is a solution of this problem; calculate the corresponding Lagrange multipliers λ∗, such that the Karush–Kuhn–Tucker conditions are fulfilled. c) Check the sufficient second-order optimality conditions for (x∗ , λ∗ ). 16. Find the point x ∈ R2 that lies closest to the point p := (2, 3) under the constraints g1 (x) := x1 + x2 ≤ 0 and g2 (x) := x21 − 4 ≤ 0. a) Illustrate the problem graphically. b) Verify that the problem is convex and fulfills (SCQ). c) Determine the KKT points by differentiating between three cases: none is active, exactly the first one is active, exactly the second one is active. d) Now conclude with theorem 2.2.8.
Exercises to Chapter 2
85
The problem can of course be solved elementarily. We, however, want to practice the theory with simple examples.
Determine the current flow such that the total loss stays minimal. The constraints are given by x1 + x2 = r , x1 ≥ 0 , x2 ≥ 0 . 18. Verify in the linear case that the Lagrange dual problem of (D) (cf. p. 71) — transformed into standard form — is again the primal problem. 19. Consider the optimization problem (cf. [Erik]): ⎧ n ⎨ f (x) := x log( xi ) −→ min , x ∈ Rn i pi i=1 ⎩ T A x = b, x ≥ 0 where A ∈ Rn×m , b ∈ Rm and p1 , p2 , . . . , pn ∈ R++ are given. Let further 0 ln 0 be defined as 0. Prove: a) The dual problem is given by n ϕ(λ) := bT λ − pi exp(eTi Aλ − 1) −→ max , λ ∈ Rm . i=1 T
b) ∇ϕ(λ) = b − A x with xi = pi exp(eTi Aλ − 1). c) ∇2 ϕ(λ) = −AT XA, where X = Diag(x) with x from b) . 20. Support Vector Machines cf. [Cr/Sh] Support vector machines have been extensively used in machine learning and data mining applications such as classification and regression, text categorization as well as medical applications, for example breast cancer diagnosis. Let two classes of patterns be given, i. e., samples of observable characteristics which are represented by points xi in Rn . The patterns are given in the form (xi , yi ), i = 1, . . . , m, with yi ∈ {1, −1}. yi = 1 means that xi belongs to class 1; otherwise xi belongs to class 2. In the simplest case we are looking for a separating hyperplane described by w , x+β = 0 with w , xi + β ≥ 1 if yi = 1 and w , xi +β ≤ −1 if yi = −1. These conditions can be written as yi w , xi + β ≥ 1 (i = 1, . . . , m). We
aim to maximize the ‘margin’ (distance) 2/ w , w between the two hyperplanes w , x + β = 1 and w , x + β = −1. This gives a linearly constrained convex " quadratic minimization problem 1 2 w , w −→ min (11) yi w , xi + β ≥ 1 (i = 1, . . . , m) .
Chapter 2
17. In a small power network the power r runs through two different channels. Let xi be the power running through channel i for i = 1, 2. The total loss is given by the function f : R2 −→ R with 1 2 f (x1 , x2 ) := x1 + x + x22 . 2 1
86
Optimality Conditions
Chapter 2
Separable Case
Non-Separable Case
6
6
4
4
2
2
0
2
4
6
0
2
4
6
In the case that the two classes are not linearly separable (by a hyperplane), we introduce nonnegative penalties ξi for the ‘misclassification’ m of xi and minimize both w , w and i=1 ξi . We solve this optimization problem in the following way with soft margins " m 1 ξi → min 2 w , w + C i=1 (P ) (12) yi w , xi + β ≥ 1 − ξi , ξi ≥ 0 (i = 1, . . . , m). Here, C is a weight parameter of the penalty term. a) Introducing the dual variables λ ∈ Rm + , derive the Lagrange dual problem to (P ): " 1 m m − 2 i,j=1 yi yj xi , xj λi λj + i=1 λi −→ max (D) (13) m 0 ≤ λi ≤ C (i = 1, . . . , m) i=1 yi λi = 0, Compute the coefficients w ∈ Rn and β ∈ R of the separating hyperplane by means of the dual solution λ and show m w = j=1 yj λj xj , β = yj − w , xj if 0 < λj < C . Vectors xj with λj > 0 are called support vectors. b) Calculate a support vector ‘machine’ for breast cancer diagnosis using the file wisconsin-breast-cancer.data from the Breast Cancer Wiscon sin Data Set cf. http://archive.ics.uci.edu/ml/ . The file wisconsinbreast-cancer.names gives information on the data set: It contains 699 instances consisting of 11 attributes. The first attribute gives the sample code number. Attributes 2 through 10 describe the medical status and give a 9-dimensional vector xi . The last attribute is the class attribute (“2” for benign, “4” for malignant). Sixteen samples have a missing attribute, denoted by “?”. Remove these samples from the data set. Now split the data into two portions: The first 120 instances are used as training data. Take software of your choice to solve the quadratic problem (P ), using the penalty parameter C = 1000. The remaining instances are used to evaluate the ‘performance’ of the
classifier or decision function given by f (x) := sgn w , x + β .
3 Unconstrained Optimization Problems
Chapter 3
3.0 Logistic Regression 3.1 Elementary Search and Localization Methods The Nelder and Mead Polytope Method The Shor Ellipsoid Method 3.2 Descent Methods with Line Search Coordinatewise Descent Methods Gradient Method Kantorovich’s Inequality Requirements on the Step Size Selection 3.3 Trust Region Methods Least Squares Problems 3.4 Conjugate Gradient Method Generation of A-Conjugate Directions Rate of Convergence The CG-Method in the Nonquadratic Case 3.5 Quasi-Newton Methods Least Change Principle More General Quasi-Newton Methods Quasi-Newton Methods in the Nonquadratic Case Exercises Unconstrained optimization methods seek a local minimum (or a local maximum) in the absence of restrictions, that is, f (x) −→ min
(x ∈ D)
for a real-valued function f : D −→ R defined on a nonempty subset D of Rn for a given n ∈ N . Unconstrained optimization involves the theoretical study of optimality criteria and above all algorithmic methods for a wide variety of problems. In section 2.0 we have repeated — as essential basics — the well-known (firstand second-order) optimality conditions for smooth real-valued functions. Often constraints complicate a given task but in some cases they simplify it. Even though most optimization problems in ‘real life’ have restrictions to be satisfied, the study W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4 3,
87
88
Unconstrained Optimization Problems
Chapter 3
of unconstrained problems is useful for two reasons: Firstly, they occur directly in some applications, so they are important in their own right. Secondly, unconstrained problems often originate as a result of transformations of constrained optimization problems. Some methods, for example, solve a general problem by converting it into a sequence of unconstrained problems. In section 3.1 elementary search and localization methods like the Nelder–Mead polytope method and Shor’s ellipsoid method are treated. The first method is widely used in applications, as the effort is small. The results, however, are in general rather poor. Shor’s ellipsoid method has attracted great attention, mainly because of its applications to linear optimization in the context of interior-point methods. Often methods proceed by finding a suitable direction and then minimizing along this direction (“line search”). This is treated in section 3.2. Trust region methods, which we are going to cover in section 3.3, start with a given step size or an upper bound for it and then determine a suitable search direction. In section 3.4 the concept of conjugate directions is introduced. If the objective function is quadratic, the resulting method terminates after a finite number of steps. The extension to any differentiable function f : Rn −→ R goes back to Fletcher/Reeves (1964). Quasi-Newton methods in section 3.5 are based on the “least change secant update principle” by C. G. Broyden and are thus well motivated. Due to the multitude of methods presented, this chapter might at times read a bit like a ‘cookbook’. To enliven it, we will illustrate the most important methods with insightful examples, whose results will be given in tabular form. In order to achieve comparability, we will choose one framework example which picks up on our considerations from exercise 20 in chapter 2 (classification, support vector machines). To facilitate understanding, we content ourselves, at first, with the discussion of an example with only twelve given points for the different methods. At the end of the chapter we will consider a more serious problem, breast cancer diagnosis, and compile important results of the different methods in a table for comparison.
3.0 Logistic Regression As the framework example we consider a specific method for binary classification, called logistic regression: Besides support vector machines, this method arises in many applications, for example in medicine, natural language processing and supervised learning (see [Bo/Va]). The aim is, given a ‘training set’, to get a function that is a ‘good’ classifier. In our discussion we follow the notation of [LWK], in which the interested reader may also find supplementary considerations. We consider — in a general form — the logistic regression function given by f (w, β) :=
m
log 1 + exp(−yμ ( w , xμ + β)) .
μ=1
Given m training instances x1 , . . . , xm ∈ Rn and ‘labels’ y1 , . . . , ym ∈ {−1, 1} , one ‘estimates’ (w, β) ∈ Rn × R by minimizing f (w, β).
3.0
Logistic Regression
89
A professional will recognize this immediately as a log-likelihood model since the probability is the product of the individual probabilities P (yµ | xµ ; w, β) :=
1 1 + exp − yµ ( w , xµ + β)
To simplify the calculation as well as the implementation, we transform xμ w −→ w −→ xμ and β 1 and, in doing so, get — after relabeling the variables and inserting a regularization term λ · 12 w , w — the simple form of a regularized logistic regression function f (w) := λ ·
m 1 w , w + log 1 + exp(−yμ w , xμ ) , 2 μ=1
where λ ≥ 0 is a suitable parameter. The resulting optimization methods are iterative processes, generating a sequence w(k) which is hoped to converge to a minimizer. For the gradient and the Hessian of f we get ∇f (w) = λw−X y .∗(1−h) and ∇2 f (w) = λI +X T Diag h .∗(1−h) X with the notations ⎛ T⎞ ⎛ ⎞ x1 y1
⎜ ⎟ ⎜ ⎟ X := ⎝ ... ⎠ , y := ⎝ ... ⎠ and h := 1 ./ 1 + exp − y . ∗ (X w) , xTm ym R
where — following the Matlab convention — the operations .∗ and ./ denote the coordinatewise multiplication and division, respectively. From that the convexity of f can be directly deduced. In the following we will each time at first consider m = 12 and the data: μ xμ yμ
1 2 3 4 5 6 7 8 9 10 11 12 1.0 2.0 2.5 3.0 3.5 4.0 4.0 4.5 5.0 6.0 6.5 7.0 3.0 4.0 3.5 1.5 4.5 1.5 3.5 2.5 1.0 2.0 3.0 1.5 1 1 1 1 1 −1 −1 −1 −1 −1 1 −1
Chapter 3
and therefore the logarithm of the probability is nothing else but the sum of the logarithms of the individual probabilities. This knowledge, however, is not necessary to understand the examples given here. These very brief remarks only serve to illustrate that the above is an important type of example. By taking the logarithm of the original function we obtain a convex problem (see below).
90
Unconstrained Optimization Problems
3.1 Elementary Search and Localization Methods
Chapter 3
The Nelder and Mead Polytope Method (1965) The Nelder and Mead method is a direct search method which works “moderately well”. It is based on the evaluation of functions at the vertices of a polytope1 which is modified iteratively by replacing ‘old’ vertices with better ones. The method is widely used in applications. Therefore one should be familiar with it. However, we will not need any of the theoretical considerations we have discussed so far! General assertions about the convergence of this method are not known. Even for convex problems of two variables the method does not necessarily converge to a stationary point — and if it converges, then often very slowly. Since we do not need any derivatives, the method can also be applied to problems with a nondifferentiable objective function, or to problems where the computation of the derivatives is laborious. To sum up the method, we can say: Small effort but rather poor results. A polytope is the convex hull of a finite number of vectors x1 , . . . , xm in Rn . If these vectors are affinely independent, we call it a simplex with the vertices x1 , . . . , xm . However, we will use the term “polytope” in this section as a synonym for “simplex”, always assuming the affine independence of the ‘generating’ points. Suppose that f : Rn −→ R is a given function, and that we have an ndimensional polytope with the n + 1 vertices x1 , x2 , . . . , xn+1 . Let these be arranged in such a way that f1 ≤ f2 ≤ · · · ≤ fn+1 holds for the function values fj := f (xj ) . In each iteration step the currently worst vertex xn+1 is replaced by a new vertex. For that, denote by xc :=
n
1 xj n j=1
the barycenter (centroid) of the best n vertices. x3
xc − x3
x1 xc
1
x2
We will not use the common term “simplex” in the name of this method, in order to avoid confusion with the Simplex Algorithm.
3.1
Elementary Search and Localization Methods
91
The polytope will be modified using the operations Reflection — Expansion — Contraction — Shrinking . At the start of each iteration we try a reflection of the polytope and for that we compute xr := xc + α (xc − xn+1 ), f r := f (xr ) with a fixed constant α > 0 (often α = 1). α is called the reflection coefficient, and xr is called the reflected point. x3
Chapter 3
xk 3) a) xc − x3 xc
x1
x2 xk
3) b)
xr
xe Then we consider the following three cases: 1) f1 ≤ f r < fn : In this case we replace xn+1 with xr. 2) f r < f1 : The ‘direction of the reflection’ seems ‘worthy’ of further exploration. Therefore we expand the polytope and, in order to do that, we compute xe := xc + β (xr − xc ), f e := f (xe ) with a fixed constant β > 1 (often β = 2). β is called the expansion coefficient, and xe is called the extrapolated point. If f e < f r : Replace xn+1 with xe . Else: Replace xn+1 with xr . 3) f r ≥ fn : The polytope seems to be ‘too big’ . Therefore we try a (partial) contraction of the polytope: a) If f r ≥ fn+1 , we compute
92
Unconstrained Optimization Problems xk := xc + γ (xn+1 − xc ), f k := f (xk ) with a fixed constant γ , 0 < γ < 1 (often γ = 12 ). γ is called the contraction coefficient, and xk is called the contracted point. If f k < fn+1 , we replace xn+1 with xk ; otherwise we shrink the whole polytope:
Chapter 3
xj := 1 (xj + x1 ) 2
(j = 1, . . . , n + 1) x3
x3 x1 = x 1
x 2 x2
b) If f r < fn+1 , we compute xk := xc + γ (xr − xc ) . If f k ≤ f r , we replace xn+1 with xk ; otherwise we shrink the polytope as well. In each iteration step there is a rearrangement and we choose the notation such that f1 ≤ f2 ≤ · · · ≤ fn+1 holds again. Termination Condition: n+1 1 |fj − f (xc )|2 < ε2 n + 1 j=1
to a given ε > 0 . This condition states: The function is nearly constant in the given n + 1 points. A variant of this method with information on its convergence can be found in [Kel]. Example 1 In order to apply the Nelder and Mead method to our framework examR ple (cf. section 3.0), we utilize the function fminsearch from the Matlab Optimization Toolbox. The option optimset(’Display’, ’iter’) gives additional information on the operation executed in each step (reflection, expansion,
3.1
Elementary Search and Localization Methods
93
contraction). For the starting point w(0) = (0, 0, 0)T , the regularization parameter λ = 1 and the tolerance of 10−7 we need 168 iteration steps, of which we will obviously only list a few (rounded to eight decimal places): w(k) 0 −0.11716667 −0.31787421 −0.46867136 −0.48864968 −0.49057101 −0.50062724 −0.50501507 −0.50485529 −0.50485551
0 0.06591667 0.17843116 0.56801005 0.77331672 0.77096765 0.76632985 0.75304960 0.75325130 0.75324905
0 0.03558333 0.09626648 0.01999832 −0.05325103 −0.05059686 0.00262412 0.07187413 0.07065917 0.07066912
f w(k) 8.31776617 7.57510570 7.17602741 5.90496742 5.68849106 5.68818353 5.68256597 5.67970021 5.67969938 5.67969938 R
Our numerical tests were run on an Apple iMac G5 with Matlab 7.0. The given distribution of points and the obtained ‘separating’ line w1 x1 + w2 x2 + w3 = 0 (168) look like this: for 5 w := w
4.5 4 3.5 3 2.5 2 1.5 1 0.5
1
2
3
4
5
6
7
The Shor Ellipsoid Method2 Shor’s ellipsoid method is a localization method which has attracted great attention mainly because of its applications to linear optimization. However, 2
Cf. [Shor], ch. 8.5, pp. 86–91.
Chapter 3
k 0 15 30 45 60 75 90 110 135 168
94
Unconstrained Optimization Problems
we will not discuss the corresponding algorithm published by Khachiyan in 1979. Starting with an ellipsoid E (0) containing ‘the’ minimal point we are looking for, we will consider a segment of it containing the point and then the corresponding enclosing ellipsoid of minimal volume in each iteration step.
Chapter 3
Since this method uses ellipsoids, we will give some preliminary remarks about them: Let n ∈ N be fixed. For r > 0 and x0 ∈ Rn Bxr 0 := x ∈ Rn : x − x0 2 ≤ r = x ∈ Rn : (x − x0 )T (x − x0 ) ≤ r2 is the ball with center x0 and radius r > 0 . For a symmetric positive definite matrix P E := E(P, x0 ) :=
∈ Rn×n
x ∈ Rn | (x − x0 )T P −1 (x − x0 ) ≤ 1
describes an ellipsoid with center x0 . The special case P := r2 I gives E = Bxr 0 . For a, b > 0 and ξ0 , η0 ∈ R we get, for example,
with P :=
ξ − ξ0 a
a2 0 0 b2
2
+
η − η0 b
, hence, P
in the form (ξ − ξ0 , η − η0 ) P −1
−1
2 ≤ 1 =
ξ − ξ0 η − η0
a−2 0 0 b−2
,
≤ 1.
E is the image of the unit ball UB := B01 =
u ∈ Rn : u2 ≤ 1
under the mapping h defined by h(u) := P 1/2 u + x0 for u ∈ Rn , since for u ∈ UB and x := h(u) = P 1/2 u + x0 it holds that (x − x0 )T P −1 (x − x0 ) = (P 1/2 u)T P −1 P 1/2 u = u , P 1/2 P −1 P 1/2 u = u , u ≤ 1
3.1
Elementary Search and Localization Methods
95
and back. We obtain the volume vol(E(P, x0 )) of the above ellipsoid with the help of the Transformation Theorem: √ vol(E(P, x0 )) = dx = det P 1/2 du = ωn det P , E
UB
where ωn denotes the volume of the n-dimensional unit ball (see, for example, [Co/Jo], p. 459). In addition we will need the following two tools: For α ∈ R and u ∈ Rn \ {0} it holds that det I − α u uT = 1 − α uT u . Proof: (I − α u uT u = (1 − α uT u)u and (I − α u uT v = v hold for v ∈ Rn with v ⊥ u . With that the matrix I − α u uT has the eigenvalue 1 of multiplicity (n − 1) and the simple eigenvalue 1 − α uT u . Remark 3.1.2
For n ≥ 2 it holds that
n n+1
n+1
n n−1
n−1
1 . < exp − n
Proof: The assertion is equivalent to n+1 n−1 n−1 n+1 1 > exp n n n and hence to (n + 1) ln(1 + 1/n) + (n − 1) ln(1 − 1/n) > 1/n . Using the power series expansion ln(1+x) =
∞
(−1)k+1
k=1
= − = −
∞ n+1 n−1 − k nk k nk k=1
∞
2n + 2κ 2 κn κ=1 ∞ κ=1
=
1 (−1)k+1 xk for −1 < x ≤ 1, k
k=1
we get
hs =
∞
∞
1 κn2κ−1
∞
2 2κ−1 (2 κ − 1)n κ=1
+
∞
2 (2 κ − 1)n2κ−1 κ=1
1 1 1 > . 2 κ−1 (2 κ − 1)κ n n κ=1
Chapter 3
Remark 3.1.1
96
Unconstrained Optimization Problems
Suppose that f : Rn −→ R is a given differentiable convex function with a minimal point x∗ ∈ Rn , hence, f (x) ≥ f (x∗ ) for all x ∈ Rn . As is generally known, f (x) ≥ f (x0 ) + f (x0 )(x − x0 )
Chapter 3
holds for all x, x0 ∈ Rn (see chapter 2). From f (x0 )(x − x0 ) > 0 it follows that f (x) > f (x0 ). Therefore, each time x∗ lies in the half space x ∈ Rn | f (x0 )(x − x0 ) ≤ 0 . The following method is based on this observation: (k) Locate x∗ in an ellipsoid E (0) . If we have for k ∈ N0 conE an ellipsoid ∗ (k) ∗ (k) n T taining x , with center x , x ∈ E ∩ x ∈ R | gk x − x(k) ≤ 0 =: Sk holds for gk := ∇f x(k) — as stated above. (Other classes of so-called cutting-plane methods will be considered in chapter 8.)
Now choose E (k+1) as the hull ellipsoid, that is, the ellipsoid of minimal volume containing Sk .3
∇f (x(k) ) E (k+1)
x(k)
E (k)
Each of these ellipsoids can be written in the form (k) E (k) = x ∈ Rn | (x − x(k) )T A−1 (x − x ) ≤ 1 k with a symmetric positive definite matrix Ak . With gk gk := gkT Ak gk
and bk := Ak gk
for a nonzero gradient ∇f (x(k) ) we note the following updating formulae for the ellipsoid E (k+1) (without proof 4 ): 1 b n+1 k
2 : = 2n Ak − 2 bk bTk n+1 n −1
x(k+1) : = x(k) − Ak+1 3 4
The unique ellipsoid of minimal volume containing a given convex body is called ¨ wner–John ellipsoid. a Lo cf. [John]
3.1
Elementary Search and Localization Methods
97
For the determinant we get from the above (considering remark 3.1.1): n
2 n2 T −1 1 − det A A b det Ak+1 = b k k n2 − 1 n+1 k k −1/2
−1/2
Properties of the Ellipsoid Method: • If n = 1, it is identical to the bisection method. • If E (k) and gk are given, we obtain E (k+1) with very simple updating formulae. However, the direct execution of this rank 1-update can pose numerical problems, for example, we might lose the positive definiteness. We will get to know later on how to execute this numerically stable. • vol (E (k+1) ) ≤ exp(− 21n ) vol(E (k) ) n = 2 : 0.7788 n = 3 : 0.8465 • Note: The ellipsoid method is not a descent method. • The ellipsoid method can be generalized to nondifferentiable convex objective functions and can also be applied to problems with convex constraints. With the help of the estimate f (x∗ ) ≥ f (x(k) ) + f (x(k) )(x∗ − x(k) ) ≥ f (x(k) ) + inf f (x(k) )(x − x(k) ) x∈E (k) (k) = f (x ) − gkT Ak gk we obtain the following stop-criteria:
gkT Ak gk < ε |f (x(k) )| + 1 • Or: Uk − Lk < ε |Uk | + 1 , where
Uk := min f (x(j) ) and Lk := max f (x(j) ) − gjT Aj gj .
•
j≤k
j≤k
Chapter 3
bk )T Ak bk = 1 and By definition of bk it follows that bTk A−1 k bk = (Ak hence with remark 3.1.2 n n+1 n−1
1 det Ak+1 n−1 n n n2 = . = < exp − det Ak n2 − 1 n+1 n+1 n−1 n
98
Unconstrained Optimization Problems
3.2 Descent Methods with Line Search
Chapter 3
In the following we will look at minimization methods which, beginning with a starting point x(0) , iteratively construct a sequence of approximations (x(k) ) such that f (x(k) ) > f (x(k+1) ) for k ∈ N0 . For differentiable objective functions they are all based on the idea of choosing a descent direction dk , that is, f (x(k) )dk < 0 , at the current point x(k) and then determining a corresponding step size λk > 0 such that f x(k) + λk dk = min f x(k) + λdk (1) λ≥0
or weaker — in a sense to be stated more precisely — f x(k) + λk dk ≈ min f x(k) + λdk . λ≥0
x(k+1) := x(k) + λk dk yields a better approximation. In the case of (1) it holds that d 0 = f (x(k) + λdk )λ=λ = f (x(k+1) )dk . k dλ This one-dimensional minimization is called ‘line search’. It has turned out that putting a lot of effort into the computation of the exact solution of (1) is not worthwhile. The exact search is often too laborious and costly. Therefore, a more tolerant search strategy for efficient step sizes is useful. The exact step size is only used for theoretical considerations and in the special case of quadratic optimization problems. Before addressing specific aspects of the so-called ‘inexact line search’ , we will discuss the choice of suitable descent directions. The following example shows that — even in the simplest special case — the step sizes have to be chosen with care: Example 2 f (x) := x2 for x ∈ R, x(0) := 1 , dk := −1 and λk := 2−k−2 for k ∈ N0 . We get k x(k+1) := x(k) + λk dk = x(k) − λk = x(0) − λκ κ=0
and then x(k+1) = 1 − In this way it follows that
k+2 k κ 1 1 1 1 + = . 4 κ=0 2 2 2
3.2
Descent Methods with Line Search
99
0 < x(k+1) < x(k) , hence, f (x(k+1) ) < f (x(k) ) , with x(k) −→ 1/2 , f (x(k) ) −→ 1/4 . However, the minimum lies at x∗ = 0 and has the value 0 . 1
Chapter 3
0.25
0
0.5
1
Here the step sizes have been chosen too small! The choice dk := (−1)k+1 with λk := 1+3/2k+2 gives x(k) =
1 k 2 (−1) (1
+
1 ). 2k
1
0.5
0.25
–1
–0.5
0.5
Here the step sizes have been chosen too large!
1
Coordinatewise Descent Methods Cyclic Control: We cyclically go through the n coordinate axes x1 , . . . , xn . In the k-th iteration of a cycle we fix all except the k-th variable and minimize the objective function. Then we repeat the cycle (Gauss–Seidel method). Relaxation Control: In this variant of the alternating variable method we ∂f (k) choose the number j of the coordinate axes such that ∂xj (x ) becomes maximal (Southwell).
100
Unconstrained Optimization Problems
Gradient Method The gradient method — also called steepest descent method — was already proposed by Cauchy in 1847 (see section 1.2). The choice of the negative gradient as the descent direction is based on the observation that
Chapter 3
d f (x(k) + λd)λ=0 = f (x(k) )d dλ holds and that the minimization problem min f (x(k) )d
d2 = 1
for a nonzero gradient ∇f (x(k) ) has the solution dk := −
1 ∇f (x(k) ) ∇f (x(k) )2
— for example, because of the Cauchy–Schwarz inequality. This locally optimal choice of d will turn out not to be globally very advantageous. The way the gradient method works can very easily be examined in the special case of a convex quadratic objective function f defined by f (x) :=
1 T x Ax + bT x 2
with a symmetric positive definite matrix A and a vector b ∈ Rn . For the gradient it holds that ∇f (x) = Ax + b =: g(x) . In this case the step size can easily be determined by means of ‘exact line search’: d f (x(k) + λ(−g )) = (A(x(k) + λ(−g )) + b)T (−g ) k k k dλ = − (gk − λAgk )T gk = 0 ⇐⇒ λ =
gkT gk > 0 gkT Agk
Here, gk := g(x(k) ) denotes the gradient of f at the point x(k) . As an improved approximation we obtain x(k+1) = x(k) −
gkT gk gk . gkT Agk
Example 3 We want to minimize the convex quadratic objective function f given by 1 T 1 2 1 0 2 . x + 9 x2 = x Ax with A := f (x1 , x2 ) := 0 9 2 1 2
3.2
Descent Methods with Line Search
101
For the gradient it holds that ∇f (x) = Ax. Choosing x(0) := (9, 1)T as the starting vector and using exact ‘line search’ , we prove for the sequence (x(k) ): 9 (k) k = 0.8 x (−1)k This is clear for k = 0 . For the induction step from k to k + 1 we obtain 1 1 k k , Agk = 0.8 · 9 gk = 0.8 · 9 (−1)k 9 (−1)k λk = Together with (k)
x
1 − gk = 0.8k 5
Chapter 3
as well as
gkT gk 2 . = 10 gkT Agk
9 − 95 (−1)k (1 − 95 )
= 0.8
k+1
9 (−1)k+1
= x(k+1)
this gives the result. 2
0
–2
1
5
3
7
9
Kantorovich’s Inequality Let A be a symmetric positive definite matrix with the eigenvalues 0 < α1 ≤ · · · ≤ αn . Then for any x ∈ Rn \ {0} it holds that xT Ax xT A−1 x (α1 + αn )2 · ≤ . T T x x x x 4 α1 αn Proof: With the corresponding orthonormal eigenvectors z1 , . . . , zn we pren sume the representation x = ξν zν and obtain ν=1
n T
T
−1
x Ax x A x · = xT x xT x
=
αν ξν2
ν=1 n
μ=1 n ν=1
ξμ2
n
·
uν αν
1 αν ν=1 n
ξν2
ξμ2 μ=1 n
uν α ν=1 ν
102
Unconstrained Optimization Problems
with uν := ξν2 /
n μ=1
weighted sum α :=
ξμ2 . Here uν ≥ 0 holds with
n
ν=1
n ν=1
uν = 1 . For the
uν αν we have α1 ≤ α ≤ αn , and from the convexity
of the function R++ t → 1/t the inequality 1 1 1 t ≤ + − for α1 ≤ t ≤ αn t α1 αn α1 αn
Chapter 3
follows. Hence, n n uν 1 1 1 αν 1 α = ≤ uν + − + − . α α1 αn α1 αn α1 αn α1 αn ν=1 ν ν=1 This gives n n uν 1 1 α ≤ α uν αν + − α α1 αn α1 αn ν=1 ν=1 ν
1 1 t (α1 + αn )2 + − = . ≤ max t α1 ≤ t ≤ αn α1 αn α1 αn 4 α1 αn We revert to the above convex quadratic objective function f (x) =
1 T x Ax + bT x 2
with the minimal point x∗, that is, ∇f (x∗ ) = Ax∗ + b = 0 , and introduce the error function 1 E(x) := (x − x∗ )T A(x − x∗ ) . 2 Since E(x) = f (x) + 12 x∗T Ax∗ = f (x) − f (x∗ ), the functions E and f differ only in a constant. Using the notation introduced above x(k+1) := x(k) − λk gk , λk :=
gkT gk , gkT Agk
we obtain: Lemma 3.2.1 E x(k+1) =
1−
(gkT gk )2 T (gk Agk )(gkT A−1 gk )
Hence, with regard to the norm5 A holds that 5
cf. exercise 4
2 c(A) − 1 E(x(k) ) c(A) + 1 defined by xA := x, Ax it
E x(k) ≤
3.2
Descent Methods with Line Search x(k+1) − x∗ A ≤
103
c(A) − 1 · x(k) − x∗ A . c(A) + 1
Proof: Here, for the condition number c(A) := A2 · A−1 2 of A it holds that c(A) = αn /α1 . Setting yk := x(k) − x∗ , we obtain yk+1 = yk − λk gk and Ayk = A(x(k) − x∗ ) = gk ;
Chapter 3
with that the first part follows E x(k) − E x(k+1) λk gkT Ayk − 12 λ2k gkT Agk = 1 T E x(k) 2 yk Ayk 2
=
(gkT gk ) . (gkT Agk )(gkT A−1 gk )
The second part follows with the Kantorovich inequality: 2 2 4 α1 αn αn − α1 (gkT gk ) ≤ 1 − = 1− T (α1 + αn )2 αn + α1 (gk Agk )(gkT A−1 gk )
In conclusion we can note the following significant disadvantages of the gradient method : • Slow convergence in the case of strong eccentricity (ill condition) • For quadratic objective functions the method is not necessarily finite (cf. example 3 on page 100 f). Example 4 In order to also apply the steepest descent method to our framework example (see section 3.0), we utilize the function fminunc — with the option R optimset(’HessUpdate’, ’steepdesc’) — from the Matlab Optimization Toolbox. For the starting point w(0) = (0, 0, 0)T and the regularization parameter λ = 1 we need 66 iteration steps to reach the tolerance of 10−7 . Again, we will only list a few of them (rounded to eight decimal places): (k) ∇f w k w(k) f w(k) ∞
0 2 4 6 14 24 34 44 54 66
0 −0.22881147 −0.54704228 −0.50630235 −0.50508694 −0.50491650 −0.50487153 −0.50485972 −0.50485662 −0.50485573
0 0.30776833 0.81975751 0.75340273 0.75290067 0.75315790 0.75322512 0.75324278 0.75324742 0.75324874
0 0.03037985 0.08597370 0.07669618 0.07254714 0.07116240 0.07079865 0.07070312 0.07067802 0.07067088
8.31776617 6.45331017 5.69424031 5.67972415 5.67970141 5.67969952 5.67969939 5.67969938 5.67969938 5.67969938
6.00000000 2.54283988 0.29568040 0.00699260 0.00194409 0.00051080 0.00013416 0.00003524 0.00000925 0.00000186
104
Unconstrained Optimization Problems
Requirements on the Step Size Selection
Chapter 3
When choosing a suitable step size λk , we try to find a happy medium between the maximal demand of exact line search — compare (1) — and the minimum demand f (x(k) + λk dk ) < f (x(k) ) . A λk fulfilling (1) is sometimes called ray minimizing. As a weakening the smallest local minimal point or even the smallest stationary point on the ray {x(k) +λdk | λ ≥ 0} are often considered. More tolerant conditions are based on, for example, Goldstein (1965), Wolfe (1963), Armijo (1966) and Powell (1976). For the following discussions suppose that in general − gkT dk ≥ γ dk 2 · gk 2
(2)
holds for k ∈ N0 with a fixed 0 < γ ≤ 1 . Geometrically speaking this means: The descent direction dk must not be ‘almost orthogonal’ to the gradient gk := ∇f (x(k) ) (less than the right angle, uniformly in k ). −gk dk
Goldstein Conditions: λk as a positive solution of two inequalities: Suppose that f (x(k) + λk dk ) ≤ f (x(k) ) + α λk gkT dk holds with an α ∈ (0, 1) .
(3)
<0
Hence, the step size λk should only guarantee a sufficient descent (sufficiently large decrease of the value of the objective function), that is, x(k+1) should not be too far from x(k) , hence, not be too close to the ‘right edge’ of the descent region. Furthermore suppose that f (x(k) + λk dk ) ≥ f (x(k) ) + β λk gkT dk
(4)
holds with a β ∈ (α, 1) . This should guarantee a minimum size of the step size, that is, x(k+1) should not be close to the ‘left edge’. (Often we choose α = 1/4 and β = 3/4 .) In this kind of problem k ∈ N0 , x(k) and dk are fixed. Therefore, we consider the real-valued function ϕ of one real-valued variable defined by ϕ(λ) := ϕk (λ) := f x(k) + λdk for λ ∈ R+ ,
3.2
Descent Methods with Line Search
105
taking into account the observations ϕ(0) = f x(k) , ϕ (λ) = f x(k) + λdk dk , in particular ϕ (0) = f x(k) dk = gkT dk < 0 . With that, conditions (3) and (4) can be written more clearly: ϕ(λk ) ≤ ϕ(0) + α λk ϕ (0) and ϕ(λk ) ≥ ϕ(0) + β λk ϕ (0) .
|
[
]
|
a
βa
αa
0
The condition β a ≤ q(λk ) ≤ α a precisely means ϕ(λk ) ≤ ϕ(0) + α λk ϕ (0) and ϕ(λk ) ≥ ϕ(0) + β λk ϕ (0) .
f (x(k) ) = ϕ(0)
14 12 10
ϕ(0) + α λk ϕ (0)
8 6
ϕ(0) + β λk ϕ (0)
4 2 0
λ
0.02
0.03
0.04
0.05
0.06
Disadvantage of (4): In the plotted example the ‘first’ minimizing λ lies outside the interval in which conditions (3) and (4) are met. f (x(k+1) )dk ≈
f (x(k+1) ) − f (x(k) ) ! ≥ β f (x(k) )dk λk
suggests a modification of (4): With a β ∈ (α, 1) suppose that f (x(k+1) )dk ≥ β gkT dk respectively ϕ (λk ) ≥ β ϕ (0)
(5)
Chapter 3
The idea to demand (3) and (4) suggests itself: If we consider the difference quotient defined by ϕ(λ) − ϕ(0) q(λ) := λ−0 for λ > 0, we have q(λ) −→ a := ϕ (0) < 0 (λ −→ 0) .
106
Unconstrained Optimization Problems
holds. (3), (5) are called Wolfe–Powell conditions. (5) gives that the derivative grows sufficiently. The step size does not become too small x(k+1) does not lie too close to x(k) . According to [Bonn], these conditions are the “most intelligent in the current state of the art.” Now the lower estimate will always include the minimizing λ .
Chapter 3
With a sharpening of (5) we can force λk to lie closer to a local minimum λ : (k+1) )dk ≤ −β gkT dk (6) f (x For example, for β = 0.9 we obtain an ‘inexact line search’; for β = 0.1 , however, we conduct an almost ‘exact line search’. We will now examine in which case the Wolfe–Powell conditions always have a solution: Lemma 3.2.2 Suppose that f ∈ C 2 (Rn ) and 0 < α ≤ β < 1 , 0 < γ ≤ 1 . For x ∈ Rn with g := ∇f (x) = 0 and d ∈ Rn \ {0} let also −g T d ≥ γ g2 d2 and inf{f (x + λd) | λ ≥ 0} > −∞ . Then it holds that: 1) There exists a λ > 0 such that f (x + λd) ≤ f (x) + α λg T d
(7)
f (x + λd)d ≥ β g T d .
(8)
and
2) Let λ be the smallest of all numbers λ > 0 fulfilling (7) and (8). Then we have f (x + td)d < f (x + λ d)d for all 0 ≤ t < λ , and t −→ f (x + td) is strictly antitone (order-reversing) on the interval [0 , λ ] . Setting L := max ∇2 f (x + td)2 , it holds for all positive λ fulfilling 0≤t≤λ
(7) and (8) that
f (x + λd) ≤ f (x) −
α (1 − β)γ 2 2 g2 . L
Here, the corresponding matrix norm is denoted by the same symbol 2 . Proof: 1) The function ϕ : R −→ R given by ϕ(t) := f (x + td) is twice continuously differentiable and, because of ϕ (t) = f (x + td)d , it holds that
3.2
Descent Methods with Line Search
107
ϕ (0) = g T d ≤ −γ g2 d2 < 0 . First of all, we will show that there exists a λ > 0 which fulfills (8). The latter is equivalent to ϕ (λ) ≥ β ϕ (0). If ϕ (λ) < β ϕ (0) (< 0) for all λ > 0, then t ϕ(t) = ϕ(0) + ϕ (λ) dλ ≤ ϕ(0) + tβ ϕ (0) , 0
−→ −∞ (t→∞)
g T d = ϕ (0) < β ϕ (0) ≤ ϕ (λ) ,
the continuity of ϕ results in a minimal λ > 0 such that ϕ λ = β ϕ (0) < 0 , hence, ϕ (t) < β ϕ (0) for 0 ≤ t < λ. λ then also fulfills (7): ϕ λ = ϕ(0) + ϕ (τ )λ with a suitable τ ∈ (0, λ) ≤ ϕ(0) + β λ ϕ (0) ≤ ϕ(0) + α λϕ (0) . <0
2) The first two of the assertions that still have to be proven are a direct result of the above observations. Only the following estimate remains to prove: From ϕ (t) = dT ∇2 f (x + td)d follows |ϕ (t)| ≤ ∇2 f (x + td)2 d22 ≤ L d22 for 0 ≤ t ≤ λ . Since ϕ is not constant, L has to be positive. Then it holds that
λ
(β − 1) ϕ (0) = ϕ (λ) − ϕ (0) =
ϕ (t) dt ≤ λ L d22 .
0
This yields a lower bound for λ : λ ≥ −
(1 − β)ϕ (0) 2 L d2
= −
(1 − β)g T d 2
L d2
If λ meets conditions (7) and (8), it is obvious that λ ≥ λ ; consequently, 2 2 using ϕ (0)2 = (−g T d)2 ≥ γ 2 g2 d2 , we get ϕ(λ) − ϕ(0) ≤ α λ ϕ (0) ≤ α λ ϕ (0) (7) <0
≤ −
α (1 − β) ϕ (0)2 2 L d2
≤ −
α (1 − β)γ 2 2 g2 . L
Chapter 3
which would be a contradiction to the assumption that ϕ is bounded from below in R+ . Because of
108
Unconstrained Optimization Problems
These observations are the basis for the following Algorithm Let 0 < α ≤ β < 1 , 0 < γ ≤ 1 and a starting vector x(0) ∈ Rn be given. Iteration Step for k ∈ N0 : If gk = 0: STOP; x(k) is a stationary point of f . Else:
Chapter 3
a) Choose a descent direction dk ∈ Rn such that −gkT dk ≥ γ gk 2 dk 2 . b) Calculate a step size λk > 0 such that f (x(k+1) ) ≤ f (x(k) ) + α λk gkT dk T dk ≥ β gkT dk gk+1
(9) (10)
holds for x(k+1) := x(k) + λk dk . Proposition 3.2.3 Suppose that f ∈ C 2 (Rn ), x(0) ∈ Rn and that the level set N := x ∈ Rn | f (x) ≤ f (x(0) ) is compact. Then the above algorithm can be carried out. It will either terminate after a finite number of steps or generate a sequence (x(k) ) such that : 1) f (x(k+1) ) < f (x(k) ) for k ∈ N0 . 2) (x(k) ) has at least one accumulation point x∗ ∈ N . 3) Each of these accumulation points x∗ is a stationary point of f, that is, ∇f (x∗ ) = 0 . Proof: If there exists a k ∈ N0 such that gk = 0 , then x(k) is a stationary point. Otherwise the set {x(k) | k ∈ N0 } is infinite. 1) holds because of (9), and 2) results from the fact that all x(k) are in the compact set N . In order to prove 3), we set L := max ∇2 f (x)2 ; lemma 3.2.2 gives x∈N
α (1 − β)γ 2 2 gk 2 for k ∈ N0 . (11) L By construction the sequence f (x(k) ) is strictly antitone. Because of f (x(k+1) ) ≤ f (x(k) ) −
f (x(k) ) ≥ min f (x) > −∞ x∈N
3.2
Descent Methods with Line Search
109
it is bounded from below and hence convergent. In particular, f (x(k) ) − f (x(k+1) ) −→ 0 holds for k → ∞. Consequently, (11) shows that (gk ) is a null sequence. The continuity of ∇f yields ∇f (x∗ ) = 0 for each accumulation point x∗ of (x(k) ). For remarks on the limited utility of this proposition see, for example, [Ja/St], p. 144. Addendum:
Armijo Step Size Rule: With a fixed σ > 0 independent of k firstly choose λ0 ≥ σ gk 2 /dk 2 . Then calculate the smallest j ∈ N0 such that f (x(k) + λj dk ) ≤ f (x(k) ) + α λj gkT dk holds for λj := λ0 /2j and set λk := λj . Proof: Consider once more ϕ(λ) := f (x(k) + λdk ) for λ ∈ R and k ∈ N0 . 1) This iterative halving of the step size terminates after a finite number of steps; since, because of ϕ(λ) − ϕ(0) −→ ϕ (0) < 0 for λ −→ 0 , λ ϕ(λ) − ϕ(0) ≤ λα ϕ (0) holds for a sufficiently small λ > 0 , that is, f (x(k) + λdk ) ≤ f (x(k) ) + α λgkT dk . 2) We will prove: With L := max ∇2 f (x)2 and c := min x∈N
!
α (1 − β)γ 2 , αγ σ 2L 2
it holds that f (x(k) + λj dk ) ≤ f (x(k) ) − c gk 2 : We distinguish two cases: If j = 0 , ϕ(λ0 ) ≤ ϕ(0) + α λ0 gkT dk
" >0
Chapter 3
The assertion of proposition 3.2.3 also holds if we replace conditions (9) and (10) in the algorithm with the following rule which is easy to implement:
110
Unconstrained Optimization Problems
holds at first. Together with the assumptions λ0 ≥ σ gk 2 /dk 2 and gkT dk ≤ −γgk 2 dk 2 we get ϕ(λ0 ) ≤ ϕ(0) + α σ
gk 2 2 − γgk 2 dk 2 = ϕ(0) − α γ σgk 2 . dk 2
If j > 0 , then the step size rule gives
Chapter 3
ϕ(λj−1 ) > ϕ(0) + α λj−1 ϕ (0), since j − 1 does not yet meet the condition. It follows that 2λj = λj−1 ≥ λ with λ from lemma 3.2.2 (cf. proof of 1)). Hence, it holds that: λj ≥ λ 2
≥
(p. 107)
−
(1 − β)ϕ (0) and with that 2 Ldk 22
ϕ(λj ) ≤ ϕ(0) + α λj ϕ (0) ≤ ϕ(0) −
α (1 − β)ϕ (0)2 2 2 Ldk 2
≤ ϕ(0) −
α (1 − β)γ 2 2 gk 2 . 2L
3.3 Trust Region Methods In contrast to line search methods, where we are looking for an appropriate step size to a given descent direction, trust region methods have a given step size (or an upper bound for it) and we have to determine an appropriate search direction. For this method we use a ‘local model’ ϕk of a given twice continuously differentiable function f on the ball with center x(k) and radius Δk k Bk := BxΔ(k)
for a fixed k ∈ N0 with fk := f (x(k) ), gk := ∇f (x(k) ) = 0 and a symmetric matrix Hk , for example, Hk as the Hessian ∇2 f (x(k) ), 1 T f (x) ≈ ϕk (x) := fk + gkT x − x(k) + x − x(k) Hk x − x(k) . 2 We calculate the search direction dk as the global minimizer of the following problem: 1 fk + gkT d + dT Hk d −→ min 2 d2 ≤ Δk respectively 1 dT d − Δ2k ≤ 0 . 2
(12)
3.3
Trust Region Methods
111
Since the function value and the value of the first derivative of f and ϕk correspond in x(k), we can expect a solution of (12) to give a good approximation of a minimum of f on the region Bk for sufficiently small radii Δk . The idea behind this method is to determine the radius Δk such that we can ‘trust’ the model ϕk in the neighborhood Bk . Then we speak of a ‘trust region’.
With the help of the Karush–Kuhn–Tucker conditions it follows that ∃ λk ∈ R+ (Hk + λk I)dk + gk = 0 dk 2 ≤ Δk , λk Δk − dk 2 = 0 .
(13)
A simple transformation where we insert the KKT conditions and take into account that Δk = dk 2 , hence, Δ2k = dk , dk , for λk = 0 , yields for all d ∈ Rn with d2 ≤ Δk ϕk (x(k) + dk ) ≤ ϕk (x(k) + d) = fk + gk , d + 1 d, Hk d 2 = ϕk (x(k) + dk ) + 12 (d − dk )T (Hk + λk I)(d − dk ) + 1 λk (Δ2k − dT d) . 2 n For d ∈ R with d2 = Δk we hence get (d − dk )T (Hk + λk I)(d − dk ) ≥ 0 . This gives that the matrix Hk + λk I is positive semidefinite. Proof: If dk 2 < Δk , this is clear right away, since λk = 0 , and Hk is positive semidefinite, because of the necessary optimality condition for unconstrained problems (cf. page 36). If dk 2 = Δk , we proceed as follows: For λ := −2 dk , x /x22 we have dk + λx22 = dk 22 + 2 λ dk , x + λ2 x22 = dk 22 = Δ2k . Hence, for any x ∈ Rn with dk , x = 0 there exists a λ = 0 such that dk + λx2 = Δk . x d dk
Chapter 3
In order to do this, we calculate the quotient between the descent of the function f and the descent of the modeling function ϕk for a solution dk of (12). The closer the quotient is to 1, the more ‘trustworthy’ the model seems to be. If the correspondence is bad, we reduce the radius Δk .
112
Unconstrained Optimization Problems
For d := dk + λx 0 ≤ (d − dk )T (Hk + λk I)(d − dk ) = λ2 xT (Hk + λk I)x
Chapter 3
holds and consequently xT (Hk + λk I)x ≥ 0 . However, if dk , x = 0 , 2 dk , x + tdk = tdk 2 > 0 holds for any positive t, therefore firstly x + tdk , (Hk + λk I)(x + tdk ) ≥ 0 and after passing to the limit x, (Hk + λk I)x ≥ 0 . In the following we will assume that the matrix Hk + λk I is even positive definite. Then, because of (13), it holds that: a) −gkT dk = dTk (Hk + λk I)dk > 0 , that means dk is a descent direction. b) dk yields a strict global minimum of (12). The vector dk is called the Levenberg–Marquardt direction to the parameter value λk . For λk = 0, dk = −Hk−1 gk corresponds to a (quasi-)Newton direction (cf. section 3.5) and for a large λk to the direction of steepest descent, because of dk ≈ − λ1k gk . Remark −gkT dk > 0 also holds if the matrix Hk + λk I is only positive semidefinite; otherwise we would have a stationary point, because of −gkT dk = dTk (Hk + λk I)dk = 0 ⇐⇒ (Hk + λk I)dk = 0 ⇐⇒ gk = 0 . This would be a contradiction to the assumption (cf. page 110). The equation (Hk + λI)d(λ) + gk = 0 defines the so-called Levenberg–Marquardt trajectory d(λ). We are looking for a parameter value λ ≥ 0 such that d(λ)2 = Δk . Then with λk = λ and dk = d(λ) condition (13) is met. Suppose that the matrix Hk has the eigenvalues α1 ≥ α2 ≥ · · · ≥ αn with the corresponding orthonormal eigenvectors v1 , v2 , . . . , vn . Starting from the n representation gk = βν vν , we obtain the Levenberg–Marquardt traν=1
jectory d(λ) for λ ∈ / {−α1 , . . . , −αn }: d(λ) = −
n
βν vν , α +λ ν=1 ν
consequently
2
d(λ)2 =
Hence, d( · )2 is strictly antitone in (−αn , ∞) with
n
βν2 . (αν + λ)2 ν=1
3.3
Trust Region Methods
113
d(λ)2 −→ 0 for λ −→ ∞ and, if βn = 0 ,
d(λ)2 −→ ∞ for λ ↓ −αn .
If βn = 0, there exists therefore exactly one λ ∈ (−αn , ∞) such that d(λ)2 = Δk . Closer Case Differentiation:
i) αn > 0 : Then d(0) is defined, since 0 is not an eigenvalue of Hk . If d(0)2 ≤ Δk , then λk := 0 and dk := d(0) solve conditions (13). In the remaining case d(0)2 > Δk there exists exactly one λk > 0 such that d(λk )2 = Δk . We consider J := {ν ∈ {1, . . . , n} | αν = αn } . ii) αn ≤ 0 and suppose that there exists a ν0 ∈ J with βν0 = 0 : βν20 ↑ ∞ for λ ↓ −αn ≥ 0 . Then there exists exactly (αn + λ)2 one λk in the interval (−αn , ∞) such that d(λk )2 = Δk . 2
d(λ)2 ≥
iii) αn ≤ 0 and suppose βν = 0 for all ν ∈ J: d(λ) = −
ν ∈J /
βν βν vν =: d for λ ↓ −αn . vν −→ − αν + λ αν − αn ν ∈J /
γν vν If d 2 ≤ Δk we set λk := −αn . Every vector d of the form d+ ν∈J
(γν ∈ R for ν ∈ J) with d2 = Δk solves (13). If Δk < d 2 , then there exists exactly one λk ∈ (−αn , ∞) such that d(λk )2 = Δk . Solution of the Equation d(λ)2 = Δk : By definition of d(λ) it holds that d(λ) = −(Hk + λI)−1 gk respectively (Hk + λI)d(λ) = −gk . Differentiation gives d(λ) + (Hk + λI)d (λ) = 0 and # $ d d(λ), d(λ) = 2 d(λ), d (λ) = 2 d(λ), −(Hk + λI)−1 d(λ) . dλ
Chapter 3
In general there are three cases:
114
Unconstrained Optimization Problems
N := d( · )2 has a pole in −αn . Therefore, we consider ψ(λ) :=
1 1 − N (λ) Δk
and solve the equation ψ(λ) = 0 using Newton’s method
Chapter 3
(r+1) λk
=
(r) λk
(r) ψ λk − (r) . ψ λk
With ψ (λ) = −N (λ)/N (λ)2 we obtain & (r) 2 % N λk 1 1 (r+1) (r) = λk + λk (r) − Δk N λ(r) N λk k % (r) & (r) N λk N λk (r) = λk + 1 − (r) Δk N λk & % (r) (r) 2 d(λk )2 d(λk )2 (r) = λk + 1 − (r) . (r) Δk d λ , d λ k
k
(r) (r) (r) can be calculated as follows: d(λk ) and d λk , d λk (r)
If Hk + λk I is positive definite, the Cholesky decomposition (r)
Hk + λk I = L LT
(L lower triangular matrix)
(r) as the solution of gives d λk L LT u = −gk . Furthermore it holds that (r) (r) (r) (r) (r) d λk , d λk = d λk , −(Hk + λk I)−1 d λk (r) (r) = − d λk , (L LT )−1 d λk (r) (r) = − w , w . = − d λk , (LT )−1 L−1 d λk =: w
Remark The equation d(λ)2 = Δk or its variants have only to be solved with low accuracy. We demand, for example, ( ' d(λ)2 ∈ 0.75 Δk , 1.5 Δk . Then normally a few steps of the (scalar) Newton method suffice.
3.3
Trust Region Methods
115
Notation: rk :=
f (x(k) ) − f (x(k) + dk ) f (x(k) ) − ϕk (x(k) + dk )
The numerator gives the reduction of the function f , the denominator that of the modeling function ϕk . The above considerations lead to the following Algorithm
1) Solve approximately:
ϕk (x(k) + d) −→ min subject to the constraint d2 ≤ Δk .
2) Calculate rk . 3) If rk < 1/4 : Δk+1 := 12 Δk and x(k+1) := x(k) ; else, that is, rk ≥ 1/4 : If rk > 3/4 and dk 2 = Δk : Δk+1 := 2 Δk ; else: Δk+1 = Δk ; x(k+1) := x(k) + dk . In 0) we assume that it has been set how to calculate Hk each time.
Example 5 x = (x1 , x2 )T ∈ R2 f (x) := x41 + x21 + x22 & % % 4 x31 + 2 x1 12 x21 + 2 , ∇2 f (x) = ∇f (x) = 2 x2 0 % & % & 1 14 0 x(0) := , H0 := ∇2 f (x(0) ) = , Δ0 1 0 2
0
&
2 :=
1 2
(0)
, λ0
ϕ0 (x(0) + d) := f0 + g0T d + 12 dT H0 d & % & %√ 14 0 6 √ g0 = , d(λ) = −(H0 + λI)−1 g0 , L = 2 0 2 % 3& −7 (0) (0) −1 , d(λ0 ) = 1.0880 d(λ0 ) = −H0 g0 = 2 −1 ⎛ ⎞ − √3 (0) 352 L w = d(λ0 ), w = ⎝ 7· 1 14 ⎠ , w , w = 14·49 = 0.5131 − √2
:= 0
Chapter 3
0) Let x(k) and Δk be given; calculate corresponding gk and Hk .
116
Unconstrained Optimization Problems
1.08802 − 0.5131 = 2.713 = 0 + 1 − 1.0880 0.5 % & −0.3590 (1) (1) d(λ0 ) = , d(λ0 ) = 0.556 ∈ [0.375 , 0.75 ] 2 −0.4244 (1)
λ0
' ( Hence, the value is already within the demanded interval 0.75 Δ0 , 1.5 Δ0 . Therefore, we actually do not need the computation in the next three lines. They only
Chapter 3
serve to show the quality of the next step.
%
& −0.0878 w = , w , w = 0.0459 −0.1955 0.5562 (2) − 0.0459 = 3.467 λ0 = 2.713 + 1 − 0.556 0.5 % & −0.3435 (2) (2) d(λ0 ) = , d(λ0 ) = 0.5018 2 −0.3658 % & 0.6410 (1) x(1) = x(0) + d(λ0 ) = 0.5756
Modeling function ϕ0 0.5
–1
–0.5
=: d0
(0)
f (x
(0)
) = 3 , f (x
(0)
ϕ0 (x
r0 =
–0.5
+ d0 ) = 0.9110
+ d0 ) = 1.0795
2.089 1.9205
0.5
, Δ1 := 1
–1
To conclude these discussions, we want to introduce Powell’s dogleg 6 trajectory. We will see that it can be regarded as an approximation of the Levenberg–Marquardt trajectory replacing it with a path consisting of two line segments. To solve the nonlinear equation (Hk + λ I) d(λ) = −gk just in order to get a new direction dk for the actual value Δk seems wasting one’s breath. The dogleg method fixes a new direction more directly.
Here again we are dealing with the problem ϕk (x(k) + d) = fk + gkT d + 12 dT Hk d −→ min d2 ≤ Δk . We assume that the matrix Hk is positive definite. Because of 6
In golf, “dogleg” is a reference for the direction of a golf hole. While many holes are designed in a straight line from the tee-off point to the green, some of the holes may bend somewhat to the right or left. This is called a “dogleg”, referencing the partial bend at the knee of a dog’s leg. On rare occasions, a hole’s direction can bend twice. This is called a “double dogleg”.
3.3
Trust Region Methods
117
gk 2 d ϕk x(k) − λgk = 0 ⇐⇒ λ = , dλ gk , Hk gk 2
we obtain the minimum point xS := x(k) + dS with dS := −
2
gk 2 gk gk , Hk gk
in the direction of steepest descent and as the unconstrained minimum point
We will prove: For the steepest dS and the quasi-Newton direction dN it S descent Ndirection holds that d 2 ≤ d 2 . This follows from the estimate S d = 2
gk 2 Hk−1 gk 2 gk 32 gk 32 $ ≤ · # gk , Hk gk gk , Hk gk gk , Hk−1 gk =
gk 22 gk 22 $ dN 2 . ·# −1 gk , Hk gk g k , Hk g k ≤1
The latter is, for example, a result of the proof of Kantorovich’s inequality (cf. page 101), if we take into account the inequality % n &% n & uν ≥ 1 uν αν α ν=1 ν=1 ν for the weighted arithmetic and harmonic mean (cf. e.g. [HLP], pp. 12–18). Following Powell we choose x(k+1) such that: 1) dN 2 ≤ Δk : 2) dS 2 ≥ Δk :
x(k+1) := xN x(k+1) := x(k) − Δk gk gk 2
S N 3) d exactly one λ ∈ (0, 1) such that k < d S 2 < Δ 2 : There exists N S d + λ(d − d ) = Δk . Set x(k+1) := (1 − λ)xS + λxN . 2
Chapter 3
xN := x(k) + dN with dN := −Hk−1 gk .
118
Unconstrained Optimization Problems −gk xS
x(k+1) xN
Chapter 3
Δk
x(k)
dN =−Hk−1 gk
Example 6 In order to apply the trust region method to the framework example, we utilize the function fminunc — with the option optimset(’GradObj’, ’on’) — from the Matlab Optimization Toolbox. For the starting point w(0) = (0, 0, 0)T and the regularization parameter λ = 1 , six iteration steps suffice to reach the tolerance of 10−7 . The approximate solution of the quadratic subproblem (cf. step 1 of the trust region algorithm on p. 115) is restricted to a twodimensional subspace which is determined with the aid of a (preconditioned) conjugate gradient process (cf. section 3.4). R
0 1 2 3 4 5 6
f w(k)
(k) ∇f w ∞
8.31776617 5.75041913 5.68031561 5.67972329 5.67970043 5.67969942 5.67969938
6.00000000 0.79192755 0.05564070 0.01257655 0.00221882 0.00056029 0.00010095
w(k)
k 0 −0.40705741 −0.49570677 −0.50487306 −0.50447655 −0.50485565 −0.50483824
0 0.62664555 0.74745305 0.75569907 0.75299924 0.75335803 0.75323767
0 0 0.04755343 0.06543416 0.06972656 0.07043384 0.07062618
Least Squares Problems In applications there often occur problems in which the function to be minimized has the following special form F (x) =
m 1 fμ (x)2 2 μ=1
with an m ∈ N. For that suppose that twice continuously differentiable functions fμ : Rn −→ R are given. With f (x) := (f1 (x), . . . , fm (x))T for x ∈ Rn we then get 1 1 2 F (x) = f (x), f (x) = f (x)2 . 2 2
3.3
Trust Region Methods
119
Parameter estimates via the least squares method as well as systems of nonlinear equations are examples of this kind of problem. In the following we will discuss methods which make use of the special form of such objective functions. With the help of the Jacobian of f ∂fμ J(x) := (x) ∈ Rm×n ∂xν we obtain the gradient
and the Hessian is H(x) := ∇2 F (x) = J(x)T J(x) + R(x) with R(x) :=
m
(14)
fμ (x) Hμ (x) and Hμ := ∇2 fμ .
μ=1
From (14) we can deduce that J(x)T J(x) — here, there occur only first-order derivatives — is an essential part of the Hessian in the vicinity of a minimizer x∗ if, for example, F (x∗ ) = 0 , or if all fμ are affinely linear. In the latter case Hμ = 0 holds. Starting from the current approximation x(k) of the minimum point x∗ with fk := f x(k) , Jk := J x(k) and Rk := R x(k) , x(k+1) = x(k) + dk can be calculated as follows: a) Newton–Raphson Method T Jk Jk + Rk dk = −JkT fk
cf. (14) with Hk := H x(k)
Advantage: The algorithm is — with standard assumptions — (locally) quadratically convergent. Disadvantage: Second-order partial derivatives are needed explicitly. b) Gauß–Newton Method T Jk Jk dk = −JkT fk In this case there occur only first-order partial derivatives of f . The method is at the most quadratically convergent. Slow convergence can occur in particular if J(x∗ )T J(x∗ ) is nondefinite.
Chapter 3
g(x) := ∇F (x) = J(x)T f (x),
120
Unconstrained Optimization Problems
c) Levenberg–Marquardt Method T Jk Jk + λk I dk = −JkT fk , λk ≥ 0
Chapter 3
Marquardt (1963) provided first ‘ad hoc’ formulations for the choice of λk . A more specific control of λk is possible if we regard the Levenberg– Marquardt method as a trust region method. There are many modifications of this method of which we only want to mention the method introduced by Dennis, Gay and Welsch in 1981 (cf. [DGW]). It takes account of the term Rk in the Hessian ignored by the Gauss–Newton method and uses quasi-Newton updates to approximate it. We do not go into details because this is beyond the scope of our presentation.
3.4 Conjugate Gradient Method The conjugate gradient (CG) method was introduced in 1952 by Hestenes and Stiefel to solve large systems of linear equations with a sparse symmetric and positive definite coefficient matrix. For that they made use of the equivalence Ax = −b ⇐⇒ f (x) = minn f (z) z∈R
to the minimization problem for the convex quadratic objective function f defined by f (z) := 12 z , Az + b , z , whose well-known gradient is ∇f (z) = Az + b . This iterative method has proved to be particularly efficient since the main effort per iteration step lies in the calculation of a matrix-vector product and which is therefore very small. Originally the CG-method was not designed as a minimization method — at most to minimize quadratic functions. Only when Fletcher and Reeves (1964) suggested it, was the method used to minimize ‘any kind’ of function. For the remainder of this section let n ∈ N, A a real-valued symmetric and positive definite (n, n)-matrix, b ∈ Rn and — corresponding to that — the quadratic objective function f be defined by f (x) := 1 x, Ax + b , x . 2 Definition For m ∈ N the vectors d1 , . . . , dm ∈ Rn \ {0} are called A-conjugate if and only if dν , Adμ = 0 for all 1 ≤ ν < μ ≤ m . Remark Such d1 , . . . , dm are linearly independent. In particular m ≤ n holds.
3.4
Conjugate Gradient Method
121
Using the definiteness of A, this can be proved immediately by applying the definition. Proposition 3.4.1 To A-conjugate vectors d0 , . . . , dn−1 and any starting vector x(0) ∈ Rn let x(1) , . . . , x(n) be recursively defined by x(k+1) := x(k) + λk dk with λk := −
gk , dk . dk , Adk
λ∈R
f x(n) = minn f (x) . x∈R
Proof: The simple derivation of the relations gk+1 , dk = 0 and f x(k+1) = (k) min f x + λ dk | λ ∈ R is familiar to us from earlier discussions of descent methods. However, we will give a short explanation: d (k) f x + λ dk |λ=λk = f x(k+1) dk = gk+1 , dk dλ
0 = and
$ # gk+1 , dk = Ax(k) + λk Adk + b , dk = gk , dk + λk A dk , dk . The vectors d0 , . . . , dn−1 are linearly independent since we have assumed them to be A-conjugate. Hence, they form a basis of the Rn . From x(n) = x(n−1) + λn−1 dn−1 = · · · = x(j+1) +
n−1
λν dν
ν=j+1
for 0 ≤ j ≤ n − 1 follows7 gn = gj+1 +
n−1
λν Adν
ν=j+1
and from that gn , dj = gj+1 , dj + =0
n−1 ν=j+1
λν Adν , dj = 0 , =0
hence, gn = 0 . This gives f x(n) = minn f (x) .
x∈R
7
It is not necessary to remind ourselves of ∇f (x) = A x + b , is it?
Chapter 3
Then it holds that f x(k+1) = min f x(k) + λ dk , gk+1 , dk = 0 and
122
Unconstrained Optimization Problems
In order to obtain an algorithm from proposition 3.4.1, we need A-conjugate vectors, one talks of A-conjugate directions. First of all, we will give the Remark
Chapter 3
There exists an orthonormal basis of eigenvectors to A. Its vectors are evidently A-conjugate. This observation is more of a theoretical nature. The following method, in which the conjugate directions dk are calculated at the same time as the x(k) , is more elegant. Generation of A-Conjugate Directions Hestenes–Stiefel CG-Algorithm 0) Let g0 := ∇f x(0) = Ax(0) + b and d0 := −g0 to a given x(0) ∈ Rn . (Starting step: steepest descent)
For k = 0, 1, . . . , n − 1 : If gk = 0 : STOP; x(k) is the minimal point of f . Else: gk , dk . Adk , dk (exact line search) . Hence, we have f (x(k+1) ) = min f (x(k) + λdk ) λ∈R (k+1) ; dk+1 := −gk+1 + γk+1 dk with 2) gk+1 := ∇f x
1) x(k+1) := x(k) + λk dk , where λk := −
γk+1 :=
2
gk+1 2 gk+1 , gk+1 = . 2 gk , gk gk 2
In order to calculate dk+1 , we only use the preceding vector dk — besides gk and gk+1 .
Before making some more basic comments about this algorithm — especially that it is well-defined, we will give an example: Example 7 1 0 , b := 0 f (x) := 1 x21 + 9 x22 , A := 2 0 9 x1 ∇f (x) = 9 x2 9 1 1 (0) , g0 = 9 , d0 = −9 , λ0 = 2 = 1 x := 10 5 1 1 1
(cf. page 100 f)
3.4
Conjugate Gradient Method
x(1) = 4 5
9 −1
123
1 −1 g , g d1 = −g1 + 1 1 d0 g , g 0 0 36 2 ·2 1 1 −9 36 36 5 = − ·9 = − 5 −1 25 1 1 81 · 2 50 5 λ1 = = 90 9
, g1 = 36 5
Important properties of this method are contained in the following Proposition 3.4.2 There exists a smallest m ∈ {0, . . . , n} such that dm = 0 . With that we have gm = 0 and for all ∈ {0, . . . , m} , 0 ≤ ν < k ≤ and 0 ≤ r ≤ : a)
dν , gk
= 0
b)
gν , gk
= 0
c)
dν , Adk = 0
d)
gr , dr
= − gr , gr .
x(m) then gives the minimum of f . If gm = 0 for an m ∈ {1, . . . , n} , then 2) shows dm = 0 . The other way round it follows from dm = 0 via 2 dm 22 = gm 22 − 2 γm gm , dm−1 + γm dm−1 22 , =0
≥0
that dm 2 ≥ gm 2 , therefore gm = 0 . The CG-method is in particular well-defined, since — because A is positive definite — γr+1 and λr are defined for 0 ≤ r ≤ m − 1 . Here, following d), λr is positive. In addition, by c) the vectors d0 , . . . , dm−1 are A-conjugate, hence linearly independent. This proves m ≤ n . The name ‘conjugate gradient method’ is unfortunately chosen, since not the gradients but the directions dk are A-conjugate.
Proof: Denote by A( ) the validity of the assertions a) to d) for all 0 ≤ ν < k ≤ and 0 ≤ r ≤ . Hence, we have to prove A( ) for ∈ {0, . . . , m} : A(0): Assertions a) to c) are empty and consequently true. d) holds by definition of d0 . For 0 ≤ < m we will show A( ) =⇒ A( + 1) . (In particular g = 0 and d = 0 hold, since < m .)
Chapter 3
x(2) = x(1) + λ1 d1 = 0
124
Unconstrained Optimization Problems
a) From g+1 = ∇f x(+1) = Ax(+1) + b = g + λ Ad it follows for ν < + 1 that ⎧ ⎪ ⎨ 0 for ν = (by definition of λ ) dν , g+1 = dν , g +λ dν , Ad = 0 for ν < . ⎪ ⎩ =0
=0
Chapter 3
b) We have to prove gν , g+1 = 0 for ν < + 1 : gν , g+1 = −dν + γν dν−1 , g+1 (with d−1 := 0 and γ0 := 0 ) = − dν , g+1 +γν dν−1 , g+1 = 0 . =0
=0
a)
a)
2
d) g+1 , d+1 = g+1 , −g+1 + γ+1 d = −g+1 2 a)
c) We have to prove dν , Ad+1 = 0 for ν < + 1 : Since gν+1 − gν = λν Adν it holds that dν , Ad+1 = dν , A (−g+1 + γ+1 d ) = − Adν , g+1 + γ+1 dν , Ad ⎧ 1 g − gν+1 , g+1 +γ+1 dν , Ad = 0 for ν <
⎪ ⎨ λν ν =0 =0 = b) ⎪ ⎩ 1 2 for ν = . − λ g+1 2 + γ+1 d , Ad = 0 The latter follows with γ+1 =
2
g+1 2 2
g 2
and λ =
2
g 2 − g , d = . d , Ad d) d , Ad
Hence, in an exact calculation at the latest x(n) will give the minimum of f . In practice, however, gn is different from zero most of the time, because of the propagation of rounding errors. For the numerical aspects of this method see, for example, [No/Wr], p. 112 ff.
Corollary For < m it holds that: a) S := span{g0 , . . . , g } = span{d0 , . . . , d } = span g0 , Ag0 , . . . , A g0 b) f x(+1) = min f x() + u = min f x(0) + u u ∈ S
u ∈ S
The spaces S are called “ Krylov spaces”. Proof: a) α) span{g0 , . . . , g } = span{d0 , . . . , d } can be deduced inductively from d+1 = −g+1 + γ+1 d for + 1 < m with d0 = −g0 .
3.4
Conjugate Gradient Method
125
β) The second equation is (with α)) trivial for = 0 . g+1 = g + λ Ad for + 1 < m with α) gives the induction assertion span{d0 , . . . , d+1 } ⊂ span g0 , Ag0 , . . . , A+1 g0 . For the other inclusion we only have to show A+1 g0 ∈ span{d0 , . . . , d+1 } in the induction step for + 1 < m : From the induction hypothesis we have A g0 = μν dν for suitable μ0 , . . . , μ ∈ R, ν=0
ν=0
μν Adν . The already familiar relation λν Adν =
gν+1 − gν for ν = 0, . . . , gives the result (with α) and taking into account λν = 0) . b) By a) any u ∈ S is of the form τν dν with suitable τ0 , . . . , τ ∈ R. ν=0
For j ∈ {0, . . . , } it holds that T
d f x() + τν dν = f x() + τν dν dj = g + τν Adν dj dτj ν=0 ν=0 ν=0 = g , dj +τj dj , Adj . >0
= 0 for j<
Hence, we have
! τj =
0 , if j <
λ , if j =
for the minimal point on x() + S . From that follows x() +
τν dν = x() + λ d = x(+1) .
ν=0
The second partial assertion directly results from x() = x(0) +
−1
λν dν .
ν=0
Rate of Convergence Let the real-valued symmetric and positive definite matrix A have the eigenvalues 0 < α1 ≤ · · · ≤ αn with corresponding orthonormal eigenvectors
Chapter 3
hence, A+1 g0 =
126
Unconstrained Optimization Problems
v1 , . . . , vn . Let x∗ be the minimal point of f ; hence, Ax∗ + b = 0 . ‘Taylor series expansion’ around x∗ yields 2 f (x) = f (x∗ ) + 1 x − x∗ , A(x − x∗ ) = f (x∗ ) + 1 x − x∗ A . 2 2
Chapter 3
Consequently, E(x) := f (x) − f (x∗ ) = 1 x − x∗ , A(x − x∗ ) . 2 n Starting from the representation h := x(0) − x∗ = ν=1 ξν vν with ξν ∈ R , we obtain n (0) 1 E x = αν ξν2 . 2 ν=1
For k ∈ {1, . . . , m} with suitable γκ ∈ R the vectors u ∈ x(0) + Sk−1 can be written in the form u = x(0) +
k−1
γκ Aκ g0 = x(0) + p(A)g0
κ=0
with the polynomial p ∈ Πk−1 defined by p(τ ) :=
k−1
κ=0
γκ τ κ . Via
g0 := ∇f x(0) = Ax(0) + b = Ax(0) − Ax∗ = Ah due to u − x∗ = x(0) + p(A)g0 − x∗ = h + p(A)Ah = (I + p(A)A)h , we obtain E(u) = 1 u − x∗ , A(u − x∗ ) 2 $ # = 1 (I + p(A)A)h, A(I + p(A)A)h = 1 h, A(I + p(A)A)2 h 2 2 n 2 αν 1 + αν p(αν ) ξν2 = 1 2 ν=1 ≤ max (1 + αν p(αν ))2 E x(0) . 1≤ν ≤n
With
k := Π
q ∈ Πk | q(0) = 1 it follows that
E x(k) ≤ min max |q(αν )|2 E x(0) ≤ min k 1≤ν ≤n q∈Π
max
k α1 ≤ α ≤ αn q∈Π
2 |q(α)| E x(0) .
This leads to an extremal problem whose solution can be stated with the help of the Chebyshev polynomials:
3.4
Conjugate Gradient Method
127
For t ∈ [−1 , 1 ] and k ∈ N0 we consider Tk (t) := cos(k arccos t) . Evidently
max Tk (t) : t ∈ [−1 , 1 ] = 1 .
(15)
The relation cos(k + 1)ϑ + cos(k − 1)ϑ = 2 cos ϑ cos k ϑ
T0 (t) = 1 , T1 (t) = t and Tk+1 (t) + Tk−1 (t) = 2 tTk (t) . In this way we can see inductively for k ∈ N: Tk is the restriction of a polynomial8 of degree k with highest coefficient 2k−1 on [−1 , 1 ]. The first Chebyshev polynomials are given by T0 (t) = 1, T1 (t) = t, T2 (t) = 2t2 − 1, T3 (t) = 4t3 − 3t, T4 (t) = 8t4 − 8t2 + 1 . R
A short Maple
program yields a plot of T1 , . . . , T4 :
> restart: with(plots): with(orthopoly): p := plot([seq(T(n,t),n=1..4)],t=-1..1,color=[blue$2,black$2], labels=[‘ ‘,‘ ‘],linestyle=[1,2]): display(p); 1
0.5
–1
–0.5
0.5
1
–0.5
–1
For real-valued t with |t| ≥ 1 the addition theorem of cosh yields respectively: Tk (t) = cosh(k arcosh t) , with that 8
In the notation we do not make a distinction between the polynomials and their restrictions to [−1 , 1 ] .
Chapter 3
for ϑ := arccos t and k ∈ N, which follows directly from the addition theorem of the cosine, shows the recursion:
128
Unconstrained Optimization Problems Tk (t) =
1 2
k k t + t2 − 1 + t − t2 − 1
and thus
k 1 t + t2 − 1 for t ≥ 1 . 2
Tk (t) ≥
(16)
The function τ defined by
Chapter 3
α −→ (αn + α1 − 2 α)/(αn − α1 ) =: τ (α) is a one-to-one mapping of the interval [α1 , αn ] to [−1 , 1 ]. Hence, by (15) k defined by it holds for the polynomial q ∗ ∈ Π Tk (αn + α1 − 2 α)/(αn − α1 ) Tk τ (α) ∗ = q (α) := Tk τ (0) Tk (αn + α1 )/(αn − α1 ) that
κ + 1 −1 −1 max |q ∗ (α)| : α ∈ [α1 , αn ] = Tk τ (0) = Tk κ−1 with the condition number κ := c(A) = αn /α1 of A (cf. exercise 16). Tk
κ+1 κ−1
≥
1 2
=
1 2
(16)
κ+1 κ−1
+
4κ (κ−1)2
k √ ( κ+1)2 κ−1
=
1 2
k
√
√κ+1 κ−1
k
gives the following estimate k √
κ + 1 −1 κ−1 max |q ∗ (α)| : α ∈ [α1 , αn ] = Tk ≤ 2 √ . κ−1 κ+1 With the above considerations we thus obtain 2 k √ κ−1 E x(k) ≤ 4 √ E x(0) . κ+1 Compared to the gradient method (cf. lemma 3.2.1) this is a much better √ estimate, because here we have replaced the condition number κ of A with κ. Hence, a better condition number of A ensures a quicker convergence of the CG-method. We will use this fact in the technique of preconditioning to speed up the CG-method. Preconditioning We carry out a substitution x = M x with an invertible matrix M in the objective function f , and get
3.4
Conjugate Gradient Method
129
f( x) := f (x) = f (M x ) =
1 2
x T M TAM x + bT M x
=
1 2
x x T A + bT x
with
:= M TAM and b := M T b . A
The gradient g := ∇f transforms as follows: x + b) = M T g(x) g( x) := ∇f( x) = A + b = M T (AM x
= M TAM = L−1 AL−T = I . A If the matrix A is sparse, L usually has mostly nonzero entries (problem of “fill in”). Therefore, this suggestion for the choice of M is only useful for an approximative L which is sparse (incomplete Cholesky decomposition 9 ): T
A = LL + E −T
M := L
:
(E stands for “error”) −1
M T AM = I + L
E L
−T
.
“perturbation”
The occurring inverse will of course not actually be calculated. We will only solve the corresponding systems of linear equations.
The CG-Method in the Nonquadratic Case The CG-method can be rephrased in such a way that it can also be applied to any differentiable function f : Rn −→ R . A first version goes back to Fletcher/Reeves (1964). A variant was devised by Polak/Ribi` ere (1971). They proposed to take for γk+1 gk+1 , gk+1 − gk gk+1 , gk+1 instead of . gk , gk gk , gk This modification is said to be more robust and efficient, even though it will theoretically give the same result in special cases. In the literature on the topic it is also reported in many instances that this variant has much better convergence properties.
We consider g := ∇f and are looking for a stationary point, hence, a zero of the gradient g : 9
Also see [Me/Vo].
Chapter 3
We now want to choose M such that c(M TAM ) c(A) holds. If we know the Cholesky decomposition A = L LT , the choice M := L−T gives
130
Unconstrained Optimization Problems
Fletcher and Reeves Algorithm with Restart: 0) To a given x(0) ∈ Rn let g0 := g x(0) and d0 := −g0 . For k = 0, 1, 2 . . .: If gk = 0 : x(k) is a stationary point: STOP; else:
Chapter 3
1) Calculate — using exact or inexact line search — λk > 0 such that f x(k) + λk dk ≈ min f x(k) + λdk | λ ≥ 0 and set x(k+1) := x(k) + λk dk and gk+1 := g x(k+1) . 2) dk+1 := −gk+1 + γk+1 dk , where γk+1
⎧ ⎨0 , if k + 1 ≡ 0 mod n := gk+1 , gk+1 − gk ⎩ , else . gk , gk
(Restart)
When inexact line search is used, it is assumed that a specific method has been chosen. Here — in contrast to the special case — λk cannot be noted down directly in the iteration step but the calculation requires a minimization in a given direction. In this more general case dk+1 is also always a descent direction when we use exact line search, because gk+1 , dk+1 = − gk+1 22 + γk+1 gk+1 , dk = − gk+1 22 < 0 . This property can likely be lost in inexact line search. We often have to reach a compromise between great effort (exact line search) and possibly unfavorable properties (inexact line search). In practice we would of course replace the condition gk = 0 , for example, by gk 2 ≤ ε or gk 2 ≤ ε g0 2 for a given ε > 0 . For generalizations, additions and further numerical aspects see, for example, [No/Wr], pp. 122 ff.
3.5 Quasi-Newton Methods Another important class of methods for constructing descent directions are Quasi-Newton methods which we are going to look at in this section. Suppose that f : Rn −→ R is twice continuously differentiable. As usual we consider the minimization problem f (x) −→ minn x∈R
3.5
Quasi-Newton Methods
131
by determining stationary points, that is, by solving the system of nonlinear equations ∇f (x) = 0 . In order to do that, we can, for example, use the (0) following approximation methods starting from ∈ Rn and with the an x (k) (k) abbreviations fk := f x , gk := ∇f x for k ∈ N0 : −1 1) Newton’s method : x(k+1) := x(k) − ∇2 f x(k) gk , assuming that the =: dk Hessian ∇2 f x(k) is invertible.
Chapter 3
2) Damped Newton method : x(k+1) := x(k) + λk dk for a λk ∈ [0 , 1 ] such that f x(k+1) = min f x(k) + λdk . 0≤λ≤1
3) Modification of the descent direction: dk := −Bk−1 gk with a suitable symmetric positive definite matrix Bk ≈ ∇2 f (x(k) ).10 Example 8 R
A simple Matlab program for Newton’s method applied to our framework example — with the starting point w(0) = (0, 0, 0)T , the regularization parameter λ = 1 and the tolerance of 10−7 — gives after only five iterations: w(k)
k 0 1 2 3 4 5
0 −0.41218838 −0.49782537 −0.50481114 −0.50485551 −0.50485551
0 0.61903615 0.74273470 0.75317925 0.75324907 0.75324907
0 0.04359951 0.06860137 0.07065585 0.07066909 0.07066909
f w(k)
(k) ∇f w ∞
8.31776617 5.74848374 5.68008777 5.67969939 5.67969938 5.67969938
6.00000000 0.79309457 0.04949655 0.00031030 0.00000002 0.00000000
The idea underlying many quasi-Newton methods is: If the Hessian is cumbersome or too expensive or time-consuming to compute, it — or its inverse — is approximated by a suitable (easy to compute) matrix.
We consider for x ∈ Rn with T T Φk+1 (x) := fk+1 + gk+1 x − x(k+1) + 1 x − x(k+1) Bk+1 x − x(k+1) 2 10
will be denoted by Bk , those to Approximate matrices to ∇2 f x(k)
(k) −1 2 by Hk . This is surely a bit confusing but seems to be standard ∇ f x in the literature.
132
Unconstrained Optimization Problems
the corresponding quadratic approximation Φk+1 of f in x(k+1) . For that it holds that ∇Φk+1 (x) = Bk+1 (x − x(k+1) ) + gk+1 . We demand from Bk+1 that ∇Φk+1 has to match with the gradient of f at x(k) and x(k+1) : ∇Φk+1 (x(k+1) ) = gk+1
Chapter 3
∇Φk+1 (x(k) )
!
= Bk+1 (x(k) − x(k+1) ) + gk+1 = gk
The relation Bk+1 pk = qk results from the above with pk := x(k+1) − x(k) and qk := gk+1 − gk . It is called the secant relation or Quasi-Newton condition. We are looking for suitable updating formulae for Bk which meet the Quasi-Newton condition. The first method of this kind was devised by Davidon (1959) and later developed further by Fletcher and Powell (1963) (cf. [Dav], [Fl/Po]). Our presentation follows the discussions of J. Greenstadt and D. Goldfarb (1970) which themselves are based on the “least change secant update” principle by C. G. Broyden. Least Change Principle Since we do not want the discussion with its numerous formulae to become confused by too many indices, we will temporarily introduce — for a fixed k ∈ N0 — a different notation: x(k) −→ x , x(k+1) −→ x pk −→ p , qk −→ q = ∇f (x ) − ∇f (x) Bk −→ B , Bk+1 −→ B The Quasi-Newton condition then reads B p = q . Denote by Mn the set of all real-valued (n, n)-matrices. Definition For a fixed symmetric and positive definite matrix W ∈ Mn let AW := W 1/2 A W 1/2 F , where F is the well-known Frobenius norm with n a2i,j = trace (AT A) A2F := i,j=1
for any matrix A ∈ Mn .
3.5
Quasi-Newton Methods
133
Remark The norm W is strictly convex in the following weak sense: For different A1 , A2 ∈ Mn with A1 W = A2 W = 1 it holds that 1 (A1 + A2 ) < 1. 2 W Proof: This follows immediately from the parallelogram identity in the Hilbert space Mn , F . This remark shows that the following approximation problem will have at most one (global) minimizer: Suppose W ∈ Mn symmetric and positive definite, p, q ∈ Rn with p = 0 , c := W −1 p and B ∈ Mn symmetric. Then we obtain the following “rank 2-update of B”, that is, a matrix of (at most) rank two is added to B, B = B +
q − B p, p T (q − B p)cT + c(q − B p)T − cc 2 c, p c, p
as the unique solution of the convex optimization problem min A − BW : A ∈ Mn with AT = A and Ap = q . This is called the “principle of least change”. Proof: The existence of a minimizer A of this problem follows by a routine compactness argument. a) For the moment we only consider the case W = I : For A ∈ Mn we have to minimize the term 1/2 A − B2F subject to the constraints Ap = q and A = AT . With the corresponding Lagrangian n n n n 1 2 (aij − bij ) + i aij pj − qi + σij (aij − aji ) L(A, , σ) := 2 i,j=1
i=1
j=1
i,j=1
( ∈ Rn and A, σ ∈ Mn ) it holds for the minimizer A and 1 ≤ i, j ≤ n that ∂L = aij − bij + i pj + σij − σji = 0 ∂aij n
aij pj = qi
(17) (18)
j=1
aij = aji . Exchanging the indices in (17) yields aji − bji + j pi + σji − σij = 0 ;
(19)
Chapter 3
Proposition 3.5.1
134
Unconstrained Optimization Problems
addition to (17) and consideration of (19) leads to the relation 2 aij − 2 bij + i pj + pi j = 0 respectively
A = B − 1 pT + pT . 2
(20)
Chapter 3
It follows from (20) that q = Ap = B p − 1 pT p + pT p = B p − 1 p, p + p , p ; (21) 2 2 setting w := q − B p we obtain from that: p, w = − p, p, p or p, = −
p, w . p, p
With (21) we get p, p = −2 w − , p p
or = −
2 w + p, w p . 2 p, p p, p
Hence, A = B − 1 pT + pT 2 ! "
p, w T p, w 2 2 1 T T − w+ = B− 2 p p + p − p, p w + 2p 2 p, p p, p p, p T T p, w T . = B + w p + pw − 2 pp p, p p, p
This gives the result in this case (taking into account p = c). b) For arbitrary W we set M := W 1/2 and obtain the desired result by considering M (A − B)M = M AM − M BM =: A
and
=: B
- p- = q- . Ap = q ⇐⇒ (M AM ) M −1 p = M q ⇐⇒ A =: p -
=: q-
The optimization problem is equivalent to -T = A - and A - p- = q- . - − B min A F : A ∈ Mn with A
3.5
Quasi-Newton Methods
135
- p- the solution Following a) we obtain with w - := q- − B -T p-, w - - p-T + p-w -= B -+ w − - p-T . A 2 p p-, p- p-, p- Because of
we have — as claimed — A = B+
p, w T w cT + cwT − 2 cc . c, p c, p
Special Cases 1) The case W = I , which was firstly discussed in the proof, yielded (with c = p) the Powell-Symmetric-Broyden Update BPSB = B+
q − B p, p T (q − Bp)pT + p(q − Bp)T − pp p, p p, p2
of the so-called PSB method. 2) If p, q > 0 , then there exists a symmetric positive definite matrix W with W q = p: We can obtain such a matrix, for example, by setting W := I −
q qT ppT + : q , q p, q
Obviously W q = p, and for x ∈ Rn it follows with the Cauchy– Schwarz inequality (specifically with the characterization of the equality) that x, W x = x, x −
2
2
q , x p, x ≥ 0 + q, q p, q
as well as
≤ x ,x
x, W x = 0 ⇐⇒ x = λ q with a suitable λ ∈ R and p, x = 0 . Hence x = λq follows from x, W x = 0 and further 0 = p, x = λ p, q and from that λ = 0 , hence, x = 0 . With c = q we obtain the updating formula of Davidon–Fletcher– Powell
Chapter 3
p- = M c , w - = Mw, # −1 $ # $ # $ p-, p- = M p, M −1 p = p, M −2 p = p, W −1 p = p, c , # −1 $ # $ p-, w - = M p, M w = M M −1 p, w = p, w
136
Unconstrained Optimization Problems (q − B p)q T + q (q − B p)T q − B p, p T BDFP = B+ qq − 2 p, q p, q p qT q pT q qT = I− + B I− p, q p, q p, q =: Q
Chapter 3
of the so-called DFP method, also often referred to as the variable metric method. Since Q2 = Q the matrix Q is a projection with Q p = 0 . When updating a symmetric positive definite B := Bk in a quasi-Newton method, it is desirable that B := Bk+1 is symmetric positive definite too:
Proposition 3.5.2 If B is symmetric positive definite, then — with the assumption p, q > 0 — BDFP is also symmetric positive definite. For HDFP := (BDFP )−1 and H := −1 B it holds that HDFP = H+
Hq q T H ppT − , p, q q , Hq
(22)
and we get the original DFP update (cf. [Fl/Po]). Proof: For x ∈ Rn we have x = x, BDFP
x, QT B Q x +
q , x2 q qT ≥ 0. x = Q x, B (Q x) + p, q p, q ≥0 ≥0
From x, BDFP x = 0 it thus follows Q x, B (Q x) = 0 and q , x = 0 , hence, Q x = 0 and q , x = 0 , and finally x = 0 . If we denote the right- , B H - = I follows after some transformations hand side of (22) by H DFP from ppT Hq q T H q qT T H+ − , BDFP H = Q B Q + p, q p, q q , H q - . hence, HDFP =H
Remark
−1 Suppose that in the damped Newton method x(k+1) := x(k) −λk ∇2 f x(k) gk −1 we do not approximate ∇2 f x(k) with Bk , but ∇2 f x(k) with Hk . Then we get the following new kind of Quasi-Newton method x(k+1) := x(k) − λk Hk gk .
3.5
Quasi-Newton Methods
137
The Quasi-Newton condition transforms into Hk+1 qk = pk . If we exchange q and p as well as B and H, it follows with proposition 3.5.1: Proposition 3.5.1’ Suppose that W ∈ Mn is symmetric and positive definite, p, q ∈ Rn with q = 0, d := W −1 q and H ∈ Mn symmetric. Then the unambiguously determined solution of min G − HW : G ∈ Mn with GT = G and Gq = p
H = H +
(p − H q)dT + d(p − H q)T p − H q , q T dd . − 2 d, q d, q
Special Cases 1) W = I : In this case we have d = q , and with HG = H +
(p − H q)q T + q (p − H q)T p − H q , q T qq − q, q q , q 2
we obtain the Greenstadt update. 2) If p, q > 0 , there exists a symmetric and positive definite matrix W such that W p = q . Consequently, d = p, and we obtain the Broyden– Fletcher–Goldfarb–Shanno updating formula (BFGS formula) p − Hq , q T (p − Hq)pT + p(p − Hq)T − HBFGS = H + pp 2 p, q p, q q pT p pT p qT . H I− + = I− p, q p, q p, q Remark This updating formula is the one most commonly used. Due to empirical evidence the BFGS method seems to be the best for general purposes, since good convergence properties remain valid even with inexact line search. Besides that, it has effective self-correcting properties. Proposition 3.5.2’ If p, q > 0 and H symmetric positive definite, then HBFGS is also symmetric −1 positive definite. For BBFGS := HBFGS and B := H −1 the formula BBFGS = B +
B ppT B q qT − p, q p, B p
holds which is the dual of the DFP formula (22).
Chapter 3
is given by
138
Unconstrained Optimization Problems
More General Quasi-Newton Methods Many of the properties of the DFP and BFGS formulae extend to more general classes which can dominate these in special cases, especially for certain nonquadratic problems with inexact line search.
Chapter 3
a) Broyden’s class (1970) This updating formula, which is also called the SSVM technique (selfscaling variable metric), is a convex combination of those of the DFP and BFGS methods; hence, for ϑ ∈ [0 , 1 ] we have Hϑ
:= ϑHBFGS + (1 − ϑ)HDFP = T Hq q T H ϑ T q, H q p p − (1 − ϑ) − pq H + Hq pT . H + 1+ϑ p, q p, q q , H q p, q
If H is positive definite, ϑ ∈ [0 , 1 ] and p, q > 0, then the matrix Hϑ is positive definite. Clearly ϑ = 0 yields the DFP and ϑ = 1 the BFGS update. A Broyden method is a quasi-Newton method in which a Broyden update is used in each step, possibly with varying parameters ϑ. b) If we replace the matrix H by γ H with a positive γ as an additional scaling factor, we obtain for ϑ ∈ [0 , 1 ] the Oren–Luenberger class H −→ γ H −→ Hγ,ϑ := ϑHBFGS (γ H) + (1 − ϑ)HDFP (γ H).
An Oren–Luenberger method with exact line search terminates after at most n iterations for quadratic functions:
Proposition 3.5.3 Suppose that f is given by f (x) = 1 x, Ax + b , x 2 with b ∈ Rn and a symmetric positive definite matrix A ∈ Mn . In addition let x(0) ∈ Rn and a symmetric positive definite matrix H0 ∈ Mn be given. Then each method of the Oren–Luenberger class yields, starting from x(0) , H0 , with exact line search x(k+1) = x(k) − λk Hk gk , min f x(k) − λHk gk = f x(k+1) , λ≥0
sequences x(k), Hk , pk := x(k+1) − x(k) and qk := gk+1 − gk such that: a) There exists a smallest m ∈ {0, . . . , n} such that gm = 0 . Then x(m) = −A−1 b is the minimal point of f .
3.5
Quasi-Newton Methods
139
c) If m = n , it holds in addition: Hn = P D P −1 A−1 with
D := Diag γ0,n , γ1,n , . . . , γn−1,n , P := p0 , p1 , . . . , pn−1 .
Hence, for Broyden’s class (that is, γk = 1 for all k) Hn = A−1 holds. Proof: See [St/Bu], theorem 5.11.10, p. 320 ff. This result is somewhat idealized because an exact line search cannot be guaranteed in practice for arbitrary functions f .
Quasi-Newton Methods in the Nonquadratic Case We content ourselves with presenting the framework of general DFP and BFGS methods: DFP method: 0) Starting with x(0) ∈ Rn and H0 = I for k ∈ N0 let:
Hk ≈ ∇2 f (x(k) )−1 1) dk := −Hk gk If gk = 0 : dk is a descent direction of f in x(k) , since for ϕ(t) := f x(k) + tdk it holds that ϕ (0) = gkT dk = −gkT Hk gk < 0 . 2) Calculate a λk > 0 with f x(k) + λk dk = min f x(k) + λd or via suitable inexact line search. x(k+1) := x(k) + λk dk
λ≥0
Chapter 3
b) The following assertions hold: 1) pi , qk = pi , Apk = 0 for 0 ≤ i < k ≤ m − 1 pi , qi > 0 for 0 ≤ i ≤ m − 1 Hi is symmetric positive definite for 0 ≤ i ≤ m . 2) pi , gk = 0 for 0 ≤ i < k ≤ m . 3) Hk qi = γi,k pi for 0 ≤ i < k ≤ m, where ! γi+1 · γi+2 · · · · · γk−1 for i < k − 1 γi,k := 1 for i = k − 1 .
140
Unconstrained Optimization Problems
3) Update with a given γ ∈ (0, 1) : ⎧ Hk qk qkT Hk pk pTk ⎨ − , if pk , qk ≥ γ pk 2 qk 2 Hk + Hk+1 := pk , qk qk , Hk qk ⎩ H0 , else restart .
Chapter 3
BFGS method:
dk := Bk−1 gk Bk ≈ ∇2 f x(k) Updating formula: ⎧ Bk pk pTk Bk qk qkT ⎨ − , if pk , qk ≥ γ pk 2 qk 2 Bk + Bk+1 := pk , qk pk , Bk pk ⎩ B0 , else restart .
The literature on quasi-Newton methods is quite extensive. We did not intend to cover the whole range in detail. This is beyond the scope of our presentation and it might be confusing for a student. Convergence properties of these methods are somewhat difficult to prove. We refer the interested reader for example to [No/Wr], section 6.4 or [De/Sc] where a comprehensive treatment of quasi-Newton methods can be found.
Example 9 In order to finally apply the DFP and BFGS methods to our framework example, we again utilize the function fminunc — here with the options optimset(’HessUpdate’, ’dfp’) and optimset(’HessUpdate’, ’bfgs’) . For the starting point w(0) = (0, 0, 0)T and the regularization parameter λ = 1 we need 62 (DFP) and 15 (BFGS) iteration steps to reach the tolerance of 10−7 . As in the preceding examples, we only list a few of them: DFP
k 0 2 4 6 14 23 33 43 52 62
w(k) 0 −0.32703997 −0.48379776 −0.50560601 −0.50582098 −0.50463179 −0.50452876 −0.50502368 −0.50482358 −0.50485551
0 0.28060590 0.71909320 0.75191947 0.75201407 0.75347752 0.75367743 0.75304119 0.75328767 0.75324907
0 0.01405140 0.07610771 0.07912688 0.07398917 0.06704473 0.06994114 0.07161968 0.07044751 0.07066907
f w(k) 8.31776617 6.67865969 5.68355186 5.67974239 5.67972332 5.67971060 5.67970303 5.67970002 5.67969941 5.67969938
(k) ∇f w ∞ 6.00000000 5.81620726 0.19258622 0.01290898 0.03566679 0.01623329 0.01552518 0.00302611 0.00022112 0.00000005
3.5
Quasi-Newton Methods f w(k)
w(k)
k 0 −0.30805489 −0.38153345 −0.46878531 −0.49706979 −0.50578467 −0.50627420 −0.50590101 −0.50489391 −0.50485548
0 0.19253431 0.43209082 0.67521531 0.73581744 0.75171902 0.75246799 0.75350315 0.75333394 0.75324905
0 0 0.03570334 0.07011404 0.07772590 0.07917689 0.07785948 0.07263553 0.07043112 0.07066907
(k) ∇f w
∞
8.31776617 7.20420404 6.10456800 5.69897036 5.68053511 5.67974092 5.67973141 5.67970764 5.67969945 5.67969938
6.00000000 7.47605903 3.59263698 0.57610147 0.09150746 0.01002935 0.01278409 0.02001700 0.00175692 0.00000059
Lastly, we have compiled — for our framework example — some important data from the different methods for comparison: λ=0 iter Nelder–Mead Steepest Descent Trust Region Newton DFP BFGS c(H ∗ )
λ=1
λ=2
sec
iter
sec
iter
sec
205 0.20 524 1.26 10 0.47 6 0.014 633 0.67 20 0.30 522.63
168 66 6 5 62 15
0.17 0.40 0.44 0.013 0.32 0.29 48.26
158 52 5 5 38 14
0.16 0.38 0.44 0.013 0.31 0.29 28.16
The table shows that the regularization parameter λ may have great influence on the number of iterations (iter), the time needed and the condition of the corresponding Hessian H ∗ in the respective minimizer. One should, however, heed the fact that these are the results of one particular problem and that no conclusions can be drawn from them to other problems. Moreover, it is well possible that individual methods cannot be used, because necessary prerequisites are not met (for example, smoothness or convexity). Also, it may be dangerous to only look at the number of iteration steps as the costs per iteration can differ greatly!
Example 10 As already mentioned at the beginning of this chapter, with breast cancer diagnosis we now want to turn to a more serious ‘real-life’ problem for which
BFGS
Chapter 3
0 1 2 3 4 6 8 10 12 15
141
142
Unconstrained Optimization Problems
we will again assemble important results from the different methods in tabular form: λ=0
Chapter 3
iter Nelder–Mead Steepest Descent Trust Region Newton DFP BFGS c(H ∗ )
λ=1 sec
6 623 2.42 29 857 85.10 38 0.61 11 0.03 10 855 8.66 104 0.39 ≈ 1.6 e4
iter
λ=2 sec
5 892 2.17 2 097 6.58 7 0.44 8 0.02 2 180 1.83 79 0.36 371.45
iter
sec
5 779 2.25 1 553 5.12 7 0.45 7 0.02 1 606 1.78 76 0.36 273.58
We have split the data (wisconsin-breast-cancer.data) into two portions: The first 120 instances are used as training data. The remaining 563 instances are used as test data to evaluate the ‘performance’ of the classifier or decision function. The exact figures are displayed in the following table: λ=0 P N
λ=1
λ=2
M
B
M
B
M
B
362 17
3 181
366 13
5 179
362 17
5 179
Here ‘P’ stands for a positive test, ‘N’ for a negative test (from a medical point of view!). A malignant tumor is denoted by ‘M’ and a benign tumor by ‘B’. Accuracy, sensitivity and specificity are measures of the performance of a binary classification method, in this case of a medical test. Sensitivity measures the probability of a positive test among patients with disease. The higher the sensitivity, the fewer real cases of breast cancer (in our example) go undetected. The specificity measures the probability of a negative test among patients without disease (healthy people who are identified as not having breast cancer). Accuracy measures the proportion of all test persons correctly identified.
362+181 563
= 0.9645
366+179 563
= 0.9680
362+179 563
= 0.9609
Sensitivity
362 379
= 0.9551
366 379
= 0.9657
362 379
= 0.9551
Specificity
181 184
= 0.9837
179 184
= 0.9728
179 184
= 0.9728
Accuracy
A comparison with the results for the support vector machine method shows that their accuracy values are almost the same: 95.74 % (SVM) versus 96.80 % (logistic regression with λ = 1 ).
Exercises to Chapter 3
143
Exercises 2 1. The Rosenbrock function f : R2 −→ R, x → 100 x2 − x21 + (1 − x1 )2, also compare http://en.wikipedia.org/wiki/Rosenbrock function, is frequently utilized to test optimization methods. a) The absolute minimum of f , at the point (1, 1)T , can be seen without any calculations. Show that it is the only extremal point.
c) Implement the Nelder–Mead method. Visualize the level curves of the given function together with the polytope for each iteration. Finally visualize the trajectory of the centers of gravity of the polytopes! d) Test the program with the starting polytope given by the vertices (−1, 1)T , (0, 1)T , (−0.5, 2)T and the parameters (α, β, γ) := (1, 2, 0.5) and ε := 10−4 using the Rosenbrock function. How many iterations are needed? What is the distance between the calculated solution and the exact minimizer (1, 1)T ? e) Find (α, β, γ) ∈ [0.9, 1.1]×[1.9, 2.1]×[0.4, 0.6] such that with ε := 10−4 the algorithm terminates after as few iterations as possible. What is the distance between the solution and the minimizer (1, 1)T in this case? f ) If the distance in e) was greater than in d), reduce ε until the algorithm gives a result — with the (α, β, γ) found in e) — which is not farther away from (1, 1)T and at the same time needs fewer iterations than the solution in d). 2. Implement Shor’s ellipsoid method and use it to search for the minimizers of the functions f1 , f2 : R2 −→ R defined by f1 (x) := 3x21 + x22 + 3x1 and f2 (x) := x21 + cos(π(x1 + x2 )) + 12 (x2 − 1)2 as well as the Rosenbrock function. Let the starting ellipsoid E (0) be given by (0)
x
T
:= (0, −1)
50 . and A0 := 05
For k = 1, 2, . . . visualize the level curves of each function together with the corresponding ellipsoids E (k) of the k th iteration. What is your observation when you are trying to find the minimum of the concave function f given by f (x1 , x2 ) := −x21 − x22 ? 3. Modified Armijo step size rule Let a differentiable function ϕ : R+ −→ R with ϕ (0) < 0 be given, in
Chapter 3
b) Graph the function f on [−1.5, 1.5] × [−0.5, 2] as a 3D plot or as a level curve plot to understand why f is also referred to as the banana function and the like.
144
Unconstrained Optimization Problems addition 0 < α < 1, an initial step size λ0 and a step size factor > 1. Determine the maximal step size λ in S := {λ0 k | k ∈ Z} for which ϕ(λ) ≤ ϕ(0) + α λ ϕ (0) holds.
Chapter 3
R
R
a) Implement this Armijo step size rule with Matlab or Maple and test it for different ϕ, α and . Observe the distance from the exact minimizer of ϕ. Study the examples 1 ϕ(t) := 1 − 1 − t + 2t2 ϕ(t) := 2 − 4 t + exp(t) ϕ(t) := 1 − t exp(−t2 ) with α ∈ {0.1, 0.9}, ∈ {1.5, 2} and λ0 := 0.1. b) Implement the gradient method with this step size rule. (In each iteration step the function ϕ is defined by ϕ(t) := f (x(k) − t gk ), where f is the objective function to be minimized.) Compare the results of this inexact line search with those of the exact line search. Test it with the function f given by 1 9 f (x1 , x2 ) := x21 + x22 , 2 2 and the weakened banana-shaped valley function of Rosenbrock defined by f (x1 , x2 ) := 10 (x2 − x21 )2 + (1 − x1 )2 . The number of iterations needed to obtain a euclidean distance from the minimizer of less than ε = 10−k (k = 1, . . . , 6) can serve as a criterion for comparison. 4. Let f : Rn −→ R be a continuously differentiable function and H ∈ Mn a symmetric positive definite matrix. Show: a) x, y H := x, Hy defines an inner product on Rn . The thereby induced norm is denoted by H . b) The direction of steepest descent of f in a point x with ∇f (x) = 0 with respect to the norm H , that is, the solution to the optimization problem min ∇f (x)T d is given by
d = −
such that dH = 1 , H −1 ∇f (x) . H −1 ∇f (x)H
c) Let (xk ) ∈ (Rn )N and (dk ) ∈ (Rn )N with dk := −H −1 ∇f (xk ) = 0. Then the sequence (dk ) meets the angle condition, that is, there exists a constant c > 0 such that for all k ∈ N −
∇f (xk ), dk ≥ c. ∇f (xk ) dk
Exercises to Chapter 3
145
5. Golden Section Search The one-dimensional ‘line search’ is the basis of many multidimensional optimization methods. We will have a closer look at the following simple variant for the minimization of a continuous function f : R −→ R. Algorithm:
/ . (0) (0) (0) (0) Initialize t1 , t2 such that t∗ ∈ t1 , t2 holds for an optimizer t∗ . Set j = 0. Calculate (j)
(j)
(j)
(j)
(j)
(j+1)
• Else, set t2
(j)
= t4
(j+1)
and t1
(j)
= t1 .
Set j = j + 1 and repeat the above calculations until convergence is (j) (j) reached, that is, t2 − t1 ≤ ε, where ε is a given tolerance. a) Show that this algorithm always converges (where to is a different (j) (j) question), hence t2 − t1 −→ 0 as j −→ ∞. How many iterations are necessary to go below a given tolerance ε ? b) Implement this algorithm and determine the minimum of the functions f1 , f2 , f3 : R −→ R with f2 (t) := t2 + 5 | sin(t)| , f3 (t) := t2 + 5 | sin(3t)| (0) (0) and the starting interval t1 , t2 = (−5, 5) numerically. What can be observed? f1 (t) := |t| ,
c) Show that for convex functions the algorithm converges to a minimizer. 6. Let f : R2 −→ R with f (x) = 12 x, Ax+b , x and A symmetric, positive definite. Implement the gradient method to minimize f (cf. p. 100 f). a) Try to select the matrix A, the vector b and the starting vector x0 such that the trajectory (the line connecting the approximations x0 , x1 , x2 , . . . ) forms a zigzag line similar to the following picture:
b) Try to choose the matrix such that the angle between consecutive directions becomes as small as possible (‘wild zigzagging’). Make a conjecture as to what the smallest possible angle in this zigzag line is. Prove this!
Chapter 3
(j)
t3 := α t1 + (1 − α) t2 , t4 := α t2 + (1 − α) t1 , √ where α := 12 ( 5 − 1). (j) (j) (j+1) (j) (j+1) (j) • If f t3 > f t4 , set t1 = t3 and t2 = t2 .
146
Unconstrained Optimization Problems c) Which conditions does the matrix A have to meet to get a ‘spiral’ trajectory (see below)?
7. Let f : R2 −→ R. Implement the gradient method to solve
Chapter 3
f (x) −→ min , x ∈ R2 . Use the Golden Section Search algorithm (GSS) from exercise 5 to determine the step size. Visualize the level curves of f together with the approximations of the solutions xk , k = 0, 1, 2, . . . . a) Do the first 20 iteration steps with the gradient method. Choose f (x) := x41 + 20 x22
and x0 := (1, 0.5)T
as the test function and the starting point. Let ε be the termination tolerance for the one-dimensional minimization. b) Deactivate the visualization part in your program (since with smaller problems the visualization is often the most costly operation). Then repeat a) and complete the following table: ε
x − x∗ 2 total runtime percentage needed for GSS
−1
10
10−2 10−4 10−6 10−8 10−10 10−12
What conclusions can you draw? What strategy would you recommend for the selection of ε? 8. We are going to use the gradient method to minimize the function f : Rn −→ R defined by f (x) := 12 x, Ax with a positive definite matrix A. When choosing the step size, however, assume that we allow there k , that is, assume to be a certain tolerance δ λ x(k+1) = x(k) − λk gk k | ≤ δ λ k , where λ k minimizes the term f (x(k) − λgk ) with with |λk − λ reference to λ ∈ R.
Exercises to Chapter 3
147
a) Express the speed of convergence of this algorithm with the help of the condition number c(A) and the tolerance δ. b) What is the largest value δ for which the algorithm still converges? Explain this geometrically! 9. The (stationary) temperature distribution T in a long bar can be expressed by means of the differential equation −T (x) = f (x) , x ∈ (0, 1) .
(23)
For twice continuously differentiable functions T : [0 , 1 ] −→ R the solution to (23) is also the minimizer of the following variational problem 1 a(T, T ) − (T ) −→ min , T (0) = T (1) = 0 , (24) 2 01 where a and are given by a(T, P ) = 0 T (x)P (x) dx and (P ) = 01 f (x)P (x) dx (cf. [St/Bu], p. 540 ff). 0 If we discretize (24), we obtain a ‘classic’ optimization problem. Let xi = (i − 1)h, i = 1, 2, . . . , N, be equidistant points in the interval [0, 1] with h = 1/(N − 1) and xi+ 12 = xi + h2 , i = 1, 2, . . . , N − 1. Let furthermore gi := g(xi ) and gi+ 12 := g(xi+ 12 ) for a function g : [0, 1] −→ R. We will use the following approximations dg 1 xi+ 12 ≈ (gi+1 − gi ) dx h 1 gi+ 12 ≈ (gi+1 + gi ) 2 x i+1 g(x) dx ≈ hgi+ 12 . xi
a) Formulate the approximations of (24) as an unconstrained optimization problem! b) Show that this problem has a unique solution! c) Find an approximate solution to (24) via (i) cyclic control (relaxation control), (ii) the gradient method. Use the Armijo step size rule. Visualize the temperature distribution in each iteration step. What is your observation?
Chapter 3
f (x) is the intensity of the heat source. The temperature at both ends is given by T (0) = T (1) = 0 .
148
Unconstrained Optimization Problems d) Replace the Armijo step size rule by something else. What happens, for example, if you set λk = 0.1 (0.01), λk = 0.2 or λk = 1/k ? Hint: In the special case f (x) ≡ c in (0, 1) the solution to (23) with the constraints T (0) = T (1) = 0 can be determined analytically. It is
Chapter 3
T (x) =
1 c x (1 − x) . 2
T The vector T (x1 ), . . . , T (xN −1 ) also yields the minimum of the approximating variational problem at the same time. This fact can be used to test the program. 10. Calculate the Levenberg–Marquardt trajectory to a) f (x) := 2 x21 − 2 x1 x2 + x22 + 2 x1 − 2 x2 at the point x(0) := (0, 0)T , b) f (x) := x1 x2 at the point x(0) := (1, 0.5)T , c) f (x) := x1 x2 at the point x(0) := (1, 1)T . Treat the examples according to the case differentiations on page 113. 11. Do one iteration step of the trust region method for the example f (x) := x41 + x21 + x22 . Choose x(0) := (1, 1)T , H0 := ∇2 f (x(0) ) and Δ0 := 34 and determine d0 with the help of Powell’s dogleg trajectory. 12. We again consider the weakened Rosenbrock function f : R2 −→ R 2 2 defined by f (x) := 10 x2 − x21 + (1 − x1 ) . a) Plot the level curves of f and the level curves of the quadratic approximation of f at the point x0 := (0, −1)T . b) Calculate the solutions to the quadratic minimization problem 1 T ϕ(x0 + d) := f (x0 ) + ∇f (x0 ) d + 12 dT ∇2 f (x0 )d −→ min d ≤ Δ0 for the trust regions with the radii Δ0 := 0.25, 0.75, 1.25. c) Repeat a) and b) for x0 := (0, 0.5)T . 13. Consider the quadratic function f : R3 −→ R defined by f (x) := x21 − x1 x2 + x22 − x2 x3 + x23 . First write f in the form f (x) =
1 2
x, Ax + b , x
with a symmetric positive definite matrix A ∈ M3 and b ∈ R3 . Calculate the minimum of this function by means of the conjugate gradient method. Take x(0) := (0, 1, 2)T as a starting vector and apply exact line search.
Exercises to Chapter 3
149
14. Calculate the minimum of the quadratic function f in the preceding exercise by means of the DFP method. Start at x(0) := (0, 1, 2)T with H0 := I and apply exact line search. Give the inverse of the Hessian of f . 15. Let f : R2 −→ R be defined by f (x) := −12 x2 + 4 x21 + 4 x22 − 4 x1 x2 . Write f as f (x) = 12 x, Ax + b , x
Minimize f starting from x(0) := (− 12 , 1)T in the direction of d1 and from the thus obtained point in the direction of d2 := (1, 2)T . Sketch the situation level curves; x(0) , x(1) , x(2) . 16. Let n ∈ N and Tn the n-th Chebyshev polynomial, cf. p. 126 ff. Verify: a) Tn (t) = 2n−1
n−1 2 ν=0
t − cos
2ν+1 2n π
b) Among all polynomials of degree n with leading coefficient 2n−1 on the interval [−1, 1] Tn has the smallest maximum norm, which is 1. 17. Let f : R2 −→ R be given by the function in exercise 12. Starting from x(0) := (0, −1)T carry out some steps of the Fletcher and Reeves algorithm using ‘inexact line search’ by hand and sketch the matter. 18. Solve the least squares problem 5
F (x) :=
1 fi (x)2 −→ min 2 i=1
with fi (x) := x1 ex2 ti − yi and ti 1 2 4 5 8 . yi 3 4 6 11 20 Use x(0) := (3, 0.5)T as a starting point (cf. [GNS], p. 743 f (available in appendix D at the book web site http://www.siam.org/books/ot108) or p. 409 f of the first edition). R
a) We apply the routine fminunc of the Matlab Optimization Toolbox which realizes several minimization methods: steepest descent, bfgs, dfp, trust region method. By default we have GradObj = off. In this case the gradient will be calculated numerically by finite differences. First implement the objective function (without gradient) as described in help fminunc or doc fminunc beginning with the headline
Chapter 3
with a positive definite symmetric matrix A ∈ M2 and b ∈ R2 . To d1 := (1, 0)T find all the vectors d2 ∈ R2 such that the pair (d1 , d2 ) is A-conjugate.
150
Unconstrained Optimization Problems
Chapter 3
function F = myfun(x). Now test this implementation. Further apply fminunc to minimize F . Use optimset(’Display’,’iter’,’TolX’,1e-6) to get more information about the iterative process and to raise the precision to 10−6 . Next implement the gradient (cf. help fminunc). Activate the gradient for the solution algorithm by means of optimset(’GradObj’,’on’) and repeat the computations. For which of the methods is the gradient absolutely necessary? By means of optimset(’HessUpdate’,’steepdesc’), for instance, you can activate the gradient method. Compare the numerical results of the different methods. b) Carry out similar experiments with lsqnonlin and lsqcurvefit and try to activate the Levenberg–Marquardt method as well as the Gauss–Newton method. c) Visualize the results! 19. Practical Realization of the BFGS Update Let us assume B = LLT with a lower triangular matrix L ∈ Mn and p, q ∈ Rn with p, q > 0 . Try to calculate — in an efficient way — the Cholesky decomposition B = L (L )T of the matrix B defined BppT B qq T − . Show that this needs at most O(n2 ) by B := B + p, q p, Bp operations. Hints:
a) Define L∗ := L + uv T with u := q/ p, q − Lv and v := LT p/LT p2 . Then it holds that L∗ (L∗ )T = B . b) Choose Givens rotations Gn−1,n , . . . , G1,2 such that G1,2 · · · Gn−1,n v = e1 ; then for Q := GTn−1,n · · · GT1,2 it holds that L := L∗ Q has lower Hessenberg form. c) Eliminate the upper diagonal of L using appropriate Givens rotations.
4 Linearly Constrained Optimization Problems
Chapter 4
4.1 Linear Optimization The Revised Simplex Method Numerical Realization of the Method Calculation of a Feasible Basis The Active Set Method 4.2 Quadratic Optimization The Active Set Method Minimization Subject to Linear Equality Constraints The Goldfarb–Idnani Method 4.3 Projection Methods Zoutendijk’s Method Projected Gradient Method Reduced Gradient Method Preview of SQP Methods Exercises The problems we consider in this chapter have general objective functions but the constraints are linear. Section 4.1 gives a short introduction to linear optimization (LO) — also referred to as linear programming, which is the historically entrenched term. LO is the simplest type of constrained optimization: the objective function and all constraints are linear. The classical, and still well usable algorithm to solve linear programs is the Simplex Method. Quadratic problems which we treat in section 4.2 are linearly constrained optimization problems with a quadratic objective function. Quadratic optimization is often considered to be an essential field in its own right. More important, however, it forms the basis of several algorithms for general nonlinearly constrained problems. In section 4.3 we give a concise outline of projection methods, in particular the feasible direction methods of Zoutendijk, Rosen and Wolfe. They are extensions of the steepest descent method and are closely related to the simplex algorithm and the active set method. Then we will discuss some basic ideas of SQP methods — more generally treated in section 5.2 — which have proven to be very efficient for wide classes of problems. W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4 4,
151
152
Linearly Constrained Optimization Problems
We write — as usual — for k ∈ N and vectors z, y ∈ Rk
and
z ≤ y :⇐⇒ y ≥ z :⇐⇒
∀ κ ∈ {1, . . . , k}
yκ ≥ z κ
z < y :⇐⇒ y > z :⇐⇒
∀ κ ∈ {1, . . . , k}
yκ > z κ ,
0 ≤ y :⇐⇒ y ≥ 0 :⇐⇒
∀ κ ∈ {1, . . . , k}
yκ ≥ 0
0 < y :⇐⇒ y > 0 :⇐⇒
∀ κ ∈ {1, . . . , k}
yκ > 0 .
in particular and
Chapter 4
For j1 , . . . , jk ∈ R and J := (j1 , . . . , jk ) we denote by S(J) the set of the components of J, that is, S(J) := {j1 , . . . , jk } . In contrast to S(J) , the order of the components of J is important.
4.1 Linear Optimization The development of the simplex algorithm by George Dantzig (1947) marks the beginning of the age of modern optimization. This method makes it possible to analyze planning problems for large industrial and manufacturing systems in a systematic and efficient manner and to determine optimal solutions. Dantzig’s considerations appeared simultaneously with the development of the first digital computers, and the simplex method became one of the earliest important applications of this new and revolutionary technology. Nowadays there exists sophisticated software for this kind of problem. The ‘confidence’ in it is occasionally so strong that the simplex method is even used if the problem is nonlinear. From the variety of topics and variants we will only treat — relatively compactly — the revised simplex algorithm. We will not discuss the close and interesting relations to geometric questions either.
The Revised Simplex Method Let m, n ∈ N with m ≤ n, A a real (m, n)-matrix with the columns a1 , . . . , an and rank(A) = m . Furthermore let b ∈ Rm and c ∈ Rn . Every primal problem T c x → min (P ) Ax = b , x ≥ 0 has a corresponding dual problem (cf. section 2.4, p. 71) T b u → max (D) . AT u ≤ c
4.1
Linear Optimization
153
For vectors x and u taken from the respective feasible regions FP := x ∈ Rn | Ax = b , x ≥ 0 and FD := u ∈ Rm | AT u ≤ c it holds that cT x ≥ (uT A)x = uT (Ax) = uT b = bT u . Hence, we get for the respective minimum min(P ) of (P ) and maximum max(D) of (D) the following weak duality: max(D) ≤ min(P )
(1)
Let N := {1, . . . , n}, j1 , . . . , jm ∈ N and J := (j1 , . . . , jm ). Definition J is called a basis of (P ) iff the matrix AJ := (aj1 , . . . , ajm ) is invertible. The corresponding variables xj1 , . . . , xjm are then called basic variables, the remaining variables are referred to as nonbasic variables. Let J be a basis of (P ) and K = (k1 , . . . , kn−m ) ∈ N n−m with S(K) S(J) = N . Corresponding to K we define the matrix AK := (ak1 , . . . , akn−m ). We split each vector x ∈ Rn into subvectors xJ := (xj1 , . . . , xjm ) and xK := (xk1 , . . . , xkn−m ) , where xJ and xK refer to the basic and nonbasic variables, respectively. Then obviously Ax = AJ xJ + AK xK holds. Using this splitting for the equation Ax = b , the substitution xJ = n−m A−1 gives a parametrization of the solution set J (b − AK xK ) with xK ∈ R {x ∈ Rn | Ax = b} . Corresponding to J there exists a unique basic point x = x(J) to the linear system Ax = b with xK = 0 and AJ xJ = b . Definition A basis J of (P ) is called feasible iff x(J) ≥ 0 holds, that is, all components of the corresponding basic point x(J) are nonnegative. Then x is called the feasible basic point of (P ). If furthermore xJ > 0 holds, x is called nondegenerate.
Chapter 4
Later on we will see that even max(D) = min(P ) holds if both sets FP and FD are nonempty, that is, there exist feasible points for both problems.
154
Linearly Constrained Optimization Problems
The objective function can be expressed by the nonbasic variables; substitution gives T cT x = cTJ xJ + cTK xK = cTJ A−1 J (b − AK xK ) + cK xK = cTJ xJ + (cTK − cTJ A−1 AK ) xK . J
=: cT
If J is feasible and c ≥ 0 , then we have obviously found the minimum. Algorithm Let J = (j1 , . . . jm ) be a feasible basis.
Chapter 4
① Choose K = (k1 , . . . , kn−m ) ∈ Rn−m with S(K) S(J) = N . Compute b := xJ with AJ xJ = b . ② cT := cTK − cTJ A−1 J AK If c ≥ 0 : STOP. We have found the minimum (see above!). ③ Otherwise there exists an index s = kσ ∈ S(K) with cσ < 0 . The index s enters the basis. ④ Compute the solution as = (a1,s , . . . , am,s )T of AJ as = as . If as ≤ 0 : STOP. The objective function is unbounded (see below!). b b Otherwise determine a ∈ {1, . . . , m} with min μ = ; a,s aμ,s >0 aμ,s r := j . The index r leaves the basis: J := (j1 , . . . , j−1 , s, j+1 , . . . , jm ) Update J: J := J ; go to ①. Remark For σ ∈ {1, . . . , n − m} with kσ = s we consider d ∈ Rn with dJ := −as and dK := eσ ∈ Rn−m . Then Ad = AJ dJ + AK dK = −as + akσ = 0 holds, and for τ ∈ R+ we obtain cT (x + τ d) = cT x + τ (cTJ dJ + cTK dK ):
= cσ < 0
= c + holds by definition of c ; hence, cTK dK = cT eσ + cTJ A−1 J as = T cσ + cJ as and therefore cTJ dJ + cTK dK = cσ < 0 . cTK
T
cTJ A−1 J AK
Hence, the term cT (x + τ d) is strongly antitone in τ .
4.1
Linear Optimization
155
From Ax = b and Ad = 0 we get A(x + τ d) = b . We thus obtain: x + τ d ∈ FP ⇐⇒ xJ + τ dJ = b − τ as ≥ 0 If as ≤ 0 , then all x + τ d belong to FP , and because of cT (x + τ d) −→ −∞ (τ → ∞) the objective function is unbounded. Otherwise: b − τ as ≥ 0 ⇐⇒ τ ≤ min
aμ,s >0
bμ . aμ,s
Advantage of This Method It saves much computing time (in comparison with the ‘normal’ simplex method) in the case of m n. Additionally the data can be saved in a mass storage, if necessary. We will now discuss how a solution to (P ) can give a solution to (D):
T bT u = uT b = cTJ A−1 J b = cJ xJ = min(P ) ;
we conclude from (1) that u = A−T J cJ gives a solution to (D). Numerical Realization of the Method Let J be a feasible basis; suppose that the feasible basis J results from J via an exchange step of the kind described above. It holds that AJ = AJ T,
and hence
−1 A−1 J = T AJ ,
(2)
where ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ T = ⎜ ⎜ ⎜ ⎜ ⎝
1
⎞ v1 .. .. ⎟ . . ⎟ .. ⎟ ⎟ 1 . ⎟ v ⎟ ⎟ .. ⎟ . 1 ⎟ .. .. ⎠ . . 1 vm
with vμ = −
aμ,s a,s
(μ = ), v =
1 a,s
.
Regular matrices which deviate from the identity matrix in only one column are called Frobenius matrices. Proof of (2): With wμ := ajμ for μ = 1, . . . , m we obtain AJ = (w1 , . . . , wm ) and AJ = (w1 , . . . , w−1 , as , w+1 , . . . , wm ).
Chapter 4
Let J be a feasible basis of (P ) with cTK ≥ cTJ A−1 J AK ; therefore, (cf. p. 154) x(J) is a minimizer. Let u ∈ Rm be the solution to ATJ u = cJ ; then T ATK u = ATK A−T J cJ ≤ cK yields A u ≤ c, that is, u ∈ FD . Furthermore
156
Linearly Constrained Optimization Problems
T eμ = eμ for μ = 1, . . . , m with μ = shows AJ eμ = AJ T eμ . It remains !
to prove that AJ e = w = AJ T e : AJ T e = 1 AJ − as + (1 + a,s )e = 1 AJ − as + (1 + a,s )as a,s a,s AJ − AJ as + a,s as = 1 a,s (w − as ) + a,s as = w = 1 a,s a,s Starting with J1 := J, we carry out k exchange steps of this kind with Frobenius matrices T1 , . . . , Tk and get AJ1 = AJ2 T1 with T1 := T and J2 := J and hence
Chapter 4
AJ1 = AJk+1 Tk · · · T1
or
−1 Tk · · · T1 A−1 J1 = AJk+1 .
One should not multiply the matrices Tκ explicitly, but only save the corresponding relevant columns. The procedure described above is disadvantageous as ill-conditioned intermediate matrices can lead to serious loss of accuracy (cf. the example in [St/Bu], p. 240 f). It is much better to revert to the original data: We do not need A−1 explicitly. Merely the following systems of linear J equations need to be solved: AJ b = b , With this we obtain
ATJ u = cJ ,
AJ as = as
cT := cTK − uT AK .
We can solve these linear equations, for example, by using the following QRfactorization of AJ : Q AJ = R with an orthogonal matrix Q and an upper triangular matrix R : R b = Qb RT v = cJ , u = QT v R a s = Q as Although the costs of the computation are slightly higher, this method is numerically more stable. We can utilize the obtained results to compute the QR-factorization of AJ . If we modify the exchange step to J = (j1 , . . . , j−1 , j+1 , . . . , jm , s), then Q AJ has upper Hessenberg form because of ⎛ ⎞ ∗ ··· ∗ ∗ ··· ∗ ∗ ⎜ .. .. .. .. ⎟ ⎜ . . . .⎟ ⎜ ⎟ ⎜ ∗ ∗ ∗ ∗⎟ ⎜ ⎟ Q AJ = ⎜ ∗ ··· ∗ ∗⎟ ⎜ ⎟ . ⎜ ⎟ ∗ .. ∗ ∗ ⎟ ⎜ ⎜ . .. .. . ⎟ ⎝ . . .⎠ ∗ ∗
4.1
Linear Optimization
157
and can be transformed to upper triangular form with O(m2 ) operations. −3 x1 − x2 − 3 x3 −→ min
Example 1
2 x1 + x2 + x3 ≤ 2 x1 + 2 x2 + 3 x3 ≤ 5 2 x1 + 2 x2 + x3 ≤ 6 xν ≥ 0
(1 ≤ ν ≤ 3)
This problem is not in standard form (P ). By introducing so-called slack variables x4 , x5 , x6 ≥ 0 , we obtain: −3 x1 − x2 − 3 x3 −→ min
xν ≥ 0 ⎞ 211100 A := ⎝ 1 2 3 0 1 0 ⎠ , 221001 ⎛
⎛ ⎞ 2 b := ⎝ 5 ⎠ , 6
(1 ≤ ν ≤ 6) c := (−3, −1, −3, 0, 0, 0)T
Now (4, 5, 6) is a feasible basis. 1) J1 := (4, 5, 6), K1 := (1, 2, 3), AJ1 = I3 , hence Q1 = R1 = I3 , cJ1 = (0, 0, 0)T , b = xJ1 = b , u = 0 , cT x = cTJ1 b = 0 , cT = cTK1 − uT AK1 = (−3, −1, −3); choose σ = 1 , hence s = k1 = 1 , a1 = a1 = (2, 1, 2)T , min 22 , 51 , 62 = 1 yields = 1 and r = j1 = 4 . 2) J2 = (5, 6, 1), K2 := (2, 3, 4), ⎞ ⎞ ⎞ ⎛ ⎛ ⎛ 002 010 101 AJ2 = ⎝ 1 0 1 ⎠ , Q2 := ⎝ 0 0 1 ⎠ , Q2 AJ2 = ⎝ 0 1 2 ⎠ =: R2 , 012 100 002 ⎛ ⎞ ⎛ ⎞ 5 4 R2 b = Q2 b = ⎝ 6 ⎠ , b = ⎝ 4 ⎠ , cTJ2 b = −3 , 2 1 ⎞ ⎞ ⎛ ⎞ ⎛ ⎛ 0 0 −3/2 R2T v = cJ2 = ⎝ 0 ⎠ , v = ⎝ 0 ⎠ , u = QT2 v = ⎝ 0 ⎠ , −3/2 −3 0
Chapter 4
2 x1 + x2 + x3 +x4 = 2 x1 + 2 x2 + 3 x3 +x5 = 5 2 x1 + 2 x2 + x3 +x6 = 6
158
Linearly Constrained Optimization Problems ⎞ 111 cT = cTK2 − uT AK2 = (−1, −3, 0) − (−3/2, 0, 0) ⎝ 2 3 0 ⎠ 210 = (1/2, −3/2, 3/2), hence σ = 2 and s = k2 = 3 . ⎞ ⎛ ⎞ ⎛ 3 5/2 R2 a3 = Q2 a3 = ⎝ 1 ⎠ , a3 = ⎝ 0 ⎠ 1 1/2 4 1 min 5/2 = 85 yields = 1 and r = j1 = 5 . , 1/2
Chapter 4
⎛
3) J3 = (6, 1, 3) , K3 := (2, 4, 5) ⎞ ⎞ ⎛ ⎛ 021 013 AJ3 = ⎝ 0 1 3 ⎠ , Q2 AJ3 = ⎝ 1 2 1 ⎠ , 121 021 ⎞ ⎞ ⎛ ⎞⎛ ⎛ 121 013 010 ⎝ 0 0 1 ⎠⎝ 1 2 1 ⎠ = ⎝ 0 2 1 ⎠ 013 021 100 ⎛ ⎞ ⎛ ⎛ ⎞ ⎞⎛ ⎞ 1 0 0 0 0 1 1 √2 √1 121 ⎝ 0 √25 √15 ⎠⎝ 0 2 1 ⎠ = ⎝ 0 5 5 ⎠ =: R3 , Q3 = ⎝ √25 √15 0 ⎠ √ 013 − √15 √25 0 0 − √15 √25 5 0 0 ⎞ ⎛ ⎛ ⎞ 6√ 4 R3 b = Q3 b = ⎝ 9/√5 ⎠ , b = ⎝ 1/5 ⎠ , cTJ3 b = −27/5 8/5 8/ 5 ⎞ ⎛ ⎞ ⎛ ⎛ ⎞ 0√ 0 −6/5 R3T v = cJ3 = ⎝ −3 ⎠ , v = ⎝ −3/ 5 ⎠ , u = QT3 v = ⎝ −3/5 ⎠ −3 0 0 ⎞ ⎛ 110 cT = cTK3 − uT AK3 = (−1, 0, 0) − (−6/5, −3/5, 0) ⎝ 2 0 1 ⎠ 200 = (7/5, 6/5, 3/5) ≥ 0 Now the algorithm stops; x = (1/5, 0, 8/5, 0, 0, 4)T gives a solution to (P ) and u = (−6/5, −3/5, 0)T a solution to (D); furthermore min(P ) = max(D) = −27/5 holds. Calculation of a Feasible Basis Assume that no feasible basis of the original problem T c x −→ min (P ) Ax = b , x ≥ 0
4.1
Linear Optimization
159
is known. wlog let b ≥ 0 ; otherwise we multiply the μ-th row of A and bμ by −1 for μ ∈ {1, . . . , m} with bμ < 0 . Phase I: Calculation of a feasible basis In phase I of the simplex algorithm we apply this method to an auxiliary problem (P ) with a known initial feasible basis. A basis corresponding to a solution to (P ) yields a feasible basis of (P ) .
(P)
⎧ T ⎨ Φ(x, w) := e w −→ min Ax + w = b ⎩ x ≥ 0, w ≥ 0
⎛
⎞ ⎛ ⎞ xn+1 1 ⎜ .. ⎟ ⎜ .. ⎟ w = ⎝ . ⎠ , e := ⎝ . ⎠ xn+m 1
−x1 −x2 −→ min x1 +2 x3 +x4 = 1 x2 −x3 +x5 = 1 x1 +x2 +x3 = 2 x≥0
Example 2
Now we can proceed in a simpler way than in the general case (‘worst case’): Phase I:
w −→ min +2 x3 +x4 = 1 x2 −x3 +x5 = 1 x1 +x2 +x3 +x6 = 2
x1
xν ≥ 0
(1 ≤ ν ≤ 6)
c := (0, 0, 0, 0, 0, 1)T 1) J := (4, 5, 6), K := (1, 2, 3) ⎛ ⎞ ⎛ ⎞ 1 0 AJ = I3 , b = b = ⎝ 1 ⎠ , u = cJ = ⎝ 0 ⎠ 2 1
(w = x6 )
Chapter 4
x (0, b)T is a feasible basic point to (P ); hence (P ) has a minimizer w with min(P ) ≥ 0 . Furthermore: FP = ∅ ⇐⇒ min(P) = 0 Let Φ( x, w) = 0 . Then x is a feasible basic point to (P). If none of the x artificial variables wμ remains in the basis to the solution w , then we have found a feasible basis to (P ). Otherwise we continue to iterate with this basis in Phase II. Artificial basic variables xj with a,s = 0 can be used as pivot elements. As bj = 0 holds for artificial basic variables xj , we then get again a feasible basic point. Thus only those artificial variables xjμ with aμ,s = 0 remain in the basis.
160
Linearly Constrained Optimization Problems ⎞ 1 0 2 cT = (0, 0, 0) − (0, 0, 1) ⎝ 0 1 −1 ⎠ = (−1, −1, −1), s = k1 = 1 1 1 1 ⎛ ⎞ 1 a1 = a1 = ⎝ 0 ⎠ , r = j1 = 4 1
Chapter 4
⎛
2) J := (1, 5, 6), K := (2, 3, 4) ⎞ ⎛ ⎛ ⎞ 100 1 AJ = ⎝ 0 1 0 ⎠ , b = ⎝ 1 ⎠ , u 101 1 ⎛ 0 2 cT = (0, 0, 0) − (−1, 0, 1) ⎝ 1 −1 1 1 ⎛ ⎞ 0 a2 = ⎝ 1 ⎠ , r = j2 = 5 1
⎞ −1 = ⎝ 0⎠ 1 ⎞ 1 0 ⎠ = (−1, 1, 1), s = k1 = 2 0 ⎛
3) J := (1, 2, 6), K := (3, 4, 5) ⎞ ⎞ ⎛ ⎛ ⎛ ⎞ −1 100 1 AJ = ⎝ 0 1 0 ⎠ , b = ⎝ 1 ⎠ , u = ⎝ −1 ⎠ 1 111 0 ⎞ ⎛ 210 cT = (0, 0, 0) − (−1, −1, 1) ⎝ −1 0 1 ⎠ = (0, 1, 1) ≥ 0 100 J := (1, 2, 6), K := (3, 4, 5) ⎞ ⎛ ⎛ ⎞ −1 1 c := (−1, −1, 0, 0, 0, 0)T , b = ⎝ 1 ⎠ , u = ⎝ −1 ⎠ 0 0 ⎞ ⎛ 210 cT = (0, 0, 0) − (−1, −1, 0) ⎝ −1 0 1 ⎠ = (1, 1, 1) ≥ 0 100 min(P ) = cTJ b = −2 Phase II:
The Active Set Method The idea of the active set method1, which is widely used in the field of linearly constrained optimization problems, consists of estimating the active set (referring to the Karush–Kuhn–Tucker conditions) at each iteration (cf. [Fle], p. 160 ff). 1
This method is also called the working set algorithm.
4.1
Linear Optimization
161
Problem Description Starting from natural numbers n, m, vectors c, aμ ∈ Rn , b ∈ Rm and a splitting of {1, . . . , m} into disjoint subsets E (Equality) and I (Inequality), we consider for x ∈ Rn : f (x) := cT x −→ min aTμ x = bμ for μ ∈ E and aTμ x ≥ bμ for μ ∈ I . Assume that a feasible basis J = (j1 , . . . , jn ) with E ⊂ S(J) and a vector x ∈ Rn with aTμ x = bμ for μ ∈ S(J) and
exist. Hence, degeneracy shall be excluded for the moment. Then S(J) contains exactly those constraints which are active at x. Each iteration step consists of a transition from one basic point to another which is geometrically illustrated by a move from one ‘vertex’ to a ‘neighboring vertex’ along a common ‘edge’. AJ := aj1 , . . . , ajn , A−T =: (d1 , . . . , dn ) (will not be computed explicitly) J ⎛
⎞ dT1 ⎜ .. ⎟ = dTσ aj A = , . . . , a In = A−1 a ⎝ ⎠ J j j . 1 n J dTn shows δ,σ = dTσ aj = aTj dσ . ⎛ T ⎞ ⎞ ⎛ d1 c uj1 ⎜ .. ⎟ ⎜ .. ⎟ uJ := A−1 J c = ⎝ . ⎠ = ⎝ . ⎠ , hence c − AJ uJ = 0 . dTn c
ujn
If uμ ≥ 0 for all μ ∈ S(J) ∩ I , then x gives a minimum (according to the Karush–Kuhn–Tucker conditions). Otherwise there exists a jσ ∈ S(J) ∩ I with ujσ < 0 . As f (x)dσ = cT dσ = ujσ <0 , the vector dσ is a descent direction. Let ujσ = min uμ | μ ∈ S(J) ∩ I and s := jσ . Consider x := x + α dσ for α > 0. For μ = j ∈ S(J) ∩ E we have = σ and therefore aTμ dσ = 0 . For μ ∈ S(J) ∩ I the inequality aTμ dσ ≥ 0 holds. For μ ∈ I \ S(J) it needs to hold that aTμ x = aTμ x + α aTμ dσ ≥ bμ . Therefore α := min M > 0 for
Chapter 4
aTμ x > bμ for μ ∈ I \ S(J)
162
Linearly Constrained Optimization Problems M :=
bμ − aTμ x : μ ∈ I \ S(J) ∧ aTμ dσ < 0 aTμ dσ
is the best possible step size; choose an index r ∈ I \ S(J) with aTr dσ < 0 and br − aTr x α = aTr dσ and define J := j1 , . . . , jσ−1 , r, jσ+1 , . . . , jn . Then for the objective function it holds that
Chapter 4
f (x ) = cT x = cT x + α cT dσ = cT x + α ujσ < cT x. If M = ∅ , that is, α = min ∅ := ∞, then the objective function is unbounded from below. Example 3
2 x1 + 4 x2 + 3 x3 −→ min −x1 + x2 + x3 ≥ 2 2 x1 + x2 ≥ 1 x ≥ 0
Hence, we have n = 3, m = 5 , E = ∅ , I = {1, . . . , 5} , b1 = 2, b2 = 1, b3 = b4 = b5 = 0 , c = (2, 4, 3)T and ⎞ ⎛ ⎛ ⎞ −1 2 a 1 = ⎝ 1 ⎠ , a2 = ⎝ 1 ⎠ , a 3 = e 1 , a 4 = e 2 , a 5 = e 3 . 1 0 ⎞ ⎛ ⎛ ⎞ −1 2 0 2 For J := (1, 2, 4), K := (3, 5) we obtain AJ = ⎝ 1 1 1 ⎠ and bJ = ⎝ 1 ⎠. 100 0 aTμ x = bμ for μ ∈ S(J) means ATJ x = bJ , since this is equivalent to aTjν x = (AJ eν )T x = eTν ATJ x = eTν bJ = bjν for ν = 1, . . . , n. The two linear systems ATJ x = bJ and AJ uJ = c can be solved as follows: −1 1 1 210 010
2 1 0
−1 1 1 032 010
2 5 0
−1 2 0 111 100
2 4 3
x1 = 1/2 1 0 0 3 u1 = 3 0 2 0 5 u2 = 5/2 x3 = 5/2 0 1 1 1 u4 = −3/2 x2 = 0 ⎞ ⎞ ⎛ ⎛ 1/2 3 x = ⎝ 0 ⎠ and uJ = ⎝ 5/2 ⎠ yield σ = 3 , hence s = j3 = 4 . 5/2 −3/2 e and therefore solve ATJ d3 = e3 : d3 = (−1/2, 1, −3/2)T . We need d3 = A−T 3 J
4.1
Linear Optimization
163
aT3 d3 = eT1 d3 = −1/2 and aT5 d3 = eT3 d3 = −3/2 yield 0 − 1/2 0 − 5/2 b3 − aT3 x b5 − aT5 x = min , , = 1. α := min −1/2 −3/2 aT3 d3 aT5 d3 This gives r = 3 . Hence, we have ⎞ ⎛ ⎞ ⎞ ⎛ ⎛ 0 −1/2 1/2 1⎠ = ⎝1⎠. x := x + α d3 = x + d3 = ⎝ 0 ⎠ + ⎝ 1 −3/2 5/2 ⎞ ⎛ −1 2 1 J := (1, 2, 3), K := (4, 5), AJ = ⎝ 1 1 0 ⎠. 1 0 0 Solving the linear system AJ uJ = c yields uJ = (3, 1, 3)T ≥ 0 : 2 4 3
100 021 010
3 5 1
Chapter 4
−1 2 1 110 100
u1 = 3 u3 = 3 u2 = 1
Hence, x is a minimal point with value cT x = 7 .
Calculation of a Feasible Basis Let wlog E = ∅ . We will describe how to find an initial feasible basic point to the following problem: cT x −→ min with |I | = m ≥ n. aTμ x ≥ bμ (μ ∈ I) Phase I: Let x(k) ∈ Rn and Vk := V x(k) := μ ∈ I | aTμ x(k) < bμ . Hence, Vk contains the indices of the constraints violated at x(k) . We carry out one iteration step of the active set method for the following problem ⎧ Fk (x) := bμ − aTμ x −→ min ⎪ ⎪ ⎨ μ∈Vk (Pk ) aTμ x ≤ bμ (μ ∈ Vk ) ⎪ ⎪ ⎩ T aμ x ≥ bμ (μ ∈ / Vk ). Let a basis J = (j1 , . . . , jn ) with S(J) ⊂ I \ Vk and a corresponding basic point x(k) be given. The problem (Pk ) will be updated after each iteration
164
Linearly Constrained Optimization Problems
step. This will be repeated iteratively, until the current x(k) is feasible. If u ≥ 0 occurs before that, then there exists no feasible point. Example 4
2 x1 + 5 x2 + 6 x3 −→ min 2 x1 +x2 +2 x3 ≥ 3 x1 +2 x2 +2 x3 ≥ 1 x1 +3 x2 +x3 ≥ 3 x1 , x2 , x3 ≥ 0
Chapter 4
Phase I: 1) x(0) := 0 , hence V0 = {1, 2, 3} . ⎧ bμ − aTμ x = 7 − 4 x1 + 6 x2 + 5 x3 −→ min ⎪ ⎪ ⎨ μ∈V0 aTμ x ≤ bμ (μ ∈ V0 ) ⎪ ⎪ ⎩ T aμ x ≥ bμ (μ ∈ / V0 ) , i. e., x ≥ 0 ⎞ ⎛ −4 J = (4, 5, 6) , AJ = I3 , c := ⎝ −6 ⎠ −5 c = AJ uJ = uJ yields s = j2 = 5 . With d2 = e2 ∈ R3 we get α = min {3/1 , 1/2 , 3/3} = 1/2 , r = 2 . ⎞ ⎛ 0 2) x(1) := x = x(0) + α dσ = 1 d2 = ⎝ 1/2 ⎠ , hence V1 = {1, 3} . 2 0 ⎧ T (bμ − aμ x) = 6 − (3 x1 + 4 x2 + 3 x3 ) −→ min ⎪ ⎪ ⎨ μ∈V1 aTμ x ≤ bμ (μ ∈ V1 ) ⎪ ⎪ ⎩ T aμ x ≥ bμ (μ ∈ / V1 ) ⎛
⎞ ⎞ ⎞ ⎛ ⎛ −3 110 0 1 0 J = (2, 4, 6), AJ = ⎝ 2 0 0 ⎠, A−T = ⎝ 1/2 −1/2 −1 ⎠, c := ⎝ −4 ⎠ J −3 201 0 0 1 T AJ uJ = c has⎛ the solution u = (−2, −1, 1) . Hence σ = 1 and s = J ⎞ 0 jσ = 2 , dσ = ⎝ 1/2 ⎠ , 0 3−3/2 α = min 3−1/2 = 1, r = 3. 1/2 , 3/2 ⎞ ⎛ ⎞ ⎞ ⎛ ⎛ 0 0 0 3) x(2) := x(1) + α dσ = ⎝ 1/2 ⎠ + ⎝ 1/2 ⎠ = ⎝ 1 ⎠, V2 = {1}, J = (3, 4, 6) 0 0 0
4.2
Quadratic Optimization
165
⎧ bμ − aTμ x = b1 − aT1 x = 3 − (2 x1 + x2 + 2 x3 ) −→ min ⎪ ⎪ ⎨ μ∈V2 aT1 x ≤ b1 ⎪ ⎪ ⎩ T aμ x ≥ bμ (2 ≤ μ ≤ 6) ⎞ ⎞ ⎞ ⎛ ⎛ ⎛ −2 110 0 1 0 AJ = ⎝ 3 0 0 ⎠ , A−T = ⎝ 1/3 −1/3 −1/3 ⎠ , c := ⎝ −1 ⎠ J −2 101 0 0 1
Phase II:
⎛ ⎞ 2 J = (1, 3, 6) , c := ⎝ 5 ⎠ 6 ⎞ ⎛ ⎛ ⎞ 210 1 1 AJ = ⎝ 1 3 0 ⎠, AJ uJ = c has the solution uJ = ⎝ 8 ⎠ ≥ 0 . Hence 5 2⎛1 1⎞ 20 6 x(3) = 1 ⎝ 3 ⎠ is a minimizer with value 27/5 . 5 0
4.2 Quadratic Optimization Quadratic problems are linearly constrained optimization problems with a quadratic objective function. Quadratic optimization is often considered to be an essential field in its own right. More important, however, it forms the basis of several algorithms for general nonlinearly constrained problems.
Let the following optimization problem be given: f (x) := 1 xT C x + cT x −→ min 2 (QP ) AT x ≤ b
Chapter 4
AJ uJ = c has the ⎛ solution⎞uJ = 1/3 (−1, −5, −5)T . Hence σ = 2 and 1 s = jσ = 4 , dσ = ⎝ −1/3 ⎠. 0 2 1 α = min 5/3 , 1/3 = 6 , r = 1 ⎞ ⎛ 5⎞ ⎛ ⎛ ⎞ 0 1 6 x(3) := x(2) + α dσ = ⎝ 1 ⎠ + 6 ⎝ −1/3 ⎠ = 15 ⎝ 3 ⎠ is a feasible point, 5 0 0 0 as none of the constraints is violated, J = (1, 3, 6) .
166
Linearly Constrained Optimization Problems
Suppose: C ∈ Rn×n symmetric positive semidefinite, c ∈ Rn , b ∈ Rm , A ∈ Rn×m with A = (a1 , . . . , am ). We consider the set of feasible points F := x ∈ Rn | AT x ≤ b . Theorem 4.2.1 (Barankin–Dorfman)2 Assume F to be nonempty and let f be bounded from below on F. Then the function f | attains its minimum. F
Chapter 4
Proof 3 : For a positive we consider the set F : = x ∈ F | x2 ≤ . Obviously this is a compact set. For sufficiently large it is nonempty. ϕ() := min f (x) | x ∈ F ↓ γ := inf f (x) | x ∈ F ∈ R ( −→ ∞) There exists an x ∈ F with f (x ) = ϕ() and x 2 ≤ y2 for all y ∈ F with f (y) = ϕ(). x is a minimizer with minimal norm.
1) We will prove: There exists a 0 > 0 with x 2 < for all > 0 . Otherwise there exists an isotone, that is, monotone increasing, sequence (k ) of positive numbers such that xk 2 = k −→ ∞ and f (xk ) ↓ γ . (i) For k ∈ N let yk := xk and vk := 1 yk . Hence vk 2 = 1 holds. k wlog let vk −→ v (k −→ ∞) for a v ∈ Rn . Obviously we have v2 = 1. yk ∈ Fk ⊂ F implies aTμ yk ≤ bμ for all μ ∈ I := {1, . . . , m} , hence aTμ v ≤ 0 for all μ ∈ I , i. e., AT v ≤ 0 . We consider
I0 := {μ ∈ I | aTμ v = 0} .
For μ ∈ I \ I0 we have aTμ v < 0 , and hence aTμ vk < aTμ v/2 and k aTμ v/2 ≤ bμ for sufficiently large k , thus aTμ yk = k aTμ vk < k aTμ v/2 ≤ bμ . 2 3
Confer [Ba/Do]. Our proof is based on [Bl/Oe 1]. An alternative proof can be found in the book [Co/We]. Without using the Karush–Kuhn–Tucker conditions and furthermore for a matrix C that is merely symmetric (not necessarily positive semidefinite).
4.2
Quadratic Optimization
167
(ii) There exists a γ ∈ R with γ ≤ f (yk ) ≤ γ for all k ∈ N; hence γ ≤ 2k 12 vkT Cvk + k cT vk ≤ γ , thus, by k 21 vkT Cvk + cT vk → 0 , vT C v = 0 . yk +λv ∈ F for λ ≥ 0 shows γ ≤ f (yk +λv) = f (yk )+λ(C yk +c)T v and so (C yk + c)T v ≥ 0 by λ −→ ∞. (iii) aTμ (yk − λv) = aTμ yk ≤ bμ for μ ∈ I0 and λ ≥ 0 . For μ ∈ I \ I0 and sufficiently large k we have shown aTμ yk < bμ . Hence, there exists a δ > 0 with aTμ (yk − λv) ≤ bμ for 0 ≤ λ ≤ δ.
f (yk − λv) = f (yk ) − λ (C yk + c)T v ≤ f (yk )
≥0
we obtain a contradiction to the definition of yk = xk . 2) We will show: f (xσ1 ) = f (xσ2 ) for 0 < σ1 < σ2 . From that it obviously follows that f (x ) = γ for all > 0 . By construction f (xσ1 ) ≥ f (xσ2 ) holds. Let f (xσ1 ) > f (xσ2 ) ; then xσ1 2 < σ1 < xσ2 2 < σ2 holds: The two outer inequalities hold by 1). For the inequality in the middle we consider: If xσ2 2 ≤ σ1 holds, then xσ2 is an element of Fσ1 with f (xσ2 ) < f (xσ1 ) . This is a contradiction to the definition of xσ1 .
:= xσ2 2 : 0 < < σ2 =⇒ f (x ) ≥ f (xσ2 ) xσ2 ∈ F =⇒ f (x ) ≤ f (xσ2 )
=⇒ f (x ) = f (xσ2 )
This is a contradiction to the definition of xσ2 (minimizer with minimal norm). Occasionally, we have already tacitly used the following formulation of the KKT conditions for the quadratic case: Proposition 4.2.2 (Karush–Kuhn–Tucker) A point x ∈ F is a minimizer to problem (QP ) iff there exist vectors y, u ∈ Rm + with C x + c + A u = 0, AT x + y = b and uT y = 0 .
Chapter 4
ψk (λ) := 12 yk − λv22 = 12 λ2 − λ(ykT v) + 12 yk 22 ψk (0) = −ykT v = −k vkT v is negative for sufficiently large k. Thus there exists a λ ∈ (0, δ) with yk − λv ∈ F and yk − λv2 < yk 2 = k . Because of
168
Linearly Constrained Optimization Problems
The constraint AT x ≤ b is equivalent to AT x + y = b with a slack vector y = y(x) ∈ Rm +. Proof: Assuming the three equations, we conclude as follows (compare the proof of theorem 2.2.8): The objective function f is differentiable and convex, hence by the considerations on page 53 f (x) − f (x) ≥ f (x)(x − x) = (C x + c)T (x − x) = (Au)T (x − x) = uT AT x − x = uT y − y = uT y ≥ 0 for all x ∈ F . For the other direction we firstly remark: The present problem (QP ) with E := ∅ and
Chapter 4
gμ (x) := aTμ x − bμ for μ ∈ I := {1, . . . , m} has the form (P ), which we have investigated in section 2.2. For a minimizer x to (P ) it follows from theorem 2.2.5 that m
∇f (x) +
uμ ∇gμ (x) = 0 μ=1
and
uμ gμ (x) = 0 for μ ∈ I with a vector u ∈ Rm + , as (MFCQ) or the Slater condition — in the case of linear constraints and E = ∅ — are trivially fulfilled here. With ∇gμ (x) = aμ the first equation gives C x + c + Au = 0 . With y := b − AT x ≥ 0 the second equation shows uT y = 0 .
Now we will get to know a method which solves (QP ) in a finite number of steps by constructing a solution to the Karush–Kuhn–Tucker conditions. The Active Set Method The solution strategy of this method — analogously to the active set method in the linear case of section 4.1 — is based on determining the optimum on varying submanifolds. In each iteration step we only use the constraints active at the current iteration point. All other constraints are ignored for the moment. These considerations date back to Zangwill (1967) (manifold
4.2
Quadratic Optimization
169
suboptimization). An essential part of the following discussion also holds for the more general case of convex functions. Therefore we consider the following problem f (x) −→ min (CP ) AT x ≤ b with a convex function f : Rn −→ R, b ∈ Rm , A ∈ Rn×m , I := {1, . . . , m} . Let x ∈ F := x ∈ Rn | AT x ≤ b and A(x) := μ ∈ I | aTμ x = bμ . Then we obtain — using the same argument as on the preceding page: x is a minimizer to (CP ) ⇐⇒ There exist uμ ≥ 0 for μ ∈ A(x) with ∇f (x) + ⇐⇒ x is a minimizer to
μ∈A(x)
u μ aμ = 0
f (x) −→ min aTμ x ≤ bμ for all μ ∈ A(x).
In general A(x) is unknown. We are going to describe an algorithm which determines A(x) iteratively and is based on the following idea: We solve a sequence of optimization problems f (x) −→ min (Pk ) aTμ x = bμ for all μ ∈ Jk , moving Jk to another index set successively, such that the minimal values of the objective function give an antitone sequence with Jk = A(x) after a finite number of steps. Let J ⊂ I and y be a minimizer to f (x) −→ min (PJ ) aTμ x = bμ for all μ ∈ J . Then — with suitable uμ ∈ R for μ ∈ J — the Karush–Kuhn–Tucker conditions for (PJ ) hold: ∇f (y) +
u μ aμ = 0
(3)
μ∈J
Definition Let a minimizer to (PJ ). y is called nondegenerate iff the vectors y ∈ F be aμ | μ ∈ A(y) are linearly independent. Remark If y is nondegenerate, then the Lagrange multipliers uμ are uniquely determined.
Chapter 4
170
Linearly Constrained Optimization Problems
Lemma 4.2.3 Let y ∈ F be a minimizer to (PJ ) and nondegenerate. In (3) let us < 0 for an index s ∈ J. If y! is a minimizer to (PJ!) for J! := A(y) \ {s} , then f (! y ) < f (y) and aTs y! < bs hold.
Chapter 4
Proof: Because of (3) and us < 0 the point y is a minimizer to ⎧ ⎪ ⎨ f (x) −→ min aTμ x = bμ for all μ ∈ J! ⎪ ⎩ T as x ≥ b s . For y the Karush–Kuhn–Tucker conditions to ⎧ ⎪ ⎨ f (x) −→ min aTμ x = bμ for all μ ∈ J! ⎪ ⎩ T as x ≤ b s
(4)
(5)
cannot be met as y is nondegenerate. Let z be a minimizer to (5). Then it follows: a) f (! y ) < f (y): Otherwise we get f (y) ≤ f (! y). As (PJ!) has less constraints than (5), f (! y ) ≤ f (z) holds. Altogether, this gives f (y) ≤ f (z). Hence, y is a minimizer to (5) contrary to the above considerations. b) aTs y! < bs : Otherwise aTs y! ≥ bs holds. Therefore y! is feasible to (4) and thus also a minimizer to it. Hence, it follows f (y) = f (! y ) contrary to a). Algorithm
0) Let x(0) ∈ F and J0 := A x(0) . If the points x(0) , . . . , x(k) ∈ F and the corresponding index sets J0 , . . . , Jk are determined for a k ∈ N0 , then we obtain x(k+1) and Jk+1 as follows:
1) Determine a minimizer y (k) to f (x) −→ min (Pk ) aTμ x = bμ for all μ ∈ Jk . 2) Case 1: y (k) ∈ F uμ aμ = −∇f (y (k) ). Solve μ∈Jk
If uμ ≥ 0 for all μ ∈ Jk , then the point y (k) is a minimizer to (CP ) and the algorithm stops.
4.2
Quadratic Optimization
171
Otherwise choose an index s ∈ Jk with us < 0 , for example s ∈ Jk minimal with us = min{uμ | μ ∈ Jk } . Set Jk+1 := A(y (k) ) \ {s} and x(k+1) := y (k) (deactivation step). /F 3) Case 2: y (k) ∈ (k) For d := y − x(k) we determine a maximal α > 0 with aTμ (x(k) + α d) ≤ bμ for all μ ∈ I
(activation step)
and set x(k+1) := x(k) + α d and Jk+1 := A(x(k+1) ) . Example 5 f (x) :=
x21
≤ ≤ ≤ ≤ # " 2 x1 − x2 − 3 , ∇f (x) = −x1 + 2 x2
2 0 0 3/2 " A =
" # x1 ∈ R2 − 3 x1 −→ min for x = x2
1 −1 0 1 1 0 −1 0
# , b = (2, 0, 0, 3/2)T
0. 0) x(0) := 0 ∈ F , J0 = A x(0) = {2, 3} ; obviously this returns: 1) y (0) = 0
" # " # " # 3 −1 0 =⇒ u2 = −3 , u3 = 0 + u3 = 2) y ∈ F : u2 0 0 −1 (0) For s = 2 we get J1 := A y \ {2} = {3} and x(1) := y (0) = 0 . 1. 1) x21 − x1 x2 + x22 − 3 x1 −→ min leads to x2 = 0 # " 3/2 , hence A y (1) = {3, 4}. x21 − 3 x1 −→ min and thus y (1) = 0 " # " # 0 0 gives u3 = −3/2 < 0 . For s = 3 = 2) y (1) ∈ F : u3 3/2 −1 " # 3/2 (1) (2) (1) . := y = we get J2 := A(y ) \ {3} = {4} and x 0 # " 3/2 2. 1) x21 − x1 x2 + x22 − 3 x1 −→ min ∈ / F. returns y (2) = x1 = 3/2 3/4 " # 0 3) (Case 2) d := y (2) − x(2) = 3/4 aT1 d = 3/4 , aT2 d = 0 , aT3 d = −3/4 , aT4 d = 0 give 2 − 3/2 α = = 2/3 and hence 3/4 (0)
Chapter 4
x1 + x2 −x1 −x2 x1
− x1 x2 +
x22
172
Linearly Constrained Optimization Problems " # 3/2 , x(3) := x(2) + 2 d = 3 1/2 (3) J3 := A x = {1, 4} .
Chapter 4
# " 3. 1) x21 − x1 x2 + x22 − 3 x1 −→ min 3/2 (3) =⇒ y = x1 + x2 = 2 1/2 x1 = 3/2 # " # " " # " # 1/2 −1/2 1 1 = + u4 = − 2) y (3) ∈ F : u1 1/2 −1/2 1 0 u1 + u4 = 1/2 is solved by u1 = 1/2 , u4 = 0 . u1 = 1/2 " # 3/2 gives the optimum. Hence, x(4) := y (3) = 1/2 Proposition 4.2.4
4
Let m ≥ n, rank(A) = n and all the minimizers y (k) to the problems (Pk ) which are feasible to (CP ) be nondegenerate. Then the algorithm reaches an optimum of (CP ) after a finite number of steps. Proof: Let k ∈ N0 . a) If y (k) ∈ F with uμ ≥ 0 for all μ ∈ Jk , then y (k) meets the Karush– Kuhn–Tucker conditions for (CP ). Hence, y (k) yields an optimum to (CP ). b) x(k) is a feasible point of (Pk ). Hence, f y (k) ≤ f x(k) holds. Furthermore we get x(k+1) = (1 − α)x(k) + α y (k) ∈ FPk with an α ∈ (0, 1 ]. As f is convex, it follows that f y (k) ≤ f x(k+1) ≤ (1 − α)f x(k) + α f y (k) ≤ f x(k) . If y (k) ∈ F with ∇f y (k) + μ∈Jk uμ aμ = 0 and us < 0 for an index s ∈ Jk , then we first get x(k+1) := y (k) and Jk+1 := A y (k) \ {s}. By assumption y (k) is nondegenerate. Lemma 4.2.3 then gives that f y (k+1) < f y (k) = f x(k+1) , aTs y (k+1) < bs . From that results, with a β ∈ (0, 1 ], f x(k+2) = f (1 − β)x(k+1) + β y (k+1) ≤ (1 − β)f x(k+1) + β f y (k+1) < f x(k+1) .
4
Confer [Bl/Oe 2], chapter 13.
4.2
Quadratic Optimization
173
c) Case 1 can only occur a finite for of times, since, if case 1 occurs number problem (Pk ), we obtain f y (k) = f x(k+1) > f x(k+2) ≥ f y (k+1) by b) and thus f x(k+2) ≥ f x(k+) ≥ f y (k+) for all ≥ 2 . Therefore min(Pk+ ) < min(Pk ) holds for all ∈ N. As I has only a finite number of subsets, case 1 can only appear a finite number of times.
The finiteness of the maximum number of steps does not mean that the algorithm is altogether finite, because it is possible that the subproblems (Pk ) can only be solved iteratively.
From an applicational point of view single exchange is more advantageous than multiple exchange: Algorithm (single exchange) 0) Let x(0) ∈ F , J0 := A x(0) and {aμ | μ ∈ J0 } be linearly independent. If the points x(0) , . . . , x(k) ∈ F and the corresponding index sets J0 , . . . , Jk are already determined for a k ∈ N0 , we obtain x(k+1) and Jk+1 as follows:
1) Determine a minimizer y (k) to f (x) −→ min (Pk ) aTμ x = bμ for all μ ∈ Jk . 2) Case 1: y (k) ∈ F uμ aμ = −∇f (y (k) ). Solve μ∈Jk
If uμ ≥ 0 for all μ ∈ Jk , then the point y (k) is a minimizer to (CP ) and the algorithm stops. Otherwise choose an index s ∈ Jk with us < 0 , for example s ∈ Jk minimal with us = min{uμ | μ ∈ Jk } . Set Jk+1 := Jk \ {s} and x(k+1) := y (k) (deactivation step).
Chapter 4
d) Case 2 can at most occur m times in a row: Let y (k) ∈ / F . Then x(k+1) = (1 − α)x(k) + α y (k) ∈ FPk with α ∈ (0, 1) holds. There exists at least one r ∈ I \ Jk with aTr y (k) > br and aTr x(k+1) = br . From that follows Jk+1 := A(x(k+1) ) ⊃ Jk ∪ {r} , hence, Jk+1 = Jk , where Jk = ∅ is possible. If case 2 occurs times in a row from the k-th to the (k + )-th iteration, then Jk+ contains at least elements. Therefore there exists an ≤ m, such that {aμ | μ ∈Jk+ } is a spanning set of the Rn . From this follows FPk+ = x(k+) and y (k+) = x(k+) ∈ F, that is, after at most m repetitions of case 2, we return to case 1.
174
Linearly Constrained Optimization Problems
3) Case 2: y (k) ∈ /F (k)
(activation step)
(k)
we determine an index r ∈ I \ Jk with aTr d > 0 and For d := y − x $ bμ − aTμ x(k) $$ br − aTr x(k) T α := min μ ∈ / J (0 ≤ α < 1) , a d > 0 = $ k μ $ aTμ d aTr d and set x(k+1) := x(k) + α d and
Jk+1 := Jk ∪ {r} .
Remark
Chapter 4
Let {aμ | μ ∈ J0 } be linearly independent. Then the vectors {aμ | μ ∈ Jk } are also linearly independent for k ≥ 1 . Hence, ≤ n (for the according to d) in the proof of proposition 4.2.4) holds. Proof : by exercise 17. Example 6 (cf. Example 5) 0. (unaltered) 0) x(0) := 0 ∈ F , J0 = A(x(0) ) = {2, 3} ; obviously this returns: 1) y (0) = 0
" # " # " # 3 −1 0 =⇒ u2 = −3 , u3 = 0 + u3 = 0 0 −1 For s = 2 we get J1 := J0 \ {2} = {3} and x(1) := y (0) = 0 . # " 3/2 1. 1) x21 − x1 x2 + x22 − 3 x1 −→ min . again yields y (1) = 0 x2 = 0 " # " # 0 0 gives u3 = −3/2 < 0 . For s = 3 2) y (1) ∈ F : u3 = 3/2 −1 " # 3/2 . we get J2 := J1 \ {3} = ∅ and x(2) := y (1) = 0 2) y (0) ∈ F :
u2
2. 1) x21 − x1 x2 + x22 − 3x1 −→ min (unconstrained minimum) " # 2 . is solved by y (2) = 1 " # 1/2 3) y (2) ∈ / F ; d := y (2) − x(2) = 1 T T T a1 d = 3/2 , a2 d = −1/2 , a3 d = −1 , aT4 d = 1/2 2 − 3/2 3/2 − 3/2 , = 0, r = 4 α := min 3/2 1/2" # 3/2 , J3 := J2 ∪ {4} = {4} x(3) := x(2) + α d = x(2) = 0
4.2
Quadratic Optimization
175
" # 3/2 3. 1) x21 − x1 x2 + x22 − 3 x1 −→ min =⇒ y (3) = 3/4 x1 = 3/2 " # 0 3) y (3) ∈ / F ; d := y (3) − x(3) = 3/4 aT1 d = 3/4 , aT2 d = 0 , aT3 d = −3/4 , aT4 d = 0 2 − 3/2 = 2/3 , r = 1 α := min " #3/4 " # " # 3/2 0 3/2 2 (4) := + = , J4 := J3 ∪ {1} = {1, 4} x 0 1/2 3 3/4
When executing the two algorithms just discussed, we had to solve a sequence of optimization problems subject to linear equality constraints. Therefore we will briefly treat this type of problem but only consider the quadratic case:
Minimization Subject to Linear Equality Constraints We consider the following problem: f (x) := 1 xT C x + cT x −→ min 2 (QP E) AT x = b Let A = (a1 , . . . , am ) ∈ Rn×m , m ≤ n, rank(A) = m, C ∈ Rn×n symmetric positive semidefinite, c ∈ Rn and b ∈ Rm . Elimination of Variables (by means of QR-factorization): The matrix A ! with an orthogonal (n, n)can be written in the factorized form A = QT R ! matrix Q and an upper triangular matrix R . Because of the assumption ! can be chosen to be positive. With rank(A) = m, the diagonal elements of R ! that the factorization is unique, and R has the form " # R ! R = 0 with an upper triangular matrix R ∈ Rm×m and 0 ∈ R(n−m)×m . According to this, let Q be split up in
Chapter 4
" # 4. 1) x21 − x1 x2 + x22 − 3 x1 −→ min 3/2 (4) =⇒ y = x1 + x2 = 2 1/2 x1 = 3/2 # " # " " # " # 1/2 −1/2 1 1 = + u4 = − 2) y (4) ∈ F ; u1 1/2 −1/2 1 0 is solved by u1 = 1/2 , " u4 =#0 . 3/2 gives the optimum! Hence, x(5) := y (4) = 1/2
176
Linearly Constrained Optimization Problems " Q =
Q1 Q2
#
! then decomposes into with Q1 ∈ Rm×n and Q2 ∈ R(n−m)×n . Q A = R n Q1 A = R and Q2 A = 0 . Every x ∈ R can be represented uniquely as x = QT u with a vector u ∈ Rn . Decomposing u in the form u = u v with m n−m T T T u ∈ R and v ∈ R , we get x = Q u = Q1 u + Q2 v . Thus AT x = b ⇐⇒ AT QT1 u + AT QT2 v = b ⇐⇒ RT u = b ⇐⇒ u = R−T b and, further,
Chapter 4
T x = QT1 R−T b + QT2 v = x0 + QT2 v with x0 := QT1 R−T b = R−1 Q1 b . For Z := QT2 ∈ Rn×(n−m) the substitution x = x0 + Zv with v ∈ Rn−m leads to the minimization of the reduced function ϕ defined by ϕ(v) := f x0 + Zv . ϕ (v) = f x0 + Zv Z yields the reduced gradient ∇ϕ(v) = Z T ∇f x0 + Zv = Z T C x0 + Zv + c = Z T ∇f x0 + Z T CZv and the reduced Hessian
∇2 ϕ(v) = Z T CZ .
From that we obtain the ‘Taylor series expansion’ ϕ(v) = f (x0 ) + f (x0 )Zv + 1 v T Z T C Zv . 2 Now the methods for the computation of ‘unconstrained minima’ can be applied to ϕ. Example 7 f (x) := x21 + x22 + x23 −→ min x1 + 2 x2 − x3 = 4 x1 − x2 + x3 = −2 Here we have the special case: m = 2 , n = 3 , ⎞ ⎛ " # 1 1 4 and A = ⎝ 2 −1⎠. C = 2 I3 , c = 0 , ∇f (x) = 2 x, b = −2 −1 1 After a simple calculation5 we obtain 5
A widely used computation method for the QR-decomposition uses Householder matrices I − 2 a aT with aT a = 1 .
4.2
Quadratic Optimization
177
⎛
√ √ √ ⎞⎛ √ ⎞ ⎞ ⎛√ 1/ 6 2/ 6 −1/ 6 6 − 6/3 1 1 √ √ ⎟ Q1 ⎜ √ √ ⎟ ⎜ ⎜ 4/ 21 −1/ 21 2/ 21 ⎟ ⎜ 21/3 ⎟ ⎠. ⎝ 2 −1 ⎠ = ⎝ 0 ⎝ ⎠ √ √ √ −1 1 Q2 0 0 1/ 14 −2/ 14 −3/ 14 & % √ % √ √ & 6 1/ 21 6 1/ 4/ √ √ and R−T b = . Hence, it follows that R−1 = 0 3/ 21 −2/ 21 ⎛ ⎞ 1 2 x0 := QT1 R−T b = ⎝ 5⎠ , x = x0 + Zv = x0 + QT2 v 7 −3 !
0 = ∇ϕ(v) = Q2 ∇f (x0 ) + Q2 2 I3 QT2 v = Q2 2 x0 + 2 v ⇐⇒ v = 0 , as in this case Q2 x0 = 0 . Then x := x0 gives the optimum. With g := ∇f (x) we get: !
0 = ∇f (x) +
m μ=1
uμ aμ = g + A u ⇐⇒ A u = −g
! u = Q A u = −Q g =⇒ R u = −Q1 g ⇐⇒ R ⇐⇒ u = −R−1 Q1 g for u = (u1 , . . . , um ) ∈ Rm . Then for our Example 7 it follows that ⎛ ⎞ % % √ & √ & 1 2/ 6 14/ 6 4 4 √ √ = 4 g := ∇f (x) = 2 x = ⎝ 5⎠ , Q1 g = 7 7 −7/ 21 −1/ 21 −3 % √ √ & % √ & # " −2 2/ 6 1/ 6 1/ 21 4 √ √ . ·4 = u = − 7 1 0 3/ 21 −1/ 21 The elimination method described above is often called the zero space method in the literature. It is advantageous if m ≈ n holds. In the case of m n it is more useful to apply the so-called range space method. We start from the Karush–Kuhn–Tucker conditions as stated in proposition 4.2.2: A point x ∈ F is a minimizer to problem (QP ) if and only if there exist vectors y, u ∈ Rm + with C x + c + A u = 0, AT x + y = b and uT y = 0 . As we consider the equality constraint case AT x = b , we have y = 0 , and therefore the system reduces to
Chapter 4
Calculation of the Lagrange multipliers:
178
Linearly Constrained Optimization Problems "
C A AT 0
# " #" # −c x = b . u
For the coefficient matrix of this linear system we will show: 1) If C is positive definite on kernel(AT ) , then "
C A AT 0
# is invertible.
2) If C is positive definite, then it holds that
Chapter 4
"
C A AT 0
#−1
Proof: 1) From
" =
U X XT W "
#
C A AT 0
with
⎧ T −1 −1 ⎨ W = −(A C A) −1 X = −C AW ⎩ U = C −1 − X AT C −1 .
" # #" # 0 x = 0 y
for x ∈ Rn and y ∈ Rm it follows that 'C x + Ay( = 0 and x ∈ kernel(AT ). 0 = x, C x + x, Ay = x, C x + AT x, y = x, C x shows x = 0 . Hence, because of rank(A) = m, y = 0 follows. #" # " # " C A U X In 0 2) = I is equivalent to = n+m AT 0 XT W 0 Im U C + X AT = In , U A = 0 X T C + W AT = 0 , X T A = Im . The third equation implies X T + W AT C −1 = 0 (∗) and thus X T A + W AT C −1 A = 0 , hence, from the fourth equation W AT C −1 A = −Im and −1 W = − AT C −1 A . From (∗) follows X = −C −1 AW T = −C −1 AW . The first equation shows U = C −1 − X AT C −1 . For C positive definite we obtain the solution to the Karush–Kuhn– Tucker conditions # " " #" # C A −c x = b u AT 0 from
" # # " " # #" U X −U c + X b −c x = = , b u −X T c + W b XT W
that is, u = −X T c + W b and x = −C −1 (A u + c).
4.2
Quadratic Optimization
Continuation of Example 7 ⎞ ⎛ " # 1 1 4 , A = ⎝ 2 −1 ⎠ C −1 = 1 I3 , c = 0 , b = 2 −2 −1 1 #−1 # " " 6 −2 3 2 1 T −1 −1 T −1 W = −(A C A) = −2 (A A) = −2 = − 7 2 6 −2 3 #" # " " # " # 4 3 2 8 −2 = −1 = 4 u = Wb = −1 7 2 6 7 −4 7 −2 1 ⎞ ⎛ ⎛ ⎞ # " 1 1 1 −2 = 2 ⎝ 5⎠ x = − 1 A u = − 1 ⎝ 2 −1 ⎠ 4 2 2 7 7 1 −1 1 −3
179
The Cholesky decomposition C = L D LT (L ‘unit’ lower triangular matrix, D diagonal matrix) and S := Q1 L−T then yield Q1 C −1 QT1 = Q1 L−T D−1 L−1 QT1 = SD−1 S T . The Goldfarb–Idnani Method6 One of the main advantages of the following dual method is that it does not need a primal feasible starting vector. It is one of the best general-purpose methods for quadratic optimization. The method iteratively generates ‘dually optimal’, but possibly infeasible approximations with isotone values of the objective function. 6
Confer [Go/Id].
Chapter 4
These formulae for the solution of the Karush–Kuhn–Tucker conditions do not have the right form to react flexibly to changes of A within the active set method. This works better in the following way: " # " # Q1 R QA = A = yields Q2 0 " # " # T T R R T = Q1 Q2 = QT1 R , hence, AT = RT Q1 . A = Q 0 0 C x = −(A u + c) = − QT1 R u + c then gives with v := R u : v = R u = R − XT c + W b (definition of X and symmetry of W, C) = R W AT C −1 c + b T = R W R Q1 C −1 c + b −1 T = −R RT Q1 C −1 QT1 R R Q1 C −1 c + b (definition of W ) −1 Q1 C −1 c + R−T b . = − Q1 C −1 QT1
180
Linearly Constrained Optimization Problems
We consider the problem f (x) := 1 xT C x + cT x −→ min 2 (QP ) . AT x ≤ b Let A = (a1 , . . . , am ) ∈ Rn×m , c ∈ Rn , b ∈ Rm , C ∈ Rn×n symmetric positive definite. For J ⊂ I := {1, . . . , m} we again consider the subproblem f (y) −→ min (PJ ) ATJ y ≤ bJ . Definition
Chapter 4
For x ∈ Rn (x, J) is called an S-pair to (QP ) iff Here “S” stands for “solution”.
⎧ ⎪ ⎨ rank(AJ ) = |J|, ATJ x = bJ , ⎪ ⎩ x is a minimizer to (P ) . J
The unconstrained problem (P∅ ) with the minimizer x = −C −1 c is a trivial example. (x, ∅) is an S-pair to (QP ). If (x, J) is an S-pair to (QP ) with x feasible to (QP ), then x is a minimizer to (QP ).
Definition Let x ∈ Rn , J ⊂ I, p ∈ I \ J, J := J ∪ {p}. Then ⎧ T AJ x = bJ , aTp x > bp , ⎪ ⎪ ⎪ ⎪ ⎨ aμ | μ ∈ J linearly independent and (x, J, p) is called a V-triple iff uμ aμ + λap = 0 ∇f (x) + ⎪ ⎪ ⎪ μ∈J ⎪ ⎩ with suitable uμ ≥ 0 and λ ≥ 0 hold. Here “V” stands for “violation”. Algorithm (model algorithm) 0) Let an S-pair (x, J) be given.7 1) If AT x ≤ b holds, then x is a minimizer to (QP ): STOP 2) Otherwise choose a p ∈ I with aTp x > bp (violated constraint): J := J ∪ {p} If (PJ ) does not have any feasible points, then (QP ) does not have any feasible points either: STOP 7
Choose, for example,
− C −1 c, ∅ .
4.2
Quadratic Optimization
181
3) Otherwise determine a new S-pair (x, J) with p ∈ J ⊂ J and f (x) > f (x) . Set x := x , J := J and go to 1). The realization of the above algorithm naturally raises the questions of how — starting from a current S-pair (x, J) and a constraint violated by x — we can decide efficiently, whether the new problem has feasible vectors and how, if necessary, we can produce a new S-pair (x, J ) with the properties mentioned above. If we can ensure the realization, then the algorithm stops after a finite number of steps since there exist only a finite number of S-pairs, and after each step the value of the objective function increases strictly, hence, ‘cycles’ are excluded.
Implementation of step 3):
We define α := aTp x − bp . In the following, when determining a new S-pair, we will consider the two cases (I) {aμ | μ ∈ J } linearly independent and (II) ap linearly dependent on {aμ | μ ∈ J} separately.
For α ≥ 0 we consider the problem (Pα )
⎧ ⎪ ⎨ f (y) −→ min ATJ y = bJ ⎪ ⎩ aT y = a T x − α . p p
Remark The problem (P0 ) has the minimizer x with Lagrange multiplier since ∇f (x) + AJ u0 = 0 holds. Case I: {aμ | μ ∈ J } linearly independent Then (x, J, p) is a V-triple. Let x(α) denote the minimizer to (Pα ) and grange multiplier.
"
u(α) λ(α)
u 0
≥ 0,
# the corresponding La-
The Karush–Kuhn–Tucker conditions (cf. page 167) of (Pα ) are: " # u(α) = 0 C x(α) + c + AJ λ(α) # " bJ ATJ x(α) = !bJ − α e|J|+1 with !bJ := aTp x
Chapter 4
Let (x, J) be an S-pair and p ∈ I with aTp x > bp . According to this assumption x is a minimizer to (PJ ) there exists a vector u ≥ 0 with ∇f (x) + AJ u = 0 .
182
Linearly Constrained Optimization Problems
Chapter 4
This system of linear equations has a unique solution (cf. page 178), which is affinely linear with respect to α . Using the solution ⎛ ⎞ # " x 0 ⎝u ⎠ for the right-hand side , we obviously obtain −e|J|+1 λ — with a suitable function λ( · ) ≥ 0 , which according to the above remarks is equal to zero in the beginning — ⎛ ⎞ ⎞ ⎛ x + αx x(α) ⎝ u(α) ⎠ = ⎜ ⎟ ⎝u + αu ⎠ λ(α) λ + αλ " # " # u u and = 0, Cx + AJ = 0 , C x + c + AJ λ λ ATJ x = −e|J|+1 , ATJ x = !bJ , f (x(α)) = f (x) + α f (x) x + 1 α2 x TC x , f (x) = (C x + c)T . 2 Remark > 0. = λ (≥ 0) and x TC x = λ It holds that f (x) x = −e|J|+1 ), hence, we get Proof: x is unequal to zero (because of ATJ x % & % & u u T = − x T AJ = e|J|+1 0 < x TC x = λ, λ λ T
0 ≤ λ = (u , λ)e|J|+1
" #T u = − ATJ x = f (x) x . λ
Conclusion:
2 for The function f ◦ x, that is, (f ◦ x)(α) = f (x(α)) = f (x) + α λ + α λ 2 α ≥ 0 , is strongly isotone.
To guarantee u(α) = u + α u ≥ 0, we define $ u μ < 0 α := min − μ $ μ ∈ J, u that is, α = min ∅ = ∞, if u ≥0 . u μ For 0 ≤ α ≤ α we have u(α) ≥ 0 and λ(α) ≥ 0 . Hence, x(α) is a minimizer to the problem ⎧ ⎪ ⎨ f (y) −→ min ≤ ATJ y ≤ bJ (Pα ) ⎪ ⎩ a T y ≤ aT x − α . p p
4.2
Quadratic Optimization
183
Case Ia: 0 < α ≤ α x(α), J ∪ {p} is an S-pair with f (x(α)) > f (x). Therefore, we set x := x(α), J := J ∪ {p} . Case Ib:
0≤α <α
α) = 0 and f (x( α)) ≥ f (x). ‘=’ only for There exists a μ ∈ J with uμ ( α =0
Then (x, J, p) is a V-triple, that is, we are in the starting situation of case I, just with one constraint less. If we repeat the described procedure, after a finite number of steps — at the latest, if J = ∅ — case Ia will occur. Case II: ap depends linearly on {aμ | μ ∈ J}, that is, ap = AJ v with a suitable v ∈ R|J| . Case IIa: v ≤ 0 Then, respecting ATJ x = bJ and aTp x > bp , the inequality system ATJ y ≤ bJ and aTp y ≤ bp is unsolvable, because otherwise we would get ATJ (y − x) ≤ 0 aTp (y − x) < 0 and therefore a contradiction 0 > aTp (y − x) = v T ATJ (y − x) ≥ 0 . Case IIb:
v ≤ 0
As (x, J) is an S-pair, there exist u ≥ 0 for ∈ J with ∇f (x) + u a = 0 . Let μ ∈ J with vμ > 0 and ∈J
From ap = AJ v =
$ uμ $ = min u $ ∈ J, v > 0 . vμ v
v v a + vμ aμ we get aμ = 1 ap − a , hence vμ v =μ =μ μ
Chapter 4
Replace: x by x( α) u by the vector arising from u( α) when the component uμ is removed λ by λ( α) J by J \ {μ} α by aTp x( α ) − bp = α − α
184
Linearly Constrained Optimization Problems
" # u v u u − μ a + μ ap . vμ vμ =μ =μ All coefficients of this representation are — according to the choice of μ — nonnegative. 0 = ∇f (x) +
u a + uμ aμ = ∇f (x) +
Replace J by J \ {μ} . Now (x, J, p) is a V-triple, and we restart with step 3) . The p-th constraint is still violated, but now we can continue with case I. f (x) := x21 − x1 x2 + x22 − 3 x1 −→ min x1 + x2 ≤ 2 x1 ≥ 0 , x2 ≥ 0 " # # " −3 2 −1 , , c= Here we have in particular C = 0 " # # "−1 2 2 x1 − x2 − 3 1 −1 0 ∇f (x) = , b = (2, 0, 0)T . , A = 1 0 −1 −x1 + 2 x2
Chapter 4
Example 8
0) J := ∅ and x := −C −1 c = (2, 1)T give a starting S-pair. 1) x1 + x2 = 3 > 2 2) Then we have p = 1 , α = aT1 x − b1 = 1 , J = {1}. f (y) −→ min 3) For α ≥ 0 we solve the problem (Pα ) y 1 + y2 = 3 − α . Now we have case I, since {aμ | μ ∈ J } = {a1 } is linearly independent. % & u = − e|J|+1 = −e1 = −1 result in this Cx + AJ = 0 and ATJ x λ " # #" # " # " x 1 x 1 1 2 −1 = 0 , (1, 1) +λ = −1 , special case 1 −1 2 x 2 x 2 hence are by 1 =#x 2 = # , λ"=#1/2 . We obtain " −1/2 # " x " solved 2 −1/2 1 2 −α = and α = ∞ > α = 1; there+α x(α) = 2 1 1 −1/2 1 " # 3/2 fore we have case Ia. x(1) = is feasible and so gives the solution. 1/2 2 2 Because of f (x(α)) = f (x) + α = −3 + α , we obtain f (x(1)) = −11 . 4 4 4 R
For practice purposes, we have provided a Matlab program for our readers. We refrained from the consideration of a number of technical details for the benefit of transparency. It is surprising how easily the Goldfarb–Idnani method can then be implemented:
4.2
Quadratic Optimization
185
function [x,u,J] = goldfarb(C,c,A,b)
5
% Experimental version of the Goldfarb-Idnani method % for quadratic optimization % % Problem: min 1/2*x’*C*x + c’*x s.t. A’*x <= b
10
n = size(C,1); x = C \ (-c); % nabla(f) = 0 u = []; % 0:0 - matrix lambda = 0; J = []; [alpha_, p] = max(A’*x - b);
15
E = 10^(-13); while (alpha_ > E*norm(A(:,p))) ap = A(:,p); AJ = A(:,J); J_ = [J, p];
25
30
35
40
45
Chapter 4
20
<==> C*x = -c
if (rank(AJ) < rank([AJ, ap])) % Case 1 AJ_ = A(:,J_); Z = zeros(length(J_)); M = [C, AJ_; AJ_’, Z]; rhs = [zeros(n,1); zeros(length(J),1); -1]; xul = M \ rhs; x_ = xul(1:n); if (length(J) > 0) u_ = xul(n+1:end-1); else u_ = []; end; lambda_ = xul(end:end); I = find(u_ < 0); if (length(I) > 0) [alpha_hat, i] = min(-u(I) ./ u_(I)) end; if (length(I) == 0) | (alpha_ <= alpha_hat) x = x + alpha_*x_; u = u + alpha_*u_; lambda = lambda + alpha_*lambda_; u = [u; lambda]; lambda = 0; J = J_; [alpha_, p] = max(A’*x - b); else % Case 1b x = x + alpha_hat*x_; u = u + alpha_hat*u_; lambda = lambda + alpha_hat*lambda_; I = setdiff([1:length(J)],i); u = u(I); J = setdiff(J,J(i)); alpha_ = alpha_ - alpha_hat; end;
% Case 1a
186
50
55
Linearly Constrained Optimization Problems
else % Case 2 v = AJ \ ap; I = find(v > 0); if (length(I) == 0) % Case 2a error(’There exists no feasible solution.’) else % Case 2b [lambda, i] = min(u(I) ./ v(I)); I = setdiff([1:length(J)],i); u = u(I) - lambda*v(I); J = setdiff(J,J(i)); end;
Chapter 4
end; % end if end; % end while
4.3 Projection Methods We will now consider linearly constrained optimization problems with more general objective functions than linear or quadratic ones. Let x(0) be a feasible, but not yet optimal point. Starting there, we try to move along the negative gradient of the objective function to get another feasible point with a smaller function value (in the case of minimization). If x(0) is an interior point of the feasible set, this is always possible. However, if x(0) is a boundary point, we may leave the feasible region by doing so. The earliest proposals for feasible direction methods go back to Zoutendijk and Rosen (1960).
Let n ∈ N, m, p ∈ N0 , D an open subset of Rn , f : D −→ R a continuously differentiable function and aμ ∈ Rn , bμ ∈ R for μ ∈ I ∪ E with I := {1, . . . , m} , E := {m + 1, . . . , m + p} . Then we consider the problem ⎧ ⎪ ⎨ f (x) −→ min (P L) aTi x ≤ bi for all i ∈ I ⎪ ⎩ T aj x = bj for all j ∈ E
(6)
with the feasible region F := x ∈ D | aTi x ≤ bi (i ∈ I), aTj x = bj (j ∈ E) . If we solely minimize with respect to linear equality constraints, that is, we have
4.3
Projection Methods
(P LE)
f (x) −→ min aTj x = bj for all j ∈ E ,
187 (7)
then, if we utilize the elimination process for the special case (QPE) on p. 175 ff (under the assumption: am+1 , . . . , am+p are linearly independent), we can replace this problem by an unconstrained optimization problem.
d is a feasible direction at x(0) ∈ F if and only if aTi d ≤ 0 for all i ∈ A(x(0) ) and aTj d = 0 for all j ∈ E hold. A descent direction at x(0) ∈ F which is at the same time a feasible direction is called a feasible descent direction at x(0) . We go back to the active set method (cf. the special case on p. 170 ff): Algorithm (model algorithm) 0) Let x(0) ∈ F , J0 := A x(0) ∪E and (aμ | μ ∈ J0 ) be linearly independent. If for a k ∈ N0 the points x(0) , . . . , x(k) ∈ F and the corresponding index sets J0 , . . . , Jk with (aμ | μ ∈ Jk ) linearly independent are already determined, then we obtain x(k+1) and Jk+1 as follows:
1) Determine a feasible descent direction dk at x(k) if one exists. Otherwise go to 4). 2) Choose an r ∈ / Jk with aTr dk > 0 and ⎧ T (k) $ $ br − aTr x(k) ⎨ min bi − ai x T i ∈ / J , a d > 0 = $ k k i aTi dk aTr dk αk := ⎩ T ∞ , if ai dk ≤ 0 for all i ∈ / Jk . Determine λk ∈ [0 , αk ] with f (x(k) + λk dk ) = min{f (x(k) + λdk ) | 0 ≤ λ ≤ αk } . That is, λk = ∞, if αk = ∞ and f is unbounded from below with respect to the direction dk . Otherwise: x(k+1) := x(k) + λk dk 3) If λk = αk < ∞ : Jk+1 := Jk ∪ {r} Set k := k + 1 and go to 1).
Chapter 4
For the general case we remember the already well-known concepts: For x(0) ∈ D, a d ∈ Rn with f x(0) d < 0 is called a descent direction at x(0) . Then, for all τ > 0 sufficiently small, it holds that f x(0) +τ d < f x(0) (cf. p. 45). A vector d = 0 is called a feasible direction at x(0) ∈ F if x(0) +τ d ∈ F holds for all τ > 0 sufficiently small. Using the index set A x(0) := i ∈ I | aTi x(0) = bi of the active inequality constraints at x(0) , this obviously means:
188
Linearly Constrained Optimization Problems
4) Calculate the Lagrange multipliers: ∇f (x(k) ) +
i∈Jk
u i ai = 0
5) If ui ≥ 0 for all i ∈ Jk ∩ I holds, then x(k) is a KKT point: STOP Otherwise there exists an s ∈ Jk ∩ I with us < 0 . Jk+1 := Jk \ {s} x(k+1) := x(k) ; set k := k + 1 and go to 1).
Chapter 4
A feasible starting vector can be obtained — if necessary — via phase I of the simplex algorithm. We will of course try to determine a descent direction d with f (x(0) d minimal. However, this is only possible if d is bounded in some way. Therefore we introduce an additional normalization constraint.
In order to determine locally optimal descent we consider the fol directions, lowing method for x(0) ∈ F and g0 := ∇f x(0) : Zoutendijk’s Method (1960) ⎧ T g0 d −→ min ⎪ ⎪ ⎪ ⎨ aT d ≤ 0 for i ∈ A(x(0) ) i T ⎪ a ⎪ j d = 0 for j ∈ E ⎪ ⎩ −1 ≤ dν ≤ 1 (1 ≤ ν ≤ n)
(unit ball with respect to the maximum norm)
We can utilize a suitable version of the simplex algorithm to solve this problem. T to these constraints is nonnegative, it follows If the minimum of g0 d subject that C x(0) ∩ Cdd x(0) = ∅, that is, x(0) is a KKT point (cf. proposition 2.2.1).
Variants of this method are: ⎧ T g0 d −→ min ⎪ ⎪ ⎪ ⎨ aT d ≤ 0 for all i ∈ A(x(0) ) i ⎪ aTj d = 0 for all j ∈ E ⎪ ⎪ ⎩ T d d ≤ 1 (unit ball with respect to the euclidean norm) ⎧ 1 T ⎪ ⎨ 2 (g0 + d) (g0 + d) −→ min aT d ≤ 0 for all i ∈ A(x(0) ) ⎪ ⎩ iT aj d = 0 for all j ∈ E . Remark a) Let d be a minimizer to (9). Then it holds that: If d = 0 , then d also yields a solution to (8). If d = 0 , then d/d2 gives a solution to (8).
(8)
(9)
4.3
Projection Methods
189
b) If d is a minimizer to (8), then there exists an α ≥ 0 such that α d yields a solution to (9). Before we give the proof, observe: (8) and (9) are convex optimization problems fulfilling (SCQ). Hence, the Karush–Kuhn–Tucker conditions are necessary and sufficient for optimality. These conditions to (8) are: ⎧ g0 + α d + A1 u + A2 v = 0 ⎪ ⎪ ⎪ ⎨α ≥ 0 , u ≥ 0 (10) ⎪ AT1 d ≤ 0 , AT2 d = 0 , dT d ≤ 1 ⎪ ⎪ ⎩ α (dT d − 1) = 0 , uT AT1 d = 0
The Karush–Kuhn–Tucker conditions to (9) are ⎧ ⎪ ⎨ g0 + d + A1 u + A2 v = 0 u ≥ 0 , AT1 d ≤ 0 , AT2 d = 0 ⎪ ⎩ T T u A1 d = 0 .
(11)
Proof of the Remark: a) Suppose that (d, u, v) meets (11). If d = 0 , then d = 0 , α = 0 , u = u , v = v solve the equations in (10). If d = 0 , then (10) is solved by d := d/ d2 , α := d2 , u := u , v := v . b) If (d, α, u, v) solves (10), then (α d, u, v) solves (11).
Projected Gradient Method Rosen (1960) was the first to propose a projection of the gradient onto the boundary and to move along the projected gradient. Our discussion in this subsection will focus on the following equality constrained quadratic problem, which is a modification of problem (9): 1 (g + d)T (g + d) −→ min 0 2 0 (12) ATJ d = 0 We assume that the column vectors of AJ are linearly independent. Denote by L the subspace d ∈ Rn | ATJ d = 0} , hence, L = kernel ATJ .
Chapter 4
with A1 := ai | i ∈ A x(0) , A2 := aj | j ∈ E and AJ := A1 , A2 with J := A(x(0) ) ∪ E .
190
Linearly Constrained Optimization Problems
The Karush–Kuhn–Tucker conditions to (12) are 0 = g0 + d + AJ u ATJ d = 0 . ' ( AJ u , d = u , ATJ d = u , 0 = 0 for any d ∈ L shows AJ u ∈ L⊥ (the orthogonal complement of L). −g0 = d + AJ u with d ∈ L and AJ u ∈ L⊥ yields −ATJ g0 = ATJ d + ATJ AJ u = ATJ AJ u and thus u = −(ATJ AJ )−1 ATJ g0 ; with PL := In − AJ (ATJ AJ )−1 ATJ , the orthogonal projector onto the subspace L , we finally get d = −PL g0 .
Chapter 4
If d = −PL g0 = 0, then the ' vector( d is a descent direction, because of g0 , d = − g0 , PL g0 = − g0 , PL2 g0 = d, d < 0. During the execution of the algorithm it is necessary to compute the projection matrices. Since the changes affect only one constraint at a time, it is important to have a simple way of obtaining the projection matrix from the preceding ones (cf. exercise 14).
Example 9 x21
+ x22
(cf. [Lu/Ye], p. 370 f)
+ x23
+ x24 −2 x1 − 3 x4 −→ min xi ≥ 0 for 1 ≤ i ≤ 4 2 x1 + x2 + x3 + 4 x4 = 7 x1 + x2 + 2 x3 + x4 = 6 x(0) := (2, 2, 1, 0)T is feasible with g0 = (2, 4, 2, −3)T .
⎞ 0 0 0 −1 It holds that A(x(0) ) = {4} , J = {4, 5, 6} and ATJ = ⎝ 2 1 1 4 ⎠. 1 1 2 1 As on p. 175 f we get the QR-factorization of AJ with ⎛ ⎞ 0 0 0 −1 √ √ √ & % ⎜ ⎟ Q1 1/ 6 1/ 6 0 ⎟ ⎜ 2/ 6 ⎜ ⎟: √ √ √ := ⎜ Q := ⎟ 66 1/ 66 7/ 66 0 −4/ Q2 ⎝ ⎠ √ √ √ 1/ 11 −3/ 11 1/ 11 0 ⎞ ⎛ 1 −4 −1 % & √ ⎟ ⎜ √ R ⎜ 0 6 5/ 6 ⎟ ! ⎟ ⎜ √ , Q1 AJ = R, Q2 AJ = 0, AJ = QT1 R Q AJ = ⎜ ⎟ =: R =: 66 0 0 11/ 0 ⎠ ⎝ 0 0 0 ⎛
4.3
Projection Methods
191
−1 T From that we obtain ATJ AJ = RT Q1 QT1 R = RT R , AJ ATJ AJ AJ = −1 RT Q1 = QT1 Q1 and — with I = QT Q = QT1 Q1 + QT2 Q2 — QT1 R RT R ⎞ ⎞ ⎛ ⎛ 1 −3 1 0 1 1 ⎜ −3 9 −3 0 ⎟ 8 ⎜ −3 ⎟ ⎟, d = −PL g0 = ⎟. ⎜ ⎜ PL = I−QT1 Q1 = QT2 Q2 = 11 ⎝ 1 −3 1 0 ⎠ 11 ⎝ 1 ⎠ 0 0 0 0 0 What happens if d = −PL g0 = 0 ?
If d = 0 , we would have AJ u = AJ u , hence as as a linear combination of the remaining columns in contradiction to the assumption. Unfortunately, the projected gradient method may fail to converge to a KKT point or a minimizer because of the so-called jamming or zigzagging (cf. [Fle], section 11.3). The following example shows this phenomenon. We do not discuss strategies to avoid jamming.
Example 10 (Wolfe, 1972) 3/4 + x3 −→ min f (x) := 4 x21 − x1 x2 + x22 3 x1 , x2 , x3 ≥ 0 The objective function f : Rn −→ R is convex and continuously differentiable (cf. exercise 19). The unique minimizer — with value 0 — is (0, 0, 0)T . √ 1 √1 and c > 1 + a We start from x(0) := (a, 0, c)T where 0 < a ≤ 2√ 2 2 and are going to prove:
(αk , 0, γk )T , (0, αk , γk )T ,
for k even for k odd 1 j √ k−1 √ with αk := ak , γk := c − 1 a 2 2 2 j=0 (k)
x
=
Chapter 4
u i ai . Due to the KKT conditions, it holds that 0 = g0 + i∈J If ui ≥ 0 holds for all i ∈ A x(0) , x(0) is a KKT point of (6) and hence a minimizer to the original problem. Otherwise there exists an s ∈ A x(0) with us < 0 . With J := J \ {s} , A1 := ai | i ∈ A x(0) \ {s} , AJ := A1 , A2 we consider L := d ∈ Rn | ATJ d = 0 , d := −PL g0 , −g0 = AJ u and −g0 = AJ u + d and show d = 0 :
192
Linearly Constrained Optimization Problems dk := −∇f (x(k) ) are feasible descent directions. √ √ αk −→ 0 , γk −→ c − 2a 1−1√1 = c − a 1 + √12 =: γ ∗ > 0 2
x(k) −→ x∗ = (0, 0, γ ∗ )T with f (x∗ ) > 0. So we get a sequence of points converging to a point which has nothing to do with the solution to the problem!
x(0) x(1) x(4) x(6) •
x(5) x∗
x(3)
er iz im in m
Chapter 4
x(2)
" f (x) =
2x1 − x2 2x2 − x1 ) , ) , 1 4 4 x21 − x1 x2 + x22 x21 − x1 x2 + x22
# for x = 0
√ √ T x(0) = (a, 0, c)T = (α0 , 0, γ0 )T , g0 = ∇f (x(0) ) = 2 a, − a, 1 √ A(x(0) ) = {2} ; d0 = −g0 is feasible, as aT2 d0 = − a ≤ 0. T √ √ √ x(0) + λd0 = a − 2 aλ, λ a, c − λ ∈ F ⇐⇒ 0 ≤ λ ≤ a/2 < c √ So we get α0 := a/2 (cf. 2) on p. 187). 3/4 √ With ϕ(λ) := f x(0) + λd0 = 43 a3/4 7λ2 − 5 aλ + a + c − λ we are looking for a minimizer λ0 of ϕ on [0 , α0 ]: √ 2 √ 3/4 ) 14λ − 5 a 3/4 (14λ − 5√ a) + 6a ϕ (λ) = a − 1, ϕ (λ) = a >0 √ 4 4(7λ2 − 5 aλ + a)5/4 7λ2 − 5 aλ + a √ (α0 ) = 2 2a − 1 ≤ 0 and This shows that ϕ is strongly isotone, ϕ (λ) ≤ ϕ√ further that ϕ is antitone. Hence, λ0 = α0 = a/2 yields the minimum. T √ With x(1) := x(0) + λ0 d0 = 0, a/2, c − a/2 instead of x(0) we get the next step in like manner.
k=0:
Reduced Gradient Method The reduced gradient method was developed by Wolfe (1962) to solve nonlinear optimization problems with linear constraints. The method is closely related to the simplex method for linear problems because the variables are split into basic and nonbasic ones. Like gradient projection methods, it can be regarded as a steepest
4.3
Projection Methods
193
descent method generating feasible descent directions. The dimension of the problem is reduced, at each step, by representing all variables in terms of an independent subset of variables.
We consider the problem f (x) −→ min (P ) Ax = b , x ≥ 0 where m, n ∈ N with m ≤ n, A ∈ Rm×n , b ∈ Rm and f : Rn −→ R continuously differentiable. So we have linear equality constraints for nonnegative variables. We make the following assumptions: 1) F := {x ∈ Rn | Ax = b , x ≥ 0} = ∅
The basic idea of the reduced gradient method is to consider, at each step, the problem only in terms of independent variables: Let N := {1, . . . , n}, j1 , . . . , jm ∈ N and J := (j1 , . . . , jm ) be a feasible basis to A (cf. p. 153, simplex method). Hence, AJ is nonsingular. With K = (k1 , . . . , kn−m ) ∈ N n−m such that S(K) S(J) = N we have −1 AJ xJ + AK xK = Ax = b ⇐⇒ xJ = A−1 J b − AJ AK xK .
=: b
=: AK
xJ is the vector of basic or dependent variables, xK the vector of nonbasic or independent variables. By assigning values to them, we get — via xJ := b − AK xK — a unique solution to Ax = b.
With the reduced function f(xK ) :=
f given by f xJ , xK = f b − AK xK , xK
we consider the following transformation of (P ): ⎧ ⎪ ⎨ f (xK ) −→ min (P ) b − AK xK ≥ 0 ⎪ ⎩ xK ≥ 0 The reduced gradient of f is the gradient of f: xj = bj − aj x (j ∈ J) ∈K
∂ f ∂xk
=
∂f ∂xk
+
j∈J
∂f ∂xj
· (−ajk ) T
(k ∈ K)
∇f(xK ) = ∇xK f (x) − AK ∇xJ f (x)
Chapter 4
2) Problem (P ) is nondegenerate in the following sense: For each x ∈ F the set of restrictions active at x — including the equality constraints — is linearly independent.
194
Linearly Constrained Optimization Problems
Proposition 4.3.1 To each x ∈ F there exists a decomposition S(K) S(J) = N with xJ > 0 and AJ regular. (Commonly, we do not have xK = 0 .)
Chapter 4
Proof: Let x ∈ F , wlog x1 = · · · = xq = 0 and xj > 0 for j = q + 1, . . . , n with a suitable q ∈ {0, . . . , n} . Nondegeneracy means ⎛ ⎞ ⎫ a11 . . . a1q a1,q+1 . . . a1n ⎜ . ⎟ ⎬ . . . .. .. .. ⎟ ⎜ .. m ⎜ ⎟ ⎜ am1 . . . amq am,q+1 . . . amn ⎟ ⎭ ⎜ ⎟ = m + q (≤ n) . rank ⎜ ⎟ ⎫ ⎜ 1 ⎟ ⎬ ⎜ ⎟ ⎜ ⎟ .. ⎝ ⎠ ⎭q . 0 1 This is equivalent to
⎛
⎞ a1,q+1 . . . a1,n ⎜ .. ⎟ = m rank ⎝ ... . ⎠ am,q+1 . . . am,n
and further to the existence of J ⊂ {q + 1, . . . , n} such that AJ is regular . (0)
If x(0) is feasible to (P) and d ∈ kernel(A) with di ≥ 0 if xi = 0 (for i ∈ N ), then x(0) + εd ∈ F for all ε > 0 sufficiently small. In other words: (0) d ∈ Rn is a feasible direction if and only if Ad = 0 and di ≥ 0 if xi = 0 (for i ∈ N ). (0)
(0)
Proof: If xi > 0, then we obtain xi + εdi > 0 for all ε > 0 sufficiently (0) (0) small. If xi = 0, then we have di ≥ 0 and thereby xi + εdi ≥ 0 for all ε > 0. (0)
Let now x(0) ∈ F, J ⊂ N with xJ > 0 and AJ regular be given. We determine a feasible descent direction d to (P ) at x(0) in the following way: For k ∈ K, we define ⎧ (0) (0) ∂ f ⎨ − ∂ f x(0) < 0 or xk > 0 K , if ∂xk xK ∂xk dk := (0) (0) ⎩ ∂ f xK ≥ 0 and xk = 0 0, if ∂x k and set dJ := −AK dK . We are usually searching in the direction of the negative gradient. If we were able to (0) make a ‘small’ move here from xK in the direction of the negative reduced gradient without violating the constraints, we would get a decrease of the value of f . Thus,
4.3
Projection Methods
195
we have to choose dK in such a way that (0)
xi
-
. (0) ∇f xK , dK < 0 and that di ≥ 0 if
= 0 (for i ∈ K) .
x2
(0) ∇f xK dK
x1
(0) −∇f xK
1) dK is a feasible direction to ( P ) at xK and d is a feasible direction to (P ) at x(0) . (0)
2) If dK = 0 , then dK is a descent direction to ( P ) at xK descent direction to (P ) at x(0) .
(0)
and d is a
3) dK = 0 if and only if xK is a KKT point to ( P ) and further if and only if x(0) is a KKT point to (P ). (0)
Proof: (0)
(0)
1) We have Ad = 0 by definition of dJ . Since b− AK xK = xJ is positive, (0) (0) it will remain positive for ‘small’ changes of xK . Thus, we obtain xJ + (0) εdJ > 0 and (with the definition of dK ) xK + εdK ≥ 0 for sufficiently (0) small ε > 0. Altogether we have for such ε: x(0) + εd ∈ F and xK + εdK is feasible to ( P ). ( ' ( ' ( ' 2) d, ∇f x(0) = dJ , ∇xJ f x(0) + dK , ∇xK f x(0) ( ' T = dK , −AK ∇xJ f x(0) + ∇xK f x(0) ( (0) ( ' ' = − dK , dK < 0 = dK , ∇f xK 3) By the definition of dK we have ⎧ ⎨ x(0) > 0 =⇒ k dK = 0 ⇐⇒ ∀ k ∈ K ⎩ x(0) = 0 =⇒ k
(0) ∂ f ∂xk (xK ) = 0 (0) ∂ f ∂xk (xK ) ≥ 0 .
xK is a KKT point to ( P ) iff there exist vectors u, v ≥ 0 such that: (0)
(0)
xK ≥ 0 ,
(0)
b − AK xK ≥ 0,
Chapter 4
Proposition 4.3.2
196
Linearly Constrained Optimization Problems T
∇f(xK ) + AK u − v = 0 , ' ( ' (0) (0) ( u , AK xK − b = 0 and v , xK = 0 . (0)
(0)
(0)
Since b − AK xK = xJ
> 0 and u ≥ 0 , it follows:
-
. (0) u, AK xK − b = 0
(0) iff u = 0 . In this case we have v = ∇f(xK ). Thus, these KKT conditions reduce to (0) (0) (0) . ∇f x = 0. ≥ 0 and ∇f x ,x K
K
(0)
Hence: dK = 0 iff xK
K
is a KKT point to ( P ).
x is a KKT point to (P ) iff there exist vectors u ∈ Rm and v ∈ Rn+ ( ' such that ∇f x(0) + AT u − v = 0 and v , x(0) = 0 . ' ' (0) ( (0) ( This is equivalent to vK , xK + vJ , xJ = 0 and (0) ∇xK f x + ATK u − vK = 0, ∇xJ f x(0) + ATJ u − vJ = 0 . ' (0) (0) ( Since xJ > 0 and vJ ≥ 0 , it follows: vJ , xJ = 0 iff vJ = 0 . Thus, ' (0) ( these KKT conditions reduce to vK , xK = 0 and ∇xK f x(0) + ATK u − vK = 0, ∇xJ f x(0) + ATJ u = 0 . (0) . With that The last equation is uniquely solved by u = −A−T J ∇xJ f x ' (0) ( the remaining KKT conditions are vK , xK = 0 and (0) −T (0) T ∇f x = ∇x f (x ) − A A ∇x f x(0) = vK .
Chapter 4
(0)
K
K
K
J
J
(0) Hence: x is a KKT point of (P ) iff xK is a KKT point of ( P ). Now we can describe one step of the reduced gradient method as follows: (0)
1) Determine dK (cf. p. 194). 2) If dK = 0 , then STOP: The current point x(0) is a KKT point to (P ). (0)
(0)
3) α1 := max{α ≥ 0 | xJ + α dJ ≥ 0} > 0 (since xJ > 0) (0)
(0)
α2 := max{α ≥ 0 | xK + α dK ≥ 0} > 0 (xk = 0 =⇒ dk ≥ 0) α := min{α1 , α2 } Search α0 ∈ [0 , α ] with f (x(0) + α0 d) = min{f (x(0) + α d) | 0 ≤ α ≤ α } (1) (0) x := x + α0 d 4) If α0 < α1 : J remains unchanged. (0)
If α0 = α1 : There exists an s ∈ J with xs + α0 ds = 0. The variable xs leaves the basis.8 One of the positive nonbasic variables xr becomes a basic variable. This is possible because of the assumed nondegeneracy. 8
If there is more than one zero-component, each of them will be removed from the basis J via an exchange step.
4.3
Projection Methods
197
Let us illustrate one step of the reduced gradient method with example 9 (cf. p. 190): # " # 7 2 1 1 4 and , b= 6 1 1 2 1 f (x) = x21 + x22 + x23 + x24 − 2 x1 − 3 x4 . T x(0) := (2, 2, 1, 0)T is feasible with ∇f x(0)" = (2, J := (1, 2) # 4, 2, −3) ". To # 2 1 1 4 , AK = , (for example) we get: K = (3, 4), AJ = 1 1 2 1 # # " " 1 −1 −1 3 A−1 , AK = A−1 J = J AK = −1 2 3 −2 # #" # " # " " (0) (0) (0) T −8 2 −1 3 2 = − ∇f xK =∇xK f x −AK ∇xJ f x = −1 4 3 −2 −3 " # 8 dK = , α2 = ∞ 1 # " 5 1 , α1 = 11 dJ = −AK dK = −22 "
Example 11 We have A =
Remark An overall disadvantage of the previous methods is that the convergence rate is at best as good as that of the gradient method. The so-called SQP methods, which are based on the same principles as Newton’s method, have proven more useful. In order to avoid redundancy in our discussion which is also important for chapter 5, we will for the moment only consider general nonlinearly constrained optimization problems in the following subsection. In the case of linearly constrained problems there will be some simplifications which will lead to feasible point methods.
Preview of SQP Methods The abbreviation “SQP” stands for Sequential Quadratic Programming or Successive Quadratic Programming. SQP methods date back to Wilson (1963) and belong to the most efficient methods for solving nonlinear optimization problems. We consider the following optimization problem ⎧ ⎪ ⎨ f (x) −→ min ci (x) ≤ 0 , i ∈ I := {1, . . . , m} (P ) (13) ⎪ ⎩ ci (x) = 0 , i ∈ E := {m + 1, . . . , m + p} .
Chapter 4
Minimization of f (x(0) + α d) = 5 − 65 α + 574 α2 yields α0 = 65/1148 . Because of α0 < α1 the basis J = (1, 2) remains unchanged.
198
Linearly Constrained Optimization Problems
For that let m, p ∈ N0 (hence, E = ∅ or I = ∅ are permitted), and suppose that the functions f, c1 , . . . , cm+p are twice continuously differentiable in an open subset D of Rn . The vectors v from Rm+p are divided into the two components vI and vE . Corresponding to that, suppose c that the function c with the components c1 , . . . , cm+p is written as cIE . With the notation introduced in chapter 2, it of course holds that g = cI and h = cE . As usual L(x, λ) := f (x) + λT c(x) = f (x) + λ, c(x)
(for x ∈ D and λ ∈ Rm+p )
Chapter 4
denotes the corresponding Lagrangian. Let the following optimality conditions (cf. theorem 2.3.2) be fulfilled in an x∗ ∈ D: There exists a vector λ∗ ∈ Rm with m+p ∗ λi ∇ci (x∗ ) = 0 1) ∇x L(x∗ , λ∗ ) = ∇f (x∗ ) + λ∗i
∗
≥ 0 , ci (x ) ≤ 0 and
i=1 λ∗i ci (x∗ )
=0
for all i ∈ I
∗
ci (x ) = 0 for all i ∈ E . 2) For all s ∈ C + (x∗ ), s = 0, it holds that sT ∇x2x L(x∗ , λ∗ )s > 0 , where $ ∗ ∗ ∗ $ ∗ n $ ci (x )s ≤ 0 for all i ∈ A(x ) \ A+ (x ) C + (x ) := s ∈ R $ ∗ ci (x )s = 0 for all i ∈ A+ (x∗ ) ∪ E with A+ (x∗ ) := i ∈ A(x∗ ) | λ∗i > 0 . Then x∗ yields a strict local minimum. The complementarity condition λ∗i ci (x∗ ) = 0 means λ∗i = 0 or (in the nonexclusive sense) ci (x∗ ) = 0 . If exactly one of the λ∗i and ci (x∗ ) is zero for all i ∈ I, then strict complementarity is said to hold.
In addition we suppose: 3) λ∗i > 0 for all i ∈ A(x∗ ) (strict complementarity) $ ∇ci (x∗ ) $ i ∈ A(x∗ ) ∪ E linearly independent. In this case we have A(x∗ ) = A+ (x∗ ) and thus $ C + (x∗ ) = s ∈ Rn $ ci (x∗ )s = 0 for all i ∈ A(x∗ ) ∪ E . If we disregard the inequality constraints λ∗i ≥ 0, ci (x∗ ) ≤ 0 for all i ∈ I for a moment, the Karush–Kuhn–Tucker conditions lead to the following system of nonlinear equations:
4.3
Projection Methods
199
⎞ ∇x L(x, λ) ⎜ λ c (x) ⎟ ⎟ ⎜ 1 1 ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ Φ(x, λ) := ⎜ λm cm (x) ⎟ = 0 ⎟ ⎜ ⎜ cm+1 (x) ⎟ ⎟ ⎜ ⎟ ⎜ .. ⎠ ⎝ . cm+p (x) ⎛
This is a nonlinear system of n + m + p equations with n + m + p variables.
We examine its Jacobian, respecting ∇x L(x, λ) = ∇f (x) +
i=1
λi ∇ci (x):
∇x2x L(x, λ) ∇c1 (x) . . . ∇cm (x) ∇cm+1 (x) . . . ∇cm+p (x)
⎜ ⎜ λ1 c1 (x) ⎜ ⎜ .. ⎜ . ⎜ λ c Φ (x, λ) = ⎜ m m (x) ⎜ ⎜ ⎜ cm+1 (x) ⎜ ⎜ .. ⎝ . (x) cm+p
c1 (x) .. . 0 0 .. . 0
... .. . ...
0 .. .
0 .. .
cm (x)
0
0 .. .
0 .. .
0
0
... .. . ...
... .. . ... ... .. . ...
0 .. . 0 0 .. .
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
0
Under the assumptions made above, we can show that Φ (x∗ , λ∗ ) is invertible: s = 0 , that is: Proof: Let s ∈ Rn , λ ∈ Rm+p with Φ (x∗ , λ∗ ) λ m+p ∇x2x L(x∗ , λ∗ )s + λi ∇ci (x∗ ) = 0 and i=1
λ∗i ci (x∗ )s + λi ci (x∗ ) = 0 for all i ∈ I, ci (x∗ )s = 0 for all i ∈ E . If s = 0 , then we get
m+p i=1
λi ∇ci (x∗ ) = 0 and λi ci (x∗ ) = 0 for all i ∈ I.
∗ Hence, because of λi = 0 for all i ∈ I \ A(x ) and the linear independence of ∇ci (x∗ ) | i ∈ A(x∗ ) ∪ E , λ1 = · · · = λm+p = 0 holds.
In the remaining case, s = 0 , we have λ∗i > 0 and ci (x∗ ) = 0 for all i ∈ A(x∗ ) and thus ci (x∗ )s = 0 for such i . Together with ci (x∗ )s = 0 for all i ∈ E we obtain s ∈ C + (x∗ ). By multiplying the first equation (from the left) by sT we get: sT ∇x2x L(x∗ , λ∗ )s +
m+p i=1
λi ci (x∗ )s = 0
Chapter 4
⎛
m+p
200
Linearly Constrained Optimization Problems
ci (x∗ )s = 0 for i ∈ A(x∗ ) ∪ E and λi = 0 for i ∈ I \ A(x∗ ) show m+p λi ci (x∗ )s = 0 , thus sT ∇x2x L(x∗ , λ∗ )s = 0 in contradiction to 2). i=1
Hence, we can apply Newton’s method to compute (x∗ , λ∗ ) iteratively. In general we regard for a given iteration point (x(k) , λ(k) ): & % (k) (k) x − x(k) (k) (k) = 0 + Φ x ,λ Φ(x, λ) ≈ Φ x , λ λ − λ(k)
Chapter 4
In the foregoing special case this equation means: m+p (k) ∇x L x(k) , λ(k) + ∇x2x L x(k) , λ(k) x − x(k) + λi − λi ∇ci x(k) = 0 i=1 (k) (k) (k) λi ci x(k) + λi ci x(k) ) x − x(k) + λi − λi ci x(k) = 0 for all i ∈ I ci x(k) + ci x(k) x − x(k) = 0 for all i ∈ E . These conditions are equivalent to: m+p ∇f x(k) + ∇x2x L x(k) , λ(k) x − x(k) + λi ∇ci x(k) = 0 i=1 / 0 (k) λi ci (x(k) ) + ci (x(k) )(x − x(k) ) − λi − λi ci x(k) x − x(k) = 0 (i ∈ I) ci x(k) + ci x(k) x − x(k) = 0 for all i ∈ E . Wilson’s Lagrange–Newton Method (1963) The following method consists of the successive minimization of second-order expansions of the Lagrangian, subject to first-order expansions of the constraints.
With the notations d := x − x(k)
ATk := c x(k) m+p (k) 2 (k) = ∇x2x L x(k) , λ(k) λi ∇ ci x Bk := ∇2 f (x(k) ) + i=1 (k) (k) (k) ci := ci (x(k) ), i (d) := ci + ci x(k) d
gk := ∇f (x(k) ),
(k)
the above equations — adding the condition λi ≥ 0 and i (d) ≤ 0 and (k) neglecting the quadratic terms λi − λi ci x(k) x − x(k) for all i ∈ I — can be written as follows: 0 = gk + Bk d + Ak λ
4.3
Projection Methods (k)
201
(k)
λi ≥ 0 , i (d) ≤ 0 , λi i (d) = 0 for all i ∈ I (k)
i (d) = 0 for all i ∈ E . These can be interpreted as the Karush–Kuhn–Tucker conditions of the following quadratic problem:
(QP )k
⎧ ϕ (d) := 12 dT Bk d + gkT d −→ min ⎪ ⎨ k (k) i (d) ≤ 0 for all i ∈ I ⎪ ⎩ (k) i (d) = 0 for all i ∈ E .
For the rest of this section we are only interested in linear constraints, that is, with suitable ai ∈ Rn and bi ∈ R it holds that ci (x) = aTi x − bi
The feasible region of (QP L)k is nonempty. If Bk is in addition positive definite on kernel(ATE ), then the objective function ϕk is bounded from below on the feasible region. The Barankin–Dorfman theorem (cf. theorem 4.2.1) then gives the existence of a minimizer d to (QP L)k . Together with the corresponding Lagrange multiplier λ we get: 0 = gk + Bk d + Aλ T (k) λi ≥ 0 , ai x −bi +aTi d ≤ 0 , aTi
λi aTi x(k) −bi +aTi d = 0
for all i ∈ I
d = 0 for all i ∈ E .
d is then a feasible direction to F in x(k) . If d = 0 , then x(k) is a KKT point to (P L). Otherwise d is a descent direction of f in x(k) : By multiplying gk = −Bk d − Aλ (from the left) by dT we get m+p
gk , d = − d, Bk d − d, Aλ = − d, Bk d −
λi ai , d . i=1
d, Bk d is positive because of our assumptions concerning Bk . It furthermore holds that: ai , d = 0 for all i ∈ E and λi ai , d = λi bi − aTi x(k) ≥ 0 for all i ∈ I . Hence gk , d < 0 .
Chapter 4
for x ∈ Rn and i = 1, . . . , m+ p, and consider the problem (P L) (p. 186). Then evidently Bk = ∇2 f x(k) and Ak = A := a1 , . . . , am+p . If additionally x(k) ∈ F , then problem (QP )k can be simplified as follows: ⎧ (k) 1 T 2 d + gkT d −→ min ⎪ ⎨ ϕk (d) := 2 d ∇ f x T (k) T (QP L)k ai x − bi + ai d ≤ 0 for all i ∈ I ⎪ ⎩ T ai d = 0 for all i ∈ E .
202
Linearly Constrained Optimization Problems
Now a rudimentary algorithm can be specified as follows: Algorithm (Lagrange–Newton SQP Method) 0) Let x(0) ∈ F be given. Set k := 0. 1) Determine a solution dk and a corresponding Lagrange multiplier λ(k) of (QP L)k . 2) If dk = 0 : STOP x(k) is a KKT point to (P L).
Chapter 4
Otherwise: x(k+1) := x(k) + dk Set k := k + 1 and go to 1). This method obviously yields a sequence x(k) of feasible points. Similar to Newton’s method, local quadratic convergence can be shown (cf. [Fle], theorem 12.3.1). One can try to extend the local convergence result to a global one: 1) Line search instead of fixed step size 1 The solution to (QP L)k produces a feasible descent direction dk which is used to determine αk ∈ [0 , 1 ] with f x(k) + αk dk = min f x(k) + α dk | 0 ≤ α ≤ 1 . With that, we define x(k+1) := x(k) + αk dk . If Bk := ∇2 f x(k) , we get the damped Newton method. 2) Quasi-Newton approximations to ∇2 f x(k) Quite a number of SQP methods use BFGS-like update formulae (cf. proposition 3.5.2’). With qk := ∇x L x(k+1) , λ(k+1) − ∇x L x(k) , λ(k) and pk := x(k+1) − x(k) it thereby holds Bk+1 := Bk +
Bk pk pTk Bk qk qkT − . pk , qk pk , Bk pk
A very special choice is Bk := I. It leads to the projected gradient method from the beginning of this section. Example 12
cf. [MOT], p. 5–49 f
f (x) := −x1 x2 x3 −→ min −x1 − 2 x2 − 2 x3 ≤ 0 x1 + 2 x2 + 2 x3 ≤ 72 This problem has the minimizer x∗ = (24, 12, 12)T with the corresponding Lagrange multiplier λ∗ = (0, 144)T and the active set A(x(∗) ) = {2}.
4.3
Projection Methods
203 R
To solve it, we utilize the function fmincon from the Matlab Optimization Toolbox. For the starting point x(0) := (10, 10, 10)T we get: k
10.0000 33.4444 29.6582 29.3893 29.2576 28.1660 26.1073 24.2705 23.9256 23.9829 23.9990 24.0000
(k)
x2
10.0000 6.8889 10.5854 10.0580 10.3573 11.2276 12.0935 12.4212 12.2145 12.0109 12.0001 12.0000
(k)
f x(k)
αk
10.0000 6.8889 10.5854 11.2474 11.0139 10.6894 10.8528 11.4436 11.8227 11.9976 12.0004 12.0000
−1000.00 −1587.17 −3323.25 −3324.69 −3337.54 −3380.38 −3426.55 −3449.87 −3455.06 −3556.00 −3556.00 −3556.00
0.5 1.0 0.0625 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
x3
A x(k) ∅ ∅ {2} {2} {2} {2} {2} {2} {2} {2} {2} {2}
The following two m-files illustrate the possible procedures:
5
f = @(x) -x(1)*x(2)*x(3); A = [-1 -2 -2; 1 2 2]; b = [0; 72]; x0 = [10,10,10]; Aeq = []; beq = []; lb = []; ub = []; options = ... optimset(’Display’,’iter’,’LargeScale’,’off’,’OutputFcn’,@outfun); [x,fval] = fmincon(f,x0,A,b,Aeq,beq,lb,ub,[],options) iteration
The output function outfun.m allows the logging of the course of the iteration and is activated by OutputFcn .
5
10
15
% cf. [MOT], p. 2-75 function stop = outfun(x,optimValues,state,varargin) stop = []; persistent history switch state case ’init’ history = []; case ’iter’ % Concatenate current point with history. % x must be a row vector. history = [history; x]; case ’done’ assignin(’base’,’iteration’,history); otherwise end
Chapter 4
0 1 2 3 4 5 6 7 8 9 10 11
(k)
x1
204
Linearly Constrained Optimization Problems
Exercises 1. Let the following linear program be given: cT x −→ min Ax = b , ≤ x ≤ u In the following we will use the notations from section 4.1 (cf. p. 152 ff). a) Show that this problem can be solved via a modification of the revised simplex method :
Chapter 4
① Determine a feasible basic point x to a basis J with j ≤ xj ≤ uj for j ∈ J and xj = j or xj = uj for j ∈ K. ② Set cT := cTK − y T AK with y := A−T J cJ . For all s = kσ ∈ K let cσ ≥ 0 if xs = s and cσ ≤ 0 if xs = us : STOP . ③ Otherwise there exists an index s = kσ ∈ K such that xs = s and cσ < 0 (Case A) or xs = us and cσ > 0 (Case B). ④ Define d ∈ Rn with dJ := −as , dK := eσ and as := A−1 J as . A: Determine the largest possible τ ≥ 0 such that xji − uji , if ai,s < 0 , ai,s xj − ji τ ≤ i , if ai,s > 0 . ai,s
τ ≤
Assume that the maximum is attained for i = . If τ ≤ us − s : xr with r = j leaves the basis, xs enters. If τ > us − s : Set τ := us − s ; J remains unchanged. B: Determine the smallest possible τ ≤ 0 such that xji − ji , if ai,s < 0 , ai,s xj − uji τ ≥ i , if ai,s > 0 . ai,s τ ≥
Assume that the minimum is reached for i = . If τ ≥ s − us : xr with r = j leaves the basis, xs enters. If τ < s − us : Set τ := s − us ; J remains unchanged. x := x + τ d; update of J as described above; goto ②.
Exercises to Chapter 4
205
b) Derive the dual optimization problem from the primal problem given above and show how to obtain a solution to the dual problem with the help of the algorithm described in a). c) Solve the following linear optimization problem 2 x1 + x2 + 3 x3 − 2 x4 + 10 x5 −→ min x1 + x3 − x4 + 2 x5 = 5 x2 + 2 x3 + 2 x4 + x5 = 9 T 0 ≤ x ≤ 7, 10, 1, 5, 3 . Firstly, verify that J := (1, 2), K := (3, 4, 5) and xK := (0, 0, 0)T or xK := (1, 0, 0)T yield a feasible basic solution.
(A + u v T )
−1
= A−1 − 1 A−1 u v T A−1 α
holds. Utilize this formula to determine — for v1 , . . . , vm ∈ R and ∈ {1, . . . , m} — the inverse of the matrix ⎞ ⎛ 1 v1 . ⎟ ⎜ ... .. ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ 1 . ⎟ ⎜ v T = ⎜ ⎟ ⎟ ⎜ .. ⎟ ⎜ . 1 ⎟ ⎜ .. .. ⎠ ⎝ . . 1 vm from section 4.1. 3. Diet problems which optimize the composition of feed in stock farming or the ration of components in industrial processes are a good example of the application of linear optimization. Assume that the chef of the cafeteria of a well-known Californian university has only two types of food (to simplify the calculation), concentrated feed Chaπ and fresh vegetables from Salinas Valley. Important ingredients per unit, costs per unit and the students’ daily demand are given in the following table: Carbohydrates Proteins Vitamins Cha π Fresh Vegetables
20 20
15 3
5 10
Daily Demand
60
15
20
Costs 10 dollars 7 dollars
Chapter 4
2. Let m ∈ N, A an invertible (m, m)-matrix, u, v ∈ Rm and α := 1 + v T A−1 u. Show that A + u v T is invertible if and only if α = 0, and that in this case the Sherman–Morrison–Woodbury formula
206
Linearly Constrained Optimization Problems The aim is to minimize the total costs while satisfying the daily demands of the students. a) Visualize the problem and determine the solution with the help of the graph. b) Formulate the corresponding problem in standard form by introducing three slack variables x3 , x4 , x5 ≥ 0 and start the (revised) simplex method with the basis J := (1, 3, 4).
Chapter 4
c) This simple example also illustrates the fact that the simplex method not only yields the optimal solution but also offers some insight as to how the solution and the optimal value respond to ‘small’ changes in the data of the problem (sensitivity analysis); see next exercise. Preparatorily: Determine A−1 for J := (1, 2, 5) and the optimal J solution to the dual problem. 4. We continue from the last exercise. In part c) ⎞ ⎞ ⎛ ⎛ 20 20 0 −0.6 4 0 1 ⎝ 3 −4 0 ⎠ to AJ = ⎝ 15 3 0 ⎠ A−1 = J 48 5 10 −1 27 −20 −48 for J := (1, 2, 5) and the optimizer u =
1 (15, 12, 0)T = (0.3125, 0.25, 0)T 48
of the dual problem had been calculated preparatorily. The minimal value of the objective function (costs) was then 22.5 dollars. xJ = b = (0.5, 2.5, 7.5)T showed x1 = 0.5 and x2 = 2.5. −1 With the help of the inverse A−1 J or also AJ (cf. exercise 5) and the result of the last step of the simplex algorithm we can gain insight into the effects of ‘small’ changes in the data of the problem on the optimizer and the optimal value. We will illustrate this phenomenon with an example: a) The chef of the cafeteria receives an offer of a new concentrated feed Chaπplus that costs 12 dollars per unit and contains 30 carbohydrates, 10 proteins and 10 vitamins (in suitable units of measurement). Would the use of this new feed be profitable? b) How low would the price p (per unit) of the new feed have to be for the use of it to be profitable? Note that although the optimal value changes continuously with p, the optimal feeding plan may depend discontinuously on p.
c) The shadow prices (optimizers of the dual problem) also give evidence as to how sensitively the optimal solution reacts to changes in
Exercises to Chapter 4
207
the composition of the new feed. Show that an enrichment of the carbohydrates from 30 to 31 units is more profitable than an enrichment of the proteins by one unit from 10 to 11, while the concentration of vitamins is initially irrelevant. d) Describe — in a general way — the influence of ‘small’ changes of b on a basic solution xJ and thus on the objective function: If J is the basis of the optimal solution, then u returns the sensitivity of the optimal costs with reference to ‘small’ changes of b. e) It is a bit more complex to determine to what degree changes in the data influence the basic matrix. Calculate how much the previous price of 10 dollars for one unit of Cha π may change without having to change the meal.
for the modified price of 10 + ε.
5. For manual calculations it is sometimes in particular useful to set # " " # " # A 0 b x (m+1)×(n+1) m+1 ∈R ∈ Rn+1 ∈R A := , b := , x := −cT 1 0 z and with that rewrite the primal problem as follows: z −→ min x A = b, x ≥ 0 To a basis J let
J := A
be formed accordingly. Verify that −1 A J
=
# AJ 0 −cTJ 1
"
%
A−1 J 0 uT 1
&
with u := A−T J cJ . The components of u are called shadow prices. R
6. Implement the active set method as a Matlab function beginning with the headline function [x,u,J,iter] = LP(c,A,b,J0) to solve the linear optimization problem
Chapter 4
Remark: To do that, one can, for example, utilize the Sherman–Morrison– Woodbury formula for the inverse of the basic matrix to ⎞ ⎛ 20 20 0 0 ⎜ 15 3 0 0⎟ ⎟ J (ε) := ⎜ A ⎝ 5 10 −1 0 ⎠ −(10 + ε) −7 0 1
208
Linearly Constrained Optimization Problems cT x −→ min AT x ≥ b as described in section 4.1. Choose the matrices A, b, c and a feasible basis J0 as input parameters. Output parameters are the optimal solution x, the corresponding Lagrange multipliers u, the optimal feasible basis J and the number of iterations iter. Modify this program for the case that J0 is a nonfeasible basis, and write it as a two-phase method. Test these two versions with the examples 3 and 4 of chapter 4.
7. Consider the problem
Chapter 4
f (x) := 12 xT Cx −→ min Ax = b , where
⎞ 0 −13 −6 −3 ⎜ −13 23 −9 3 ⎟ ⎟ C := ⎜ ⎝ −6 −9 −12 1 ⎠ , −3 3 1 −1 ⎛
" A :=
212 1 1 1 3 −1
# ,
b := (3, 2)T
and x0 := (1, 1, 0, 0)T . Determine a matrix Z ∈ R4×2 the columns of which are a basis of the nullspace of A. Minimize the reduced objective function ϕ given by ϕ(v) := f (x0 + Zv) and derive from its solution the solution to the original problem (cf. p. 175 f). 8. Solve the quadratic optimization problem 1 2 2 x1 T
+ 12 x22 + 2 x1 + x2 −→ min
A x ≤ b by means of the active set method , where # " −1 0 1 −1 1 0 and b := (0, 2, 5, 2, 5, 1)T . A := −1 1 1 1 0 −1 Choose the feasible point x(0) := (5, 0)T as a starting vector. In faded and hardly legible documents the following table to this exercise was found:
Exercises to Chapter 4
k 0 1 2 3 4 5 6 7 8
x(k) (5, 0) (5, 0) (3, 2) (3, 2) (0, 2) (0, 2) (−1, 1) (−1, 1) (−0.5, 0.5)
209 Jk f x(k) y (k) {3, 5} 22.5 (5, 0) ∈ F {3} 22.5 (2, 3) ∈ /F {2, 3} 14.5 {2} 14.5 {2, 4} 4.0 {4} 4.0 {1, 4} 0.0 {1} 0.0 {1} −0.25
u (0, 0, −1, 0, −6, 0)
α 2/3
9. Show: For a symmetric, positive definite matrix C ∈ Rn×n the quadratic optimization problem 1 T 2 x Cx T
+ cT x −→ min
A x≤b is equivalent to the constrained least squares problem L x − d −→ min AT x ≤ b with suitable L ∈ Rn×n and d ∈ Rn . 10. Show that for a nonempty, convex and closed set K ⊂ Rn : a) To every x ∈ Rn there exists exactly one z =: PK (x) ∈ K such that x − z2 ≤ x − y2 for all y ∈ K. z is hence the element in K with minimal distance to x and is called the orthogonal projection of x onto K. b) z = PK (x) is characterized by the fact that y − z , x − z ≤ 0 for all y ∈ K . 11. Let the following optimization problem be given: f (x) −→ min (P ) AT x ≤ b , A ∈ Rn×m and b ∈ Rm . with f : Rn −→ R continuously differentiable, n T For x ∈ F := y ∈ R | A y ≤ b the set J := A(x) denotes the active set at x and PH(x) the projection on
Chapter 4
Confirm the values and complete the table.
210
Linearly Constrained Optimization Problems H(x) :=
d ∈ Rn | ATJ d ≤ 0
in the sense of the preceding problem. Consider the following algorithm: Choose x0 ∈ F, k := 0 and carry out the following loop: Set dk := PH(xk ) − ∇f (xk ) . • If dk = 0: STOP .
• Otherwise: Determine λk := max λ > 0 | xk + λdk ∈ F and xk+1 := xk + λk dk with f (xk+1 ) = min f (xk + λdk ) | 0 ≤ λ ≤ λk .
Set k := k + 1 and repeat the above calculation. Show:
Chapter 4
a) dk = 0 is a feasible descent direction.
b) x is a KKT point of (P ) if and only if PH(x) − ∇f (x) = 0 holds. Hint to b): Utilize part b) of the preceding exercise and then apply Farkas’ Theorem of the Alternative.
12. Solve the following problem x21 + 2 x22 −→ min −x1 + 4 x2 ≤ 0 −x1 − 4 x2 ≤ 0 by using the projection method from the preceding exercise. Choose x(0) := (4, 1)T as the starting point. T Hint: It holds that x(k) = 31k 4, (−1)k . R
13. Implement the active set method for quadratic optimization in Matlab starting with the headline function [x,u,J,iter] = QP(C,c,A,b,x0) to solve the convex problem 1 T 2 x Cx T
+ cT x −→ min A x≤b as described in section 4.2. Choose the matrices C, c, A, b and a feasible point x0 as input parameters. Output parameters are the optimizer x, the corresponding Lagrange multipliers u, the optimal feasible basis J and the number of iterations iter. Use the program to solve the problem in exercise 8. Choose the three points (5, 0)T , (2, −1)T and (0, 1)T as starting vectors x0 . 14. An important subproblem of the implementation of optimization algorithms is the update of the QR-factorization of a matrix A if an extra column has to be attached to it or if one column of it is deleted. R Come up with your own Matlab procedures QRins bzw. QRdel for that
Exercises to Chapter 4
211 R
and carry out the necessary rotations with the help of the Matlab command givens . Compare your results with those of the standard functions qrinsert and qrdelete . Use this to solve the following problem: a) Calculate the QR-factorization of the matrix " A =
1 42 0 42 1 0 1 12 1 12 4 1
#T .
b) Based on Q and R we want to determine the QR-factorization of the ! which results from the addition of the following two columns matrix A T T u = 00 2 13 1 1 2 , v = 1 0 30 4 12 4 .
15. Use Rosen’s projection method and — with the starting value x(0) := (0, 0)T — solve the following optimization problem 1 2
x21 + 12 x22 − 4 x1 − 6 x2 x1 + x2 3 x1 + x2 x1 , x2
−→ min ≤ 5 ≤ 9 ≥ 0
and visualize it. 16. Use the reduced gradient method to solve the following optimization problem with the starting values x(0) := (0, 0, 2)T and J0 := {3}: x21 − x1 x2 + x22 − 3 x1 −→ min x1 + x2 + x3 = 2 x1 , x2 , x3 ≥ 0 17. In the active setmethod (cf. p. 173 f) assume that for the initial working set J0 the vectors aj | j ∈ J0 are linearly independent. Show that a vector ar which is added to this set of vectors is not linearly dependent on the other vectors, and hence prove inductively that the vectors aj | j ∈ Jk are linearly independent for k = 1, 2, . . .. 18. Solve exercise 15 a) with the active set method, b) with the help of the Goldfarb–Idnani method. For that discuss all of the possible solution paths.
Chapter 4
c) Finally we want to determine the QR-factorization of the matrix A ! that results from the deletion of the second column of A .
212
Linearly Constrained Optimization Problems
19. Let α be a real number with 1/2 < α < 1 and C ∈ Rn×n a symmetric positive definite matrix. Consider the function f : Rn −→ R given by f (x) := x, C xα . a) How does f (0) need to be defined in order for f to become continuously differentiable? b) Prove that f is a convex function. 20. Solve the linearly constrained optimization problem (cf. [Col], p. 21 ff) cT x + xT Cx + dT x3 −→ min
Chapter 4
AT x ≤ b, x ≥ 0 where the vector x3 arises from x raising each entry to the third power. Take the following test data: ⎛ ⎞ ⎞ ⎛ ⎞ ⎛ 4 −15 30 −20 −10 32 −10 ⎜ 8⎟ ⎜ −27 ⎟ ⎜ −20 39 −6 −31 32 ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ C := ⎜ −10 −6 10 −6 −10 ⎟ , c := ⎜ −36 ⎟ , d := ⎜ 10 ⎟ , ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎝ 6⎠ ⎝ −18 ⎠ ⎝ 32 −31 −6 39 −20 ⎠ 2 −12 −10 32 −10 −20 30 ⎛
16 0 3.5 0 0 −2 ⎜ −2 2 0 2 9 0 ⎜ ⎜ A := ⎜ 0 0 −2 0 2 4 ⎜ ⎝ −1 −4 0 4 −1 0 0 −2 0 1 2.8 0 b := 40, 2, 0.25, 4, 4, 1, 40,
⎞ −1 −1 ⎟ ⎟ ⎟ −1 ⎟ , ⎟ −1 ⎠ −1 T 60, −5, −1 .
1 1 1 1 1
1 2 3 2 1
−1 −2 −3 −4 −5
To solve this, utilize — analogously to example 12 — the function R fmincon of the Matlab Optimization Toolbox and show the course of the iteration using the feasible starting point x(0) := (0, 0, 0, 0, 1)T . R
Determine — for example by means of the symbolic power of Maple the exact minimizer of the optimization problem stated above.
—
5 Nonlinearly Constrained Optimization Problems
We again assume f, g1 , . . . , gm , h1 , . . . , hp to be continuously differentiable realvalued functions on Rn with n ∈ N and m, p ∈ N0 , and consider the problem ⎧ ⎪ ⎨ f (x) −→ min gi (x) ≤ 0 for i ∈ I := {1, . . . , m} (P ) ⎪ ⎩ hj (x) = 0 for j ∈ E := {1, . . . , p} or short
(P )
f (x) −→ min x∈F
with the set of feasible points F := x ∈ Rn | gi (x) ≤ 0 for i ∈ I, hj (x) = 0 for j ∈ E . The most difficult case is the general one in which both the objective function f and the constraint functions gi and hj are permitted to be nonlinear. We speak of nonlinear optimization problems. In section 5.1 a constrained optimization problem is replaced by an unconstrained one. There are two different approaches to this: In exterior penalty methods a term is added to the objective function which ‘penalizes’ a violation of constraints. In interior penalty methods a barrier term prevents points from leaving the interior of the feasible region. As a side product we get the ‘starting point’ for the development of so-called primal– dual interior-point methods which we will discuss in chapter 6 (linear case) and chapter 7 (semidefinite case). W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4 5,
213
Chapter 5
5.1 Penalty Methods Classic Penalty Methods (Exterior Penalty Methods) Barrier Methods (Interior Penalty Methods) 5.2 SQP Methods Lagrange–Newton Method Fletcher’s S1 QP Method Exercises
214
Nonlinearly Constrained Optimization Problems
By a sequential quadratic programming method — which we have already treated briefly in section 4.3 — a usually nonlinear optimization problem is converted into a sequence of quadratic optimization problems which are easier to solve. Many authors consider them to be the most important class of methods for solving nonlinear optimization problems without any particular structure. Newton’s method is crucial for a lot of important algorithms.
5.1 Penalty Methods Classic Penalty Methods (Exterior Penalty Methods)
Chapter 5
We already got to know the special case of quadratic penalty functions in section 1.1 (cf. p. 16 ff). This approach is based on an idea by Courant (1943) who transformed a constrained minimization problem into an unconstrained one by ‘inserting’ the constraints as penalty terms into the objective function. At first, however, this idea was not followed up. Suppose that the feasible region F is given in the form of F = C ∩ S with closed sets C and S, where C is given by the ‘simple’ (for example, convex or linear) constraints and S by the ‘difficult’ ones. The problem f (x) −→ min (P ) x∈C ∩S can then be transformed into the following equivalent problem f (x) + δS (x) −→ min x∈C with the help of the indicator function: 0 , x∈S δS (x) := ∞ , else. In this section we will assume that problem (P ) is globally solvable. The minimization methods which we are familiar with require the objective function to be sufficiently smooth (and real-valued). Therefore, we will try to approximate the indicator function δS by (real-valued) functions which should be as smooth as possible. For example, if π : Rn −→ R is a sufficiently smooth function with π(x) = 0 for all x ∈ S and π(x) > 0 for all x ∈ / S, it holds that 1 π(x) −→ δ (x) for 0 < r −→ 0 . S r
5.1
Penalty Methods
215
Example
S := F = x ∈ Rn | gi (x) ≤ 0 for i ∈ I , hj (x) = 0 for j ∈ E p m gi+ (x)α + |hj (x)|α π(x) := i=1
j=1
Here let α ≥ 1 (for example, α = 2), a+ := max{0, a} for a real number a and then obviously g + (x) := (g(x))+ for a real-valued map g and x from the domain of definition of g . This is indicated in the following figures for n = 1, m = 2, p = 0, g1 (x) := −1 − x, g2 (x) := x − 1; α = 2 (left); α = 1 (right).
–1
1
0.5
0.5
1
2
–2
–1
1
2
Hence, with the penalty function Φr defined by Φr (x) := Φ(x, r) := f (x) + 1 π(x) r for r > 0 and x ∈ Rn we get the following subsidiary problems Φr (x) −→ minx (Pr ) x ∈ C. Here r is sometimes called the penalty parameter. The violation of the constraints is ‘penalized’ by increasing ‘costs’ — when r is decreasing. Hence, the penalty function is the sum of the original objective function and a penalty term π(x)/r .
This approach is based on the intuitive expectation that the minimizer x(r) to (Pr ) will converge to the minimizer x to (P ) for r −→ 0 . Example 1 (cf. [Spe], p. 394 ff) f (x) := x21 + 4 x1 x2 + 5 x22 − 10 x1 − 20 x2 −→ min h1 (x) := x1 + x2 − 2 = 0
Chapter 5
–2
1
216
Nonlinearly Constrained Optimization Problems
This problem can of course be solved very easily with basic methods: Substituting x2 = 2 − x1 in f gives f (x) = 2 x21 − 2 x1 − 20 , hence, via x21 − x1 − 10 −→ min for x1 ∈ R , yields x1 = 1/2 , x2 = 3/2 and f (x) = −20.5 . By using the KKT condition ∇f (x) + μ1 ∇h1 (x) = 0 and h1 (x) = 0 , we also get x1 = 1/2 , x2 = 3/2 and μ1 = 3 very easily. However, here we want to illustrate the penalty method :
Φr (x) := f (x) + 1 h1 (x)2 r
⎛ ⎞ 2 x + x − 2 + 4 x − 10 + 2 x 1 2 2 r 1 ∇Φr (x) = ∇f (x)+ 2 h1 (x)∇h1 (x) = ⎝ ⎠ 2 r 4 x1 + 10 x2 − 20 + x1 + x2 − 2 r
Chapter 5
gives
2+ 4+
2 r 2 r
4 + 2r 10 + 2r
for the zero x(r) of ∇Φr , hence, ⎧ ⎨ x1 (r) = ⎩ x2 (r) =
x1 (r) x2 (r)
20+ r4 4+ 8r 12 r
4+ r8
=
= =
10 + 20 +
4 r 4 r
5 r+1 r+2 3 r+2 .
Thus, we have x(r) −→ x1 (r) + x2 (r) =
5 r+4 r+2
1 2 3 2
= x for r −→ 0 . 3r r+2
>2 3r Furthermore, it holds that 2 h1 x(r) = 2r r+2 −→ 3 = μ1 . r 2 4 is positive definite. The matrix ∇2 f (x) + μ1 ∇2 h1 (x) = 4 10 =2+
=0
The matrix Ar := ∇2 Φr (x) =
κ =
2+ 4+
2 r 2 r
4 + 2r 10 + 2r
has the condition number
√ 2 17 r2 + 10 r + 2 + 2 (3 r + 1) 8 r2 + 4 r + 1 ∼ , r (r + 2) r
that is, the Hessian is ill-conditioned for small r .
5.1
Penalty Methods
217
A short reminder: The condition number is — in this case — the quotient of the greatest and the smallest eigenvalue. The eigenvalues can be obtained from λ2 − trace (Ar ) λ + det(Ar ) = 0 .
We derive a simple convergence property to the general penalty function approach. Proposition 5.1.1 If there exist (rk ) ∈ (0, ∞) N with rk −→ 0 and (x(k) ) ∈ C N with Φrk (x(k) ) = min{Φrk (x) | x ∈ C} for each k ∈ N , then each accumulation point x∗ of the sequence (x(k) ) minimizes (P ). Proof: Suppose that x is a (global) minimizer to (P ) and x∗ an accumulation point of the sequence (x(k) ). Since C is closed — according to the usual assumptions — x∗ belongs to C . Let wlog x(k) −→ x∗ . Taking into account π(x) = 0 for x ∈ S , Φrk (x(k) ) ≤ Φrk (x) = f (x) lim sup Φrk (x(k) ) ≤ f (x) . k→∞
In addition x∗ ∈ C ∩ S; otherwise the sequence bounded. Finally,
Φrk (x(k) ) would be un-
lim inf Φrk (x(k) ) ≥ f (x∗ ) ≥ f (x) k→∞
holds since Φrk (x(k) ) = f (x(k) ) + 1 π(x(k) ) ≥ f (x(k) ) −→ f (x∗ ) . rk Altogether, we get lim Φrk (x(k) ) = f (x∗ ) = f (x); hence, x∗ minimizes k→∞
problem (P ).
Properties of Penalty Methods 1) Penalty methods substitute unconstrained optimization problems for constrained ones. The effective use of numerical methods for unconstrained problems requires the penalty function Φr to be sufficiently differentiable. 2) Unlimited reduction of the penalty parameter r increases ill-conditioning which is often inherent in the minimization process of the penalty function. Therefore there should exist an r > 0 such that a local minimizer x to (P ) minimizes locally each unconstrained problem (Pr ) for r < r too. In this case the penalty function is called exact.
Chapter 5
holds by definition of x(k) . From that it follows that
218
Nonlinearly Constrained Optimization Problems
In general these two properties are not compatible. Now we are going to construct an exact penalty function to the optimization problem f (x) −→ min (CP ) gi (x) ≤ 0 (i = 1, . . . , m). We have chosen the name (CP ) because in applications of the following considerations the functions f and gi will be convex in most cases.
Let the Lagrangian L to (CP ) as always be defined by L(x, λ) := f (x) +
m
λi gi (x)
i=1
for x ∈ Rn and λ ∈ Rm + . We will use m
gi+ (x) Φr (x) := f (x) + 1 r i=1
Chapter 5
as the penalty function for the moment. Proposition 5.1.2 (Exactness of the Penalty Function) Let x, λ ∈ Rn × ∈ Rm + be a saddlepoint of L, that is, L(x, λ) ≤ L x, λ ≤ L(x, λ)
for all x ∈ Rn and λ ∈ Rm + .
For r ∈ (0, ∞) it then holds that: a) minn Φr (x) = Φr (x) if r λ∞ ≤ 1. x∈R
b) If r λ∞ < 1 and Φr (x∗ ) = minn Φr (x), then x∗ also minimizes (CP ). x∈R
Φr is then called the ‘exact penalty function’. Proof: Following theorem 2.2.6, x yields a minimizer to (CP ) and L(x, λ) = f (x) holds. a) It holds that gi+ (x) = 0 and gi (x) ≤ gi+ (x) for all x ∈ Rn and i ∈ I . With r λ∞ ≤ 1 it follows that m Φr (x) = f (x) = L(x, λ) ≤ L(x, λ) = f (x) + λi gi (x) ≤ f (x) +
m i=1
i=1
λi gi+ (x)
m 1 g + (x) = Φ (x). ≤ f (x) + r i i=1 r
5.1
Penalty Methods
219
b) Let r λ∞ < 1 and Φr (x∗ ) = minn Φr (x). If x∗ is not a minimizer to x∈R
(CP ), it follows with a) that: f (x∗ ) ≤ Φr (x∗ ) = minn Φr (x) = Φr (x) = x∈R
f (x). Consequently, x∗ is not feasible, that is, there exists an i ∈ I with gi (x∗ ) > 0. The same chain of inequalities as in the proof of a) then yields the following for x = x∗ : Φr (x) ≤ f (x∗ ) +
m i=1
λi gi+ (x∗ ) < f (x∗ ) +
m 1 i=1
r
gi+ (x∗ ) = Φr (x∗ ) ,
hence, a contradiction.
Remark Since this Φr is not necessarily differentiable, it is not possible to apply methods for solving unrestricted nonlinear problems directly. Example 2
Here we are able to obtain the minimizer x = 1 with f (x) = −9 immediately (parabola). Typical graphs of Φ( · , r) — for different values of the penalty parameter r — are illustrated in the following figure:
α=2
α=1
1
1
0
0
–5
–5
–10
–10
1) Let g(x) := x − 1 for x ∈ R. For r > 0 consider at first 2 Φr (x) := f (x) + 1 g + (x)2 = (x2 − 10 x) + 1 max(0, x − 1) . r r
Chapter 5
f (x) := x2 − 10 x → min x ≤ 1
220
Nonlinearly Constrained Optimization Problems Φr
= 2 x − 10 + 2 max(0, x − 1) r 2 x − 10 , x≤1 1 1 = 2 r +1 x−2 r +5 , x > 1
x(r) = λ(r) =
1 r +5 1 r +1
2 r
= 1+
g(x(r)) =
4r 1+r −→ 1 = x 8 1+r −→ 8 = λ
2) For L(x, λ) := f (x) + λg(x) (x ∈ R, λ ∈ R+ ) the pair x, λ := (1, 8) is a saddlepoint of L since L x, λ = x2 − 10 x + 8 (x − 1) = x2 − 2 x − 8 = (x − 1)2 − 9 ≥ −9 = L x, λ = L(x, λ) .
Chapter 5
Hence, we can apply the observations from proposition 5.1.2 to 2 , x≤1 x − 10 x 1 2 Φr (x) := x − 10 x + max(0, x − 1) = 1 r x2 − 10 x + (x − 1), x > 1 . r According to 5.1.2.a, we know Φr (1) = min Φr (x) if r ≤ 1/8 , x∈R
and by 5.1.2.b that the solutions to minx∈R Φr (x) for r < 1/8 give the solution to (CP ). Barrier Methods (Interior Penalty Methods) As with the exterior penalty methods, we consider the following minimization problem f (x) −→ min (P ) x∈C ∩S. For that let C and S be closed subsets of Rn ; in addition let C be convex. When using barrier methods, we will consider the following kinds of substitute problems for r > 0 : Φr (x) := Φ(x, r) := f (x) + r B(x) −→ min ◦
x∈C ∩S
◦
Here, suppose that the barrier function B : S −→ R is continuous and B(x) −→ ∞ holds for x −→ z for all z ∈ ∂S. Since the solutions to the ◦
substitute problems should always lie in S , we demand
5.1
Penalty Methods
221 inf
◦
f (x) = min f (x). x∈C∩S
x∈C∩ S ◦
For example, C ∩ S = C ∩ S is sufficient for that. Example Let S = {x ∈ Rn | gi (x) ≤ 0 (i = 1, . . . , m)} with continuous and convex functions g1 , . . . , gm and gi (x0 ) < 0 (i = 1, . . . , m) for an x0 ∈ C. Important barrier functions are (locally around x0 ) the m 1 inverse barrier function B(x) = − Carroll (1961) and the g (x) i=1 i m log(−gi (x)) Frisch (1955) . logarithmic barrier function B(x) = − i=1
With the same functions g1 and g2 as in the example on page 215 we get: Inverse barrier function
Logarithmic barrier function 10
20
Chapter 5
15
5
10
5
–1
1
–1
1
The barrier method algorithm corresponds closely to the algorithms of exterior penalty methods: ◦
To a given starting point x(0) ∈ S and a null sequence (rk ) with rk ↓ 0 , we solve the following problems for k ∈ N: Φrk (x) −→ min (Pk ) ◦ x ∈C ∩S with the starting value x(k−1) . Suppose that we obtain the solution x(k) := x(rk ). Subject to suitable assumptions, we can also show that the accumula tion points of the sequence x(k) yield solutions to (P ). Example 3 f (x) = x −→ min −x + 1 ≤ 0
222
Nonlinearly Constrained Optimization Problems
Obviously this elementary problem has the solution x = 1 .
r x−1 √ With x(r) = 1 + r we obtain an interior point of the feasible region from r Φr (x) = 1 − = 0. (x − 1)2 2r = √2 . For that it also holds that Φr (x(r)) = r (x(r) − 1)3 Φr (x) = x +
The following example from [Fi/Co], p. 43 ff illustrates more clearly how the barrier method works. Example 4 f (x) := x1 + x2 → min g1 (x) := x21 − x2 ≤ 0 g2 (x) := −x1 ≤ 0
2
1
Chapter 5
Φr (x) := x1 + x2 − r log(−x21 + x2 ) − r log(x1 ) 0 r −1 2 x1 1 r + shows: + ∇Φr (x) = −x21 + x2 −1 x1 0 1
0.5
1
1.5
r 2 r x1 − = 0 −x21 + x2 x1 r 1− = 0 −x21 + x2
1+ ∇Φr (x) = 0
⇐⇒
(1) (2)
From (2) we obtain −x21 + x2 = r or x2 = r + x21 . Substitution into (1) gives the equation 1 + 2 x1 − r = 0 ; x1 √ 1 from there we obtain x1 (r) := 4 − 1 + 1 + 8 r . The second solution to the quadratic equation is omitted since it is negative. The trajectory defined by x1 (r) x(r) := (r > 0) r + x1 (r)2 is often called the ‘central path’. In addition, we note that √ r r = 1 , λ2 (r) := = 1 1 + 1 + 8r λ1 (r) := 2 −g1 (x(r)) −g2 (x(r)) are approximations of the Lagrange multipliers since 0 1 lim x(r) = =: x , lim λ(r) = =: λ r→0 r→0 0 1 yield a KKT pair (x, λ) — and hence, according to theorem 2.2.8, a solution — to our optimization problem.
5.1
Penalty Methods
223
In the more general case we will now consider the problem f (x) −→ min (P ) gi (x) ≤ 0 for i ∈ I and the corresponding logarithmic penalty function Φr (x) := f (x) − r
m
log(−gi (x)) .
i=1
We are looking for an optimal solution x(r) to a given penalty parameter r > 0 . For x = x(r) the gradient of Φr is equal to 0 , that is, it holds that ∇x Φr (x(r)) = ∇f (x(r)) +
m i=1
r ∇gi (x(r)) = 0 . −gi (x(r))
(3)
If we define an estimation of the Lagrange multipliers by λi (r) :=
r −gi (x(r))
for i ∈ I ,
∇f (x(r)) +
m
λi (r)∇gi (x(r)) = 0 .
i=1
This condition is identical with the gradient condition ∇x L(x, λ) = 0 for the Lagrangian L of problem (P ) defined by L(x, λ) := f (x) +
m
λi gi (x).
i=1
Let us also look at the other KKT conditions: gi (x) ≤ 0 for i ∈ I
(4)
λi ≥ 0 for i ∈ I
(5)
λi gi (x) = 0 for i ∈ I .
(6)
For x = x(r) and λ = λ(r) conditions (4) and (5) are evidently satisfied since gi (x(r)) < 0 and λi (r) > 0 hold for all i ∈ I . The complementary slackness condition (6), however, is only met ‘approximately’ because of λi (r) − gi (x(r)) = r for all i ∈ I and r positive. If
lim x(r), λ(r) = x, λ
r→0
Chapter 5
it is possible to write (3) as
224
Nonlinearly Constrained Optimization Problems
holds, x, λ is a KKT pair of (P ). If we insert a slack variable into the constraints of (P ), we get the equivalent problem: ⎧ ⎪ ⎨ f (x) −→ min gi (x) + si = 0 for i ∈ I (P ) ⎪ ⎩ si ≥ 0 for i ∈ I . We obtain the following ‘perturbed’ KKT conditions of (P ): ∇f (x) +
m
λi ∇gi (x) = 0
i=1
g(x) + s = 0 λi si = r for i ∈ I λ, s ≥ 0 .
Chapter 5
This is the starting point for the development of so-called primal–dual interior-point methods which we will discuss later on.
5.2 SQP Methods As already mentioned in section 4.3, ‘SQP’ stands for Sequential Quadratic Programming. SQP, however, does not denote a single algorithm but a general method. For the convenience of our readers, we repeat the essentials: Usually a nonlinear optimization problem is converted into a sequence of quadratic optimization problems which are easier to solve. SQP methods go back to R. B. Wilson (1963; PhD Thesis, Harvard University). Many authors consider them to be the most important class of methods for solving nonlinear optimization problems without any particular structure. Consider the problem: ⎧ ⎪ ⎨ f (x) −→ min ci (x) ≤ 0 for i ∈ I := {1, . . . , m} (P ) ⎪ ⎩ ci (x) = 0 for i ∈ E := {m + 1, . . . , m + p} . For that let m, p ∈ N0 (hence, E = ∅ or I = ∅ are permitted), and suppose that the functions f, c1 , . . . , cm+p are twice continuously differentiable in an open subset D of the Rn . Hence, the set of feasible points is
F := x ∈ D | ci (x) ≤ 0 for i ∈ I, ci (x) = 0 for i ∈ E
= x ∈ D | cI (x) ≤ 0 , cE (x) = 0 .
5.2
SQP Methods
225
For the Lagrangian L defined by L(x, λ) := f (x) + λT c(x) for x ∈ D and λ ∈ Rm+p it holds that
∇x L(x, λ) = ∇f (x) +
m+p
λi ∇ci (x)
i=1
∇λ L(x, λ) = c(x) ∇x2x L(x, λ) = ∇2 f (x) + ∇x2λ L(x, λ)
m+p
T = c (x) .
λi ∇2 ci (x)
i=1
(x∗ , λ∗ ) is a KKT point of (P ) if and only if ∇x L(x∗ , λ∗ ) = 0 , λ∗i ci (x∗ ) = 0 , λ∗i ≥ 0 and ci (x∗ ) ≤ 0 for i ∈ I and ci (x∗ ) = 0 for i ∈ E hold. λ∗I , cI (x∗ ) = 0 , λ∗I ≥ 0 , cI (x∗ ) ≤ 0 and cE (x∗ ) = 0 . ∗ ∗ To determine (k) (x(k), λ ) iteratively, we compute — starting from an ‘approximation’ x , λ — a solution to the system
0 = gk + Bk (x − x(k) ) + Ak λ λi ≥ 0 , (k)
(k)
i (x − x(k) ) ≤ 0 ,
(k)
λi i (x − x(k) ) = 0 for i ∈ I
i (x − x(k) ) = 0 for i ∈ E with
gk := ∇f x(k) ATk := c x(k) m+p (k) 2 (k) Bk := ∇2 f x(k) + λi ∇ ci x (k) ci
(k)
:= ci x
;
i=1 (k) i (d)
(k)
:= ci
+ ci x(k) d for d ∈ Rn and i ∈ I ∪ E
taking into account the ‘linearizations’ of ∇x L(x, λ) and ci (x) by ∇x L x(k) , λ(k) + ∇x2x L x(k) , λ(k) (x − x(k) ) + ∇x2λ L(x(k) , λ(k) (λ − λ(k) ) = gk + Ak λ(k) + Bk (x − x(k) ) + Ak (λ − λ(k) ) = gk + Bk (x − x(k) ) + Ak λ and (k)
ci
(k) + ci x(k) (x − x(k) ) = i (x − x(k) ).
Chapter 5
The last two lines can be shortened to
226
Nonlinearly Constrained Optimization Problems
If we set d := x − x(k) , these are exactly the KKT conditions of the following quadratic program: ⎧ 1 T T ⎪ ⎪ ⎨ ϕk (d) := 2 d Bk d + gk d −→ min (k) (QP )k i (d) ≤ 0 for i ∈ I ⎪ ⎪ ⎩ (k) i (d) = 0 for i ∈ E . If Bk is positive semidefinite, (QP )k is a convex quadratic problem. Then the KKT conditions are also sufficient for a global minimizer.
These observations lead to the following algorithm (basic version): 0) Given: x(0) , λ(0) ; k := 0
Chapter 5
1) Determine the solution d and the corresponding Lagrange multipliers λ to (QP )k . 2) If d = 0 : STOP Otherwise: x(k+1) := x(k) + d, λ(k+1) := λ With k := k + 1 go to 1) . We will of course stop if d2 < ε for a given ε > 0 .
Similarly to Newton’s method we can prove locally quadratic convergence. One speaks of the Lagrange–Newton Method. Example 5
1
f (x) := −x1 − x2 −→ min c1 (x) := x21 − x2 ≤ 0 c2 (x) := x21 + x22 − 1 ≤ 0
–1
1
√ 1/√2 ∗ of the minimization problem. The figure shows the solution x = 1/ 2 0√ . The corresponding Lagrange multiplier is λ∗ = 1/ 2 1/2 0 (0) (0) and λ := it follows that For x := 1 0 −1 (0) (0) B0 = 0 , c1 = −3/4 , c2 = 1/4 , g0 = . −1
5.2
SQP Methods
227
Hence, in this case (QP )0 is ⎧ ⎪ ⎨ ϕ0 (d) = −d1 − d2 −→ min −3/4 + d1 − d2 ≤ 0 ⎪ ⎩ 1/4 + d + 2 d ≤ 0 . 1 2 1/3 5/12 leads to , λ(1) = The solution d = 2/3 −1/3 11/12 2 0 x(1) = x(0) + d = , B1 = . 2/3 0 4/3 Hence, (QP )1 now is ⎧ 2 2 ⎪ ⎨ ϕ1 (d) = d1 + 2/3 d2 − d1 − d2 −→ min 25/144 + 11/6 d1 − d2 ≤ 0 ⎪ ⎩ 41/144 + 11/6 d + 4/3 d ≤ 0 (active!). 1 2 0 0.7471 −0.1695 (2) (2) yields x . , λ = = The solution d = 0.7304 0.6863 0.0196
Chapter 5
Possible problems are: a) The quadratic objective function is unbounded from below: 1
Example 6 f (x) := x1 + x2 −→ min c1 (x) := x21 − x2 ≤ 0
0
–1
1
In the accompanying figure we observe that the objective function f atT tains its minimum at x∗ = (−1/2, 1/4) . 0 , λ(0) := 0 yield x(0) := 0 ϕ0 (d) = d1 + d2 −→ min −d2 ≤ 0 . b) The constraints of (QP )0 are inconsistent:
1
Example 7 c1 (x) := −x1 ≤ 0 c2 (x) := −x2 ≤ 0 c3 (x) := x2 + (x1 − 1)3 ≤ 0 0
1
228
Nonlinearly Constrained Optimization Problems 1 −1 is feasible. For any f and λ(0) the point x(0) := 0 0 leads to the obviously inconsistent system ⎧ (0) ⎪ ≤ 0 ⎪ 1 (d) = 1 − d1 ⎨
The point
⎪ ⎪ ⎩
(0)
2 (d) = −d2 (0) 3 (d)
≤ 0
= −8 + 12 d1 + d2 ≤ 0 .
The objectives of the following considerations are to obtain global convergence and avoid the second-order derivatives. This will lead to Quasi-Newton methods.
Chapter 5
If the point x(k) is still far away from x∗, we have to keep in mind that on the one hand f should be made as small as possible and on the other hand the constraints should be met as exactly as possible. In practice finding feasible points is more important than the minimization of f . We will try to strike a happy medium between these two objectives. We will see that the solution dk to (QP )k is a descent direction of the following exact1 penalty function: Φr (x) := f (x) +
m+p m 1 c+ (x) + 1 |c (x)| ri i ri i i=1
i=m+1
Here, ri > 0 for i ∈ I ∪ E are of course given.
S.-P. Han’s suggestion (1976): Step size control via Φr (x) Han Algorithm 0) Given: x(0) and a positive definite B0 ∈ Rn×n ; k := 0 1) Determine the solution dk and the corresponding Lagrange multiplier λ(k) to (QP )k . 2) Determine an αk ∈ [0 , 1 ] with Φr x(k) + αk dk = min Φr x(k) + α dk | 0 ≤ α ≤ 1 and set pk := αk dk , x(k+1) := x(k) + pk . 3) Update Bk to Bk+1 such that the positive definiteness remains. With k := k + 1 go to 1). To 2): dk is a descent direction of the penalty function Φr . 1
We will not prove exactness here.
5.2
SQP Methods
229
More precisely: Let dk , λ(k) be a KKT pair of (QP )k with dk = 0 and (k) ri |λi | ≤ 1 for i ∈ E ∪ I. Then for the directional derivative of Φr in x(k) in the direction of dk it holds that (k) Φr x(k) + tdk − Φr x(k) < 0. Φr x ; dk := lim t t→0+ Proof: At first we will consider the individual terms c+ i for i ∈ I and |ci ( )| for i ∈ E and divide the set I into
(k) (k) (k) I + := i ∈ I | ci > 0 , I − := i ∈ I | ci < 0 , I 0 := i ∈ I | ci = 0 and do the same for E . For i ∈ E , t > 0 sufficiently small and then passage to the limit t −→ 0 , we have ⎧ (k) (k) ⎪ ⎪ ⎪ ci x + tdk − ci −→ ci x(k) dk , i ∈ E+ ⎪ ⎪ (k) t (k) ⎪ ci x + tdk − c ⎨ (k) (k) i = − ci x + tdk − ci −→ −ci x(k) dk , i ∈ E − ⎪ t t (k) ⎪ ⎪ ci x + tdk ⎪ ⎪ −→ c x(k) dk , ⎪ i ∈ E 0. ⎩ i t
(k) (k)+ + tdk − ci c+ i x t
⎧ (k) (k) ci x + tdk − ci ⎪ ⎪ −→ ci x(k) dk , i ∈ I + ⎪ ⎪ t ⎨ i ∈ I− = 0, ⎪ + ⎪ c x(k) + td + ⎪ k ⎪ ⎩ i −→ ci x(k) dk , i ∈ I 0 . t
dk , λ(k) is a KKT pair of (QP )k which means the following: 0 = gk + Bk dk + Ak λ(k) and for i ∈ I (k) (k) (k) (k) (k) λi ≥ 0 , i (dk ) = ci + ci x(k) dk ≤ 0 , λi ci + ci x(k) dk = 0 (k) as well as ci + ci x(k) dk = 0 for i ∈ E. These three lines together with the definition of Ak firstly yield (k) (k) gkT dk = −dTk Bk dk − λ(k)T ATk dk = −dTk Bk dk − dk and λi ci x (k) −λi ci x(k) dk For i ∈ I 0 For i ∈ E 0
i∈I∪E
(k) (k) λi ci
for i ∈ I ∪ E . + we have ci x(k) dk ≤ 0 , hence, ci x(k) dk = 0. we have ci x(k) dk = 0 . =
With that we get the following for the directional derivative of Φr :
Chapter 5
For i ∈ I we get analogously
230
Nonlinearly Constrained Optimization Problems
1 (k) 1 (k) 1 (k) dk − dk + dk ci x ci x c x r r r i i∈E + i i∈E − i i∈I + i (k) (k) (k) (k) λi − 1 ci + λi + 1 ci = −dTk Bk dk + ri r i∈E + i i∈E − gkT dk +
<0
≤0 ≤0 (k) (k) (k) (k) (k) 1 λi ci + ci x + d + λi ci < 0 ri k i∈I + i∈I −
≤0
(k)
≤ −ci
Addition: If dk = 0 , it holds that gk + Ak λ(k) = 0 , (k)
ci Hence,
(k)
≤ 0 , λi
(k) ci = 0 (k) (k)
(x
,λ
(k) (k)
≥ 0 , λi ci
= 0 for i ∈ I and
for i ∈ E. ) is a KKT pair of problem (P ).
Chapter 5
Instead of 2) we will often choose the more ‘tolerant’ search strategies (inexact line search instead of exact line search) which have already been discussed in section 3.2, for example, the following Armijo step size rule: Let α ∈ (0, 1). Starting from λ := 1 , we will halve λ until Φr x(k) + λdk ≤ Φr x(k) + α λΦr x(k) ; dk holds.
To 3): With qk := ∇L x(k+1) , λ(k) − ∇L x(k) , λ(k) and the assumption pk , qk > 0 we consider the BFGS updating formula Bk+1 := Bk +
Bk pk pTk Bk qk qkT − , pk , qk pk , Bk pk
which was discovered independently by Broyden, Fletcher, Goldfarb and Shanno, using different methods (compare section 3.5). Continuing from there, Powell2 proposed a number of modifications to make the algorithm more efficient: To 1): The linear constraints in (QP )k can be inconsistent even if the nonlinear constraints in (P ) are consistent. In this case we introduce a new vari(k) (k) able ξ and replace the constraints i (d) ≤ 0 for i ∈ I and i (d) = 0 for i ∈ E by: 2
Confer [Pow 2].
5.2
SQP Methods
231
(k) ci ξi + ci x(k) d ≤ 0 (k) 1 , if ci ≤ 0 with ξi := (k) ξ , if ci > 0 (k) ci ξ + ci x(k) d = 0 (i ∈ E) 0 ≤ ξ ≤ 1.
(i ∈ I)
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭
(7)
For that we will at first choose ξ as the maximum and then determine d as the minimizer of the quadratic objective function restricted by (7). Hence, we only change the inequality constraints which are not met at the starting point. The factor ξ ensures that at least the pair (ξ, d) = (0, 0) is feasible. If we thus get ξ = 0 and d = 0 , the restrictions in (P ) are considered to be inconsistent and the algorithm is terminated. Example 8
(functions ci from example 7)
c1 (x) := −x1 ≤ 0 c2 (x) := −x2 ≤ 0 c3 (x) := x2 + (x1 − 1)3 ≤ 0 (0)
= 1 , c2
−1 0
(0)
= 0 , c3 = −8 ⎫ ξ − d1 ≤ 0 ⎪ ⎪ ⎪ ⎬ −d2 ≤ 0 =⇒ ξ = −8 + 12 d1 + d2 ≤ 0 ⎪ ⎪ ⎪ ⎭ 0 ≤ ξ ≤ 1
2 3
; d =
2 3
0
is feasible.
To 2): If we choose suitable penalty parameters ri , it holds by theorem 14.3.1 in [Fle], p. 380, that x∗ yields a solution to (P ) if and only if x∗ is a minimal point of Φr . Powell suggests the following modulation of the penalty parameters: 1 := |λ(0) | k = 0: i ri 1 := max |λ(k) | , 1 1 + |λ(k) | k ≥ 1: i i ri 2 ri Problem: We are left with only one estimation Φr(k) x(k+1) < Φr(k) x(k) ; however, (k+1) (k) it might happen that Φr(k+1) x > Φr(k) x holds. Then the chain of inequalities does not fit anymore. So how can we obtain the convergence of x(k) −→ x∗ ? Furthermore:
Chapter 5
(0)
c1
x(0) :=
232
Nonlinearly Constrained Optimization Problems
• If r(k) is too small, the step size αk can become extremely small since the Hessian becomes ill-conditioned. (In the ideal case we should have αk ≈ 1 .) • If r(k) is too big, dk might possibly not be a descent direction of Φr(k) anymore. We will of course hope for locally superlinear or quadratic convergence of the SQP method if we have chosen Bk suitably. However, due to the so-called Maratos Effect (1978) this will not be the case: (k)
Suppose that ri |λi | ≤ 1 (i = 1, . . . , m + p) for all k ≥ 0 . The following might happen: • If all iteration points x(k) lie close to x∗ and if we choose x(k+1) = x(k) + dk , then (x(k) ) is superlinearly convergent. $(k) + αk dk with • However, if we choose x $(k+1) = x (k+1) (k) Φr x $ = min Φr x $ + α dk ) , 0≤α≤1
it can happen that x $(k) is not superlinearly convergent, since the line search of the penalty function (k) prevents the optimal choice of αk = 1 because Φr x $(k) + dk > Φr x $ .
Chapter 5
Consider the following standard example by Powell cf. [Pow3] : Example 9 (cf. exercise 8) f (x) := 2 x21 + x22 − 1 − x1 h(x) := c1 (x) := x21 + x22 − 1 = 0 The feasible region F is exactly the boundary of the unit circle in R2 in this case. For x = (x1 , x2 )T ∈ F we have f (x) = −x1 . With that x∗ := (1, 0)T gives the minimum with value f (x∗ ) = −1 . We obtain the Lagrange multiplier λ∗ = −3/2 .
With the positive definite matrices ∗ ∗ 1 0 2 0 4 0 2 ∗ = +λ Bk := ∇x x L x , λ = 0 1 0 2 0 4 the quadratic subproblems are ϕk (d) = gkT d + 1 dT Bk d = f x(k) d + 1 d, d −→ min 2 2 (k) 1 (d) = h x(k) + h x(k) d = 0 . We will drop the index k in the following considerations; hence, we have to consider the following subproblem
5.2
SQP Methods
233
f (x)d + 1 d, d −→ min 2 h(x) + ∇h(x), d = 0 . We obtain d via the corresponding KKT conditions ∇f (x) + d + μ ∇h(x) = 0 h(x) + ∇h(x), d = 0 . Due to the special (quadratic) structure, we get the following for h and f : h(x + d) = h(x) + h (x)d + d, d = d, d f (x + d) = f (x) + f (x)d + 2 d, d = f (x) − d + μ∇h(x), d + 2 d, d = f (x) − μ ∇h(x), d + d, d = f (x) + μh(x) + d, d If x is now any feasible point, f (x + d) = f (x) + d, d > f (x) h(x + d) = d, d > 0 = h(x) hold for each d ∈ R2 \ {0}. For the penalty function Φr defined by
x ∈ R2 , r ∈ (0, ∞)
it hence holds that
Φr (x+d) = f (x)+ d, d+ 1 d, d = f (x)+ 1+ 1 d, d > f (x) = Φr (x) . r r Consequently, the full step size 1 will never be accepted, regardless of how close a feasible x is to x∗ . This phenomenon can be explained by the discrepancy between the merit function Φr and the local quadratic model used to compute d .
Feasible region (black) and contours of the penalty function Φ 13 (blue) 1 y
–1
0.5
–0.5
0.5
x
1
–0.5
–1
An Alternative Approach: Fletcher’s S1 QP Method (cf. [Fle], p. 312 ff) Let ri |λ∗i | < 1
(1 ≤ i ≤ m+p). Fletcher establishes a connection between
Chapter 5
Φr (x) := f (x) + 1 h(x) r
234
Nonlinearly Constrained Optimization Problems
the trust-region method and the SQP method in the following way. Instead of solving (QP )k , he solves a trust-region problem in which an approximation of the exact 1 -penalty function occurs: 1 (k) + 1 (k) i (d) −→ min i (d) + ϕk (d) + i∈I ri i∈E ri d∞ ≤ Δk
Chapter 5
This problem is equivalent to a quadratic optimization problem with linear constraints: ⎧ m+p ⎪ ⎪ ϕk (d) + 1 ηi −→ min ⎪ ⎪ ⎪ i=1 ri ⎪ ⎨ −Δk ≤ dν ≤ Δk (1 ≤ ν ≤ n) (QP )k ⎪ (k) ⎪ ⎪ 0 ≤ ηi , i (d) ≤ ηi for i ∈ I ⎪ ⎪ ⎪ ⎩ (k) −ηi ≤ i (d) ≤ ηi for i ∈ E . The feasible region of these problems is evidently always nonempty. 1 (k) + 1 (k) (d) (d) + Consider Ψr (d) := f x(k) + ϕk (d) + ri i r i i∈E i (k) i∈I Φr x + dk − Φr x(k) . and k := Ψr (dk ) − Ψr (0) The numerator gives the actual reduction of the penalty function Φr , the denominator that of the modeling penalty function Ψr .
Model Algorithm 1) Let x(k) , λ(k) and Δk be given. Compute f x(k) , gk , Bk , c(k) and Ak . 2) The solution to (QP )k gives the KKT pair dk , λ(k+1) . 3) Compute k . 4) If k <
1 4
:
Δk+1 := Δk /2 and x(k+1) := x(k)
Otherwise: x(k+1) := x(k) + dk If k > 34 and dk ∞ = Δk : Δk+1 := 2 Δk Otherwise: Δk+1 := Δk With k := k + 1 go to 1). We will return to a test problem by Powell which we have already discussed in exercise 15 to chapter 2 . Example 10 We are looking for a triangle with the least possible area, containing two disjoint disks with radius 1. If (0, 0), (x1 , 0) and (x2 , x3 ) denote the vertices of the triangle and (x4 , x5 ) and (x6 , x7 ) the centers of the disks, we have to
5.2
SQP Methods
235
solve the following problem: f (x) := 1 x1 x3 −→ min (area of the triangle). 2 wlog let x1 ≥ 0 and x3 ≥ 0. Then we have the ‘trivial’ inequalities (x4 − x6 )2 + (x5 − x7 )2 ≥ 4 (disjoint disks), x5 ≥ 1 and x7 ≥ 1 . For the other four inequalities, we utilize the point normal form or Hessian form of a linear equation:
x−
n
a x − a , n = 0
x a
If the normal n to the straight line is scaled with the norm 2 , then d := p − a , n returns the distance of a point p to the straight line.
1 ≤ d = p , n . Insertion of the two points (x4 , x5 ) and (x6 , x7 ) gives x3% x4 − x2 x5 ≥ 1 and x3% x6 − x2 x7 ≥ 1 . x23 + x22 x23 + x22 Via the distance to the line through % (x2 , x3 ) and (x1 , 0) with x − a = (x2 − x1 , x3 ) and n = (−x3 , x1 − x2 )/ (x1 − x2 )2 + x23 , we obtain accordingly −x3 x6%+ (x2 − x1 )x7 + x1 x3 −x3 x4%+ (x2 − x1 )x5 + x1 x3 ≥ 1 and ≥ 1. 2 2 x3 + (x2 − x1 ) x23 + (x2 − x1 )2 The problem has, for example, the following minimal point: T √ √ √ √ √ x∗ = 4 + 2 2, 2 + 2 , 2 + 2 , 1 + 2 , 1 , 3 + 2 , 1 3 2 1 0
∗
0
2
4
6
with A(x ) = {3, 4, 5, 6, 9} and √ √ √ √ T 1 , 2 + 2 , 2 + 2 , 2 + 2 2 , 0 , 0 , 2 + 2 2 . λ∗ = 0 , 0 , 12 + 2√ 2
Chapter 5
Here we have at first a := (0, 0) and x := (x2 , x3 ) . % Then n = (x3 , −x2 )/ x22 + x23 . For p = (x, y) it thus follows that
236
Nonlinearly Constrained Optimization Problems
We obtain the following for the Hessian of the Lagrange function: ⎛ 1+√2 1−√2 1−√2 ⎞ 1 0 0 − 12 4√ 4√ 4 2 ⎜ 1− 2 3− 2 ⎟ 1 1 1 0√ − 12 ⎟ ⎜ 4√ − 2 2 2 2 ⎜ 1− 2 ⎟ 1 1+ 2 ⎜ 4 0 − 21√ − 21 − 12 ⎟ 2 2√ ⎜ ⎟ 1 H∗ = ⎜ − 21 − 2+2 2 0 √ 2+2 2 0√ ⎟ ⎜ 0 ⎟ 2 ⎜ 1 1 2+ 2 2+ 2 ⎟ − 0 − 0 ⎜ 0 ⎟ 2 2 2 2 √ √ ⎜ ⎟ 1 1 2+ 2 2+ 2 ⎝ − 12 ⎠ 0 − 0 2 2 2 2 √ √ 1 1 1 2+ 2 2+ 2 −2 −2 0 0 − 2 2 2 R
A Maple worksheet gives a good result for the Powell starting points which are far away from a solution (with B0 as the identity matrix) in only a few iteration steps. The attentive reader will of course have noticed that this is not the solution obtained above but the following alternative solution:
Chapter 5
T √ √ √ √ 2(1 + 2), 0 , 2(1 + 2), 1 , 1 + 2 , 1 + 2 , 1 k=0
k=1
k=2
6.
6.
6.
4.
4.
4.
2.
2.
2.
0.
0.
0.
–2. –2.
0.
2.
4.
6.
–2. –2.
0.
k=3
2.
4.
6.
–2. –2.
k=4 6.
6.
4.
4.
4.
2.
2.
2.
0.
0.
0.
0.
2.
4.
6.
–2. –2.
0.
2.
2.
4.
6.
k=5
6.
–2. –2.
0.
4.
6.
–2. –2.
0.
2.
4.
6.
Exercises to Chapter 5
237
Exercises 1. Let Φr (x) := Φ(x, r) := f (x) + 1 π(x) be an outer penalty function for r the problem f (x) −→ min x∈S and let furthermore a continuously decreasing sequence with rk ↓ 0 be given. We additionally assume that for each k ∈ N there exists a global minimizer xk to Φ( · , rk ), that is,
Φ(xk , rk ) = min Φ(x, rk ) | x ∈ S . Show: a) Φ(xk , rk ) ≤ Φ(xk+1 , rk+1 ) b) π(xk ) ≥ π(xk+1 ) c) f (xk ) ≤ f (xk+1 )
Chapter 5
d) f (x) ≥ Φ(xk , rk ) ≥ f (xk ) for all x ∈ S, k ∈ N. 2. Give a graphical illustration of the optimization problem f (x) := −x1 x2 −→ min g1 (x) := x1 + x22 − 1 ≤ 0 g2 (x) := −x1 − x2 ≤ 0 and solve it by means of the quadratic penalty method. Compute its minimizer x(r) and give estimates of the Lagrange multipliers. 3. Solve the following optimization problem by means of the Barrier method : f (x) := −x1 − 2 x2 −→ min g1 (x) := x1 + x22 − 1 ≤ 0 g2 (x) := −x2 ≤ 0 Give estimates for the Lagrange multipliers and draw the central path. 4. Consider the following optimization problem: n f (x) := i=1 x2i −→ min x ∈ Rn n h(x) := i=1 xi − 1 = 0
(8)
Determine the optimizer of (8) by means of the quadratic penalty function given by Φr (x) := f (x) + 21r h(x)2 with the penalty parameter r > 0 .
238
Nonlinearly Constrained Optimization Problems a) Prove that the corresponding optimization problems of the penalty method have a unique minimizer x(r) for every r > 0 . b) Formulate a system of linear equations Br x(r) = b Br ∈ Rn×n , b ∈ Rn determining x(r) and solve it. c) Calculate the condition number of the matrix Br . Can you identify a potential ‘weak point’ of the penalty method by that? d) Determine the minimizer x∗ of problem (8). e) Visualize for n = 2 • the contours of f and the feasible region, • the contours of the objective function of the penalty method for the values 1/5, 1/10 and 1/15 of the penalty parameter r. Plot x as a function of r and the point x∗ .
Chapter 5
What do you observe? 5. Definite Quadratic Forms (cf. [Deb]) Let Q ∈ Rn×n be symmetric and A ∈ Rm×n with m ≤ n. Prove the equivalence: Q is positive definite on kernel(A) iff there exists a number α ∈ R such that Q + α AT A is positive definite. Hint: Show that the function ϕ : K −→ R defined by x, Q x/Ax,
Ax is bounded from below on K := x ∈ Rn | x, x = 1 and Ax = 0 . 6. Let the following optimization problem be given: f (x) −→ min x ∈ Rn hj (x) = 0
(j = 1, . . . , p)
(9)
To that we define with L(x, μ) := f (x) + μT h(x) the corresponding 1 Lagrange function as well as with LA (x, μ; r) := L(x, μ) + 2r h(x)T h(x) the augmented Lagrange function. a) Let (x∗ , μ∗ ) satisfy the sufficient second-order optimality conditions to (9) (cf. theorem 2.3.2). Then there exists a number r∗ > 0 such that for all r ≤ r∗ the point x∗ is a strict local (unconstrained) minimizer of LA ( · , μ∗ ; r). (; r) for some b) If h(( x) = 0 and x ( is a stationary point of LA ( · , μ μ ( ∈ Rp , then x ( is a KKT point to (9). 7. The preceding exercise suggests the following algorithm: Let the starting points x(0) , μ(0) , r0 be given. Set k = 0.
Exercises to Chapter 5
239
• If ∇x L x(k) , μ(k) = 0 and h(x(k) ) = 0: STOP. • Calculate a minimizer x(k+1) of LA ( · , μ(k) ; rk ) −→ min. • μ(k+1) := μ(k) + h(x(k+1) )/rk Choose rk+1 such that rk ≤ rk+1 . • Set k = k + 1 and repeat the above calculation. Use the algorithm to solve the following problem: f (x) := x1 x22 −→ min h(x) := x21 + x22 − 2 = 0 (cf. chapter 1, example 5). Choose x(0) := (−2, 2)T and μ(0) := 0 as starting points and rk := 0.25 for all k ∈ N0 . Utilize, for example, the R function fminunc from the Matlab Optimization Toolbox. 8. Maratos Effect We pick up on our discussion from example 9 and take a closer look at the phenomenon.
hold. Determine the Φr -optimal step size for Han’s algorithm. b) Generalize the results of a) to the case x ∈ R2 with x ≥ 1 and also verify the relation 2 2 x + td − 1 = (1 − t) x − 1 + t2 d, d . c) Do numerical experiments with the full step and damped Newton method. Choose, for example, x(0) := (ε + cos ϑ, sin ϑ)T with ε ∈ {0, 0.01}, ϑ ∈ {0.05, π2 } as starting points and observe for r = 1/2 the behavior of the step sizes and the quotients ) (k+1) ) (k+1) ) ) )x )x − x∗ ) − x∗ ) ) ) and ) ) )x(k) − x∗ ) . )x(k) − x∗ )2 9. Carry out two steps of the S1 QP method with example 5 (cf. p. 226) Choose the values r1 := r2 := 1 for the penalty parameters, and x(0) := (0.5, 1)T as the starting point. 10. Solve Powell’s test example (cf. chapter 2, exercise 15) by means of R the function fmincon from the Matlab Optimization Toolbox. Apply the SQP method using the starting point x(0) := (3, 0, 2, −1.5, 1.5, 5, 0)T . Visualize the result of each iteration step. Define the objective function as an inline function and the constraints by an m-file.
Chapter 5
a) Show for x ∈ F that d = (x22 , −x1 x2 )T and Φr (x + td) = Φr (x) + 2 + 1 t2 − t d, d r
240
Nonlinearly Constrained Optimization Problems Hints: a) func = @(x) 1/2*x(1)*x(3); b) function [c,ceq] = con(x) ceq = []; % no equality constraints c(1) = ... ; % c has 9 component functions!
c) options = optimset(’LargeScale’,’off’,’OutputFcn’,@outfun); x = fmincon(func,x0,[],[],[],[],[],[],@con,options); iteration
d) The actual triangle and circles can be visualized by the commands:
Chapter 5
5
10
N = size(iteration,1); psi = linspace(0,2*pi,200); for k = 1:N clf, hold on, axis equal xx = iteration(k,:); patch([xx(1),xx(2),0],[0,xx(3),0],’y’); plot(xx(4)+cos(psi),xx(5)+sin(psi),’-’); plot(xx(6)+cos(psi),xx(7)+sin(psi),’-’); hold off, waitforbuttonpress end
6 Interior-Point Methods for Linear Optimization
6.1
The development of the last 30 years has been greatly influenced by the aftermath of a “scientific earthquake” which was triggered in 1979 by the findings of the Russian mathematician Khachiyan (1952–2005) and in 1984 by those of the Indian-born mathematician Karmarkar. The New York Times, which profiled Khachiyan’s achievement in a November 1979 article entitled “Soviet Mathematician Is Obscure No More,” called him “the mystery author of a new mathematical theorem that has rocked the world of computer analysis.” At first it only affected linear optimization and the up to that time unchallenged dominance of the simplex method. This method was seriously questioned for the first time ever when in 1972 Klee and Minty found examples in which the simplex algorithm ran through all vertices of the feasible region. This confirmed that the ‘worst case complexity’ depended exponentially W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4 6,
241
Chapter 6
Linear Optimization II The Duality Theorem The Interior-Point Condition 6.2 The Central Path Convergence of the Central Path 6.3 Newton’s Method for the Primal–Dual System 6.4 A Primal–Dual Framework 6.5 Neighborhoods of the Central Path A Closer Look at These Neighborhoods 6.6 A Short-Step Path-Following Algorithm 6.7 The Mizuno–Todd–Ye Predictor-Corrector Method 6.8 The Long-Step Path-Following Algorithm 6.9 The Mehrotra Predictor-Corrector Method 6.10 A Short Look at Interior-Point Methods for Quadratic Optimization Exercises
242
Interior-Point Methods for Linear Optimization
on the dimension of the problem. Afterwards people began searching for LPalgorithms with polynomial complexity.
Chapter 6
Based on Shor’s ellipsoid method, it was Khachiyan who found the first algorithm of this kind. When we speak of the ‘ellipsoid method’ today, we usually refer to the ‘Russian algorithm’ by Khachiyan. In many applications, however, it turned out to be less efficient than the simplex method. In 1984 Karmarkar achieved the breakthrough when he announced a polynomial algorithm which he claimed to be fifty times faster than the simplex method. This announcement was a bit of an exaggeration, but it stimulated very fruitful research activities. Gill, Murray, Saunders and Wright proved the equivalence between Karmarkar’s method and the classical logarithmic barrier methods, in particular when applied to linear optimization. Logarithmic barrier methods are methods which — unlike the exterior penalty methods from section 5.1 — solve restricted problems by transforming a penalty or barrier term into a parameterized family of unbounded optimization problems the minimizers of which lie in the interior of the feasible region. First approaches to this method date back to Frisch (1954). In the sixties Fiacco and McCormick devised from that the so-called interior-point methods. Their book [Fi/Co] contains a detailed description of classical barrier methods and is regarded as the standard reference work. A disadvantage was that the Hessians of the barrier functions were ill-conditioned in the approximative minimizers. This is usually seen as the cause for large rounding errors. This flaw was probably the reason why people lost interest in these methods. Now, due to the reawakened interest, the special problem structure was studied again and it was shown that the rounding errors are less problematic if the implementation is thorough enough. Efficient interior-point methods have in the meantime been applied to larger classes of nonlinear optimization problems and are still topics of current research.
6.1 Linear Optimization II We consider once more a linear problem in standard form, the primal problem c, x −→ min (P ) Ax = b, x ≥ 0 , where A is a real (m, n)-matrix with m, n ∈ N and m ≤ n, b ∈ Rm , c ∈ Rn and x ∈ Rn . Here, A, b and c are given, and the vector x is an unknown variable. In the following we assume that the matrix A has full rank, that is, rank(A) = m.
6.1
Linear Optimization II
243
Associated with any linear problem is another linear problem, called the dual problem, which consists of the same data objects arranged in a different way. The dual problem to (P ) is b , y −→ max (D) AT y ≤ c , where y ∈ Rm . In section 2.4 we obtained the dual problem with the help of the Lagrange function. After introducing slack variables s ∈ Rn+ , (D) can be written equivalently as ⎧ ⎪ ⎨ b , y −→ max AT y + s = c (De ) ⎪ ⎩ s≥0, where the index e stands for extended problem. We know (compare chapter 2, exercise 18) that the primal and dual problem are symmetric, i. e., the dual problem of (D) is again the primal problem. We show this once more: to problem (D) is given by The Lagrange function L x) = − b , y + x, AT y − c = −c, x + Ax − b , y , L(y, where in this case x ∈ Rn+ is the dual variable: y∈R
y∈R
So we get the problem
− c, x , if Ax = b −∞, else.
− c, x −→ max Ax = b, x ≥ 0
which is equivalent to problem (P ).
The Duality Theorem Definition The set of feasible points of (P ) is defined by FP := {x ∈ Rn | Ax = b, x ≥ 0} , and the set of feasible points of (D) is given by
FD := y ∈ Rm | AT y ≤ c
Chapter 6
x) = inf {− c, x + Ax − b , y } = infm L(y, m
244
Interior-Point Methods for Linear Optimization
or alternatively FDe :=
(y, s) ∈ Rm × Rn | AT y + s = c, s ≥ 0 .
Analogously we define the set of strictly feasible points of (P ) and (D) by FP0 := {x ∈ Rn | Ax = b, x > 0} ,
FD0 := y ∈ Rm | AT y < c and
FD0 e :=
(y, s) ∈ Rm × Rn | AT y + s = c, s > 0 ,
respectively. Any vector x ∈ FP is called primally feasible, a vector y ∈ FD or a pair (y, s) ∈ FDe is called dually feasible. The two problems (P ) and (D) together are referred to as a primal–dual system. A vector y ∈ FD determines an s ≥ 0 such that AT y + s = c, and so we get a pair (y, s) ∈ FDe and vice versa. This holds analogously for the set of strictly feasible points.
Chapter 6
For vectors x ∈ FP and y ∈ FD it holds that c, x ≥ AT y , x = y , Ax = b , y . Hence we have the following weak duality c, x ≥ b , y . Corollary If x∗ ∈ FP and y ∗ ∈ FD with c, x∗ = b , y ∗ , then x∗ and y ∗ are optimizers for their respective problems. Proof: By weak duality we get c, x ≥ b , y ∗ = c, x∗ for any vector x ∈ FP and b , y ≤ c, x∗ = b , y ∗ for any vector y ∈ FD , respectively. Hence, any y that is feasible to (D) provides a lower bound b , y for the values of c, x whenever x is feasible to (P ). Conversely, any x that is feasible to (P ) provides an upper bound c, x for the values of b , y whenever y is feasible to (D). The difference between the optimal value p∗ := v(P ) := inf(P ) := inf {c, x | x ∈ FP } of the primal problem and the optimal value d∗ := v(D) := sup(D) := sup { b , y | y ∈ FD }
6.1
Linear Optimization II
245
of the dual problem is called the duality gap. If the duality gap vanishes, i. e. p∗ = d∗ , we say that strong duality holds. For s := c − AT y the value s, x just measures the difference between the objective values c, x of the primal problem and b , y of the dual problem. The following theorem gives information about an optimality condition for problems (P ) and (D): Theorem 6.1.1 (Optimality condition) The following statements are equivalent: a) The primal problem (P ) has a minimizer. b) The dual problem (De ) has a maximizer. c) The following optimality conditions have a solution: Ax A y+s xi si x, s T
= = = ≥
b, c, 0 (i = 1, ..., n), 0
(1)
If a) to c) hold, then a minimizer x∗ to (P ) and a maximizer (y ∗ , s∗ ) to (D) yield a solution (x∗ , y ∗ , s∗ ) of c) and vice versa.
∇x L(x∗ , y ∗ , s∗ ) = c − s∗ − AT y ∗ = 0 x∗ , s∗ ≥ 0 Ax∗ = b x∗ , s∗ = 0 b) ⇐⇒ c): We show this analogously: A vector y ∗ ∈ Rm is a maximizer to the dual problem (D) if and only if there exists a multiplier x∗ ∈ Rn+ such that (x∗ , y ∗ ) satisfies the KKT conditions of the dual problem. The Lagrange to problem (D) is given by L(y, x) = −b , y + x, AT y − c . function L Hence we see that the KKT conditions of (D) are exactly the conditions (1): ∗ , x∗ ) ∇y L(y x∗ T ∗ A y ∗ x , c − AT y ∗ =: s∗
= Ax∗ − b = 0 ≥0 ≤c =0
Chapter 6
Proof: a) ⇐⇒ c): Problem (P ) is in particular a convex problem with linear constraints, so we know that a vector x∗ ∈ Rn is a minimizer to (P ) if and only if there exist multipliers y ∗ ∈ Rm and s∗ ∈ Rn+ such that the triple (x∗ , y ∗ , s∗ ) satisfies the KKT conditions of the primal problem (P ). The Lagrange function L to problem (P ) is given by L(x, y, s) = c, x + y , b − Ax + s, −x. It is obvious that the KKT conditions of (P ) are exactly the conditions (1):
246
Interior-Point Methods for Linear Optimization
We are now able to state the Duality Theorem for Linear Programming: Theorem 6.1.2 (Duality Theorem) For the primal problem (P ) and the dual problem (D) exactly one of the following four cases holds: a) Both problems have optimizers x∗ and y ∗ respectively and the optimal values are the same, i. e., c, x∗ = b , y ∗ (normal case). b) The primal problem has no feasible point, the dual problem has a feasible point and d∗ = ∞. c) The dual problem has no feasible point, the primal problem has a feasible point and p∗ = −∞. d) Neither the primal nor the dual problem has feasible points.
Chapter 6
Proof: We want to show this by using the Theorem of the (Farkas). To use it, we set ⎛ ⎞ ⎛ ⎞ x ⎜ ⎟ A 0 0 0 0 ⎜ ⎟ y ⎜ ⎟ ⎜ 1⎟ ⎜ ⎟ ⎜ T T A := ⎜ 0 := ⎜ y2 ⎟ A −A I 0 ⎟, x ⎟ , b := ⎝ ⎠ ⎜ ⎟ ⎝s⎠ bT 0 1 cT −bT ω
Alternative
⎛ ⎞ b ⎜ ⎟ ⎝c⎠. 0
The normal case is given if and only if there exists a nonnegative solution x to x A = b. (2) If we set y := y1 − y2 , we have x ∈ FP , (y, s) ∈ FDe and c, x − b , y ≤ 0, thus by weak duality c, x = b , y . Then, with the corollary from page 244, we have that y is a maximizer to the dual problem and x a minimizer to the primal problem. If (2) has no nonnegative solution, then we know by the Theorem of the Alternative (Farkas) that there exists a vector y := T y ≥ 0 and bT y < 0. This just means (−y, x, )T ∈ Rm+n+1 such that A −AT y + c ≥ 0 Ax − b = 0 x≥0 ≥0 and c, x < b , y . Suppose that > 0. Then we obtain AT ( 1 y) ≤ c and A( 1 x) = b in the equation above. So we know that 1 x ∈ FP and
6.1
Linear Optimization II
247
1
y ∈ FD . Weak duality yields c, x ≥ b , y . This, however, is a contradiction to c, x < b , y . Hence we can conclude that = 0. By this we get c, x < b , y , Ax = 0, x ≥ 0 and AT y ≤ 0. Now we consider two cases: 1. c, x < 0 : Then FD = ∅ : If we assume that there exists a vector y ∈ FD , then we have AT y ≤ c. This yields 0 = x, AT y ≤ c, x < 0, and therefore we have a contradiction. If FP = ∅ as well, then we have case d). Else, for FP = ∅, there exists an x with Ax = b. Hence we get A(x + λx) = b for λ ∈ R+ and limλ→∞ c, x + λx = −∞. Thus we have case c). 2. c, x ≥ 0 : As c, x < b , y , this yields b , y > 0. Then we have FP = ∅ : Since, if we assume that there exists an x ∈ FP , we have that Ax = b. By this we get 0 ≥ AT y , x = y , Ax = y , b > 0 and we have a contradiction. If FD = ∅ as well, then we have case d). Otherwise, T T there exists Ta y ∈ FD , that is, A y ≤ c. Hence, we get A (y + λy) = T A y + λ A y ≤ c for λ ∈ R+ and limλ→∞ b , y + λy = ∞. This ≤0
means that we have case b). As the four cases are mutually exclusive, exactly one of them occurs.
Remark In particular, by theorem 6.1.2 we know that if both problems have feasible points x ∈ FP and y ∈ FD , then both problems have optimizers x∗ and y ∗ with c, x∗ = b , y ∗ .
In the following we will give an example for each of the four cases of theorem 6.1.2: a) Normal case: For c := (3, 4)T , A := (1 2) and b := 1 we have x = (0, 12 )T ∈ FP and y = 2 ∈ FD with c, x = 2 = b , y .
b) For c := (3, 4)T , A := (−1 − 2) and b := 1 , the primal problem has no feasible point, and the dual has the optimal value d∗ = ∞. 1 0 −1 and b := (1, 1)T , the dual c) For c := (−1, −1, −1)T , A := 0 1 0 problem has no feasible point and p∗ = −∞.
d) We consider A and c as in c), set b := (1, −1)T and get: Both problems have no feasible points. From theorem 6.1.1 we can derive the complementarity condition:
Chapter 6
Example 1
248
Interior-Point Methods for Linear Optimization
Lemma 6.1.3 (Complementary slackness optimality condition) If x is feasible to (P ) and (y, s) is feasible to (De ), then x is a minimizer for (P ) and (y, s) is a maximizer for (De ) if and only if xi si = 0 for i = 1, ..., n. Proof: If x is a minimizer of (P ) and (y, s) is a maximizer of (De ), we get xi si = 0 by (1). If x is feasible to (P ), (y, s) is feasible to (De ) and xi si = 0, then, by theorem 6.1.1, x is a minimizer of (P ) and (y, s) is a maximizer of (De ). To derive primal–dual interior-point methods later on, we rearrange the optimality condition (1) in the following way ⎞ ⎛ ⎞ ⎛ T 0 A y+s−c (3) F0 (x, y, s) := ⎝ Ax − b ⎠ = ⎝ 0 ⎠ , x ≥ 0, s ≥ 0 . 0 Xs
Chapter 6
An uppercase letter denotes the diagonal matrix corresponding to the respective vectors, for example ⎛ ⎞ x1 ⎜ x2 ⎟ ⎟. X := Diag(x) := ⎜ .. ⎝ ⎠ . xn Hence we have Xeν = xν eν for ν = 1, ..., n and Xe = x where e := T (1, . . . , 1) . In the following the norm denotes the euclidean norm 2 . We note that XSe = Xs = (x1 s1 , ..., xn sn )T =: xs (Hadamard product). The Interior-Point Condition Definition We say that the primal–dual system satisfies the interior-point condition (IPC) iff both problems (P ) and (D) have strictly feasible points, that is, FP0 = ∅ and FD0 = ∅ . We are now going to introduce the logarithmic barrier function for the primal problem (P ). This is the function Φμ defined by Φμ (x) := c, x − μ
n
log(xi ) ,
i=1
where μ > 0 and x runs through all primally feasible vectors that are positive. The domain of Φμ is the set FP0 . μ is called the barrier parameter. The gradient of Φμ is given by
6.1
Linear Optimization II
249
∇Φμ (x) = c − μX −1 e and the Hessian by
∇2 Φμ (x) = μX −2 .
Obviously, the Hessian is positive definite for any x ∈ FP0 . This means that Φμ is strictly convex on FP0 . Analogously, the logarithmic barrier function for the dual problem (D) is given by n μ (y) := b , y + μ Φ log (ci − ai , y ) , i=1 T
where c − A y > 0 , and a1 , ..., an denote the columns of the matrix A. The μ is the set F 0 . The gradient of Φ μ is given by domain of Φ D μ (y) = b − μ ∇Φ
n i=1
and the Hessian by μ (y) = −μ ∇2 Φ
n i=1
ai ci − ai , y
ai aTi . (ci − ai , y )2
The Hessian is negative definite for any y ∈ FD0 : The assumption rank(A) = m gives ai , z = 0 for at least one i for any z = 0. Thus, for any z = 0 we have
= −μ
n i=1
2
ai , z < 0. (ci − ai , y )2
Chapter 6
μ (y) z z , ∇2 Φ
μ is strictly concave on F 0 . This means that Φ D Theorem 6.1.4 Let μ > 0. Then the following statements are equivalent: a) b) c) d)
Both FP0 and FD0 are not empty. There exists a (unique) minimizer to Φμ on FP0 . μ on F 0 . There exists a (unique) maximizer to Φ D The system ⎛ ⎞ ⎛ ⎞ AT y + s − c 0 ⎜ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ Fμ (x, y, s) := ⎜ ⎝ Ax − b ⎠ = ⎝ 0 ⎠ , Xs − μe 0
x ≥ 0, s ≥ 0
(4)
has a (unique) solution. If a) to d) hold, then the minimizer x(μ) to Φμ and the maximizer y(μ) to μ yield the solution x(μ), y(μ), s(μ) of (4), where s(μ) := c − AT y(μ). Φ
250
Interior-Point Methods for Linear Optimization
Proof: b) ⇐⇒ d): The definition of Φμ can be extended to the open set Rn++ and Φμ is differentiable there. On Rn++ we consider the system (Pμ )
Φμ (x) −→ min Ax = b .
(Pμ ) is a convex problem with linear constraints. So we know that a vector x is a (unique) minimizer to (Pμ ) iff there exists a (unique) multiplier y ∈ Rm such that (x, y) satisfies the KKT conditions of problem (Pμ ). The Lagrange function L to this problem is given by L(x, y) = Φμ (x) + y , b − Ax. The KKT conditions to (Pμ ) are ∇x L(x, y) = c − μX −1 e − AT y = 0 Ax = b. If we set s := μX −1 e which is equivalent to Xs = μe, we get nothing else but (4). We have therefore shown that system (4) represents the optimality conditions of problem (Pμ ). μ has a (unique) maximizer y on F 0 if and only if c) ⇐⇒ d): Φ D
μ ( 0 = ∇Φ y) = b − μ
n
Chapter 6
i=1
ai . ci − ai , y
(5)
For such a maximizer y we define x , s ∈ Rn to get a solution ( x, y, s) to system (4): xi :=
μ > 0, ci − ai , y
si := ci − ai , y > 0 .
By this construction we have AT y + s − c = 0 and in addition A x =
n
x i ai = μ
i=1
n i=1
ai (5) = b. ci − ai , y
Obviously, we have xi si = μ for i = 1, ..., n. Hence we have obtained a solution to (4). For a solution (x, y, s) to (4) we have AT y + s − c = 0 . This means si = ci − ai , y , thus xi = μs−1 = μ(ci − ai , y )−1 for i = 1, ..., n. i Besides 0 = b − Ax = b −
n i=1
xi ai = b − μ
n i=1
ai μ (y) = ∇Φ ci − ai , y
μ on F 0 . Therefore we have shown shows that y is the (unique) maximizer to Φ D that system (4) represents the optimality conditions for the problem
6.1
Linear Optimization II
251
(Dμ )
μ (y) −→ max Φ AT y + s = c, s > 0 .
d) =⇒ a): If system (4) has a solution (x, y, s), then we have x ∈ FP0 and y ∈ FD0 . a) =⇒ b): Assuming a), there exist vectors x0 ∈ FP0 , y 0 ∈ FD0 and hence (y 0 , s0 ) ∈ FD0 e for a suitable s0 > 0 . We define the level set LK of Φμ by
LK := x ∈ FP0 | Φμ (x) ≤ K with K := Φμ (x0 ). As we have x0 ∈ FP0 , the set LK is not empty. Since Φμ is continuous on its domain, it suffices to show that LK is compact, because then Φμ has a minimizer and, since Φμ is strictly convex, this minimizer is unique. Obviously, LK is closed in FP0 . Hence it remains to show that LK is bounded: Let x ∈ LK . We have c, x − b , y 0 = x, s0 and we thus get n Φμ (x) = b , y 0 + x, s0 − μ log(xi ) . i=1
With s := s0 /μ the inequality Φμ (x) ≤ K yields the boundedness of h(x) := x, s −
n
log(xi ) =
i=1
n xi si − log(xi ) i=1
Graph of the function x − log(x)
4
2
0
2
4
Chapter 6
on LK. If LK was unbounded, there would exist a sequence x(κ) in LK with x(κ) → ∞ for κ → ∞. By choosing a subsequence, we can obtain (κ) (κ) to xk will that for a k ∈ {1, . . . , n} wlog exactly the sequences x1 definitely diverge to ∞ or converge to 0 , while the other sequences have a n positive limit. In i=1 xi si − log(xi ) the first k summands will definitely diverge to ∞, while the others will converge and thus be bounded. Altogether — with h x(κ) −→ ∞ for κ → ∞ — we obtain a contradiction.
252
Interior-Point Methods for Linear Optimization
Remark As we have xi si = μ > 0 for i = 1, . . . , n, we know that every solution to (4) satisfies x > 0 and s > 0. Thus, the second equation in (4) means nothing else but that x is primally feasible and the first that (y, s) is dually feasible. The third equation is called the centering condition. The interior-point condition is not always fulfilled. We can easily see this with an example: Example 2 For A := (1 0 1), b := 0 and c := (1, 0, 0)T there exists no vector x ∈ FP0 and no vector y ∈ FD0 : Ax = b yields x1 = x3 = 0 and by AT y = (y, 0, y)T we see that there exists no y ∈ R with AT y < c. Hence the interior-point condition is not fulfilled. In the following we will make the additional assumption: Both problems (P ) and (D) have strictly feasible points, that is, FP0 = ∅ and FD0 = ∅ .
Chapter 6
6.2 The Central Path We want to solve system (3) by means of Newton’s method ⎛ (k+1) ⎞ ⎛ (k) ⎞ x x ⎝ y (k+1) ⎠ = ⎝ y (k) ⎠ − DF0 x(k) , y (k) , s(k) −1 · F0 x(k) , y (k) , s(k) , s(k+1) s(k) where DF0 is the Jacobian of F0 given by ⎛ 0 AT DF0 (x, y, s) = ⎝ A 0 S 0
⎞ I 0 ⎠. X
The nonlinear system (3) consists of m+ 2 n equations with m+ 2 n variables. Any solution to (3) satisfies xi si = 0 (i = 1, . . . , n) and therefore lies on the boundary of the primal–dual feasible set
F := (x, y, s) ∈ Rn+ × Rm × Rn+ | Ax = b, AT y + s = c . We define the set of strictly primal–dual feasible points as
F 0 := (x, y, s) ∈ Rn++ × Rm × Rn++ | Ax = b, AT y + s = c .
6.2
The Central Path
253
Theorem 6.2.1 The Jacobian
⎛
0 AT ⎜ DF0 (x, y, s) = ⎝ A 0 S 0
⎞ I ⎟ 0 ⎠ X
is nonsingular if x > 0 and s > 0 . Proof: For (u, v, w) ∈ Rn × Rm × Rn with ⎛ T ⎞ ⎛ ⎞ ⎛ ⎞ A v+w u 0 ⎠ ⎝ 0 ⎠ = DF0 (x, y, s) ⎝ v ⎠ = ⎝ Au w Su + Xw 0 we have u, w =
u , −AT v
= − Au , v = 0 .
−1
From S u + X w = 0 we get u = −S X w and, with the equality above, we obtain 0 = u , w = w , u = − w , S −1 Xw . Since x > 0 and s > 0 , the matrix S −1 X is positive definite and we can conclude that w = 0 . From 0 = S u + X w = S u we obtain u = 0 and from 0 = AT v + w = AT v we get v = 0 since rank(A) = m.
x(μ) = (x(μ), y(μ), s(μ)) . Definition The set C :=
x(μ) | μ > 0
of solutions to (4) is called the (primal–dual) central path. The central path C is an ‘arc’ of strictly feasible points that is parameterized by a scalar μ > 0 , and each point x(μ) of C solves system (4). These conditions differ from the KKT conditions only in the term μ. Instead of the
Chapter 6
As mentioned above, the solution to system (3) lies on the boundary of the set F and thus it is not guaranteed that the Jacobian is invertible. We therefore consider the ‘relaxed nonlinear system’ (4). As we have xi si = μ > 0 for i = 1, . . . , n, we know that every solution to (4) satisfies x > 0 and s > 0. Now it is our aim to find a solution to system (4) by applying Newton’s method. Theorem 6.1.4 gives that the interior-point condition is independent of the barrier parameter μ. Hence we can conclude that if a minimizer to Φμ μ exists for some positive μ, then it exists for all positive or a maximizer to Φ μ. These solutions to system (4) are denoted as
254
Interior-Point Methods for Linear Optimization
complementarity condition xi si = 0 we demand that the products xi si have the same value for all indices i. By theorem 6.1.4 we know that the central path is well-defined. We also know that the equations (4) approximate (3) more and more closely as μ goes to zero. This lets us hope that x(μ) con verges to an optimizer of (P ) and (D) if x(μ) converges to any point for μ ↓ 0 . In this case, the central path guides us to a minimizer or maximizer along a well-defined route. For μ > 0 we have nμ =
n i=1
xi (μ)si (μ) = x(μ), s(μ) = c, x(μ) − b , y(μ) . =μ
This gives an idea of how far away x(μ) and y(μ) are from being optimal. For μ ↓ 0 we see that c, x(μ) and b , y(μ) strive against each other. With the normalized duality gap μ = c, x(μ) − b , y(μ) /n we can see how close c, x(μ) and b , y(μ) are. Example 3 We consider A := (1 1 1), b := 1 and c := (−1, −3, −4)T . Thus we get the primal problem −x1 − 3x2 − 4x3 −→ min (P ) x1 + x2 + x3 = 1, x ≥ 0
Chapter 6
and the dual problem
(De )
Hence system (4) becomes
⎧ y ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ y + s1 y + s2 ⎪ ⎪ ⎪ y + s3 ⎪ ⎪ ⎩ s
x1 + x2 + x3 y + s1 + 1 y + s2 + 3 y + s3 + 4 xi si
−→ = = = ≥
max −1 −3 −4 0.
= = = = =
1 0 0 0 μ
where xi > 0 and si > 0 for i = 1, 2, 3. This gives s1 = −1 − y, s2 = −3 − y, s3 = −4 − y. s > 0 shows that y < −4 . From x1 s1 = x2 s2 = x3 s3 = μ we obtain 1 x1 + x2 + x3 1 1 1 1 1 1 = = + + . + + = − μ μ s1 s2 s3 1+y 3+y 4+y Multiplication with the common denominator gives a cubic equation from which y(μ) , with y(μ) < −4 , can be obtained relatively easily. Then we can
6.2
The Central Path
255
calculate si (μ) by s1 (μ) = −1 − y(μ), s2 (μ) = −3 − y(μ), s3 (μ) = −4 − y(μ), and with that evaluate xi (μ) by xi (μ) = μ/si (μ). The figure below shows the central path. The triangular region is the feasible set of the primal problem. We see that the curve lies inside the feasible set and approaches the minimizer (0, 0, 1)T of problem (P ). Trajectory of the central path for (P ) (0, 0, 1) •
1
x3 (0, 1, 0) (1, 0, 0) 0
0 0 0.5
0.5
x2
1
x1
Modified Objective Function (0, 0, 1)
1
•
x3 0.5
(0, 1, 0) (1, 0, 0)
0
0 0
x2
0.5
0.5 1
x1
Chapter 6
If we change the objective function minimally by setting c := (−1, −3, −3)T, then every point on the line connecting (0, 1, 0)T and (0, 0, 1)T is a minimizer. For μ → 0 the central path tends to the barycenter (0, 1/2, 1/2)T of this line. Here the starting point of the central path (i. e., μ −→ ∞) is the barycenter (1/3, 1/3, 1/3)T of the feasible region.
256
Interior-Point Methods for Linear Optimization
Convergence of the Central Path The central path is a ‘curve’ in the interior of the feasible region. It begins ‘somewhere in its middle’ — more precisely in its ‘analytic center’ — and ends ‘somewhere in the middle of the optimal solutions’ as μ −→ 0 . We would now like to examine this in more detail. We therefore consider the sets of the respective optimizers
FPopt := x ∈ FP | c, x = p∗ and FDopt := y ∈ FD | b , y = d∗ and obtain as the main result: Theorem 6.2.2 For μ > 0 let x(μ) denote the unique minimizer of Φμ on FP0 and y(μ) μ on F 0 . Then there exist the limits the unique maximizer of Φ D x∗ := lim x(μ) and y ∗ := lim y(μ) . μ→0+
μ→0+
For them it holds that x∗ ∈ FPopt and y ∗ ∈ FDopt . Before doing the proof, we will give some general remarks. We had
Φμ (x) := c, x − μ
n
log(xi )
Chapter 6
i=1
for μ > 0 and x ∈ FP0 , hence Φμ (x) = f (x) + μh(x) with f (x) := c, x and h(x) := −
n
log(xi ) = − log det(X) .
i=1
Lemma 6.2.3 For 0 < μ < λ the following inequalities hold: h x(λ) ≤ h x(μ) and f x(μ) ≤ f x(λ) . The function h ◦ x is hence antitone, the function f ◦ x isotone. Proof: Φμ x(μ) ≤ Φμ x(λ) and Φλ x(λ) ≤ Φλ x(μ) mean f x(μ) + μh x(μ) ≤ f x(λ) + μh x(λ) f x(λ) + λh x(λ) ≤ f x(μ) + λh x(μ) .
(6)
6.2
The Central Path
257
Summation of the two equations gives 0 ≥ (λ − μ) h x(λ) − h x(μ) , hence h x(λ) ≤ h x(μ) . Together with (6) this yields f x(μ) ≤ f x(λ) .
Lemma 6.2.4 For α ∈ R the level sets
x ∈ FP | c, x ≤ α and y ∈ FD | b , y ≥ α are bounded.
Proof: For (P ): Otherwise there exists a sequence x(k) in FP such that c, x(k) ≤ α for k ∈ N and 0 < x(k) −→ ∞ (k −→ ∞). We consider the sequence of the corresponding normalized vectors pk := x(k) / x(k) . For them we have wlog (otherwise choose a suitable subsequence!) pk −→ p for a vector p ∈ Rn with p = 1. We gain p ≥ 0 , Ap = 0 and c, p ≤ 0 . With a y (0) ∈ FD0 the contradiction follows:
0 <
p, c − AT y (0)
= c, p − Ap, y (0) ≤ 0
Corollary The sets of the optimal points FPopt and FDopt are nonempty, convex and compact. Proof: By assumption (IPC) and the duality theorem the sets are nonempty. The convexity and closedness are immediately apparent. With that the compactness follows by the above lemma. We will now turn to the proof of theorem 6.2.2 which will be divided into three parts: (I) If (μk ) is a null sequence in R++ , the sequence (x(μk ), y(μk ), s(μk )) is bounded and each of its accumulation points x , y, s gives a minimizer x of (P ) and a maximizer y of (D).
Chapter 6
For (D): Otherwise thereexists a sequence y (k) in FD such that b , y (k) ≥ α for k ∈ N and 0 < y (k) −→ ∞ (k −→ ∞). For the sequence of the corresponding normalized vectors qk := y (k) / y (k) we have wlog qk −→ q for a q ∈ Rm with q = 1. Here we get AT q ≤ 0 , b , q ≥ 0 and AT q = 0 , as q = 0 and rank(A) = m. With an x(0) ∈ FP0 we obtain the contradiction 0 > x(0) , AT q = Ax(0) , q = b , q ≥ 0 .
258
Interior-Point Methods for Linear Optimization
Proof of (I): With an upper bound μ > 0 for the sequence (μk ) we have — by lemma 6.2.3 — c, x(μk ) ≤ c, x(μ) for k ∈ N. In the same way it follows that b , y(μk ) ≥ b , y(μ) for k ∈ N. With lemma 6.2.4 we thus know that the sequences x(μ ) and y(μ ) are bounded and therefore k k the sequence s(μk ) as well. From the central path conditions we thus get the following for each accumulation point ( x, y, s) by taking the limit of a corresponding subsequence: Ax = b, x ≥ 0 AT y + s = c, x , s = 0
s ≥ 0
With theorem 6.1.1 this gives the desired result.
i ∈ {1, . . . , n} | ∃x ∈ FPopt xi > 0 it holds that: log xj −→ min ψ(x) := − j∈B a) The barrier problem (B) x ∈ FPopt , xB > 0 has a unique minimizer x∗ .
(II) With B :=
b) lim x(μ) = x∗
Chapter 6
μ→0+
Proof of (II): a) If B = ∅, then FPopt = {0} holds, and the assertions are trivially true.1 Let (j) therefore B = ∅: For each j ∈ B there exists an x(j) ∈ FPopt with xj > 0 . Then 1 (j) x := x |B| j∈B
FPopt
is in with xB > 0 . The existence of a minimizer x∗ of (B) follows once again from the assertion that level sets of (B) are compact. (Compare our considerations on p. 250 f with c := 0, μ := 1 and the index set B instead of {1, . . . , n} .) The uniqueness follows directly from the strict convexity of the objective function ψ . b) Let now (μk ) be a null sequence in R++ and x an accumulation point of the corresponding sequence x(μk ) . By (I) it then holds that x ∈ FPopt . wlog x(μk ) −→ x . We will show that x minimizes problem (B). x = x∗ then follows from the already proven uniqueness of the minimizer which completes the proof. i) x is feasible for (B): The KKT conditions to (Pμ ) (cf. p. 250) yield the following relation for μ > 0: c − μX(μ)−1 e − AT y(μ) = 0 1
By convention we have
∅
log xj = 0 .
6.2
The Central Path
259
For x ∈ FPopt it then holds that 0 = y(μ), A(x − x(μ)) = AT y(μ), x − x(μ) = c − μX(μ)−1 e , x − x(μ) xj = c, x − c, x(μ) + μn − μ x(μ)j . j∈B
Since c, x ≤ c, x(μ), we then have
xj j∈B x(μ)j
≤ n and thus x(μ)j ≥
1 n xj
for j ∈ B . Setting x := x∗ ∈ FPopt and μ := μk gives the estimate x j ≥ 1 ∗ x > 0 for j ∈ B as k −→ ∞. n j ii) We will finally show that x minimizes problem (B): For μ > 0 and x ∈ FPopt it holds that (cf. i) ) 1 c, x − x(μ) = X(μ)−1 e , x − x(μ) . μ If we choose x := x∗ and x := x , we get the following via subtraction and with the fact that c, x∗ = c, x = p∗ x∗j − x j 0 = 1 c, x∗ − x = X(μ)−1 e , x∗ − x = . μ x(μ)j j∈B
With μ := μk and k −→ ∞ we obtain x∗j = |B| = : . x j j∈B
1
log
! x∗ " j
x j
Analogously to (II) we get:
(III) With N := i ∈ {1, . . . , n} | ∃y ∈ FDopt ci − ai , y > 0 we have: ⎧ := log cj − ai , y −→ max ⎨ ψ(y) j∈N a) Problem B ⎩ y ∈ F opt , (c − AT y) > 0 D N has a unique maximizer y ∗ .
b) lim y(μ) = y ∗ μ→0+
Remark The solution x∗ to (B) is apparently the unique maximizer of # $ opt max . xi | x ∈ FP i∈B
Chapter 6
The concavity of the logarithmic function then yields ! x∗ " j = j∈B log ψ( x) − ψ(x∗ ) = j∈B x j ! ∗" 1 xj = 0. ≤ log j∈B x j
260
Interior-Point Methods for Linear Optimization
is the unique maximizer of In the same way the solution y ∗ to B $ # ci − ai , y | y ∈ FDopt . max i∈N
Following Sonnevend cf. [Son] x∗ is called the analytic center to FPopt and y ∗ the analytic center to FDopt . From (II) and (III) we derive an interesting Corollary: For the index sets B and N it holds that B ∩ N = ∅ as well as B ∪ N = {1, . . . , n} ; we have in particular x∗ + s∗ > 0 for s∗ := c − AT y ∗ . Proof: The optimality conditions for the primal problem (P ) and the dual problem (D) yield Ax∗ = b, x∗ ≥ 0 AT y ∗ + s∗ = c, x∗ , s∗ = 0 .
s∗ ≥ 0
(7)
The complementarity condition x∗ , s∗ = 0 gives B ∩ N = ∅ . For μ > 0 the points of the central path meet the following condition: Ax(μ) = b,
x(μ) ≥ 0
T
Chapter 6
A y(μ) + s(μ) = c, s(μ) ≥ 0 x(μ) · s(μ) = μ e We have seen that lim x(μ) = x∗ ,
lim y(μ) = y ∗ .
μ→0+
μ→0+
With that we also have limμ→0+ s(μ) = s∗ . The relation x(μ) − x∗ , s(μ) − s∗ = − x(μ) − x∗ , AT y(μ) − y ∗ = − A x(μ) − x∗ , y(μ) − y ∗ = 0 yields
μ · n = x(μ), s(μ) = x∗ , s(μ) + x(μ), s∗ x∗i s(μ)i + x(μ)i s∗i = i∈B
i∈N
and after dividing by μ = x(μ)i s(μ)i : n =
x∗ s∗ i i + x(μ)i s(μ)i i∈B
i∈N
(8)
6.3
Newton’s Method for the Primal–Dual System
261
For μ → 0+ it follows that n = |B| + |N | and thus with B ∩ N = ∅ the assertion. From the considerations to theorem 6.2.2 we have thus gained the following proposition — albeit under the assumption that the Interior-Point Condition holds: Goldman–Tucker Theorem There exist optimizers x∗ to (P ) and y ∗ to (D) with x∗ + s∗ > 0 for s∗ := c − AT y ∗ . In this context also see exercise 5.
6.3 Newton’s Method for the Primal–Dual System Our aim is to solve (4) for a fixed μ > 0 by Newton’s method.2 A full step of the Newton iteration reads as follows:3 (x+ , y + , s+ ) := (x, y, s) − (Δx, Δy, Δs) We will occasionally use the shortened notation x := (x, y, s),
x+ := (x+ , y + , s+ ) and Δx := (Δx, Δy, Δs) .
x+ := x − Δx . The Newton search direction Δx is obtained by solving the system DFμ (x) Δx = Fμ x , where we never calculate the inverse of the matrix DFμ x explicitly. For μ > 0 we define the dual residual rd , the primal residual rp and the complementarity residual rc at x as: rd := AT y + s − c rp := Ax − b rc := Xs − μe 2 3
In the following considerations we will drop the index k, that is, we write (x, y, s) instead of x(k) , y (k) , s(k) and (x+ , y + , s+ ) instead of x(k+1) , y (k+1) , s(k+1) . We prefer this notation (with “−”) which is slightly different from the widely used standard notation.
Chapter 6
The above relation then reads
262
Interior-Point Methods for Linear Optimization
The Newton direction Δx at x is given as the solution to ⎛ ⎞ rd DFμ (x) Δx = ⎝ rp ⎠ = Fμ (x) rc which we can restate as ⎛ 0 AT ⎝ A 0 S 0
⎞ ⎛ ⎞ ⎞⎛ rd Δx I 0 ⎠ ⎝ Δy ⎠ = ⎝ rp ⎠ . Δs X rc
(9)
By theorem 6.2.1 the Jacobian is nonsingular here as we have μ > 0 and therefore x > 0 and s > 0 . We therefore know that the Newton step is well-defined. The system (9) is of size 2 n + m. It is possible to symmetrize DFμ (x, y, s) by multiplying the last block row by S −1 : ⎞ ⎛ ⎛ ⎞⎛ ⎞ rd 0 AT Δx I ⎝ A 0 ⎠ ⎝ Δy ⎠ = ⎝ rp ⎠ 0 −1 Δs I 0 S X S −1 rc In the following theorem we show an alternative way of solving system (9): Theorem 6.3.1 The solution to system (9) is given by the following equations:
Chapter 6
Δy = (AXS −1 AT )−1 rp − AS −1 rc − Xrd Δs = rd − AT Δy Δx = S −1 rc − XΔs Proof: (9) means
AT Δy + Δs = rd AΔx = rp SΔx + XΔs = rc .
Hence Δs = rd − AT Δy, Δx = S −1 (rc − XΔs) = x− μS −1 e − XS −1 Δs. If we substitute Δx in the second equation above, we get AS −1 (rc − XΔs) = rp . Now we substitute Δs and obtain AS −1 rc − AS −1 Xrd + AS −1 XAT Δy = rp , hence AS −1 XAT Δy = rp + AS −1 (−rc + Xrd ) and so Δy = (AXS −1 AT )−1 (rp + AS −1 (−rc + Xrd )) .
With the positive definite matrix4 4
For the positive definite matrix M := AXS −1 AT a Cholesky decomposition may be used. In the case of large sparse problems a preconditioned CG algorithm is suggested.
6.4
A Primal–Dual Framework
263
M := AXS −1 AT and the right-hand side rhs := rp + AS −1 (−rc + Xrd ) the solution to
M Δy = rhs
firstly yields Δy . Then we have Δs = rd − AT Δy
and
Δx = S −1 (rc − XΔs).
In the short-step method (compare section 6.6) we firstly deal with full Newton steps and small updates of the parameter μ. Later on we derive a longstep method with damped Newton steps, that is, we introduce a step size α ∈ (0, 1] and choose α such that (x+ , y + , s+ ) := (x, y, s) − α (Δx, Δy, Δs) satisfies (x+ , s+ ) > 0 . For α = 1 we have the full Newton step.
6.4 A Primal–Dual Framework For each (x, y, s) ∈ F 0 we define the duality measure τ by n
i=1
which gives the average value of the products xi si . If we are on the central path, we obviously have μ = τ , but τ is also defined off the central path. With the results we have gained we can now define a general framework for primal–dual algorithms. But first we introduce a centering parameter σ ∈ [0, 1]. We now consider the system ⎞ ⎞ ⎛ ⎛ ⎞⎛ 0 0 AT I Δx ⎟ ⎟ ⎜ ⎜ ⎟⎜ 0 (10) ⎠. ⎝ A 0 0 ⎠ ⎝ Δy ⎠ = ⎝ XSe − στ e Δs S 0 X The direction Δx, Δy, Δs gives a Newton step toward a point xσ,τ , yσ,τ , sσ,τ ∈ C with σ τ = μ.
Chapter 6
τ := τ(x s)
:= τ (xs) := 1 xi si = 1 x, s n n
264
Interior-Point Methods for Linear Optimization
General Interior-Point Algorithm Let a starting vector x(0) , y (0) , s(0) ∈ F 0 and an accuracy requirement ε > 0 be given. Initialize (x, y, s) := (x(0) , y (0) , s(0) ) and τ := n1 x, s . while τ > ε ⎞ ⎞ ⎛ ⎞⎛ ⎛ Solve 0 Δx 0 AT I ⎟ ⎟ ⎜ ⎟⎜ ⎜ 0 ⎠ ⎝ A 0 0 ⎠ ⎝ Δy ⎠ = ⎝ XSe − στ e Δs S 0 X where σ ∈ [0, 1]. Set (x, y, s) := (x, y, s) − α Δx, Δy, Δs where α ∈ (0, 1] denotes a suitable step size which we choose such that (x, s) > 0 . τ := n1 x, s
Chapter 6
Depending on the choice of σ and α, we get different interior-point algorithms. We will discuss some of them later on. In principle, the term σ τ on the right-hand side of (10) plays the same role as the parameter μ in system (4). For σ = 0 we target directly at a point for which the optimality conditions (1) are fulfilled. This direction is also called the affine-scaling direction. At the other extreme, for σ = 1, the equations (10) define a Newton direction, which yields a step toward the point (xτ , yτ , sτ ) ∈ C at which all the products xi si are identical to τ . Therefore this direction is called the centering direction. These directions usually make only little progress in reducing τ but move closer to the central path.
Remark To get a suitable step size, we choose an η ∈ (0, 1) such that η ≈ 1 , for example η = 0.99, and set $ xν sν α := min 1, η min . , η min Δxν >0 Δxν Δsν >0 Δsν So we have x − α Δx > 0 and s − α Δs > 0 . For the implementation it is useful to consider that $ xν sν η &. % min 1, η min = , η min Δxν >0 Δxν Δsν >0 Δsν max η, max Δxν , max Δsν ν
xν
ν
sν
R
In Matlab the denominator of the right-hand side can be written in the form ' ( max η; Δx./x; Δs./s . The next lemma shows that we can reduce τ , dependent on σ and α . We will use the notation
6.4
A Primal–Dual Framework
265
x(α), y(α), s(α) := (x, y, s) − α Δx, Δy, Δs , τ (α) := 1 x(α), s(α) . n
Lemma 6.4.1 The solution Δx, Δy, Δs to system (10) has the following properties: a) Δx, Δs = 0 b) τ (α) = 1 − α 1 − σ τ Proof: a): We have
AΔx = 0 and A Δy + Δs = 0 . T
Now we multiply the second equation with Δx and obtain 0 = Δx, AT Δy + Δx, Δs = AΔx, Δy + Δx, Δs, hence Δx, Δs = 0 . b): The third row of (10) gives SΔx + XΔs = XSe − σ τ e . Summation of the n components yields s, Δx + x, Δs = x, s − σ x, s = (1 − σ) x, s . With a) we therefore obtain
The next theorem states that the algorithm has polynomial complexity. Theorem 6.4.2 Suppose that x(k) , y (k) , s(k) is a sequence generated by the algorithm, ε ∈ (0, 1) and that the parameters produced by the algorithm fulfill δ τk+1 ≤ 1 − ω τk , k = 0, 1, 2, ... (11) n for some positive constants δ and ω. If the initial vector x(0) , y (0) , s(0) satisfies 1 (12) τ0 ≤ ν ε for a positive constant ν , then there exists an index K ∈ N with K = O(nω |ln(ε)|) and τk ≤ ε for all k ≥ K .
Chapter 6
nτ (α) = x(α), s(α) = x, s − α s, Δx + x, Δs = 1 − α (1 − σ) x, s = n 1 − α (1 − σ) τ .
266
Interior-Point Methods for Linear Optimization
Proof: We take logarithms on both sides in (11) and obtain δ ln τk+1 ≤ ln 1 − ω + ln τk . n If we repeat this procedure and use (12) in the second inequality, we get δ δ ln τk ≤ k ln 1 − ω + ln τ0 ≤ k ln 1 − ω − ν ln ε. n n ! " We know that t ≥ ln(1 + t) for t > −1 . This implies for t := − nδω that δ ln τk ≤ k − ω − ν ln ε. n Hence, if we have
δ k − ω − ν ln ε ≤ ln ε, n
Chapter 6
the convergence criterion τk ≤ ε is satisfied. This inequality holds for all k that satisfy nω k ≥ (1 + ν) |ln(ε)| . δ * ) ω With K := (1 + ν) nδ |ln(ε)| the proof is complete.
6.5 Neighborhoods of the Central Path Path-following methods move along the central path in the direction of decreasing μ to the set of minimizers of (P ) and maximizers of (D), respectively. These methods do not necessarily stay exactly on the central path but within a loose, well-defined neighborhood. Most of the literature does not examine the nature of these neighborhoods. We, however, will take a close look at them, especially at their geometric interpretation. We want to measure the deviation of each x = (x, y, s) ∈ F 0 from the central path: So far, we know that min Fμ (x) = min XSe − μe μ
μ
as x ∈ F 0 . The orthogonal projection from XSe onto {μe | μ ∈ R} is given by n1 x, s e . We thus get — with the duality measure τ — min Fμ (x) = min XSe − μe = XSe − 1 x, s e = XSe − τ (xs)e . n μ μ
6.5
Neighborhoods of the Central Path
267
Obviously, a triple (x, y, s) ∈ F 0 lies on the central path if and only if XSe − τ (xs)e = 0. Our aim is to control the approximation of the central path in such a way that the deviation converges to zero as x, s goes to zero. This leads to the following definition of the neighborhood N2 of the central path for β ≥ 0 :
N2 (β) := (x, y, s) ∈ F 0 | XSe − τ (xs)e ≤ β τ (xs) It is clear that C = N2 (0) ⊂ N2 (β1 ) ⊂ N2 (β2 ) ⊂ F 0 for 0 ≤ β1 ≤ β2 ≤ 1 . The following picture shows the x-projection of N2 (β) corresponding to example 3 for the values β = 1/2 and β = 1/4 : (0, 0, 1) •
1
x3 0.5
(0, 1, 0) 0 0.5
x2
0.5 1
x1
Another neighborhood which we will need later on is N−∞ defined as
N−∞ (γ) := (x, y, s) ∈ F 0 | xν sν ≥ γ τ(x s) for ν = 1, ..., n for some γ ∈ [0, 1]. It is clear that C = N−∞ (1) ⊂ N−∞ (γ2 ) ⊂ N−∞ (γ1 ) ⊂ N−∞ (0) = F 0 for 0 ≤ γ1 ≤ γ2 ≤ 1 . We will illustrate this neighborhood with a picture (on the next page) for the same example and the values γ = 0.3 and γ = 0.6 as well. Every one of the path-following algorithms which we will consider in the following restricts the iterates to one of the two neighborhoods of the central path C with β ∈ [0, 1] and γ ∈ [0, 1]. The requirement in the N−∞ neighborhood is not very strong:
Chapter 6
(1, 0, 0) 0 0
268
Interior-Point Methods for Linear Optimization (0, 0, 1) •
1
x3 0.5
(0, 1, 0) 0
(1, 0, 0) 0 0
0.5
x2
0.5 1
x1
If we choose γ close to zero, we have almost the whole feasible set. By looking at N2 (β), we see that we cannot claim the same here for β ∈ [0, 1]. In certain examples it does not matter how we choose β, there are always points in the strictly feasible region which do not lie in the neighborhood N2 (β): Example 4
Let
Chapter 6
A :=
1 0 0 2
,
b := (7, 2)T and c := (1, 1)T .
Hence, we consider the primal problem ⎧ x1 + x2 −→ min ⎪ ⎪ ⎪ ⎨ x =7 1 (P ) ⎪ 2 x2 = 2 ⎪ ⎪ ⎩ x ≥ 0 and the dual problem
(De )
⎧ 7 y1 + 2 y2 −→ max ⎪ ⎪ ⎪ ⎨ y +s =1 1 1 ⎪ 2 y 2 + s2 = 1 ⎪ ⎪ ⎩ s ≥ 0.
Here we have FP = FP0 = {(7, 1)T } . Hence we see that x = (7, 1)T ∈ FP0 and (y, s) = ((0, 0)T , (1, 1)T ) ∈ FD0 e with (x, y, s) ∈ N−∞ (γ) for a suitable γ but (x, y, s) ∈ / N2 (β) for any β ∈ [0, 1]: As we have τ = 1/2 x, s = 4 we get with γ := 1/4 that √ xν sν ≥ γ τ = 1 for ν = 1, 2. On the other hand, we have XSe − τ e = 18 > β τ for all β ∈ [0, 1].
6.5
Neighborhoods of the Central Path
269
Since the condition XSe − τ (xs)e ≤ β τ (xs) in the definition of N2 (β) does not depend on (x, y, s), but only on xs, the vector of the products xν sν , we firstly consider the simpler set
N2 (β) := ω ∈ Rn+ | ω − τ (ω)e ≤ β τ (ω) with the mean value n
τ := τω := τ (ω) := 1 ων = 1 ω , e = 1 eT ω for ω ∈ Rn+ . n n n ν=1
Here we ignore the fact that the constraints Ax = b and AT y + s = c are met. Likewise it is helpful to consider the simpler set
N−∞ (γ) := ω ∈ Rn+ | ων ≥ γ τω for ν = 1, ..., n instead of N−∞ (γ). Here the fact that the constraints are met is also ignored. A Closer Look at These Neighborhoods With
n2 τ 2 = (eT ω)2 =
ω , eeT ω
we get for ω ∈ N2 (β)
With that the inequality ω − τ e ≤ β τ means ω , ω − (n + β 2 )τ 2 ≤ 0 .
With the matrix A := Aβ := I −
1 + β2 n n2
eeT
this can be written in the form ω , Aω ≤ 0 . With Ae = −β 2 /ne we see that e is an eigenvector of A with eigenvalue −β 2 /n. For any vector u ∈ Rn with u ⊥ e we obtain Au = u , hence all vectors that √ are orthogonal to e are eigenvectors of A with eigenvalue 1 . To hn := 1/ n e we choose h1 , ..., hn−1 such that h1 , ..., hn is an orthonormal basis. We can therefore express any ω by ω =
n ν=1
αν hν
Chapter 6
ω − τ e2 = ω , ω − 2τ ω , e + τ 2 e , e = ω , ω − nτ 2 .
270
Interior-Point Methods for Linear Optimization
with αν := ω , hν and obtain with that τ (ω) = √1 αn ≥ 0 n
and 0 ≥ ω , Aω = α21 + · · · + α2n−1 −
hence α21 + · · · + α2n−1 ≤ Cβ :=
n ν=1
αν hν
β2 2 n αn .
β2 2 α , n n
With
β √ αn ≥ n
$ + α21 + · · · + α2n−1 ; α1 , . . . , αn ∈ R
Chapter 6
we thus get N2 (β) = Rn+ ∩ Cβ . We will see that Cβ is a Lorentz cone
with apex 0 and symmetry axis {λe | λ ∈ R+ }, the bisecting line of the positive orthant, as well as a ‘circular’ cross section — orthogonal to the √ symmetry axis. For αn = 1 this cross section has radius β/ n and lies in the n √ hyperplane H := x ∈ Rn | e , x = ν=1 xν = n . If we now consider a circle Kr in H with center M = hn =: e and radius r ≥ 0 , we have
Kr = e + y | y ∈ Rn , e , y = 0, y ≤ r . We would like to find out when exactly Kr ⊂ Rn+ holds. Preparatorily, we prove the following Lemma 6.5.1 For y ∈ Rn with e , y = 0 and y ≤ 1 it holds that |yk |2 ≤ 1 − k = 1, . . . , n.
1 n
for
Proof: With the Cauchy–Schwarz inequality we firstly obtain 2 n 2 2 yk = (−yk ) = 1 · yν ≤ (n − 1) yν2 ≤ (n − 1)(1 − yk2 ). ν=k
ν=k
From this it follows that nyk2 ≤ n − 1 , and thus the desired result. Now we are able to answer the question from above:
6.5
Neighborhoods of the Central Path
271
Lemma 6.5.2 Kr ⊂ Rn+ holds if and only if the radius r fulfills the inequality r ≤
√1 n−1
.
x
e
x2
Kr
x3
x1
e
x r
x1
x2
Proof: For n = 2 this can be directly deduced from the left picture. To prove the statement for arbitrary n ≥ 2 , we start from the assumption Kr ⊂ Rn+ : √ n For n 1 √ (e − e1 ) x := neν = n − 1 ν=2 n−1 it holds that e , x − e = 0 and x − e =
√1 n−1
since
√ " √ √ ! √n n n n − e− e1 = (e − ne1 ) x−e = n−1 n n−1 n(n − 1) x − e2 = Set
n 1 . (n − 1)2 + (n − 1) = 2 − 1) n−1
n2 (n
y := r
x−e r = , (e − ne1 ) . x − e n(n − 1)
For that e , y = 0 and y = r hold, hence x := e + y ∈ Kr ⊂ Rn+ . In particular, we have r 1 x1 = √ − (n − 1) , ≥ 0, n n(n − 1) from which we obtain r ≤
√1 n−1
.
√1 n−1
and x = e + y with e , y = 0 and y ≤ r, hence -2 x ∈ Kr . With lemma 6.5.1 we have - 1r yk - ≤ 1 − n1 = n−1 for k = 1, . . . , n. n . n−1 1 ≤√ for k = 1, . . . , n |yk | ≤ r n n
Let now r ≤
then yields x = e + y ∈ Rn+ .
Chapter 6
and
272
Interior-Point Methods for Linear Optimization
Corollary From the above we obtain the following condition for β : . √ n β = nr ≤ n−1 Therefore N2 (β) = Cβ for β ∈ [0, 1]. In addition N2 (β) \ {0} ⊂ Rn++ . For the polyhedral cone N−∞ (γ) =
ω ∈ Rn+ | ων ≥ γ τω for ν = 1, . . . , n
with γ ∈ [0, 1] we can show analogously
N−∞ (γ) = ω ∈ Rn+ | I − γ hn hTn ω ≥ 0
n n−1 = ν=1 αν hν αn ≥ 0, ν=1 αν hν + (1 − γ)αn hn ≥ 0 . If β ∈ [0, 1], we can furthermore get the following with lemma 6.5.1: , N2 (β) ⊂ N−∞ (γ) for γ ≤ 1 − β (n − 1)/n . We will leave the proof as exercise 6 to the interested reader.
Chapter 6
We will now look at an algorithm which is a special case of the general algorithm, the short-step algorithm. In this algorithm we choose α = 1 and σ = 1 − 5√2 n .
6.6 A Short-Step Path-Following Algorithm In this section we consider a short-step path-following algorithm. This algorithm works with the 2-norm neighborhood. The duality measure τ is steadily reduced to zero. Each search direction is a Newton direction toward a point on the central path, a point for which the normalized duality gap μ is less or equal to the current duality measure algorithm we discuss τ . The short-step generates strictly feasible iterates x(k) , y (k) , s(k) that satisfy system (10). As the target value we use μ = σ τ , where σ is the centering parameter and τ the duality measure we introduced above. In the algorithm, we choose α√=1 as the step length and for σ we choose a constant value σ := 1 − 2/ 5 n . We start at a point x(0) , y (0) , s(0) ∈ N2 ( 12 ) with τ0 := n1 x(0) , s(0) . Then we take a Newton step, reduce τ and continue. The name short-step algorithm refers √ to algorithms which reduce τ in every step by a constant factor like 1 − 1/ n which only depends on the dimension of the problem.
6.6
A Short-Step Path-Following Algorithm
273
Short-step algorithm Let a starting vector x(0) , y (0) , s(0) ∈ N2 ( 12 ) and an accuracy requirement ε > 0 be given. Initialize (x, y, s) := x(0) , y (0) , s(0) and τ := n1 x, s . while τ > ε Determine the Newton direction Δx, Δy, Δs according to (10) and set x := x − Δx y := y − Δy τ :=
!
1−
2 √ 5 n
"
s := s − Δs , τ.
If we apply the algorithm to example 3 (cf. page 254) with the starting points x = (0.3, 0.45, 0.25)T , s = (9, 7, 6)T and y = −10 , we get the following visualization (after 57 iteration steps): (0, 0, 1) •
1
x3
(0, 1, 0) 0
(1, 0, 0) 0
0
0.5
x2
0.5 1
Lemma 6.6.1 Let x(0) , y (0) , s(0) ∈ N2 ( 12 ) and τ0 ≤
x1
1 εν
for some positive constant ν and a given ε ∈ (0, 1). Then there exists an index √ K ∈ N with K = O( n |ln(ε)|) such that τk ≤ ε
for all k ≥ K . Proof: We set δ :=
2 5
and ω :=
1 2
and obtain the result with !theorem " 6.4.2 2 as we have — by our choice of σ and α — that τk+1 = σ τk = 1 − 5√n τk .
Chapter 6
0.5
274
Interior-Point Methods for Linear Optimization
We want to show that all iterates x(k) , y (k) , s(k) of path-following algorithms stay in the corresponding neighborhoods, i. e., for (x, y, s) ∈ N2 (β) it also follows that (x(α), y(α), s(α)) ∈ N2 (β) for all α ∈ [0, 1] (cf. p. 265). In order to be able to state this, we will first provide some technical results:
Lemma 6.6.2 Let u and v be two vectors in Rn with u , v ≥ 0 . Then √ 2 U V e ≤ 2−3 u + v .
(13)
Proof: We partition the index set I := {1, ..., n} into P := {i ∈ I| ui vi ≥ 0} and M := {i ∈ I | ui vi < 0}. For T ⊂ I we define the vector zT := i∈T ui vi ei . From 0 ≤ u , v = zP 1 − zM 1 we get U V e2 =
n
(ui vi )2 = zP 2 + zM 2
i=1
2
2
2
≤ zP 1 + zM 1 ≤ 2 zP 1 = 2
! 1
"2
!
ui vi
"2
i∈P
(ui + vi )2 4 i∈P n ! "2 4 −3 ≤ 2 (ui + vi )2 = 2−3 u + v . ≤ 2
Chapter 6
i=1
Remark For n ≥ 2, u := (1, −1, 0, ..., 0) ∈ Rn and v := (1, 1, 0, ..., 0) ∈ Rn we get u , v = 0 and √ 2 U V e = 2−3 u + v , therefore the inequality (13) is sharp for n ≥ 2. Lemma 6.6.3 For β ∈ [0, 1], ω ∈ N2 (β) and τ := τω we have min ων ≥ (1 − β)τ . ν
Proof: For any index ν ∈ {1, ..., n} we get τ − ων ≤ |ων − τ | ≤ ω − τ e ≤ β τ and therefore (1 − β)τ ≤ ων . Corollary For β ∈ [0, 1] we have N2 (β) ⊂ N−∞ (1 − β) .
6.6
A Short-Step Path-Following Algorithm
275
Lemma 6.6.4 For (x, y, s) ∈ N2 (β) we have ΔX ΔS e ≤
β 2 + n(1 − σ)2 √ τ. 23 (1 − β) 1
1
Proof: We define the positive definite diagonal matrix D := X 2 S − 2 . If we 1 multiply the third equation of (10) by (XS)− 2 , we get 1
D−1 Δx + DΔs = (XS)− 2 (XS e − σ τ e). Now we set u := D−1 Δx and v := DΔs and thus — with the help of lemma 6.4.1 — we get u , v = 0 . Then we can apply lemma 6.6.2 and obtain ΔXΔSe = (D−1 ΔX)(DΔS)e √ 2 ≤ 2−3 D−1 Δx + DΔs 2 √ 1 = 2−3 (XS)− 2 (XSe − σ τ e) n √ √ (xν sν − στ )2 = 2−3 ≤ 2−3 xν sν ν=1
n ν=1
(xν sν − σ τ )2 min xν sν ν
2 √ XS e − σ τ e = 2−3 . min xν sν ν
2
2
XSe − στ e = (XSe − τ e) + (1 − σ)τ e 2
= XSe − τ e + 2(1 − σ)τ e , XSe − τ e + (1 − σ)2 τ 2 e , e 2
= XSe − τ e + (1 − σ)2 τ 2 n ≤ β 2 τ 2 + (1 − σ)2 τ 2 n. With lemma 6.6.3 we obtain the desired result.
Lemma 6.6.5 For (x, y, s) ∈ N2 (β) and α ∈ [0, 1] we have X(α)S(α)e − τ (α)e ≤ |1 − α| XS e − τ e + α2 ΔXΔS e 2 β + n(1 − σ)2 2 √ τ. ≤ |1 − α| β τ + α 8 (1 − β) Proof: From lemma 6.4.1 we get
(14) (15)
Chapter 6
By assumption (x, y, s) ∈ N2 (β) and so we have
276
Interior-Point Methods for Linear Optimization
X(α)S(α)e − τ (α)e = (x − α Δx)(s − α Δs) − 1 − α (1 − σ) τ e = xs − α (sΔx + xΔs) + α2 ΔxΔs − 1 − α (1 − σ) τ e 2 = xs − α (xs − σ τ e) + α ΔxΔs − 1 − α + α σ τ e (10)
= (1 − α)(XS e − τ e) + α2 ΔXΔS e .
This yields (14) and with lemma 6.6.4 we obtain (15).
In the next theorem we will show that all new iterates stay in the neighborhood, even for a full step with α = 1 . We will also supply a relation between β and σ . Theorem 6.6.6 Let the parameters β ∈ (0, 1) and σ ∈ (0, 1) be chosen to satisfy β 2 + n(1 − σ)2 √ ≤ σβ . (16) 8(1 − β) If (x, y, s) ∈ N2 (β), we have x(α), y(α), s(α) ∈ N2 (β) for all α ∈ [0, 1]. Proof: By substituting (16) into (15) we get X(α)S(α)e − τ (α)e ≤ (1 − α)β τ + α2 σ β τ ≤ (1 − α + σ α)β τ
Chapter 6
= β τ (α)
(by lemma 6.4.1.b) (17) for α ∈ [0, 1]. It remains to show that x(α) ∈ FP0 , and y(α), s(α) ∈ FD0 e . By (10) it holds that: Ax(α) = A(x − α Δx) = Ax − α AΔx = b
and
AT y(α) + s(α) = AT (y − α Δy) + s − α Δs = AT y − α (AT Δy + Δs) + s = c . We still need x(α) > 0 and s(α) > 0 . We know that (x, s) > 0 . With (17), lemma 6.6.3 and lemma 6.4.1 we get xν (α)sν (α) ≥ (1 − β)τ (α) = (1 − β) 1 − α (1 − σ) τ > 0 for ν ∈ {1, . . . , n} . It follows that we can neither have xν (α) = 0 nor sν (α) = 0 and therefore x(α) > 0 and s(α) > 0 by continuity, hence (x(α), y(α), s(α)) ∈ F 0 . We can easily verify that the parameters we chose in the short-step method satisfy (16): "2 ! 1 2 2 √ + n 2 2 1 41 3 5 n √ √ √ ≤ 1 − = ≤ 10 5 n 2 8 (1 − 12 ) 100 2
6.7
The Mizuno–Todd–Ye Predictor-Corrector Method
277
We illustrate inequality (16) with a little picture for the graphs of the functions 2 2 and g(d) := σ β for d ∈ [0, 1] with f and g defined by f (d) := β √+n(1−σ) 8 (1−β)
σ := 1 −
√d n
and β :=
1 2
for n = 1 :
1
f
0.5
g 0.4
d
1
We see that f (d) ≤ g(d) for all d ∈ [0, 0.4].
6.7 The Mizuno–Todd–Ye Predictor-Corrector Method
The Mizuno–Todd–Ye predictor-corrector method is one of the most remarkable interior-point methods for linear and — more general — for quadratic optimization. It is based on a simple idea that is also used in the numerical solution of differential equations.
Chapter 6
In the short-step method we discussed above we chose a value slightly less than 1 for σ . By this choice we were able to ensure that the iterates did not approach the boundary of the feasible set too fast but at the same time τ could only slowly converge to zero. Therefore the algorithm made only slow progress toward the solution. Better results are obtained with the predictor-corrector method developed by Mizuno, Todd and Ye in 1993 (cf. [MTY]). It uses two 2-norm neighborhoods. In this method even-index iterates (i. e., (k)nested x , y (k) , s(k) with k even) are confined to the inner neighborhood, whereas the odd-index iterates are allowed to stay in the outer neighborhood but not beyond. Every second step of this method is a predictor step which starts in the inner neighborhood and moves to the boundary of a slightly larger neighborhood to compute a rough approximation. Between these predictor steps, the algorithm takes corrector steps. The corrector steps enable us to bring the iterates toward the central path and to get back into the inner neighborhood. Then we can start with the next predictor step. This algorithm is also a special case of the general algorithm. In the predictor step we choose σ = 0 to reduce τ . According to lemma 6.4.1, these steps reduce the value of τ by a factor of (1 − α) where α is the step length. Corrector steps leave τ unchanged, we choose σ = 1 to improve centrality. By moving back into the inner neighborhood, the corrector step allows the algorithm to make better progress in the next predictor step.
278
Interior-Point Methods for Linear Optimization
To describe the algorithm, suppose that two constants β, β ∈ [0, 1] with β < β are given. We assume the inner neighborhood to be N2 (β) and the outer to be N2 (β ).
Chapter 6
Mizuno–Todd–Ye predictor-corrector algorithm Let a starting vector x(0) , y (0) , s(0) ∈ N2 (β) and ε > 0 be given. Initialize (x, y, s) := x(0) , y (0) , s(0) and τ := n1 x(0) , s(0) . while τ > ε Predictor step Solve (10) with σ = 0, i. e., solve: ⎞ ⎞ ⎛ ⎛ ⎞⎛ 0 0 AT I Δx ⎝ A 0 0 ⎠ ⎝ Δy ⎠ = ⎝ 0 ⎠ XS e Δs S 0 X Choose α as the largest value in (0, 1] such that x(α), y(α), s(α) ∈ N2 β and set (x, y, s) := x(α), y(α), s(α) , τ := (1 − α)τ . Corrector step Solve (10) with σ = 1, i. e., solve ⎞ ⎞ ⎛ ⎛ ⎞⎛ 0 Δx 0 AT I ⎠ ⎝ A 0 0 ⎠ ⎝ Δy ⎠ = ⎝ 0 XS e − τ e Δs S 0 X and set (x, y, s) := (x, y, s) − Δx, Δy, Δs .
(18)
(19)
We consider example 3 (cf. page 254) again to illustrate the algorithm. The picture shows the first iterates of the algorithm for the starting points x = (0.32, 0.27, 0.41)T , s = (11, 9, 8)T and y = −12 with the neighborhoods N2 (1/4) and N2 (1/2): (0, 0, 1) •
1
x3 0.5
(0, 1, 0) 0
(1, 0, 0) 0
0
0.5
x2
0.5 1
x1
6.7
The Mizuno–Todd–Ye Predictor-Corrector Method
279
With β := 1/4 and β := 1/2 we will have a closer look at the predictor step: Lemma 6.7.1
Suppose that (x, y, s) ∈ N2 (1/4) and let Δx, Δy, Δs be calculated from (10) with σ' = 0 ( (predictor step). Then x(α), y(α), s(α) ∈ N2 (1/2) holds for all α ∈ 0, α with 12 τ 1 , . (20) α := min 2 8 ΔXΔS e Hence, the predictor step has at least length α and the new value of τ is at most (1 − α)τ . Proof: With (14) we get
From lemma 6.6.4 we get with β = 1/4 and σ = 0 √ √ √ τ 8 · 3/4 1 3· 2 3· 2 0.492 τ ≥ · = ≥ ≥ . 8 ΔXΔS e 8 1/16 + n τ 1 + 16 n 17 n n
This shows α ≥ min
1 0.49 , √ 2 n
0.49 = √ . n
The following lemma gives a description of the corrector steps for the case 2 of β = β , hence especially for β = 1/4 and β = 1/2 . Any point of N2 β points toward the inner neighborhood N2 (β) without changing the value of τ :
Chapter 6
X(α)S(α)e − τ (α)e ≤ |1 − α| XS e − τ e + α2 ΔXΔS e τ ΔXΔS e ≤ |1 − α| XS e − τ e + 8 ΔXΔS e (20) ≤ 1 (1 − α)τ + τ as (x, y, s) ∈ N2 (1/4) 4 8 since α ∈ [0, 1/2] ≤ 1 (1 − α)τ + 1 (1 − α)τ 4 4 1 τ (α) by lemma 6.4.1.b with σ = 0 (21) = 2 ' ( for α ∈ 0, α . So far we have shown that x(α), y(α), s(α) satisfies the proximity condition for N2 (1/2). With (21), lemma 6.6.3 and lemma 6.4.1.b we get xν (α)sν (α) ≥ 1 τ (α) = 1 (1 − α)τ > 0 2 2 for ν ∈ {1, ...,n}. Using the same arguments as in the proof of theorem 6.6.6 we conclude x(α), y(α), s(α) ∈ F 0 .
280
Interior-Point Methods for Linear Optimization
Lemma 6.7.2
/ ! √ 8−1 and let Δx, Δy, Δs Suppose that (x, y, s) ∈ N2 β for β ∈ 0, √ 8 be calculated from (10) with σ = 1 (corrector step). Then x(1), y(1), s(1) 2 with τ (1) = τ .5 belongs to N2 β Proof: τ (1) = τ follows immediately from 6.4.1.b (with σ = 1 ). Then we substitute α = 1 and σ = 1 into (15) and obtain 2
β 2 2 X(1)S(1)e − τ (1)e ≤ √ (22) τ ≤ β τ = β τ (1). 8 (1 − β ) So we know that (x(1), y(1), s(1) satisfies the proximity conditions for 2 N2 β . As in the proof of theorem 6.6.6, we get that x(1), y(1), s(1) belongs to F 0 .
We know that the value of τ does not change in a corrector step, but because the predictor step reduces τ significantly, we can state the same kind of complexity result as for the short-step algorithm:
Lemma 6.7.3 2 Suppose that x(0) , y (0) , s(0) ∈ N2 β . If
Chapter 6
τ0 ≤
1 εν
holds for√ some positive constant ν, then there exists an index K ∈ N with K = O( n |ln(ε)|) such that τk ≤ ε for all k ≥ K . Proof: We have shown above that any corrector step leaves the value of τ unchanged. So we know that τk+2 = τk+1 for k = 0, 2, 4, . . . . In the proof of lemma 6.6.4 we saw (with σ = 0 for a predictor step) 2 √ 1 √ ΔXΔS e ≤ 2−3 (XS)− 2 (XS e) = 2−3 nτ . With (20) we thus get for n > 1 1 1 1 ,√ √ α ≥ min = √ √ 2 n48 n 48 5
√ 8−1 √ 8
= 0.646 . . .
6.8
The Long-Step Path-Following Algorithm
281
and therefore (compare lemma 6.7.1) we have 1 τk for k = 0, 2, 4, . . . . τk+2 = τk+1 ≤ 1 − √ √ n48 For n = 1 we have α ≥ 1/2 and so τk+2 = τk+1 ≤
1 τk for k = 0, 2, 4, . . . . 2
√ If we set δ := 1/ 4 8 = 0.5946 . . . and ω := 1/2 in the case of n > 1 and δ := 1/2, ω arbitrary for n = 1, almost all the prerequisites for theorem 6.4.2 are given. The odd iterates are missing in the inequality above. This fact, however, does not affect the given proof.
6.8 The Long-Step Path-Following Algorithm Contrary to the short-step algorithm the long-step algorithm does not use the 2-norm neighborhoods N2 (β) but the neighborhoods N−∞ (γ) defined by N−∞ (γ) =
(x, y, s) ∈ F 0 | xν sν ≥ γ τ (xs) for all ν = 1, ..., n
Chapter 6
for some γ ∈ [0, 1]. This neighborhood is much more expansive compared to the N2 (β) neighborhood: The N2 (β) neighborhood contains only a small area of the strictly feasible set whereas with the N−∞ (γ) neighborhood we almost get the entire strictly feasible set if γ is small. The long-step method can make fast progress because of the choice of the neighborhood for γ close to zero. We obtain the search direction by solving system (10) and choose the step length α as large as possible such that we still stay in the neighborhood N−∞ (γ). Another difference with the short-step method is the choice of the centering parameter σ. In the short-step algorithm we choose σ slightly smaller than 1 whereas in the long-step algorithm we are a little more aggressive. We take a smaller value for σ which lies between an upper bound σmax and a lower bound σmin , independent of the dimension of the problem, to ensure a larger decrease in τ . The lower bound σmin ensures that each search direction begins by moving away from the boundary of N−∞ (γ) and closer to the central path. In the short-step method we have a constant value for the centering parameter and α is constantly 1, in the long-step method we choose a new σ in every iteration and have to perform a line search to choose α as large as possible such that we still stay in the neighborhood N−∞ (γ).
282
Interior-Point Methods for Linear Optimization
Long-step algorithm Let γ ∈ (0, 1), σmin and σmax with 0 < σmin ≤ σmax < 1, a starting vector (0) (0) (0) ∈ N−∞ (γ) and an accuracy requirement ε > 0 be given. x ,y ,s Initialize (x, y, s) := x(0) , y (0) , s(0) and τ := 1 x, s. n while τ > ε ( ' Choose σ ∈ σmin , σmax . Determine the Newton step Δx, Δy, Δs at (x, y, s) according to (10). Choose α as the largest value in (0, 1] such that
x(α), y(α), s(α) ∈ N−∞ (γ) .
Set (x, y, s) := x(α), y(α), s(α) and τ := 1 − α (1 − σ) τ . Lemma 6.8.1 If (x, y, s) ∈ N−∞ (γ), then
Chapter 6
! " 3 ΔXΔS e ≤ 2− 2 1 + 1 nτ . γ Proof: As in the proof of lemma 6.6.4 we obtain the following by considering x, s = nτ , e , e = n, xν sν ≥ γ τ for ν = 1, . . . , n and σ ∈ (0, 1): ΔXΔS e ≤ = ≤ = ≤
2 1 2−3 (XS)− 2 (XS e − σ τ e) √ 1 1 2 2−3 (XS) 2 e − σ τ (XS)− 2 e ! " √ 2−3 x, s − 2 σ τ n + σ 2 τ 2 n γτ √ 2 2−3 nτ 1 − 2 σ + σ γ ! " √ 2−3 nτ 1 + 1 γ
√
Theorem 6.8.2
If (x, y, s) ∈ N−∞ (γ), then we have x(α), y(α), s(α) ∈ N−∞ (γ) for all ( α ∈ 0, α with √ 1−γ . α := 8γ σ n(1 + γ)
Proof: With lemma 6.8.1 it follows for ν = 1, . . . , n :
6.8
The Long-Step Path-Following Algorithm |Δxν Δsν | ≤ ΔXΔs ≤
283
! " √ 2−3 nτ 1 + 1 . γ
By using S Δx + X Δs = X S e − σ τ e (the third row of (10)) once more, we get with xν sν ≥ γ τ that x(α)s(α) = xs − α (sΔx + xΔs) + α2 ΔxΔs ≥ xs(1 − α) + α σ τ e − α2 |ΔxΔs| ! " √ ≥ γ τ (1 − α)e + α σ τ e − α2 2−3 nτ 1 + 1 e . γ By lemma 6.4.1 we know that τ (α) = 1 − α(1 − σ) τ . So we get
With this lemma we have obtained a lower bound α for the step length α in the long-step algorithm. Lemma 6.8.3 If the parameters γ , σ := σmin and σ := σmax are given, there exists a constant δ > 0 independent of n such that δ τk for k = 1, 2, 3, . . . . τk+1 ≤ 1 − n Proof: By lemma 6.4.1 and as α ≥ α , we get √ 8 1−γ γ σ (1 − σ) τk . τk+1 = (1 − α (1 − σ))τk ≤ 1 − n 1+γ The quadratic function σ −→ σ (1 − σ) is strictly concave and therefore has its minimizer in the compact interval [σ, σ ] ⊂ (0, 1) in one of the endpoints. So we obtain
Chapter 6
√ x(α)s(α) ≥ γ τ (α)e if γ τ (1−α)+α σ τ −α2 2−3 nτ (1+ 1 ) ≥ γ (1−α+ασ)τ , γ √ ( = α . Hence, for α ∈ 0, α the and that is the case iff α ≤ 8 γ σ n1 1−γ 1+γ triple x(α), y(α), s(α) the proximity condition. It remains to show satisfies that x(α), y(α), s(α) ∈ F 0 holds for these α : Analogously to the proof of theorem 6.6.6 we get Ax(α) = b and AT y(α) + s(α) = c. As γ ∈ (0, 1), we know that γ (1 − γ) ≤ 1/4 holds and so √ √ α ≤ 8 σ γ 1 − γ ≤ 8 1 σ 1 < √1 < 1 . n 1+γ n 4 1+γ 2n We have shown ( above that x(α)s(α) ≥ γ 1 − α(1 − σ) τ e > 0 holds for all α ∈ 0, α and we know that x(0) > 0 and s(0) > 0 . As in the proof of theorem 6.6.6 we conclude x(α), y(α), s(α) ∈ F 0 .
284
Interior-Point Methods for Linear Optimization
σ(1 − σ) ≥ min σ(1 − σ), σ(1 − σ) =: σ
for all σ ∈ [σ, σ ]. Now, δ :=
√
8γ 1 − γ σ > 0 completes the proof. 1+γ
Theorem 6.8.4
For ε, γ ∈ (0, 1) suppose that the starting point x(0) , y (0) , s(0) ∈ N−∞ (γ) satisfies τ0 ≤ 1ν ε for some positive constant ν . Then there exists an index K with K = O(n | ln ε|) such that τk ≤ ε for all k ≥ K .
Proof: By theorem 6.4.2 and lemma 6.8.3 we get the result with ω := 1 .
Chapter 6
6.9 The Mehrotra Predictor-Corrector Method In the algorithms we have discussed so far we always needed a strictly feasible starting point. For most problems such a starting point is difficult to find. For some problems a strictly feasible starting point does not even exist (compare example 2). Infeasible algorithms just require that x(0) , s(0) > 0 and neglect the feasibility. In this section we will look at an infeasible interior-point algorithm, the Mehrotra predictor-corrector method. This algorithm generates a sequence of iterates x(k) , y (k) , s(k) permitted to be infeasible with x(k) , s(k) > 0. It departs from the general algorithm and also differs from the Mizuno–Todd– Ye algorithm. Unlike this method Mehrotra’s algorithm does not require an extra matrix factorization for the corrector step. Mehrotra obtained a breakthrough via a skillful synthesis of known ideas and path-following techniques. A lot of the currently used interior-point software for linear optimization (since 1992) is based on Mehrotra’s algorithm. The method is among the most effective for solving wide classes of linear problems. The improvement is obtained by including the effect of the second-order term. Unfortunately no rigorous convergence theory is known for Mehrotra’s algorithm. There is a gap between the practical behavior and the theoretical results, in favor of the practical behavior. Obviously, this gap is unsatisfactory. Mehrotra’s approach uses three directions at each iteration, the predictor, the corrector and a centering direction, where the centering and the corrector
6.9
The Mehrotra Predictor-Corrector Method
285
step are accomplished together. As the method considers possibly infeasible points, we not only haveto check that τ is small enough but also that the residuals Ax − b and AT y + s − c are small. One big advantage of this algorithm is the adaptive choice of σ . We can choose a new σ in every step instead of choosing a fixed value for it in the beginning and keeping it. In the predictor step we solve ⎛ ⎞⎛ ⎞ ⎛ ⎞ ⎞ ⎛ T 0 AT I ΔxN rd A y+s−c ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ N⎟ (23) ⎝ A 0 0 ⎠ ⎝ Δy ⎠ = ⎝ Ax − b ⎠ =: ⎝ rp ⎠ , N S 0 X Xs Δs rc where ΔxN , Δy N , ΔsN denotes the affine-scaling or Newton direction. N Then we calculate the largest step lengths αN p , αd ∈ (0, 1] such that N N := x − αN := s − αN ≥ 0, s αN ≥ 0 x αN p p Δx p d Δs by setting αN p
:= min 1, min
xi N N Δx Δxi >0 i
$ ,
αN d
:= min 1, min
si N N Δs Δsi >0 i
$ .
obtain a heuristic choice of the centering parameter σ at each iteration. This choice has been shown to work well in practice. If τ N is a lot smaller than τ , then we have a small σ and the duality gap is reduced significantly by the Newton step. The next step is to approximate the central path in a corrector step. We solve the equation ⎞ ⎛ T ⎛ ⎞⎛ ⎞ ⎛ ⎞ A y+s−c 0 AT I Δx 0 ⎟ ⎜ ⎜ ⎟⎜ ⎟ ⎜ ⎟ 0 (24) ⎝ A 0 0 ⎠ ⎝ Δy ⎠ = ⎝ Ax − b ⎠ + ⎝ ⎠. Δs ΔX N ΔsN S 0 X Xs − σ τ e We see that we use the same coefficient matrix in the corrector and the predictor step, only the right-hand sides differ slightly. Hence only one matrix factorization is necessary at each iteration. In this step we set — with η ∈ (0, 1) such that η ≈ 1 , for example η = 0.99 — $ $ xi si αp := min 1, η min and αd := min 1, η min . Δxi >0 Δxi Δsi >0 Δsi
Chapter 6
Thereby we get a damped Newton step for the optimality conditions (1). αN p and αN without violating d give the maximum step lengths that can be taken N the nonnegativity condition. Now we compute τ N := n1 x αN and p , s αd with 3 N σ := τ τ
286
Interior-Point Methods for Linear Optimization
minΔxi >0 xi /Δxi and minΔsi >0 si /Δsi give the maximum step length that can be taken without violating the nonnegativity condition x ≥ 0 and s ≥ 0 . Here we choose values slightly less than this maximum. In the context of general nonlinear equations, the idea of reusing the first derivative to accelerate the convergence of Newton’s method is well-known, for example in [Ca/Ma] cited there as the Chebyshev method: We can understand the corrector step as a modified Chebyshev method: We want to find the zeros of Fστ given by ⎛ T ⎞ A y+s−c Fστ (x, y, s) = ⎝ A x − b ⎠ . Xs − σ τ e ⎞ AT (y − Δy) + (s − Δs) − c ⎟ ⎜ A (x − Δx) − b Fστ x − Δx, y − Δy, s − Δs = ⎝ ⎠ (X − ΔX)(s − Δs) − σ τ e ⎞ ⎞ ⎛ ⎞⎛ ⎛ 0 Δx 0 AT I ⎟ ⎟ ⎜ ⎟⎜ ⎜ = Fστ (x, y, s) − ⎝ A 0 0 ⎠ ⎝ Δy ⎠ + ⎝ 0 ⎠ ΔXΔs Δs S 0 X ⎛ ⎛ ⎞ ⎞2 Δx Δx = Fστ (x, y, s) − DFστ (x, y, s) ⎝ Δy ⎠ + 12 D2 Fσ τ (x, y, s) ⎝ Δy ⎠ . Δs Δs ⎞ ⎛ ⎞ ⎛ ⎛ ⎞ 0 Δx 0 AT I ⎟ ⎟⎜ ⎜ ⎜ ⎟ 0 Hence we solve ⎝ A 0 0 ⎠ ⎝ Δy ⎠ = Fστ (x, y, s) + ⎝ ⎠. N N Δs S 0 X ΔX Δs ⎛
It holds that
Chapter 6
Mehrotra predictor-corrector algorithm Let x(0) , y (0) , s(0) with x(0) , s(0) > 0 and an accuracy requirement ε > 0 be given. Initialize (x, y, s) := x(0) , y (0) , s(0) and τ := n1 x, s . while max τ, Ax − b , AT y + s − c > ε Predictor step Solve (23) for ΔxN , Δy N , ΔsN . N N Calculate the maximal possible step length αN p , αd and τ . ! N "3 Set σ = ττk .
Corrector step Solve (24) for Δx, Δy, Δs . Calculate the primal and dual step length αp and αd . Set (x, y, s) := x−αp Δx, y−αd Δy, s−αd Δs and τ :=
1 n
x, s .
6.9
The Mehrotra Predictor-Corrector Method
287
In the following picture we can nicely see that the first of three iterates does not lie in the feasible set. Here we used the starting points x = (0.3, 0.8, 0.7)T , s = (11, 9, 8)T and y = −10 . Iterates of the Mehrotra predictor-corrector algorithm
1
0 0
0
0.5 0.5 1
Starting Points Problems (P ) and (D) in canonical form can be embedded in a natural way into a slightly larger problem which fulfills the interior self-dual homogeneous point condition cf. [RTV], chapter 2 . This approach has the following advantages:
• The solution of the self-dual problem gives us information about whether the original problems have solutions or not. • In the affirmative case we are able to extract them from the solution of the augmented problem. We, however, will not pursue this approach any further since our main interest lies with Mehrotra’s predictor-corrector method. Its main advantage over the methods discussed so far is that it does not necessarily need feasible starting points x(0) , y (0) , s(0) with x(0) , s(0) ∈ Rn++ . The following heuristic de veloped by Mehrotra has proved useful cf. [Meh], p. 589, [No/Wr], p. 410 : We start from the least squares problems x −→ minn x∈R
such that
Ax = b
Chapter 6
• It is easy to find a starting point for the augmented problem. Any of the feasible interior points can be used to solve it.
288
Interior-Point Methods for Linear Optimization
and s −→
min
y∈Rm , s∈Rn
such that AT y + s = c .
The optimizers to these problems are given by −1 −1 b and y := AAT Ac, x := AT AAT
s := c − AT y˜ .
In these cases x , s ∈ Rn++ generally does not hold. By using ), 0 + 1/n , δx := max − 1.5 min( x δs := max − 1.5 min( s), 0 + 1/n , we adjust these vectors and define the starting point as follows + δx e, x0 := x
y 0 := y,
s0 := s + δs e . R
For practice purposes, we will provide a Matlab program for our readers. We refrained from the consideration of a number of technical details for the benefit of transparency. It is surprising how easily Mehrotra’s method can then be implemented:
Chapter 6
function [x,y,s,f,iter] = PrimalDualLP(A,b,c)
5
% % % % %
Experimental version of Mehrotra’s primal-dual interior-point method for linear programming primal problem: min c’x dual problem: max b’y
s.t. s.t.
Ax = b, x >= 0 A’y + s = c, s >= 0
10
Maxiter = 100; Tol = 1.e-8; UpperB = 1.e10*max([norm(A),norm(b),norm(c)]); [m,n] = size(A);
15
% starting point e = ones(n,1); M = A*A’; R = chol(M); x = A’*(R \ (R’ \ b)); y = R \ (R’ \ (A*c)); s = c - A’*y; delta_x = max(-1.5*min(x),0)+1/n; x = x+delta_x*e; delta_s = max(-1.5*min(s),0)+1/n; s = s+delta_s*e; for iter = 0 : Maxiter
20
f = c’*x; % residuals rp = A*x-b;
% primal residual
6.10 25
30
35
Interior-Point Methods for Quadratic Optimization
rd = A’*y+s-c; rc = x.*s; tau = x’*s/n;
289
% dual residual % complementarity % duality measure
residue = norm(rc,1)/(1+abs(b’*y)); primalR = norm(rp)/(1+norm(x)); dualR = norm(rd)/(1+norm(y)); STR1 = ’iter %2i: f = %14.5e, residue = %10.2e’; STR2 = ’primalR = %10.2e, dualR = %10.2e\n’; fprintf([STR1 STR2], iter, f, residue, primalR, dualR); if (norm(x)+norm(s) >= UpperB) error(’Problem possibly infeasible!’); end if (max([residue; primalR; dualR]) <= Tol) break; end % coefficient matrix of the linear systems M = A*diag(x./s)*A’; R = chol(M); % M = R’R
40
45
% corrector step: correct towards center path sigma = (tau_N/tau)^3; % Mehrotra rc = rc - sigma*tau + dx.*ds; rhs = rp - A*((rc-x.*rd) ./ s); dy = R \ (R’ \ rhs); % dy = M \ rhs ds = rd-A’*dy; dx = (rc-x.*ds) ./ s; eta = max(0.99,1-tau); alpha_p = eta/max([eta; dx./x]); alpha_d = eta/max([eta; ds./s]); x = x-alpha_p*dx; y = y-alpha_d*dy; s = s-alpha_d*ds;
10
end % for-loop
6.10 A Short Look at Interior-Point Methods for Quadratic Optimization So far we have dealt with linear problems, now we will have a look at quadratic problems. We consider the problem 1 2 x, C x + c, x −→ min (QP ) Ax = b, x ≥ 0 , where C is a symmetric positive definite (n, n)-matrix, A an (m, n)-matrix with m ≤ n, c ∈ Rn , b ∈ Rm , and x ∈ Rn is the unknown variable. In the following we again assume that the matrix A has full rank, that is, rank(A) = m.
Chapter 6
5
% predictor step with maximal Newton step size rhs = rp - A*((rc-x.*rd) ./ s); dy = R \ (R’ \ rhs); % dy = M \ rhs ds = rd-A’*dy; dx = (rc-x.*ds) ./ s; alpha_p = 1/max([1; dx./x]); alpha_d = 1/max([1; ds./s]); tau_N = ((x-alpha_p*dx)’*(s-alpha_d*ds))/n;
290
Interior-Point Methods for Linear Optimization
Theorem 6.10.1 (Optimality condition) A given vector x∗ ∈ Rn is a minimizer to problem (QP ) iff there exist vectors y ∗ ∈ Rm and s∗ ∈ Rn+ that solve AT y ∗ + s∗ − C x∗ − c = 0 Ax∗ − b = 0 x∗i s∗i = 0
(i = 1, ..., n)
(25)
x∗ , s∗ ≥ 0 . Proof: If x∗ is a minimizer to problem (QP ), there exist multipliers y ∗ ∈ Rm and s∗ ∈ Rn+ such that the triple (x∗ , y ∗ , s∗ ) satisfies the KKT conditions. The Lagrange function L to problem (QP ) is given by L(x, y, s) = 1 x, C x + c, x + y , b − Ax + s, −x 2 for x ∈ Rn , s ∈ Rn+ and y ∈ Rm . Hence the KKT conditions are: ∇x L(x∗ , y ∗ , s∗ ) = C x∗ + c − AT y ∗ − s∗ = 0 Ax∗ = b
Chapter 6
x∗i s∗i = 0 x∗ , s∗ ≥ 0 ,
(i = 1, ..., n)
and so we get (25). On the other hand, if we have (25), then (QP ) attains its global minimum at x∗ by theorem 2.2.8 . We repeat the surprisingly simple argument: The objective function f defined by f (x) := 12 x, C x + c, x is differentiable and convex. For any feasible x we thus get f (x) − f (x∗ ) ≥ ∇f (x∗ ), x − x∗
= C x∗ + c, x − x∗ = AT y ∗ + s∗ , x − x∗ = y ∗ , Ax − Ax∗ + s∗ , x = s∗ , x ≥ 0 .
Hence x∗ is a (global) minimizer to (QP ).
For a given pair (y, s), the function L( · , y, s) is convex and differentiable. Therefore a necessary and sufficient condition for a minimizer x0 to L( · , y, s) is that the gradient vanishes, that is, 0 = ∇x L(x0 , y, s) = C x0 + c − AT y − s . Hence,
ϕ(y, s) := infn L(x, y, s) = L(x0 , y, s) = = = =
x∈R 1 T 2 x0 , C x0 + c, x0 + y , b − x0 , A y + s 1 2 x0 , C x0 + c, x0 + y , b − x0 , C x0 + c y , b − 12 x0 , C x0 y , b − 12 C −1 AT y + s − c , AT y + s − c .
6.10
Interior-Point Methods for Quadratic Optimization
291
Thus, the corresponding dual problem can be written as follows: y , b − 12 C −1 AT y + s − c , AT y + s − c −→ max (QDe ) y ∈ Rm , s ∈ Rn+ As in section 6.1, we rearrange system (25) in the following way: ⎞ ⎛ ⎞ ⎛ T 0 A y +s−Cx−c ⎠ = ⎝ 0 ⎠ , x, s ≥ 0 Ax − b H0 (x, y, s) := ⎝ 0 Xs
(26)
Now we consider the logarithmic barrier function Ψμ for problem (QP ) defined by n 1 Ψμ (x) := x, C x + c, x − μ log(xi ) x ∈ FP0 2 i=1 for μ > 0 . We will show in the same way as in the proof of theorem 6.1.4 that Ψμ has a (unique) minimizer iff the system ⎞ ⎛ ⎞ ⎛ T 0 A y + s − Cx − c ⎠ = ⎝ 0 ⎠ , x, s ≥ 0 Ax − b (27) Hμ (x, y, s) := ⎝ 0 Xs − μe has a (unique) solution: Theorem 6.10.2
Proof: The definition of Ψμ can be extended to the open set Rn++ and Ψμ is differentiable there. On Rn++ we consider the system Ψμ (x) −→ min (QPμ ) Ax = b . (QPμ ) is a convex problem with linear constraints. So we know that a vector x ∈ Rn++ is a (unique) minimizer to problem (QPμ ) iff there exists a (unique) multiplier y ∈ Rm such that the pair (x, y) satisfies the KKT conditions to problem (QPμ ). The Lagrange function L to this problem is given by L(x, y) := Ψμ (x) + y , b − Ax. The KKT conditions to (QPμ ) are ∇x L(x, y) = C x + c − μX −1 e − AT y = 0 Ax = b . −1
If we set s := μX e which is equivalent to Xs = μe, we get (27). So we have shown that system (27) represents the optimality conditions for problem (QPμ ).
The central path is the set of solutions x(μ), y(μ), s(μ) | μ > 0 of (27).
Chapter 6
There exists a minimizer to Ψμ on FP0 iff system (27) has a solution. The minimizer to Ψμ and the solution of (27), if they exist, are unique.
292
Interior-Point Methods for Linear Optimization
Theorem 6.10.3 The Jacobian
⎞ −C AT I ⎟ ⎜ DHμ (x, y, s) = ⎝ A 0 0 ⎠ S 0 X ⎛
is nonsingular if x > 0 and s > 0 . Proof: For (u, v, w) ∈ Rn × Rm × Rn with ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ −C u + AT v + w u 0 ⎠ ⎝ 0 ⎠ = DHμ (x, y, s) ⎝ v ⎠ = ⎝ Au w 0 S u + Xw T we have u , w = u , C u − A v = u , C u . On the other hand we have −1 −1 u , w = u , −X S u . As C and positive definite, we have for X −1S are all u = 0 that 0 < u , C u = u , −X S u < 0, hence we can conclude that u = 0. From 0 = S u + X w = X w and x = 0 we get w = 0 and from 0 = −Cu + AT v + w = AT v we get v = 0 since rank(A) = m.
Chapter 6
Hence we obtain the Newton direction at a point (x, y, s) = (xσ τ , yσ τ , sσ τ ) on the central path with μ = σ τ by solving ⎞ ⎛ ⎞ ⎞⎛ ⎛ rq Δx −C AT I ⎝ A 0 0 ⎠ ⎝ Δy ⎠ = ⎝ rp ⎠ (28) Δs S 0 X rc where rq := AT y + s − Cx − c, rp := Ax − b and rc := Xs − σ τ e . The next iterate is given by + + + x ,y ,s := (x, y, s) − α Δx, Δy, Δs with a suitable α ∈ (0, 1] such that x+ , s+ > 0 . We denote the primal–dual feasible set by
Fq := (x, y, s) ∈ Rn+ × Rm × Rn+ | Ax = b, AT y − C x + s = c and the strictly primal–dual feasible set by
Fq0 := (x, y, s) ∈ Rn++ × Rm × Rn++ | Ax = b, AT y − C x + s = c . Now we can give a general framework for solving quadratic problems by using interior-point methods:
6.10
Interior-Point Methods for Quadratic Optimization
293
General interior-point algorithm for quadratic problems Let x(0) , y (0) , s(0) ∈ Fq0 and an accuracy requirement ε > 0 be given. Initialize (x, y, s) := x(0) , y (0) , s(0) and τ := n1 x, s . while τ > ε Solve (28) for σ ∈ [0, 1].
Set (x, y, s) := (x, y, s) − α Δx, Δy, Δs where α ∈ (0, 1] denotes a suitable step size such that (x, s) > 0 . τ :=
1 n
x, s
Chapter 6
294
Interior-Point Methods for Linear Optimization
Exercises 1. Consider the primal problem
(P )
⎧ ⎪ ⎨ b , y −→ max AT y + s = c ⎪ ⎩ s≥0,
and its dual problem (De ) where
A :=
−1 1 1 0 1101
,
c, x −→ min Ax = b, x ≥ 0
b := (2, 4)T
and
c := (−1, −2, 0, 0)T .
a) Calculate the solution x∗ := (x∗ , y ∗ , s∗ ) to the optimality conditions for (P ) and (D) (cf. (1) in theorem 6.1.1) and determine FPopt and FDopt . The nonnegativity constraints x ≥ 0, s ≥ 0 are essential in the optimality conditions. For otherwise — besides x∗ — we get solutions which are erroneous with respect to the linear optimization problems. b) Show by means of the implicit function theorem that the central path system cf. (4), theorem 6.1.4
Chapter 6
AT y + s Ax Xs x, s
= = = ≥
c b μe 0
has a unique solution x(μ) := (x(μ), y(μ), s(μ)) for μ ∈ R in a suitable neighborhood of 0. Prove that x∗ = limμ→0+ x(μ). R
Hint: When using Maple , determine the functional matrix with the help of linalg[blockmatrix] . c) Calculate thecentral path numerically and plot the projection x1 (μ), x2 (μ) . d) Now choose c := (−1, −1, 0, 0)T and repeat a) – c). Can you use the same approach as in b) again? If necessary, calculate limμ→0+ x(μ) in a different way (cf. the remark on p. 259 f). 2. Solve the linear optimization problem c, x −→ min (P ) Ax = b, x ≥ 0 with the projected gradient method (cf. p. 189 ff). Use the data from the preceding exercise for A, b, c and choose x(0) := (1, 1, 2, 2)T as the starting vector. Visualize this.
Exercises to Chapter 6
295
3. Primal Affine-Scaling Method By modifying the projected gradient method, we can derive a simple interior-point method. Starting from x(0) ∈ FP0 , we will describe one step of the iteration. By means of the affine-scaling transformation x = X x with X := Diag(x(0) ) we get the following equivalent problem in x-space: c, x −→ min A x = b, x ≥ 0 , where c := X c and A := AX. Our current approximation in x-space is apparently at the ‘central point’ x(0) = X −1 x(0) = e := (1, . . . , 1)T . If we apply the projected gradient method to the transformed problem, we get d = − P c with −1 T T −1 P = I −A AA A = I − XAT AX 2 AT AX . Transformation back to x-space yields the primal affine-scaling direction d = X d = −X P X c. a) Show that x(0) is a KKT point if d = 0 and thus a minimizer of (P ). Otherwise d is a descent direction. For d ≥ 0 the objective function is unbounded from below. Determine the maximal αmax > 0 such that (0) x size α := η αmax with η ∈ (0, 1) + αmax d ≥ 0, and set the step (1) := x(0) + α d we thus get a e. g., η = 0.99 . In the update x 0 strictly positive vector from FP again, with which we can continue the iteration.
4. Primal Newton Barrier Method Assume that the interior-point condition holds, that is, FP0 and FD0 are nonempty. We apply the logarithmic barrier method to the primal problem n (P ): Φμ (x) := c, x − μ log(xi ) −→ min Ax = b , x > 0 ,
i=1
where μ > 0 is the barrier parameter. a) Operate in the Lagrange–Newton framework (cf. p. 200 ff) and derive the quadratic approximation to Φμ at x: 1 −2 d + c − μX −1 e , d −→ min 2 d, μX Ad = 0 , where X denotes the diagonal matrix Diag(x). Show that the Newton direction
Chapter 6
R
b) Implement the method sketched above in Matlab and test the program with the data from exercise 1 and the starting vector x(0) from exercise 2.
296
Interior-Point Methods for Linear Optimization d = −1 XP Xc+XP e μ
with
−1 P = I − XAT AX 2 AT AX
is a minimizer of this problem. The expression for d is the sum of a multiple of the primal affinescaling direction and X P e, which is called the centering direction. In part c) of this exercise we will see why. b) Suppose (y, s) ∈ FD0 . Show that d can be written as ! " d = XP e− 1 Xs . μ If we additionally have e − μ1 Xs < 1, then X −1 d < 1 holds and the Newton iterate satisfies x + d > 0. c) The problem
−
n i=1
log(xi ) −→ min
Ax = b , x > 0
Chapter 6
finds the analytic center of FP (cf. the remark on p. 259 f). Verify again that the Newton direction d of this problem yields the centering direction X P e. 5. Goldman–Tucker Theorem (1956) Let FP and FD be nonempty. Show that the following holds for
B := i ∈ {1, . . . , n} | ∃ x ∈ FPopt xi > 0 : a) For every k ∈ / B there exists a y ∈ FDopt such that ck − ak , y > 0 . b) There exist x∗ ∈ FPopt and y ∗ ∈ FDopt with x∗ + s∗ > 0 for s∗ := c − AT y ∗ . Hint to a): For k ∈ / B consider the linear optimization problem −ek , x −→ min Ax = b c, x ≤ p∗ := v(P ), x ≥ 0 and apply the Duality Theorem to it. Hint to b): Firstly show with the help of a) (k)
(k)
∀ k ∈ {1, . . . , n} ∃ x(k) ∈ FPopt ∃ y (k) ∈ FDopt xk + sk
> 0.
Exercises to Chapter 6
297
6. Show that: a)
N2 (β1 ) ⊂ N2 (β2 ) for 0 ≤ β1 < β2 . N−∞ (γ2 ) ⊂ N−∞ (γ1 ) when 0 ≤ γ1 < γ2 ≤ 1 .
b)
c)
N2 (β) ⊂ N−∞ (γ) for γ ≤ 1 − β .
N−∞ (γ) = ω ∈ Rn+ | I − γ hn hTn ω ≥ 0
n n−1 = ν=1 αν hν αn ≥ 0, ν=1 αν hν + (1 − γ)αn hn ≥ 0 , (γ) for γ ≤ 1 − β (n − 1)/n. N2 (β) ⊂ N−∞ , N2 (β) = Rn+ for β ≥ n(n − 1) .
7. Implement the short-step or the long-step path-following algorithm as well as the Mizuno–Todd–Ye predictor-corrector method. Test them with the linear optimization problem c, x −→ min (P ) Ax = b, x ≥ 0 , 1110 , b := (3, 2)T and c := (−1, −3, 0, 0)T . where A := 2101 T T T Use x(0) := 12 , 12 , 2, 12 , y (0) := − 2, −2 and s(0) := 5, 1, 2, 2 as feasible starting points.
x −→ minn x∈R
s −→
min
such that Ax = b
y∈Rm , s∈Rn
such that
and
AT y + s = c
are given by −1 b x := AT AAT
−1 and y := AAT Ac,
s := c − AT y .
9. Let the following optimization problem be given: f (x) := 1 xT C x + cT x −→ min 2 (QP ) AT x ≤ b Suppose: C ∈ Rn×n symmetric positive semidefinite, c ∈ Rn , b ∈ Rm , A ∈ Rn×m , C positive definite on kernel(AT ) .
Chapter 6
8. Starting Points for Mehrotra’s algorithm (cf. p. 287 f) Show that the solutions of the least squares problems
298
Interior-Point Methods for Linear Optimization
a) Show that a point x ∈ F := v ∈ Rn | AT v ≤ b is a minimizer to (QP ) iff there exist vectors y, s ∈ Rm + with C x + c + Ay = 0 AT x + s = b y i si = 0
(i = 1, . . . , n) .
b) Derive the central path system for μ > 0 : ⎛ ⎞ ⎛ ⎞ C x + c + Ay 0 Fμ (x, y, s) := ⎝ AT x + s − b ⎠ = ⎝ 0 ⎠ , 0 Y s − μe c) Show that the Jacobian
⎛
Chapter 6
C DFμ (x, y, s) = ⎝ AT 0
y ≥ 0, s ≥ 0
⎞ A 0 0 I ⎠ S Y
is nonsingular if y > 0 and s > 0 . d) The Newton direction Δx, Δy, Δs at (x, y, s) is given as the solution to ⎞ ⎛ ⎞ ⎞⎛ ⎛ C A 0 rd Δx ⎝ AT 0 I ⎠ ⎝ Δy ⎠ = ⎝ rp ⎠ , (29) Δs rc 0 S Y where the dual residual rd , the primal residual rp and the complementarity residual rc are defined as rd := C x + c + Ay ,
rp := AT x + s − b ,
rc := Y s − μe.
Show that the solution to system (29) is given by the following equations −1 Δx = C + AY S −1 AT rd − AS −1 rc − Y rp Δs = rp − AT Δx Δy = S −1 rc − Y Δs . e) Translate Mehrotra’s method to the quadratic optimization probR lem (QP ) and implement it in Matlab . Use exercise 15 in chapter 4 to test your program.
7 Semidefinite Optimization
Semidefinite optimization (SDO) differs from linear optimization (LO) in that it deals with optimization problems over the cone of symmetric positive semidefinite n instead of nonnegative vectors. In many cases the objective function is matrices S+ linear and SDO can be interpreted as an extension of LO. It is a branch of convex optimization and contains — besides LO — linearly constrained QP, quadratically constrained QP and — for example — second-order cone programming as special cases. n Problems with arguments from S+ are in particular of importance because many practically useful problems which are neither linear nor convex quadratic can be written as SDO problems. SDO, therefore, covers a lot of important applications in very different areas, for example, system and control theory, eigenvalue optimization
W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4 7,
299
Chapter 7
7.1 Background and Motivation Basics and Notations Primal Problem and Dual Problem 7.2 Selected Special Cases Linear Optimization and Duality Complications Second-Order Cone Programming 7.3 The S-Procedure Minimal Enclosing Ellipsoid of Ellipsoids 7.4 The Function log ◦ det 7.5 Path-Following Methods Primal–Dual System Barrier Functions Central Path 7.6 Applying Newton’s Method 7.7 How to Solve SDO Problems? 7.8 Icing on the Cake: Pattern Separation via Ellipsoids Exercises
300
Semidefinite Optimization
for affine matrix functions, combinatorial optimization, graph theory, approximation theory as well as pattern recognition and separation by ellipsoids (cluster analyis). This wide range of uses has quickly made SDO very popular — besides, of course, the fact that SDO problems can be solved efficiently via polynomially convergent interior-point methods, which had originally only been developed for LO. Since the 1990 s semidefinite problems (in optimization) have been intensively researched. First approaches and concepts, however, had already existed decades earlier, for example, in a paper by Bellman/Fan (1963). In 1994 Nesterov/Nemirovski provided a very general framework which could later on be simplified mainly due to the work and effort of Alizadeh. The abundance of literature in this field can be overwhelming which makes it often difficult to separate the wheat from the chaff and find really new ideas. Many concepts from linear optimization can be transferred to the more general case of semidefinite optimization. Note, however, that this cannot be done ‘automatically’ since the duality theory is weaker, and strict complementarity in the sense of the Goldman–Tucker theorem cannot generally be assumed. Therefore many considerations have to be done with great care. The basic idea of many interior-point methods is to approximately follow the central path in the interior of the feasible set of an optimization problem which leads to an optimal point of the problem — mostly via variants of Newton’s method. The technical difficulties that arise there and their extensive ‘apparatus’ will not be addressed here in detail. We content ourselves with a short illustration of the problems in comparison with chapter 6. We limit our attention in this chapter to some basic facts and only provide a first introduction to a very interesting field with a focus on the parallels to linear optimization.
Chapter 7
7.1 Background and Motivation Basics and Notations An obvious difference between linear optimization and semidefinite optimization1 is the feasible set. While in linear optimization the feasible set is a subset of Rn , the feasible set of a semidefinite problem is a subset of the set of the real symmetric n × n matrices. Therefore we have to look at real symmetric — especially semidefinite — matrices: Note that a quadratic real symmetric matrix A is positive definite if and only if xT Ax > 0 for all x ∈ Rn \ {0}, and positive semidefinite if and only if xT Ax ≥ 0 for all x ∈ Rn .
1
Semidefinite optimization is also referred to as semidefinite programming (SDP), which is a historically entrenched term.
7.1
Background and Motivation
We define:
301
A ∈ Rn×n | A symmetric
Sn
:=
n S+
:= {A ∈ S n | A positive semidefinite}
n := {A ∈ S n | A positive definite} S++ n A B :⇐⇒ A − B ∈ S+
n A B :⇐⇒ A − B ∈ S++
¨ wner partial order. The relation is often referred to as the Lo Definition • The standard inner product , S n : S n × S n → R is given by A, B S n := tr(AT B) = tr(AB) =
n n
aij bij .
i=1 j=1
• This inner product yields the Frobenius norm F defined by n n a2ij . A := tr(AT A) = F
i=1 j=1
A, B := tr(AT B) can also be defined for nonsymmetric matrices. Recall that the trace of a matrix is the sum of its eigenvalues. The eigenvalues of A2 are the squares of the eigenvalues of A counted according to their multiplicity. Therefore it holds for any A ∈ S n : n 2 λi (A)2 AF = tr(A ) = i=1
S n is a vector space with dimension n
:= n(n + 1)/2. For example, we can identify S n and Rn using the map svec defined by svec(A) :=
T √ √ √ a11 , 2 a12 , . . . , 2 a1n , a22 , 2 a23 , . . . , ann .
√ The factor 2 in front of all nondiagonal elements in the definition of svec is motivated by the fact that the standard inner products of S n and Rn are compatible, i.e., for all A, B ∈ S n it holds that A, B S n = svec(A), svec(B) Rn .
Chapter 7
where here and below λi (A) denotes the i-th largest eigenvalue of A.
302
Semidefinite Optimization
The operator svec is invertible. Positive semidefiniteness in S n is equivalent to the property that all eigenvalues are nonnegative. Correspondingly, a symmetric matrix is positive definite if and only if all eigenvalues are positive. Remark n n S+ is a closed convex cone. Its interior is S++ . n is a closed convex cone. It remains to show that Proof: It is obvious that S+ n its interior is S++ (cf. exercise 4).
Theorem 7.1.1 n S+ is self-dual. In other words: A symmetric matrix A is positive semidefinite if and only if A, B S n ≥ 0 for all B 0.
We provide two lemmas to prove self-duality: Lemma 7.1.2 n the positive semidefinite root denoted by A1/2 exists and is For any A ∈ S+ unique (cf. exercise 2).
Lemma 7.1.3 For A ∈ Rm×n and B ∈ Rn×m it holds:
Chapter 7
tr(AB) = tr(BA)
n
m n Proof: (AB)jj = a bij implies tr(AB) = i=1 j=1 i=1 aji bij and
n ji m accordingly tr(BA) = i=1 j=1 bij aji . n . Due to lemma 7.1.2 the roots A1/2 and Proof of 7.1.1: i) Let A, B ∈ S+ 1/2 exist and are positive semidefinite, too. Thus B A, B S n = tr(AB) = tr A1/2 A1/2 B 1/2 B 1/2 = tr B 1/2 A1/2 A1/2 B 1/2 (7.1.3)
2 = tr (A1/2 B 1/2 )T A1/2 B 1/2 = A1/2 B 1/2 ≥ 0 . F
ii) Let x ∈ Rn . The matrix xxT ∈ Rn×n is positive semidefinite because it holds for all v ∈ Rn that 2 v T xxT v = (xT v)T xT v = xT v ≥ 0. By lemma 7.1.3 we get xT Ax = tr(xTAx) = tr(AxxT ) = i.e., A is positive semidefinite.
A, xxT
Sn
≥ 0,
7.1
Background and Motivation
303
2 Boundary of the cone S+
To create the figure, we used the following condition: A matrix X = 2 belongs to S+ iff xz ≥ y 2 and x, z ≥ 0.
xy yz
To keep it simple, we denote the standard inner product on S n in the following by , and mostly do not differentiate between the standard inner product for symmetric matrices and the standard inner product for vectors in our notation. The following technical lemma will be instrumental in proofs: n Lemma 7.1.4 For A, B ∈ S+ it holds:
Proof: Let Q be an orthogonal matrix which diagonalizes A, i.e., A = QDQT , where D is a diagonal matrix formed by the eigenvalues of A. It holds: A, B = tr(AB) = tr(Q D QT B) = tr(D QT B Q)
≥ λ1 (A) tr(QT B Q)
= λ1 (A) tr(B) ≥ λ1 (A)λn (B) In the same manner we get the remaining inequalities.
The following remark will be useful for applications of semidefinite optimization.
Chapter 7
λ1 (A)λn (B) ≤ λ1 (A) tr(B) ≤ A, B ≤ λn (A) tr(B) ≤ nλn (A)λn (B)
304
Semidefinite Optimization
Remark A block diagonal matrix is symmetric positive (semi-) definite iff all its blocks are symmetric positive (semi-) definite, i.e., for Xκ ∈ Rnκ ×nκ , 1 ≤ κ ≤ k, ⎛ ⎞ X1 0 . . . 0 ⎜ 0 X2 . . . 0 ⎟ ⎜ ⎟ X := ⎜ . . . . ⎟ 0 ( 0) ⇐⇒ X1 , . . . , Xk 0 ( 0). ⎝ .. .. . . .. ⎠ 0
0
0 Xk
Proof: Let α := n1 + · · · + nk and v =:
v1 .. ∈ Rα with vκ ∈ Rnκ for . vk
1 ≤ κ ≤ k. (i) Let κ0 ∈ {1, . . . , k} and w ∈ Rnκ0 . With vκ0 := w and vκ := 0 for κ = κ0 it holds that w , Xκ0 w = v , Xv . Thus if X is positive semidefinite, then w , Xκ0 w ≥ 0. For X 0, we get w , Xκ0 w = 0 iff v = 0 ∈ Rα , especially w = 0 ∈ Rκ0 .
k (ii) By v , Xv = κ=1 vκ , Xκ vκ we get X 0, if X1 , . . . , Xk 0. Since
k vκ , Xκ vκ ≥ 0, it holds that v , Xv = κ=1 vκ , Xκ vκ = 0 if and only if all summands are zero. If in addition all Xκ are positive definite, vκ = 0 must hold for 1 ≤ κ ≤ k, i.e., v = 0 and therefore X 0.
Chapter 7
Primal Problem and Dual Problem Let b ∈ Rm , A(j) ∈ S n (1 ≤ j ≤ m) and C ∈ S n . We consider the problem ⎧ , X → min ⎪ ⎨ C (j) (SDP ) A , X = bj (1 ≤ j ≤ m) ⎪ ⎩ X 0. The constraint X 0 is convex, but nonlinear. Let the linear operator A : S n → Rm be defined by ⎞ ⎛ (1) A ,X m ⎟ ⎜ . . A(j) , X ej , A(X) := ⎝ ⎠ = (m). j=1 A ,X
(1)
where ej ∈ Rm is the j-th unit vector. Then we get the final form of our semidefinite problem:
7.1
Background and Motivation
(SDP )
305 ⎧ ⎪ ⎨ C , X → min A(X) = b ⎪ ⎩ X 0
We denote the set of feasible points of (SDP ) by FP := X ∈ S n : A(X) = b and X 0 . Remark There is no loss of generality by assuming symmetry of the matrix C: If C is not symmetric, we can replace C by 1/2(C + C T ), since C , X = C T , X . The same holds for the matrices A(j) , 1 ≤ j ≤ m. The unique adjoint operator A∗ : Rm −→ S n to A is given by the property y , A(X) = A∗ (y), X for all X ∈ S n and y ∈ Rm . Remark The adjoint operator A∗ has the form A∗ (y) =
m
yj A(j) .
j=1
Proof: By (1) it holds: " ! m m ∗ (j) A (y), X = y , A(X) = y , A , X ej = A(j) , X y , ej j=1
=
m
!
A(j) , X yj =
j=1
m
"
j=1
yj A(j) , X
j=1
Let us now consider the (Lagrange) dual problem belonging to (SDP ): The Lagrange function L of (SDP ) is given by L(X, y) := C , X + y , b − AX where y ∈ Rm is a multiplier. Let ϕ be the dual function, that is, ϕ(y) := inf L(X, y) = inf y , b + C − A∗ y , X . X0
X0
If C − A∗ y 0, then we get by theorem 7.1.1 C − A∗ y , X ≥ 0 for any X 0. Thus ϕ(y) = b , y . Otherwise, also by theorem 7.1.1, there exists an X 0 with C − A∗ y , X < 0 . For any k ∈ N it holds that k X 0 and so
Chapter 7
For convenience we write AX for A(X) and A∗ y for A∗ (y).
306
Semidefinite Optimization C − A∗ y , kX = k C − A∗ y , X −→ −∞ for k −→ ∞.
Thus ϕ(y) = −∞. Altogether, we have # ϕ(y) =
y , b , if C − A∗ y 0 −∞ , otherwise.
For the effective domain of ϕ we get FD = y ∈ Rm | C − A∗ y 0 and with that the Lagrange dual problem # b , y −→ max (DSDP ) C − A∗ y 0 .
(2)
The condition C − A∗ y 0 can be rewritten as A∗ y + S = C for some n S ∈ S+ called the slack variable. If b ∈ R(A), then we have: Lemma 7.1.5 The dual problem (DSDP ) itself is a semidefinite problem. Proof: Let X ∈ S n be fixed such that AX = b . With that the following holds for y ∈ Rm and S := C − A∗ y : b , y = AX , y = X , A∗ y = X , C − S
Chapter 7
The dual problem can then be written in the form # # X , C − S −→ max X , S −→ min or ∗ S 0 , C − S ∈ R(A ) C − S ∈ R(A∗ ) , S 0 . Let now — in the nontrivial case R(A∗ ) = S n — F (1) , . . . , F (k) be a ∗ (A). ) = N (A)⊥ if and only if Then we(κ)have C − S ∈(κ)R(A basis of N(κ) = 0 or F , S = C , F =: fκ holds for all κ = 1, . . . , k. C − S,F Thus we get the dual problem in the desired form ⎧ min ⎪ ⎨ X , S −→ (κ) F , S = fκ for κ = 1, . . . , k ⎪ ⎩ S 0.
If the matrices A(1) , . . . , A(m) are linearly independent, then the assumption b ∈ R(A) is given, as R(A) = Rm .
m (j) = 0 , that is, Proof: For y ∈ Rm \ {0} it holds that A∗ y = j=1 yj A ∗ ∗ ⊥ m N (A ) = {0} , hence R(A) = N (A ) = R .
7.1
Background and Motivation
307
Demanding the linear independence of the matrices A(1) , . . . , A(m) — which we will do later on anyway — is not an essential restriction: If these are linearly matrices (1) () as the dependent, we can — in the nontrivial case — choose wlog A , . . . , A basis of
A(1) , . . . , A(m)
for a suitable ∈ {1, . . . , m − 1} . If b , y = 0 holds for
all y ∈ N (A∗ ) , then the two problems can be rewritten equivalently as ⎧ ⎪ ⎨ C , X → min (SDP ) A(j) , X = bj (1 ≤ j ≤ ) ⎪ ⎩ X 0 and (DSDP )
⎧ ⎪ ⎨ b , y −→ max
⎪ yj A(j) 0 ⎩ C− j=1
(y ∈ R ) .
If, however, b , y = 0 for some y ∈ N (A∗ ) , hence wlog b , y > 0 , then the dual problem is unbounded if it has feasible points.
Remark The dual problem of (DSDP ) formulated in equality form with a slack variable is the primal problem (SDP ). Proof: We proceed in the same way as above when we constructed (DSDP ). If we write (DSDP ) as a minimization problem, we get the Lagrange function L of (DSDP ) by L(y, X, S) := −b , y + X , A∗ y + S − C , where X is a multiplier. wlog we can assume that X is symmetric (compare the first remark on page 305). For the dual function denoted by ϕ we get ϕ(X) :=
inf
L(y, X, S)
= − C , X + infm AX − b , y + inf X , S . y∈R
S0
For AX−b = 0 we have inf y∈Rm AX − b , y = −∞. Otherwise this infimum is zero. If X is not positive semidefinite, then we have inf S0 X , S = −∞ . n We obtain this in the same way as on page 305 f. If X ∈ S+ holds, then this term is zero. This yields $ − C , X , AX = b and X 0 ϕ(X) = −∞, otherwise. n Hence, the effective domain of ϕ is the set X ∈ S+ : AX = b . To sum up, the dual problem to (DSDP ) is
Chapter 7
S0, y∈Rm
308
Semidefinite Optimization − C , X −→ max AX = b, X 0 .
If we exchange max and min , we get exactly the primal problem.
Weak duality can be confirmed easily: C , X − b , y = C , X − AX , y = C − A∗ y , X ≥ 0 ,
(3)
where X ∈ FP and y ∈ FD . C − A∗ y , X = S , X is called the duality gap for feasible X and y. We denote p∗ := inf{C , X | X ∈ FP } d∗ := sup{b , y | y ∈ FD } . If the duality gap p∗ − d∗ vanishes, that is, p∗ = d∗ , we say that strong duality holds. A matrix X ∈ FP with C , X = p∗ is called a minimizer for (SDP ). Correspondingly, a vector y ∈ FD with b , y = d∗ is called a maximizer for (DSDP ). Lemma 7.1.6 If there are feasible X ∗ and y ∗ with C , X ∗ = b , y ∗ , i.e., the duality gap for X ∗ and y ∗ is zero, then X ∗ and y ∗ are optimizers for the primal and dual problem, respectively.
Chapter 7
Proof: By weak duality we get for each X ∈ FP that C , X ≥ b , y ∗ = C , X ∗ . For any y ∈ FD it holds that b , y ≤ C , X ∗ = b , y ∗ , respectively. Later on we will see that the other implication — in contrast to linear optimization — is not always true. In addition, duality results are weaker for semidefinite optimization than for linear optimization. Definition A matrix X is called strictly (primal) feasible iff X ∈ FP and X 0 . A vector y is called strictly (dual) feasible iff y ∈ FD and C − A∗ y 0 . The condition to have strictly feasible points for both the primal and the dual problem is called the Slater condition. The following Duality Theorem will be given without proof. For a proof an extension of the Theorem of the Alternative (Farkas) is needed which can be proved by separation theorems and then applied to cones which are not polyhedral — unlike in linear optimization:
7.2
Selected Special Cases
309
Theorem 7.1.7 (Duality Theorem) Assume that both the primal and the dual semidefinite problem have feasible points. a) If the dual problem has a strictly feasible point y, then a minimizer to the primal problem exists and min{C , X : X ∈ FP } = sup{b , y : y ∈ FD }. b) If the primal problem has a strictly feasible point X, then a maximizer to the dual problem exists and inf{C , X : X ∈ FP } = max{b , y : y ∈ FD }. c) If both problems have strictly feasible points, then both have optimizers, whose values coincide, that is, min{C , X : X ∈ FP } = max{b , y : y ∈ FD }. Instead of pursuing these theoretical aspects extensively any further, we will now firstly have a look at some special cases, examples and supplementary considerations.
7.2 Selected Special Cases
Linear Optimization and Duality Complications In the following considerations we use the fact that Diag(a), Diag(b) S n = a, b Rn 2
Confer [Be/Ne].
Chapter 7
Semidefinite optimization is not only important due to its variety of practical applications2 , it also includes many other branches of optimization, e.g., linear optimization, quadratic optimization with both linear and quadratic constraints. Second-order cone problems, which contain quasi-convex nonlinear problems, can also be interpreted as semidefinite problems. Semidefinite optimization itself belongs to cone optimization, which is part of convex optimization. In the following, several connections will be pointed out. Below Diag(x) denotes the diagonal matrix whose diagonal elements are the entries of a vector x ∈ Rn and diag(X) denotes the column vector consisting of the diagonal elements of a matrix X ∈ Rn×n .
310
Semidefinite Optimization
holds for all a, b ∈ Rn . Linear optimization is a special case of semidefinite optimization. If we have a linear problem # c, x −→ min (LP ) Ax = b, x ≥ 0 with A ∈ Rm×n , b ∈ Rm and c ∈ Rn , we can transfer it to a problem in (SDP ) form: T If we set C := Diag(c), X := Diag(x) and A(j) := Diag a(j) , where a(j) is the j-th row of A for 1 ≤ j ≤ m, we get c, x = C , X , Ax = AX. Finally, x ≥ 0 means X 0. On the other hand, an (SDP ) can be formulated as an (LP ) — to a certain 2 degree. To this end the vec-operator, which identifies Rn×n with Rn , can be 2 used: vec: Rn×n −→ Rn ⎛ ⎞ a1 ⎜ ⎟ vec(A) := ⎝ ... ⎠ an where a1 , . . . , an are the columns of A. By this definition we get for 1 ≤ k ≤ m vec(C), vec(X) =
Chapter 7
A(k) , X
=
n n
cij xij = C , X
j=1 i=1
vec(A(k) ), vec(X) .
In general, however, the eponymous constraint X 0 cannot simply be translated. If the matrices C and A(k) are diagonal matrices, the objective function and the equality constraint of (SDP ) correspond to the ones in (LP ): With c := diag(C), x := diag(X), a(j) := diag(A(j) ) for 1 ≤ j ≤ m we have
C , X = c, x A(j) , X = a(j) , x .
Notwithstanding these similarities, one must be careful about promptly transferring the results gained for linear optimization to semidefinite optimization, since there are, in particular, differences in duality properties to be aware of. Such differences will be shown in the following three examples. The crucial point there is the existence of strictly feasible points.
7.2
Selected Special Cases
Example 1
311
cf. [Todd 2]
FP = ∅ , max(D) exists
We consider the following problem in standard form & % 0 0 , X −→ min 0 0 & & % % 0 1 1 0 , X = 2, X 0. , X = 0, 1 0 0 0 In other words the constraints are x11 = 0 and x21 = x12 = 1 for X ∈ S 2 , but such a matrix is not positive semidefinite. So the feasible set is empty and the infimum is infinite. The corresponding dual problem is 2 y2 −→ max 0 0 1 0 0 1 . + y2 y1 0 0 0 0 1 0 We get
S =
−y1 −y2 −y2 0
0.
Thus it must hold: y2 = 0 and y1 ≤ 0 . Therefore the supremum is zero and a maximizer is y = (0, 0)T . Example 2 (cf. [Hel]) max(D) exists, min(P ) does not exist, though inf(P ) is finite
We consider the problem
x11 −→ min x11 1 0 1 x22
it holds that
A(1) , X
= x21 + x12 = 2 x12 .
Finally, if we set b = 2 , we receive the standard form of the primal problem. With 1 −y1 C − y1 A(1) = −y1 0 we get the dual problem
Chapter 7
and formulate it in ‘standard’ form: We have x11 = C , X for 1 0 . C := 0 0 With 0 1 (1) A := 1 0
312
Semidefinite Optimization 2 y1 −→ max 1 −y1 0. −y1 0
The dual problem’s optimum is zero, and for a maximizer y it holds that y1 = 0. The primal problem has a strictly feasible point (for example x11 = x22 = 2). Because of the semidefiniteness constraint the principal determinants must be nonnegative: x11 ≥ 0 and x11 x22 − 1 ≥ 0. (Here, it can be seen that the constraint X 0 is not just a simple extension of the linear constraint x ≥ 0 .) Since x11 ≥ 1/x22 , the infimum is zero, but it is not attained. This example shows that unlike in the case of linear optimization a maximizer of the dual problem might exist but the infimum of the primal problem will not be attained if no strictly dual feasible point exists. Example 3 cf. [Hel] max(D) and min(P ) exist with p∗ = d∗ As in linear optimization a zero duality gap at feasible points X and y implies that they are optimizers (see lemma 7.1.6) but the following example illustrates that in contrast to linear optimization optimality does no longer imply a zero duality gap in general. The condition of strictly feasible points cannot be omitted. The considered primal problem is
Chapter 7
x12 −→ min ⎞ ⎛ 0 0 x12 ⎠ 0. ⎝ x12 x22 0 0 0 1 + x12 The standard form of the objective function ⎛ 0 1/2 C := ⎝ 1/2 0 0 0
is given by ⎞ 0 0⎠. 0
The constraints for the entries of the matrix X are ensured by the definition of the following matrices A(1) , . . . , A(4) : ⎞ ⎛ 0 −1/2 0 A(1) := ⎝ −1/2 0 0 ⎠ and b1 := 1 yields x33 = 1 + x12 , 0 0 1 ⎞ ⎛ 1 0 0 A(2) := ⎝ 0 0 0 ⎠ and b2 := 0 gives x11 = 0, 0 0 0 ⎞ ⎛ 0 0 1 A(3) := ⎝ 0 0 0 ⎠ and b3 := 0 shows x13 = 0, 1 0 0
7.2
Selected Special Cases
313
⎞ 0 0 0 := ⎝ 0 0 1 ⎠ and b4 := 0 yields x23 = 0. 0 1 0 ⎛
A(4)
As the dual problem we get y1 −→ max ⎛ ⎞ −y2 (1 + y1 )/2 −y3 S = ⎝ (1 + y1 )/2 0 −y4 ⎠ 0 −y3 −y4 −y1 with S := C − y1 A(1) − y2 A(2) − y3 A(3) − y4 A(4) . A necessary condition for a matrix X to be primally feasible is x12 = 0 . Thus a strictly primal feasible point does not exist. The infimum is attained iff x12 = 0 and x22 ≥ 0. If a vector y is dually feasible, y1 = −1 must hold. By Sarrus’ rule we get det(S) = y2 y42 . Thus there exists no strictly feasible y: Otherwise, det(S) > 0 would hold. Since y2 < 0, we would get y42 < 0. The dual’s supremum is −1 and a maximizer is found with y2 = y3 = −1, for example. The duality gap is always −1 although optimizers do exist. Second-Order Cone Programming Definition The Second-Order Cone (Ice Cream Cone or Lorentz Cone) — see the figure on page 270 — is defined3 by (4) Ln := (u, t) ∈ Rn−1 × R : u2 ≤ t . To show that the second-order cone can be embedded in the cone of semidefinite matrices, we use the Schur Complement: If
M=
A B BT C
,
where A is symmetric positive definite and C is symmetric, then the matrix C − B T A−1 B is called the Schur Complement of A in M . 3
SeDuMi ’s definition differs slightly from ours: Ln := (t, u) ∈ R × Rn−1 : t ≥ u2
Chapter 7
Definition
314
Semidefinite Optimization
Theorem 7.2.1 In the situation above the following statements are equivalent: a) M is symmetric positive (semi-) definite. b) C − B T A−1 B is symmetric positive (semi-) definite. Proof: Since A 0 , the matrix A is invertible. Setting D := −A−1 B gives A 0 A B I D I 0 = =: N . 0 C − B T A−1 B BT C 0 I DT I I D is regular, thus M is (symmetric) positive (semi-) defiThe matrix 0 I nite iff N is (symmetric) positive (semi-) definite. N is (symmetric) positive (semi-) definite iff its diagonal blocks are (symmetric) positive (semi-) definite by the remark on page 304. This is the case iff the Schur Complement C − B T A−1 B is positive (semi-) definite. With (u, t) ∈ Rn−1 × R+ we look at tIn−1 u . Et,u := uT t Due to t ≥ 0 , (u, t) ∈ Ln is equivalent to t2 − uT u ≥ 0 . For t > 0 this is equivalent to t − uT In−1 t u ≥ 0 , and we get by theorem 7.2.1: (u, t) ∈ Ln ⇐⇒ Et,u 0
(5)
Chapter 7
If t = 0, it holds that u = 0 and E0,0 0. Therefore, the second-order cone can be embedded in the cone of symmetric positive semidefinite matrices. For c ∈ Rn , A(j) ∈ Rnj ×n , a(j) ∈ Rnj , b(j) ∈ Rn and d(j) ∈ R (1 ≤ j ≤ m) a second-order cone problem has the form $ c, x −→ min (SOCP ) A(j) x + a(j) 2 ≤ b(j) , x + d(j) . The constraint is called the second-order cone constraint. With (5) the secondorder cone contraint is equivalent to ' (j) ( b , x + d(j) Inj A(j) x + a(j) 0 for 1 ≤ j ≤ m, (j) T (j) A x + a(j) b , x + d(j) that is, we can reformulate it by a positive semidefiniteness constraint: (j)
Let ai
∈ Rnj be the i-th column of A(j) for 1 ≤ i ≤ n. With
7.2
Selected Special Cases ' (j) Ai
:=
(j)
315 (j)
−bi Inj −ai (j) T (j) − ai −bi
(
' and C
(j)
:=
d(j) Inj a(j) (j) T (j) a d
(
we get the equivalent problem ⎧ ⎨ c, x −→ min n
(j) ⎩ C (j) − xi Ai 0 for 1 ≤ j ≤ m . i=1
The m constraints can be combined in one condition C − ni=1 xi Ai 0 , (j) where C and Ai are the block matrices formed by C (j) and Ai for 1 ≤ i ≤ n and 1 ≤ j ≤ m, correspondingly (compare the remark on page 304). This yields a dual problem in standard form, which is itself a semidefinite problem by lemma 7.1.5: ⎧ ⎨ −c, x −→ max n
⎩ C− xi Ai 0 i=1
Thus (SOCP ) is a special case of (SDP ).4 Lemma 7.2.2 a) Second-order cone optimization contains linear optimization. b) Second-order cone optimization includes quadratic optimization with both quadratic and linear inequality constraints.
n where C, Aj ∈ S++ , c, cj ∈ Rn and dj ∈ R for 1 ≤ j ≤ m. Since C and Aj 1/2
−1/2
are symmetric positive definite, there exist C 1/2 , C −1/2 , Aj and Aj We can write problem (QPI ) as ⎧ 2 ⎨ C 1/2 x + C −1/2 c − c, C −1 c −→ min 2 1/2 −1/2 ⎩ cj − cj , A−1 Aj x + Aj j cj + dj ≤ 0 (1 ≤ j ≤ m). 4
.
(7)
Solving an (SOCP ) via (SDP ), however, is not expedient: Interior-point methods which solve (SOCP ) directly have a much better worst-case complexity than (SDP ) interior-point methods applied to the semidefinite formulation of (SOCP ); compare the literature in [LVBL].
Chapter 7
Proof: a) Setting A(j) := 0 and a(j) := 0 for 1 ≤ j ≤ m yields a general linear problem (in dual form). b) A quadratic problem with quadratic inequality constraints is $ x, C x + 2 c, x −→ min (QPI ) (6) x, Aj x + 2 cj , x + dj ≤ 0 (1 ≤ j ≤ m)
316
Semidefinite Optimization
We linearize the objective function by minimizing a nonnegative value t such 2 that t2 ≥ C 1/2 x + C −1/2 c . Up to a square and a constant, problem (7) equals the second-order cone problem ⎧ t −→ min(t,x) ⎪ ⎪ ⎨ 1/2 C x + C −1/2 c ≤ t ⎪ ⎪ 1/2 −1/2 ⎩ cj , A−1 cj ≤ Aj x + Aj j cj − dj
(8) (1 ≤ j ≤ m).
If q ∗ is the optimal value of the subsidiary problem, p∗ := q ∗ 2 − c, C −1 c is the optimal value of (6). Other important special cases are semidefinite relaxation of quadratic optimization with equality constraints, quasi-convex nonlinear optimization and max-cut problems. We will, however, not go into any more detail here and refer the interested reader to the special literature on this topic.
7.3 The S-Procedure
Chapter 7
Minimal Enclosing Ellipsoid of Ellipsoids Given a finite set of ellipsoids E1 , . . . , Em , we consider the problem of finding the ellipsoid E of minimal volume which contains E1 , . . . , Em . This topic is important in statistics and cluster theory, for example. It can be formulated n as a semidefinite program with a nonlinear objective function. Let A ∈ S++ n and c ∈ R . An ellipsoid E with center c (and full dimension) is defined by5 E := E(A, c) := x ∈ Rn | x − c, A(x − c) ≤ 1 or alternatively E = x ∈ Rn | x, Ax − 2 c, Ax + c, Ac − 1 ≤ 0 . We consider the condition Ej ⊂ E for 1 ≤ j ≤ m. To this end, we use the so-called S-Procedure:
5
The definition here differs minimally from the one in section 3.1, A instead of the matrix A−1 .
7.3
The S-Procedure
317
Lemma 7.3.1 (S-Procedure) n Let A1 , A2 ∈ S++ , c1 , c2 ∈ Rn and d1 , d2 ∈ R. If there exists an x
∈ Rn with
+ 2 c1 , x
+ d1 < 0, then the following statements are equivalent: x
, A1 x a) The following implication holds: x, A1 x + 2 c1 , x + d1 ≤ 0 =⇒ x, A2 x + 2 c2 , x + d2 ≤ 0 b) There exists a λ ≥ 0 such that ⎞ ⎛ ⎞ ⎛ A1 c1 A2 c2 ⎠ λ⎝ ⎠. ⎝ cT2 d2 cT1 d1 Proof: The implication from b) to a) is ‘trivial’: For x ∈ Rn and j = 1, 2 we have ( ' Aj cj x T = x, Aj x + 2 cj , x + dj . (x , 1) T 1 cj dj Note that the existence of x
is not necessary for this implication.
We postpone the proof of the other direction for the moment and firstly have a look at the important application of the S-Procedure to the ellipsoid problem. The existence of a point x
in the lemma corresponds to the requirement that the interior of the corresponding ellipsoid is nonempty. With A1 := Aj , c1 := −Aj cj , d1 := cj , Aj cj − 1, A2 := A, c2 := −Ac, d2 := c, Ac − 1 and (Ej ) = ∅ item a) in the lemma means Ej ⊂ E . Then it follows directly that Ej ⊂ E holds if and only if there exists a λj ≥ 0 such that
A −Ac (−Ac)T c, Ac − 1
− λj
Aj −Aj cj (−Aj cj )T cj , Aj cj − 1
0.
(9)
In this section we use an alternative formulation of the theorem about the Schur complement: Let E, G be symmetric matrices and G 0 . Then the matrix M given by E F M := FT G is positive semidefinite if and only if E − F G−1 F T is positive semidefinite.
Chapter 7
◦
318
Semidefinite Optimization
We use this to ‘linearize’ (9): Lemma 7.3.2 With b := −Ac and bj := −Aj cj for 1 ≤ j ≤ m condition ⎛ ⎞ ⎛ Aj A b 0 bj ⎜ T ⎜ T T ⎟ −1 b ⎠ − λj ⎝ bj cj , Acj − 1 ⎝ b 0 0 0 b −A
(9) is equivalent to ⎞ 0 ⎟ 0 ⎠ 0. (10) 0
Proof: For clarification we denote the (n, n)-matrix of zeros by 0n×n and the n-dimensional vector of zeros by 0n here. We formulate (10) as ⎛ ⎞ λj Aj − A λj bj − b 0n×n ⎜ ⎟ λj cj , Aj cj − λj + 1 −bT ⎠ 0 . M := ⎝ λj bTj − bT (11) 0n×n −b A With
'
λj Aj − A λj bTj − bT
E :=
λj bj − b λj cj , Aj cj − λj + 1
(
and F :=
0n×n −bT
we write the matrix M in block form E F . M = FT A ( ' It holds that 0n 0n×n −1 T . FA F = b , A−1 b 0Tn Thus, we have
Chapter 7
' −1
E−FA
F
T
= λj
bj Aj T bj cj , Aj cj − 1
(
' −
A b T −1 b b, A b − 1
( .
By definition of b it holds that c, Ac − 1 = b , A−1 b − 1. If we use the alternative formulation of the Schur complement, we see that condition (11) is equivalent to (9). From section 3.1 we know: √ vol(E) = ωn det A−1 , unit ball. Thus the volume vol(E) where ωn is the volume of the n-dimensional √ −1 of the ellipsoid E is proportional to det A , and minimizing vol(E) means n minimizing det A−1 for A ∈ S++ . Since the logarithm log is strictly isotone, we consider the objective function given by log det A−1 = − log det A, which is a convex function (see section 7.4).
7.3
The S-Procedure
319
The semidefinite optimization problem (with nonlinear objective function) used for finding the minimal enclosing ellipsoid of ellipsoids is given by ⎧ ⎪ ⎪ − log det A −→ min ⎪ ⎪ ⎪ ⎞ ⎞(λ,c,A) ⎛ ⎪ ⎛ ⎪ ⎨ A b 0 Aj bj 0 ⎟ ⎜ T ⎜ ⎟ ⎝ b −1 bT ⎠ − λj ⎝ bTj cj , Acj − 1 0 ⎠ 0 for 1 ≤ j ≤ m ⎪ ⎪ ⎪ ⎪ 0 0 0 0 b −A ⎪ ⎪ ⎪ ⎩ A 0 , λ = (λ , . . . , λ )T ∈ Rm . 1 m + (12) In the same way as in the considerations in section 7.2 this problem can be transferred to a semidefinite problem with constraints in standard form. Semidefinite programs with a nonlinear objective function and linear constraints can also be solved by primal–dual interior-point methods. We limit ourselves to linear semidefinite programs. For continuative studies we refer to the work of Yamashita, Yabeand Harada, who consider general nonlinear semidefinite optimization cf. [YYH] , and the work of Toh and Vandenberghe, Boyd and Wu, who concentrate their considerations on the special case of determinant maximization problems with linear matrix inequality constaints cf. [Toh] and [VBW] . In the following we will give some auxiliary considerations as a preparation for the proof of the nontrivial direction of the central lemma 7.3.1 (S-Procedure), which is also an important tool in other branches of mathematics (e. g., control theory). The proof, however, is only intended for readers with a special interest in mathematics. In the literature, one can find a number of proofs which employ unsuitable means and are therefore difficult to understand. The following clear and elementary proof was given by Markus Sigg on the basis of [Ro/Wo].
Lemma 7.3.3
x ∈ {−1,1}
Proof: On the left the diagonal entries of all 2n summands are x2i = 1. For the off-diagonal entries the number of entries xi xj = 1 obtained by the products 1 · 1 and (−1) · (−1) equals the number of entries xi xj = −1 obtained by the products (−1) · 1 and 1 · (−1). Lemma 7.3.4 For any A ∈ Rn×n it holds that
2n tr A =
x, Ax .
x ∈ {−1,1}n
Proof: Lemma 7.3.3 combined with the linearity of the trace and 7.1.3 yields tr(xxT A) = x, Ax . 2n tr A = 2n tr(I A) = n n x ∈ {−1,1} x ∈ {−1,1}
Chapter 7
xxT = 2n I n
320
Semidefinite Optimization
Lemma 7.3.5 Let P, Q ∈ S n with tr Q ≤ 0 < tr P . Then there exists a vector y ∈ Rn with y , Q y ≤ 0 < y , P y . Proof: Let U ∈ Rn×n be an orthogonalmatrix diagonalizing Q, i.e., Q = U T D U where D := Diag (λ1 , . . . , λn )T with the eigenvalues λ1 , . . . , λn of Q. By lemma 7.3.4 we get 0 < 2n tr P = 2n tr(U P U T ) = x, U P U T x = U T x, P (U T x) . x ∈ {−1,1}n n
Hence there exists an x ∈ {−1, 1} y , Q y =
U T x, U T D U U T x
x ∈ {−1,1}n
with y , P y > 0 for y := U T x and
= x, D x = tr D = tr Q ≤ 0 .
Lemma 7.3.6 Let A, B ∈ S n with:
, Ax
< 0. a) There exists an x
∈ Rn with x b) For any ε > 0 there exists a λ ≥ 0 with B λA + εI . Then B λA for a suitable λ ≥ 0 . B λk A + k1 I for any k ∈ N. Proof: By b) there exists a λk 1≥ 0 with In particular we have x
, (B − k I) x
≤ λk x
, A x
. As the left-hand side converges for k → ∞, there exists a convergent subsequence (λkν ) of (λk ) by a). With the corresponding limit λ ≥ 0 we have for any x ∈ Rn x, B x = lim x, B − 1 I x ≤ lim λkν x, Ax = λ x, Ax . kν ν→∞ ν→∞
Chapter 7
Now we prove lemma 7.3.1 in the homogeneous case, i. e.: Proposition 7.3.7 Let A, B ∈ S n . If there exists an x
∈ Rn with x
, Ax
< 0, then the following statements are equivalent: a) For any x ∈ Rn with x, Ax ≤ 0 it holds that x, B x ≤ 0 . b) There exists a λ ≥ 0 such that B λA . Proof: The implication from b) to a) is ‘trivial’ (cf. p. 317). For the other implication — a) =⇒ b) — we consider the optimization problem ⎧ ⎪ ⎨ μ −→ max B λA − μI (λ, μ ∈ R) (D) ⎪ ⎩ λ ≥ 0.
7.3
The S-Procedure
321
If we prove d∗ = sup μ ∈ R | ∃λ ∈ R+ B λA − μI ≥ 0 , then we obtain b) by lemma 7.3.6. For λ, μ ∈ R we define 0 0 0 λ , C := , b := y := 0 −B 1 μ and
A∗ y := y1 A1 + y1 A2 := λ
0 0 −1 0 +μ 0 I 0 −A
with C, A1 , A2 ∈ S 1+n . Problem (D) is a (dual) semidefinite problem in standard form $ b, y −→ max (D) y ∈ R2 . C − A∗ y 0 The corresponding primal problem is given by $ C, X −→ min (P ) X ∈ S 1+n . AX = b, X 0 Any X ∈ S
1+n
can be written as X =
X1 ∈ S n . With that we get ⎧ ⎪ − B, X1 −→ min ⎪ ⎪ ⎪ ⎪ ⎨ ξ + A, X1 = 0 (P ) ⎪ tr X1 = 1 ⎪ ⎪ ⎪ ⎪ ⎩X 0
ξ xT x X1
with ξ ∈ R, x ∈ Rn and
ξ ∈ R, x ∈ Rn , X1 ∈ S n .
The dual problem (D) has a strictly feasible point: For μ < −A − B and λ := 1 we have B ≺ λA − μI since it holds for all x ∈ R2 \ {0} that 2
2
The primal problem (P ) has a feasible point, for example: x := 0 ,
X1 :=
1
x
2
x
x
T 0 ,
ξ := − A, X1 = −
1
x2
x
, A x
≥ 0 .
Hence, by theorem 7.1.7 Duality Theorem, part a) problem (P ) has a minimizer ∗ ξ (x∗ )T ∗ X = with C , X ∗ = d∗ . x∗ X1∗ Thus we have − B, X1∗ = d∗ . X ∗ 0 yields ξ ∗ ≥ 0 and X1∗ 0 . We define W := (X1∗ )1/2 , Q := W AW and P := W BW and get
Chapter 7
x, (λA − B)x = x, (A − B)x ≥ −A−B x > μ x = x, μI x .
322
Semidefinite Optimization tr Q = tr(W AW ) = tr(AW 2 ) = A, X1∗ = −ξ ∗ ≤ 0 tr P = tr(W BW ) = B , X1∗ = −d∗ .
Assume d∗ < 0 , i.e., tr P > 0 . By lemma 7.3.5 there would exist a vector y ∈ Rn with W y , A(W y) = y , Q y ≤ 0 < y , P y = W y , B (W y) , which is a contradiction to a) with x := W y . Hence d∗ ≥ 0 holds which completes the proof. We define E := {x ∈ Rn+1 : xn+1 = 1} and reformulate lemma 7.3.1: n Let A , B ∈ S++ , a, b ∈ Rn and α, β ∈ R. If there exists an x
∈ E with x
, A x
< 0 for B b A a and B := , A := aT α bT β
then the following statements are equivalent: a) For any x ∈ E with x, Ax ≤ 0 it holds that x, B x ≤ 0 . b) There exists a λ ≥ 0 with B λA. We define E := {x ∈ Rn+1 : xn+1 = 0}.
Chapter 7
Proof of 7.3.1: Let a) hold. For all x ∈ E with x, Ax ≤ 0 we have x, Bx ≤ 0 by homogeneity. Any x ∈ Rn+1 \ E can be written as x = T T (x ) , 0 with x ∈ Rn . From x, Ax ≤ 0 we get x , A x = x, Ax ≤ 0. As A 0 , it follows that x = 0 which yields x = 0 and so x, B x = 0 . Hence, x, B x ≤ 0 holds for all x ∈ Rn+1 with x, Ax ≤ 0. Now, we can apply the S-Lemma in the homogeneous version (7.3.7) which directly yields lemma 7.3.1. Note, we did not use the requirement that B is positive definite. Remark A matrix A ∈ S n+1 is positive semidefinite if and only if x, Ax ≥ 0 holds for all x ∈ E. Proof: If x, Ax ≥ 0 holds for all x ∈ E, then we get x, Ax ≥ 0 for T all x ∈ E by homogeneity. We write x ∈ Rn+1 \ E as x = (x )T , 0 T with an x ∈ Rn . For k ∈ N and xk := (x )T , 1/k ∈ E it follows that 0 ≤ limk→∞ xk , Axk = x, Ax . The other implication is trivial. This remark shows that it is no restriction to require the existence of a suitable x
∈ E in the inhomogeneous S-Lemma.
7.4
The Function log ◦ det
323
7.4 The Function log ◦ det n In this section we consider the function ϕ : S++ −→ R defined by n ϕ(X) := − log det(X) for X ∈ S++ .
Lemma 7.4.1 n a) For X ∈ S++ and H ∈ S n , where H is sufficiently small, it holds that ϕ(X + H) = ϕ(X) + −X −1 , H + o(H). n Thus, ϕ is differentiable, and the derivative at a point X ∈ S++ can be identified with −X −1 in the following sense:
ϕ (X)H =
−X −1 , H
for all H ∈ S n .
b) ϕ is strictly convex. Here, denotes the operator norm belonging to an arbitrarily chosen norm on Rn . The following proof is in fact based on a simple idea. To do the proof mathematically clean, however, requires great care, which tends to be ‘overlooked’ in a number of versions found in the literature. n Proof: a) Let X ∈ S++ and H ∈ S n with H sufficiently small such that n X + H ∈ S++ . It holds:
= − log det(X −1 (X + H)) = − log det(I + X −1 H) λ is an eigenvalue of X −1 H iff λ + 1 is an eigenvalue of I + X −1 H (counted according to their multiplicity). We abbreviate λi (X −1 H) to λi and get − log det(I + X −1 H) = − log
n ) i=1
(1 + λi ) = −
n
log(1 + λi ) .
i=1
By the differentiability of the logarithm at x = 1 we have log(1 + λ) = log(1) + λ + o(λ) = λ + o(λ), and with that we get −
n i=1
log(1 + λi ) = −
n i=1
λi +
n i=1
o(λi ) = − tr(X
−1
H) +
n i=1
o(λi ).
Chapter 7
ϕ(X + H) − ϕ(X) = − log det(X + H) − log det(X) = − log det(X + H) + log det(X −1 )
324
Semidefinite Optimization
−1
n It remains to show that H ≤ i=1 o(λi ) = o(H): Note that |λi | ≤ X −1 X H holds for any eigenvalue λi . We write o(λi ) = |λi |ri (λi ), where ri : R −→ R is a function with ri (λ) −→ 0 for λ −→ 0 . We get * n * n n * * −1 * * H |λ |r (λ ) ≤ λ |r (λ )| ≤ X |ri (λi )| . * i i i * max i i * * i=1
i=1
i=1
n Due to |λi | ≤ X −1 H → 0 for i = 1, . . . , n, we have | i=1 |ri (λi )| → 0 if H −→ 0 . Altogether we get n
o(λi ) = o(H).
i=1 n it holds that b) For X, Y ∈ S++
ϕ(Y ) − ϕ(X) = − log det(Y ) + log det(X) = − log det(X −1 Y ) n n n ) 1 = − log λi = − log(λi ) = −n log(λi ) , (13) n i=1 i=1 i=1 where λ1 , . . . , λn are the eigenvalues of X −1 Y (counted according to their multiplicity). Since the logarithm function is concave, we have n n 1 1 (14) log λi ≤ log λi . n i=1 n i=1
Chapter 7
The well-known fact log(x) ≤ x − 1 for any x ∈ R++ yields n n 1 1 λi ≤ λi − 1 . log n i=1 n i=1
(15)
From (13) and (15) we get
n n 1 ϕ(Y ) − ϕ(X) = −n log(λi ) ≥ − λi + n n i=1 i=1 = tr(−X −1 Y + I) = tr(−X −1 (Y − X)) = ϕ (X)(Y − X) . Thus, ϕ is convex. It remains to show that ϕ is strictly convex: n with ϕ(Y ) − ϕ(X) = ϕ (X)(Y − X). Then equality holds Let X, Y ∈ S++ in all the inequalities above. Due to (14), we get λ1 = · · · = λn as the logarithm is strictly concave.
nlog(x) = x−1 holds iff x = 1 . Hence, (15) yields 1 n λ = 1 and thus i i=1 i=1 λi = n. Finally, λ1 = · · · = λn = 1 . Since n all eigenvalues of X −1 Y are one (and X −1 Y is diagonalizable), X −1 Y = I holds, i.e., X = Y . Therefore, the function ϕ is strictly convex.
7.5
Path-Following Methods
325
7.5 Path-Following Methods We assume that the matrices A(1) , . . . , A(m) are linearly independent. Lemma 7.5.1 The adjoint operator A∗ is injective. In particular a vector y which is a preimage of a given matrix is unique. Proof: The matrices A(j) are linearly independent and we have A∗ y =
m (j) for all y ∈ Rm . Thus the operator A∗ is injective. j=1 yj A Primal–Dual System Similar to the situation of linear programming we can formulate a necessary and sufficient condition for optimality under our assumptions: Lemma 7.5.2 X and y are optimizers for (SDP ) and (DSDP ) if and only if there exists a matrix S 0 such that AX = b, X 0, i.e., X ∈ FP A∗ y + S = C, S 0, i.e., y ∈ FD S , X = 0 . The last equality is called the complementary slackness condition, sometimes also referred to as the centering condition.
The complementary slackness condition can be reformulated: Lemma 7.5.3 For X, S 0 it holds that: S , X = 0 ⇐⇒ XS = 0 Proof: i) From XS = 0 it follows that 0 = tr(XS) = X , S = S , X . ii) Let X , S = S , X = 0 . The proof of theorem 7.1.1 has shown X , S =
Chapter 7
Proof: In the proof of weak duality we have seen: C , X − b , y = C − A∗ y , X = S , X for X ∈ FP , y ∈ FD and S := C − A∗ y . If S , X = 0 , we have a zero duality gap and so by lemma 7.1.6 X and y are optimizers. Let X and y be optimizers. Then C , X = b , y holds by the Duality Theorem, that is, S , X = 0 .
326
Semidefinite Optimization
1/2 1/2 2 X S , thus X 1/2 S 1/2 = 0 and so XS = X 1/2 X 1/2 S 1/2 S 1/2 = 0 . F We change our system while ‘relaxing’ the third condition. With μ > 0 we can formulate a (preliminary) primal–dual system for semidefinite optimization: AX = b ∗
A y+S = C XS = μI X 0, S 0 X 0 and S 0 can be replaced by X 0 and S 0 : By X S = μI it holds that the matrices X and S are regular, thus positive definite. The final form of our primal–dual system is: AX = b A∗ y + S = C XS = μI X 0, S 0 Barrier Functions
Chapter 7
A second way to obtain the primal–dual system and the corresponding central path is to look at barrier functions which result either from the primal or from the dual problem. First, we regard the logarithmic barrier function Φμ for the primal problem (SDP ) defined by Φμ (X) := C , X − μ log det(X), + , n where μ > 0 and X ∈ FP0 := X ∈ S++ | AX = b . Analogously, the logarithmic barrier function for the dual problem (DSDP ) is given by
μ (y) := b , y + μ log det (C − A∗ y) , Φ , + n . The following theorem yields where y ∈ FD0 := y ∈ Rm | C − A∗ y ∈ S++ the connection between the primal–dual system and the corresponding barrier problems:
7.5
Path-Following Methods
327
Theorem 7.5.4 Let μ > 0 . Then the following statements are equivalent: a) FP0 and FD0 are nonempty (Slater condition). b) There exists a (unique) minimizer to Φμ on FP0 .
μ on F 0 . c) There exists a (unique) maximizer to Φ D d) The primal–dual system ⎛ A∗ y + S − C ⎜ Fμ (x, y, s) := ⎜ ⎝ AX − b XS − μI
⎞
⎛
⎞ 0
⎟ ⎜ ⎟ ⎟ = ⎜0⎟ ⎠ ⎝ ⎠ 0
n X, S ∈ S++
(16)
has a (unique) solution. If a) to d) hold, then the minimizer X(μ) to Φμ and the maximizer y(μ) to
μ yield the solution X(μ), y(μ), S(μ) of (16), where S(μ) := C −A∗ y(μ). Φ The proof runs parallel to that of theorem 6.1.4: Proof: b) ⇐⇒ d): The definition of Φμ can be extended to the open set n n S++ and Φμ is differentiable there (cf. lemma 7.4.1). On S++ we consider the # problem Φμ (X) −→ min (Pμ ) AX = b . (Pμ ) is a convex problem (cf. lemma 7.4.1) with (affinely) linear constraints. Therefore the KKT conditions are necessary and sufficient for a minimizer to (Pμ ). The Lagrange function L is given by L(X, y) := Φμ (X)+y , b − AX n for X ∈ S++ and y ∈ Rm . The KKT conditions to (Pμ ) are
S ∗ , X = C , X − b , y ∗ . So C , X −μ log det(X) = Φμ (X) can be replaced by S ∗ , X −μ log det(X) in problem (Pμ ). We may add the constraint
Chapter 7
∇X L(X, y) = C − μX −1 − A∗ y = 0 AX = b . −1 0 which is equivalent to XS = μI, we get nothing If we set S := μX else but (16). Thus, an X is a KKT point to (Pμ ) iff (X, y, S) solves (16). The triple (X, y, S) is unique: X is unique, as (Pμ ) is strictly convex. Since XS = μI must hold, S is unique and then by lemma 7.5.1 the vector y is unique. The proof of c) ⇐⇒ d) can be done in like manner (cf. the proof of theorem 6.1.4) and will be left to the interested reader as an exercise. d) =⇒ a): If system (16) has a solution (X, y, S), then we have X ∈ FP0 and y ∈ FD0 . a) =⇒ b): Let X ∗ and (y ∗ , S ∗ ) be strictly feasible to (SDP ) and (DSDP ), respectively. For X ∈ FP it holds that
328
Semidefinite Optimization S ∗ , X − μ log det(X) ≤ S ∗ , X ∗ − μ log det(X ∗ ) =: α
and obtain the subsidiary problem ⎧ ∗ ⎪ ⎨ S , X − μ log det(X) −→ min (Pμ∗ ) AX = b , X 0 ⎪ ⎩ S ∗ , X − μ log det(X) ≤ α . We aim to show that the feasible set of (Pμ∗ ) — a level set of the objective function denoted by Fμ∗ — is compact. Fμ∗ is nonempty since X ∗ is feasible. With σ := λ1 (S ∗ ) it holds that σ > 0 . For X ∈ Fμ∗ we get n
σ λi (X) − μ log λi (X) = σ tr X − μ log det(X)
i=1
because of det X =
-n
i=1
λi (X). By lemma 7.1.4 we have 0 < σ tr X ≤ S ∗ , X .
Due to X ∈ Fμ∗ , this yields n σ λi (X) − μ log λi (X) ≤ α .
(17)
i=1
We define f (τ ) := σ τ − μ log τ for τ ∈ R++ . f (τ ) = σ τ − μ log τ for σ = 1, μ = 0.2 3
Chapter 7
2
1
0 0
1
2
3
The function f is strictly convex and has a unique minimizer τ ∗ := μ/σ . It holds that f (τ ) −→ ∞ if τ → 0 or τ → ∞. With v ∗ := f (τ ∗ ), we have α ≥ nv ∗ . Since α − (n − 1)v ∗ ≥ v ∗ holds, there exist τ1 , τ2 ∈ R++ such that n
f (λi (X)) ≤ α f (τ ) > α − (n − 1)v ∗ for τ ∈ (0, τ1 ) and τ ∈ (τ2 , ∞). then shows λi (X) ∈ [τ1 , τ2 ] for i = 1, . . . , n. It follows
i=1
7.5
Path-Following Methods X2F
329 =
n
λi (X)2 ≤ nτ22 .
i=1
Fμ∗
is bounded. The subset of S n defined by AX = b is closed. So the set n Since the function log ◦ det is continuous on S++ and λi (X) ≥ τ1 for i = ∗ 1, . . . , n, the set defined by the condition S , X − μ log det(X) ≤ α is closed, too. Therefore, the feasible set is compact. As the objective function given by S ∗ , X − μ log det(X) is continuous on Fμ∗ , it has a minimizer. Central Path From now on we assume that the Slater condition holds, that is, both the primal and the dual problem have strictly feasible points, and the matrices A(1) , . . . , A(m) are linearly independent. Following theorem 7.5.4 there exists then a unique solution X(μ), y(μ), S(μ) to (16) for all μ > 0 . Definition The set
X(μ), y(μ), S(μ) : μ > 0
of solutions to the primal–dual system is called the (primal–dual) central path of the semidefinite problems (SDP ) and (DSDP ). The central path is well-defined, if for some μ > 0 equation (16) has a solution.
Remark n For any μ > 0 the matrices X(μ) and S(μ) lie in the interior of S+ .
Lemma 7.5.5 For X1 , X2 ∈ S n with AX1 = AX2 and S1 , S2 ∈ C − R(A∗ ) it holds that X1 − X2 , S1 − S2 = 0 . Proof: X1 − X2 ∈ N (A) and S1 − S2 ∈ R(A∗ ) ⊂ N (A)⊥ .
Chapter 7
We use the following idea to solve the semidefinite program: We reduce μ > 0 and hope that a sequence of corresponding points on the central path will lead to a pair of solutions to problems (SDP ) and (DSDP ). In order to show that a suitable subsequence converges to a pair of optimizers, we need:
330
Semidefinite Optimization
Theorem 7.5.6 For any sequence (μk ) in R++ with μk −→ 0 for k → ∞ a suitable subsequence of the corresponding points Xk , yk , Sk := X(μk ), y(μk ), S(μk ) of the central path converges to a pair of optimizers of (SDP ) and (DSDP ). Proof: wlog we can assume that μk ↓ 0 . Let X0 be strictly feasible to (SDP ) and (y0 , S0 ) strictly feasible to (DSDP ). By lemma 7.5.5 it holds that 0 = Xk − X0 , Sk − S0 = Xk , Sk − Xk , S0 − X0 , Sk + X0 , S0 . With Xk , Sk = tr Xk Sk = tr(μk I) = nμk we get nμ1 + X0 , S0 ≥ nμk + X0 , S0 = Xk , S0 + X0 , Sk ≥ λn (Xk )λ1 (S0 ) + λn (Sk )λ1 (X0 ). (7.1.4)
Therefore the eigenvalues of the matrices Xk and Sk are uniformly bounded from above by some constant M . It follows that n 2 Xk F = λi (Xk )2 ≤ nM 2 holds for any k ∈ N . i=1
Accordingly the sequence (Sk ) is bounded and with it the sequence (yk ). Thus the sequence (Xk , yk , Sk ) is bounded. Therefore a suitable subsequence n n
y , S ∈ S+ × Rm × S + . By the con(Xk , yk , Sk ) converges to a triple X, tinuity of the inner product it holds that
, S = lim Xk , Sk = lim μk · n = 0 . X
Chapter 7
→∞
→∞
S fulfills the complementary slackness condition. By continuity of A So X,
is feasible for (SDP ) and the pair y , S is feasible for and A∗ the matrix X
and y are optimizers to (SDP ) and (DSDP ). By lemma 7.5.2 the points X (DSDP ), respectively. It was shown in [HKR] that, unlike in linear optimization, the central path in semidefinite optimization does not converge to the analytic center of the optimal set in general. The authors analyze the limiting behavior of the central path to explain this phenomenon.
7.6 Applying Newton’s Method We have seen in the foregoing section that the primal–dual system is solvable and the solution is unique for any μ > 0 . If then the corresponding triples
7.6
Applying Newton’s Method
331
(Xμ , yμ , Sμ ) converge for μ ∈ R++ and μ −→ 0 , Xμ and Yμ converge to a pair of optimizers of (SDP ) and (DSDP ). Theoretically, we just need to solve a sequence of primal–dual systems, that is, follow the central path until we reach a limit. Practically, however, we face several difficulties when solving A∗ y + S = C AX = b
(18)
XS = μI X 0, S 0 .
The first two block equations are linear, while the third one is nonlinear. Hence, a Newton step seems to be a natural idea for an iteration algorithm. Using path-following methods we solve system (18) approximately and then reduce μ. Starting with X 0 and S 0 , we aim to get primal and dual directions ΔX and ΔS , respectively, that satisfy X − ΔX, S − ΔS 0 as well as the linearized system A∗ Δy + ΔS = A∗ y + S − C AΔX = AX − b
(19)
X ΔS + ΔX S = X S − μI . A crucial observation is that the above system might have no symmetric solution ΔX . This is a serious problem! The second condition gives m equations and the first condition yields n
= n(n + 1)/2 equations since A∗ Δy and thus ΔS are symmetric. The third block equation contains n2 equations as the product XS is in general not symmetric even if X and S are symmetric. We, however, have only m + 2 n
= m + n(n + 1) unknowns and so system (19) is overdetermined while we require ΔX to be symmetric. Therefore Newton’s method cannot be applied directly. There are many ways to solve this problem, which has caused a great deal of research.6
XS = μI,
SX = μI,
XS + SX = 2 μI
are equivalent but linearizations of these three equations will lead to different search directions if XS ≈ μI. There are two natural ways — used by the first SDO algorithms — to handle the problem. A first possibility is to drop the symmetry condition for ΔX. . is a solution to the relaxed system of Then the system can be solved. If ΔX . i.e., equations, we take the symmetric part ΔX of ΔX, . + ΔX . ΔX ΔX := 2 6
T
.
A comparison of the best known search directions was done by [Todd 1].
Chapter 7
Another difficulty arises as in practice the equation XS = μI will hold only approximately. In theory the three relations
332
Semidefinite Optimization
This search direction is called HKM-direction.7 A second approach proposed by Alizadeh, Haeberly and Overton is to start with the equivalent formulation of the relaxed complementary slackness condition XS + SX = 2 μI where the left-hand side is now symmetric. After linearizing, the so-called AHO-direction is the solution to the equation (in addition to the feasibility equations)8 ΔX S + S ΔX + XΔS + ΔS X = (X S + S X) − 2 μI . In the development of their analysis, the search directions have been grouped into families, e.g., the Monteiro–Zhang-family (MZ-family) or the Monteiro–Tsuchiya-family (MT-family). The AHO- and HKM-directions belong to these families. Due to good theoretical and practical results lots of considerationsconcentrate on these families of search directions. The ‘famous’ NT-direction cf. Nesterov/Todd (1997) , which is one of the directions with the best theoretical results, belongs to the MZ-family, too. In the concluding section of this chapter we restrict ourselves to describing only the general principles underlying many algorithms.
7.7 How to Solve SDO Problems? We rewrite the ‘relaxed’ complementary slackness condition XS = μI in the symmetric form (cf. exercise 9) as
Chapter 7
X ◦ S :=
1 XS + SX = μI . 2
It is easy to see that the binary operation ◦ is commutative and distributive with respect to addition + , but not associative.
In the Newton approach the equations A∗ (y − Δy) + (S − ΔS) = C A(X − ΔX) = b (X − ΔX) ◦ (S − ΔS) = μI lead — after linearization — to 7
8
Also referred to as HRVW/KSH/M-direction, since [HRVW] (joint paper by Helmberg, Rendl, Vanderbei, Wolkowicz (1996)) and [KSH] (joint paper by Kojima, Shindoh, Hara (1997)) proposed it independently, and it was rediscovered by [Mon] (Monteiro 1997). For convenience we often drop the feasibility equations for search directions in the following.
7.7
How to Solve SDO Problems?
333
A∗ Δy + ΔS = A∗ y + S − C =: Rd AΔX = X ◦ ΔS + ΔX ◦ S =
AX − b =: rp X ◦ S − μI =: Rc .
(20)
n we define the operator LX : S n −→ S n by For X ∈ S++
LX (Z) := X ◦ Z
(Z ∈ S n ) .
With that the linearized system (20) can be written as: A∗ Δy + ΔS = Rd AΔX = rp
(21)
LS (ΔX) + LX (ΔS) = Rc For implementations we reformulate (21) identifying S n with Rn via svec: S n −→ Rn , where n
:= n(n + 1)/2 (cf. page 301 f). With x := svec(X),
s := svec(S),
Δx := svec(ΔX), Δs := svec(ΔS),
rd := svec(Rd ), rc := svec(Rc ),
the ‘matrix representations’ LX , LS of the operators LX , LS — that is, LX v := svec(LX (V )) and LS v := svec(LS (V )) for v ∈ Rn and V ∈ S n with v = svec(V ) — and the matrix ⎞ ⎛ svec(A1 )T ⎟ ⎜ .. A := ⎝ ⎠ . svec(Am )T
(22)
Remark The matrices LS and LX are invertible. Proof: It suffices to show that LS v = 0 for v ∈ Rn only has the trivial solution. The symmetric positive definite matrix S can be written as Q D QT , where Q is orthogonal and D diagonal with positive diagonal components d1 , ..., dn . LS v = 0 implies 0 = 2 LS (V ) = SV + V S and further 0 = QT (SV + V S)Q = D(QT V Q) + (QT V Q)D = D W + W D , where W := QT V Q . For i, j ∈ {1, . . . , n
} we have
Chapter 7
we get the equivalent linear system ⎞ ⎛ ⎞⎛ ⎞ ⎛ rd Δx 0 AT I ⎟ ⎜ ⎟⎜ ⎟ ⎜ 0 ⎠ ⎝ Δy ⎠ = ⎝ rp ⎠ . ⎝ A 0 Δs rc LS 0 LX
334
Semidefinite Optimization 0 = (D W + W D)i j = di wi j + wi j dj = wi j (di + dj ), / 01 2
thus wi j = 0 . From W = 0 we conclude that V = 0 .
>0
L−1 S LX
is positive definite the unique With the additional assumption that solvability of system (22) can be obtained easily. We will therefore not repeat the details here (compare theorem 6.3.1). 0 If we start from X ∈ FP0 and y ∈ FD , it is generally difficult to maintain the — nonlinear — third equation of (18) exactly. Path-following methods therefore require the iterates to satisfy the equation only in a suitable approximate sense. This can be stated in terms of neighborhoods of the central path.
There is a gap between the practical behavior of algorithms and the theoretical performance results, in favor of the practical behavior. The considerations necessary for that closely follow those of chapter 6. They are, however, far more laborious with often tedious calculations. We therefore abstain from going into any more details here and refer the reader to the special literature on this topic, for example, the Handbook of Semidefinite Programming [WSV] or de Klerk’s monograph [Klerk].
Chapter 7
7.8 Icing on the Cake: Pattern Separation via Ellipsoids A pattern is a reliable sample of observable characteristics of an object. A specific pattern of n characteristics can be represented by a point in Rn . Pattern recognition 9 is the art of classifying patterns.10 We assume that two sets of points {p1 , . . . , pm1 } and {q1 , . . . , qm2 } are given, which we want to separate by ellipsoids written in the form E := x ∈ Rn | x − xc , X(x − xc ) ≤ 1 n with center xc ∈ Rn and a matrix X ∈ S++ .
We want the best possible separation, that is, we want to optimize the separation ratio in the following way: ⎧ −→ max ⎪ ⎨ pi − xc , X(pi − xc ) ≤ 1 (i = 1, . . . , m1 ) (23) ⎪ ⎩ qj − xc , X(qi − xc ) ≥ (j = 1, . . . , m2 ) This is a nonlinear optimization problem depending on the variables X ∈ n , xc ∈ Rn and ∈ R. S++ 9 10
Confer [Gli] and [Bo/Va], p. 429 ff . See exercise 20, chapter 2 .
7.8
Icing on the Cake: Pattern Separation via Ellipsoids
We use the following maps
pi −→ p i :=
pi 1
,
qj −→ q j :=
qj 1
335
to ‘lift’ each point to the hyperplane H := x ∈ Rn+1 | xn+1 = 1 . The optimal ellipsoid of problem (23) can then be recovered as the intersection of H with the optimal elliptic cylinder in Rn+1 (with center 0 ) containing the lifted points p i and q j . Visualization of ‘Lifting’ 1.5
0.6
1
0.4
0.5
0.2
–0.5 0.2
0.4
0
0.8
0.6
y
0.5
Due to
p i p i , X
=
, p i p i T , X
q j q j , X
=
0.5
x
0
–0.5
(24)
q j q j T , X
and by introducing scalar slack variables, problem (24) can be transformed to the standard form of a primal SDO. From the optimizers '
= X
X | −b | −bT | γ
( n+1 ∈ S+ and ∈ R with > 1 ,
n and xc ∈ Rn of the original problem: We have we recover X ∈ S+
Chapter 7
1 The ‘lifted’ problem becomes: 1 ⎧ −→ max ⎪ ⎪ ⎪ ⎪
p i ≤ 1 ⎨ p i , X (i = 1, . . . , m1 )
q j ≥ ⎪ q j , X (j = 1, . . . , m2 ) ⎪ ⎪ ⎪ ⎩
n+1 X ∈ S+ , ∈ R with > 1 .
336
Semidefinite Optimization
x x
, X
= x, Xx − 2 b , x + γ (25) x for x ∈ Rn and x
:= . There exists a vector xc ∈ Rn such that 1 Xxc = b , since for y ∈ kernel(X) it follows: T T y 0
y = y = 0 X 0 0 − b , y 0 y
= 0 and thus b , y = 0 . This shows the solvability of Therefore, X 0 the above system of equations (cf. exercise 10). Now (25) yields
x x
, X
= x, Xx − 2 xc , Xx + γ = x − xc , X(x − xc ) + δ where δ := γ − xc , Xxc . With ( ' ( (' (' ' X |0 X | −b I | xc I |0 = | | | | xTc | 1 −bT | γ 0| 1 0 |δ we get Since
= δ · det(X) . δ ≥ 0 and det(X)
pi − xc , X(pi − xc ) + δ ≤ 1 qj − xc , X(qi − xc ) + δ ≥
(i = 1, . . . , m1 ) ,
(26)
(j = 1, . . . , m2 ) ,
we also have
Chapter 7
δ ≤ 1 and X = 0 . δ = 1 implies X(pi − xc ) = 0 (i = 1, . . . , m1 ), that is, p1 , . . . , pm1 lie in an affine subspace of at most dimension n − 1 . If we exclude this, then δ < 1. The above inequalities can then be rewritten as pi − xc , 1 X(pi − xc ) ≤ 1 (i = 1, . . . , m1 ) 1−δ 1 −δ qj − xc , X(qi − xc ) ≥ 1−δ (j = 1, . . . , m2 ). 1−δ If δ > 0 this gives a contradiction to the maximality of since (26) now implies that (X, xc , ) is a solution to (23).
−δ 1−δ
> .
Remark n we thus obtain the desired separation of the two point sets by two If X ∈ S++ concentric ellipsoids whose center xc is uniquely determined.
If det(X) = 0 , then we get a ‘parallel strip’ instead of an ‘ellipsoid’. In appendix C we will find more information on how to solve this kind of problem using SeDuMi .
Exercises to Chapter 7
337
Exercises 2 1. a) Find a matrix in S 2 \ S+ whose leading principal subdeterminants are nonnegative. 1 −1 is positive b) Show that the nonsymmetric matrix M := 1 1 definite, that is, xT M x > 0 holds for all x ∈ R2 \ {0} .
c) Find two matrices A, B ∈ S 2 whose product AB is not symmetric. n and k ∈ N. Show that there exists a uniquely determined 2. Let A ∈ S+ 1/k n := R ∈ S+ such that Rk = A. This matrix is called the k-th matrix A root of A. Show furthermore that rank(A) = rank(R) and AR = R A .
Hint: Cf. the spectral theorem and applications [Hal], p. 156 f and p. 166 . 3. With α > 0 let Lnα := (u, t) ∈ ×Rn−1 × R | t ≥ α u2 . Verify: a) Lnα is a closed convex cone. ∗ b) (Lnα ) = Ln1/α n 4. Logarithmic Barrier Function for S+ n ◦ n Show: a) S+ = S++ n b) The function B : S++ −→ R given by B(X) := − log det(X) n for X ∈ S++ has the barrier property: n ◦ n Xk in S+ , converging to an X ∈ ∂S+ For every sequence n boundary of S+ , it holds that B(Xk ) −→ ∞ for k −→ ∞.
where Aj ∈ Rm×nj , cj ∈ Rnj for j = 0, . . . , k and b ∈ Rm . Determine the corresponding dual problem. Hint: For y ∈ Rm minimize L(x0 , . . . , xk , y) :=
k j=0
k cj , xj + b − Aj xj , y j=0
with respect to x0 ∈ Rn+ , xj ∈ Lnj (j = 1, . . . , k) . b) How will the dual problem change if we demand x0 ∈ Rn0 instead of x0 ∈ Rn+0 ?
Chapter 7
5. a) Consider the optimization problem ⎧ k
⎪ ⎪ ⎪ cj , xj −→ min ⎪ ⎪ ⎨ j=0 k
⎪ A x = b ⎪ ⎪ j=0 j j ⎪ ⎪ ⎩ x ∈ Rn0 , x ∈ Lnj (Lorentz cone) (j = 1, . . . , k) , 0 j +
338
Semidefinite Optimization
6. Minimum Circumscribed Ball We have points x(1) , . . . , x(m) ∈ Rn and wish to find the smallest (euclidean) ball with center x and radius r containing them, that is, we have to solve the problem r −→ min (j) x − x ≤ r for j = 1, . . . , m. a) Formulate this problem as a second-order cone problem. b) With SeDuMi 11 , solve this problem for the points 0 5 4 1 , x(2) := , x(3) := and x(4) := x(1) := 0 −1 6 3 (cf. chapter 1, example 4). Visualize your solution and compare it with: 6
6
4
4
2 2 0 0 –2 0
2
4
6
0
2
4
6
8
Chapter 7
c) Test your program with 100 random points (generated with randn ), then with 1000 random points. d) Look at the more general problem of determining the minimum circumscribed ball to given balls with centers x(j) and radii j for j = 1, . . . , m. Solve this problem for the points from b) and the respective radii 0.5, 1.5, 0.0, 1.0 . 7. Shortest Distance between Two Ellipses cf. [An/Lu], example 14.5 Solve the shortest distance problem ⎧ δ −→ min ⎪ ⎪ ⎪ ⎪ ⎪ u − v ≤ δ , ⎪ ⎪ with ⎨ 1 0 T u u − (2, 0) u − 3 ≤ 0 and ⎪ 0 4 ⎪ ⎪ ⎪ ⎪ ⎪ 1 T 5 3 ⎪ v − (22, 26) v + 70 ≤ 0 for u, v ∈ R2 . ⎩ 2v 3 5 11
Confer appendix C .
Exercises to Chapter 7
339
T a) Let z := u1 , u2 , v1 , v2 , δ ∈ R5 and formulate the above problem as a second-order cone problem: b , z −→ min ATj z + cj 2 ≤ bj , z + dj
(j = 1, 2, 3)
b) Solve this problem by means of SeDuMi (cf. example 4 in appendix C). c) Determine the largest strip which separates the two ellipses. Compare your solution with our result:
4
2
2
4
8. Let the following quadratic optimization problem be given: 1 xT Q x + pT x −→ min 2 AT x ≤ b n with Q ∈ S++ , p ∈ Rn , A ∈ Rn×m and b ∈ Rm .
b) Have another close look at exercise 8 in chapter 4 and solve this problem with SeDuMi . n with XS + SX = 2 μI . 9. a) Let μ > 0 and X, S ∈ S+ n Verify: X, S ∈ S++ and XS = μI . μ σ 1 0 with σ = 0 it holds that and S := b) For X := σ −μ 0 −1 XS + SX = 2 μI , but XS = μI .
Hint to a): Firstly show that X can wlog be chosen as a diagonal matrix.
Chapter 7
a) Show that this problem is equivalent to a second-order cone problem of the form τ −→ min T A
+
c ≤ τ AT x ≤ b .
= L and Hint: The Cholesky decomposition Q = L LT gives A c = L−1 p.
340
Semidefinite Optimization
n 10. a) Show for A ∈ S+ and x ∈ Rn : If x, Ax = 0, then Ax = 0 .
b) Show for A ∈ Rm×n and b ∈ Rm : The system of equations Ax = b has a solution if and only if AT y = 0 always implies bT y = 0 . 11. a) Solve the following semidefinite optimization problem: bT y −→ max n C − A∗ y ∈ S+ ,
11 , y ∈ R2 and where b = 9 ⎞ ⎞ ⎞ ⎛ ⎛ ⎛ 1 2 3 1 0 1 0 2 8 A1 = ⎝ 0 3 7 ⎠, A2 = ⎝ 2 6 0 ⎠, C = ⎝ 2 9 0 ⎠ . 3 0 7 1 7 5 8 0 4
b) Find scalars y1 , y2 such that the maximum eigenvalue of C + y1 A1 + y2 A2 is minimized. Formulate this problem as an SDO problem. Visualize this problem and compare it with: –3
–1
1
3 3
1
–1
Chapter 7
–3
8 Global Optimization
8.1 Introduction 8.2 Branch and Bound Methods 8.3 Cutting Plane Methods Cutting Plane Algorithm by Kelley Concavity Cuts Exercises Global optimization is concerned with the computation and characterization of global optimizers of — in general — nonlinear functions. It is an important task since many real-world questions lead to global rather than local problems. Global optimization has a variety of applications including, for example, chemical process design, chip layout, planning of just-in-time manufacturing, and pooling and blending problems. In this chapter we will have a look at the special questions and approaches in global optimization as well as some of the particular difficulties that may arise. Our aim is by no means a comprehensive treatment of the topic which would go beyond the scope of our book, but rather a first introduction to this interesting branch of optimization suitable for our intended target group. Within the framework of our book this chapter is a good complement to the range of topics covered so far. Readers interested in pursuing this topic further will find a variety of state-of-the-art approaches and suggestions in the Handbook of Global Optimization [Ho/Pa]. There also exists an extensive amount of further reading in which more advanced topics are covered.
So far we have — apart from the convex case — primarily dealt with the computation of local minima. The computation of a global minimum is generally a much more complex problem. We take the simple special case of concave minimization to illustrate that there can exist a great number of local minima from which the global minima have to be — often laboriously — extracted. W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4 8,
341
Chapter 8
8.1 Introduction
342
Global Optimization
Example 1 (Pardalos, Rosen (1987)) a) Let 0 < c < 1 . At first we consider the simple problem: c := 0.25
−(cx + 12 x2 ) −→ min −1 ≤ x ≤ 1
−1
It is obvious (parabola opening downwards!) that: For x = −c we get the maximum 1/2 c2 . The boundary point x = −1 gives a local minimum c − 1/2 , the boundary point x = 1 the global minimum −(c + 1/2) .
x 1
−0.5
With g1 (x) := −1 − x and g1 (x) := x − 1 we write the constraints −1 ≤ x ≤ 1 in the usual form g1 (x) ≤ 0 , g1 (x) ≤ 0 and illustrate the situation again with the help of the Karush–Kuhn–Tucker conditions (for a feasible point x). From −(c + x) − λ1 + λ1 = 0 λ1 (1 + x) = 0 , λ1 (x − 1) = 0 λ1 , λ1 ≥ 0 we get: λ1 = 0 , λ1 = 1 + c λ1 = 0 , λ1 = 1 − c λ1 = λ1 = 0 , x = −c
x=1 x = −1 −1 < x < 1
global minimizer local minimizer global maximizer
b) We now use the functions from a) as building blocks for the following problem in n variables. To given numbers 0 < ci < 1 (i = 1 , . . . , n) and for x = (x1 , . . . , xn )T ∈ Rn we consider f (x) := −
n i=1
Chapter 8
−1 ≤ xi ≤ 1
ci xi + 12 x2i −→ min (i = 1 , . . . , n) .
With c := (c1 , . . . , cn )T we get f (x) = − c, x − 1/2 x, x and further with gi (x) := −1 − xi = −1 − ei , x and g i (x) := xi − 1 = ei , x − 1 at first: ∇f (x) = −(c + x), ∇gi (x) = −ei , ∇g i (x) = ei Hence, in this case the Karush–Kuhn–Tucker conditions (for feasible points x) are −(c + x) + λi , λi ≥ 0
n i=1
(−λi + λi )ei = 0
λi (xi + 1) = 0 = λi (xi − 1)
(i = 1 , . . . , n).
8.1
Introduction
343
0
–1
–2 –1 –1 0 x2
0 x 1 1
In this case we thus have 3n Karush–Kuhn–Tucker points; they yield 2n
local minimizers (at the vertices); among them one global minimizer at x = (1, . . . , 1)T 1 global maximizer at x = −c n n 3 − 2 − 1 saddlepoints.
We see: Global optimization only via KKT conditions may be very inefficient. For the following considerations we will need a weak form of the theorem of Krein–Milman which gives an extremely important characterization of compact convex sets, and for that the term extreme points. A point z of a convex set C is called an extreme point to C iff there are no distinct points x, y ∈ C such that z = α x + (1 − α)y for a suitable α ∈ (0, 1). In other words: A point z in C is an extreme point iff it is not an interior point of any nontrivial line segment lying in C . ext(C) := z ∈ C | z extreme point For example, in the plane a triangle has three extreme points, and a sphere has all its boundary points as extreme points.
Let K be a compact convex set in Rn. Then K is the convex hull of its extreme points. A proof can be found, for example, in [Lan].
Chapter 8
Theorem (Krein–Milman)
344
Global Optimization
Theorem 8.1.1 Let K ⊂ Rn be convex, compact and f : K −→ R concave. If f attains a global minimum on K, then this minimum will also be attained at an extreme point of K. Proof: Assume that the point x ∈ K gives a global minimum of f . Since m K = conv(ext(K)), there exists a representation x = λj xj with m ∈ N, λj ≥ 0,
m j=1
j=1
λj = 1 and xj ∈ ext(K) . wlog let f (x1 ) = min f (xj ) | 1 ≤ j ≤ m .
From f (x) ≥
m j=1
λj f (xj ) ≥ f (x1 ) then follows f (x) = f (x1 ).
Remark Let M ⊂ Rn be convex and f : M −→ R concave. Assume furthermore that x yields a local minimum of f . Then it holds that: 1) If the function f is even strictly concave, then x ∈ ext(M ). 2) If x yields a strict local minimum, then x ∈ ext(M ). Proof: 1): By assumption we can find an ε > 0 such that f (x) ≤ f (x) for all x ∈ M with x − x 2 < ε. From x ∈ / ext(M ) we deduce that there exist x1 , x2 ∈ M with x1 = x2 and α ∈ (0, 1) with x = α x1 + (1 − α)x2 . Then x can also be written as a convex combination of two distinct points v, w on the connecting line x1 x2 , for which v − x 2 < ε and w − x 2 < ε hold: x = β v + (1 − β)w for some β ∈ (0, 1) thus leads to a contradiction: f (x) > β f (v) + (1 − β)f (w) ≥ β f (x) + (1 − β)f (x) = f (x)
Chapter 8
2): Follows ‘analogously’; above we first use only the concavity of f for our estimate, and only afterwards do we utilize the fact that x yields a strict local minimum. We begin with the definition of the convex envelope of a given function which is one of the basic tools used in the theory and algorithms of general global optimization.
Definition Let M ⊂ Rn be a nonempty, convex set, and f : M −→ R a function bounded from below by a convex function. A function fM := F : M −→ R is called the convex envelope 1 of f on M if and only if 1
Also known as the convex hull in the literature.
8.1
Introduction
345
⎧ ⎪ ⎨ F is convex. F ≤ f , that is, F (x) ≤ f (x) for all x ∈ M . ⎪ ⎩ For every convex function G with G ≤ f it holds that G ≤ F . The convex envelope of a given function is the best convex ‘underestimation’ of this function over its domain. In most cases it is not even mentioned that — strictly speaking — the existence of the convex envelope has to be proven. It of course follows from the obvious fact that the supremum of any family of convex functions (taken pointwise) is again convex. The uniqueness on the other hand follows immediately from the defining characteristics.
Geometrically, fM is the function whose epigraph is the convex hull of the epigraph of the function f . It is the pointwise supremum over all convex underestimators of f over M . Convex Envelope f
fM
Remark For nonempty convex sets A, B with A ⊂ B ⊂ M it holds that fA (x) ≥ fB (x) for all x ∈ A. Proof: Every optimization problem whose set of feasible points is convex is related to a convex problem with the same optimal value:
Theorem 8.1.2 (Kleibohm (1967))
f (x) = min f (x) = min fM (x) = fM (x). x∈M
x∈M
Proof: By definition we have fM ≤ f , in particular fM (x) ≤ f (x). Therefore inf x∈M fM (x) ≤ inf x∈M f (x) = f (x). For the constant convex function G
Chapter 8
Let M ⊂ Rn be nonempty and convex, and f : M −→ R. Further assume that there exists an x ∈ M with f (x) ≤ f (x) for all x ∈ M . With the convex envelope fM of f it holds that
346
Global Optimization
defined by G(x) := f (x) it holds that G ≤ f . It follows that f (x) = G(x) ≤ fM (x) for all x ∈ M and thus f (x) ≤ inf x∈M fM (x) ≤ fM (x). Hence, altogether, the desired result. This theorem might suggest that we should attempt to solve a general optimization problem by solving the corresponding convex problem where the new objective function is the convex envelope. The difficulty, however, is that — in general — finding the convex envelope of a function is at least as difficult as computing its global minimum. In addition, even though the theorem states that every global minimum of f is also a global minimum of fM , simple one-dimensional examples (like f (x) := sin(x) for x ∈ R) show that the inversion of the argument is seriously wrong. We are now going to discuss two important special cases where convex envelopes can be explicitly described. If we have the special case that M is a polytope 2 and the function f concave, the convex envelope can be evaluated by solving a linear program:
Theorem 8.1.3
Falk/Hoffman (1976), cf. [Fa/Ho]
Let the points v0 , . . . , vk be the vertices of the polytope P , that is, P = conv{v0 , . . . , vk } . Then the convex envelope fP of a concave function f : P −→ R is given by k λκ f (vκ ) λ ∈ Λ(x) , fP (x) = min κ=0
where
k k
k+1 λκ = 1 , λκ vκ = x . Λ(x) := λ = (λ0 . . . , λk ) ∈ [0 , 1 ] κ=0
κ=0
Proof: We denote the right-hand side by G, that is, k G(x) := min λκ f (vκ ) λ ∈ Λ(x)
(x ∈ P ) .
Chapter 8
κ=0
k For x, y ∈ P we choose λ ∈ Λ(x) and μ ∈ Λ(y) with G(x) = κ=0 λκ f (vκ ) k and G(y) = κ=0 μκ f (vκ ). First, we prove that G is convex: For α ∈ [0, 1 ] we have α λ+(1−α)μ ∈ Λ α x+(1−α)y and therefore G α x+(1−α)y ≤ k κ=0 [α λκ + (1 − α)μκ ]f (vκ ) = α G(x) + (1 − α)G(y). The concavity of f implies G(x) =
k
κ=0
λκ f (vκ ) ≤ f
k
λκ vκ
= f (x) ,
κ=0
hence — with the convexity of G — we have G ≤ fP . Now 2
That is, the convex hull of a finite number of points in Rn . Thereby polytopes are compact convex sets.
8.2
Branch and Bound Methods fP (x) = fP
k
λκ vκ
κ=0
implies fP = G. Theorem 8.1.4
≤
k
347 λκ fP (vκ ) ≤
κ=0
k
λκ f (vκ ) = G(x)
κ=0
Falk/Hoffman (1976), cf. [Fa/Ho]
Let S be a simplex3 with vertices v0 , . . . , vn and let f : S −→ R be a concave function. Then the convex envelope fS of f over S is an affinely linear function, that is, fS (x) = a, x + α (x ∈ S) with a suitable pair (a, α) ∈ Rn × R which is uniquely determined by the system of linear equations f (vν ) = a, vν + α (ν = 0, . . . n) . ⎛
v0T ⎜ Proof: The matrix A := ⎝ ...
⎞ 1 .. ⎟ has the (maximal) rank n + 1. There.⎠
vnT 1 fore the above system of linear equations has a uniquely determined solution (a, α) ∈ Rn × R. Via (x) := a, x + α for x ∈ Rn we define an affinely linear function — which is in particular convex. For x ∈ S with a λ ∈ Λ(x) it holds that n n n
(x) = λν (vν ) = λν f (vν ) ≤ f λν vν = f (x) ν=0
ν=0
ν=0
and hence ≤ fS . It furthermore holds that fS (x) ≤
n
ν=0
λν fS (vν ) ≤
n
ν=0
hence, altogether, fS = .
λν f (vν ) = (x) , (v. s.)
8.2 Branch and Bound Methods
3
If n + 1 points v0 , . . . , vn in Rn are affinely independent, which means that the vectors v1 − v0 , . . . , vn − v0 are linearly independent, then the generated set S := conv{v0 , . . . , vn } is called a simplex with the vertices v0 , . . . , vn .
Chapter 8
Branch and bound is a general search method for finding solutions to general global optimization problems. A branch and bound procedure or successive partitioning requires two tools: Branching refers to a successive partitioning of the set of feasible points, that is, the feasible region is divided into ‘disjoint’ subregions of the original,
348
Global Optimization
which together cover the whole feasible region. This is called branching, since the procedure may be repeated recursively. Bounding refers to the determination of lower and upper bounds for the optimal value within a feasible subregion. The core of the approach is the simple observation (for a minimization problem) that a subregion R may be removed from consideration if the lower bound for it is greater than the upper bound for any other subregion. This step is called pruning. Branch and bound techniques differ in the way they define rules for partitioning and the methods used for deriving bounds.
The following considerations are based on the works of Falk–Soland cf. [Fa/So] and their generalization by Kalantari–Rosen cf. [Ka/Ro] . As an introduction we consider a two-dimensional concave optimization problem with linear constraints: f (x) −→ min (KP ) AT x ≤ b In this case let m ∈ N, A ∈ R2×m , b ∈ Rm and the objective function f be concave and separable, that is, assume that f (x1 , x2 ) = f1 (x1 ) + f2 (x2 ) holds, with suitable functions f1 and f2 . At the beginning let M = [a1 , b1 ]× [a rectangle that contains the feasible region F := 2 , b2 ] 2be anT axis-parallel x ∈ R | A x ≤ b . The construction of the convex envelope is trivial for the concave functions fj : The envelope is the line segment graph ofthe convex passing through the points aj , f (aj ) and bj , f (bj ) . Hence we obtain the convex envelope fM via linear interpolation of f at the vertices of M in the following way: fM (x) =
f1 (b1 ) − f1 (a1 ) f2 (b2 ) − f2 (a2 ) (x1 − a1 ) + f1 (a1 ) + (x2 − a2 ) + f2 (a2 ) b 1 − a1 b 2 − a2
Chapter 8
Instead of (KP ) we now solve — for a rectangle or, more generally, a polyhedral set M — the following linear program: fM (x) −→ min (LP ) x ∈F ∩M Assume that it has the minimal point ω(M ) ∈ F ∩ M with value β(M ) := fM (ω(M )). α(M ) := f (ω(M )) then yields an upper bound for the global minimum of (KP ). a) If ω(M ) is a vertex of M , we have f (x) ≥ fM (x) ≥ β(M )
=
(8.1.1)
α(M ) = f (ω(M )) for all x ∈ F ∩ M .
ω(M ) either yields a new candidate for the global minimum or there is no point in F ∩ M which could be a global minimizer.
8.2
Branch and Bound Methods
349
b) Assume now that ω(M ) is not a vertex of M . If β(M ) is strictly greater than the current minimal value, then F ∩ M does not contain any candidate for a global minimum, as β(M ) = fM (ω(M )) ≤ fM (x) ≤ f (x) for x ∈ F ∩ M . Otherwise nothing can be said about the point ω(M ). We therefore divide the rectangle M into smaller rectangles and solve the corresponding (LP ). We continue this procedure until all rectangles constructed in this way allow a decision: either that there is no global minimizer in the respective rectangle or that a vertex yields an optimal value. Example 2 Kalantari, Rosen (1987) ⎧ f (x) = −x21 − 4 x22 −→ min ⎪ ⎪ ⎪ ⎪ ⎪ x1 + x2 ≤ 10 ⎪ ⎪ ⎨ x + 5 x ≤ 22 1 2 (KP ) ⎪ + 2 x2 ≤ 2 −3 x 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −x1 − 4 x2 ≤ −4 ⎩ x1 − 2 x2 ≤ 4
(1) (2) (3) (4) (5)
Feasible Region with Contour Lines 5 (1) (2) (3)
3
F 1
(5) (4)
–1 0
2
4
6
8
We solve the following linear program: fM0 (x) −→ min (LP ) subject to the constraints (1) to (5).
Chapter 8
We start with the rectangle M0 := [0, 8] × [0, 4]; it is obviously the smallest axis-parallel rectangle that contains the feasible region F of (KP ). We obtain the convex envelope fM0 of f on M0 via fM0 (x) = −8 x1 − 16 x2 .
350
Global Optimization Objective Function and Convex Envelope
0
–50
–100 4
0 2
4 8
It has the minimizer ω(M0 ) = (7, 3)T with the value β(M0 ) := fM0 (ω(M0 )) = −104. The value α(M0 ) := f (ω(M0 )) = −85 gives an upper bound for the global minimum of (KP ). However, one cannot decide yet whether ω(M0 ) gives a solution to (KP ). Therefore, we bisect M0 and get two rectangles M1,1 := [0, 8] × [0, 2] and M1,2 = [0, 8] × [2, 4] . Via fM1,1 (x) = −8 x1 − 8 x2 ,
fM1,2 (x) = −8 x1 + (32 − 24 x2 )
we get the new convex envelopes fM1,1 and fM1,2 and — with their help — we solve the following linear programs for j = 1, 2 : fM1,j (x) −→ min (LP ) subject to the constraints (1) to (5) and x ∈ M1,j . Bisection of M0
Chapter 8
Starting Rectangle M0 4
4
3
3
2
2
1
1
M1,2 M1,1
0
0 0
2
4
6
8
0
2
4
6
8
Hence, we get the minimal points: ω(M1,1 ) = (8, 2)T , ω(M1,2 ) = (7, 3)T and to them α(M1,1 ) = −80 = β(M1,1 ), α(M1,2 ) = −85 , β(M1,2 ) = −96.
8.2
Branch and Bound Methods
351
Thus, M1,1 cannot contain a global minimal point, and we cannot answer this question for M1,2 . Therefore, we divide the rectangle M1,2 further into M2,1 = [0, 4] × [2, 4] and M2,2 = [4, 8] × [2, 4]. The corresponding envelopes are fM2,1 (x) = −4 x1 + (32 − 24 x2 ) and fM2,2 (x) = (32 − 12 x1 ) + (32 − 24 x2 ). The linear programs fM2,j (x) −→ min (LP ) subject to the constraints (1) to (5) and x ∈ M2,j (j = 1, 2) have the minimizers ω(M2,1 ) = (2, 4)T and ω(M2,2 ) = (7, 3)T with α(M2,1 ) = −68, β(M2,1 ) = −72 and α(M2,2 ) = −85, β(M2,2 ) = −92. The rectangle M2,1 hence does not contain any minimal points. We divide M2,2 further into the four rectangles M3,1 = [4, 7] × [2, 3], M3,2 = [7, 8] × [2, 3], M3,3 = [4, 7] × [3, 4], M3,4 = [7, 8] × [3, 4] . Division of M2,2
Bisection of M1,2 4
4
M2,1
3
M2,2
3
2
2
1
1
0
M3,3
M3,4
M3,1
M3,2
0 0
2
4
6
8
0
2
4
6
8
We get the following convex envelopes fM3,1 (x) = (28 − 11 x1 ) + (24 − 20 x2 ), fM3,2 (x) = (56 − 15 x1 ) + (24 − 20 x2 ), fM3,3 (x) = (28 − 11 x1 ) + (48 − 28 x2 ), fM3,4 (x) = (56 − 15 x1 ) + (48 − 28 x2 ).
(j = 1, 2, 3, 4) all have the minimizer ω(M3,j ) = (7, 3)T with β(M3,j ) = α(M3,j ) = −85. Therefore (7, 3)T gives the global minimum of (KP ).
Chapter 8
The linear programs fM3,j (x) −→ min (LP ) subject to the constraints (1) to (5) and x ∈ M3,j
352
Global Optimization
More generally, let now a concave quadratic minimization problem be given: f (x) := p, x − 1 x, Cx −→ min 2 (QP ) x ∈ F := v ∈ Rn | AT v ≤ b n We hereby assume that p ∈ Rn , b ∈ Rm , A ∈ Rn×m , C ∈ S++ and F compact. (1) (n) Assume furthermore that u , . . . , u are C-conjugate directions, that is, it holds that 0 , ν = μ u(ν) , C u(μ) = λν > 0 , ν = μ
for ν, μ = 1 , . . . , n. The following considerations could be written even more clearly and concisely with the new scalar product , C defined by v , w C := v , Cw for v, w ∈ Rn and the normalization of the vectors u(ν) (to λν = 1) . We will, however, not go into the details here.
For ν = 1 , . . . , n a maximizer of the linear program
LP
u(ν) , Cx x∈F
ν
−→ max
is denoted by x(ν) and a minimizer of the linear program
LP
ν
u(ν) , Cx x∈F
−→ min
by x(ν) . Now we also define (ν) u , C x(ν) β ν := λν hence Pν :=
and β ν :=
u(ν) , C x(ν) , λν
x ∈ Rn | λν β ν ≤ u(ν) , Cx ≤ λν β ν
Chapter 8
gives a parallel strip with the corresponding parallelepiped P :=
n ! ν=1
Theorem 8.2.1 It holds that: a) F ⊂ P
Pν .
8.2
Branch and Bound Methods
b) f
n
ν=1
c) With
353
n for 1 , . . . , n ∈ R. f ν u(ν) (1)ν=1 (n) U := u , . . . , u , #T " h := 1 β 1 + β 1 , . . . , β n + β n , 2 n λ β β and a := p − CU h α := 1 2 ν=1 ν ν ν the convex envelope fP of f on P is given by
ν u
(ν)
=
fP (x) := a, x + α. Proof: a) For x ∈ F and ν = 1 , . . . , n we have by the definition of x (ν) and x (ν) λν β ν = u(ν) , C x (ν) ≤ u(ν) , C x ≤ u(ν) , C x (ν) = λν β ν . n $ This shows x ∈ Pν ; hence, F ⊂ Pν = P . ν=1 ( ' % & n n n n 1 (ν) (ν) (ν) (μ) = p, − ν u ν u u ,C μ u b) f 2 ν=1 ν ν=1 ν=1 μ=1 n n = p, ν u(ν) − 1 ν u(ν) , C ν u(ν) 2 ν=1 ν=1 n (ν) f ν u = ν=1
c) We consider the function F defined by F (x) := a, x + α for x ∈ Rn with the above noted a and α: (i) F is affinely linear, hence convex. (ii) We show: F (w) = f (w) for all vertices w of P :
ν=1
b)
Chapter 8
n A vector is a vertexof P iff there exists a representation n w ∈ R (ν) w = ν=1 ων u with ων ∈ β ν , β ν . n 1 β + β u(μ) , C u(ν) CU h, u(ν) = U h, C u(ν) = μ μ μ=1 2 n 1 β +β = u(μ) , C u(ν) = 1 β ν + β ν λν μ μ 2 2 μ=1 n F (w) = a, w + α = p, w − CU h, w + 1 λ β β 2 ν=1 ν ν ν & % n n = p, w − CU h, ων u(ν) + 1 λ β β 2 ν=1 ν ν ν ν=1 n = p, w − 1 λ ω β + β − βν βν 2 ν=1 ν ) ν ν *+ν , n = ων2 = f ων u(ν) = f (w)
354
Global Optimization With N := 2n let w(1) , . . . , w(N ) be the extreme points of P . For x ∈ P there exist — by the theorem of Krein–Milman — αν ∈ R+ with N N αν = 1 and x = αν w(ν) . Then it holds that ν=1 ν=1 N N N (ν) (ν) (ν) = f (x). αν F (w ) = αν f (w ) ≤ f αν w F (x) = (ii)
ν=1
ν=1
ν=1
(iii) Assume that G : P −→ R is a convex function with G ≤ f . With the above-mentioned representation for an x ∈ P we get G(x) ≤
N
αν G(w
(ν)
) ≤
ν=1
N
αν f (w(ν) ) = F (x).
ν=1
(v. s.)
8.3 Cutting Plane Methods Cutting Plane Algorithm by Kelley The cutting plane algorithm by Kelley was developed to solve convex optimization problems of the form f (x) −→ min gi (x) ≤ 0 (i = 1, . . . , m) where x ∈ Rn and f as well as g1 , . . . , gm are differentiable convex functions. This problem is evidently equivalent to xn+1 −→ min f (x) − xn+1 ≤ 0,
gi (x) ≤ 0
(i = 1, . . . , m) .
It therefore suffices to consider problems of the form
Chapter 8
(P )
c, x −→ min gi (x) ≤ 0 (i = 1, . . . , m)
where c, x ∈ Rn and g1 , . . . , gm convex and differentiable. Let the feasible region F := x ∈ Rn | gi (x) ≤ 0 (i = 1, . . . , m) be nonempty and suppose that a polyhedral set P0 with F ⊂ P0 is known. A polyhedral set or polyhedron is the intersection of a finite family of closed half-spaces — {x ∈ Rn | a, x ≤ β} for some a ∈ Rn and β ∈ R — in Rn . Using matrix notation, we can define a polyhedron to be the set
8.3
Cutting Plane Methods
355
x ∈ Rn | AT x ≤ b ,
where A is an (n, m)-matrix and b ∈ Rm . Clearly a polyhedron is convex and closed. Without proof we note: A set in Rn is a polytope iff it is a bounded polyhedron.4 Kelley’s cutting plane algorithm now yields an antitone sequence of outer approximations P0 ⊃ P1 ⊃ · · · ⊃ F in the following way: 1) Solve
(LP )
c, x −→ min x ∈ P0
.
If x(0) ∈ P0 is a minimizer of (LP ) with x(0) ∈ F, then we have found a solution to the convex optimization problem (P ); otherwise go to 2). 2) Hence, there exists an index i0 such that gi0 (x(0) ) > 0, for example, let gi0 (x(0) ) = max gi (x(0) ). Set 1≤i≤m P1 := P0 ∩ x ∈ Rn | gi0 (x(0) ) + gi0 (x(0) )(x − x(0) ) ≤ 0 , go to 1) and continue with P1 instead of P0 . / P1 , P0 ⊃ P1 as well as P1 ⊃ F ; since for It holds that: x(0) ∈ P0 , x(0) ∈ x ∈ F we have gi0 (x(0) ) + gi0 (x(0) )(x − x(0) ) ≤ gi0 (x) ≤ 0 . The linear constraint (x) := gi0 (x(0) ) + gi0 (x(0) )(x − x(0) ) ≤ 0 reduces the given set P0 in such a way that no feasible points are excluded. The hyperplane {x ∈ Rn | (x) = 0} is called a cutting plane or simply a cut. Thus, a cut ‘cuts off’ the current optimizer to the LP ‘relaxation’, but no feasible point to (P). Other cutting plane approaches differ in the generation of cuts and the way the set of feasible points is updated when a cut is chosen.
Cutting plane methods of the kind described above are less suitable for concave objective functions. Therefore, we will give some preliminary remarks which will enable us to describe another class of cutting plane methods. They are based on the iterative reduction of P0 = P . 4
Some authors use just the opposite notation.
Chapter 8
Concavity Cuts
356
Global Optimization
Given a polyhedron P ⊂ Rn and a concave function f : Rn −→ R, we want to solve the problem: f (x) −→ min x∈P
For that let x(0) be a vertex of P with f x(0) =: γ. Suppose that this vertex is nondegenerate in the sense that there are n ‘edges’ leading off from it in linearly independent directions u(1) , . . . , u(n) . Furthermore assume that f (z (ν) ) ≥ γ for z (ν) := x(0) + ϑν u(ν) with ϑν > 0 maximal for ν = 1 , . . . , n. By the concavity of f , x(0) gives a local minimum of f in P . There exists exactly one affinely linear mapping : Rn −→ R with (z (ν) ) = 1 for ν = 1 , . . . , n and (x(0) ) = 0 (compare the proof of theorem 8.1.4). It is defined by (x) := a, x − x(0) , where a := Q−T e with e := (1, . . . , 1)T and Q := (z (1) − x(0) , . . . , z (n) − x(0) ); since z (ν) = Q−T e , z (ν) − x(0) = e , Q−1 z (ν) − x(0) = e , eν = 1 . Suppose that w ∈ P is a maximizer to (x) −→ max (LP ) x∈P with μ := (w). Then it holds that: 1) (x) > 1 for x ∈ P with f (x) < γ. 2) If μ ≤ 1 , then x(0) yields the global minimum of f on P . Proof: 1): It holds that (cf. exercise 5) P ⊂ K := x(0) + cone u(1) , . . . , u(n) . In addition
Chapter 8
S := conv x(0) , z (1) , . . . , z (n) = K ∩ x ∈ Rn | (x) ≤ 1 can be easily verified as follows: To w ∈ S there exist λν ≥ 0 with n ν=0 λν = 1 and n n λν z (ν) = λ0 x(0) + λν x(0) + ϑν u(ν) w = λ0 x(0) + ν=1
ν=1
= x(0) +
n
λν ϑν u(ν) . n This shows w ∈ K and (w) = λ0 x(0) + ν=1 λν z (ν) ≤ 1 . Con) *+ , ) *+ , ν=1
=0
=1
versely, if w is an element of the right-hand side, we get w = x(0) +
8.3
Cutting Plane Methods
357
n
λν u(ν) with suitable λν ≥ 0 and (w) ≤ 1 , consequently, n n n λν u(ν) = λν a, u(ν) = λν Q−T e , u(ν) 1 ≥ (w) = a, ν=1
= This yields
n ν=1
ν=1
ν=1
λν e , Q−1 u
(ν)
=
n ν=1
λν ϑν
e, Q
−1
ν=1
z (ν) − x(0)
=
(v. s.)
n ν=1
λν ϑν
.
n n n
λν (ν) λν λν (ν) (0) x(0) + z −x = 1− z ∈ S. w =x + ϑ ϑ ϑ ν=1 ν ν=1 ν ν=1 ν Then we get f (x) ≥ min f (x(0) ), f (z (1) ), . . . , f (z (n) ) = γ for x ∈ S , and for x ∈ P with f (x) < γ we have x ∈ / S, hence, (x) > 1. (0)
2): If μ ≤ 1, then (x) ≤ (w) = μ ≤ 1 for x ∈ P shows P ⊂ S. For x ∈ P we thus get f (x) ≥ γ = f x(0) . The above considerations give that P ∗ (γ) := x ∈ P | f (x) < γ ⊂ P ∩ x ∈ Rn | (x) ≥ 1 . The affinely linear inequality (x) ≥ 1 is called a γ-cut of (f, P ). We may restrict our further search to the subset P∗ (γ) of P . To preserve the polyhedral structure, we consider the larger set P ∩ x ∈ Rn | (x) ≥ 1 instead.
Example 3 (Horst/Tuy, p. 200 f) f (x) = −(x1 − 1.2)2 − (x2 − 0.6)2 −→ min −2 x1 + x2 ≤ 1 x1 + x2 ≤ 4 x1 − 2 x2 ≤ 2 0 ≤ x1 ≤ 3 , 0 ≤ x2 ≤ 2 0
2 –2
–4 3 2
2 1
0 0
1
2
Initialization: x(0) :=
0 0
3
x2
, γ = f x(0) = −1.8 , P0 := P
First Iteration Cycle: k = 0 : u(1) := 10 , u(2) := 01
1x 0
1
Chapter 8
1
358
Global Optimization
0 , z (2) = 1.2 ; f (z (1) ) = f (z (2) ) = −1.8 z (1) = 2.4 0 5 5 We then get 0 (x) = 12 x1 + 6 x2 with 0 (z (1) ) = 0 (z (2) ) = 1. The linear program 0 (x) −→ max (LP ) x ∈ P0 is solved by w0 = 22 with (w0 ) = 52 > 1 and f (w0 ) = −2.6 < γ . In this simple example it is still possible to easily verify this by hand. In more R R complex cases, however, we prefer to work with the help of Maple or Matlab .
Replace P0 by P1 := P0 ∩ {x ∈ Rn | 0 (x) ≥ 1}.
2 z (2) =
6 5
0 ( x) =
1
1
0 0
1
2z (1) =
12 3 5
Start with w0 = 22 and calculate a vertex x(1) of P1 which yields a local minimum of f in P1 with f x(1) ≤ f (w0 ). Since f (w0 ) < γ , we have (1) hence, x(1) is also a vertex of P . In this case we obtain x(1) = 03(x ) > 1; (1) with f (x ) = −3.4 . 1 Set: x(0) :=
3 1
, γ := −3.4 , P0 := P1 .
Second Iteration Cycle: 0 (2) k = 0 : u(1) = −1 = −1 1 , u (2) 3 z (1) = 1.6 ; f (z (1) ) = f (z (2) ) = −3.4 = 0.2 2.4 , z
Chapter 8
0 (x) =
50 7
−
55 28
x1 − 54 x2 ; 0 (z (1) ) = 0 (z (2) ) = 1
The linear program 0 (x) −→ max (LP ) x ∈ P0 is solved by w0 = (0.08, 1.16)T with 0 (w0 ) = 5.536 and f(w0 ) = −1.568 > γ . Replace P0 by the set P1 := P0 ∩ x ∈ Rn | 0 (x) ≥ 1 .
8.3
Cutting Plane Methods
359
2
1
0 0
1
2
3
T
Start with w0 = (0.08, 1.16) and calculate a vertex x(1) of P1 which yields a local minimum of f in P1 . This gives: x(1) = (1/2, 2)T , f (x(1) ) = −2.45 > γ . k = 1 : u(1) = 10 , u(2) = −1 −2 2.4 (2) −0.5253 (1) z = 2 , z = −0.0506 ; f (z (1) ) = f (z (2) ) = γ 1 (x) = 1.238 + 0.5263 x1 − 0.7508 x2 ; 1 (z (1) ) = 1 (z (2) ) = 1 The linear program 1 (x) −→ max (LP ) x ∈ P1 is solved by w1 = (2.8552, 0.4276)T with 1 (w1 ) = 2.4202 and f (w1 ) = −2.7693 > γ . Replace P1 by P2 := P1 ∩ x ∈ Rn | 1 (x) ≥ 1 .
2
1
0 0
1
2
3
In the next two iteration steps we get: k = 2: z (1) = (1.6, 2.4)T , z (2) = (0.1578, −0.9211)T ; f (z (1) ) = f (z (2) ) = γ 2 (x) = 1.2641 − 0.4735 x1 + 0.2056 x2; 2 (z (1) ) = 2 (z (2) ) = 1 w2 = (0.7347, 0.8326)T , 2 (w2 ) = 1.087 , f (w2 ) = −0.2706 > γ x(3) := w2 Replace P2 by P3 := P2 ∩ x ∈ Rn | 2 (x) ≥ 1 .
Chapter 8
x(2) := w1
360
Global Optimization
2
1
0 0
1
2 T
3
z = (2.8492, −0.2246) , z = (2.4, 2)T ; f (z (1) ) = f (z (2) ) = γ 3 (x) = −0.4749 + 0.5260 x1 + 0.1062 x2 ; 3 (z (1) ) = 3 (z (2) ) = 1 w3 = (1.0002, 1.0187)T , μ := 3 (w3 ) = 0.1594 < 1 Thus, x(0) = 31 yields the global minimum.
k=3:
(1)
(2)
Summarizing the results we conclude with the formulation of a corresponding cutting plane algorithm: Algorithm 0) Calculate a vertex x(0) which yields a local minimum of f in P such that the function values in the neighboring vertices are greater or equal to f (x(0) ). Set γ := f (x(0) ) and P0 := P . Iteration cycle:
For k = 0, 1, 2, . . .
1) Construct a γ-cut k of (f, Pk ) in x(k) . k (x) −→ max 2) Solve (LP ) x ∈ Pk and get a maximizer wk . If k (wk ) ≤ 1 : STOP; x(0) yields the global minimum. 3) Pk+1 := Pk ∩ x ∈ Rn | k (x) ≥ 1
Otherwise:
Chapter 8
Start with wk and find a vertex x(k+1) of Pk+1 which gives a local minimum of f in Pk+1 in the above stated sense. If f (x(k+1) ) ≥ γ: Go to iteration k + 1 If f (x(k+1) ) < γ: Set γ := f (x(k+1) , x(0) := x(k+1) , P0 := Pk+1 and start a new iteration cycle. In [Ho/Tu], for example, so-called ‘Φ-cuts’ are considered which yield ‘deeper’ cuts and thus reach their goal faster. We, however, do not want to pursue this approach any further.
Exercises to Chapter 8
361
Exercises 1. Do some numerical experiments to test the efficiency of global optiR mization software. Take for example Maple and compare the local solver Optimization[NLP Solve] with GlobalOptimization[GlobalSolve], choosing the options method=multistart (default), method=singlestart, method= branchandbound and method=reducedgradient : a) Consider the minimization problem of example 1: n ci xi + 12 x2i −→ min f (x) := − i=1
−1 ≤ xi ≤ 1
(i = 1 , . . . , n)
1 i+1
for i = 1, . . . , n and n ∈ {5, 10, 15, 20, 25} . b) The Haverly Pooling Problem cf. [Hav] Choose ci :=
The following problem results from the so-called pooling problem which is one of the fundamental optimization problems encountered in the petroleum industry:
f (x) := 6 x1 + 16 x2 − 9 x5 + 10 x6 − 15 x9 −→ min x1 + x2 − x3 − x4 = 0 x3 − x5 + x7 = 0 x4 + x8 − x9 = 0 −x6 + x7 + x8 = 0 −2.5 x5 + 2 x7 + x3 x10 ≤ 0 2 x8 − 1.5 x9 + x4 x10 ≤ 0 3 x1 + x2 − (x3 + x4 )x10 = 0 lb ≤ x ≤ ub, where lb = (0, 0, 0, 0, 0, 0, 0, 0, 0, 1)T and ub = (300, 300, 100, 200, 100, 300, 100, 200, 200, 3)T .
2. a) Let M ⊂ Rn be a nonempty convex set and f : M −→ R a function with the convex envelope F = fM . Then for any affinely linear function : M −→ R the function f + has the convex envelope F + . Hint: fM + gM ≤ (f + g)M for arbitrary functions f, g : M −→ R. b) Let the functions fj : M := [−1 , 1 ] −→ R be defined by f1 (x) := 1 − x2 , f2 (x) := x2 and f3 (x) := f1 (x) + f2 (x) . Determine the convex envelopes to fj on M (j = 1, 2, 3).
Chapter 8
Hint: The best value we know is f (x∗ ) = −400 attained for x∗ = (0, 100, 0, 100, 0, 100, 0, 100, 200, 1)T .
362
Global Optimization
3. Let f : R −→ R be defined by f (x) := x1 x2 on the rectangle R := [a1 , b1 ] × [a2 , b2 ]. Show that the convex envelope of f over R is given by F (x) := max{F1 (x), F2 (x)}, where F1 (x) := a1 a2 + a2 (x1 − a1 ) + a1 (x2 − a2 ) F2 (x) := b1 b2 + b2 (x1 − b1 ) + b1 (x2 − b2 ). Hint: Use the Taylor expansion and the interpolation properties of F1 and F2 . 4. Zwart’s counterexample cf. [Fa/Ho] 2 − (x1 − 1)2 − x22 − (x3 − 1)2 −→ min AT x ≤ b , ⎞ ⎛ 1 −1 12 12 −6 −1 0 0 ⎟ ⎜ where A = ⎝ 1 1 5 12 1 0 −1 0 ⎠ −1 −1 12 7 1 0 0 −1 T and b = 1, −1, 34.8, 29.1, −4.1, 0, 0, 0 Solve this problem by means of the branch and bound method (cf. theorem 8.2.1). 5. Let P := x ∈ Rn | AT x ≤ b be a polytope with A ∈ Rn×m , b ∈ Rm and m ≥ n. a) x(0) ∈ P is a vertex of P iff the matrix AJ with J := A(x(0) ) has rank n. b) Let x(0) be a nondegenerate vertex of P , that is, A(x(0) = n. Then there exist exactly n neighbors y (1) , . . . , y (n) of x(0) . c) Let x(0) be a nondegenerate vertex of P and y (1) , . . . , y (n) ∈ P its neighbors. Show for u(ν) := y (ν) − x(0) (ν = 1, . . . , n): P ⊂ x ∈ Rn | ATJ x ≤ bJ = x(0) + u ∈ Rn | ATJ u ≤ 0 u ∈ Rn | ATJ u ≤ 0 = cone u(1) , . . . , u(n) f (x) −→ min from exercise 8 AT x ≤ b in chapter 4 by means of the Kelley cutting plane method. Derive upper bounds Uk and lower bounds Lk for p∗ := inf{f (x) | AT x ≤ b} and terminate the calculation when Uk − Lk < ε .
Chapter 8
6. Solve the quadratic optimization problem
Hint: The quadratic optimization problem is equivalent to t −→ min f (x) ≤ t, AT x ≤ b .
Exercises to Chapter 8
363 x
∈ R3 | AT x ≤ b, t ≥ α for a (k) (k) x (k) suitable α ∈ R. You can take Lk := t , Uk := f x where t(k) t −→ min x is the solution to the linear program . t ∈ Pk Start the iteration with P0 :=
t
7. Generalize the implementation of Shor’s ellipsoid method (cf. chapter 3, exercise 2) to convex constrained problems f (x) −→ min gi (x) ≤ 0 (i = 1, . . . , m) . Test the program with the quadratic problem from exercise 8 in chapter 4. Hint: Construct the ellipsoids E (k) in such a way that they contain the minimizer x∗ . Case 1: x(k) ∈ F ; update E (k) based on ∇f x(k) like in chapter 3. Case 2: x(k) ∈ / F , say gj (x(k) > 0 ; gj x(k) (x − x(k) ) > 0 implies gj (x) > 0 , that is, x ∈ / F . Update E (k) based on ∇gj x(k) and eliminate the halfspace of infeasible points. 8. Solve the linearly constrained quadratic problem −(x1 − 2.5)2 − 4 (x2 − 1)2 −→ min x1 + x2 ≤ 10 x1 + 5 x2 ≤ 22 −3 x1 + 2 x2 ≤ 2 x1 − 2 x2 ≤ 4 x1 , ≥ 0, x2 ≥ 0 by means of concavity cuts. Start the iteration at x(0) :=
0 0
.
Chapter 8
Appendices
A
B C
A Second Look at the Constraint Qualifications The Linearized Problem Correlation to the Constraint Qualifications The Fritz John Condition Optimization Software for Teaching and Learning R Matlab Optimization Toolbox SeDuMi: An Introduction by Examples R Maple Optimization Tools
A A Second Look at the Constraint Qualifications The Guignard constraint qualification C (x0 )∗ = Ct (x0 )∗ in section 2.2 seems to somewhat come out of the blue. The correlation between this regularity condition and the corresponding ‘linearized’ problem which we will discuss now makes the matter more transparent.
The Linearized Problem We again consider the general problem: ⎧ ⎪ ⎨ f (x) −→ min gi (x) ≤ 0 for i ∈ I := {1, . . . , m} (P ) ⎪ ⎩ h (x) = 0 for j ∈ E := {1, . . . , p} . j Here let n ∈ N, m, p ∈ N0 (hence, E = ∅ or I = ∅ permitted), the real-valued functions f, g1 , . . . , gm , h1 , . . . , hp defined on an open subset D of Rn and p ≤ n. The set F := x ∈ D | gi (x) ≤ 0 for i ∈ I, hj (x) = 0 for j ∈ E W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4,
365
366
Appendices
was called the feasible region or set of feasible points of (P ). If we combine, as usual, the m functions gi to a vector-valued function g and the p functions hj to a vector-valued function h, respectively, then we are able to state problem (P ) in the shortened form ⎧ ⎪ ⎨ f (x) −→ min g(x) ≤ 0 (P ) ⎪ ⎩ h(x) = 0 with the feasible region F =
x ∈ D | g(x) ≤ 0 , h(x) = 0 .
If we assume the functions f, g and h to be differentiable at a point x0 in D, we can ‘linearize’ them and with f(x) := f (x0 ) + f (x0 )(x − x0 ) g(x) := g(x0 ) + g (x0 )(x − x0 ) h(x) := h(x0 ) + h (x0 )(x − x0 ) draw on the — generally simpler — ‘linearized problem’ ⎧ ⎪ f(x) −→ min
⎨ g(x) ≤ 0 P (x0 ) ⎪ ⎩ h(x) = 0 with the feasible region F (x0 ) :=
x ∈ Rn | g(x) ≤ 0 , h(x) = 0
for comparison and apply the known optimality conditions to it.
of (P ) also solves P (x ) , then the KKT conditions If a local minimizer x 0 0
for P (x0 ) are met and hence also for (P ) since the gradients that occur are the same in both problems. In this case we get a necessary condition. A simple example, however, shows that a minimizer x0 of (P ) does not necessarily yield a (local) solution to P (x0 ) : Example 1 To illustrate this fact, let us look at the following problem (with n = 2): ⎧
x = (x1 , x2 )T ∈ R2 ⎪ ⎨ f (x) := x1 −→ min (P ) g1 (x) := −x31 + x2 ≤ 0 ⎪ ⎩ g2 (x) := −x2 ≤ 0
A Second Look at the Constraint Qualifications
367
Here we have p = 0, m = 2 , and D can be chosen as R2 . We thus obtain the following set of feasible points: F = x ∈ R2 | x2 ≥ 0, x2 ≤ x31 Since x1 ≥ 0 holds for x ∈ F, x0 := (0, 0)T yields the minimum of (P ). As the linearized problem we get ⎧ ⎪ f(x) = f (x) = x1 −→ min
⎨ P (x0 ) g1 (x) = x2 ≤ 0 ⎪ ⎩ g2 (x) = g2 (x) = −x2 ≤ 0 with the set of feasible points F (x0 ) =
x ∈ R2 | x2 = 0 .
The function f, however, is not even bounded from below on F (x0 ).
The question of in which cases a local minimizer x0 to (P ) also solves locally
P (x0 ) leads to the regularity conditions:
We can rewrite problem P (x0 ) with the set of indices of the active inequalities A(x0 ) := i ∈ I : gi (x0 ) = 0 as
⎧ ⎪ f (x0 )(x − x0 ) −→ min
⎨ gi (x0 )(x − x0 ) ≤ 0 for i ∈ A(x0 ) P1 (x0 ) ⎪ ⎩ h (x )(x − x ) = 0 0 0
since h(x0 ) = 0 and gi (x0 ) = 0 for i ∈ A(x0 ) . In this case we were obviously able to drop the additive constant f (x0 ) in the objective function.
Lemma A.1 of Let
F and f, g, h differentiable in x0 . Then x0 is a local minimizer
x0 ∈ P (x0 ) if and only if x0 is a local minimizer of problem P1 (x0 ) .
Let us furthermore denote the set of feasible points of P1 (x0 ) by F1 (x0 ) := x ∈ Rn : h (x0 )(x − x0 ) = 0, gi (x0 )(x − x0 ) ≤ 0 for i ∈ A(x0 ) .
Proof: If x0 is a local minimizer of P1 (x0 ) , then f (x0 )(x − x0 ) ≥ 0 holds for x ∈ F1 (x0 ) in a suitable neighborhood of x0 . Since x0 ∈ F (x0 ) ⊂ F1 (x0 ),
x0 gives locally a solution to P (x0 ) .
368
Appendices
If conversely x0 is a local minimizer of P (x0 ) , then f(x) ≥ f(x0 ) = f (x0 ) holds locally, hence f (x0 )(x − x0 ) ≥ 0 for x ∈ F (x0 ). To x ∈ F1 (x0 ) we consider the vector u(t) := x0 + t(x − x0 ) with t > 0 . It holds that h (x0 )(u(t) − x0 ) = th (x0 )(x − x0 ) = 0 and for i ∈ A(x0 ) gi (x0 )(u(t) − x0 ) = tgi (x0 )(x − x0 ) ≤ 0 . For i ∈ I \ A(x0 ) we have gi (x0 ) < 0 and thus for t sufficiently small gi (x0 ) + gi (x0 )(u(t) − x0 ) = gi (x0 ) + tgi (x0 )(x − x0 ) < 0 . Hence, for such t the vector u(t) is in F (x0 ), consequently f (x0 ) + f (x0 )(u(t) − x0 ) = f(u(t)) ≥ f(x0 ) = f (x0 ) and thus
tf (x0 )(x − x0 ) = f (x0 )(u(t) − x0 ) ≥ 0 ,
hence f (x0 )(x − x0 ) ≥ 0 . Since we had chosen x ∈ F1 (x0 ) at random, x0 yields a local solution to P1 (x0 ) . With the transformation
d := x − x0 , we obtain the following problem which is equivalent to P1 (x0 ) : ⎧ ⎪ f (x0 )d −→ min
⎨ g (x0 )d ≤ 0 for i ∈ A(x0 ) P2 (x0 ) ⎪ i ⎩ h (x0 )d = 0 . From the lemma we have just proven we can now easily deduce Lemma A.2 Let x0 ∈
F and f, g, h differentiable in x0 . Then x0 is a local minimizer of P (x0 ) if and only if 0 (∈ Rn ) minimizes locally P2 (x0 ) .
The set of feasible points of P2 (x0 ) F2 (x0 ) := d ∈ Rn : h (x0 )d = 0 and gi (x0 )d ≤ 0 for i ∈ A(x0 ) is nothing else but the linearizing cone C (x0 ) to F in x0 . We see that this cone occurs here in a very natural way. Let us summarize the results we have obtained so far:
A Second Look at the Constraint Qualifications
369
Proposition Let x0 ∈ F and f, g, h differentiable in x0 . Then the following assertions are equivalent:
(a) x0 gives a local solution to P (x0 ) .
(b) 0 (∈ Rn ) gives a local solution to P2 (x0 ) . (c) ∇f (x0 ) ∈ C (x0 )∗ (d) C (x0 ) ∩ Cdd (x0 ) = ∅ (e) x0 is a KKT point. Here Cdd (x0 ) =
d ∈ Rn | f (x0 )d < 0
denotes the cone of descent directions of f at x0 . Proof: Lemma A.2 gives the equivalence of (a) and (b). (b) means ∇f (x0 )d ≥ 0 for all d ∈ F2 (x0 ) = C (x0 ), hence ∇f (x0 ) ∈ C (x0 )∗ , which gives (c). The equivalence of (c) and (d) can be deduced directly from the definition of Cdd (x0 ). We had formulated this finding as lemma 2.2.2. Proposition 2.2.1 yields precisely the equivalence of (d) and (e). Recall that for convex optimization problems (with continuously differentiable functions f, g, h) every KKT point is a global minimizer (proposition 2.2.8). Correlation to the Constraint Qualifications For x0 ∈ F we had defined the cone of tangents or tangent cone of F in x0 by d Ct (x0 ) := d ∈ Rn | ∃ (xk ) ∈ F N xk −→ x0 . d
Here xk −→ x0 meant: There exists a sequence (αk ) of positive numbers with αk ↓ 0 and 1 (xk − x0 ) −→ d for k −→ ∞ . αk Following lemma 2.2.3, it holds that Ct (x0 ) ⊂ C (x0 ), hence C (x0 )∗ ⊂ Ct (x0 )∗ . Lemma 2.2.4 gives that for a minimizer x0 of (P ) it always holds that ∇f (x0 ) ∈ Ct (x0 )∗ .
If a local minimizer x0 of (P ) also yields a local solution to P (x0 ) , the preceding proposition gives ∇f (x0 ) ∈ C (x0 )∗ .
370
Appendices
This leads to the demand Ct (x0 )∗ ⊂ C (x0 )∗ , hence Ct (x0 )∗ = C (x0 )∗ , which is named after Guignard. We now want to take up example 1 — or example 4 from section 2.2 — once more:
Example 2 With A(x0 ) = {1, 2} we get C (P, x0 ) = d ∈ R2 | ∀ i ∈ A(x0 ) gi (x0 )d ≤ 0} = d ∈ R2 | d2 = 0 . In section 2.2 we had obtained for the cone of tangents (cf. page 50) Ct (x0 ) = d ∈ R2 | d1 ≥ 0, d2 = 0 , thus it holds that Ct (x0 ) = C (P, x0 ). If we write the feasible region in a different way and consider
⎧ f (x) := x1 −→ min x = (x1 , x2 )T ∈ R2 ⎪ ⎪ ⎪ ⎨ g (x) := −x3 + x ≤ 0 1 2 1 (P0 ) ⎪ g (x) := −x ≤ 0 2 ⎪ ⎪ 2 ⎩ g3 (x) := −x1 ≤ 0 , we obtain a linearization at the point x0 := (0, 0)T f(x) = f (x) = x1 −→ min g1 (x) = x2 ≤ 0 g2 (x) = g2 (x) = −x2 ≤ 0 g3 (x) = g3 (x) = −x1 ≤ 0 with the set of feasible points F (x0 ) =
x ∈ R2 | x1 ≥ 0, x2 = 0 .
The additional condition g3 (x0 )d ≤ 0 with g3 (x0 ) = (−1, 0) gives d1 ≥ 0 . It therefore holds that C (P0 , x0 ) = d ∈ R2 | d1 ≥ 0, d2 = 0 = Ct (x0 ) , in particular C (P0 , x0 )∗ = Ct (x0 )∗ .
B The Fritz John Condition In this appendix
we want to have a look at Fritz John’s important paper from 1948 [John] in which he derives a necessary condition for a minimizer of a general optimization problem subject to an arbitrary number of constraints. John then applies this condition in particular to the problem of finding enclosing balls or ellipsoids of minimal volume.
The Fritz John Condition
371
Theorem B.1 (John, 1948) Let D ⊂ Rn be open and Y a compact Hausdorff space1. Consider the optimization problem: (J)
f (x) −→ min subject to the constraint g(x, y) ≤ 0 for all y ∈ Y .
Let the objective function f : D −→ R be continuously differentiable and the function g : D × Y −→ R, which gives the constraints, continuous as well as continuously differentiable in x, hence in the components x1 , . . . , xn . Let x0 ∈ F := {x ∈ D | g(x, y) ≤ 0 for all y ∈ Y } yield a local solution to (J), that is, there exists a neighborhood U of x0 such that f (x0 ) ≤ f (x) for all x ∈ U ∩ F. Then there exist an s ∈ N0 with 0 ≤ s ≤ n, scalars λ0 , . . . , λs , which are not all zero and y1 , . . . , ys ∈ Y such that (i) g(x0 , yσ ) = 0 for 1 ≤ σ ≤ s (ii) λ0 ≥ 0, λ1 > 0, . . . , λs > 0 and s (iii) λ0 ∇f (x0 ) + λσ ∇x g(x0 , yσ ) = 0 . σ=1
Each y ∈ Y gives a constraint, which means that an arbitrary number of constraints is permitted. If we compare assertions (i)–(iii) to the Karush–Kuhn–Tucker conditions (cf. 2.2.5), we firstly see that λ0 occurs here as an additional Lagrangian multiplier. If λ0 = 0, assertion (iii) is not very helpful since the objective function f is irrelevant in this case. Otherwise, if λ0 > 0, (iii) yields precisely the KKT condition for the Lagrangian multipliers λσ := λσ /λ0 of g( · , yσ ) for σ = 1, . . . , s. Positive — compared to theorem 2.2.5, for example — is the fact that no constraint qualifications are necessary. However, we have to strengthen the smoothness conditions. In special types of problems — for example the problem of finding the minimum volume enclosing ellipsoid — it is possible to deduce that λ0 is positive. In these cases John’s theorem is more powerful than theorem 2.2.5.
In preparation for the proof of the theorem we firstly prove the following two lemmata. Lemma B.2 Let A ∈ Rs×n . Then exactly one of the following two assertions holds: 1) There exists a z ∈ Rn such that Az < 0 . 2) There exists a λ ∈ Rs+ \ {0} such that λT A = 0 . 1
If you are not familiar with this term, think of a compact subset of an Rk, for example.
372
Appendices
Proof: 1) and 2) cannot hold at the same time since we get a contradiction from Az < 0 and λT A = 0 for a λ ∈ Rs+ \{0} by 0 = (λT A)z = λT (Az) < 0 . If 1) does not hold, there does not exist any vz ∈ (−∞, 0) × Rn with T s v e − Az ≥ v 0 for e := (1, . . . , 1) ∈ vR either. Hence, the inequality (1, 0, . . . , 0) z ≥ 0 follows from (e, −A) z ≥ 0 for all vz ∈ R × Rn . Following the Theorem of the Alternative (Farkas, cf. p. 43) there thus exists a λ ∈ Rs+ with λT (e, −A) = (1, 0, . . . , 0). Consequently λ = 0 and λT A = 0, hence 2). Lemma B.3 Let K be a nonempty, convex and compact subset of Rn. If for all z ∈ K there exists a u ∈ K such that u , z = 0 , then it holds that 0 ∈ K . Proof: Let u0 be an element of K with minimal norm. We have to show that u0 = 0 . By assumption there exists a u1 ∈ K to u0 with u0 , u1 = 0 . For that the relation u0 2 ≤ (1 − λ)u0 + λu1 2 = (1 − λ)2 u0 2 + λ2 u1 2 follows for all λ ∈ [0 , 1 ]. For λ = 0 we obtain from the above: 2u0 2 ≤ λ u0 2 + u1 2 . Passage to the limit as λ −→ 0 yields u0 = 0 , hence, u0 = 0 . Proof of theorem B.1: The set Y (x0 ) := {y ∈ Y | g(x0 , y) = 0} — as the pre-image of the closed set {0} under the continuous function g(x0 , · ) — is closed and therefore compact since it is a subset of the compact set Y . We prove the following auxiliary assertion: There exists no z ∈ Rn with ∇f (x0 ), z < 0 and ∇x g(x0 , y), z < 0 for all y ∈ Y (x0 ).
(1)
For the proof of the auxiliary assertion we assume (1) to have a solution z. The continuity of ∇f and ∇x g, and the compactness of Y (x0 ) give: There exist δ, η > 0 such that the neighborhood U := {x ∈ Rn | x − x0 ≤ η} of x0 lies completely in D , as well as an open set Y0 ⊃ Y (x0 ) with ∇f (x), z ≤ −δ and ∇x g(x, y), z ≤ −δ for all y ∈ Y0 and all x ∈ U . Since Y \ Y0 is compact and g(x0 , y) < 0 for all y ∈ Y \ Y0 , there exists furthermore an ε > 0 with g(x0 , y) ≤ −ε for such y . Now choose a t ∈ (0, 1) with x0 + tz ∈ U and t| ∇x g(x, y), z | ≤ ε/2 for all (x, y) ∈ U × (Y \ Y0 ). With suitable ϑ0 ∈ (0, 1) and ϑy ∈ (0, 1) for y ∈ Y it holds that f (x0 + tz) = f (x0 ) + t ∇f (x0 + ϑ0 tz), z and g(x0 + tz, y) = g(x0 , y) + t ∇x g(x0 + ϑy tz, y), z .
The Fritz John Condition
373
With x0 + tz the points x0 + ϑ0 tz and x0 + ϑy tz for y ∈ Y also lie in U and we therefore get with the above estimates: f (x0 + tz) ≤ f (x0 ) − tδ, g(x0 + tz, y) ≤ g(x0 , y) − tδ for y ∈ Y0 and g(x0 + tz, y) ≤ g(x0 , y) + t | ∇x g(x0 + ϑy tz, y), z | ≤ −
ε for y ∈ Y \ Y0 . 2
Consequently, x0 + tz ∈ F ∩ U with f (x0 + tz) < f (x0 ) is a contradiction to the local minimality of f in x0 which proves the auxiliary assertion. For the rest of the proof we distinguish two cases: Case 1: s := |Y (x0 )| ≤ n, that is, there exist y1 , . . . , ys ∈ Y with Y (x0 ) = {y1 , . . . , ys }. If we set AT := ∇f (x0 ), ∇x g(x0 , y1 ), . . . , ∇x g(x0 , ys ) ∈ R n×(s+1) and apply lemma B.2, the desired result follows immediately from the unsolvability of (1). Case 2: |Y (x0 )| ≥ n+1: Set T := {∇f (x0 )} ∪ {∇x g(x0 , y) | y ∈ Y (x0 )} . The continuity of ∇x g(x0 , · ) on the compact set Y (x0 ) gives the compactness of T . Since (1) is unsolvable, it holds that: For all z ∈ Rn there exist u1 , u2 ∈ T with u1 , z ≤ 0 ≤ u2 , z and thus a u ∈ conv(T ) with u , z = 0 . By Carath´ eodory’s lemma (cf. exercise 7, chapter 2) the convex hull conv(T ) of T is given by s s
σ uσ uσ ∈ T, σ > 0 (σ = 1, . . . , s),
σ = 1 ; 1 ≤ s ≤ n + 1 . σ=1
σ=1
Since it is compact (cf. exercise 7, chapter 2), we can apply lemma B.3 and obtain 0 ∈ conv(T ). With the above representation we get that there exist
σ ≥ 0 and uσ ∈ T for σ = 1, . . . , n + 1 with n+1 n+1
σ = 1 and
σ u σ = 0 , (2) σ=1
σ=1
where wlog 1 , . . . , s > 0 for an s with 1 ≤ s ≤ n + 1 . If ∇f (x0 ) is one of these uσ , we are done. If otherwise s ≤ n holds, the assertion follows with λ0 := 0 . Hence, only the case s = n + 1 remains: In this case, however, the n + 1 vectors uσ − ∇f (x0 ) are linearly dependent and therefore there exists an α := (α1 , . . . , αn+1 )T ∈ Rn+1 \ {0} such n+1 n+1 α (u − ∇f (x )) = 0 and wlog σ=1 ασ ≤ 0 . Setting α0 := that n+1σ=1 σ σ n+1 0 − σ=1 ασ , gives σ=1 ασ uσ + α0 ∇f (x0 ) = 0 . For the σ given by (2) and an arbitrary τ ∈ R it then holds that τ α0 ∇f (x0 ) + n+1 σ=1 ( σ + τ ασ )uσ = 0 . If we now choose τ := min{− σ /ασ | 1 ≤ σ ≤ n + 1, ασ < 0} and set λ0 := τ α0 and λσ := σ + τ ασ for all σ = 1, . . . , n + 1, we have
374
Appendices
λ0 ≥ 0, λ1 ≥ 0, . . . , λn+1 ≥ 0, where — by definition of τ — for at least one σ ∈ {1, . . . , n + 1} the relation λσ = 0 holds. Therefore, we also obtain the assertion in this case. At this point it would be possible to add interesting discussions and powerful algorithms for the generation of minimum volume enclosing ellipsoids, and as the title page of the book shows, the authors do have a certain weakness for this topic. We, however, abstain from a further discussion of these — rather more complex — issues.
C Optimization Software for Teaching and Learning R
Matlab
Optimization Toolbox
The Optimization Toolbox is a collection of functions that extend the nuR merical capabilities of Matlab . The toolbox includes various routines for many types of optimization problems. In the following tables we give a short overview for the main areas of application treated in our book. One-Dimensional Minimization fminbnd
Minimization with Bounds
Golden-Section Search or Parabolic Interpolation
Unconstrained Minimization fminsearch Unconstrained Minimization
Nelder–Mead Method
fminunc
Unconstrained Minimization
Steepest Descent, BFGS, DFP or Trust Region Method
\
Unconstrained Linear Least Squares
Ax = b :
fminimax
Minimax Optimization
SQP Method
lsqnonlin
Nonlinear Least Squares
Trust Region,
lsqcurvefit Nonlinear Curve Fitting fsolve
x = A\b
Levenberg–Marquardt or
Nonlinear Systems of Equations Gauss–Newton Method
Optimization Software for Teaching and Learning
375
Constrained Minimization linprog
Linear Programming
Primal-Dual Interior Point Method, Active Set Method or Simplex Algorithm
quadprog lsqnonneg
Quadratic Programming Nonnegative Linear Least Squares Constrained Linear Least Squares
Active Set Method
Constrained Nonlinear Minimization
SQP Method
lsqlin fmincon fminimax
Constrained Minimax Optimization lsqnonlin Constrained Nonlinear Least Squares lsqcurvefit Constrained Nonlinear Curve Fitting The User Guide [MOT] gives in-depth information about the OptimizaR tion Toolbox. Matlab includes a powerful Help facility which provides an easy online access to the documentation. We therefore content ourselves here with an overview of the methods accessible with the important option LargeScale (= on/off) for the functions of the Toolbox and the restrictions that hold for them. Function fminbnd fmincon
Option
Method/Remarks Golden-Section Search or Parabolic Interpolation
LargeScale = on Subspace Trust Region Method expects GradObj = on either: bound constraints ≤ x ≤ u, < u or: linear equality constraints LargeScale = off SQP Method allows inequality and/or equality constraints
376
Appendices max fi (x) −→ min
fminimax
1≤i≤m
max |fi (x)| −→ min
1≤i≤m
with or without constraints uses an SQP method Nelder–Mead Polytope Method
fminsearch fminunc
LargeScale = on Subspace Trust Region Method (default)
expects GradObj = on
LargeScale = off HessUpdate = bfgs (default) HessUpdate = dfp HessUpdate = steepdesc fsolve
LargeScale = on Subspace Trust Region Method LargeScale = off NonlEqnAlgorithm = dogleg: (default) (default)
NonlEqnAlgorithm = lm: Lev.–Marquardt NonlEqnAlgorithm = gn: Gauss–Newton
linprog
LargeScale = on Linear Interior-Point Solver (LIPSOL) based (default)
on Mehrotra’s Predictor-Corrector Method
LargeScale = off Simplex = off: Active Set Method Simplex = on: Simplex Algorithm lsqcurvefit LargeScale = on Subspace Trust Region Method (default)
bound constraints ≤ x ≤ u where < u
LargeScale = off LevenbergMarquardt = on: (default) LevenbergMarquardt = off: Gauss–Newton no bound constraints lsqlin
LargeScale = on Subspace Trust Region Method (default)
bound constraints ≤ x ≤ u where < u
LargeScale = off allows linear {≤, =}-constraints based on quadprog
Optimization Software for Teaching and Learning lsqnonlin
377
LargeScale = on Subspace Trust Region Method (default)
bound constraints ≤ x ≤ u where < u
LargeScale = off LevenbergMarquardt = on: (default) LevenbergMarquardt = off: Gauss–Newton no bound constraints otherwise use fmincon Active Set Method using svd lsqlin
lsqnonneg quadprog
LargeScale = on Subspace Trust Region Method (default)
either: bound constraints ≤ x ≤ u, < u or: linear equality constraints
LargeScale = off Active Set Method allows linear {≤, =}-constraints
SeDuMi: An Introduction by Examples ‘SeDuMi’ — an acronym for ‘Self-Dual Minimization’ (cf. [YTM]) — is a R Matlab-based free software package for linear optimization over self-dual n cones like Rn+ , Ln and S+ . How to Install SeDuMi Download SeDuMi from http://sedumi.mcmaster.ca/ . After uncompressing the distribution in the folder of your choice, follow the SeDuMi 1.1 installation R instructions in the file Install.txt . Start Matlab and change the current folder to where you put all the files. Type install sedumi. This will build all the R binary files and add the directory to the Matlab path. You are now ready to use SeDuMi and can start by typing help sedumi or by reading the User Guides in the /doc directory. You can find there in particular the user guide [Stu] by J. Sturm, the developer of SeDuMi. LP Problems It is possible to formulate an LP problem in either the primal standard form (P ) or the dual standard form (D) (cf. p. 242 f). The solution of the primal problem can be obtained by x = sedumi(A,b,c) . The solutions of both the
378
Appendices
primal and the dual problem can be obtained by means of the command [x, y, info] = sedumi(A,b,c) . Example 3 (cf. chapter 4, example 1)
5
A = [2 1 1; 1 2 3; 2 2 1]; m = size(A,1); A = [A, eye(m)]; b = [2; 5; 6]; c = [ -3, -1, -3, zeros(1,m)]’; x = sedumi(A,b,c) [x,y,info] = sedumi(A,b,c)
Instead of requiring x ∈ Rn+ , in SeDuMi, it is possible to restrict the variables n to a Lorentz cone 1 Ln or to the cone S+ of positive semidefinite matrices. More generally, we can require x ∈ K, where K is a Cartesian product of Rn+ , Lorentz cones and cones of positive semidefinite matrices.
Second-Order Cone Programming Let the problem (D)
⎧ ⎪ ⎨ b , y −→ max AT y ≤ c ⎪ ⎩ ATj y + cj 2 ≤ bj , y + dj
(j = 1, . . . , k)
be given with y, b, bj ∈ Rm ; A ∈ Rm×n0 , c ∈ Rn0 ; Aj ∈ Rm×(nj −1) , cj ∈ Rnj −1 , dj ∈ R
(j = 1, . . . , k).
According to exercise 5 in chapter 7 problem (D) can be interpreted as the dual problem to ⎧ k d T ⎪ j T ⎪ c ⎪ x + xj −→ min 0 ⎪ ⎪ j=1 cj ⎨ k (P ) − (bj | Aj ) xj = b Ax0 + ⎪ ⎪ ⎪ j=1 ⎪ ⎪ ⎩ x0 ∈ Rn+0 , xj ∈ Lnj (j = 1, . . . , k). Example 4 Shortest Distance of Two Ellipses
5
F1 = [1 0; 0 4]; P1 = sqrtm(F1); F2 = [5/2 3/2; 3/2 5/2]; P2 = sqrtm(F2); g1 = [-1; 0]; g2 = [-11; -13]; c0 = [0; 0]; c1 = P1 \ g1; c2 = P2 \ g2; gamma1 = -3; gamma2 = 70; d0 = 0; 1
Recall the definition of Ln in SeDuMi (cf. p. 313).
(cf. exercise 7 in chapter 7)
Optimization Software for Teaching and Learning
10
15
20
379
d1 = sqrt(c1’*c1-gamma1); d2 = sqrt(c2’*c2-gamma2); A0 = [zeros(2,1) eye(2) -eye(2)]’; A1 = [zeros(2,1) P1 zeros(2,2)]’; A2 = [zeros(2,1) zeros(2,2) P2]’; b0 = [1 0 0 0 0]’; b1 = zeros(5,1); b2 = b1; At0 = -[b0 A0]; At1 = -[b1 A1]; At2 = -[b2 A2] At = [At0 At1 At2]; bt = [-1 0 0 0 0]’; ct0 = [d0; c0]; ct1 = [d1; c1]; ct2 = [d2; c2] ct = [ct0; ct1; ct2]; K = []; K.q = [size(At0,2) size(At1,2) size(At2,2)]; [xs,ys,info] = sedumi(At,bt,ct,K); x = ys; d = x(1) % Distance: d = norm(u-v,2) u = x(2:3) v = x(4:5)
In the above call to SeDuMi, we see a new input argument K. This argument makes SeDuMi solve the primal problem (P ) and the dual problem (D), where the cone K = Ln1 × · · · × Lnk is described by the structure K. Without the fourth input argument K, SeDuMi would solve an LP problem. If in addition there are variables x0 ∈ Rn+0 , the field K. equals the number of nonnegative variables.
Semidefinite Programming In this case we deal with the primal problem ⎧ c, x + C , X −→ min ⎪ ⎨ Ax + A(X) = b ⎪ ⎩ m x ∈ Rn+0 , X ∈ S+ and its dual problem: ⎧ ⎪ ⎨ b , y −→ max c − AT y ≥ 0 ⎪ ⎩ n1 C − A∗ (y) ∈ S+ With
A ∈ Rm×n0 , c ∈ Rn0 , ⎛ ⎞ A1 , X .. ⎠ A(X) := ⎝ . Am , X
A∗ (y) = y1 A1 + · · · + ym Am
ct = [c; vec(C)]; At = [A [vec(A_1); ... ; vec(A_m)]]; K.l = size(A,2); K.s = size(C,2);
the command [xs, ys, info] = sedumi(At,b,ct,K); yields the solutions (x, X) of the primal problem and y of the dual problem in the following way:
380
Appendices
x = xs(1:K.l); X = mat(xs(K.l+1:end), K.s); y = ys;
Example 5 Pattern Separation via Ellipses
(cf. section 7.8)
clc; clear all; close all;
5
10
% Data Set 1: cf. the picture on p. 335 %P1 = [6.1 2.8; 5.9 2.92; 5.78 2.99; 6.04 2.98; 5.91 3.07; 6.07 3.08]’ %P2 = [5.5 2.87; 5.58 3; 5.48 3.13; 5.65 3.12; 5.7 3.22; 5.96 3.24; ... % 6.17 3.22; 6.3 3.07]’ % Data Set 2: parallel strip %P1 = [-2 1; -2 3; 0 3; 0 5; 2 5; 2 7]’ %P2 = [-2 0; -2 4; 0 2; 0 6; 2 4; 2 8]’ % Data Set 3: random points (cf. [Bo/Va], p. 429 ff) n = 2; N = 100; P1 = randn(n,N); P1 = P1*diag((0.95*rand(1,N))./sqrt(sum(P1.^2))); P2 = randn(n,N); P2 = P2*diag((1.05+rand(1,N))./sqrt(sum(P2.^2))); T = [1 -1; 2 1]; P1 = T*P1; P2 = T*P2;
15
20
25
30
35
40
% Solution via SeDuMi [n,m1] = size(P1); m2 = size(P2,2); m = m1+m2; P = [P1 P2]; At = []; for p = P; At = [At; vec([p; 1]*[p; 1]’)’]; end A = [eye(m1) zeros(m1,m2) zeros(m1,1); zeros(m2,m1) -eye(m2) -ones(m2,1) ]; At = [A At]; b = [ones(m1,1); zeros(m2,1)]; C = zeros(n+1,n+1); c = [zeros(m,1); -1]; ct = [c; vec(C)]; K = []; K.l = size(A,2); K.s = size(C,2); [xs,ys,info] = sedumi(At,b,ct,K); rho = xs(K.l) Z = mat(xs(K.l+1:end),K.s) X = Z(1:n,1:n) xc = - X \ Z(1:n,n+1) % center of ’ellipsoid’ % Displaying results if n > 2; break; end; F = @(x,y) X(1,1)*(x-xc(1)).^2 + 2*X(1,2)*(x-xc(1)).*(y-xc(2)) + ... X(2,2)*(y-xc(2)).^2; XX = linspace(min(P2(1,:)),max(P2(1,:)),50); YY = linspace(min(P2(2,:)),max(P2(2,:)),50); [X1,Y1] = ndgrid(XX,YY); F1 = F(X1,Y1); FF1 = contour(X1,Y1,F1,[1 1],’b-’); hold on; FF2 = contour(X1,Y1,F1,[rho rho],’k-’); hold on; plot(P1(1,:),P1(2,:),’b+’,P2(1,:),P2(2,:),’k*’); title(’Pattern Separation via Ellipses’);
Optimization Software for Teaching and Learning
381
Pattern Separation via Ellipses 4
3
2
1
0
−1
−2
−3
−4
R
Maple Optimization Tools R
R
Maple — besides other computer algebra systems such as Mathematica — is one of the most mature products and in many respects a powerful tool for scientific computing in mathematics, natural sciences, engineering and economics. Its benefits do not only come from the combination of symbolic and numerical computing but also from its visualization abilities of functions and geometric 2D/3D objects. These components blend together to a powR erful working environment for research and teaching. The power of Maple R is enhanced by numerous add-on packages. Since version 9 Maple offers the Optimization package, which consists of highly efficient algorithms for local optimization coming from the NAG library and at the same time gives full R access to the power of Maple. In 2004 Maplesoft introduced the so-called Professional Toolboxes that can be purchased separately as add-on products to extend the scope and functionality R of Maple in key application areas. The first of these add-on products has been the Global Optimization Toolbox (GOT) the core of which is based on the LGO (Lipschitz Global Optimization) solver developed by Pint´er Consulting Services. LGO runs on several other platforms, for example in its MathemaR R tica or GAMS implementation. For a review of the GOT for Maple we refer the reader to [Hen]. More details are presented in the e-book [Pin 2], which can be viewed as a hands-on introduction to the GOT. The LGO solver offers the following four optional solution strategies:
382
Appendices
• branch-and-bound based global search • global adaptive random search (single-start) • random multi-start based global search • generalized reduced gradient algorithm based local search For technical details we refer the reader to the elaborate monograph [Pin 1]. Invoking the GlobalOptimization package, we get two options to use the GOT. GlobalSolve is its command-line usage, Interactive uses the interactive R Maplet graphical interface. The interplay between Maple and the underlying optimization tools, for example the visualization of the objective functions by means of the plot3d command with the default option “style=patch”, does not seem to be adequate in this context. We suggest using the options “style=patchcontour, shading=zhue” which yield a multicolored graph and contour lines of the surface. R
The Maple help system gives more details about the GOT. Furthermore we can get additional information during the solution process by using higher information levels, and there are numerous options whose default settings can be adapted. In contrast to the Optimization package, the GOT does not rely on the objectivegradient, objectivejacobian and constraintjacobian options, since the GOT requires only computable model function values of continuous functions. The global search options do not rely on derivatives, and the local search option applies central finite difference based gradient estimates.
Bibliography [Ab/Wa ] A. Abdulle, G. Wanner (2002): 200 Years of Least Squares Method. Elemente der Mathematik 57, pp. 45–60 [Alt]
W. Alt (2002): Nichtlineare Optimierung. Vieweg, Braunschweig, Wiesbaden
[An/Lu]
A. Antoniou, W. Lu (2007): Practical Optimization: Algorithms and Engineering Applications. Springer, Berlin, Heidelberg, New York
[Avr ]
M. Avriel (1976): Nonlinear Programming — Analysis and Methods. Prentice-Hall, Englewood Cliffs
[Ba/Do ]
E. W. Barankin, R. Dorfman (1958): On Quadratic Programming. Univ. of California Publ. in Statistics 2, pp. 285–318
[Bar 1 ]
M. C. Bartholomew-Biggs (2005): Nonlinear Optimization with Financial Applications. Kluwer, Boston, Dordrecht, London
[Bar 2 ]
M. C. Bartholomew-Biggs (2008): Nonlinear Optimization with Engineering Applications. Springer, Berlin, Heidelberg, New York
[Be/Ne]
A. Ben-Tal, A. Nemirovski (2001): Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. SIAM, Philadelphia
[Bha ]
M. Bhatti (2000): Practical Optimization Methods with Mathematica Applications. Springer, Berlin, Heidelberg, New York
[Bl/Oe 1 ] E. Blum, W. Oettli (1972): Direct Proof of the Existence Theorem for Quadratic Programming. Oper. Research 20, pp. 165–167 [Bl/Oe 2 ] E. Blum, W. Oettli (1975): Mathematische Optimierung. Springer, Berlin, Heidelberg, New York [Bo/To ]
P. T. Boggs, J. W. Tolle (1995): Sequential Quadratic Programming. Acta Numerica 4, pp. 1–51
[Bonn]
J. F. Bonnans et al. (2003): Numerical Optimization. Springer, Berlin, Heidelberg, New York
[Bo/Le]
J. Borwein, A. Lewis (2000): Convex Analysis and Nonlinear Optimization. Springer, Berlin, Heidelberg, New York
[Bo/Va ]
S. Boyd, L. Vandenberghe (2004): Convex Optimization. Cambridge University Press, Cambridge
W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4,
383
384
Bibliography
[Br/Ti]
J. Brinkhuis, V. Tikhomirov (2005): Optimization: Insights and Applications. Princeton University Press, Princeton
[BSS]
M. S. Bazaraa, H. D. Sherali, C. M. Shetty (2006): Nonlinear Programming — Theory and Algorithms. Wiley, Hoboken
[Ca/Ma ] V. Candela, A. Marquina (1990): Recurrence Relations for Rational Cubic Methods II: The Chebyshev Method. Computing 45, pp. 355–367 [Cau]
A. Cauchy (1847): M´ethode g´en´erale pour la r´esolution des syst`emes d’´equations simultan´ees. Comptes Rendus Acad. Sci. Paris 25, pp. 536–538
[Co/Jo ]
R. Courant, F. John (1989): Introduction to Calculus and Analysis II/1. Springer, Berlin, Heidelberg, New York
[Co/We]
L. Collatz, W. Wetterling (1971): Optimierungsaufgaben. Springer, Berlin, Heidelberg, New York
[Col]
A. R. Colville (1968): A Comparative Study on Nonlinear Programming Codes. IBM New York Scientific Center Report 320-2949
[Cr/Sh]
N. Cristianini, J. Shawe-Taylor (2000): Introduction to Support Vector Machines. Cambridge University Press, Cambridge
[Dav]
W. C. Davidon (1991): Variable Metric Method for Minimization. SIAM J. Optimization 1, pp. 1–17
[Deb]
G. Debreu (1952): Definite and Semidefinite Quadratic Forms. Econometrica 20, pp. 295–300
[De/Sc]
J. E. Dennis, R. B. Schnabel (1983): Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Englewood Cliffs
[DGW]
J. E. Dennis, D. M. Gay, R. E. Welsch (1981): An Adaptive Nonlinear Least-Squares Algorithm. ACM Trans. Math. Software 7, pp. 348–368
[Elst]
K.-H. Elster et al. (1977): Einf¨ uhrung in die nichtlineare Optimierung. Teubner, Leipzig
[Erik]
J. Eriksson (1980): A Note on Solution of Large Sparse Maximum Entropy Problems with Linear Equality Constraints. Math. Progr. 18, pp. 146–154
[Fa/Ho ]
J. E. Falk, K. R. Hoffman (1976): A Successive Underestimation Method for Concave Minimization Problems. Math. Oper. Research 1, pp. 251–259
Bibliography
385
[Fa/So ]
J. E. Falk, R. M. Soland (1969): An Algorithm for Separable Nonconvex Programming Problems. Management Science 15, pp. 550–569
[Fi/Co ]
A. Fiacco, G. McCormick (1990): Nonlinear Programming — Sequential Unconstrained Minimization Techniques. SIAM, Philadelphia
[Fle]
R. Fletcher (2006): Practical Methods of Optimization. Wiley, Chichester
[Fl/Pa ]
C. A. Floudas, P. M. Pardalos (1990): A Collection of Test Problems for Constrained Global Optimization Algorithms. Springer, Berlin, Heidelberg, New York
[Fl/Po ]
R. Fletcher, M. J. D. Powell (1963): A Rapidly Convergent Descent Method for Minimization. Computer J. 6, pp. 163–168
[Fo/Ho 1 ] W. Forst, D. Hoffmann (2002): Funktionentheorie erkunden R mit Maple . Springer, Berlin, Heidelberg, New York [Fo/Ho 2 ] W. Forst, D. Hoffmann (2005): Gew¨ ohnliche Differentialgleichungen — Theorie und Praxis. Springer, Berlin, Heidelberg, New York [Fra ]
J. Franklin (1980): Methods of Mathematical Economics. Springer, Berlin, Heidelberg, New York
[Ga/Hr ]
W. Gander, J. Hrebicek (eds.) (2004): Solving Problems in Scientific Computing using Maple and Matlab. Springer, Berlin, Heidelberg, New York . Chapter 6
[Gau]
C. F. Gauß (1831): Brief an Schumacher. Werke Bd. 8, p. 138
[Ge/Ka 1 ] C. Geiger, C. Kanzow (1999): Numerische Verfahren zur L¨ osung unrestringierter Optimierungsaufgaben. Springer, Berlin, Heidelberg, New York [Ge/Ka 2 ] C. Geiger, C. Kanzow (2002): Theorie und Numerik restringierter Optimierungsaufgaben. Springer, Berlin, Heidelberg, New York [Gli]
F. Glineur (1998): Pattern Separation via Ellipsoids and Conic Programming. M´emoire de D.E.A. (Master’s Thesis), Facult´e Polytechnique de Mons, Belgium
[GLS]
¨ tschel, L. Lovasz, A. Schrijver (1988): Geometric M. Gro Algorithms and Combinatorial Optimization. Springer, Berlin, Heidelberg, New York
[GMW]
P. E. Gill, W. Murray, M. H. Wright (1981): Practical Optimization. Academic Press, London, New York
386
Bibliography
[GNS]
I. Griva, S. G. Nash, A. Sofer (2009): Linear and Nonlinear Programming. SIAM, Philadelphia
[Go/Id]
D. Goldfarb, A. Idnani (1983): A Numerically Stable Dual Method for Solving Strictly Convex Quadratic Programs. Math. Progr. 27, pp. 1–33
[GQT]
S. M. Goldfeld, R. E. Quandt, H. F. Trotter (1966): Maximization by Quadratic Hill-Climbing. Econometrica 34, pp. 541–551
[Gr/Te]
C. Grossmann, J. Terno (1993): Numerik der Optimierung. Teubner, Stuttgart
[Hai]
E. Hairer (2001): Introduction a ` l’Analyse Num´erique. Lecture Notes, Gen`eve, p. 158
[Hal]
P. R. Halmos (1993): Finite-Dimensional Vector Spaces. Springer, Berlin, Heidelberg, New York
[Hav]
C. A. Haverly (1978): Studies of the Behavior of Recursion for the Pooling Problem. ACM SIGMAP Bulletin 25, pp. 19–28
[Hel]
C. Helmberg (2002): Semidefinite Programming. European J. Operational Research 137, pp. 461–482
[Hen]
D. Henrion (2006): Global Optimization Toolbox for Maple. IEEE Control Systems Magazine, October, pp. 106–110
[HKR]
´, E. de Klerk, C. Roos (2002): On the ConverM. Halicka gence of the Central Path in Semidefinite Optimization. SIAM J. Optimization 12, pp. 1090–1099
[HLP ]
´ lya (1967): Inequalities. G. H. Hardy, J. E. Littlewood, G. Po Cambridge University Press, Cambridge
[Ho/Pa ]
R. Horst, P. M. Pardalos (eds.) (1995): Handbook of Global Optimization Vol. 1. Kluwer, Boston, Dordrecht, London
[Ho/Sc]
W. Hock, K. Schittkowski (1981): Test Examples for Nonlinear Programming Codes. Springer, Berlin, Heidelberg, New York
[Ho/Tu]
R. Horst, H. Tuy (1996): Global Optimization: Deterministic Approaches. Springer, Berlin, Heidelberg, New York
[HPT]
R. Horst, P. M. Pardalos, N. V. Thoai (2000): Introduction to Global Optimization. Kluwer, Boston, Dordrecht, London
[HRVW]
C. Helmberg et al. (1996): An Interior-Point Method for Semidefinite Programming. SIAM J. Optimization 6, pp. 342–361
[Ja/St]
F. Jarre, J. Stoer (2004): Optimierung. Springer, Berlin, Heidelberg, New York
Bibliography
387
[John]
F. John (1948): Extremum Problems with Inequalities as Subsidiary Conditions. In: Studies and Essays. Courant Anniversary Volume. Interscience, New York, pp. 187–204
[Ka/Ro ]
B. Kalantari, J. F. Rosen (1987): An Algorithm for Global Minimization of Linearly Constrained Concave Quadratic Functions. Mathematics of Operations Research 12, pp. 544–561
[Kar ]
W. Karush (1939): Minima of Functions of Several Variables with Inequalities as Side Conditions. Univ. of Chicago Master’s Thesis
[Kel]
C. T. Kelley (1999): Detection and Remediation of Stagnation in the Nelder-Mead Algorithm Using a Sufficient Decrease Condition. SIAM J. Optimization 10, pp. 43–55
[Klei]
K. Kleibohm (1967): Bemerkungen zum Problem der nichtkonvexen Programmierung. Zeitschrift f¨ ur Operations Research (formerly: Unternehmensforschung), pp. 49–60
[Klerk]
E. de Klerk (2002): Aspects of Semidefinite Programming: Interior Point Algorithms and Selected Applications. Kluwer, Boston, Dordrecht, London
[Kopp]
J. Kopp (2007): Beitr¨ age zum H¨ ullkugel- und H¨ ullellipsoidproblem. University of Konstanz Diploma Thesis
[Kos]
P. Kosmol (1991): Optimierung und Approximation. de Gruyter, Berlin, New York
[Kru]
S. O. Krumke (2004): Interior Point Methods. Lecture Notes, Kaiserslautern
[KSH]
M. Kojima, S. Shindoh, S. Hara (1997): A note on the Nesterov-Todd and the Kojima-Shindoh-Hara Search Directions in Semidefinite Programming. SIAM J. Optimization 7, pp. 86–125
[Lan]
S. Lang (1972): Linear Algebra. Addison-Wesley, Reading
[Lev]
K. Levenberg (1944): A Method for the Solution of Certain Nonlinear Problems in Least Squares. Quart. Appl. Math. 2, 164–168
[Lin]
C. Lindauer (2008): Aspects of Semidefinite Programming: Applications and Primal-Dual Path-Following Methods. University of Konstanz Diploma Thesis
[Lu/Ye]
D. Luenberger, Y. Ye (2008): Linear and Nonlinear Programming. Springer, Berlin, Heidelberg, New York
[LVBL]
M. S. Lobo et al. (1998): Applications of Second Order Cone Programming. Linear Algebra and its Applications 284, pp. 193–228
388
Bibliography
[LWK ]
C.-J. Lin, R. C. Weng, S. S. Keerthi (2008): Trust Region Newton Methods for Large-Scale Logistic Regression. J. Machine Learning Research 9, 627–650
[Man]
O. L. Mangasarian (1969): Nonlinear Programming. McGraw-Hill, New York
[Mar ]
D. W. Marquardt (1963): An Algorithm for Least-Squares Estimation of Nonlinear Parameters. SIAM J. Appl. Math. 11, pp. 431–441
[Meg ]
N. Megiddo (1989): Pathways to the Optimal Set in Linear Programming. In: N. Megiddo (ed.), Progress in Mathematical Programming: Interior-Point Algorithms and Related Methods. Springer, Berlin, Heidelberg, New York , pp. 131–158
[Meh]
S. Mehrotra (1992): On the Implementation of a Primal-Dual Interior-Point Method. SIAM J. Optimization 2, pp. 575–601
[Me/Vo ]
J. A. Meijerink, H. A. van der Vorst (1977): An Iterative Solution Method for Linear Systems of which the Coefficient Matrix is a Symmetric M-Matrix. Math. Computation 31, pp. 148–162
[Mon]
R. D. C. Monteiro (1997): Primal-Dual Path-Following Algorithms for Semidefinite Programming. SIAM J. Optimization 7, pp. 663–678
[MOT]
Matlab (2003): Optimization Toolbox, User’s Guide. The Mathworks Inc.
[MTY]
S. Mizuno, M. J. Todd, Y. Ye (1993): On Adaptive-Step PrimalDual Interior-Point Algorithms for Linear Programming. Mathematics of Operations Research 18, pp. 964–981
[Nah]
P. Nahin (2004): When Least is Best. Princeton University Press, Princeton
[Ne/Ne]
Y. Nesterov, A. Nemirovski (1994): Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia
[Ne/To ]
Y. Nesterov, M. J. Todd (1997): Self-Scaled Barriers and Interior Point Methods for Convex Programming. Mathematics of Operations Research 22, pp. 1–42
[No/Wr ]
J. Nocedal, S. J. Wright (2006): Numerical Optimization. Springer, Berlin, Heidelberg, New York
[Pa/Ro ]
P. M. Pardalos, J. B. Rosen (1987): Constrained Global Optimization. Springer, Berlin, Heidelberg, New York
[Ped]
P. Pedregal (2004): Introduction to Optimization. Springer, Berlin, Heidelberg, New York
Bibliography
389
[Pin 1 ]
J. D. Pint´ er (1996): Global Optimization in Action. Springer, Berlin, Heidelberg, New York
[Pin 2 ]
J. D. Pint´ er (2006): Global Optimization with Maple. Maplesoft, Waterloo, and Pint´er Consulting Services, Halifax
[Pol]
E. Polak (1997): Optimization — Algorithms and Consistent Approximations. Springer, Berlin, Heidelberg, New York
[Pow 1 ]
M. J. D. Powell (1977): Variable Metric Methods for Constrained Optimization. Report DAMTP 77/NA6, University of Cambridge
[Pow 2 ]
M. J. D. Powell (1977): A Fast Algorithm for Nonlinearly Constrained Optimization Calculations. In: Numerical Analysis, Dundee 1977. Lecture Notes in Mathematics 630. Springer, Berlin, Heidelberg, New York, pp. 144–157
[Pow 3 ]
M. J. D. Powell (1986): Convergence Properties of Algorithms for Nonlinear Optimization. SIAM Review 28, pp. 487–500
[Rock]
R. T. Rockafellar (1970): Convex Analysis. Princeton University Press, Princeton
[Ro/Wo ] C. Roos, J. van der Woude (2006): Convex Optimization and System Theory. http://www.isa.ewi.tudelft.nl/∼roos/courses/WI4218/ [RTV]
C. Roos, T. Terlaky, J. P. Vial (2005): Interior Point Methods for Linear Optimization. Springer, Berlin, Heidelberg, New York
[Shor ]
N. Z. Shor (1985): Minimization Methods for Non-Differentiable Functions. Springer, Berlin, Heidelberg, New York
[Son]
G. Sonnevend (1986): An ‘Analytical Center’ for Polyhedrons and New Classes of Global Algorithms for Linear (Smooth, Convex) Programming. Lecture Notes in Control and Information Sciences 84. Springer, Berlin, Heidelberg, New York , pp. 866–875
[Sou]
R. V. Southwell (1940): Relaxation Methods in Engineering Science. Clarendon Press, Oxford
[Spe]
P. Spellucci (1993): Numerische Verfahren der nichtlinearen Optimierung. Birkh¨auser, Basel, Boston, Berlin
[St/Bu]
J. Stoer, R. Bulirsch (1996): Introduction to Numerical Analysis. Springer, Berlin, Heidelberg, New York
[Stu]
J. F. Sturm (1999): Using SeDuMi 1.02, A Matlab Toolbox for Optimization over Symmetric Cones. Optimization Methods and Software 11-12 , pp. 625–653
[Todd 1 ]
M. J. Todd (1999): A Study of Search Directions in Primal-Dual Interior-Point Methods for Semidefinite Programming. Optimization Methods and Software 11 , pp. 1–46
390
Bibliography
[Todd 2 ]
M. J. Todd (2001): Semidefinite Optimization. Acta Numerica 10 , pp. 515–560
[Toh]
K. C. Toh (1999): Primal-Dual Path-Following Algorithms for Determinant Maximization Problems with Linear Matrix Inequalities. Computational Optimization and Applications 14 , pp. 309–330
[VBW]
L. Vandenberghe, S. Boyd, S. P. Wu (1998): Determinant Maximization with Linear Matrix Inequality Constraints. SIAM J. on Matrix Analysis and Applications 19 , pp. 499–533
[Vogt]
J. Vogt (2008): Primal-Dual Path-Following Methods for Linear Programming. University of Konstanz Diploma Thesis
[Wer 1 ]
J. Werner (1992): Numerische Mathematik — Band 2: Eigenwertaufgaben, lineare Optimierungsaufgaben, unrestringierte Optimierungsaufgaben. Vieweg, Braunschweig, Wiesbaden
[Wer 2 ]
J. Werner (1998): Nichtlineare Optimierungsaufgaben. Lecture Notes, G¨ottingen
[Wer 3 ]
J. Werner (2001): Operations Research. Lecture Notes, G¨ottingen
[Wil]
R. B. Wilson (1963): A Simplicial Algorithm for Concave Programming. PhD Thesis, Harvard University
[Wri]
S. J. Wright (1997): Primal-Dual Interior-Point Methods. SIAM, Philadelphia
[WSV]
H. Wolkowicz, R. Saigal, L. Vandenberghe (eds.) (2000): Handbook of Semidefinite Programming: Theory, Algorithms and Applications. Kluwer, Boston, Dordrecht, London √ Y. Ye, M. J. Todd, S. Mizuno (1994): An O( n L)-Iteration Homogeneous and Self-Dual Linear Programming Algorithm. Mathematics of Operations Research 19, pp. 53–67
[YTM]
[YYH]
H. Yamashita, H. Yabe, K. Harada (2007): A Primal-Dual Interior Point Method for Nonlinear Semidefinite Programming. Technical Report, Department of Math. Information Science, Tokyo
Index of Symbols Latin A1/2 root of matrix A . . . 302, 337 AJ . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Bar ball with center a and radius r . . . . . . . . . . . . . . 94 c(A) condition number . . . . . . 103 C ∗ dual cone . . . . . . . . . . . . . . . . . 41 cone(A) conic hull . . . . . . . . . . . . 41 conv(A) convex hull of set A . . 82 d∗ . . . . . . . . . . . . . . . . . . 69, 244, 308 diag(X) . . . . . . . . . . . . . . . . . . . . . 309 Diag(x) . . . . . . . . . . . . . . . . .248, 309 e vector of ones . . . . . . . . . 159, 248 en n-th standard unit vector .304 epi(p) epigraph . . . . . . . . . . . 75, 82 ext(C) extreme points . . . . . . . 343 Fμ . . . . . . . . . . . . . . . . . . . . . 248 f, 327 I, In identity matrix in Mn inf(P ) . . . . . . . . . . . . . . . . . . . . . . . 68 f L Lagrange function 47, 67, 225 LA augmented Lagrange function . . . . . . . . . . . . 238 T L transpose of matrix L L⊥ orthogonal complement . . 190 lhs left-hand side log natural logarithm max(D) . . . . . . . . . . . . . . . . . . . . . . 69 min(P ) . . . . . . . . . . . . . . . . . . . . . . . 69 p∗ . . . . . . . . . . . . . . . . . . 68, 244, 308 rank(A) rank of matrix A rc , rd , rp residuals . . . . . . 261, 298 Rc , Rd , rp residuals . . . . . . . . . 333 rhs right-hand side S(J) set of components of J . .152 n n , S++ . . . . . . . . . . . . . . . . . 301 S n , S+ sgn signum sup(D) . . . . . . . . . . . . . . . . . . . . . . . 69 svec(A) . . . . . . . . . . . . . . . . . . . . . .301 Tk Chebyshev polynomial . . 127
trace(A) trace of matrix A UB unit ball . . . . . . . . . . . . . . . . . . 94 v(D) optimal value to (D) 69, 244 v(P ) optimal value to (P ) . . . . . . 39, 68, 244 vec . . . . . . . . . . . . . . . . . . . . . . . . . . 310 vol(E) volume of ellipsoid E 95, 318 X := Diag(x) . . . . . . . . . . . . . . . .248 (x+ , y + , s+ ) . . . . . . . . . . . . . . . . . . 261 xJ vector with corresponding components . . . . . . . . . 153 xs entry-wise product of vectors, Hadamard product 248 Bold Mn real (n, n)-matrices . . . . . . 132 N set of natural numbers = {1, 2, . . .} Nk := {k, k + 1, . . .} for integer k R real field R+ nonnegative real numbers R++ positive real numbers Rn×p real (n, p)-matrices Up neighborhood system of p . 36 Mathcal A∗ adjoint operator . . . . . . . . . 305 A(x0 ) active constraints . . . . . . 44 A+ (x0 ) . . . . . . . . . . . . . . . . . . . . . . . 62 A(X) . . . . . . . . . . . . . . . . . . . . . . . 304 C central path . . . . . . . . . . . . . . . 253 E equality constraints . . . . . . . . .15 E(P, x0 ) ellipsoid . . . . . . . . 94, 316 F feasible region 37, 39, 224, 365 F primal-dual feasible set . . . 252 F 0 strictly primal-dual feasible points . . . . . . . . . . . . . . 252 FD set of feasible points of (D) or (DSDP ) 153, 243, 306 FD effective domain of ϕ 67, 306
W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4,
391
392 FD0 set of strictly feasible points of (D) . . . . . . . . . 244, 326 FDe set of feasible points of (De ) . . . . . . . . . . . . . 244 FD0 e set of strictly feasible points of (De ) . . . . . . . . . . . . . 244 FDopt set of optimizers of (D) . 256 FP set of feasible points of (P ) . . . . . 153, 243, 305 FP0 set of strictly feasible points of (P ) . . . . . . . . . .244, 326 FPopt set of optimizers of (P ) 256 Fu . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 F+ (x0 ) . . . . . . . . . . . . . . . . . . . . . . . 62 I inequality constraints . . . . . . . 15 LX . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Ln Lorentz cone . . . . . . . . . . . 313 Lnα . . . . . . . . . . . . . . . . . . . . . . . . . . 337 N (A) nullspace of A N2 (β) neighborhood . . . . . . . . . 267 N2 (β) neighborhood . . . . 269, 297 N−∞ (γ) neighborhood . . 267, 281 (γ) neighborhood . . 269, 297 N−∞ R(A) rangespace of A Mathfrak x, x+, Δ(x) . . . . . . . . . . . . . . . . . . 261 x(μ) . . . . . . . . . . . . . . . . . . . . . . . . . 253 Greek δ(b, C) . . . . . . . . . . . . . . . . . . . . . . . .40 δS indicator function . . . . . . . . 214 ϕ dual function . . . . . . . . . . . . . . . 67 Φr penalty function 215, 220, 228 Φμ . . . . . . . . . . . . . . . . . 248, 295, 326 μ . . . . . . . . . . . . . . . . . . . . . .249, 326 Φ λi (A) . . . . . . . . . . . . . . . . . . . . . . . 301 μ normalized duality gap . . . . 254 Πk polynomials of degree ≤ k 126 σ centering parameter . . . . . . . 263 σmax , σmin . . . . . . . . . . . . . . . . . . 282 τ duality measure . . . . . . 263, 269
Index of Symbols ωn volume of the n-dimensional unit ball . . . . . . . . 95, 318 Cones cone(A) . . . . . . . . . . . . . . . . . . . . . . 41 Cdd (x0 ) cone of descent directions . . . . . . . . .45, 369 Cfd (x0 ) cone of feasible directions . . . . . . . . . . . 44 C (x0 ), C (P, x0 ) linearizing cone . . . . . . . . . . . . . 44, 370 C + (x0 ) . . . . . . . . . . . . . . . . . . . . . . 64 Ct (x0 ) tangent cone . . . . . . 49, 369 Ct (M, x0 ) tangent cone . . . . . . . 49 Ct + (x0 ) . . . . . . . . . . . . . . . . . . . . . . 62 C ∗ dual cone of C . . . . . . . . . . . .41 Norms and inner products
:= 2 euclidean norm . . .39 ∞ maximum norm A . . . . . . . . . . . . . . . .25, 102, 144 F Frobenius norm . 132, 301 W . . . . . . . . . . . . . . . . . . . . . . . 132 , inner product on Rn . . . . .39 , A . . . . . . . . . . . . . . . . . . . 25, 144 , S n . . . . . . . . . . . . . . . . . . . . . . 301
Miscellaneous |S| cardinality of set S > , ≥ , < , ≤ for vectors . . . . . . 152 end of proof
end of example αk ↓ 0: αk > 0 and αk → 0 a+ := max{0, a} for a ∈ R . . . 215 d
xk −→ x0 . . . . . . . . . . . . . . . . . . . . . 49 ¨ wner partial order . 301
, Lo ∇f gradient of function f ∇2 f Hessian of function f =⇒ implication v.s. vide supra (see above) wlog without loss of generality
Subject Index A
B
Abadie . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A-conjugate . . . . . . . . . . . . . . . . 120, 352 directions . . . . . . . . . . . . . . . . . . . . 122 accuracy . . . . . . . . . . . . . . . . . . . . . . . . 142 activation step . . . . . . . . . . . . . 171, 174 active constraint, restriction, set . . . . . . 44 set method . . . 26, 160, 168, 187, 207 f, 210 f adjoint operator . . . . . . . . . . . 305, 325 affinely independent . . . . . . . . . . . . 347 affinely linear . . . . . . . 55, 81, 347, 361 affine-scaling direction . . . . . . . . . . . . 264, 285, 295 method . . . . . . . . . . . . . . . . . . . . . . . 295 AHO-direction . . . . . . . . . . . . . . . . . . 332 algorithm cutting plane . . . . . . . . . . . . . . . . . 354 infeasible interior-point . . . . . . . 284 interior-point . . . . . . . . . . . . . . . . . 264 long-step path-following . . . . . 281 f Mehrotra predictor-corrector 284 Mizuno–Todd–Ye predictorcorrector . . . . . . . . . . . . . . . . . . . 277 short-step path-following . . . . 272 f alternate directions . . . . . . . . . . . . . . 30 alternative, theorem of . 43, 210, 246 analytic center . . . . . . . . . . . . . 260, 296 antitone . . . . . . . . . . . . . . . . . . . . . 75, 106 apex . . . . . . . . . . . . . . . . . . . . . . . . 39, 270 approximation Chebyshev . . . . . . . . . . . . 2, 4, 8, 29 Gauß . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Armijo step size rule . 109, 144, 230 modified . . . . . . . . . . . . . . . . . . . . . . 143 Arrow–Hurwitz–Uzawa . . . . . . . 56 artificial variable . . . . . . . . . . . . . . . . 159 augmented Lagrange function . 238
banana function . . . . . . . . . . . . . . . 143 f Barankin–Dorfman . . . . . . . . . . 166 barrier function . . . . . . . . . . . . 220, 248, 326 inverse . . . . . . . . . . . . . . . . . . . . . . 221 logarithmic . . . . . . 28, 221, 248 f, 326, 337 method . . . . . . . . . 28, 220, 237, 295 logarithmic . . . . . . . . . . . . . . . . . 242 parameter . . . . . . . . . . . . . . . 248, 295 problem . . . . . . . . . . . . . . . . 258, 326 f property . . . . . . . . . . . . . . . . . . . . . . 337 basic point . . . . . . . . . . . . . . . . . . . . . . . . . 153 feasible . . . . . . . . . . . . . . . . . 153, 159 variable . . . . . . . . . . . . . . . . . . . . . . 153 basis feasible . . . . . . . . . . . . . 153, 158, 163 of (P ) . . . . . . . . . . . . . . . . . . . . . . . . 153 BFGS method . . . . . . . . . . . . . 137, 140 BFGS updating formula . . . . . . . . . . . . . 137, 150, 230 binary classification . . . . . . . . . . . . . . 88 branch and bound method . 347, 362 breast cancer diagnosis . 85 f, 88, 141 Broyden . . . . . . . . . . . . . . . . . . . . . . . 132 method . . . . . . . . . . . . . . . . . . . . . . . 138 Broyden–Fletcher–Goldfarb– Shanno update . . . . . . . . . 137, 230 Broyden’s class . . . . . . . . . . . . . . . . 138
C calculus . . . . . . . . . . . . . . . . . . . . . . . . . . 21 of variations . . . . . . . . . . . . . . . 20, 22 Carath´ eodory’s lemma . . . . . . . . 82 Cauchy . . . . . . . . . . . . . . . . . . . . . . . . . 24 center, analytic . . . . . . . . . . . . 260, 296
W. Forst and D. Hoffmann, Optimization—Theory and Practice, Springer Undergraduate Texts in Mathematics and Technology, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-0-387-78977-4,
393
394 centering condition . . . . . . . . . . . . . . . . 252, 325 direction . . . . . . . . . . . . 264, 284, 296 parameter . . . . . . 263, 272, 281, 285 central path 222, 252 f, 291, 294, 329 Chebyshev approximation . . . . . . . . . . . . . . 8, 29 linear. . . . . . . . . . . . . . . . . . . . . . . . 2, 4 method . . . . . . . . . . . . . . . . . . . . . . . 286 polynomial . . . . . . . . . . . . . 126 ff, 149 Cholesky decomposition . . . . . . 114, 129, 150, 179, 262 incomplete . . . . . . . . . . . . . . . . . . . 129 classification . . . . . . . . . . . . . . . . . 86, 88 binary . . . . . . . . . . . . . . . . . . . . . . . . . 88 classifier function . . . . . . . . . . . 86, 142 complementarity condition . . . . . . . . . . . . . . . . 46, 247 f residual . . . . . . . . . . . . . . . . . . . . . . 261 strict . . . . . . . . . . . . . . . . . . . . . . . . . 198 complementary slackness condition . . . . . . . . . . . . . . . . . 46, 325 optimality condition . . . . . . . . . . 248 complexity, polynomial . 28, 242, 265 concave function . . . . . . . . . 53, 68, 81, 344 ff minimization . . . . . . . . . . . . . . . . . 341 quadratic . . . . . . . . . . . . . . . . . . . 352 strictly, function . . . . . 53, 249, 344 concavity cut . . . . . . . . . . . . . . 355, 363 condition complementarity . . . . . . . . . 46, 247 f complementary slackness . . 46, 325 centering . . . . . . . . . . . . . . . . 252, 325 Fritz John . . . . . . . . . . . . . . . . . . 370 interior-point . . . . . . . . . . . . 248, 253 Karush–Kuhn–Tucker . . . . . . 46 f, 51, 167 KKT . . . . . . . . . . . . . . . . . . . . . 46 f, 51 number . . . . . . . XIV, 103, 128, 217 optimality . . . . 37, 44, 61, 245, 290 Slater . . . . . . . . . . . . . . . . . . 308, 327 cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 convex . . . . . . . . . . . . . . . . . . . . . . . . .40
Subject Index dual . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 linearizing . . . . . . . . . 44, 50, 83, 368 Lorentz . . . . . . . . . . . 313, 337, 378 of descent directions . . . . . . 45, 369 programming . . . . . . . . . . . . 313, 378 second-order . . . . . . . . . . . . . 313, 378 self-dual . . . . . . . . . . . . . . . . . 302, 377 tangent . . . . . . . . . . . . . . 49, 82 f, 369 conic hull . . . . . . . . . . . . . . . . . . . . . . . . 41 conjugate directions . . . . . . . . . . . . . . . 120 f, 352 gradient method 26, 120, 129, 148 constraint . . . . . . . . . . . . . . . . . . . . . . . . 15 linear . . . . . . . . . . . . . . . . . . . . . . . . . . 15 normalization . . . . . . . . . . . . . . . . 188 qualification . . . . . . . 51, 54–57, 365 control, optimal . . . . . . . . . . . . . . . . . . 23 convergence in direction d . . . . . . . . . . . . . . . . . . 48 quadratic . . . . . . . 27, 202, 226, 232 rate of . . . . . . . . . . . . . . . . XIV, 125 ff superlinear . . . . . . . . . . . . . . . . . . . 232 convex cone . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 envelope . . . . . . . . . . . . . . . . . . . . . . 344 function . . . . . . . . . . . . . . . . . . . . . . . 52 hull . . . . . . . . . . . . . . . . . . . . 30, 54, 82 optimization problem . . . 52, 58–61 quadratic objective function . . 100 set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 underestimation . . . . . . . . . . . . . . 345 coordinatewise descent methods . . 99 corrector step . . . . . . . . . . . 277 f, 284 ff Courant . . . . . . . . . . . . . . . . . . . . . . 214 cut . . . . . . . . . . . . . . . . . . . . . . . . 355, 363 concavity . . . . . . . . . . . . . . . . 355, 363 cutting plane . . . . . . . . . . . 96, 355, 362 algorithm, method . . . . . . . 354, 360 cyclic control . . . . . . . . . . . . . . . . 99, 147
D damped Newton method . . . . . . . . . . . . . 131, 202, 239
Subject Index step . . . . . . . . . . . . . . . . . . . . . 263, 285 Dantzig . . . . . . . . . . . . . . . . . . . . . . . . 26 data fitting . . . . . . . . . . . . . . . . . . . . . . . 6 Davidon . . . . . . . . . . . . . . . . . . . . . . . . 25 Davidon–Fletcher–Powell updating formula . . . . . . . . 25, 135 f deactivation step . . . . . . . . . . . 171, 173 decision function . . . . . . . . . . . . 86, 142 definite negative . . . . . . . . . . . . . . . 64, 66, 249 positive . . . . . . . . . . . . . . . 36, 81, 300 quadratic form . . . . . . . . . . . . . . . 238 degeneracy . . . . . . . . . . . . . . . . . . . . . 161 descent coordinatewise . . . . . . . . . . . . . . . . . 99 direction . . . 25, 45, 98, 108, 161, 187, 228 method . . . . . . . . . . 25, 98, 121, 193 steepest . . . . . . . . . . . . . . 25, 100, 112 DFP method . . . . . . 25, 136, 139, 149 diet problem . . . . . . . . . . . . . . . . . . . . 205 direction affine-scaling . . . . . . . . 264, 285, 295 AHO- . . . . . . . . . . . . . . . . . . . . . . . . 332 alternate . . . . . . . . . . . . . . . . . . . . . . 30 centering . . . . . . . . . . . 264, 284, 296 conjugate . . . . . . . . . . . . . . . . 120, 352 descent . . . 25, 45, 98, 108, 161, 187, 228 feasible . . . . . . . . . . . . . . . . . . 44, 186 ff HKM- . . . . . . . . . . . . . . . . . . . . . . . . 332 Levenberg–Marquardt . . . . 112 Newton . . 262, 264, 272 f, 285, 292 NT- . . . . . . . . . . . . . . . . . . . . . . . . . . 332 steepest descent . . . . . . 24, 112, 144 tangent . . . . . . . . . . . . . . . . . . . . 49, 51 distance between ellipses . . . 338, 378 dogleg . . . . . . . . . . . . . . . . . . . . . 116, 148 domain, effective . . . . . . . 67, 75, 306 f dual cone . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 function . . . . . . . . . . . . . . . . . . 67, 305 problem . . . . . . . 66 ff, 152, 243, 304 residual . . . . . . . . . . . . . . . . . . 261, 298 variable . . . . . . . . . . . 67, 77, 86, 243
395 duality 66–79, 153, 243–247, 257, 296 gap 70, 73, 245, 285, 308, 312, 325 normalized . . . . . . . . . . . . . 254, 272 Lagrange . . . . . . . . . . . . . . . . . 66, 71 measure . . . . . . . . . . . . . . . . . . . . . . 263 strong . . . . . . . . . . . . 70, 78, 245, 308 theorem . . . . . . . . . . . . . . 78, 246, 309 weak . . . . . . . . . . . . 69, 153, 244, 308 dually feasible . . . . . . . . . 244, 252, 313
E economic interpretation . . . . . . . . . . 77 effective domain . . . . . . . . 67, 75, 306 f elimination of variables . . . . . . . . . 175 ellipsoid . . . . . . . . . . . . . . . . . . . . . 93, 316 ¨ wner–John . . . . . . . . . . . . . . . . 96 Lo method, Shor . . . . 28, 93, 143, 363 minimal enclosing . . . . . . . . . . . . 316 elliptic cylinder . . . . . . . . . . . . . . . . . 335 envelope, convex . . . . . . . . . . . . . . . . 344 epigraph . . . . . . . . . . . . . . . . 75, 82, 345 equality constrained problem . . . . . . . . . . . 35 constraints . . . . . . . . . . . . . . . . . . . . 37 error function . . . . . . . . . . . . . . . . 9, 102 Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 –Lagrange . . . . . . . . . . . . . . . . . . . 23 exact line search . . . . . 100, 104, 122, 144 penalty function . . . . 27, 217 f, 228 exchange single . . . . . . . . . . . . . . . . . . . . . . . . . 173 step . . . . . . . . . . . . . . . . . . . . 155 f, 196 extended problem . . . . . . . . . . . . . . . 243 exterior penalty method . . . . . 28, 214 extreme point . . . . . . . . . . . . . . . . . . 343
F facility location problems . . . . 10, 83 f Falk . . . . . . . . . . . . . . . . . . . . . . . . . . 346 f Farkas theorem of the alternative 43, 210, 246, 308
396 feasible basic point . . . . . . . . . . . . . . 153, 159 basis . . . . . . . . . . . . . . . . 153, 158, 163 direction . . . . . . . . . . . . . . . . 44, 186 ff method . . . . . . . . . . . . . . . . 151, 186 dually . . . . . . . . . . . . . . 244, 252, 313 primally . . . . . . . . . . . . 244, 252, 313 points . . 39, 44, 213, 224, 243, 305 strictly . . . . . . . . . . . . 244, 308, 329 region, set . . . . 28, 37, 39, 153, 252 primal–dual . . . . . . . . . . . . 252, 292 Fiacco . . . . . . . . . . . . . . . . . 28, 65, 242 Fletcher . . . . . . . . . . . . . . . . . . . 25, 27 ’s S1 QP method . . . . . . . . . . . . . 233 Fletcher/Reeves . . . 120, 130, 149 Frisch . . . . . . . . . . . . . . . . . . . . . 28, 242 Fritz John condition . . . . . . . . . . 370 Frobenius matrix . . . . . . . . . . . . . . . . . . . . . . . 155 norm . . . . . . . . . . . . . . . . . . . . 132, 301 function affinely linear . . . . . 55, 81, 347, 361 antitone . . . . . . . . . 75, 106, 192, 256 barrier . . . . . . . . . . . . . . 220, 248, 326 classifier . . . . . . . . . . . . . . . . . . 86, 142 concave . . . . . . . . . . 53, 68, 81, 344 ff convex . . . . . . . . . . . . . . . . . . . . . . . . .52 decision . . . . . . . . . . . . . . . . . . . 86, 142 dual . . . . . . . . . . . . . . . . . . . . . . 67, 305 indicator . . . . . . . . . . . . . . . . . . . . . 214 isotone . . . . . . . . . . . . . . . 53, 179, 256 Lagrange . . . . . . . 47, 74, 238, 305 log ◦ det . . . . . . . . . . . . . . . . . . . . . . 323 logistic regression . . . . . . . . . . . . . . 88 objective . . . . . . . . . . . . . . . . . . . 10, 15 penalty . . . . . . . . . . . . . . . . . . . 27, 215 perturbation . . . . . . . . . . . . . . . . . . . 75 sensitivity . . . . . . . . . . . . . . . 75, 206 f separable . . . . . . . . . . . . . . . . . . . . . 348 strictly/strongly antitone . . . . . . . . . 106, 112 ff, 154 concave . . . . . . . . 53, 249, 283, 344 convex . . . . . . . . . 52 f, 81, 249, 323 isotone . . . . . . . . . . . . . . . . . 182, 192
Subject Index
G γ-cut . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 gap, duality . . . . . . 70, 73, 245, 285, 308, 312, 325 Gauß . . . . . . . . . . . . . . . . . . . . . . . 23, 29 approximation, linear . . . . . . . . . . . 2 –Newton method . . . 24, 119, 150 normal equations . . . . . . . . . . . . 2, 29 geometric interpretation . . . . . . . . . 71 global minimum . . . . 36, 60, 112, 217, 341 optimization package . . . . . . . . . 382 golden section search . . . . . . . . . . . 145 gradient conjugate, method . . . . . . 26, 120, 129, 148 method . . . 24, 100, 103, 128, 144 projected, method . . . . . . 27, 189, 202, 294 reduced . . . . . . . . 27, 176, 192 f, 211 Goldfarb . . . . . . . . . . . . . . . . . . . . . . 132 Goldfarb–Idnani method . . . . . 179 R Matlab program . . . . . . . . . . . . . 185 f Goldman–Tucker theorem 261, 296 Goldstein condition . . . . . . . . . . . 104 Greenstadt . . . . . . . . . . . . . . . . . . . 132 update . . . . . . . . . . . . . . . . . . . . . . . 137 Guignard . . . . . . . . . . . . . . . . . . 55, 365
H Hadamard product . . . . . . . . . . . . 248 Han . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 algorithm . . . . . . . . . . . . . . . . 228, 239 Haverly pooling problem . . . . . . 361 Hessian . . . . . . . . . . . . . . . . . . . . . . . 7, 36 form of a linear equation . . . . . 235 reduced . . . . . . . . . . . . . . . . . . . . . . 176 Hestenes–Stiefel cg algorithm . . . . . . 26, 122 HKM-direction . . . . . . . . . . . . . . . . . 332 Hoffman . . . . . . . . . . . . . . . . . . . . . 346 f Householder matrix . . . . . . . . . . 176
Subject Index hull convex . . . . . . . . . . . . . . . . . 30, 44, 82 ellipsoid . . . . . . . . . . . . . . . . . . . . . . . 96 hyperplane . . . . . . . . . . . . 40, 76, 78, 85
I ice cream cone . . . . . . . . . . . . . . . . . . 313 ill-conditioned 18 f, 28, 156, 216, 232 inactive restriction . . . . . . . . . . . 44, 47 indicator function . . . . . . . . . . . . . . . 214 inequality constraints . 23, 38, 54, 67 inexact line search . . . . 106, 130, 230 infeasible interior-point algorithm 284 inner/scalar product . . . . . . 25, 39, 144, 301, 352 interior-point algorithm, method . . . . . . 28, 242, 264, 284 for quadratic optimization . . 289 primal–dual . . . . . . . . . . . . 224, 263 condition . . . . . . . . . . . . . . . . 248, 253 interior penalty method . . . . . . . . . 220 inverse barrier function . . . . . . . . . 221 IPC interior-point condition . . . . 248 isoperimetric problem . . . . . . . . . . . . 21 isotone . . . . . . . . . . . . . . . . . . . . . . 53, 166
J Jacobian . . . . . . 37, 119, 199, 252, 298 jamming . . . . . . . . . . . . . . . . . . . . . . . . 191 John . . . . . . . . . . . . . . . . . . . . . . . 23, 370
K Kantorovich . . . . . . . . . . . . . . . . . . . 23 inequality . . . . . . . . . . . . . . . . . . . . 101 Karmarkar . . . . . . . . . . . . . . . 27, 241 f Karush . . . . . . . . . . . . . . . . . . . . . . . . . 23 Karush–Kuhn–Tucker conditions . . . . 14, 46, 51, 167, 371
397 points . . . . . . . . . . . . . . . . . . . . . . . . . 47 Kelley . . . . . . . . . . . . . . . . . . . . . . . . 354 ’s cutting plane algorithm 355, 362 kernel . . 178, 189, 194, 201, 238, 297 Khachiyan . . . . . . . . . . . . . . . . 27, 241 f KKT conditions . . . 46 f, 60, 83, 167, 226 points . . 47, 60, 188, 196, 202, 225 Klee/Minty . . . . . . . . . . . . . . . 28, 241 Kleibohm . . . . . . . . . . . . . . . . . . . . . . 345 Krein–Milman . . . . . . . . . . . . . . . . 343 Krylov space . . . . . . . . . . . . . . . . . . 124 Kuhn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
L Lagrange . . . . . . . . . . . . . . . . . . . . . . . 22 dual function . . . . . . . . . . . . . . . . . . . 67 dual problem . . . . . 66 f, 71, 85 f, 306 duality . . . . . . . . . . . . . . . . . . . . . 66, 71 function . . . . . . . . . . . . 27, 47, 67, 74 augmented . . . . . . . . . . . . . . . . . . 238 multiplier rule/method 22 f, 37, 47 –Newton method . . . . . . . . 27, 226 Wilson’s . . . . . . . . . . . . . . . . . . . 200 –Newton SQP method . . . . . . 202 Lagrangian . . . . . . . . . . 47, 58, 74, 225 augmented . . . . . . . . . . . . . . . . . . . 238 function . . . . . . . . . . . . . . . . . . . . . . . 67 least change principle . . . . . . . . . . . . . . . . . . . . . . 132 secant update . . . . . . . . . . . . . . . . 132 least squares fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 method, problem . . . . . 29, 118, 209 problems, nonlinear . . . . . . . . . . . . . 6 Levenberg . . . . . . . . . . . . . . . . . . . . . . 24 Levenberg–Marquardt direction . . . . . . . . . . . . . . . . . . . . . 112 method . . . . . . . . . . . . . . . . . . . . . . . 120 trajectory . . . . . . . . . . . . . . . 112, 148 lifting . . . . . . . . . . . . . . . . . . . . . . . . . . 335 line search . . . . 98, 130, 138, 144, 202
398 exact . . . . . . . . . . . . . . . 100, 104, 122 inexact . . . . . . . . . . . . . . . . . . 106, 230 linear Chebyshev approximation . . . 2, 4 constraint . . . . . . . . . . . . . . . . . . . . . 15 equality . . . . . . . . . . . 175, 186, 193 Gauß Approximation . . . . . . . . . . 2 independence c. q. (LICQ) 56, 83 least squares problems . . . . . . . . . . 2 optimization . . . . . . . . . 15, 152, 242 program, problem . . . 71, 151, 242 f programming . . . . . . . . . . . . . . . . . 151 linearized problem, system . 331, 365 linearizing cone . . . . . . 44, 50, 83, 368 LO linear optimization . . . . . . . . . 299 local minimizer . . . . . . . . . . . . . . . . . . . . . 36 minimum . . . . . . . . . . . . . . . . . . . . . 35 f minimum point . . . . . . . . . . . . . . . . 36 sensitivity . . . . . . . . . . . . . . . . . . . . . 77 logarithmic barrier function . . . 28, 221, 248 f, 326, 337 method . . . . . . . . . . . . . . 28, 242, 295 logistic regression . . . . . . . . . . . . . . . . 88 function . . . . . . . . . . . . . . . . . . . . . . . 88 regularized . . . . . . . . . . . . . . . . . . . 89 log-likelihood model . . . . . . . . . . . . . 89 long-step method . . . . . . . . . . . . . . . . . . . . . . . 281 path-following algorithm . . . . . 281 f Lorentz cone . . . 270, 313, 337, 378 ¨ wner Lo –John ellipsoid . . . . . . . . . . . . . . . . 96 partial order . . . . . . . . . . . . . . . . . . 301
M Mangasarian–Fromovitz . . . . . . 55 R Maple optimization tools . . . . . . 381 Maratos effect . . . . . . . . . . . . 232, 239 marginal rate of change . . . . . . . . . . 77 Marquardt . . . . . . . . . . . . . . . . . . . . 25 R Matlab optimization toolbox . . . . . . 4, 19, 149, 212, 239, 374
Subject Index matrix Frobenius . . . . . . . . . . . . . . . . . . . 155 maximizer . . . . . . . . . . . . . . 69, 245, 308 maximum norm . . . . . . . . 4, 8, 53, 149 McCormick . . . . . . . . . . . . 28, 65, 242 Mehrotra predictor-corrector algorithm, method . . . . . . . . . . 284 ff R Matlab program . . . . . . . . . . . . 288 f merit function . . . . . . . . . . . . . . 27, 233 minimal enclosing ellipsoid . . . . . . . . . . . . 316 point . . . . . . . . . . . . . . . . . . 30, 39, 348 minimax problem . . . . . . . . . . . . . 3, 15 minimizer . . . . . . . . . . . . . . . 30, 39, 308 global . . . . . . . . . . . . . . . . 58, 348, 369 local . . . . . . . . . . . . . . . . . 36, 217, 366 minimum . . . . . . . . . . . . . . . . . . . . . . . 35 f circumscribed ball . . . . . . . . 14, 338 global . . . . . . . . . . . . . . . . . 34, 36, 341 local . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 point, local . . . . . . . . . . . . . . . . . . . . 36 strict global . . . . . . . . . . . . . . . . . . . 36 strict local . . . . . . . . . . . . . . . . . . . . . 36 Mizuno–Todd–Ye predictorcorrector method . . . . . . . 277 f, 284 Modified Armijo step size rule . . . . . . . . . . . . . . . . . . . . . . 143 MT-family . . . . . . . . . . . . . . . . . . . . . . 332 multiplier . . . . . . . . . . . . . . . . 37, 47, 67 method/rule, Lagrange . . . 23, 37 MZ-family . . . . . . . . . . . . . . . . . . . . . . 332
N necessary second-order condition 62 neighborhood . . . . . . . . 266 ff, 297, 334 Nelder and Mead polytope method . . . . . . . . . . . . . . . 26, 90, 143 Newton . . . . . . . . . . . . . . . . . . . . . . . . 21 direction . 262, 264, 272 f, 285, 292 step . . . . . . 239, 262 f, 272, 282, 285 damped . . . . . . . . . . . . . . . . 263, 285 ’s method . . . 131, 200, 252 f, 261, 286, 330
Subject Index damped, method . . . . . . 131, 136, 202, 239 Newton–Raphson method . . . . 119 nonbasic variable . . . . . . . . . 153, 192 ff nondegenerate . . . . . . . . 153, 169, 193 nonlinear least squares problem . . . . . . . . . . . 6 optimization . . . . . . . . . . . . . 213, 224 norm . . . . . . . . . . . . . . . . 2, 25, 102, 144 euclidean . . . . . . . . . . . 4, 10, 39, 248 Frobenius . . . . . . . . . . . . . 132 f, 300 matrix . . . . . . . . . . . . . . . . . . . . . . . 106 maximum . . . . . . . . . . . . 4, 8, 53, 149 normal case . . . . . . . . . . . . . . . . . . . . . . . . . . 246 equations, Gauß . . . . . . . . . . . 2, 29 normalization constraint . . . . . . . . 188 normalized duality gap . . . . . 254, 272 NT-direction . . . . . . . . . . . . . . . . . . . 332
O objective function . . . . . . . . . . . . 10, 15 reduced . . . . . . . . . . . . . . 37, 176, 208 optimal control . . . . . . . . . . . . . . . . . . . . . 20, 23 location . . . . . . . . . . . . 10, 12, 32, 83 value . . . . . . . 39, 68 f, 206, 244, 345 optimality condition . . . . . . 37, 44, 61, 245, 290 first-order . . . . . . . . . . . . . 36 f, 44–61 second-order necessary . . . . . . . . . . . . . . . 36, 62 f sufficient . . . . . . . . . . . . . . . . 36, 64 f optimization convex . . . . . . . . . . . . . . . . . 52, 58–61 linear 15, 28, 88, 152–165, 242–252 nonlinear . . . . . . . . . . . . . . . . 213, 224 portfolio . . . . . . . . . . . . . . . . . . . . . . . 34 quadratic 13, 15, 26, 165–186, 277, . . . . . . 289–293, 315 f, 339 semidefinite . . . . . . . . . . . . . . 299–340 toolbox . . . 4, 34, 149, 203, 374–377, 381
399 unconstrained . . . . . . 15, 28, 87–150, 187, 217 optimizer . . . . . . . . . . . . . . . . . . . . . . . . 69 Oren–Luenberger class . . . . . . . 138 orthogonal complement . . . . . . . . . . . . . . . . . . 190 distance line fitting . . . . . . . . . . . . 80 projection . . . . . . . . . . . . . . . 209, 266 projector . . . . . . . . . . . . . . . . . . . . . 190 overdetermined systems . . . . . . . 1, 29
P parameter barrier . . . . . . . . . . . . . . . . . . . 248, 295 centering . . . . . . . 263, 272, 281, 285 path central . . . 222, 252 f, 291, 294, 329 -following methods 266 f, 272–277, . . . 281–284, 297, 325–331, 334 pattern . . . . . . . . . . . . . . . . . . . . . 85, 334 recognition . . . . . . . . . . . . . . . . . . . 334 separation via ellipses . . . . . . 334–336, 380 f penalty function . . . . . . . . . . . . . . . . . . 27, 215 exact . . . . . . . . . . . . . 27, 217 f, 228 method . . . . . . . . . . . . . . . . . . . 16, 214 exterior . . . . . . . . . . . . . . . . . 28, 214 interior . . . . . . . . . . . . . . . . . . . . . 220 parameter . . . . . . . . 16, 86, 215, 231 problem . . . . . . . . . . . . . . . . . . . . . . . 34 quadratic . . . . . . . . . . . . . . . . 34, 237 term . . . . . . . . . . . . . . . . . 16, 86, 214 f perturbation function . . . . . . . . . . . . . . . . . . . . . . . 75 vector . . . . . . . . . . . . . . . . . . . . . . . . . 75 point basic . . . . . . . . . . . . . . . . . . . . . . . . . 153 extreme . . . . . . . . . . . . . . . . . . . . . . 343 minimal, minimum . 36, 39, 52, 60 stationary . . . . . . . . . . 7, 15, 36, 108 Polak/Ribi` ere . . . . . . . . . . . . . . . . 129 polyhedral set . . . . . . . . . . . . . . . . . . 354
400 polyhedron . . . . . . . . . . . . . . . . . . . . . 354 polynomial complexity . 28, 242, 265 polytope . . . . . . . . . . 90, 346, 355, 362 method, Nelder and Mead 26, 90 pooling problem . . . . . . . . . . . 341, 361 portfolio optimization . . . . . . . . . . . . 34 positive definite . . . . . . . . . . . . . . . 36, 81, 300 f semidefinite . . . . . 36, 81, 300 f, 317 Powell . . . . . . . . . . . . . . . . . 25, 27, 104 Powell’s dogleg trajectory 116, 148 Powell-symmetric-Broyden update . . . . . . . . . . . . . . . . . . . . . . . 135 preconditioning . . . . . . . . . . . . 128, 262 predictor step . . . . . . . . 277–280, 285 f predictor-corrector method R Matlab program . . . . . . . . . . . . 288 f Mehrotra . . . . . . . . . . . . . . . . . . 284 ff Mizuno–Todd–Ye . . . . . . . . . . 277 f primal affine-scaling method . . . . . . . . . 295 Newton barrier method . . . . . 295 problem. . . . . . . . . 66 f, 152, 242, 304 residual . . . . . . . . . . . . . . . . . . 261, 298 primal–dual central path . . . . . . . . . . 252 ff, 329 ff feasible set . . . . . . . . . . . . . . 252, 292 strictly . . . . . . . . . . . . . . . . . 252, 292 interior-point method . . . . . . 224, 248, 319 system . . . 244, 248, 261–263, 325 ff primally feasible . . . . . . . . . . . 244, 252 principle of least change . . . . . . . 132 f problem dual . . . . . . . . . . . 66 ff, 152, 243, 304 extended . . . . . . . . . . . . . . . . . . . . . 243 isoperimetric . . . . . . . . . . . . . . . . . . 21 linear . . . . . . . . . . 71, 242 f, 284, 310 primal . . . . . 66 f, 152, 242, 304, 379 programming cone . . . . . . . . . . . . . . . . . . . . . 313, 378 linear . . . . . . . . . . . . . . . . . . . . 151, 246 projected gradient method . . . . . . 27, 189, 202, 294
Subject Index projection method . . . . . . . 151, 186–203, 210 ff orthogonal . . . . . . . . . . . . . . . 209, 266 PSB method . . . . . . . . . . . . . . . . . . . . 135
Q QP quadratic programming 165, 167 QR-factorization . . . . . 156, 175, 210 f quadratic convergence . . . . . 27, 202, 226, 232 form, definite . . . . . . . . . . . . . . . . . 238 optimization . 13, 15, 26, 165–186, . . . . . . 277, 309, 315 interior-point method . . . 289–293 penalty method/problem . . . . . . 34, 214, 237 quasi-Newton condition . . . . . . . . . . . . . . . . 132, 137 direction . . . . . . . . . . . . . . . . . . . . . 117 method . . . . . . 25, 27, 130–142, 228
R range space method . . . . . . . . . . . . . 177 rate of convergence . . . . . . XIV, 125 ff ray minimizing . . . . . . . . . . . . . . . . . 104 reduced gradient . . . . . . . . . . . . . . . . 176, 192 f method . . . . . . . 27, 192, 196 f, 211 Hessian . . . . . . . . . . . . . . . . . . . . . . . 176 objective function . . . . 37, 176, 208 regularity assumption/condition . . . . . . 54, 365, 367 regression logistic . . . . . . . . . . . . . . . . . . . . . . . . 88 regularized logistic regression function . . . . . . 89 relaxation control . . . . . . . . . . . . . . . . 99 residual complementarity . . . . . . . . . 261, 298 dual . . . . . . . . . . . . . . . . . . . . . . 261, 298 primal . . . . . . . . . . . . . . . . . . . 261, 298
Subject Index vector . . . . . . . . . . . . . . . . . . . . . . . . . . 2 restriction . . . . . . . . . . . . . . . . . . . . . . . 15 active . . . . . . . . . . . . . . . . . . . . . . . . . 44 revised simplex algorithm . . 152, 204 root of a matrix . . . . . . . . . . . . 302, 337 Rosen . . . . . . . . . . . . . . . . . . . . . . 27, 186 ’s projection method . . . 211, 189 ff Rosenbrock function . . . . . . . . . . 143
S saddlepoint . . . . 36, 58 f, 74, 218, 343 Schur complement . . . . . . 313 f, 317 f SDO semidefinite optimization . 299 SDP semidefinite programming . . . . . . 300, 305 search method . . . . . . 26, 90, 110, 347 secant relation . . . . . . . . . . . . . . . . . . 132 second-order cone . . . . . . . . . . . . . . . . . . . 313 ff, 378 optimization . . . . . . . . . . . . . . . . 315 problem . . . . . . . . . . 309, 314, 338 f programming . . . . . . 299, 313, 378 optimality condition . . . . . . 15, 36, 59, 61–65 SeDuMi . . . . . . . . . . . . 313, 338 f, 377 ff self-dual cone . . . . . . . . . . . . . . 302, 377 homogeneous problem . . . . . . . . 287 self-scaling variable metric . . . . . . 138 semidefinite . . . 15, 36, 81, 111, 166, 175, 300 optimization . . . . . . . . . . . . . 299–340 problem, program . . . . . . 300, 304 f, 309, 316 programming . . . . . . . . . . . . 300, 379 sensitivity . . . . . . . . . . . . . . . . . 142, 206 f analysis . . . . . . . . . . . . . . . . . . 74 f, 206 function . . . . . . . . . . . . . . . . . . . . . . . 75 local . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 separable function . . . . . . . . . . . . . . 348 separating hyperplane . . . 40, 43, 85 f separation theorem . . . . . . . . . . . . . . 78
401 sequence antitone . . . . . . . . . . . . . . . . . 169, 355 isotone . . . . . . . . . . . . . . . . . . . . . . . 166 strictly antitone . . . . . . . . . . . . . . 108 sequential linear programming . . . . . . . . . . . . 27 quadratic programming . . . . . . 15, 197, 214, 224 shadow price . . . . . . . . . . . 66, 77, 206 f Sherman–Morrison– Woodbury formula . . . . . 205, 207 Shor’s ellipsoid method . . . . . . 28, 93, 143, 242, 363 short-step algorithm . . . . . . 263, 272 f, 276 f, 280 f, 297 path-following algorithm . . . . . 272 f simplex . . . . . . . . . . . . . . . . . . . 4, 90, 347 algorithm, method . . . 26, 152, 159, 188, 192, 206, 241 revised . . . . . . . . . . . . . . . . . 152, 204 single exchange . . . . . . . . . . . . . . . . . 173 slack variable . . . . 157, 224, 243, 306 f vector . . . . . . . . . . . . . . . . . . . . . . . . 168 Slater condition . . . . . . . . 61, 308, 327, 329 constrained qualification 55, 59, 78 f SLP method . . . . . . . . . . . . . . . . . . . . . 27 S1 QP method, Fletcher . . . . . . 233 SOCP second-order cone programming . . . . . . . . . . . . . . . . 314 f Sonnevend . . . . . . . . . . . . . . . . . . . . 260 S-pair . . . . . . . . . . . . . . . . . . . . . . . . . . 180 S-procedure . . . . . . . . . . . . . . . . 316–322 specificity . . . . . . . . . . . . . . . . . . . . . . . 142 SQP method . . . 15, 27, 197, 224–236 Lagrange–Newton . . . . . . . . . 202 standard form of linear problems . . . . . . 71, 85, 157, 206, 242, 377 optimization problems . . . . . . . . . 14 starting point . . . . . . . . . . . . . . 287, 297 stationary point . . . . . . 7, 15, 36, 108
402 steepest descent . . 24 f, 100, 117, 144 step corrector . . . . . . . . . . 277–280, 284 ff damped Newton . . . . . . . . 263, 285 exchange . . . . . . . . . . . . . . . 155 f, 196 predictor . . . . . . . . . . . . 277–280, 285 f size . . . . . . . . . . . . . . . . . . . . . . . . 24, 98 selection . . . . . . . . . . . . . . . . . . . . 104 strict complementarity . . . . . . . . . 198, 300 global minimum . . . . . . . . . . . . . . . 36 local minimum . . . . . . . . 36, 64, 344 strictly antitone function . . . . . . . 106, 112 ff concave function . . . . 53 f, 249, 344 convex function . . . . . 52 f, 249, 323 dual feasible . . . . . . . . . . . . . 308, 312 feasible matrix . . . . . . . . . . . . . . . 308 feasible points . . . 244, 248, 252, 268, 308 f, 329 feasible set . . . . . . . . . . . . . . . . . . . 281 primal–dual feasible point 252, 292 primal feasible . . . . . . . . . . . . . . . . 308 strong duality . . . 70, 74, 78, 245, 308 successive quadratic programming 197 sufficient second-order condition . . . . . . 61, 64 superlinear convergence . . . . . . . . . 232 support vector . . . . . . . . . . . . . . . . . . . 86 machine . . . . . . . . . . . . . . . . . . . 85, 88 method . . . . . . . . . . . . . . . . . . . . . . . 142 systems, overdetermined . . . . . . . 1, 29
T tangent cone . . . . . . . . . . . . . . . . 49 f, 82 f, 369 direction . . . . . . . . . . . . . . . . . . . . . . . 49 toolbox, optimization . . . 4 , 34, 149, . . . 203, 212, 239, 374 f, 381 f trust region method . . . . . . 25, 110–120, 148 f, 234 Tucker . . . . . . . . . . . . . . . . . . . . . . . . . 23 two-phase method . . . . . . . . . 163, 208
Subject Index
U unconstrained optimization . . . . . . 15, 28, 87–150, 187, 217 problem . . . . . . . . . . . . . . . 16, 35, 87 f updating formula . . . . . . . . . . . . . . . 132 BFGS . . . . . . . . . . . . . . 137, 140, 230 DFP . . . . . . . . . . . . . . . . . . . . 135 f, 140 ellipsoid . . . . . . . . . . . . . . . . . . . . . . . 96
V variable artificial . . . . . . . . . . . . . . . . . . . . . . 159 basic . . . . . . . . . . . . . . . . . . . . . . . . . 153 dual . . . . . . . . . . . . . . . 67, 77, 86, 243 metric method . . . . . . . . . . . . 25, 136 self-scaling . . . . . . . . . . . . . . . . . . 138 nonbasic . . . . . . . . . . . . . . 153 f, 192 ff slack . . . . . . . . . . 157, 224, 243, 306 f vertex . . . . . . . . . . . . . . 27, 90, 347, 362 v. s. vide supra (see above) V-triple . . . . . . . . . . . . . . . . . . . . . . . . . 180
W weak duality . . . . . . 69, 153, 244, 308 Wilson . . . . . . . . . . . . . . . . 27, 197, 224 ’s Lagrange–Newton Method 200 wlog without loss of generality . . . . Wolfe . . . . . . 27, 104, 106, 151, 191 f Wolfe–Powell condition . . . . . 106 working set . . . . . . . . . . . . . . . . 160, 211 algorithm . . . . . . . . . . . . . . . . . . . . 160
Z zero space method . . . . . . . . . . . . . . 177 zigzagging . . . . . . . . . . . . . . . . . 145, 191 Zoutendijk . . . . . . . . . . 151, 186, 188 Zwart’s counterexample . . . . . . . . 362