Advances in Electronics and Electron Physics, Volume 82

ADVANCES IN ELECTRONICS AND ELECTRON PHYSICS VOLUME 82 EDITOR-IN-CHIEF PETER W. HAWKES Centre National de la Recher...

Author: Peter W. Hawkes

21 downloads 685 Views 16MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

ADVANCES IN ELECTRONICS AND ELECTRON PHYSICS

VOLUME 82

EDITOR-IN-CHIEF

PETER W. HAWKES Centre National de la Recherche ScientiJique Toulouse, France

ASSOCIATE EDITOR

BENJAMIN KAZAN Xerox Corporation Palo Alto Research Center Palo Alto, California

Advances in

Electronics and Electron Physics EDITEDBY PETER W. HAWKES C E M ESILaboratoire d 'Optique Electronique du Centre National de la Recherche Scienrijque Toulouse, France

VOLUME 82

ACADEMIC PRESS, INC. Harcourt Brace Jovanovich, Publishers Boston San Diego New York London Sydney Tokyo Toronto

This book is printed on acid-free paper. @ COPYRIGHT 0 1991

BY

ALLRIGHTS RESERVED.

ACADEMIC PRESS,INC.

N O PART OF

THIS PUBLICATION MAY BE REPRODUCED OR TRANSMITTED IN ANY FORM OR BY ANY MEANS, ELECTRONIC OR MECHANICAL, INCLUDING PHOTOCOPY RECORDING, OR ANY INFORMATION STORAGE A N 0 RETRIEVAL SYSTEM, WITHOUT PERMISSION IN WRITING FROM THE PUBLISHER.

ACADEMIC PRESS, INC. I250 Sixth Avenue, San Diego. CA 92101

United Kingdom Edition published by ACADEMIC PRESS LIMITED 24-28 Oval Road. London NWI 7DX

LIBRARYOF CONGRESS CATALOG CARDNUMBER: 49-7504 ISSN 0065-2539 ISBN 0-12-014682-7 PRINTED IN THE UNIl'ED STATES OF AMERICA

91

92 93 94

9

8

7

6

5

4

3

2

1

CONTENTS

. ..

CONTRIBUTORS . . , PREFACE. . . . . . .

.

....

vii ix

1

CAD in Electromagnetism OSZKAR BIROAND K. R. RICHTER

............ ................. ............................. 111. Eddy Current Fields . . . . . . . . . . . . . . . , , . . . . . . . . . IV. Waveguides and Cavities. . . . . . . . . . . . . . . . . . . . . . . . V. Galerkin’s Method . . . . . . . . . . . . . . . , . . . . . . . . . . . VI. Application of the Finite Element Method . . . . . . . . . . . . . Acknowledgments.. . . . . . . . . . . . . . . . . . . . . . . . . . . References.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I. Introduction. .

1

..

4

11. Static Fields

Introduction . . . . . . . . . . . . . . Signal Processing in Speech Coding Speech Coding Systems. . . . . . . . Future Research Directions . . . . . Acknowledgments. . . . . . . . . . . References . . . . . . . . . . . . . . .

........ ......... , . . . . . . . . ......... ......... ......... .

95

97

Speech Coding VLADIMIR CUPERMAN

I. 11. 111. IV.

20 38 43 63 95

... . , . .,. ... ... ...

..... ...,. ..... ..... ..... .....

97 110 147 188 189 189

Bandgap Narrowing and Its Effects on the Properties of 197 Moderately and Heavily Doped Germanium and Silicon SURESH C. JAIN,R. P. MERTENS, A N D K. J. VANOVERSTRAETEN I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11. Calculations of BGN in n-Type Silicon and n-Type Germanium . . . . . . . . . . . . . . . . . . . . . . . . . 111. Impurity Concentration Fluctuations and Band Tails . IV. EfTect of Bandgap Narrowing on Optical Properties. .

.

197

...... ..,... .....,

203 232 244

....

,

vi

CONTENTS

V . Summary of Important Results . . . . . . . . . . . . . . . . . . . . Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Rectangular Patch Microstrip Radiator-Solution by Singularity Adapted Moment Method E . LEVINE.H . MATZNER. AND S . SHTRIKMAN I . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I1. The Spectral Domain Presentation . . . . . . . . . . . . . . . . . . I11. The Moment-Method Formulation . . . . . . . . . . . . . . . . . . IV. Two-Dimensional Solution . . . . . . . . . . . . . . . . . . . . . . V . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix A . Far fields in Two Polarizations . . . . . . . . . . . . Appendix B. Evaluation of Typical Matrix Elements . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Some Recent Advances in Multigrid Methods JANMANDEL I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I1. The Fundamental Multigrid Algorithim . . . . . . . . . . . . . . I11. Preconditioning by Multigrid . . . . . . . . . . . . . . . . . . . . IV . Methods Based on Space Decomposition . . . . . . . . . . . . . V . Multigrid in Elasticity . . . . . . . . . . . . . . . . . . . . . . . . VI . Multigrid for Mixed Problems . . . . . . . . . . . . . . . . . . . VII . Multigrid for High Order and Spectral Methods . . . . . . . . . VIII . Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . . IX . Multigrid and Parallel Computing . . . . . . . . . . . . . . . . . X . Some Other Multigrid Developments . . . . . . . . . . . . . . . XI . Multigrid Software . . . . . . . . . . . . . . . . . . . . . . . . . . AppendixA . PLTMG . . . . . . . . . . . . . . . . . . . . . . . . Appendix B. MADPACK . . . . . . . . . . . . . . . . . . . . . . Appendix C . MUDPACK . . . . . . . . . . . . . . . . . . . . . . AppendixD. MGDl . . . . . . . . . . . . . . . . . . . . . . . . . Appendix E. MGOO. . . . . . . . . . . . . . . . . . . . . . . . . . Appendix F. AMG . . . . . . . . . . . . . . . . . . . . . . . . . . AppendixG. BOXMG . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INDEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

268 270 270 272

277

277 279 286 302 317 317 319 324

327 328 329 347 349 357 359 360 363 364 365 365 365 366 366 367 368 368 368 368 379

CONTRIBUTORS Numbers in parentheses indicate the pages on which the authors' contributions begin.

OSZKAR BIRO (I), Institute for Fundamentals and Theory in Electrical Engineering, Graz University of Technology, Kopernikusgasse 24, A-8010 Graz, Austria

VLADIMIRCUPERMAN (97),Communication Sciences Laboratory, School of Engineering, Simon Fraser University, Burnaby, BC, Canada V5A IS6 SURESH C. JAIN(197),Delft Institute of Microelectronics, TU Delft, Postbus 5053,2600 GB Delft, The Netherlands

E. LEVINE(277), ELTA Electronics Industries Ltd., PO Box 330, Ashdod, 77102 Israel JAN MANDEL(327), Computational Mathematics Group, University of Colorado at Denver, Denver, C0802O4

H. MATZNER(277), Department of Electronics, Weizmann Institute of Science, PO Box 26, Rehovot, 76100 Israel R. P. MERTENS (197),IMEC, Kapeldreef 75,B-3030 Leuven, Belgium K. R. RICHTER(l), Institute for Fundamentals and Theory in Electrical Engineering, Graz University of Technology, Kopernikusgasse 24, A-801 0 Graz, Austria S. SHTRIKMAN (277), Department of Electronics, Weizmann Institute of Science, PO Box 26, Rehovot, 76100 lsrael

R. J.

VAN

OVERSTRAETEN (197), IMEC, Kapeldreef 75, B-3030 Leuven,

Belgium

vii

This Page Intentionally Left Blank

PREFACE Computational methods, coding, semiconductors, and antenna arrays are the subjects of this latest volume of Advances in Electronics und Electron Physics, in which we try to maintain the traditional broad coverage while highlighting particular themes in occasional topical volumes. The first of the contributions on computation is concerned with computeraided design in electromagnetism, where the widespread availability of ever faster and better computing facilities has revolutionized the field. 0. Biro leads us through the various methods that are being used or developed for studying static fields, eddy current fields, and cavities. The amount of research on these problems is enormous, as the annual COMPUMAG Conference proceedings show, and this survey of developments is very useful. The other chapter on computation, by J. Mandel, is concerned with multigrid methods, the literature of which is growing explosively. The author concentrates on recent developments in a wide range of fields. After examining the mathematical fundamentals, the special needs of elasticity, mixed problems, and spectral studies are explored in detail. There are sections on eigenvalue problems and on parallelization; a useful survey of available software concludes this up-to-date review of an important class of methods. Speech coding is intrinsically of great interest and also of considerable social and economic importance. The field is too vast to be covered usefully in a single review; the chapter by V. Cuperman is mainly concerned with a new class of speech coding systems that has emerged in the past few years, known as “analysis-by-synthesis.” The author does, however, devote considerable space to the many other types of coding that are employed, including vector quantization. The subject has important ramifications for society, not only for telephony but also for vocal recognition and vocal synthesis for the handicapped. Rectangular microstrip patches are specialized radiators used in printed antenna arrays. Although they are difficult to analyze, E. Levine, H. Matzner, and S. Shtrikman offer a means of solving the design problems associated with these devices, particularly in accelerating the convergence of the calculation. Finally, we have a long chapter on doping in semiconductors, by S. C. Jain, R. P. Mertens, and R. J. van Overstraeten. The band gap of silicon and germanium is considerably narrowed by heavy doping, and the authors examine this in great detail. After critically examining the various theoretical approaches, they consider fluctuations in impurity concentration and their ix

PREFACE

X

effect on the optical properties. This survey brings together a great deal of scattered information on these important questions. It remains only for me to thank all the authors for taking such trouble over their contributions and to list articles promised for forthcoming chapters in this series.

FORTHCOMING ARTICLES Neural Networks and Image Processing Image Processing with Signal-Dependent Noise Residual Vector Quantizers with Jointly Optimized Codebooks Parallel Detection Ion Microscopy Magnetic Reconnection Vacuum Microelectronic Devices Sampling Theory ODE Methods Nanometre-Scale Electron Beam Lithography The Artificial Visual System Concept Dynamic RAM Technology in GaAs Corrected Lenses for Charged Particles Foundations and Applications of Lattice Transforms in Image Processing The Development of Electron Microscopy in Italy The Study of Dynamic Phenomena in Solids Using Field Emission Invariant Pattern Representations and Lie Group Theory Amorphous Semiconductors Median Filters Bayesian Image Analysis Magnetic Force Microscopy Theory of Morphological Operators Kalman Filtering and Navigation

J. B. Abbiss and M. A. Fiddy H. H. Arsenault C. F. Barnes and R. L. Frost P. E. Batson M. T. Bernius A. Bratenahl and P. J. Baum I. Brodie and C. A. Spindt J. L. Brown J. C. Butcher Z. W. Chen J. M. Coggins J. A. Cooper R. L. Dalglish J. L. Davidson G. Donelli

M. Drechsler M. Ferraro

W.Fuhs N. C. Gallagher and E. Coyle S. and D. Geman U. Hartmann H. J. A. M. Heijmans H. J. Hotop

xi

PREFACE

3-D Display

Applications of Speech Recognition Technology Spin-Polarized SEM Finite Topology and Image Analysis Expert Systems for Image Processing The Intertwining of Abstract Algebra and Structured Estimation Theory Electronic Tools in Parapsychology Image Formation in STEM Phase-Space Treatment of Photon Beams Low Voltage SEM 2-Contrast in Materials Science Languages for Vector Computers Electron Scattering and Nuclear Structure Edge Detection Electrostatic Lenses Scientific Work of Reinhold Rudenberg Metaplectic Methods and Image Processing X-ray Microscopy Accelerator Mass Spectroscopy Applications of Mathematical Morphology Focus-Deflection Systems and Their Applications Echographic Image Processing The Suprenum Project Knowledge-Based Vision Electron Gun Optics Spin-Polarized SEM Cathode-ray Tube Projection TV Systems

n-Beam Dynamical Calculations Thin-film Cathodoluminescent Phosphors Parallel Imaging Processing Methodologies Diode-Controlled Liquid-Crystal Display Panels Parasitic Aberrations and Machining Tolerances Group Theory in Electron Optics

D. P. Huijsmans and G. J. Jense H. R. Kirby K. Koike V. Kovalevsky T. Matsuyama S. D. Morgera

R. L. Morris C. Mory and C. Colliex G. Nemes J. Pawley S. J. Pennycook R. H. Perrot G. A. Peterson M.Petrou F. H. Read and I. W. Drummond H.G. Rudenberg W. Schempp G. Schmahl J. P. F. Sellschop J. Serra T. Soma J. M. Thijssen 0. Trottenberg J. K. Tsotsos Y. Uchikawa T. R. van Zandt and R. Browning L. Vriens, T. G . Spanjer and R. Raue K. Watanabe A. M. Wittenberg S. Yalamanchili Z. Yaniv M. I. Yavor Yu Li

This Page Intentionally Left Blank

ADVANCES IN ELECTRONICS A N D HLECIRON PHYSICS. VOL 82

CAD in Electromagnetism OSZKAR BIRO AND K . R . RICHTER Institute for Fundamentals and Theory in Electrical Engineering Graz University of Technology. Graz. Austria

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . I1 . Static Fields . . . . . . . . . . . . . . . . . . . . . . . . . . A. Differential Equations and Boundary Conditions of Static Fields . . . . . . B. Scalar Potential Descriptions of Static Fields . . . . . . . . . . . . . C. Vector Potential Descriptions of Static Fields . . . . . . . . . . . . . I11. Eddy Current Fields . . . . . . . . . . . . . . . . . . . . . . . A. Differential Equations, Boundary and Interface Conditions of Eddy Current Fields B. Potential Descriptions of Eddy Current Fields . . . . . . . . . . . . . C. Coupling Eddy Current and Static Magnetic Fields . . . . . . . . . . . IV . Waveguides and Cavities . . . . . . . . . . . . . . . . . . . . . A. Differential Equations and Boundary Conditions of Waveguides and Cavities . . B. Potential Descriptions of Waveguides and Cavities . . . . . . . . . . . V . Galerkin’s Method . . . . . . . . . . . . . . . . . . . . . . . A . Weak Formulations . . . . . . . . . . . . . . . . . . . . . . B. General Description of Galerkin’s Method . . . . . . . . . . . . . . C. Application of Galerkin’s Method to Potential Formulations . . . . . . . V1. Application of the Finite Element Method . . . . . . . . . . . . . . . A . A summary of the Finite Element Method . . . . . . . . . . . . . . B. Analysis of an Iron Cored Choke Coil . . . . . . . . . . . . . . . C. Analysis of a Plate Beneath a Coil . . . . . . . . . . . . . . . . . D . Analysis of Transient Eddy Currents in a Conducting Brick . . . . . . . . E . Analysis of Anisotropic Waveguides . . . . . . . . . . . . . . . . F. Analysis of a Dielectric Loaded Cavity . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .

i 4 4 9 15 20 21 25 29 38 39 40 43 43 46 49 63 64 68 75 80 85 92 95 95

I . INTRODUCTION The rapid development of computer hardware in recent years provides vast resources for the design of electromagnetic devices. Comparable progress in CAD software has to follow in order to exploit these possibilities . The advent of CAD methods revolutionizes the design procedure by bringing the analysis of the electromagnetic field in the devices to the foreground. which provides an insight into their operation far superior to that obtainable by traditional network considerations .

.

Copyright i( 19Yl by Academic Press Inc All rights 01 reproduction rn any form reserved ISBN~ - I z - o I ~ ~ R ~ - ~

2

OSZKAR BlRO AND K. R. RICHTER

The field analysis methods of two-dimensional models can be regarded as established, successful CAD software packages are commercially available. The situation, however, is different with three-dimensional models: their use is by far less widespread. The reason is no longer the considerably higher memory and CPU-time requirement, since this is not essential in an age of cheap memory and fast computers. The main problem is the scarcity of robust and reliable numerical field analysis methods. The present work attempts to make up for this shortage. A largely unified field analysis approach is proposed for various types of problems: static fields, eddy current fields and general electromagnetic fields. The uniformity is attained by using similar potential functions to describe the field in each particular case. This allows for a great degree of generality with respect to material properties, since the continuity of the potentials is sufficient to ensure the satisfaction of the interface conditions on surfaces where the material characteristics change abruptly. The Coulomb gauge is invariably applied to ensure the uniqueness of the vector potentials, which are necessary in three-dimensional analysis. This results in great numerical stability and in the lack of any spurious solutions when the finite element method is employed. The robustness of the methods is shown by some illustrative examples. It is the feeling of the author that the analysis methods presented in this work can serve as a basis for some general purpose three-dimensional CAD software packages in the near future. The analysis of electromagnetic field problems is based on Maxwell's equations (Maxwell, 1864). These are partial differential equations stating the relationships between the field vectors E (electric field intensity), D (electric flux density), H (magnetic field intensity), B (magnetic flux density) and J (electric current density) as well as the scalar p (electric charge density):

V

x H =J

+-,aD at aB

VXE=--

at

V*B=O,

(3)

V*D=p.

(4)

Further relationships between the field quantities are defined by the constitutive equations

B = pH, D = EE, J = oE.

(5)

(6) (7)

CAD IN ELECTROMAGNETISM

3

The permeability p, the permittivity E and the conductivity CT describe the properties of the medium, and in the simplest homogeneous, isotropic and linear case they are constant scalar values independent of the fields. In the general case, however, they may vary in space when the medium is inhomogeneous, they may be tensor quantities describing an anisotropic medium, and they may even be field dependent resulting in nonlinearity. The electromagnetic field can be completely represented in terms of two of the field vectors (e.g., E and H), the remaining quantities are obtainable from the constitutive relationships in Eqs. ( 5 ) to (7). This way of representation involves six scalar functions (the three components of each field vector), a description that proves to be redundant. Indeed, the introduction of so-called potential functions allows the electromagnetic field to be represented in terms of a lower number of scalar functions. This potential representation is of further advantage because the potential functions can always be chosen to be continuous, while the field vectors are discontinuous whenever the material properties change abruptly. In particular, the continuity of the representative functions is profitable in a numerical context and is, therefore, of great importance in CAD methods. The complete set of the Maxwell equations (1) to (4) represent the fairly general case of electromagnetic waves in nonlinear, anisotropic media. Most practical problems do not warrant a treatment of such an extensive scope, and some simplifying assumptions may be applied. Three particular sets of simplifications will be treated in the present work, which, however, cover a wide range of problems of practical importance. The equations of static fields are arrived at by neglecting any variation in time. This case will be treated in Section 11, with which inhomogeneous, nonlinear media will be allowed. Static approximations are useful in most high voltage applications, investigations of the static behaviour of electrical machines, and many transmission line problems. When induced conductive currents are considered with the neglected displacement currents, the equations of eddy current fields are obtained. This case of the quasi-stationary limit will be discussed in Section 111, also with inhomogeneous, nonlinear material properties considered. The quasistationary approximation is satisfactory in all power frequency applications involving metallic structures. Especially, the analysis of losses in electrical machines, some nondestructive testing problems and also the prediction of the transient behaviour of metallic components in fusion reactors require eddy current fields to be computed. The third special case considered here is constituted by electromagnetic waves confined to more or less closed regions filled with a linear but possibly anisotropic, inhomogeneous medium. In contrast to the previous cases requiring the solution of differential equations, this problem involves the

4

OSZKAR BIRO AND K . R. RICHTER

finding of particular frequencies at which nonzero fields may exist. The corresponding equations will be treated in Section IV. They are useful in analyzing waveguides, cavities and resonators at high frequencies. The relevant potential functions will be introduced in each of the above cases, and the differential equations and boundary conditions will be written for them. Special attention will be given to the uniqueness of the potentials, since this is of extreme importance from a numerical point of view. The numerical solution of the differential equations in electromagnetism constitutes the central problem of CAD. Since the use of computers is indispensable in solving this analysis problem, it is usually referred to as Computer Aided Analysis. The first step in the numerical solution of the partial differential equations discussed in the first four sections is to reduce them to a system of algebraic equations, of ordinary differential equations or to a matrix eigenvalue problem. Several techniques are known for the execution of this task, the most widespread ones being the method of finite differences (Mitchell and Griffiths, 1980),variational principles (Mikhlin, 1964)and Galerkin’s method (Galerkin, 1915). Since, in the opinion of the author, the last provides the most flexible approach and its scope covers all topics treated in the present work, Galerkin’s method is singled out and its brief general description as well as its application to the partial differential equations in electromagnetism is presented in Section V. The most powerful method for the numerical realization of Galerkin techniques is the Method of Finite Elements. Omitting its extensive treatment, which can be found in many excellent books (e.g., Zienkiewicz, 1977), the application of a special form of nodal finite elements is presented by means of several examples of Computer Aided Analysis in Section VI. 11. STATIC FIELDS The neglect of the time variation of the field quantities eliminates the interdependence between the electric and the magnetic field. The models so derived describe the electrostatic field, the magnetostatic field and the static current field. In Section II.A, a summary of the differential equations and typical boundary conditions is given in these cases. Section I1.B is devoted to the description of static fields by means of scalar potentials, while vector potential formulations are introduced in Section 1I.C.

A. Diferential Equations and Boundary Conditions of Static Fields The differential equations derived from the Maxwell equations under the assumption of static conditions are given in this subsection for the electro-

5

CAD IN ELECTROMAGNETISM

static, the magnetostatic and static current fields. The typical boundary conditions are also presented and discussed. The region of interest where the fields are to be computed is invariably denoted by R bounded by the closed surface r. This bounding surface is subdivided into disjunct sections with different types of boundary conditions prescribed on them. 1. Static Electric Field

Under static conditions, the Maxwell equations (2) and (4) along with the constitutive relationship of Eq. (6) yield the description of the static electric field: VxE=O, (8) V - D = y,

inn,

(9)

D = EE.

(10)

The solution of Eqs. (8)-( 10) is sought in R. In writing the boundary conditions, n denotes the outer normal of the relevant surface with respect to the region Q. Two types of boundary conditions are of practical importance. On the part rEof the surface r bounding the domain Q, the tangential component of the electric field intensity is known and expressed as a given magnetic surface current density: E x n = K,, on rE. (1 1) E is constituted by metallic electrodes and/or planes In most cases, the surface r of symmetry with the electric field intensity normal to them (in both cases there are no magnetic surface currents; K, = 0). The boundary condition (1 1) may also model electric double layers (magnetic surface currents; K, # 0).If r, is made up of nE disjunct sections (as in the case when several electrodes are present), then further (nE - 1) integrals have to be specified: either integrals of D over (nE- 1) surface sections (charges) or integrals of E along lines connecting one section of r, to the remaining (n, - 1) sections (voltages):

61,

D e n d r = Q , , i = 1 , 2,..., a,-1 E.dl

=

U , , i = 1,2 ,..., nE - 1,

or (12)

JCE1

denotes the ith section of r E and CE,is the curve connecting the where rEi surface rE, to the section r,,, (Fig. 1). O n the part r, of the bounding surface r (with r, + r, = r),the normal component of the electric flux density is given as a surface charge density D . n = - Ps,

on

r,.

(1 3)

OSZKAR BIRO A N D K. R. RICHTER

6

/D. ndT=Q3

'

€2

FIG.1. The scheme of an electrostatic problem.

The surface rD is often made up of surface sections with the normal component of the flux density being zero (usually symmetry planes where the surface charge density is zero, ps = 0). It may also model known surface charge densities. A typical arrangement of the electrostatic problem is shown in Fig. 1. The solution of Eqs. (8)-( 10)with the boundary conditions (1I)-( 13) yields unique E or D. Indeed, let us assume that two solutions exist. Consequently, their difference E''), D'O) satisfies

V

x E'O' = 0,

(14)

V D(O)= 0,

(15)

in !2, D(0)= EE(0), E(O) x n = 0,

on r,,

D ( O ) . n d T = O , i = 1 , 2 y . . . y n E -1 S , i

LE,-

(16)

E ' o ' - d l = O , i = 1,2 ,...,nE-1,

(17)

or (18)

D(O) n = 0, on rD. (19) The difference field E(O), D'O) will be shown to be zero, a fact proving the uniqueness of the solution. The vanishing of the difference field will be

7

CAD IN ELECTROMAGNETISM

demonstrated by showing that the quantity

is zero. Since the permittivity E is positive in Eq. (16),this does in fact imply that the difference field is zero. In view of Eq. (14), E'O) can be written as

E(0) = -Vv("J.

(21)

Now, using some vector algebra, W(O)can be rewritten as

-I

.

.

y(0)V D(O)dQ-

Vp"0) D(O)dQ=

In

i

.

V(O)D(O) n dr.

(22)

The volume integral on the right-hand side is zero in view of Eq. (15), and Eq. (19) implies that the surface integral vanishes on r,. According to the boundary condition (17),V'O' is constant on ,-l . Its value can be chosen to be zero on the surface r,,,, so we end up with

where VI0' is the constant value of V'O' on the surface section yl0) =

I,. -

E'O' dl,

rEi. Obviously (24)

so the conditions (18) do in fact imply that W'O' is zero, i.e., the solution of Eqs. (8)-( 10) with the boundary conditions (1 1)-( 13) is unique. Note that no assumption of linearity has been made for the constitutive relationship (lo), i.e., the proof holds for the nonlinear case, too. 2. Static Magnetic Field Neglecting the time derivative term in Eq. (1) and further writing Eqs. (3) and ( 5 ) , the equations of the magnetostatic field are obtained as

VxH=J,

-

V B = 0, B = pH,

(25)

in R,

(26) (27)

In ferromagnetic materials the permeability p is strongly dependent upon the field vectors making the magnetostatic problem a nonlinear one. The boundary conditions of practical significance are again of two types. On the part r, of the bounding surface r,the tangential component of the magnetic field intensity is assumed to be known and given as a surface current

OSZKAR BIRO AND K.R. RICHTER

8 density

H x n=K, onr,. The surface rHis in most cases constituted by boundaries of highly (infinitely) permeable iron parts or symmetry planes with the tangential component of H vanishing (K = 0). Known surface currents are also modeled by this boundary condition. Similarly to the electrostatic case, further integrals need to be specified if H r consists of several disjunct parts. The number of necessary specifications is (nH - 1) if there are n H sections; either integrals of B over (nH - 1) surface sections (fluxes) or integrals of H along lines connecting (nH - 1) of the sections of r H to a particular section (magnetic voltages) must be gven: B * d S = Y ; , i = 1 , 2,..., n H - 1

or

I H t

H dl = Umi,i = 1,2,. . . , n H- 1,

(29)

where the meaning of r H i and CHiis similar to that of rEi and CEiin the static electric case. On the rest rB of the bounding surface r, the normal component of the magnetic flux density is assumed to be given as a magnetic surface charge density

Ban = -Pmsr onrB(30) This boundary condition can be used for modeling surfaces parallel to flux lines (pmS= 0), or known flux distributions generated by fictitious magnetic surface charges. Figure 2 shows a schematic layout of a typical magnetostatic problem. It can be shown in a way similar to the electrostatic case that the magnetic field vectors H or B can be obtained in a unique way as the solutions of Eqs. (25)-(27) with the boundary conditions (28)-(30). 3. Static Current Field By taking the divergence of Eq. (25), a continuity equation stating the source-free property of the current density in the static case is obtained. Supplementing this with Eqs. (8) and (7), the description of a static current field is obtained that is totally analogous to that of the charge-free static electric field with J written instead of D and Q instead of E :

VxE=O, V.J=O, J = aE.

(31)

in R,

(32) (33)

9

CAD IN ELECTROMAGNETISM

rH:Hxn=K

re: B .n=-P ms

rH 1

FIG.2. The scheme of a magnetostatic problem.

In view of the analogy mentioned, the typical boundary conditions are briefly mentioned only. On r,, the boundary condition is identical with Eq. (1 1): Ex n

= K,,

r,,

(34)

I,,

(35)

on

with the necessary supplementary conditions

6,1.

J .dS = I i , i

=

1,2,..., nE - 1 or

E e d l = Ui, i = 1,2,. ..,nE - I .

The surface integrals here specify the total current leaving the relevant sections of r,. On r, (as before, r, + r, = r),the normal component of the current density is given: J . n = -J,,

on

r,.

(36)

A static current problem is schematically drawn in Fig. 3. The uniqueness of E or J as the solution of Eqs. (31)-(33) with the boundary conditions (34)-(36) can be shown as before. B. Scalar Potential Descriptions of Static Fields

The most economical way of describing static fields is by means of scalar potentials. This is directly possible in the case of electrostatic and static current fields since the nonrotational property of E allows it to be derived as

10

OSZKAR BIRO A N D K . R. RICHTER

rE: Exn=K, r,: J. n=-J,

‘El

FIG.3. The scheme of a static current problem.

the gradient of a scalar. This will be expounded in Section II.B.1 for the electrostatic case, a discussion that is directly valid for the static current field in view of the existing analogy. In the magnetostatic case it is necessary to split the field into a known rotational and an unknown nonrotational part in order to be able to introduce a scalar potential. Section II.B.2 is devoted to the treatment of this problem. 1. The Electric Scalar Potential

The differential equations and boundary conditions on the electric scalar potential will be presented for the case of the electrostatic field only. As it is well known, the electric field intensity can be written as the gradient of a scalar V in view of Eq. (8):

E = -VV.

(37) This formulation ensures the satisfaction of Eq. (8) and, using the constitutive relationship (lo), Eq. (9) yields a second order differential equation for V : - V (e VV) = p,

in a,

(38)

a generalized Laplace-Poisson equation. The boundary condition (1 1) allows the specification of V on the surface r, if the additional conditions (12) refer to the voltages of the surface sections.

CAD IN ELECTROMAGNETISM

11

Indeed, on each disjunct section rEi of rE, a scalar function VOican be defined as follows. Let us choose an arbitrary point P, on each surface section rEir and specify

VOi= 0, In each other point P on

at 4 .

(39)

rEi,VOiis defined as

where C, is an arbitrary curve lying in rEi and connecting P to P,. The value of the integral is independent of the choice of the curve; otherwise the boundary condition ( 1 1 ) would contradict the nonrotational property of E. On electrodes and planes of symmetry where K, = 0 (i.e., in most practical cases) VOi= 0. The negative gradients of the functions VOievidently satisfy the to boundary condition (1 I), so VOidefines the electric scalar potential V on rEi the extent of a constant value b: V

=

Voi+ 6,

on

rEi.

(41)

The constant V,, can be chosen to be zero, thus specifying the electric scalar The rest of the (nE - 1) values potential to vanish on the surface section I-,,. are given by the voltages specified in Eq. (12). Indeed, the line integrals along the curves CEiare the differences of the potential values at the endpoints of the curves. In summary, the boundary condition (1 1) is formulated for the electric scalar potential V as V = U,,

onr,,

(42)

where the known function U, equals the the sum of the function VOiand of the voltage q on the ith section of I-,. The condition (42) is a Dirichlet boundary condition. The boundary condition (1 3 ) constitutes a Neumann boundary condition for V: n * EVV= ps,

on

rD.

(43)

In summary, the electric scalar potential satisfies the differential equation (38) with the boundary conditions (42) and (43). The uniqueness of V as the solution of this boundary value problem is assured if V is specified at an arbitrary point in Q, since the uniquely defined electric field intensity E determines V up to a constant. This specification is evident if rEis present (i.e., Dirichlet boundary conditions are given). In the case when rD = r, i.e., Neumann boundary conditions only are specified, the value of V should be set to zero at an arbitrary point in 0.

12

OSZKAR BIRO AND K. R. RICHTER

2. Magnetic Scalar Potentials The static magnetic field intensity cannot be derived as the gradient of a scalar since its curl is not zero. It is, however, possible to split it into two parts as

H

= H,

+ HM,

(44)

where H, is constructed so that it satisfies V x H, = J.

(45)

The remaining part H M is then nonrotational and can therefore be written as the negative gradient of a magnetic scalar potential 0:

H, = -V@. (46) The scalar function 0 is called the reduced magnetic scalar potential, since it describes a part of the magnetic field only. There are many possibilities for the construction of H, from the known current density J. The most attractive one seems to be to choose H, as the magnetic field due to J in free space. This field is calculated in a well known way by means of Biot-Savart’s Law:

where r is the vector pointing from the source point to the field point. The numerical execution of the integrals in Eq. (47) is well established for all practically important distributions of the current density, so H, can be regarded as known. The above formulation evidently satisfies Eq. (25). The differential equation for the reduced scalar potential is provided by Eq. (26) with the constitutive relationship (27) taken into account:

V * ( p V@)

=V

- pH,,

in Q.

(48) Similarly to the electrostatic case, this is a generalized Laplace-Poisson equation. The boundary conditions (28)-(30) can be formulated for the magnetic scalar potential exactly as in the electrostatic case. The potential CD satisfiesthe Dirichlet boundary condition 0=0,,, onr,, (49) where the function CDo can be determined by integrating the function n x (K - H, x n) and taking the given magnetic voltages into account. It is again assumed that the conditions specifying the magnetic voltages in Eq. (29) are in effect.

CAD IN ELECTROMAGNETISM

13

The boundary condition (30) is a Neumann boundary condition for the reduced magnetic scalar potential 0 :

The uniqueness of Q, an also be ensured by setting it to zero at some point. This is only necessary if no Dirichlet boundary conditions (49) are specified. The above formulation of the static magnetic field is satisfactory if the permeability of the medium is not very high, and so the magnitude of the resultant magnetic field H does not much differ from that of H,. Otherwise, IHI << IHsl, and the magnetic field is the difference of the two almost equal quantities H, and VO. This is of extreme numerical disadvantage, since serious cancellation errors may appear in the computation. Fortunately, the source current density is, in most practical cases, zero in highly permeable, ferromagnetic parts, and it is usually confined to coils made of nonferromagnetic materials. If this is the case, i t is possible to derive the total magnetic field in media with high permeability as the gradient of a scalar $, the so-called total magnetic scalar potential. In general, the volume under investigation comprises both coils and ferromagnetic materials: in this case both the reduced and the total magnetic scalar potential are to be introduced (Simkin and Trowbridge, 1979). In accordance with the previous discussion, the region R is subdivided into two subregions Qm and Q, (Fig. 4).The domain !&, contains all coils, i.e., it includes all regions where currents are present, but the medium is nonferromagnetic. Then the magnetic field in 0, is formulated in terms of the reduced magnetic scalar potential:

H = H,

- VO,

in R,.

Due to the absence of ferromagnetic medium in this domain, no cancellation errors occur. The subregion R, includes all ferromagnetic media but the current density is zero in it. Here, the magnetic field can be derived from the total magnetic scalar potential

H = -V$,

inR,.

This is possible in view of the absence of the current density in R,. It must further be stipulated that there is no closed path in R, that encircles source currents. This, in effect, excludes problems with a closed magnetic circuit surrounding coils with nonzero net current. This difficulty could be overcome by allowing the potential t,b to be discontinuous along some suitable cutting surface (Simkin and Trowbridge, 1979), which is, however, not discussed here.

14

OSZKAR BIRO A N D K. R. RICHTER 'B0'

rw: 9 = q o FIG.4. The scheme of a magnetostaticproblem with total and reduced scalar potentials

The differential equations for the two potentials express the solenoidal property of the magnetic flux density

V ( p V 0 ) = V (pHS),

in a,,

(53)

v

in 0,.

(54)

(pV$) = 0,

The boundary parts H r and r, are also subdivided into rH,,rH$, and TBo, re*in accordance with the subdivision of SZ into Q, and 0,(Fig. 4). On r H @ , the reduced scalar potential satisfies the Dirichlet boundary condition onr,,,

(I)=@o,

(55)

where (Do is determined in the usual way by n x (K - H, x n) and by the given magnetic voltages. Similarly on r H J I , the total scalar potential is set by n x K and by the magnetic voltages given in Eq. (29): $ = IG0,

onrHdr.

(54)

The Neumann boundary conditions on r, are

n p V 0 = pms+ n pHs, 9

on r,,

(57)

15

CAD IN ELECTROMAGNETISM

and n . p W = pmSt

on re$.

(58)

The interface between the two subregions is denoted by ra,,,(Fig. 4). Additional interface conditions have to be satisfied here: the tangential component of the magnetic field intensity as well as the normal component of the magnetic flux density have to be continuous. These can be written for the potentials as

(H, - VO) x n,

- V$ x n~ = 0,

on l-*$,

(59)

+

-n,.pH, + n,-pVO n*.pVt,h = 0, on rm$, (60) where the notations n, and n$ have been introduced for the outer normals associated with the subregions 0, and Q,, respectively. Obviously, n, + n$ = 0. The interface condition (59) can be reformulated in the following way. Let us define the scalar function Osat a point P on Tas as

mS= J

n, x (H,x n,) CP

- dl,

and connecting point P to a where Cp is an arbitrary curve lying in preselected point Po. Now Eq. (59) has the form 0s

+

= t,h,

on To#.

(64

In summary, the differential equations (53) and (54) have to be solved with the boundary conditions (55)-(58) and the interface conditions (60), (62) satisfied. The uniqueness of the solution is ensured by the uniqueness of the field intensity by setting the potentials to zero at some arbitrary point. This is advantageously done at the point Po defined above where the potentials @and 1,6 are equal. C . Vector Potential Descriptions of Static Fields

Although the derivation of static fields from scalar potentials is the most enconomical description, it still involves some difficulties, especially when both the reduced and the total scalar potential are used. Indeed, as seen in the previous subsection, it is not possible to use a continuous total scalar potential in regions that surround conductors with nonzero net current. The coupling of the two potentials introduces extra interface conditions that have to be catered for. Finally, the use of the scalar potentials does not easily facilitate conditions on fluxes in Eq. (29). These problems are by no means prohibitive, and the inexpensiveness of the scalar description may well outweigh them; it is still justified to investigate vector potential formulations which, although more

16

OSZKAR BIRO AND K. R. RICHTER

expensive in terms of unknown scalar functions (three of these are necessary to define a vector potential), do not suffer from the above difficulties. A further motivation in introducing vector potentials for static fields is that they cannot be dispensed with in the nonstatic case, and it is easier to state their most important features in the static context. The magnetic vector potential for describing magnetostatic fields is discussed in Section 1I.C.1, while the current vector potential is presented in Section II.C.2 for static current fields. 1. The Magnetic Vector Potential

In view of the Maxwell equation (26) stating the solenoidal property of the magnetic field, the flux density vector can be written as the curl of a vector potential

B=VxA.

(63)

Applying the constitutive relationship of Eq. (27), Eq. (25) yields a differential equation for A: 1

Vx-VxA=J, inR. P The boundary conditions (28) and (30) are formulated for the vector potential as

1 -VxAxn=K,

onrH,

V x A * n = -pmS,

onh.

I.1

and The tangential component of the vector potential can be chosen so that the boundary condition (66) is satisfied:

nxA

= a,

on

r,.

(67)

This is similar to the way the value of the scalar potentials can be selected so that the tangential component of the field vector assumes a prescribed value. Naturally, the selection of a is by far not unique. Indeed, it is easy to check that any function a satisfying

-a

on r,, (68) is appropriate. The choice of the tangential component of the vector potential prescribes the fluxes of the sections of r,. Indeed, in view of Stokes'

V

= pms,

CAD IN ELECTROMAGNETISM

17

Theorem, a must satisfy the equations

i,.

(a x n ) . d l = 'Pi,

i = 1, 2,..., nHi

(69)

which is, therefore, where CVi is the curve bounding the surface section rHi. also a curve bounding some section of r, on which a is defined (see Fig. 2). Therefore, it will be assumed that all the (0" - 1 ) specifications in Eq. (29) refer to the fluxes of the surface sections. No general guidelines can be given for the selection of the function a besides that it must satisfy the conditions (68)and (69).However, in many cases pms = 0 in Eq. (66) and nH = I, so there are no fluxes prescribed. It is then possible to choose a = 0 in the boundary condition (67); i.e., the tangential component of the vector potential can be chosen to be zero on r,. The differential equation (64) and the boundary conditions (65)and(67) do not determine a unique vector potential. They only specify the magnetic flux density, i.e., the curl of A uniquely, so the gradient of any scalar function can be added to the vector potential, a procedure called gauge transformation. The only restriction on this scalar function is so far constituted by the boundary condition (67) specifying the tangential component of A: to retain this boundary condition, the scalar function must assume a constant value of r,. In order to make the vector potential unique, it is necessary to define its divergence and, on the boundary r of the region R, either its normal or its tangential component (Biro and Preis, 1989a). Since the tangential component of A is already given on the surface I-, it is obvious to specify its normal component on r H . The simplest way to do this is to introduce the additional boundary condition

on r,, (70) in the formulation. This does not affect the magnetic flux density, since it only means that the normal derivative of the scalar function in the gauge transformation is specified. Since, further, the prescription of the divergence of A sets the Laplacian of the gauging scalar, this scalar function and thus also the vector potential become unique. The divergence of the vector potential is advantageously set to zero by the following procedure (Biro and Preis, 1989a): Let us append the left-hand side of the differential equation (64) by the term - V( l/p) V A, resulting in A . n = 0,

-

1 1 Vx-VxA-V-V.A=J, P

c1

inn.

(71)

Hence, the operator on the left-hand side becomes a generalized vector

OSZKAR BIRO AND K.R. RICHTER

18

Laplacian. The divergence of this equation yields

V2 - V e A =0,

(: )

inn.

Writing the normal component of Eq. (71) on the surface r, gives

1 n . V x-V x

=n.J,

P

onrH.

(73)

Using the vector identity V . ( b V x A x n> = n . V x -V 1 xA

P

(74)

and the boundary condition (65), this results in

The right-hand side is obviously zero: i.e., the function l/p V A satisfies the homogeneous Neumann boundary condition

besides obeying the Laplace differential equation (72). It can be made zero by specifying homogeneous Dirichlet boundary conditions on it along r, :

1 P

-V A

= 0,

on rB.

(77)

Indeed, the differential equation (72) and the boundary conditions (76) and (77) are well known to imply

1

-V. A = 0,

inn,

P

which is called the Coulomb gauge. (The proof is similar to that shown in Section II.A.l for the uniqueness of the static electric field and hence of the electric scalar potential.) This also justifies the inclusion of the extra term in Eq. (71) since this turns out to be zero. In summary, a unique magnetic vector potential obeying the Coulomb gauge and satisfying the differential equation (64) as well as the boundary conditions (65) and (66) can be obtained by solving the differential equation (71) in R with the boundary conditions (65) and (70) on l-, as well as (67) and (77) on rB.

CAD IN ELECTROMAGNETISM

19

2. The Current Vector Potential

An alternative to the scalar description of static current fields is the use of a current vector potential, ie., the derivation of the current density as the curl of a vector potential T (Carpenter and Locke, 1976): (79)

J=VxT.

In view of the analogy between the static electric and current fields when no charges are present, an electric vector potential can be introduced to yield the electric flux density D=VxF.

(80)

The differential equations and boundary conditions will only be presented for the static current case. Using the constitutive relationship (33) and the Maxwell equation ( 3 l), the following differential equation must be satisfied by the current vector potential T: 1 Vx-VxT=O,

inn.

t 7

(81)

The boundary conditions (34) and (36) specifying the tangential component of the electric field intensity and the normal component of the current density can be written as 1

-V x T x n = K,,

r,,

(82)

r,.

(83)

on

[r

and

V x T - n = -Jn,

on

The latter boundary condition can be satisfied by an appropriate prescription of the tangential component of T,

nxT

= T,

on

rJ,

(84)

so that the function T fulfills the condition V . 5 = J,

on r,.

(85)

This selection determines the currents crossing the sections of r,, so that it is assumed that all the conditions (35) refer to these quantities. In many cases, where the normal component of the current density is zero on rJ(as on an interface between conductors and nonconductors) and nE = 1, so there are no currents prescribed, the choice z = 0 is appropriate. As in the case of the magnetic vector potential, the differential equation (81) and the boundary conditions (82) and (84) do not define the current

20

OSZKAR BIRO AND K. R. RICHTER

vector potential uniquely; only its curl, the current density, is specified. Now, too, the uniqueness of T can be ensured by applying the Coulomb gauge 1 - V . T = 0,

inR,

(86)

CJ

as well as introducing the additional boundary condition T . n = 0,

on

r,.

(87)

The Coulomb gauge can be enforced again by substituting the differential equation (81) with 2 1 V x - V x T - V - V T = 0, a a a

in R,

(88)

where the additional term is justified by the Coulomb gauge (86) being satisfied. Further, it is necessary to prescribe the additional boundary condition 1 -V cr

T = 0,

on

rJ.

(89)

It can be shown, similar to the case of the magnetic vector potential, that a unique current vector potential is yielded by the solution of the differential equation (88) in 0 with the boundary conditions (82) and (87) on r, as well as (84) and ( 8 5 ) on rJ. 111. EDDYCURRENT FIELDS The inclusion of the effect of the time variation of the magnetic field inducing an electric field that causes currents flowing in conductors results in a set of differential equations describing eddy current fields. In the quasistationary limit reached in this manner, the displacement currents represented by the second term on the right-hand side of the Maxwell equation (1) are still neglected. Therefore, the electric and magnetic fields are coupled in conductors only, while in nonconductors, the magnetic field is essentially static. The differential equations in the conductors and the corresponding boundary and interface conditions are discussed in Section 1II.A. In Section IILB, the application of vector potentials and scalar potentials is introduced for the description of eddy current fields. The coupling of these potential descriptions in the conductors to the different formulations of the static magnetic field in the nonconducting medium is discussed in Section 1II.C.

21

CAD IN ELECTROMAGNETISM

: Hxn=K

free o f eddy c u r r e n t s

I eddy currents

Hxn=O

FIG.5. The scheme of an eddy current problem.

A . Differential Equations, Boundary and Inteflace Conditions of Eddy Current Fields

The differential equations obtained from the Maxwell equations in the quasi-stationary limit are summarized in this subsection along with the typical boundary conditions. The region of interest in this case is constituted by conductors carrying eddy currents. It will be denoted by 0,. Its boundary is made up of a part of the surface r bounding the entire studied region and of the interface r,, to the region SZ, free of eddy currents where a static magnetic field is present (Fig. 5). The differential equations and boundary conditions in 0, have been presented in Section II.A.2. The subset of the Maxwell equation studied in eddy current carrying conductors is

(90)

VxH=J,

V

x

dB dr

E = - -,

V.B=O,

inR,,

(92)

OSZKAR BIRO AND K.R. RICHTER

22

B = pH,

(93)

J = oE.

(94)

The boundary conditions on the surface define either the tangential component of the electric field intensity or the tangential component of the magnetic field intensity as zero: E x n =0,

onr,,

(95)

and H x n = 0, on rHc. (96) In most cases both r E and rH, are planes of symmetry. The surface r E may also model electrodes where a voltage source is connected to the conductor. The boundary condition (95)and Eq. (91) also imply that the normal component of the flux density is zero on r,, while the boundary condition (96) and Eq. (90) mean that the normal component of the current density vanishes on r H c . On the interface r,, between the conductors and the surrounding nonconducting region, interface conditions have to be satisfied. These state the continuity of the tangential component of the magnetic field intensity and of the normal component of the magnetic flux density:

+ H, x n, = 0, B - n, + B, - n, = 0,

H x n,

on r,,,

(97)

on r,,,

(98)

where H, and B, are the static field intensity and flux density in the nonconducting region, and n,, n, are the outer normals associated with the respective subregions. The interface condition (97) also implies that the normal component of the current density on the interface is zero;

-

J n = 0, on rnC. The magnetic fields must further satisfy the initial conditions B(t = 0) = Bo,

in a,,

(99)

(100)

B,(t = 0) = Brio, in Q,. (101) It will be shown now that the field quantities E,H and H, are unique solutions of the above differential equations, boundary and initial conditions if the field H, satisfies the equations (25)-(27) as well as the boundary conditions (28) and (30) of the static magnetic field and the interface conditions (97) and (98). Indeed, let us assume that two different solutions exist and let us denote their difference by E(O),J'O),H'O),B(O)in 0,and by BLo),Hio)in R,. These fields

23

CAD IN ELECTROMAGNETISM

satisfy the following differential equations, constitutive relationships, boundary, interface, and initial conditions:

B(0) = pH(oj,

J(0)

= .E(O),

V x HIp’ = 0,

-

V BIp’ = 0,

in R,, Bk0) = pHko’,

E(O) x n, = 0,

H(O) x ne = 0, Hk0) x n, = 0,

-

BLo) n, = 0,

+ HIP) x nn = 0, B‘O) .n, + Bi0)- n, = 0,

H(O) x n,

r,, on r,,, on r,,, on r,, on r,,, on r,,, on

a,,

B(O’(t = 0 ) = 0,

in

B!,O’(t = 0 ) = 0,

in 0,.

The difference fields will be shown to vanish by proving that the following quantity is zero:

Indeed, since the conductivity CT is positive in Eq. (104), the vanishing of PcO) implies that

Defining the positive, constant permeabilities ,ucand p,, by

24

OSZKAR BIRO AND K. R. RICHTER

and

the inequality (1 17) can be rewritten as

Since the difference fields satisfy the initial conditions (114) and (115), the positive integrals on the left-hand side are zero, proving that B(') and Bko) vanish. This, and the integral P(') in Eq. (1 16) being zero, implies that E'O) is zero, too. To prove the equality (1 16), let us multiply Eq. (102) by Era) and Eq. (103) by H'O) and let us subtract the latter from the former. Using a vector identity, the following is obtained:

Let us now introduce the scalar potential

$(O)

in R, as

HIP) = -V$(o)

( 1 22)

which is possible in view of Eq. (105). The boundary condition (110) then implies that $(O)

= 0,

on r,,,

(1 23)

with the homogeneous counterparts of Eq. (29) possibly used. The following equation is then valid:

where use has been made of the differential equation (106) and of the definition (122).Integrating Eq. (121) over Q and Eq. (124) over R,, using Gauss's Theorem PcO)can be written as

The surface integrals over r,, r,,, r, and r,, are zero in view of the boundary conditions (log), (log), (1 11) and (123),respectively, i.e., only the integrals over the interface r,, remain. Now, using the interface conditions (1 12) and (1 13), the differential equation (103) and the definition (122), the following can be

CAD IN ELECTROMAGNETISM

25

written (n, = -nJ:

pm=-

-

(VI,$(') x Efo)+ I,$(')

Irnc v

x

(*(OW))

V x E'O)) .n, dT

. n, d r .

(126)

If the surface r,, is closed (i.e., Q, completely surrounds 9) then this last integral is clearly zero since, applying Gauss's Theorem, the divergence of the integrand is zero. Otherwise Eq. (126) can be reformulated by means of Stokes's Theorem as pfO) =

- fcnc

.

I,$(o)E(o'dl,

(127)

where C,, is the curve bounding r,,. However, any part of C,, either bounds rH, (and rH,) or r, (and r,) (see Fig. 5 ) so, in view of the boundary conditions (108) and (123), PcO)is in fact zero. This, however, as pointed out, proves the uniqueness of the eddy current field in the conducting region and of the static magnetic field in the nonconducting region. B. Potential Descriptions of Eddy Current Fields

The field vectors of the eddy current field in conductors can be advantageously derived from potentials. In order to have the potentials continuous on material interfaces, i.e., on surfaces where the material properties change abruptly, it is necessary to introduce a vector and a scalar potential simultaneously; i.e., generally four continuous scalar quantities are required for the description of the field. The continuity of the potentials is advantageous from a numerical point of view. The use of a magnetic vector potential along with an electric scalar potential is discussed in Section III.B.1, and the description of the eddy current field by means of a current vector potential and a magnetic scalar potential is presented in Section III.B.2. The boundary conditions on the potentials implied by the boundary conditions on the field vectors are also stated. The interface conditions will be treated in Section III.B.3. 1. The Magnetic Vector Potential and the Electric Scalar Potential

The magnetic vector potential can be similarly i;itroduced for the description of eddy current fields as in the case of static magnetic fields. This enforces the satisfaction of Eq. (92). Then, Faraday's Law, which is Eq. (91),

26

OSZKAR B l R O AND K. R. RICHTER

can be satisfied by introducing the electric scalar potential to define the curlfree part of the electric field intensity: B=VxA, E = - - -aA at

(128)

v V.

( 129)

Using the constitutive equations (93) and (94),a differentialequation for these potentials is provided by the Maxwell equation (90): 1 V x -V x A P

aA + cr + Q V V = 0, at

in 0,.

(130)

The boundary conditions (95) and (96) can be written by means of the potentials as aA -x n - V V x n=0, at

onrE,

(131)

and 1 - V x A x n = 0,

on I-",.

P

The tangential component of the electric field intensity can be set to zero by splitting the boundary condition (13 1) into the following two specifications:

n x A = 0, on I-,

(133)

onr,.

(134)

and V = V,,

The condition (1 33) is consistent with the fact that the normal component of the flux density vanishes on r,. The function V,,assumes a constant value on each disjunct section of r, and the differences of these potentials equal the voltages maintained by the voltage sources, if any (Biro et al., 1988). In order to make the vector potential unique, the Coulomb gauge can further be enforced, and either the tangential or the normal component of A should be defined on the boundary (Biro and Preis, 1989a). The tangential component already being given on rE, the normal component on r, is specified, similar to the static case:

-

A n = 0,

on

rHc.

(135)

Further, the differential equation (130) is appended as 1 1 dA Vx-VxA-V-V.A+o-+crVV=O, P

P

at

in0,.

(136)

27

CAD IN ELECTROMAGNETISM

This modified equation no longer implies the solenoidal property of the current density; therefore, this has to be satisfied explicitly:

This resolves the contradiction of the differential equation (130) providing no more than three scalar equations for the four scalar unknowns represented by A and V ; the differential equations (1 36) and ( 1 37) provide the four necessary scalar equations. Note that these two differential equations imply the satisfaction of Eq. (72) in R,; the Laplacian of l/p V A is zero here. Its normal derivative vanishes on r,,, as can be shown by a reasoning similar to that applied in the static case if the normal component of the current density is enforced to be zero:

-

(+ Therefore, it should satisfy the Dirichlet boundary condition on 1

-

-V A = 0, P

on

r,:

r,.

(139)

The uniqueness of the vector potential is also affected by the interface These will be discussed in Section 1II.C. The conditions on the surface rnc. uniqueness of the scalar potential is ensured by the electric field intensity and the vector potential being unique provided the scalar potential is fixed in at least one point. This is done in a natural way on r, if present; otherwise an arbitrary point must be chosen. 2. The Current Vector Potential and the Magnetic Scalar Potential The current density in eddy current fields being solenoidal, it can be represented as the curl of a currect vector potential as in static current fields. The magnetic scalar potential can be introduced to describe the curl-free part of the magnetic field intensity, thus ensuring the satisfaction of the Maxwell equation (90):

J=VxT, H

=T

( 140)

- V$.

(141)

A differential equation for these potentials is provided by the Maxwell equation (9 l), which is written with the constitutive relationships (93) and (94)used:

1 V x -V x T o

a + -(/IT) at

a

- .-(pV$) = 0,

(7 t

in 0,.

28

OSZKAR BIRO AND K.R. RICHTER

The boundary conditions prescribing the tangential components of the electric and magnetic field intensities are

1 -VxTxn=O,

onr,,

0

(143)

and T x n - V$ x n = 0,

on rHc.

(144)

This latter boundary condition can be split into two by defining both the tangential component of the current vector potential and the value of the magnetic scalar potential:

n x T = 0,

on rHc,

and

ICI = ICIo,

on rHc.

The condition (145) explicitly enforces that the normal component of the current density is zero on rHC. The function $o represents the magnetic voltages possibly acting between the sections of rHC. The tangential component of T being set on rH,,its normal component is forced to vanish on rE: on r,. (147) Now, to impose the Coulomb gauge, the differential equation (142) is modified to 1 1 a a V x - V x T - V - V T + -(pT) - -(pV$) = 0, in Q,. (148) fJ fJ at dt T a n= 0,

-

Maxwell's equation (92) stating the solenoidal property of the flux density yields the further differential equation V-(pT-pV$)=O, inn,. ( 149) These two equations set the Laplacian of l / g V - T to zero. Provided the normal component of the flux density is explicitly set to zero on rE, (pT - pV$) n = 0, on rE, ( 1 50) Eq. (148)and the boundary condition (143)imply that the normal derivative of l / o V - Tvanishes on r,. As before, the Coulomb gauge is enforced if the condition 1

-ffV

is prescribed.

T = 0,

on

rHc,

(151)

CAD I N ELECTROMAGNETISM

29

The implications of the interface conditions along the surface r,,on the uniqueness of the current vector potential will be presented in Section 1II.C. The uniqueness of the magnetic scalar potential is assured by the above equations, a fact that can be shown by a reasoning similar to the one applied in the case of the electric scalar potential coupled with the magnetic vector potential in Section III.B.1.

C. Coupling Eddy Current und Stutic Magnetic Fields In most eddy current problems, the conductors carrying the eddy currents are at least partially surrounded by a nonconducting medium free of eddy currents where a static magnetic field is present. This static magnetic field is induced both by the eddy currents and by the given current density of the possibly present coils. The two possible ways of deriving the eddy current field from a vector and a scalar potential and the two alternatives of describing the static magnetic field by means of a magnetic vector or a scalar potential give rise to four possible formulations of eddy current problems. In Section IILC. 1, the coupling of the description by a magnetic vector potential and an electric scalar potential to the formulation of the static field in terms of a vector potential is presented (A,V-A formulation). In Section III.C.2, the alternative of using the magnetic scalar potential in the nonconductors is discussed, possibly retaining the vector potential description in some part of the nonconducting region (A, V-A-+ formulation). Section III.C.3 is devoted to the formulation employing a current vector potential and a magnetic scalar potential in the eddy current region coupled to a magnetic scalar potential description in the eddy current free domain (T,$-$ formulation). Finally in Section IIl.C.4, the T,$ formulation is retained in the conductors, but the magnetic vector potential is used in at least some part of the nonconducting region with the scalar description possibly employed in other parts (T,+-A+ formulation). 1. The A, V - A Formulation

In this formulation, a magnetic vector potential is used both in the eddy current region Q, and in the eddy current free domain R, (Chari el ul., 1982, Biddlecombe et ul., 1982). It is essential from a numerical point of view that this vector potential is continuous on the interface r,, between the two regions. The continuity of the tangential component of A immediately enforces the continuity of the normal component of the flux density, i.e., the satisfaction of the interface condition (98).The interface condition (97) stating the continuity of the tangential component of the magnetic field intensity is still to be fulfilled. A further interface condition is necessary for the enforcement of the

30

OSZKAR BIRO AND K. R. RICHTER

Coulomb gauge on the vector potential. Namely, the differential equations (7 1) used in R, and (136) as well as (137) used in R, have been seen to imply that the Laplacian of the quantity l/pV A is zero. Further, its normal derivative B and r,. In order that these vanishes on rH, whereas its value is zero on r imply that 1/pV A itself is zero, it is necessary that this quantity be continuous on the interface rnC, an interface condition that is, therefore, indispensable if the Coulomb gauge is to be satisfied. For the sake of convenience, the differential equations, boundary and interface conditions governing the potentials in the A,V-A formulation are summarized again:

-

.

aA 1 1 V x - V x A - V - V - A + uCL CL at

1 1 V x -V x A - V-V P P n x A =0,

1 -V

CL

a

A = J,

U,,

I/=

A = 0,

1 -V x A x n = 0,

P

1 -V-A=O,

+ u V V = 0,

on

in a,,

(152)

in R,,

r,,

onr,, on r,,

(157)

on r,,,

onr,,

P

A, 1

- V x A x n, P

continuous on r,,,

+ -1V P

x A x n, = 0,

on

r,,,

(165)

31

CAD IN ELECTROMAGNETISM

-1v . P

1

~ n ,+ - V . An, P

= 0,

on

r,,,

where the two sections of the surface r H bounding R, and Q, have been denoted by rH, and rH,, respectively (Fig. 5). 2. The A,V-A-$ Formulation

The vector potential description of the static magnetic field requires three scalar unknown functions (the three components of A) to be introduced in the nonconducting region R,. A more economical way of representing this field is the use of a magnetic scalar potential. It is, however, not always possible to use a continuous scalar potential outside the eddy current region; in fact, this would enforce zero net current crossing any surface bounded by a curve in Q,. This would exclude any problem involving a multiply connected eddy current region, for example, any conductor with one or more holes in it. The difficulty could be overcome by allowing the magnetic scalar potential to be discontinuous along some cutting surfaces (Rodger and Eastham, 1987), a possibility that is not discussed here. An alternative is to retain the vector potential description in a part of R,, so that the region where the magnetic scalar potential is used surrounds a simply connected domain. Typically, the vector potential is used in the holes of the eddy current carrying conductors with the magnetic scalar potential introduced in the remaining nonconducting space (Leonard and Rodger, 1988). For the sake of simplicity, a total magnetic scalar potential will be assumed to be allowable, which is appropriate only if no coils with given current density are present in 0,. The generalization to the case with nonzero impressed currents is evident by means of the introduction of a reduced scalar potential discussed in Section II.B.2. In the presence of ferromagnetic materials in the eddy current free region it may be numerically advantageous to simultaneously use a reduced and a total scalar potential. According to the above reasoning, the nonconducting region 0,is divided into two subregions: the vector potential description of Eq. (63) is used in R,, and the total scalar potential formulation of Eq. (52) is employed in R,@. Consequently, the interface r,, has two sections: the surface r n c A divides R, and RnA,while the interface between the subregions Q, and R,, is denoted by r,,,,,,. The further interface between Q,, and R,, is F A @ . The boundary surfaces r, and r B are also subdivided into rH,, r H A , r H J l and TeA.T B J l , respectively (Fig. 6). The differential equations in Qc and the boundary conditions on I-, and rH, have been introduced in Section 1IT.B.1. Similarly, the differential equation in

32

OSZKAR BIRO AND K.R. RICHTER

r in ~ W = p m s

rBR:n x A = a rHR:Hxn=K FIG.6. The scheme of an eddy current problem with a multiply connected conductor. R,, and the boundary conditions on rHIL and rs, have been treated in Section II.B.2, and the differential equation in QnA as well as the boundary conditions along r " A and r,, have been presented in Section II.C.l. The magnetic vector potential in R, and RnAis chosen to be continuous on the interface rncA, therefore the interface conditions valid here are the same as those on the entire interface r, in the A,V-A formulation introduced in the previous subsection. The remaining interface conditions are the continuity of the tangential component of the magnetic field intensity and of the normal component of the magnetic flux density on the interfaces and r A # , i.e., wherever a vector potential description is coupled to a scalar potential formulation of the magnetic field. Since these surfaces are in fact boundaries of regions with differentpotentials, i.e. no potential is continuous here, it is obvious to treat the interface conditions as boundary conditions; the tangential component of the magnetic field intensity can be taken to be specified in terms of the scalar potential for the regions with a vector potential description, and conversely, the normal component of the flux density can be thought of as given in terms

33

CAD IN ELECTROMAGNETISM

of the vector potential for the scalar potential region. Specifically, the interface condition stating the continuity of H x n is regarded as a boundary condition similar to (65) and (132) (it is, naturally, nonhomogeneous), and the continuity of B n is treated as a boundary condition similar to (58). This approach implies that, these two interfaces being the boundaries of regions with a vector potential defined with the tangential component of H given, the normal component of the vector potential is set to zero here (cf. Eq. (70) in conjunction with Eq. (65) and Eq. (135) in conjunction with Eq. (1 32)). As a summary, the differential equations, boundary and interface conditions defining unique potentials in the A, V-A-$ formulation are as follows:

-

1 1 V x -V x A - V-V P P

- A + a-dAdt + u V V = 0,

in Qc,

(168)

V.(--Ciz-8A 1 1 V x - V x A - V - V A = 0,

P

in QnA,

P

inn,,,

V.(pV$)=O,

on

V = V,,

on I-,,

1 -V x A x n = 0,

P

1

P

r,,

n x A =0,

1 -V.A=O, P

-V

x

(170)

A x n = K,

onr,, on

( 174)

rH,,

on r H A ,

(175)

34

OSZKAR BlRO AND K. R. RICHTER

A, 1

-V P

x A x n,

1 -V P

continuous on r n c A ,

+ -1V P

- An, + -P1V

x

A x n,

An,, = 0,

1 - V x A x nA - Vlc/ x nJI= 0, P

-

-

A n = 0,

-

(183) on rncA,

= 0,

on m e A ,

on r,,, and

rA,,

on TncJI and r A @ ,

nJI p V + - nA V x A = 0,

on I-& and

(187) (188)

rA$.

(189)

where n* and nA stand for the outer normals associated with the subregions R,, and R, or RnA,respectively. The A, V-A formulation presented in the previous subsection is, naturally, a special case of the present formulation. Its relative simplicity, however, warranted its separate treatment. Another similar special case is the A,V-+ formulation with QnA nonexistent, i.e., where the magnetic scalar potential can be used in the entire nonconducting region. This is possible if the eddy current carryingconductors are simply connected and, so, the enforcement of zero net current by the scalar potential description used outside them is appropriate (Pillsbury, 1983).A further special formulation can be derived from the latter if the conductivity 0 is constant throughout the eddy current region and if there are no voltage sources present. In this case, the differential equation (169) is a Laplace equation for the electric scalar potential V in view of the Coulomb gauge imposed on A. Further, the lack of voltage sources renders the boundary condition (1 73) homogeneous, ie., V vanishes on r,. Finally, in view of Eqs. (177) and (188), the boundary conditions (176) and (186) constitute homogeneous Neumann boundary conditions for the electric scalar potential. Consequently, V is a harmonic function satisfying homogeneous Dirichlet or Neumann boundary conditions on the entire boundary of the region R, and is, therefore, zero. Hence, in this special case, there is no need to introduce the electric scalar potential; i.e.,the resulting A-$ formulation can be used (Emson et al., 1983, Rodger and Eastham, 1983). 3. The T,$-$Formulation The magnetic scalar potential $ is used in this formulation throughout the region R, i.e, both in the eddy current carrying conductors and in nonconductors (Carpenter, 1977). The continuity of this potential is again essential.

35

CAD IN ELECTROMAGNETISM

This, in itself, does not ensure the continuity of any field quantity, since the magnetic field in the eddy current region Q, is derived both from T and J, according to Eq. (141). However, the tangential component of the gradient of $ being continuous, the tangential component of T must be zero on the interface r,,,in order to satisfy the interface condition (97) stating the continuity of the tangential component of H. Consequently, there is a boundary condition prescribed on the interface r,, that is identical to that in Eq. (145)given on r,, so, in order to enforce the Coulomb gauge on the current vector potential, the divergence of T should be set to zero here as in Eq. (1 5 1). The interface condition (98)is a further constraint to be accounted for. Clearly, this formulation (frequentlycalled the T-R method) is not capable of treating multiply connected conductors in view of the fact that the magnetic field in the entire eddy current free region is described by a magnetic scalar potential. Indeed this feature, as well as the vanishing of the tangential component of T on the interface, enforces zero net current in the eddy current carrying conductors. The differential equations, boundary and interface conditions of the T,+-$ formulation are summarized below with the notations of Fig. 5 used: 1 d a 1 V x - V x T - V - V T + -(pT) - -(pV$) = 0, in R,, (190) ff

-

CT

at

$,

at

r,,, on r,,,

continuous on

n x T = 0,

-

-

(pT - pVJ,) n, - n, p V $ = 0, 1 on I-,,,. - V T = 0, D

-

on

r,,,

(202)

(202a)

36

OSZKAR BIRO AND K. R. RICHTER

4. The T,t,b-A-t,b Formulation As pointed out in the previous subsection, the T,$-$formulation is not capable of treating multiply connected conductors due to the exclusive use of the magnetic scalar potential to describe the static magnetic field. This difficulty can be overcome by using the same approach in the nonconducting space that has been adopted in the A,V-A-$ formulation: the magnetic vector potential is introduced in a,, with the total scalar potential description limited to a region Qn+ that surrounds a simply connected region (Fig. 6) (Biro and Preis, 1989b, 1989~). The differential equations in Q, and the boundary conditions on r, and r,, as well as the interface conditions on rnc+ are identical to those used in the T,$-$ formulation in Qc and on r,, r,, and r,, respectively. Similarly, the differential equations in QnA and R,,,, the boundary conditions on r B A , rslL, r H A , &l as well as the interface conditions on r A # are the same as in the A,V-A-$ approach. The conditions on the interface r n c A between the conducting region R, and the magnetic vector potential region QnA remain to be established. Since no continuous potential is present on the interface r n c A , this surface acts as a boundary of regions with specificpotentials used (T and $ in Q, and A in RnA), similarly to the surfaces rncs and r', in the A,V-A-t,bformulation. It is therefore again appropriate to regard the interface conditions as boundary conditions. As seen in Section II.C.1, in order to ensure the uniqueness of a vector potential it is necessary to define either its tangential component and its divergence(cf. Eqs. (67)and (77)) or the tangential component of its curl and its normal component on the boundary (cf. Eqs. (65) and (70)). For the current vector potential, the first option is not available now, since this would enforce the net current of the eddy current carrying conductors, a constraint that has been avoided by the introduction of the region QnA. It is, therefore, necessary to formulate a boundary condition on rn,A that specifies the tangential component of the electric field intensity. An obvious possibility is to equate this quantity with the tangential component of the negative time derivative of the magnetic vector potential (see Eq. (223)). This condition implies the continuity of the normal component of the flux density. Indeed, the time derivative of B - n is determined by E x n in Qc in view of the Maxwell equation (91) and, obviously, by - dA/dt x n in QnA by force of the definition (128) of the magnetic vector potential. Since the electric field intensity is a unique quantity, the condition (223) simultaneously specifies the tangential component of the vector potential. The accompanying boundary conditions ensuring the uniqueness of the vector potentials are hence the prescription of the normal component of the current vector potential (as in Eq. (147) on r, where E x n is given) and of the divergence of the magnetic vector potential (as in Eq. (77) on r, where n x A is specified). Further, in order to enforce the

37

CAD IN ELECTROMAGNETISM

-

Coulumb gauge on T, it is necessary that the normal derivative of l / a V T vanishes on r n c A . Similar to the reasoning in Section II.C.1 regarding the magnetic vector potential, this follows from the normal component of the differential equation (148) taken on this surface provided the continuity of the normal component of the flux density is explicitly prescribed (see Eq. (225)). The continuity of the tangential component of the magnetic field intensity is again treated as a boundary condition for the magnetic vector potential region specified in terms of the current vector potential and the magnetic scalar potential (see Eq. (224)). In summary, the following differential equations, boundary and interface conditions govern the potentials in the T,$-A-$ formulation. 1 1 8 d V x - V xT-V-V-T+-(pT)--(pVIC/)=O, a

a

at

1 -V*T=0, a

at

On

hc,

inn,,

(203)

(21 la)

OSZKAR BIRO AND K. R. RICHTER

38

1

(219a)

-V.T=O, a 1 - V x A x nA - Vtj x n*

P

1 - V x A x nn + T x n,

P

1 -V

P

-

. A = 0,

= 0,

on &,

Vtj x n, = 0,

on

on rncA,

(224)

rncA,

where n* and nA refer to the outer normals associated with the subregions R,, and RnA,respectively. IV. WAVEGUIDES AND CAVITIES The complete set of Maxwell equations describes electromagnetic waves in the media characterized by the constitutive relationships. The propagation of these waves can be affected by different obstacles that can be modeled by homogeneous boundary conditions specifying certain field components to vanish. The problem to be treated in this section is the computation of the electromagnetic field in a bounded region with such boundary conditions given. The medium is assumed to be lossless, i.e., its conductivity is zero. An aspect of such problems that sharply distinguishes them from static or eddy current problems is the possible occurence of nonzero fields even if both the differential equations and the boundary conditions are homogeneous. In practice, this possibility arises only if the field quantities vary sinusoidally in time, an option that is only available if the material properties are independent of the field quantities thereby making the problem linear. Therefore, the linearity of the medium will be assumed throughout this section.

CAD IN ELECTROMAGNETISM

39

The differential equations and boundary conditions governing the problem are presented in Section 1V.A. The use of potentials for the description of the electromagnetic field in waveguides and cavities is introduced in Section 1V.B. A . Differential Equations and Boundary Conditions of’ Waoeguides and Cavities

The differential equations obtained from Maxwell’s equation in a lossless medium are

8D VxH=--, at VXE=--

dB

inR

at

B

=

[PIH,

(230)

D

=

[E]E,

(231)

where [ p ] and [ E ] are tensors describing possibly anisotropic media. Since the conductivity is zero, no conductive currents are present. The time derivatives of the Maxwell equations (3) and (4)(the latter with p = 0) are implied by Eqs. (228) and (229), respectively. The boundary r of the region R is again subdivided into two parts in accordance with the type of boundary conditions given. On the part rE, the tangential component of the electric field intensity is known to be zero: Exn

= 0,

on .,-I

(232)

The surface r E is called an electric wall, it models an infinitely conductive metallic surface or, possibly, a symmetry plane with the tangential component of E vanishing. The tangential component of the magnetic field intensity is zero on the surface rH:

H x n = 0,

on

rH.

(233)

This surface is called a magnetic wall, it models a surface of infinitely high permeability or a symmetry plane with the tangential component of H zero. Two important special cases of the electromagnetic field in a closed region are considered. The problem formulated so far is that of a three-dimensional cavity. Another important case involves a region R that is infinitely long. This two-dimensional model corresponds to a waveguide. The differences between

40

OSZKAR BlRO AND K. R. RICHTER

the two problems will be pointed out in the course of the numerical solution of the corresponding equations. Although all of the differential equations and boundary conditions are homogeneous, i.e., no excitations are present, it is still possible under particular conditions that the electromagnetic field is nonzero. In order to investigate the implication of the absence of any excitation, let us multiply Eq. (228) by E and Eq. (229) by H and let us subtract the latter from the former. Using the same vector identity as for Eq. (121), the following Poynting’s Theorem is obtained:

aD aB -V*(EXH)=E*-+H.at at

in R.

(234)

Let us integrate Eq. (234) over R. The permittivity [ E ] in Eq. (231) and the permeability [ p ] in Eq. (230) are independent of the fields and so also of time. Applying Gauss’s Theorem, this allows the following to be written:

The surface integral on the right-hand side is zero in view of the boundary conditions (232) and (233).Therefore, Eq. (235) states that the electromagnetic energy in the region R is constant in time. If homogeneous initial conditions were also given for the field quantities it would follow that they are zero. The case when the fields are constant in time is irrelevant, since no electromagnetic field can then be present. In the case, however, when the fields vary sinusoidally in time it is possible at some frequencies that the energy is constant although electric and magnetic fields are present. The problem is to find these frequencies and the corresponding field patterns, the so-called modes. This leads to the task of finding the eigenvalues and eigenfunctions of the differential equations and boundary conditions describing the electromagnetic field. Since the field quantities are assumed to be of sinusoidal time variation, the complex notation will be used from now on with the angular frequency denoted by (0. B. Potential Descriptions of Waveguides and Cavities

When trying to find the eigenvalues and eigenfunctions of cavity resonators or waveguides numerically, reports are frequently encountered that nonphysical, so-called spurious modes are found (e.g., Davies et al., 1982; Webb, 1985; Koshiba et al., 1985). This is due to the fact that the uniqueness

41

CAD IN ELECTROMAGNETISM

of the variables used is not provided by a proper formulation. Most existing formulations are based on one of the field vectors. This approach, however, similar to the static and eddy current case, does not permit the system variables to be continuous at interfaces where the material properties change abruptly. An alternate possibility is advocated in the present work. It applies continuous potentials that are made unique by the enforcement of the Coulomb gauge in a way similar to the method applied to eddy current fields. The results obtained by this technique have been found by Bardi and Biro (1989) to be free of spurious solutions. The price to be paid is the necessity of introducing four scalar unknowns: the three components of a vector potential and an additional scalar potential. The use of a magnetic vector potential and an electric scalar potential is presented in Section IV.B.1 and the alternative of employing an electric vector potential and a magnetic scalar potential is expounded in Section IV.B.2.

I . The Magnetic Vector Potential and the Electric Scalar Potential These potentials can be introduced in a way similar to the eddy current case. With the sinusoidal time variation taken into account by means of complex notation, the field quantities are derived from the potentials as (236)

B=VxA,

E = -jwA

-

VV.

(237)

These account for the Maxwell equation (229). Equation (228) along with the constitutive relationships (230) and (23 1) provide the following differential equation: V x

[PI-'

V x A - w 2 [ & ] A+ . j w [ ~V]V = 0,

in R,

(238)

where [PI-' is the inverse of the permeability tensor. The boundary conditions (232) and (233) can be formulated for the potentials as nxA=0,

v = 0, [p]-'V x A x n

= 0,

on

r,

on

r,.

(241)

In order to enforce the uniqueness of the vector potential, either its tangential or its normal component must be specified on the boundary and the Coulomb gauge must be satisfied. So the additional boundary condition n.A

= 0,

on

r,,

(242)

42

OSZKAR BIRO A N D K. R. RICHTER

is introduced and the differential equation (238) is modified as V x

[PI-'

1 V x A -V-V

P

-A

+ j w [ ~VV ] = 0,

in R, (243)

-w2[&]A

where p is a suitably chosen constant. Experience has shown the value p = i T r [ p ] to be adequate. The notation T r [ p ] stands for the sum of the diagonal elements of the tensor [ p ] . Since Eq. (243) no longer implies the solenoidality of the displacement current density, this is written in an explicit way:

-

V (-w2[&]A

+ j w [ & ]V V ) = 0,

in R.

(244)

Similar to the eddy current case, the enforcement of the Coulomb gauge is complete when the further boundary conditions

-

n ( - w 2 [ & ] A + j o [ ~V ]V ) = 0,

on rH,

(245)

and 1 -V.A=O,

onr,,

P

(246)

are specified. In summary, the frequencies are sought that allow the homogeneous differential equations (243) and (244) to have nonzero solutions with the homogeneous boundary conditions (239) to (242) and (245) to (246).

2. The Electric Vector Potential and the Magnetic Scalar Potential An alternative way of introducing potentials is to derive the field quantities from an electric vector potential and a magnetic scalar potential as

(247)

D=VxF,

H

= jwF -

V$,

(248)

a description that explicitly satisfies the Maxwell equation (228). Faraday's law (229) and the constitutive equation (230) and (231) yield the differential equation V x [ & ] - ' V x F - w 2 [ p ] F - j o [ p ] V $ = 0,

in R.

(249)

The boundary conditions (232) and (233) can be written as [&]-'V x F x n

= 0,

F x n =0,

*

= 0.

on

rE,

onr,,

(250) (251) (252)

43

CAD IN ELECTROMAGNETISM

Again, the uniqueness of the vector potential with the Coulomb gauge enforced can be achieved by replacing the differential equation (249) by the two equations V x [E]-’V x F

-

1 V-V

-F

-

w 2 [ p ] F - j o [ p ] V t j = 0,

E

in 0,

(253)

V * ( - o 2 [ p ] F - j ~ [ p Vtj) ] =0

(254) ( E is chosen to be i T r [ ~ ] and ) by specifying the additional boundary conditions F-n=O,

-

n ( - w 2 [ p ] F - j o [ p ] V$)

= 0,

on

r,,

(255) (256)

and 1 -G V . F = O ,

onr,.

(257)

The frequencies are to be found that permit the homogeneous differential equations (253) and (254) to have nonzero solutions with the homogeneous boundary conditions (250) to (252) and (255) to (257) satisfied.

V. GALERKIN’S METHOD The application of Galerkin’s method to three types of partial differential equation problems is presented in this section. These are the second order elliptic problems in the static case, the parabolic differential equations in the transient case as well as the eigenvalue problems arising in waveguides and cavities. Section V.A is devoted to the statement of the weak formulations. Galerkin’s method is described in general terms in Section V.B. The specific boundary value problems are treated in Section V.C. A . Weak Formulations

The differential equations and certain boundary conditions are reformulated in this subsection into a weak form that is adequate for the application of Galerkin’s method. 1. Weak Form of Second Order Elliptic Diflerential Equations

In the first case encountered in static problems, a differential equation of

44

OSZKAR BIRO AND K. R. RICHTER

the form L2u =f,

inn,

(258)

is to be solved where L2 is a second order elliptic differential operator (or the sum of such operators), u is the unknown function to be determined and f is a known forcing function. Two types of boundary conditions are given on disjunct sections of the boundary r: Dirichlet boundary condition is specified on the part r D , LDu

= g,

on

rD,

(259)

On

rN,

(260)

and the Neumann boundary condition LNu

= h,

is given on the part r N . The functions g and h are known functions defined on the relevant surfaces. The operator LDis the identity operator if u is a scalar function, or it yields the tangential or the normal component of u if the latter is a vector function. If the left-hand side of the differential equation (258) is multiplied by some function w and integrated over the region R, the operator L N of the Neumann boundary condition (260) permits the following transformation:

where L , is a first order differential operator. This equation is usually a form of Green’s identity. The differential equation (258) and Neumann boundary condition (260) can be rewritten in the equivalent weak formulation la

(L,u - f ) w d R +

lr

(L,u

- h)(L,w)dT = 0,

for any function w. (262)

Indeed, if Eqs. (258) and (260) are satisfied then Eq. (262) is an identity conversely, if Eq. (262) holds for any function w, then the expressions in brackets in the two integrals must be zero in view of the fundamental lemma of variational calculus (Mikhlin, 1964), i.e., Eqs. (258) and (260) are satisfied. Note that any relevant function on the surface r N can be represented in the form (L,W). 2. Weak Form of Transient Problems In the second case relevant to transient eddy current problems, the differential equation to be solved involves time as a further variable, and it has

45

CAD IN ELECTROMAGNETISM

the form L,u

+ L,-au = ,f, at

in R,

where L, is again a second order elliptic differential operator and L, has the symmetric property j*(L,u)wdR =

I

u(L,w)dR.

The form of the boundary conditions is the same as in the static case, and the operators L,, L, and LN obey the Green’s identity (261). In addition, the initial condition u(t = 0) = u g

(265)

must also be specified. The equivalent weak formulation of the differential equation (263) and the Neumann boundary condition (260) can be written as

for any function w. (266) 3. Weak Form of Eigenvalue Problems In the third case representing an eigenvalue problem and encountered in waveguide and cavity problems, the differential equation has the form L,u

+ LL,u

=

in R,

0

(267)

where, as before, L, is a second order elliptic differential operator and La has the symmetric property (264): j*(L,u)wd*

=

J-*

u(L,w)dR.

(268)

The values of the parameter 1 are sought so that the homogeneous differential equation (267)has a nontrivial solution satisfying the homogeneous boundary conditions L,u=O,

onr,,

(269)

LNu = 0,

on

(270)

and rN.

As before, the operators L,, LD and LN satisfy the Green’s identity (261).

46

OSZKAR BIRO AND K. R. RICHTER

In the present case, the weak formulation of the differential equation (267) and the Neumann boundary condition (270) is (L,u

+ AL,u)wdR +

6,

(L#)(LDW)dr = 0,

for a n y function w. (27 1)

B. General Description of Galerkin's Method

When applying Galerkin's method to the previous cases, the unknown function u is approximated by an expansion in terms of n elements of an entire function set { f k } :

where the function u,, satisfies the nonhomogeneous Dirichlet boundary condition (259) (if g = 0, as in Eq. (269) in the eigenvalue case, then uD = 0), and the elements of the function set { f k } satisfy homogeneous Dirichlet boundary condition

This ensures that the approximating function satisfies the Dirichlet boundary conditions. The values u k are unknown constants in the static and in the eigenvalue case and are unknown functions of time in the transient case. The application of Galerkin's method is constituted by writing the weak formulations (262), (266) or (271) with the function u replaced by the approximating function u(")and restricting the function w to n elements of an entire weighting function set { w i } that satisfy the homogeneous Dirichlet boundary condition LDwi = 0,

on r D .

(275)

This restriction is justified by the fact that the behaviour of the weighting function w on the surface I-, is irrelevant in the weak formulations. This Galerkin procedure results in the approximate satisfaction of the weak formulations and hence of the differential equations and the Neumann boundary condition by the approximating function. The Galerkin procedure yields n equations in the above three cases, which are written in the following.

47

CAD IN ELECTROMAGNETISM

1. Galerkin 's Equations of Second Order Elliptic Problems

When applied to the weak formulation (262) of the static problem, the algebraic equations

jR

(L2u'"')wj

d*

+

(LNU("')(LDW~)

dr =

fWj

SrN

dR -k

h(LDWi)

dr,

IrN

i = 1,2,..., n, (276) are obtained. Applying the Green's identity (261), considering the fact that the weighting functions wi satisfy the homogeneous Dirichlet boundary condition (275), these equations can be rewritten as {n(Llt.4'"')(L,wj)dQ

=

h(LDWi)dr,

i = 1,2, . . . , n . (277)

These equations are, in general, nonlinear, i.e., when the medium and hence the operators L2 and L , are nonlinear. In the linear case, however, the equations (277) constitute a set of linear algebraic equations of the form AU = f,

(278)

where the n elements of the unknown vector u are the parameters uk in the expansion (272) and the kth element in the ith row of the matrix A and the ith element of the right-hand side source vector f a r e the following:

fwim +

= Jn

h(L,,w,)dr Jr

(L,~~)(L,W~)~R.

(280)

JR

Evidently, if the weighting functions wi are selected to coincide with the expansion functions h, the matrix A is symmetric. This is a property that considerably reduces the computational effort needed for the solution of the equations system (278). Therefore, it will be assumed in the following that the expansion functions J;. are used as weighting functions. 2. Galerkin's Equations of Transient Problems The application of the Galerkin procedure to the weak formulation (266) of the transient problem results in the set of ordinary differential equations

48

OSZKAR BIRO AND K. R. RICHTER

= J,ffidn

+J

i = 1,2,..., n.

h(LDfi)dr,

(281)

rN

Using the Green’s identity (261) and the fact that the functions fi as weighting functions satisfy the Dirichlet boundary condition (274), this set can be written as

i = 1,2,...,n. (282)

Approximating the initial solution uo in the initial condition (265) as UO %

ug’ =

(283)

U 0 kf k , k= 1

the following initial values can be assigned to the functions U k in the expansion (272): uk(t

k

= 0)= U O ~ ,

=

. . ,n.

(284)

Unless the characteristics of the medium are field dependent, the equations constitute a linear set of ordinary differential equation of the form

au = f,

AU + B-at

with the elements of u being the functions uk,and the kth elements in the ith rows of the matrices A and B as well as the ith element of the forcing vector f are (285)

=~~l(Llfk)(Llfi)dn~

Bik =

In

(Lffk)fi

dQ,

fi = fhdn + Jn

(286) h(LDfi)dr

.lr

-

In

(LluD)(Llfi)dR.

(287)

The symmetry of the matrix A is obvious, whereas the symmetry of the matrix B follows from the property (264) of the operatort,.

49

CAD IN ELECTROMAGNETISM

3. Galerkin's Equations of Eigenvalue Problems Finally, the application of Galerkin's method to the weak formulation (271) of the eigenvalue problem results in the equations

b

(L2U'"')fidQ

+A

I

(L,u'")f;dfl

+

(J!dNu'"')(LDfi)dr= 0, i = 2,. . .,n. (288)

Again, applying Green's identity (261) and using the fact that the functions satisfy the homogeneous Dirichlet boundary condition (274) on r D , the following can be written:

I

jR(L1u("')(LljJdQ+ 1% (L,u("))j;,dQ= 0,

i = 1,2, ..., n.

(289)

These linear equations are equivalent to the generalized matrix eigenvalue problem AU

+ IBu = 0,

(290)

where the elements of the matrices A and B are given by n

"

The symmetry of the matrix A is again evident and the property (268) of the operator L, ensures the symmetry of the matrix B, too. C . Application of Galerkin's Method to Potential Formulations

The Galerkin procedure is next applied to the various specific potential formulations of static fields and eddy current fields, as well as waveguides and cavities. Section V.C.1 is devoted to the static case. Only the static magnetic field is treated, both with the total and reduced scalar potentials and with the magnetic vector potential employed. The other static formulations seem to be analogous. The employment of Galerkin techniques for two general eddy current vector potential formulations is presented in Section V.C.2. The case of waveguides and cavities is treated in Section V.C.3.

50

OSZKAR BIRO AND K. R. RICHTER

1. Galerkin’s Method in the Static Case

The application of the Galerkin procedure to the potential formulations of the static magnetic field is presented in this subsection. In Section V.C.l.a,the boundary value problem set for the total and reduced scalar potentials is tackled. Galerkin’s equations for the magnetic vector potential description are derived in Section V.C.1.b. a. Galerkin’s Equations for the Total and Reduced Scalar Potentials The differential equations, boundary and interface conditions of the static magnetic field in terms of two scalar potentials have been written in Section II.B.2. The scalar potential formulation of the static electric field (and, hence, of the static current field) can be regarded as a special case: with no total scalar potential present, the equations of the static magnetic field are completely analogous to those of the static electric field (cf. Eqs. (38) and (53) as well as Eqs. (42), (43) and (55), (57)). In the present case, the operator L, is the generalized Laplace operator in Eqs. (53) and (54), L D is the identity operator and L , is the normal derivative operator in the Neumann boundary conditions (57) and (58). The Green’s identity (261) has the specific form

I

[ - V - ( p V @ ) w ] d R=

[email protected]

( n - p V @ ) w d T , (293) $r

J-n

i.e., the operator L , is the V operator. For the application of Galerkin’s method, the approximating expansions of the two potentials @ and II/ are selected so that they satisfy the Dirichlet boundary conditions (55) and (56) and the interface condition (62):

n

where the expansion functionsfk satisfy the homogeneous Dirichlet boundary condition fk

and the functions ODand

on rH@ and rH@, are constructed so that the conditions

= 0, $D

a,

=

a0,

$D = $03 II/D

- @D

= @S,

(296)

on LcD,

(297)

on

rH+,

(298)

on

r@+,

(299)

51

CAD IN ELECTROMAGNETISM

are fulfilled. Evidently, these equations do indeed imply the satisfaction of the Dirichlet boundary conditions ( 5 5 )and (56)and of the interface condition (62). Note that only one set of unknowns denoted by Okhas been introduced in the expansions (294) and (295). The differential equations (53) and (54), the Neumann boundary conditions (57) and (58) and the interface condition (60) can be included in the weak formulation serving as the basis of the Galerkin procedure. Using the pattern of Eq. (262), they can be summarized in the following weak form: c

n

-

pHs + nJI p V$)w dT = 0, for any function w.

p V@- n,

(n,

+ Jw.r

(300)

+

Replacing the functions @ and by the expansions (294) and (295) and again using the functions fi as weighting functions, the following equations are obtained: [ -V

*

( p V@'"')]J;dR

+J

re. + re*

+

.

I,

[ -V

(n pvo(n)),fidr

*

+

( p V$("')]J;.dQ

j

+re*

(n p v $ ( n ) )d~ r

= J Q e [ - v . ( p ~ s ) ~ ~+dJrBe(pms ~ +n.p~,).hdr c

c

+ Jre* pmsf;:dr+ J ro* n,-pH,f;dT,

i = 1,2,..., n.

(301)

Since the functions f;: vanish on the surfaces r,, and rHsaccording to Eq. (296), the use of the Green's identity (293) leads to the following form:

1.

c

n

S,,

n.pH,J;dT,

+

i = 1,2,..., n.

52

OSZKAR BlRO AND

K.R. RICHTER

The matrix of this set of equations is evidently symmetric in the linear case when the permeability p is independent of the field. Even in the nonlinear case, a direct iteration method of updating the permeability values at each step as well as the Newton-Raphson procedure (Chari, 1970) leads to symmetric systems. b. Galerkin's Equations for the Magnetic Vector Potential The differential equations and boundary conditions of the magnetic vector potential formulation for static magnetic fields have been shown in Section II.C.l. The equations of the static current field in terms of a current vector potential have been seen to be completely analogous. In this particular case, the operator L , in the differential equation (71) is the sum of two operators. Both of these involve a Dirichlet boundary condition: Eq. (67) for the curl-curl and Eq. (70) for the grad-div operator. Similarly, a Neumann boundary condition corresponds to each of them: Eq. (65) to the curl-curl and Eq. (77) to the grad-div operator. Indeed, the following two Green's identities can be written:

(304) These equations really imply the above classification of the boundary conditions if compared with Eq. (261) and also that the two corresponding operators L1 are the curl and the div operators, respectively. Approximating the magnetic vector potential according to Galerkin's method as n

A x A") = A,

+ 1 Akfk, &= 1

(305)

the vector functions fk must satisfy the homogeneous Dirichlet conditions n x f,=O,

fk. n = 0,

onr,, on

r,,

(306)

(307)

and the function A, the Dirichlet boundary conditions (67) and (70),

n x A,, A,

= a,

- n = 0,

on r,,

(308)

on r,.

(309)

53

CAD IN ELECTROMAGNETISM

The weak formulation of the problem takes care of the differential equation (71) and the Neumann boundary conditions (65) and (77): 1 1 V x -V x A - V - V - A - J P

+JrH(iVxAxn-K

+

S,

-(V

1

.wdT

- A)(w. n)dT = 0,

for any function w.

(310)

Galerkin's equations are derived by replacing the vector potential A by the expansion (305) and the weighting function w by the expansion functions fi:

i = 1,2,..., n. (311)

These equations can be rewritten into the following symmetric form by means of the Green's identities (303) and (304), whereby the functions fj satisfy the homogeneous Dirichlet boundary conditions (306) and (307):

=

10

J .fidQ +

K . fidT,

i = 1,2,..., n.

(312)

JrH

2. Galerkin's Method in the Eddy Current Case In Section III.C, four different potential formulations of the eddy current problem have been presented: the A,V-A, the A,V-A-$, the T,$-$ and the T,+-A-$ formulations. Among these, the A,V-A approach is a special case of the A,V-A-$ formulation; it corresponds to the absence of any region with the variable t,h. Similarly, the use of the T,+-t,hmethod can be regarded as a particular case of the application of the T,$-A-$ method with the potential $ employed throughout the eddy current free region. Therefore, Galerkin's equations will only be derived for the two general formulations. These are the A,V-A-+ and T,+-A-J/ formulations treated in Sections V.C.2.a and b, respectively.

OSZKAR BIRO AND K.R. RICHTER

54

a, Galerkin's Equations for the A,V-A-+ Formulation The differential equations, boundary and interface conditions of the A, V-A+ formulation for eddy current problems have been summarized in Eqs. (168) to (189) in Section III.C.2. In writing the corresponding Galerkin's equations, it turns out that the symmetry of the matrices involved can only be achieved by treating the electric scalar potential V as the time derivative of a modified scalar potential u:

v=--av

(3 13)

at

with homogeneous initial condition u(t = 0) = 0

(3 14)

This, in effect, provides the symmetry of the operator L, in Eqs. (168) and (1 69). The relevant Green's identities have been written for the magnetic vector potential A in the differential equations (168) and (170) in Eqs. (303) and (304) and for the magnetic scalar potential $ in the differential equation (171) in Eq. (293). For the differential equation (169) involving the electric scalar potential, the following identity is appropriate:

j*[v*(

-6-

:.ve)].dn j* -

at

=

+ 6ve). at VwdR

( C T g

As before, the various potentials are approximated as

where the expansion functions satisfy the homogeneous Dirichlet boundary conditions n x f k = 0, fk

-n

= 0,

on r, and rBA, on rH,, FHA,r n c @ and

(3 19)

rA$,

(320)

55

CAD IN ELECTROMAGNETISM

on r, and rHJI,

(321) and the functions A,, uD, t,bD satisfy the nonhomogeneous Dirichlet boundary conditions of the problem: = 0,

fk

n x AD = 0,

on r,,

n x A, = a,

on

rBA,

AD n = 0,

on

rH,, r H A ,

OD

=

j:

*D = $0,

Uo(z)dz, on

(322)

on

(323)

rnc*and L,,

(324)

r,,

(325)

r"*.

(326)

Then, the approximating functions A("), u(") and $(") satisfy the Dirichlet boundary conditions given in Eqs. ( 172), (1 73), ( 177), (1 79),(18 1)and (188). The as prescribed in continuity of the magnetic vector potential on the surface rncA Eq. (183) is obvious since the same expansion is used for A in the regions R, and R,. The weak formulation involves three equations, one for each potential. Besides the differential equations (168)-( 171), they account for the Neumann boundary conditions (174), (I 75), ( 1 76), ( 1 78), ( 1 80), and (1 82), as well as the interface conditions (184), (185), (186), (187) and (189):

P

. wdr = 0,

for any function w,

(327)

56

OSZKAR BIRO AND K. R. RICHTER

- o V eat) w d R

l Q c ( V . (-rJ$

+ JrH,+

rncA

">

(,Aat + v at

+ rnc*

nw d r

for any function w,

= 0,

(328)

V.(pv+)~dR-

( n * p V + -pms)wdr JrB*

(-n*. pV$

+ n A .V x A)wdT = 0,

for any function w.

+ j r n c q f rAJl

(329) Again, Galerkin's equations can be written by replacing the potentials with the expansions (316)-(318) and the functions w and w with the functions fi andf,, respectively. In order to achieve symmetry, use is made of the Green's identities (293), (303),(304) and (315) as well as of the fact that the expansion functions satisfy the Dirichlet boundary conditions (3 19)-(321). Further, the integral coupling the magnetic vector potential and the magnetic scalar potential in Eq. (327), i.e., the second term in the fifth surface integral, is transformed by the identity

where CAJlis the curve bounding the surface rnCJl + F A # , which is the interface between the regions with A and If this surface is closed, the curve integral does not arise; otherwise any part of the curve CA*bounds either a surface with n x fi zero (r,or r B A , see Eq. (319))or one where the approximating function +(") satisfies the Dirichlet boundary condition (181) (on r H J l ) . Denoting this and using n* = nA,the identity (330) becomes latter curve by CHAJ,

+.

s

rncJl + tiIy.

(- VI,P

x n9). fi d r =

s

+(")V x f i . n,dr rncJl + TAIL.

+ JcHA* +ofi

*

dl.

Finally, Galerkin's equations have the following form:

(331)

57

CAD IN ELECTROMAGNETISM

s

(v x A(”).n,)i

r n c , + rAQ

dr-

I,, v q - ~v i +

dR = -

LL

i = n,

jrB+

pmsfid r ,

1, n2 + 2,...,n. (334)

This set of ordinary differential equations can be summarized in the matrix form

[‘oA AAA 0

;j[;]+FiA;];[;]=p]? A,,

BAA” BA” :

(335)

where the elements of the matrices can be taken from Eqs. (332)-(334). Both the matrix A and the matrix Bare readily seen to be symmetric. However, since the diagonal elements of A,, are negative, the matrix A is not positive definite even if the zero rows and columns are disregarded, whereas the matrix B does have this property. b. Galerkin ’s Equations for the T,$-A-$ Formulation The differential equations, boundary and interface conditions for the T,$-A-$ formulation have been written in Eqs. (203) to (227) in Section III.C.4. The symmetry of the operator L, turns out to be ensured only if Galerkin’s method is applied to the time derivatives of the differential equations (204), (205) and (206), of the boundary conditions (208), (212), (214), (216), and (227) and of the interface conditions (219), (220),(222),(224) and (225). The Green’s identities for the vector potentials T and A in the differential equations (203) and (205) are given for the magnetic vector potential A in Eqs. (303) and (304). For the current vector potential T they are completely analogous. The relevant identity for the magnetic scalar potential $ in the differential equation (206) has been written in Eq. (293). For the time derivative of the differential equation (204), the following identity will

58

OSZKAR BIRO AND K. R. RICHTER

be used:

The expansions used for the approximation of the potentials satisfy the Dirichlet boundary conditions

2 &fk,

T x T(")= TD +

(337)

k=l

2

A z A(")= AD +

Akfk.

(339)

k=n2 t 1

This is achieved again by using expansion functions that obey the homogeneous Dirichlet boundary conditions n x fk = 0,

on

-

fk n = 0,

rH,,

on I-,,

r&,and r,,,,, r H A , rneA

and

rAs,

(3401 (341)

on rHc and r H $ , (342) and by constructing the functions T O , $, and AD to safisfy the Dirichlet boundary conditions of the problem: fk

= 0,

nx TD

TD

= 0,

- n = 0,

on I ', and on

rHc

n x AD = K,

on

rkA,

AD n = 0,

on

rHA

$D

=

r,,,,

on r H c and

rncA,

and and

rH#

(343) (344) (345) (346)

rAs.

(347)

Hence, the satisfaction of the Dirichlet boundary conditions (209), (2 lo), (21l), (213), (215), (218), (221) and (226) by the approximating functions T'"), $(") and A(") is ensured. The continuity of the magnetic scalar potential on is provided by the use of the same expansion for $ in the the surface rnc$ regions R, and R,, , The weak formulation consists of three equations, one for each potential. They are written for the differential equation (203), for the time derivatives of the differential equations (204)-(206). They take care of the Neumann boundary conditions (207), (211a) and (219a), the time derivatives of the Neumann boundary conditions (208) (212), (214), (216) and (227), the inter-

CAD IN ELECTROMAGNETISM

59 face condition (223) and the time derivative of the interface conditions (219), (220), (222), (224) and (225):

v x -1v

1 a

x T - V-v

+IrE(:.

. T + -a- ( p ~ )at

x T xn).wd,+jr~li(:V.T)(w.n)dT

function w,

(348)

a

-VXjQtZA[

" (L

-VxA

+V-

-V.A)

wdR

a

1 -VxA

xn+-K

a

1 -VxA

FT ~n,--xn~+V-xn,

+ jrHA[-'(/'

+6;teA[-'(p

) " (L ] ) ]. w d r )

a*at

1

.wdT

at

for any function w. (350)

OSZKAR BlRO AND K.R. RICHTER

60

Similarly to the previous formulations, Galerkin's equations can be derived by writing the expansions (337)-(339) instead of the potentials and replacing the weighting functions w and w with the expansion functions fi and J , respectively, Symmetric forms are obtained when the identities (303),(304), (293) and (336) are applied as appropriate, and the homogeneous Dirichlet boundary conditions (340)-(342) on the expansion functions are taken into account. The integrals over the surfaces r " c A and r A * containing the scalar potential I) in Eq. (350) can be transformed in a way similar to Eq. (331):

where, on the surface r n C A , n# stands for n, and nA for n,. The curve CHA+ is that part of the boundary of the surface r n c A + F A + , which is also a boundary of the surfaces r,, or r H $ . Galerkin's equations can then be brought to the following form using n, = -nA on the surface r n f A and nJI = -nA on r A + :

1

['-(V x T ' " ) ) - ( V x f i ) + - (1V - T ( " ' ) ( V - f i ) ] d R

Rc

CT

a -'lnnA[;

0

-(V x A'"') (V x fJ

1 + -(V c1

1

A("')(V fi) dR

61

CAD IN ELECTROMAGNETISM

The matrix form of these equations can be written as

where the elements of the matrices can be obtained from Eqs. (352)-(354). The matrices are obviously symmetric with the diagonal elements of the submatrix B A A negative. In contrast to the A,V-A-$ formulation, the time derivatives of the elements of the matrix B are also present if, in the nonlinear case, they are time dependent. 3. Galerkin s Method for Waveguides und Cavities

The differential equations of the two potential descriptions of waveguides and cavities described in Section IV.B.l and 2 are completely analogous: therefore, Galerkin's equations will only be presented for the method using the magnetic vector potential A and the electric scalar potential V. Similarly to the eddy current case, the symmetry of Galerkin's equations turns out to be achievable only if the electric scalar potential V is treated as the time derivative of a modified scalar potential t', V

(356)

= jwv,

resulting in the symmetry of the operator LA. Let us introduce the relative permeability and permittivity tensors [ p , ] and [ E ? ] and the relative permeability p , corresponding to the constant p in the second term of the differential equation (243) as (357)

(358) (359) where p,, and E,, are the permeability and permittivity of free space. Further, introducing the free space wave number k, as

k, =

mzz,

(360)

the differential equations (243) and (244) can be rewritten as V x [pr]-' V x A

- V -1V Pr

- A - k;[&,]A - k;[t;,] VV = 0, k i V * [ & , ] A+ k i V * [el] VV = 0.

in Q,

(361) (362)

Similarly, the Neumann boundary condition (245) has the form

-

n ( - k ; [ & , ] A- k t [ c , ] Vv) = 0,

on

r,.

(363)

OSZKAR BIRO AND K.R. RICHTER

62

The eigenvalues to be determined are the values of the squared free space wave number, i.e, in effect, of the frequency. The proper vector identity for the differential equation (362) has the form

(k;V.[&,]A+

~ ; V . [ & , ] V U ) W ~ R = (-k;[&,]A-

~;[E,]VU)*VW~R

Jn

- $r

(-k;[&,]A

- k;[&,] VU). n w d r .

(364) Again, the approximations of the potentials have to satisfy the Dirichlet boundary conditions, which are a!! homogeneous for this kind of problems: ni

A z A'"' =

C Akfk,

(365)

k=l

where the Dirichlet boundary conditions (239), (240) and (242) are satisfied by the expansion functions

r,, on r,, on r,.

n x fk = 0,

on

fk n = 0, jk

= 0,

(367) (368) (369)

The weak formulation of the problem involves two equations, one for A and one for u. They take care of the differentia! equations (361),(362) and the Neumann boundary conditions (24l), (245) and (246): 1 JQ(V x [/A,]-' V x A - V-V Pr

- A - k;[cr]A

- k;[&,] Vu) w dR

for any function w, (370)

Jn(k;V

- [cr]A + k ; V .

S,

( - k;[&,]A

+

[&,]Vu)wdn

-

- k;[&,] Vu) nw d T = 0,

for any function w. (371)

63

CAD IN ELECTROMAGNETISM

Galerkin's equations can be obtained from here by replacing the potentials with their approximating expansions in Eqs. (365) and (366) and the weighting functions w and w with the expansion functions fi and fi, respectively. Applying the identities (303),(304) and (364) as well as the fact that the expansion functions fi and fi satisfy the homogeneous Dirichlet boundary conditions (367)-(369), the following symmetric forms are derived: Jn {(V x A'"'). [p,]-'(V

x fi)

1 + -(V 11,

- A'"))(V - fi)}dQ

( [ E , ] A ' " ' + [ E , ] V U ( " ' ) . V ~ ; ~ Q =iO=,n , + l , n , + 2 ,...,n.

-kij n

(373) In a matrix form, this generalized eigenvalue problem can be written as

:I[:]

: :-[ : A

:$]

I:[

=

(374)

The matrices A,, and the entire matrix B are easily seen to be symmetric and positive definite. This ensures that the eigenvalues are nonnegative. Since ( n - n l ) rows and columns of the matrix A are zero, (n - n , ) zero eigenvalues are obtained. However, all other eigenvalues are positive, corresponding physical modes. VI. APPLICATION OF

THE

FINITE ELEMENTMETHOD

An essential problem of Galerkin's method is the selection of the expansion functions that satisfy the Dirichlet boundary conditions and constitute an entire set for the approximation of the potentials. The most flexible way is the use of the Finite Element Method. Excellent treatment of the method is given by Zienkiewlcz (19771, so only a few outstanding features will be summarized in Section VI. A. The finite element solution of a nonlinear, static magnetic problem, of the three-dimensional model of a choke coil is presented in Section VI. B. A time-harmonic eddy current problem involving a coil over an aluminum plate with a hole is solved in Section V1.C. The transient eddy current problem of a conducting brick with a hole in a homogeneous, exponentially decaying magnetic field is tackled in Section V1.D. The dominant modes and the corresponding dispersion characteristics of two ferrite loaded anisotropic waveguides are determined in Section VI. E. Finally, Section V1.F is devoted to the problem of a dielectric loaded cavity.

64

OSZKAR BIRO A N D K. R. RICHTER

A . A Summary of the Finite Element Method

The Finite Element Method is based on a division of the studied region Q into small subregions, so called, finite elements. The potential functions are approximated by low order polynomials within each finite element. Elements of various shapes are in current use, triangles and quadrilaterals being the most popular ones in two dimensions and tetrahedra, prisms and hexahedra in three dimensions. In the particular variation used exclusively in the present work, the elements are defined by means of nodes. Some typical two- and three-dimensional elements are shown in Fig. 7. Special interpolation polynomials, called element shape functions, are used for the approximation of the potentials within each element. One element shape function N f ’ is associated with each node in an element; it assumes the value one at this node and is zero at all other nodes: N f )=

at the node k, at other nodes.

1 0

(375)

b 19 10

13

1

9

5

1

C d FIG.7. Typical finite elements. (a) three-noded triangular element. (b) eight-noded quadrilateral element. (c) ten-noded terahedral element. (d) twenty-noded hexahedral element.

CAD IN ELECTROMAGNETISM

65

The element shape functions are conveniently defined in a local coordinate system associated with each element. For example, a two-dimensional, eightnoded, rectangular element is shown in the local as well as the global coordinate system in Fig. 8. The element shape functions can be written as

Nf’(t,?) = +

ttk)(l

+ qqk)(ttk + qqk -

for corner nodes

((k

=

f 1, q k

=

f l),

Nf)(t,q =)!d1 - t2)(l+ q q k ) , for midside nodes N f ) ( t ? q= )

-

q2)(I

(tk

= 0, q k =

& I),

= 0,

f l),

(376)

+ ttk)?

for midside nodes

(qk

t k

=

where ( t k , q k ) are the local coordinates of the node k. The shape functions represented by Eq. (376) are shown in Fig. 9. They are quadratic polynomials and are easily seen to satisfy the conditions (375). In three dimensions, three local variables t, q and i are involved. A transformation between the local and global coordinates of an element with n:’ nodes can be defined with the aid of the locally defined shape functions X ( t 7 v],

c) =

nf)

nf xkNf’(
k=l

q. c)?

Y ( t ?q, c) =

’

1

k= 1

ykNf’(t,

11,

c), (377)

66

OSZKAR BIRO AND K. R. RICHTER

FIG.9. Element shape functions of an eight-noded quadrilateral element.

where (xkryk,zk) are the global coordinates of the kth node. Evidently, the local coordinates of the nodes are transformed into their global coordinates in view of the property (375) of the element shape functions. This transformation also defines the element shape functions in terms of the global coordinates x, y and z. The function approximating the potential function can be written in an element as

where uk is the value of the potential in the node k. Indeed, the conditions (375) ensure that the function de)assumes the value uk at the kth node. The division of the region Cl into finite elements defines a global set of nodes wherein the nodes of the neighbouring elements coincide. One global shape function Nk can be associated with each global node by defining it to equal the relevant element shape function in each element containing this global node and to be zero in all other elements. These global shape functions are therefore continuous, not only within the elements but also on the element boundaries. They satisfy the conditions (375), i.e., they equal unity at the corresponding global node and are zero at all other nodes. Therefore, they allow an approximation of a potential function u in the region R in terms of its nodal values uk: nn

u'"(x, y, z, =

k= 1

ukNk(& y, z),

(379)

where n, is the number of global nodes. The global shape functions can serve as expansion functions in the approximation of the potentials in Galerkin's method. They can also be used in a straightforward manner for the construction of functions satisfying the Dirichlet boundary conditions.

CAD IN ELECTROMAGNETISM

67

Consider first the case when u is a scalar potential that must obey the Dirichlet boundary condition on

u =g,

(380)

rD

Let nD denote the number of nodes on the surface rD (with the order numbers 1,2,. . . ,nD), and let the number of the rest of the nodes (with order numbers n, + 1, nD + 2,. . .,n,) be n = (n, - n,). With the values of the given function g in the nodes on r D denoted by gk ( k = 1, 2,. . . ,nD), the function nn

uD =

gkNk k= 1

satisfies the Dirichlet boundary condition (380) in the nodes on the surface r,. Since the global shape functions associated with the nodes n, + 1, n, + 2,. . . ,n, + n outside r, obviously vanish on this surface, the approximation

is appropriate for Galerkin's method. Its comparison with Eq. (272) reveals that the unknowns are the values of the potential in the n nodes outside the surface with Dirichlet boundary condition, and the expansion functions are the global shape functions associated with these nodes. If a vector potential u is to be approximated then, using the vector shape functions Nxk = Nkex,

Nyk

Nzk = Nkez,

=yey,

(383)

an approximation in terms of the nodal values of the function can be written as n, U'")

+

+

UxkNxk UykNyk u,kNzk.

= k= 1

(384)

In case the vector potential is to satisfy the Dirichlet boundary conditions u * n = 9,

on

n x u = g,

on

r,,,

(385) (386)

it is advantageous to introduce the local coordinate directions n, s and t in the nodes on rD, and rD2, with s and t being two orthogonal tangential unit vectors. The vector shape functions N,,

=

Nkn,

N,k

=

NkS,

Ntk = Nkt,

(387)

can then be used in these nodes instead of those given in Eq. (383). Let the

OSZKAR BIRO A N D K. R. RICHTER

68

number of nodes on rD1 be n,, (with order numbers 1, 2,. . . ,n,,) and let us denote.this on r D 2 by n,, (the corresponding order numbers are n,, + 1, nDl + 2,. . ., nD1 + nD2). Further, the values of y in the nodes on rD1 are gk (k = 1,2,. . . ,n,,), and the values of g on r D 2 are g, (k = n,, + 1, n,, 2,. . ., nu, + nD2).Now, the function

+

satisfies the Dirichlet boundary conditions (385), (386) in the nodes on the surfaces rD1 and r D 2 . The global vector shape functions N,, and N,, associated with the nodes 1, 2,. . . ,nDlr N,, corresponding to the nodes n,, + 1, n,, + 2, ..., n,, + nu, as well as those in Eq. (384) associated with the nodes n,, nD2 + 1, n,, nD2 2,. . .,nn outside the surfaces r,, and r D , satisfy the homogeneous counterparts of the Dirichlet boundary conditions (385) and (386). Therefore the approximation

+

+

+

is applicable to Galerkin's method. The unknowns are the tangential components of u in the nodes on rD1, the normal component of u on r D 2 and all three components of u in the nodes outside these two surfaces with Dirichlet boundary conditions. The finite elements used in the following examples are twenty-noded hexahedral elements in the three-dimensional problems and eight-noded rectangular elements in the two-dimensional waveguide problems (Zienkiewicz, 1977). B. Analysis of a n Iron Cored Choke Coil As an example for the computer aided analysis of a static magnetic field, the problem of an iron cored choke coil is treated in this subsection. A cylindrical coil is wound around an iron core with an air gap (Fig. 10). The material of the iron core is strongly nonlinear, the B-H characteristics are shown in Fig. 11. The coil carries a current of I = 31500 AT, which results in a strong saturation of the iron. The formulation in terms of the total and reduced scalar potentials is used for the solution. The total magnetic scalar potential II/ defined in Eq. (52) is employed in the iron core, and the reduced magnetic scalar potential 0 in

y-

C C U

C

C (r

300

I

FIG. 10. The model of an iron cored choke coil.

OSZKAR BIRO AND K. R. RICHTER

70

E

c

0.lEMl

-

-

0 . 6 0 E m

H

Wd

FIG.11. The B-H characteristics of the iron core.

Eq. (51) is the variable everywhere in the air region. The source field H, due to the cylindrical coil is computed by the Biot-Savart's Law (47), using analytical integration techniques in the axial and radial directions and numerical Gauss integration in the azimuthal direction (Urankar, 1982). Two planes of symmetry have been assumed. In the x-y plane, the normal component of the flux density is zero; i.e., the Neumann boundary conditions (57) and (58) are in effect with pmS= 0. Since this is also a plane of symmetry with respect to the coil, the normal component of H, is zero, too, in Eq. (57). In the x-z plane, the tangential component of the magnetic field intensity vanishes, so the Dirichlet boundary conditions (55) and (56) are given here.

CAD IN ELECTROMAGNETISM

71

Since no surface currents are present and the tangential component of H, is zero on this plane, both 4+, and t+ho can be selected to be zero. The finite element mesh of the problem region is shown in Fig. 12. Since no special techniques are undertaken to account for the infinity of the space surrounding the coil, far boundaries have been introduced, with both potentials vanishing here. The point Po, where the two potentials are chosen to be equal and which is the starting point for the integration in Eq. (61), is also

( b) FIG.12. Finite element mesh of the iron cored choke coil. (a) mesh in the x-yplane. (b)mesh in the 1-2 plane. (c) detail of the mesh showing the iron core.

72

OSZKAR BIRO AND K. R. RICHTER

(4 FIG.12. (continued)

indicated. The finite element mesh consists of 4,125 elements with 18,628 global nodes. The number of degrees of freedom in the set of nonlinear algebraic equations is 15,598.The equations have been solved by Newton Raphson iterative techniques (Chari, 1970).This necessitates the solution of a set of linear equations in each iteration step. In view of the high number of unknowns, only the 419,474 nonzero elements of the matrix involved have been stored. The linear equations have been solved by the preconditioned conjugate gradient method (Jacobs, 1980). For the preconditioning, the incomplete shifted Cholesky algorithm has been used (Kershaw, 1978). Starting with a constant value of pr = 2,200, the number of necessary Newton Raphson iterations has been 34 to obtain a residual vector with a squared norm less than lo-* normalized with the squared norm of the right-hand side vector. The number of conjugate gradient iterations within a Newton Raphson step was typically 45-55. The computation has been carried out by a Fortran 77 code implemented on a VaxStation 3200. The overall CPU time necessary was 24,000 seconds. The distribution of the flux density vector is illustrated by means of arrows in Fig. 13. The level of the saturation of iron can be seen in Fig. 14 where the value of the relative permeability p r on the iron surface is shown by means of shades.

CAD IN ELECTROMAGNETISM

73

e 1.685T FIG.13. Distribution of the flux density in the iron core.

0,1645Et02 0,5730Et03 0,1130Et04 0,1686Et04

0,2243Et04

FIG.14. The value of the relative permeability on the surface of the iron core.

74

OSZKAR BIRO A N D K. R. RICHTER

294

aluminum

u =3.526 lo7 Yrn,p=po

Lx 294

air

u= o,p= p 0

coi 1

49-------19

0

hole 18

I

aluminum

126

(b) FIG. 15. A plate beneath a coil.

I

1

2 8 8 294

)X

75

CAD IN ELECTROMAGNETISM

C . Analysis of a Plate Beneath a Coil A three-dimensional, time-harmonic eddy current problem is constituted by the arrangement analyzed in this subsection. This is a benchmark problem in the series of the TEAM Workshops (Nakata and Fujiwara, 1988). A relatively thick, rectangular aluminum plate has an excentrical hole in it. A racetrack shaped coil carrying a sinusoidal current is placed over the plate and induces eddy currents. Two frequencies of 50 Hz and 200 Hz are considered. The geometrical dimensions and the material properties are given in Fig. 15. The problem involves a multiply connected conductor, so the use of a magnetic scalar potential in the entire nonconducting region is not feasible. Therefore, a magnetic vector potential A is introduced in the hole of the conductor, and a reduced magnetic scalar potential @ is employed everywhere else outside the conductor to describe the static magnetic field. Two formulations are used in the eddy current carrying plate: one involving a magnetic vector potential A and an electrical scalar potential V, and one in terms of a current vector potential T and a reduced magnetic scalar potential @. The first option corresponds to the A, V - A 4 approach in Section VI.C.2 and the second to the T , @ - A 4 formulation of Section III.C.4. Owing to the presence of the coil the reduced scalar potential @ is necessary instead of the total scalar potential used in these discussions. However, the equations are completely analogous; the only difference is in the forcing terms. The source field due to the coil is again computed by means of the Biot-Savart’s Law. The effect of the rectangular parts of the racetrack shaped coil can be calculated analytically and for the curved sections similar techniques can be used as for a cylindrical coil (Urankar, 1982). The problem has no plane of symmetry. This has the effect that the only boundary condition governing the electric scalar potential in the A, V-A-@ approach is that written in Eq. ( I 86), which is of Neumann type. This makes the performance of this formulation substantially inferior compared with the T,@-A-@method, as will be seen in Table I. TABLE I PERFORMANCt Ok

TWO

FOKMULATIONS IN T H t PROBLEM

OF A

PLATE B L N ~ A TAHCOIL

Formulation

Frequency Hz

Degrees of freedom

Number of iterations

CPU-time seconds

A, V-A-Q, A, V-A-Q, T,Q,-A4 T,Q,-A-Q,

50 200 50 200

13124 13124 1 1657 1 1657

269 853 94 108

18500 51600 7300 7900

76

OSZKAR BIRO AND K. R. RICHTER

Y!

I-

3000

(b) FIG.16. Finite element mesh of the plate beneath a coil. (a) mesh in the x-y plane. (b) mesh in the x-z plane. (c) detail of the mesh showing the plate and the coil.

CAD IN ELECTROMAGNETISM

77

The finite element mesh used is shown in Fig. 16. Similarly to the previous magnetostatic problem, the infinite space surrounding the plate and the coil has been truncated. The mesh consists of 2,560 elements and 11,573 nodes. The set of linear algebraic equation with complex coefficients has been again solved by preconditioned conjugate gradient techniques (Jacobs, 1980; Kershaw, 1978). The iterative process has been terminated once the squared norm of the residual vector normalized with respect to the right hand side vector became less then The number of degrees of freedom, the number of necessary iterations and the overall CPU time on a VaxStation 3200 is shown for the various versions at the two frequencies in Table I. The superior performance of the T,@-A-Omethod is apparent. The results yielded by the two formulations are practically identical. Measurement results are available for the z component of the flux density along the lines 1 and 2 shown in Fig. 15. The calculated and measured results are compared in Fig. 17. The distribution of the eddy current density in the plate is illustrated in Fig. 18 with the aid of arrows.

d', to-

Line1 I m0e a0s u0r e d l

Line2 o o o

0

IQ)

0.-

0.-

0.-

x w (a)

x w (b) FIG. 17. Comparison of computed and measured values of the z-component of the flux density. (a) f = 50 Hz.(b) f = 200 Hz.

-

3.0 1 0 6 3 (b)

FIG.18. Distribution of the real part of the current density. (a) f’ = 50 Hz.(bj f’ = 200 Hz. 79

80

OSZKAR BlRO AND K . R . RICHTER

D . Analysis of Transient Eddy Currents in a Conducting Brick

The problem analyzed in this subsection is one of transient eddy currents. This is also a benchmark problem of the TEAM Workshops (Kameari, 1988). An aluminum brick with a hole is placed in a homogeneous, exponentially decaying magnetic field. The applied field is perpendicular to the faces of the hole. The geometry and the materials are defined in Fig. 19. Similarly to the previous time-harnomic problem, the eddy current carrying conductor is again multiply connected, so the same formulations can be used. The magnetostatic region with a magnetic vector potential A is the hole and, since no coil is present, a total magnetic scalar potential t+b can be used in the rest of the nonconducting region. In the conducting region, both the A,V and the T,$ formulations have been tried. The homogeneous field can be modeled by boundary conditions on the magnetic scalar potential. The problem region is shown in Fig, 20. The planes x-y, x-z and y-z are all symmetry planes, so only one eighth of the entire domain has to be taken into account in the computations. In the x - y plane, the tangential component of the magnetic field intensity is zero. So the magnetic scalar potential is chosen to be zero here, a boundary condition corresponding to Eqs. (181) and (21 1).The boundary conditions for the magnetic vector potential are those in Eq. (175) and in Eq. (177). Finally, the current vector potential satisfies the boundary conditions (210) and (21 la). In the x-z and y-z planes, the normal

Bo=O. I T ,

aluminum

m E 0

o=2.538

T=11.9ms

10Js/m

0

P=P

#

0

0 1524m

FIG.19. Conducting brick in homogeneous field.

81

CAD IN ELECTROMAGNETISM

,

1

, , ,

1-

.'

, -

, , /

\ ' \

I

, ' ,

'.

qJ=0

\

\

\

\

FIG.20. Problem region and boundary conditions of the conducting brick in homogeneous field.

component of the flux density and the tangential component of the electric field intensity are zero, so the relevant boundary conditions are those written in Eqs. (172), (173) (with V,, = 0), (174), (179) (with a = 0), (180), (207), (208), and (209). On the far boundary parallel to the x-y plane the magnetic scalar potential assumes a nonzero value equal to the magnetic voltage necessary to maintain the impressed field, and it satisfies a homogeneous Neumann boundary condition on the two other planes of the far boundary. The finite element mesh of the problem region is shown in Fig. 21. It consists of 1,001 elements and 5,000 nodes. The time integration of the set of ordinary differential equations has been carried out using an implicit scheme (Zienkiewicz, 1977). The time step used has been 1 ms, and 20 time steps have been taken. The set of linear equations has been solved at each time step by preconditioned conjugate gradient methods. The number of degrees of freedom and the CPU times necessary for one time step is shown for both formulations in Table 11. Since the electric

I

(4 FIG. 21. Finite element mesh of the conducting brick in homogeneous field. (a) mesh in the x-y plane. (b) detail of the mesh in the x-y plane, showing the brick. (c) mesh in the x-z plane. (d) detail of the mesh in the x-z plane, showing the brick.

TABLE II PERFORMANCE OF TWO FORMULATIONS IN T H E TRANSIENT PROBLEM OF A CONDUCTING BRICK Formulation

Degrees of freedom

CPU-tirne/time step seconds

A,V-A-$ T,$-A-$

6772 6162

320 420

82

83

CAD IN ELECTROMAGNETISM

o.JOc-04

.

o aac +aa

0 .

lac-oi

a

t

-present method --- Kameari, (19 8 8 ) FIG.22. Total current in the brick.

I

zoc-ai

[seconds]

O S Z K A R BlRO A N D K. R. RICHTER

84

N

m0

90C-01

0 .O O C - 0 1

0.7OC-01

0.60C-01

0.4OC-01

0 . Jot-01

0 * 0000 t -00

0 .O ~ O C - O l

0

I

l.LTOC+OO

z

Cml

-present method ---Kameari, (1988) FIG.23. Variation of the z-component of the flux density along the z-axis at different time instants.

85

CAD I N ELECTROMAGNETISM

scalar potential is constrained by Dirichlet boundary conditions on two symmetry planes, the performance of the A,V-A-$ method is even slightly superior compared to that of the T,$-A-$ formulation. The two formulations have again yielded practically identical results. The time variation of the total circulating current is plotted in Fig. 22, and the magnetic flux density along the z-axis at different time instants is shown in Fig. 23. For comparison, results published by Kameari (1988) are also indicated. The distribution of the current density in the brick at the moment the total current reaches its maximum is shown by arrows in Fig. 24. E . Analysis of Anisotropic Waveguides

In applying the Galerkin formulations presented in Section V.C.3 to waveguides, it is to be taken into account that the field quantities describe waves propagating in the z-direction perpendicular to the cross section of the waveguide. This means that the potentials vary in a special way in the zdirection and they can be written as A(x, y , z) = A(.u, y ) e - J g z ,

(390)

~ ( xy ,,z ) = V(x, y ) e - j @ ,

(39 1 )

e 3 . 7 lob;* FIG.24. Distribution of the current density at t

=

I 1 ms.

OSZKAR BIRO AND K. R. RICHTER

86

where p is the so called propagation coefficient. Similar expressions are valid for the electric vector and magnetic scalar potentials. In computing the dispersion characteristics of the waveguide, the coefficient /? is used as a parameter with the corresponding values of the angular frequency given by the solution of the eigenvalue problem. The implication for the finite element realization is that it suffices to discretize the cross section of the waveguide into two-dimensional elements, and the shape functions are assumed to have the form &(x, y, z ) = &(x, y)e-jfl'.

1. A Ferrite Loaded, Rectangular Waveguide

The first waveguide problem to be analyzed is one of a rectangular cross section partially filled with ferrite (Fig. 25). The problem has an analytical solution, a fact that enables the assessment of the precision of the results (Hano, 1988). The material characteristics of the ferrite are

0.875 j0.375

0.0 -j0.375), 1.0 0.0 0.0 0.875

[Er]

(10.0 = 0.0

0.0 10.0

0.0

0.0

0.0) 0.0 , (393) 10.0

i.e., it is magnetically anisotropic, but electrically isotropic. Both formulations described in Sections IV.B.l and 2 have been used. The boundaries of the waveguide are electrical walls; i.e., the boundary conditions for the magnetic vector potential and the electric scalar potential

FIG.25. A rectangular, ferrite loaded waveguide.

CAD IN ELECTROMAGNETISM

87

FIG.26. Finite element mesh of the rectangular, ferrite loaded waveguide.

are those in Eqs. (239),(240),and (246), while the electric vector potential and the magnetic scalar potential satisfy the boundary conditions (250), (255) and (256). For Galerkin’s method, these are analogous to the conditions (241), (242),and (245) for A and V on magnetic walls. The finite element mesh used is shown in Fig. 26. It consists of 40 elements with 149 global nodes. The generalized eignevalue problem has been solved by the bisection method (Waldvogel, 1982), with the sparsity of the matrices involved fully utilized. The dispersion characteristics of the waveguide have been computed for the two modes with the lowest wave numbers. They are shown in Fig. 27. The analytical curves cannot be distinguished from the computed ones. Also, there is no difference in the curves obtained by the two different formulations. The normalized wave numbers at several values of fl are shown in Table 111. The results obtained by both formulations are shown and they are seen to agree excellently. The number of degrees of freedom is also indicated. The lines of constant values of the z-component of the Poynting vector for these modes are plotted in Fig. 28. They illustrate the distribution of the power propagating in the waveguide. 2. A Ferrite Loaded, Elliptical Waveguide A second waveguide problem with both magnetic and electric anisotropy present is one of a waveguide with an elliptical cross section (Fig. 29). The

OSZKAR BIRO AND K.R. RICHTER

88 0

Y \ 4

m

0.3oOoE+01

i

ip !

0.2oooE+01

I

I i

0.1oooE+01

+ iI I

I

I

KO*A

FIG.27. Dispersion characteristics of the rectangular,ferrite loaded waveguide.

TABLE 111 NORMALIZED WAVENUMBERS AT SEVERAL VALUESOF B. RECTANGULAR, FERRITELBADED WAVEGUIDE

Formulation: Degrees of freedom:

A, V

F?*

424

536

Ba

k0,a

k02a

k0,a

ko2u

-1 0 1

0.8925 0.8095 0.9700

1.483 1.438 1.492

0.8925 0.8095 0.9700

1.485 1.440 1.494

CAD IN ELECTROMAGNETISM

c

I

89

I

(a) (b) FIG.28. Lines of constant z-component of 11ie Poynting vector in the rectangular, ferrite loaded waveguide. (a) first mode. (b) second nlode.

FIG.29. An elliptic, ferrite loaded waveguide.

90

OSZKAR BlRO AND K. R. RICHTER

material characteristics assumed are 0.875 j0.375

0.0

-j0.375), 1.0 0.0 0.0 0.875

1.0 0.0 0.0

[ErJ

r 0 - 0 0.0 0.0) = 0.0 10.0 0.0 , (394) 0.0 0.0 10.0

2.25 0.0

0.0

The boundaries are again electric walls, so the boundary conditions are the same as in the previous example. Since the boundaries are curved, the normal and tangential components of the vector potentials do not coincide with the x, y or z-components. It is necessary to use the transformations presented in Section V1.A. The finite element mesh shown Fig. 30 consists of 48 elements and 161 nodes. The results obtained by the two formulations are once again in excellent agreement. The dispersion characteristics for the two modes with the lowest wave numbers are plotted in Fig. 31. The normalized wave numbers at some values of fl are given in Table IV with the number of degrees of freedom also shown. The propagation of power is illustrated by Fig. 32 showing the lines of the constant z-component of the Poynting vector for the first three lowest modes.

FIG.30. Finite element mesh of the elliptic, ferrite loaded waveguide.

91

CAD IN ELECTROMAGNETISM

h

t -first

t I: o.ooE+oo

mode

-_-- s e c o n d

O.lOE+01

0.20E+Ql

0.3c€+01

0.4(X+ol

mode

0.50E+01

0.6OEiol KO*A

FIG.31. Dispersion characteristics of the elliptic, ferrite loaded waveguide.

TABLE IV NORMALIZED WAVENUMBERS AT SEVERAL VALUES OF / I. ELLIPTICAL, FERRITE LOADEDWAVEGUIDE Formulation: Degrees of freedom: Ba

k01a

-6 0 6

2.3585 0.8057 2.2234

A, V 548 k01

FA 612

a

2.3959 1.2753 2.5316

ko1 a

ko2 a

2.3603 0.8051 2.2242

2.3962 1.2757 2.5388

92

OSZKAR BlRO A N D K. R. RICHTER

(4

(b)

(4

FIG.32. Lines of constant z-component of the Poynting vector in the elliptic, ferrite loaded waveguide. (a) first mode. (b) second mode. (c) third mode.

F. Analysis of a Dielectric Loaded Cavity The problem to be analyzed in this subsection is a rectangular cavity partially filled with an isotropic material of high permittivity. The geometrical and material data are shown in Fig. 33. The cavity is bounded by perfectly conducting material, i.e., by electric walls. The lowest admissible wave numbers are to be calculated. The problem has two planes of symmetry, the x-z and y-z planes. The lowest order modes are obtained if these planes are treated as magnetic walls. The problem region so obtained is drawn in Fig. 34. Three different finite element meshes of increasingly fine discretization have been used to study the convergence of the wave numbers. The coarsest one of 64 elements with 425 nodes is shown in Fig. 35. The finer meshes have been generated by increasing the number of subdivisions in each of the three coordinate directions. The second mesh has 216 elements with 1,225 nodes and the third one 512 elements with 2,673 nodes. The finite elements have been chosen to be smaller in the vicinity of the interfaces between the regions with different permittivities, since a more pronounced variation of the field is

CAD IN ELECTROMAGNETISM

93

‘t

1

FIG.33. Dielectric loaded cavity. a = 1.

“t

expected here. The problem has been solved by both the A,V and the F,$ formulations. The value of the lowest wave number obtained by means of each mesh and both formulations has been summarized in Table V where the number of degrees of freedom and the CPU time have aIso been indicated in each case. The convergence of the result is clearly observable. It is interesting to note that

94

OSZKAR BIRO AND K. R. RICHTER

FIG.35. Finite element mesh of the dielectric loaded cavity.

TABLE V THELOWESTWAVENUMBER OF

A

DIELECTRIC LOADED CAVITY. DIFFERENT FORMULATIONS DISCRETIZATIONS

AND

Formulation: Number of elements 64 216 512

F,*

A, V

Degrees of freedom

ko,a

CPU-time seconds

Degrees of freedom

928 3240 7808

5.388 5.399 5.403

1555 21591 243275

1120 3612 8576

ko,a

5.46 1

5.430 5.421

CPU-time seconds 2196 34230 286375

CAD IN ELECTROMAGNETISM

95

TABLE VI THE FIRST SIX WAVENUMBERS OF A DIELECTRIC LOADEDCAVITY Formulation

A. V

F.*

k0,a kma k03a ko4a

5.399 7.815 9.917 10.636 1 1.602 12.410

5.430 7.860 9.986 10.664 1 1.623 12.442

kO+J

k0,a

the different formulations yield upper or lower bounds for the wave number. The normalized values of the first six wave numbers are given in Table VI with the second mesh used. The results obtained by the two formulations agree within a few percents, indicating a sufficient degree of accuracy.

ACKNOWLEDGMENTS The author is grateful to his colleagues lstvan Bardi, Kurt Preis, Christian Magele, Werner Renhart and Wolfgang Rucker for the many fruitful discussions and for the help in preparing the problem solutions.

REFERENCES Bardi, I., and Biro, 0.(1989).Proc. COMPUMAG. 7th, September3-7, 1989, Tokyo, Japan, p. 91 .. also in I E E E Trans. on Mag. 26, 450. Biddlecombe, C . S., Heighway. E. A,, Simkin, J., and Trowbridge, C. W. (1982). I E E E Trans. on Mag. 18, 492. Biro, O., and Preis, K. (1989a). I E E E Trans. on May. 25, 3145. Biro, O., and Preis, K. (1989b).Proc. COMPUMAG. 7th.September 3 - 7,1989,Tokyo. Japan, p. I., also in IEEE Trans. oti Mag. 26,418. Biro,O., and Preis, K. (1989~). Proc. 3DMAC. September 11-13, 1989. Okayama, Japan, p. 20.,also in COMPEL, 9, Supplement A, 45. Biro, O., Preis, K., Renhart, W., and Richter, K. (1988). Proc. Beijing Int. Symp. Electromagnecic Field El. Engng, October 19-21, 1988. Beijing China, p. 488. Carpenter, C . J. (1977). IEE Proc. 124, 1026. Carpenter, C. J., and Locke, D. H. (1976). Proc. COMPUMAG, Ist, Oxford, 31 March to 2 April 1976, p. 47. Chari, M. V. K. (1970). “Finite Element Analysis of Nonlinear Magnetic Fields in Electrical Machines.” Ph. D. Dissertation, McGill University, Montreal, Canada.

96

OSZKAR BIRO AND K. R. RICHTER

Chari, M. V. K., Konrad, A,, Palmo, M. A,, and DAngelo, J. D. (1982).I E E E Trans. on Mag. 18, 436. Davies, J. B., Fernandez, F. A., and Philippou, G . Y. (1982). I E E E Trans. on Microwave Theory Tech. 30, 1975. Emson, C . R. I., Simkin, J., and Trowbridge, C. W. (1983). I E E E Trans. on Mag. 19,2231. Galerkin, B. G . (1915). Vestn. Inzh. Tech. 19, 897. Hano, M. (1988).Electronics and Communications in Japan, Part 2,71, 71. Jacobs, D. A. H. (1980). “Preconditioned Conjugate Gradient Methods for Solving Systems of Algebraic Equations.” Central Electricity Research Lab., Lab. Note RD/L/N 193/80. Kameari, A. (1988). C O M P E L 7,65. Kershaw, D. S. (1978). J . Comput. Phys. 26,43. Koshiba, M., Hayata, K., and Suzuki, M. (1985).I E E E Trans. on Microwave Theory Tech. 33,227. Leonard, P. J., and Rodger, D. (1988). I E E E Trans. on Mag. 24.90. Maxwell, J. C. (1864). Roy. Sac. Trans. 155,459. Mikhlin, S. C. (1964). “Variational Methods in Mathematical Physics.” Pergamon, Oxford. Mitchell, A. R.,and Griffiths, D. F. (1980).“The Finite Difference Method in Partial Differential Equations.” John Wiley & Sons, New York. Nakata, T., and Fujiwara, K. (1988).Proc. Vancouver T E A M Workshop at the University of British Columbia, 18 and 19 July 1988, p. 26. Pillsbury, R. D. (1983). I E E E Trans. on Mag. 19,2284. Rodger, D., and Eastham, J. F. (1983). I E E E Trans. on Mag. 19, 2443. Rodger, D. and Eastham, J. F. (1987). I E E Proc. 134. Pt. A, 58. Simkin, J., and Trowbridge, C. W. (1979).Int. J . Num. Meth. Engng. 14,423. Urankar, L. K. (1982). IEEE Trans. on Mag. 18, 1860. Waldvogel, P. (1982). Computing 28, 329. Webb, J. P. (1985). I E E E Trans. on Microwave Theory Tech. 33,65. Zienkiewics 0.C. (1 977). “The Finite Element Method.” McGraw-Hill, London.

ADVANCES I N ELECTRONICS ANU ELECTRON PHYSICS. VOL. XZ

Speech Coding VLADIMIR CUPERMAN Communication Sciences Laboratory, School of Engineering Simon Fraser University Burnaby, British Columbia, Canada

I. Introduction . . . . . . . . . . . . . . . . . A. Speech Coding: What Is the Problem? . . . . . . . B. Paperoutline. . . . . . . . . . . . . . . . C. Performance Criteria .

. .

.

. .

.

.

.

.

. . . , . . . . 11. Signal Processing in Speech Coding . . . . . A. Scalar Quantization, . . . . , . . , . B. Linear Prediction in Speech Coding , . , . C. Vector Quantization. . . . . . . . . . D. Quantization of the LPC Coeficients. . . . 111. Speech Coding Systems. . , . , , . , , . A. Analysis-Synthesis Speech Coding (Vocoders) . B. Predictive Speech Coding . . , . . . . . C. Frequency Domain Speech Coding . . . . D. Analysis-by-Synthesis Speech Coding. . . . E. Tree and Trellis Coding . . , , . , , . IV. Future Research Directions . . . . . . . . Acknowledgments. . . . . . . . . . . . References . . . . . . . , . , . . . .

.

D. Speech Production Model. . , E. Speech Signal Characterization .

.

,

.

.

. . . . . . . . . . . . , . . . . . . . . , . . . . . . . , . . . . . , . . . . , . . . ,

.

,

. .

. . . . . . . . . . , . , . . . . . . . . . . . . . , , . . . . . . , . . . . . . . . . . , . . . , . . . . . . . . . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . , . . . , . . . . . , . . . . . . . , . . . . . . . . , . . . . . . . . . . . . . , . . , .

. . . . . . . . . . , . . .

97 97 99 100 102 103

. .

110

I 10

. I20 . . I37 . . 144 ,

.

.

147

. . 148 . , 150 . . 1% . . 160 . . 184 . . 188 . .

189

. . 189

I. INTRODUCTION A . Speech Coding: What Is the Problem? The objective of a speech coding system is to reduce the bandwidth required to transmit or store the speech signal in digital form. Despite the increased bandwidth provided by microwave and optical communications systems, the need to conserve bandwidth remains important. For example, low-rate speech coding is essential for multiple-user sharing of channels with bandwidth or power limitations, such as cellular radio and satellite links. B y using low-rate speech coding combined with time division multiplexing 97 Copyryht 1') 1991 by Academlc Press. Inc All nghls of reproduction in dny form reserved ISBN 0-12-014bR2-7

98

VLADIMIR CUPERMAN

(TDM), the new North American digital cellular system (to be introduced in 1991-1992) will increase the number of potential users of mobile communications threehold over the existing analog system. The digital cellular system will employ an aggregate rate of 16 kb/s per user, with 8 kb/s allocated for speech coding and 8 kb/s allocated for error protection and control protocols. Nevertheless, because of the exponential increase in the number of mobile communications users, a new half-rate digital cellular system having an aggregate rate of 8 kb/s is already considered for future deployment. Low-rate digital speech is also important in a variety of other applications such as voice mail and voice response systems, integrated multimedia terminals (speech, video, graphics), and secure communication systems. The simplest way to digitize speech is sampling followed by quantization. Sampling transforms the time-continuous analog waveform into a discrete sequence of samples; for telephone-quality speech, a sampling rate of 8 kHz is required. Quantization transforms each continuous-valued sample into a binary number; when using logarithmic quantization (described later), 8 bits are required for a good reproduction of the speech signal amplitude. Pulse code modulation (PCM) combines sampling with logarithmic quantization to produce digital speech at 64 kb/s (Oliver et al., 1948; Jayant and Noll, 1984). PCM at 64 kb/s is still the most frequently used digital speech system. However, PCM is wasteful from the channel bandwidth point of view. Telephone speech has a bandwidth of less than 4 kHz and is digitized by PCM at 64 kb/s; and yet, with today’s technology, there is no way to transmit 64 kb/s on 4 kHz of channel bandwidth given the noise and distortion levels encountered on the telephone network. Adaptive differential PCM (ADPCM) combines PCM and data compression based on linear prediction (CCITT, 1984;Jayant and Noll, 1984).The result is digital speech at 32 kb/s with reconstructed speech quality comparable to 64 kb/s PCM. A reasonable question is: How much further can one reduce the rate, and at what penalty in speech reproduction fidelity (quality)? A speech coding system has some other important characteristics such as communications delay, computational complexity, tandeming performance, and channel error robustness, but initially the discussion will be restricted to the two-dimensional space of rate and quality. The speech quality required in commercial telephony is called toll quality. Speech coding systems that offer toll quality at 16 kb/s and close to toll quality at 8 kb/s have been demonstrated. The basic principles of these systems are discussed below. Is 8 kb/s the lowest rate at which toll quality can be achieved? Researchers currently strive toward toll quality at 4 kb/s; this is an extremely challenging task and the potential result of this work is unknown at this time. Some researchers believe that toll quality is achievable even at 2 kb/s. This raises the following question: What does information theory predict as the lower rate limit for speech coding?

SPEECH CODING

99

The information rate of the phonetic transcription of speech is less than 50 bits/s (Flanagan, 1972). Of course, the speech signal contains information on the speaker’s voice which is missing in a phonetic transcription. On the other hand, experiments on human communication capacity show that the information processing rates are lower than 50 bits/s. This suggests that the listener may discard a large part of the information transmitted by modern speech communications systems. In rate-distortion theory, sources are characterized by a rate-distortion function that provides the minimal achievable rate for a given distortion or fidelity of reproduction. The phonetic transcription approach is based on the written equivalent fidelity criterion and disregards such features as the speaker’s identity, stress, and inflection. For deriving a rate-distortion function, sources are assumed to be stationary and to have given probability density function (pdf). Speech is a nonstationary process, and its rate-distortion function for a general fidelity criterion is an open problem. Different stationary models have been considered for the speech signal and its probability distribution; however, it is doubtful that the corresponding rate-distortion functions predict correctly the ultimate limits achievable in speech coding. Moreover, the rate-distortion theory allows for infinite delay, which is not acceptable in speech coding applications. Perceptual-based criteria that are important in low-rate speech coding are difficult to apply to the mathematical workframe of the rate-distortion theory. Another important factor related to the above discussion is the evolution of digital hardware. The rate-distortion function is an information theoretic limit that assumes unbounded implementation complexity. However, when speaking about the ultimate limits in speech coding, some practical limits on the implementation complexity are assumed. During recent years, the throughput of specialized digital signal processing (DSP) chips has increased at an exponential rate. The algorithms used these days to produce toll quality speech at 16 kb/s and close to toll quality at 8 kb/s were not conceivable 5-10 years ago; until recently, most researchers would have rejected the basic algorithms as unfeasible. The future development of digital hardware may spur the development of powerful new algorithms. In short, the ultimate limits in speech coding are still an open problem. B. Paper Outline

The main topic of this paper is a new class of speech coding systems that has emerged in the past five years. These systems are based on an approach referred to as analysis-by-synthesis (A-by-S) speech coding. Essentially, an A-by-S speech coding system uses a speech production model whose parameters are determined in closed loop by comparing the synthesized waveform to the original signal. Two new speech coding systems recently proposed for

100

VLADIMIR CUPERMAN

standardization, the vector sum excited linear prediction (VSELP) and the low delay code excited linear prediction (LD-CELP),are two examples of A-by-S systems. Section 1.C discusses performance criteria used to evaluate speech coding systems. Section 1.D includes a brief presentation of a speech production model that has had a significant impact on the evolution of speech coding systems. Speech signal characterization from the signal processing point of view is presented in Section I.E. Section I1 is dedicated to the signal processing techniques used in speech coding in general, and in analysis-by-synthesis speech coding in particular. Some new signal processing techniques used in A-by-S systems are presented near the end of the section. Section 1II.A and 1II.B briefly discuss the analysis-synthesis and the predictive speech coding systems. These systems historically precede the analysis-by-synthesis systems, with many of the techniques used in A-by-S systems having been first developed for analysis-synthesis or for predictive speech coding. The author believes their presentation is required for a basic understanding of the analysis-by-synthesis systems that are discussed in Section 1II.D. For completeness, Section 1II.C contains a summary of frequency domain speech coding, while Section 1II.E contains a brief discussion of tree and trellis speech coding (mostly indicating references related to these systems). Finally, Section IV briefly presents potential future research directions in analysis-by-synthesis speech coding. C . Performance Criteria

Traditionally, speech coding systems have been evaluated in a threedimensional space using as criteria the transmission rate, the fidelity of reproduction, and the implementation complexity. Recently, communications delay has become an important criterion for speech encoders used in the public switched telephone network (PSTN). One of the most difficult problems in speech coding is the choice of an objective fidelity criterion that will correctly reflect the subjective human perception of speech quality. The simplest and probably most used objective quality criterion is the signal-to-noise ratio (SNR). Denoting by x ( n ) the input sampled speech signal, by y ( n ) the corresponding reconstructed signal at the receiver, and by r(n) = x(n) - y ( n ) the reconstruction error, the SNR is defined by o2 SNR = 1 0 l o g l o ~ , or

where of and 0,’ are the variances of x(n) and r(n),respectively. If one wants to

SPEECH CODING

101

avoid the implicit assumption of stationary in this definition, then the SNR can be defined by time averaging over a segment of speech:

where N is the length of the segment. A better assessment of speech quality can be obtained by using the segmental signal-to-noise ratio (SEGSNR).This criterion compensates for the low weight given to the low-level signal performance in SNR evaluation. The SEGSNR is calculated by computing the SNR in dB units for each block (frame) of, say, 256 samples, eliminating silence frames, and taking the arithmetic average of these SNR values over the entire speech file. A frame is considered to be silence if its average signal power is 40 dB below the average power level of the entire speech file. Unfortunately, SNR and SEGSNR do not reliably predict the subjective speech quality; this particularly applies to rates equal to or lower than 16 kb/s. An alternative approach is subjective quality evaluation by formal tests with human listeners. The mean opinion score (MOS) is obtained by averaging the scores given by a panel of untrained listeners. Each listener characterizes the speech signal by a score on a scale of 1 (for poor quality) to 5 (for excellent quality). Typically, the averaging is done over 30-60 listeners. MOS results may vary for the same system as a function of the source material and equipment by as much as 0.5 on a scale of 5. However, MOS scores are surprisingly reproducible when brought to a common reference. Differences with respect to a given reference system as small as 0.1 on a scale of 5 were found to be significant and reproducible. The toll quality is characterized by MOS scores better than 4.0. Typically, the 64 kb/s PCM achieves a MOS of 4.25, and the CCITT ADPCM standard at 32 kb/s achieves MOS scores close to 4.0. For low-rate speech coding systems (4,800 bits/s and below), the quality is evaluated by the diagnostic rhyme test (DRT) and the diagnostic acceptability measure (DAM). The DRT is an intelligibility test designed to evaluate the apprehensiveness of consonants distinguished by basic features such as voicing, nasality, and graveness (Voiers, 1977a). Each item in the test involves two rhyming words, the initial consonants of which differ by a single distinctive feature-for example, meat and beat, or veal and feel. The DAM is a quality evaluation based on the acceptability of the speech signal as perceived by a normative listener. The quality is evaluated for the perceived background sound and the perceived signal sound (Voiers, 1977b). Although, in principle, the MOS and the DAM may be related, there are few comparative data available for analyzing such a relationship. Telephone speech scores about 92-93% on the DRT and about 65 on the DAM test.

102

VLADIMIR CUPERMAN TABLE I

PERFORMANCE OF SPEECH COUINGALGORITHMS System

Rate

MOS

Delay

Complexity

PCM ADPCM LD-CELP VSELP

64 32 16 8

4.25 4.0 3.95 3.9

0.125 0.125 1.5

1 2 20 14

55

The emerging digital communications network characterized by tandeming many different types of channels recently placed emphasis on another important criterion: the CODEC delay. In a complex network, the delays of many encoders add together, transforming the delay into a significant impairment of the system. In some applications the use of echo cancelers should be avoided. In other applications, the values of the total delay may become so large that the delay represents an impairment even in the presence of echo cancelers. For these reasons, the requirements of the new 16 kb/s CCITT standard specify a delay lower than 5 ms (the objective is 2 ms). Table I shows the representative points of four speech coding systems in a four-dimensional space defined by transmission rate, speech quality, transmission delay, and implementation complexity. PCM at 64 kb/s and ADPCM at 32 kb/s are two well-known techniques subject to existing CCITT standards. LD-CELP at 16 kb/s is the candidate for CCITT standardization at 16 kb/s, while VSELP is the new standard for 8 kb/s speech coding for digital cellular. Note that the degradation in quality from 64 kb/s to 8 kb/s is surprisingly small. The main purpose of this paper is to present the techniques that have led to the new generation of speech coding systems, including LDCELP and VSELP. The results for the last two systems in Table I are based on tests done during the standardization process and should be considered preliminary. The estimates for the computational complexity in the last column are based on the author’s evaluations and do not necessarily correspond to the number of instructions per second (IPS) necessary for realtime execution, nor to the number of floating point operations (FLOPS).

D . Speech Production Model Many advances in speech coding are related to the introduction of a simple, mathematically tractable, but still realistic speech production model. This model includes two basic components: an excitation generator and a vocal tract model. The excitation generator models the effect of the air flowing out of the lungs through the vocal cords. The vocal tract model includes the

103

SPEECH CODING

Vocal Tract Model

Speech Signal x(n)

9 FIG. I . A simplified model of speech production

effect of radiation at the lips and is represented by a time-varying filter. It is assumed that the parameters defining the vocal tract model are constant over time intervals of typically 10-30 ms. The simple model described above is shown in Fig. 1. More detailed models include an excitation generator having two modes; for voiced sounds the excitation is quasi-periodic, while for unvoiced sounds it is random. Further, a glottal pulse model may be used in conjunction with voiced excitation, and the radiation model may be separated from the vocal tract model (Flanagan, 1972; Rabiner and Schafer, 1978; Furui, 1989). In many modern speech coding algorithms, the distinction between voiced and unvoiced excitation has been removed, while the vocal tract model may always be assumed to include the glottal pulse model and the radiation model. For these reasons, the simplicity of the model of Fig. 1 will be preserved in this paper. The model of Fig. 1 (as well as more detailed versions of this model) has some well-known drawbacks. In vowel sounds, the vocal tract parameters vary slowly, and the assumption of constant parameters for intervals of 10-30 ms works well. However, this simplification does not work well for transient sounds such as stops. In fricative voiced sounds (such as “z” or “th”), the excitation signal is very complex and can hardly be modeled by voiced or unvoiced excitation. In most models, the vocal tract is modeled by an all-pole filter, while for some sounds, such as nasals, zeros are needed. Still, the model is adequate in most cases and has been used successfully in the development of speech coding algorithms. E. Speech Signal Characterization

Figure 2 shows a typical segment of a speech waveform. From the viewpoint of random process theory, this waveform may be considered as a segment of a realization of a random signal. The last part of the waveform shows a strong periodicity related to the fundamental frequency of the glottal excitation (pitch). In most speech coding algorithms, the signal is processed by a fixed segment extraction technique. The speech signal is divided into segments (frames) of fixed length N starting at an arbitrary point (see Fig. 2). Of course,

104

VLADIMIR CUPERMAN 600

Sample number

FIG.2. Typical speech waveform.

as one can understand by examining Fig. 2, such a technique is arguable. A fixed segment extraction may introduce discontinuities and may destroy the similarity of successive segments in those regions where the speech waveform is periodical; moreover, a “smearing” of the estimated segment statistics may result if the segment overlaps a signal transition region. In some low-rate speech coding systems (2,400 bits/s and below), a pitch synchronous extraction technique was used successfully. However, most systems use fixed length segment extraction; moreover, the developments that follow can be easily extrapolated to variable segment extraction environment. Consequently, for the rest of the paper a fixed length extraction technique will be assumed. 1. Autocorrelation Function

The correlation between samples at distance (time-separation) k can be quantified by the (biased) estimate 1 n=no+N-lkl-1 rx$) = C x(n)x(n+ lkl), (1) N n=no where x(n)are the input samples, no is the time index of the first sample in the frame, and k = 0, 1, . . . ,N - 1. A frame index could be used for rXx;however, to simplify the notation, such an index is used only in cases where the frame identity is important. Actually, r,, can be considered as an estimate of the true process autocorrelation function based on the data segment [no, no + N - 11. Consequently, the notation FXxwould be more appropriate; however, the notation r,, will be preserved to simplify further the relations in the rest of this paper. Note that estimating the true autocorrelation function by time-averaging implicitly assumes ergodicity of the underlying random process. Speech is,

SPEECH CODING

105

of course, a nonstationary random signal. However, the so-called local stationarity model can be invoked to justify the use of random process theory for short-time description of speech signals. According to this model, frames of speech may be considered as segments of realizations of ergodic random signals. The statistics for these random signals may then be evaluated from the available data (speech frame) by using known estimation techniques. The local stationarity model may not be mathematically rigorous; nevertheless, it is conceptually simple and leads to correct results. A more rigorous model may be obtained by introducing quasi-stationarity (see Ljung, 1987; Gray, 1990). However, the corresponding mathematical framework is outside the scope of this paper. An alternative approach would be to consider the segment [no, no + N - 11 as a deterministic waveform and to apply the corresponding definition of the autocorrelation function for deterministic signals. For details on this approach see, for example, Rabiner and Schafer (1978) and Fallside and Woods (1985). The only disadvantage of this approach is that when signal quantization is discussed, the same waveform is assumed to be a stationary process. There is a physical interpretation that can be related to the local stationarity idea. The slow movement of the vocal tract allows us to consider the corresponding filter in the speech production model as fixed for short intervals on the order of 10-30ms. For a fixed shape of the vocal tract, assuming the excitation is stationary, the corresponding speech waveform may be considered a realization of a stationary process. It is difficult to justify the fixed segment extraction technique in the general framework of local stationarity. The fixed segment extraction produces frames that overlap fast transitions in the waveform (for example, segment 2 in Fig. 2), corresponding to fast movements of the vocal tract. The result is a smearing of the estimated statistics. Unfortunately, this is not only a theoretical problem; one of the basic problems of modern speech coding systems is the way fast transitions in signal characteristics are processed. This may be a major cause for the significant reduction of speech quality at rates below 8 kb/s. The short time average of the signal is assumed to be zero in the following development. Hence, the variance estimate can be obtained from The normalized autocorrelation function, pxx,is defined by

Sometimes pxx(k)are called autocorrelation coefficients. Figure 3 shows the autocorrelation coefficients as a function of the lag k for a typical voiced frame. Figure 4 shows the same for a typical unvoiced frame.

106

VLADlMlR CUPERMAN I

800 1 600

zQ o

5 -200 -400 -600

1 50 100 150 200 250 300 350 420

I

‘0 -0.8

z

-1 0

20

. I 40

60

100

80

120

140

Lag, k (samples)

FIG.3. Normalized autocorrelation function for a voiced sound. (a) Time waveform. (b) Normalized autocorrelation function.

Ew;

,

,

20

40

,

,

,

,

,

100

120

140

6 -0.8

z

-1

0

60

80

1

Lag, k (samples)

FIG.4. Normalized autocorrelation function for an unvoiced sound.

107

SPEECH CODING

During voiced speech, the autocorrelation coefficients may reach significant values, showing that the waveform samples are strongly correlated and a significant gain may be obtained by linear prediction (see Section 1I.B). The fixed segment extraction is equivalent to a rectangular window of length N applied to the speech waveform starting at sample no. A better spectral estimate may be obtained by using a smooth window rather than a rectangular one. The corresponding estimate of the autocorrelation function, rwxx,is given by 1

n=no+N-lkJ-I

rwxx(k)= N

C

n=no

+

.x(n)w(n)x(n IkI)w(n + IkI),

(4)

where w(n) is the window function. Examples of window functions used in speech coding are the Hamming and the Hanning windows. For details about windowing in speech coding and its effects, see Markel and Gray (1976), Rabiner and Schafer (1978), and Furui (1989). 2. Power Spectral Density

For a stationary process, the power spectral density (PSD) is defined as the Fourier transform of the autocorrelation function. Considering rxxgiven by ( 1 ) as an estimate of the true autocorrelation function of an underlying stationary process, an estimate of the PSD may be obtained by taking the Fourier transform

Denoting by X ( e j W )the Fourier transform of the finite sequence x(n), n = 0, 1,. . . , N - 1, it is easy to show that

The last equation is similar to the one used in spectral estimation theory by the so-called periodogram approach. For stationary processes, a better spectral estimate can be obtained by dividing the sequence of length N into a number of non-overlapping sequences of smaller length. However, this is difficult to apply to short-time analysis of speech; a local stationarity interval may have only 20 ms, and dividing this into non-overlapping sequences may lead to insufficient data for each of the sequences. Figures 5 and 6 show the estimated short-time PSDs for typical voiced and unvoiced frames, respectively. The results were obtained by using a 256-point discrete Fourier transform. Figure 7 shows the evolution in time of the smoothed PSD estimated for 320 ms of speech containing the transition from

108

*O

-40

I

i d 500 1000 1500 2000 2500 3000 3500 4000

0

Frequency (Hz) FIG. 5. Power spectral density (PSD)for a voiced speech frame.

I

l

0

500 1000 1500 2000 2500 3000 3500 4000

Frequency (Hz)

FIG.6 . Power spectral density (PSD) for an unvoiced speech frame.

an unvoiced sound to a sustained vowel. The strong correlation between the PSD estimates for consecutive frames is evident during the voiced sound. A different approach for estimating speech PSD is based on linear prediction and the associated autoregressive estimation techniques. This is discussed in Section 1I.B. 3. Probability Density Function

Although speech is a nonstationary signal, a probability density function (PDF) may be defined by using amplitude distributions computed by time averaging over speech segments. The short-time PDF evaluated over speech segments of 20 ms may be approximated by a simple Gaussian model. The variance of the Gaussian model changes widely from segment to segment;

SPEECH CODING

109

FIG.7. Evolution in time or the smoothed PSD for a segment of 320 ms of speech.

typically, the dynamic range is in excess of 40 dB. The long-time PDF is evaluated by time averaging over a multi-talker speech data base. A first approximation to the long-time PDF may be provided by the Laplace distribution; a better fit is provided by the gamma distribution (Jayant and Noll, 1984). Gauss, Laplace, and gamma distributions are often used as probabilistic models for analysis of speech processing algorithms. 4. Spectral Flatness

The spectral flatness measures the redundancy of a random process as expressed by the shape of its PSD. A white noise process with a flat PSD has the spectral flatness measure equal to 1. For a process with the PSD Px,(o), the spectral flatness is given by

where exp denotes the exponential function. An approximate relation can be obtained by sampling the PSD at uniformly and densely spaced intervals. Then, the spectral flatness can be expressed as the ratio of the arithmetic and geometric averages computed over

110

VLADIMIR CUPERMAN

PSD samples: N

2 Yxx

"

1/N

1 Pf

j= 1

The short-time estimate of the spectral flatness for speech signals varies widely in the range 2-500; the long-time estimate is typically on the order of 8 (Flanagan et al., 1979). Large values for the spectral flatness indicate significant potential for data compression. Actually, the rate-distortion function of a Gaussian process can be expressed directly as a function of spectral flatness (Berger, 1971; Jayant and Noll, 1984).

11. SIGNAL PROCESSING IN SPEECH CODING Speech is an analog waveform, i.e., it is time-continuous and amplitudecontinuous. The reader is assumed to be familiar with basic waveform digitization techniques. Time discretization is done by sampling, an operation that is lossless or information-preserving, assuming the conditions of the Nyquist sampling theorem are met. Amplitude discretization is done by quantization, an information-lossy operation. Figure 8 shows a general block diagram for a speech coding system consisting of an encoder and decoder. A device containing both an encoder and decoder is called a speech codec. Generally, the encoder uses data compression techniques (discussed in following sections) and includes a quantizer. The decoder performs all the inverse functions corresponding to the data compression techniques used in the encoder, and it reconstructs an approximation of the original analog waveform. This section includes a brief discussion of the quantization and data compression techniques used in speech coding. A . Scalar Quantization

A scalar quantizer is a many-to-one mapping of the real axis into a finite set of real numbers y,, k = 1,2,.. .,L. Denoting the quantizer mapping by Q and the input signal by x (the time index is dropped in this section for simplicity), the quantizer equation is Q ( x )= Y ,

wherey E {y1,y 2,...,~L}.

(9)

111

SPEECH CODING

Channel

Signal Reconstruction

Decoding

Y(n)

FIG.8. Speech coding system-simplified

block diagram.

The real values yk are called quantizer output points. The output points are chosen to minimize a distortion criterion d(x,yk). Actually, this is equivalent to a nearest neighbor rule with respect to the “distance” d. Note that d is not necessarily a mathematical distance according to the usual definition (the triangle inequality may not be satisfied). For this reason the term distortion will be used rather than the term distance. The complete quantizer equation now becomes Q ( x ) = yk,

where k = ARGMIN,d(x,y,),

(10)

where the function A R G M I N , returns the value of the argument j for which a minimum is obtained. The nearest neighbor rule divides the real axis into L non-overlapping decision intervals [x, - x,], j = 1,2,.. . ,L; with this notation the quantizer equation can be written

iff X E I X k - I r X k l ,

Q(x)=Yk

(1 1)

where iff is used for “if and only if.” It is convenient to consider the quantizer as generating two outputs; the output point yk, and the index k. The decoding operation by which the output point is assigned to its index is sometimes called inverse quantization and is denoted by Q - ’ . 1. Uniform Quantization

A uniform quantizer has equidistant decision levels and output points

where A is the so-called step size. For bounded input, 1x1 5 xmax,the extreme decision points are x o = - x,,, and x L = x,,, . The maximum value of the quantization error is bounded by 141

Af2.

For unbounded inputs, x o = -00 and xL = 00, while all the other decision points, xk, k = 1,2,. . . ,L - 1, are equidistant (see Fig. 9). In this case, the

112

VLADIMIR CUPERMAN

Xo=.M

XL1

XOL

J

XL=m

XOL

FIG.9. Uniform quantizer for unbounded inputs.

quantization error is still bounded by A/2, except when the input is in the socalled overload region, defined by 1x1 > XOL,

where xOL= x L -

+ A,

The analysis of quantization noise in uniform quantizers is generally done under the assumptions that the noise, q, is independent of the input signal, uniformly distributed in the interval [ - A/2, A/2], and uncorrelated (white). These assumptions, which can be credited to Bennett (1948), provide a good approximation of reality if the quantizer is not overloaded and has a large number of densely spaced levels, and if the input has a smooth probability density function. However, this model may fail in some simple cases of practical importance; for example, for a uniform quantizer with sinusoidal input, the spectrum of the quantization noise is purely discrete (Gray, 1990). A detailed analysis of quantization noise is outside the scope of this paper; the interested reader can find details in Gray (1990) and references. For most modern speech encoders, the quantization noise is found experimentally by simulating on a computer the corresponding codec. Nevertheless, Bennett’s framework is useful for deriving analytical relations for the quantization noise, thereby bringing important insight into the quantization process. Assuming the conditions of the Bennett model are satisfied, it is easy to show that for a uniform quantizer with a bounded input, the quantization noise variance, 5:, is given by

where R = log, L . If the input signal is uniformly distributed in the interval [ -xmax,xmax],the quantizer signal-to-noise ratio is given by SNR,

5,

=

lolog<

59

= 6.02 R

(dB).

113

SPEECH CODING

A more realistic assumption is a nonuniformly distributed input signal and a quantizer designed for small overload probability. In this case, a reasonable choice is, for example, xmax= 4ox. (For this choice of xmax, the probability of is as small as 0.0035, even with a the input outside the interval [ -xmax,x,,] Laplace distribution.) In this case

SNR, = 6.02 R

-

(dB).

7.3

(15)

Equations (13) and (14) show that doubling the number of quantizer output points results in a quantization noise variance four times smaller, which corresponds to a signal-to-noise ratio larger by 6.02 dB. The rule of 6 dB/bit is considered a “rule of thumb” for expected performance improvement when the rate in bits/sample increases, although this relationship is true only for a uniform quantizer under the assumptions of the Bennett model. 2. Optimal Quantization Assume that x is a zero-mean stationary process with a given probability density function (PDF), p x . The quantization error is given by 4 = x - Q(x).

(16)

An optimal quantizer should minimize the variance of the quantization error 0:

= E{q2}= E{(x

-

Q(x))’}.

Note that optimality is defined here in terms of minimizing the distortion for a given number of output levels, L . It is easy to show that an optimal quantizer should satisfy the following conditions (Lloyd, 1957, 1982; Max, 1960):

x& =4(Y& Yk

+Yk+l)

= E{x/x

Cxk- 1 7 x k 1 )

fork = 1,2 ,..., L - 1,

(17)

f o r k = 1,2,..., L,

(18)

with xo = -a and x L = co.The first condition is a direct consequence of the nearest neighbor rule, while the second condition can be found by minimizing the variance of the quantization error through variational or other techniques. In most practical cases, it is not possible to solve the system (17,18) analytically; analytical solutions can only be found for the cases L = 2, 3. However, a simple iterative algorithm proposed by Lloyd (1957,1982) can be used to obtain a numerical solution. This algorithm starts with an initial set of output points and then iteratively optimizes the decision level given the output points and the output points given the decision levels. Lloyd’s algorithm is actually a particular case of the vector quantizer codebook optimization algorithm, which will be detailed in Section 1I.C.

114

VLADIMIR CUPERMAN TABLE I1

SNR PERFORMANCE OF UNIFORM AND PDF OPTIMIZED QUANTIZERS FOR DIFFERENT INPUT PDFs Adapted from Max (1960),reprinted by permission of IEEE, @ 1960 copyright IRE (now IEEE). Also adapted from Jayant and Noll(1984),copyright @ 1984 Prentice Hall Inc., Englewood Cliffs, NJ, reprinted by permission of Prentice Hall Inc., Englewood Cliffs, NJ.

Uniform Quantizer

PDF Optimized Quantizer

R (bits/sample)

Gauss

Laplace

Gamma

Gauss

Laplace

Gamma

1 3 5 7

4.40 14.27 24.57 35.13

3.01 1 I .44 20.60 30.23

1.76 8.78 17.49 26.29

4.40 14.62 26.01 37.81

3.01 12.64 23.87 35.69

1.76 11.52 22.85 34.67

The approach presented above is difficult to apply to speech signals because the use of Eq. (18) assumes a known probability distribution for computing the corresponding conditional expectation. This difficulty can be solved by optimizing the quantizer over a long training sequence of speech signals; the conditional expectation is then computed by simple averaging over all the input points “clustered” in each interval [x,-,,x,] (Max, 1960). Table 11 shows a comparison of the optimal and uniform quantizers for three input P D F Values: Gauss, Laplace, and gamma (Max, 1960; Jayant and Noll, 1984). The results in Table 11 show that for nonuniformly distributed input signals, particularly for Laplace and gamma distributions, the P D F optimized quantizers achieve significantly higher performance than uniform quantizers. For speech signals, PDF-optimized quantizers designed by training over a speech data base have significantly better performance than uniform quantizers. For example, at a rate of 3 bit/sample, a nonuniform quantizer typically achieves 12.1 dB SNR, as compared to 8.4 dB for a uniform quantizer (Paez and Glisson, 1972; Rabiner and Schafer, 1978).

3. Adaptive Quantization The main problem encountered with uniform and P D F optimized quantizers is the dependence of their performance on the signal statistics. The P D F of the input signal may be different than the PDF for which the quantizer was originally designed. This problem is sometimes called quantizer mismatch and may lead to a significant performance degradation. In adaptive quantization, the quantizer’s parameters are adapted to the changing signal statistics. The signal statistics can be evaluated by forward or backward estimation. In forward estimation, a buffer of N speech samples is used for statistics

115

SPEECH C O D I N G Forward

I

I

Backward

I

-I

Estimator

FIG. 10. Forward and backward adaptive quantization. Adapted from Jayant and Noll (1984). copyright /(" 1984 Prentice Hall Inc., Englewood Cliffs, NJ, reprinted by permission of Prentice Hall Inc.. Englewood Cliffs, NJ.

evaluation, and the information about quantizer adaptation is sent to the receiver as side information. The forward adapatation configuration is shown in Fig. 10a; the quantization of each input sample x(n) results in an output point y,(n) = y(n), while b(n) is the binary index sent into the channel to indicate the choice of the output point y,(n). The disadvantages of the forward estimation are the need to transmit side-information, and the encoding delay that results from the buffering of the input samples (typically 15-20 ms in speech coding). In backward estimation, the signal characteristics are evaluated from the reconstructed signal y(n), which is available at the transmitter and the receiver (see Fig. lob). The encoding delay and the side information are eliminated at the expense of a performance degradation with respect to the forward estimation case. The performance degradation is due to estimation based on the noisy signal y(n),rather than on the clean signal x(n). Speech is a nonstationary signal; according to the local stationarity model, each speech segment (frame) may have a different PDF. Hence, the quantizer should be reoptimized for each local stationarity segment. This approach would lead to a very complex codec and is not used practically. In some speech coding algorithms, a small number of quantizers optimized for different PDFs are used; for each speech segment, the quantizer that best matches the segment characteristics is chosen (switched adaptation). However, in most speech

116

VLADIMIR CUPERMAN

coding algorithms, the task of adapting to the statistics of each local stationarity segment is left to other data compression techniques, such as linear prediction or transform coding. Nevertheless, in this case the quantizer adapts to the signal level (signal variance). The speech level may vary widely in speech coding applications depending on talker, transmission channel, and the particular sound in utterance (voiced or unvoiced, for example). Adaptive quantization for signals with fixed PDF and variable variance can be obtained by signal level normalization at the quantizer input. The level normalization is based on an estimate of the short-time signal variance. For forward and backward estimation, the short-time signal variance is evaluated using, respective!y, 1 N-1 oZ(n)= - C x 2 ( n + i), N

i=o

The level normalization is performed by replacing the signal x(n) by the normalized signal x(n)/o,(n)for forward estimation and x(n)/o,(n) for backward estimation (see Fig. 11).If the quantizer is uniform, this is equivalent to using the input x(n) and a variable stepsize given by do&), or respectively, Ao,(n).

Forward

Channel Variance Estimator

I

Backward Channel

Variance Estimator

Estimator

FIG. 1 1 . Forward and backward adaptive quantization with level normalization

SPEECH CODING

117

Both previous variance estimates require storing N samples of the signal used for estimation. An alternative for backward estimation that requires less memory is the so-called running estimate, given by

cT,Z(n)= cto,Z(n - 1) + ( 1

- a ) y 2 ( n- 1).

(21)

It can be shown that this estimate is equivalent to an exponential weighting applied to the data in a window with an equivalent length N = - l+CY 1-CY

For details, see Jayant and Noll(l984). For uniform quantization, a simple and efficient adaptation procedure is the so-called one-word memory adaptive quantizer introduced by Jayant (1973).In Jayant’s quantizer, the step size A(n)is found from the previous step size A(n - l), using A(n) = A(n - l ) M ( b ( n - I)), (22) where M(b(n - 1)) is a multiplier factor depending on the magnitude of the quantizer output used for the previous sample. The multiplier factor is larger than one for the largest quantizer output magnitudes and smaller than one for the smallest quantizer output magnitudes. 4. Logurithmic Quantization Logarithmic quantization is a low complexity alternative for achieving good performance for signals with variable level (wide dynamic range). A logarithmic quantizer consists of a compressor, a uniform quantizer, and an expander. The compressor reduces the dynamic range by a logarithmic transformation of the input signal. Then the logarithmically compressed signal is quantized by a uniform quantizer. At the receiver, the expander reconstructs the original signal dynamic range by implementing a mapping that is the inverse function of the compressor’s mapping. Details on logarithmic quantization can be found in Jayant and Noll(1984).

5. Scalar Quantization and Entropy Coding The output of a scalar quantizer is a sequence of discrete values y,, k = 1,2,. . . ,L. The information transmitted to the receiver consists of the sequence of indices k, which are letters in an L-ary alphabet. The simplest way to encode the sequence of indices is to assign a binary code to each index. Assuming that L is a power of two, the resulting quantizer rate in bits/sample is R , = log, L.

118

VLADIMIR CUPERMAN

If L is not a power of two, the rate may be rounded to the nearest integer larger than log, L. A more efficient approach for cases in which L is not a power of two can be obtained by encoding sequences of indices; this approach leads to a rate close to log, L.For example, if L = 3, encoding sequences of five indices requires eight bits and leads to the average rate of 1.6 (while log, 3 = 1.585); the rate that would result by rounding is 2 bitslsample. The direct mapping of the quantizer’s output indices into binary words is wasteful in terms of channel bandwidth. From the viewpoint of information theory, the quantizer can be considered as a discrete source with an L-ary alphabet. Denoting the occurrence probability of the output point y k by P k and assuming, for the moment, the input samples are independent, the entropy of the source is given by

According to Shannon’s source coding theorem (Shannon, 1948; McEliece, 1977),such a source can be noiselessly encoded at a rate close to HQ. In many practical cases, RQ may be significantly larger than HQ. The constructive approach to achieving rates close to HQ is based on variable length source coding.

In variable length coding, each quantizer output point Y k is assigned a binary code of length nk resulting in an average rate, RQV,given by

where the optimal values for

flk

are given by

flk

= -log, Pk.

Of course, the optimal values of f l k can be reached only if Pk are powers of two. However, Huffman’s (1952) constructive procedure leads to average rates RQV very close to the entropy value (even for cases when P k are not powers of two). A very simple example suggested by McEliece (1977) follows. Assume that the quantizer has four output points with probabilities (+,$,+,+). The optimal values for f l k in this case are (1,2,3,3); Table 111 shows a variable length code with an average rate equal to 1.75, which is precisely the value of the source entropy. The code shown in Table 111 is uniquely decodable: The sequence of the quantizer outputs can be reconstructed from the encoded stream, assuming that no errors were introduced by the communications channel. Combining optimal or uniform quantization with variable-length coding does not lead to an optimal solution from the perspective of information theory. The problem is that the optimization of the quantizer is considered for a given number of output levels. In the variable-length coding environment, a

119

SPEECH CODING

TABLE 111 A VARIABLE-LENGTH SOURCE CODE

Quantizer Output

Probability

Codeword

Yo

YI Yz Y3

better result can be obtained by optimizing the quantizer for a given output entropy; this approach is called entropy coding. Mathematically, this is equivalent to minimizing rather than 042, where 2 is a Lagrange multiplier. Because HQ does not depend on y,, the relation for the optimal output points remains the same as for PDF optimized quantizers (see Eq. (18)). However, the decision thresholds x k are different than those given by Eq. (17). The optimization generally results in quantizers with an infinite number of output points. However, using a finite number of levels leads to a very small degradation in performance (Jayant and Noll, 1984). Entropy coding can achieve performance extremely close to the ratedistortion bound (Goblick and Holsinger, 1967; Gish and Pierce, 1968; Berger, 1972; Granzow and Noll, 1983). For example, the optimal entropycoded quantizer under the high-rate assumption is a uniform quantizer, and its performance is only 0.255 bit/sample worse than the rate-distortion bound. However, entropy coding is rarely used in speech coding because of practical problems related to implementation and transmission errors. A brief discussion of these problems follows. Modern communications are based on fixed-rate channels. The quantizer output is a fixed-rate sequence of symbols. However, the entropy coding transforms the quantizer output into codewords of variable length. A buffer is needed to adapt the variable rate obtained after entropy coding to the constant channel rate. The length of this buffer can be considerable even for very simple cases; Granzow and No11 (1983) found that a minimum buffer length of 800- 1000 bits was needed for achieving a synchronous rate of 1 bit/sample. Such a buffer introduces a communications delay that may not be acceptable in many speech coding applications. Moreover, the buffer will overflow with probability one, even for stationary signals (Jelinek, 1968). To avoid overflow, a feedback system that modifies quantizer parameters may be used; however, this further increases implementation complexity. Variable-length

120

VLADIMIR CUPERMAN

codes are sensitive to transmission errors; a channel error may propagate and result in a number of wrongly decoded symbols. Despite these strong objections, this author does not believe that the problems of entropy speech coding are insurmountable; some low-delay forms of entropy coding may find uses in speech coding in the future. New results obtained in adaptive entropycoded quantization, a technique that uses buffer-state feedback to control quantizer characteristics, show that at least the buffer overflow/underflow problems could be reduced with a minimal distortion penalty and at reasonable buffer length (Harrison and Modestino, 1990).

B. Linear Prediction in Speech Coding

Linear prediction is a data compression technique in which the value of each input sample is estimated, or “predicted,” by a linear combination of a finite number of past input samples. Linear prediction was first introduced in speech processing by Atal and Schroeder (1967) and Itakura and Saito (1968). A complete presentation of different approaches to linear prediction is beyond the scope of this paper. The presentation will be kept to the minimum required for understanding the rest of the material. More complete presentations can be found in other sources (Makhoul, 1975; Markel and Gray, 1976; Rabiner and Schafer, 1978; Gibson, 1980; Jayant and Noll, 1984; Furui, 1989). 1. Linear Prediction for a Stationary Process

Initially, x(n) will be assumed to be a stationary random process. This assumption simplifies the development of an insight into linear prediction that is useful in speech coding. Signal models that consider the nonstationarity of the speech signal will then be introduced. Note that the maximum likelihood approach to the linear prediction of speech (see Itakura and Saito, 1968; Markel and Gray, 1976; Furui, 1989) assumes the signal is stationary as well as Gaussian-distributed and leads to the same results as more “realistic” approaches based on a deterministic signal assumption or local stationarity. The linear prediction of the current sample x(n) is defined as a linear combination of the previous signal samples: M

hkX(n - k),

i(n)= k= 1

where hk are the linear prediction coefficientsand M is the predictor order. It is reasonable to choose the coefficients hk such that the prediction error e(n) = x(n) - 2(n)

(26)

121

SPEECH CODING

will be minimized in some sense. For a stationary process, the coefficients hk will be chosen to minimize the variance of the prediction error 0," =

E{e2(n))= E { [ x ( n ) - 2(n)]').

(27)

In adaptive filtering, this approach for optimizing the predictor coefficients is called the minimum mean-square error (MMSE) solution. Taking the derivative of the last relation with respect to h k , the following condition for the optimality of the linear predictor is found: E { e ( n ) x ( n- k)) = 0,

k

=

1,2,.,., M.

(28)

This relation is known as the orthogonality principle in linear prediction. In simple words, the orthogonality principle requires that for the optimal linear predictor, the prediction error should be orthogonal to the input data. By replacing e(n)in (28) by the expression given by Eqs. (25) and (26), the following system of equations for the optimal linear prediction coefficients is found:

for k = 1,2,. . . ,M . This is a system of M linear equations with M unknowns hj, which is called the Wiener-Hopf system of equations, or Yule- Walker equations (mostly in spectral estimation literature). The equations can be written in the vector form Rxx!!

=1x7

(30)

where R,, is the autocorrelation matrix

and LI = (h,,h,, . . .,h,IT, r , = (r.xx(1),rx,(2), . . . , r x x ( M ) ) T . Although the matrix R,, is typically positive-definite for nonzero speech signals, it may be ill-conditioned. To avoid this situation, in some speech coding applications a small positive quantity is added to the main diagonal of the matrix before Eq. (30) is solved. This is equivalent to adding a small amount of white noise to the input speech signal. Assuming R,, positivedefinite and therefore nonsingular leads to the following solution for the optimal linear prediction coefficients: = R;,'F,.

(32)

122

VLADIMIR CUPERMAN

The matrix R,, is Toeplitz and symmetrical; hence, computationally efficient procedures may be used for the matrix inversion in Eq. (32). The best-known fast inversion procedure is the Levinson-Durbin algorithm (Levinson, 1947; Durbin, 1960; Markel and Gray, 1976; Furui, 1989). The prediction error sequence, e(n), has an interesting property for the theoretical case of infinite-order prediction. If M + co, the orthogonality principle implies that the prediction error is a white noise process. On the other hand, the linear predictor can be considered as a digital filter with the input x(n), the output e(n), and the system function given by

2 h,z-k. M

A(2) = 1 -

k=

1

(33)

Hence, the power spectral density (PSD) of the filter input, Pxx(w),and output, P,,(o),are related by

P,,(N = INej”)12Px,(~).

(34)

Using the subscript co to indicate the optimal infinite-order prediction, this relation becomes

where ,:a is the variance of the white noise process e(n), obtained by the infinite-order optimal linear prediction of x(n). The previous development leads to two important conclusions. First, the optimal infinite-order linear predictor transforms a stationary signal into a white noise process. For this reason the filter A ( z ) is sometimes called the whitening Jilter. Second, the optimal infinite-order predictor contains all the information regarding the signal’s PSD shape. Practically, a good estimate of the signal’s PSD can be obtained with a finite order predictor. This property is used in spectral estimation by the so-called model-based (parametric) approach. Good short-time estimates of the speech spectrum can be obtained using predictors of order 10-20. The predictor filter A(z), transforms the stationary random signal x(n) into the white noise signal e(n),. Hence, the filter l / A ( z ) , will reconstruct precisely the original signal x(n) from the white noise “excitation” signal e(n),. These two configurations are shown in Fig. 12. The filter 1/A(z) is sometimes called the inverse Jilter.

FIG.12. The whitening and the inverse filters.

123

SPEECH CODING

2. Autocorrelation and Covariance Methods

Now we will give up the stationarity assumption and will return to the characteristics of the speech signal. A first possible approach is based on the local stationarity model of the speech signal. In this case Eq. (29) can still be used for estimating the optimal linear predictor coefficients of the stationarity segment [no, no N - 11 if the true autocorrelation function, rxx(k), is replaced by the estimate given by (1). In the local stationarity approach, the autocorrelation function is estimated using a signal segment assumed to be a realization of an ergodic process; this segment is obtained by applying a rectangular window to the input signal, x(n). It was shown in the previous section that linear prediction can be considered as a form of spectral estimation. Multiplying the signal by a window function in the time domain is equivalent to convolving the signal spectral estimate with the window spectrum. In spectral estimation it is well known that a rectangular window has the disadvantage of high spectral sidelobes; a smooth window, such as the Hamming window, may lead to a better spectral estimate. Hence, the autocorrelation function in Eq. (29) will be replaced by the windowed estimate, leading to the following system of equations for the optimal predictor coefficients:

+

M

C h j r w x x ( t j - kI) = j= 1

(36)

rwxx(k)

for k = 1,2,.. . ,M, where rwx,(k)is given by (4). This system of equations is used in the autocorrelation method for linear prediction of speech. The system can also be written in vector form: (37) R,,,!? = r wx where R,,, is the autocorrelation matrix of the windowed signal, and 1: wx is the corresponding autocorrelation vector. The derivation of the autocorrelation method presented above lacks rigor and so may be questioned by mathematically inclined readers. However, it has the advantage of connecting the linear prediction of speech and the linear prediction of stationary random signals by using a simple model of speech nonstationarity. A different insight into linear prediction of speech can be obtained by considering the problem of prediction error minimization for a given speech frame. In this approach, no assumption is made about the given speech segment; it may be considered as a deterministic finite discrete sequence. It is reasonable then to consider the minimization of the short-time mean squared prediction error given by E2 =

'":-l[x(n) n=no

-

,c, M

h,x(n

-

l2

k) .

(38)

124

VLADJMIR CUPERMAN

In adaptive filtering, this approach for optimizing the predictor coefficients is called the least squares (LS) solution. The optimal linear prediction coefficients, h k , are obtained by setting the derivatives of E’ with respect to hkrk = 1,2,. .. ,M equal to zero. After some simple algebraic manipulations, the following system of M equations in M unknowns, hk, is obtained: M

j = L2,. . ., M, where

for j, k = 1,2,. .., M . This approach to finding the optimal prediction coefficients is called in speech coding literature the covariance method; the terminology is confusing because the term covariance is associated in random process theory with the correlation of a signal with the mean removed. There is no relation between the covariance method of speech coding as defined above and the use of the same term in random process theory. As one can easily see from the last relation, the covariance method requires the samples with indices no - M , no - M - 1,. ..,no - 1, i.e., M samples of the previous speech frame. In most speech coding algorithms, buffering the last samples of the previous frame does not represent a major difficulty. The use of samples from the previous frame can be avoided by minimizing the mean squared error for the segment [no + M , no N - 11(see, for example, Furui, 1989). There are several significant differences between the autocorrelation and covariance methods. In the autocorrelation method, the system matrix is Toeplitz, and this leads to a very efficient algorithm for solving the system (Levinson, 1947; Durbin, 1960). Although the computational complexity of the covariance method can be reduced by a triangular decomposition of the system matrix (Cholesky decomposition), it still remains significantly larger than in the autocorrelation method. In speech coding, it is important to ensure the stability of the inverse filter, l/A(z), which is used in the receiver for reconstructing the signal. The autocorrelation method always results in a stable inverse filter. In the covariance method, a stabilization procedure is required for the inverse filter (Atal and Schroeder, 1979). However, the covariance method may achieve slightly better performance than the autocorrelation method (Markel and Gray, 1976). Now consider the case in which the averaging interval for computing 4xx(j, k ) is extended to infinity and the signal is windowed by a window that is nonzero only in the interval [ n o , no + N - 11. For an infinite signal it can be

+

SPEECH CODlNG

125

4 x x ( j , k ) = 4xx(lj- kl, O),

(41)

4xx(lj- kl, 0) = rwxx(lj- kl).

(42)

assumed that and after windowing, This result represents a different derivation of the autocorrelation method equations (36) and (37). Although c 2 is minimized here over an infinite interval, the windowing operation actually restricts the minimization to a finite interval. Alternatively, an equivalent result can be obtained by minimizing c z over the finite interval [ n o , no + N + M - 11 (Markel and Gray, 1976; Rabiner and Schafer, 1978). The covariance and autocorrelation methods can also be used for estimating the optimal predictor coefficients from the reconstructed signal. The corresponding equations are obtained by replacing x with y in (36) and (37) or (39) and (40), respectively. a. Prediction Gain The performance of a linear predictor can be evaluated by using the prediction gain, defined by

(43)

G, = C J : / C J ~ .

For speech signals, the prediction gain is computed by time averaging over a speech segment:

Depending on the length of the segment, a short-time or a long-time prediction gain may be defined. The average of the short-time gains computed at logarithmic scale (dB values) is called the segmental prediction gain. 3. Pitch Prediction (Long-Term Prediction)

The linear predictor given by Eq. (25) is sometimes called a one-step predictor. A r-step or distant sample predictor will predict the current sample, .x(n), by a linear combination of the samples that are at least 'c samples in the past, x ( n - T), x(n - T - I), . . . . For voiced speech, a good choice for z is T = kp, where k, is the pitch period; good results may also be obtained using a multiple of the pitch value, or generally, any lag value at which the distant sample correlation has a significant peak. Using a predictor that is symmetrical with respect to the distant sample, k,, the pitch predictor equation is given by

... k4

:(n) =

C k=-M

a,x(n

-

kP -

126

VLADIMIR CUPERMAN

As before, the prediction error can be defined by e(n) = x ( n ) - jZ(n), and the prediction coefficients can be computed by minimizing a mean squared value of e(n). Both the covariance and the autocorrelation methods can be used for finding optimal coefficients ak.In speech coding it was found that good results can be obtained by using a one-tap predictor (A4 = 0), or a three-tap predictor ( M = 1).Using the autocorrelation approach for a three-tap predictor leads to the following system of equations:

G:nerally, no windowing is used in estimating the autocorrelation function for the pitch predictor. Hence, rxx(k)for k = 0,1,2 can be computed using Eq. (l), while 1 no+N-l rxx(kp+ i) = - C x(n)x(n - k , - i) (47) N n=no for i = - 1,0,1. The computations in (47) require samples from the previous speech frame. For a one-tap predictor, this system reduces to the simple solution

The autocorrelation method applied to pitch predictors does not guarantee the stability of the resulting inverse filter. For one-tap predictors, the stability condition is simply laO[< 1. For three-tap predictors, there are no simple analytical stability checks. Sufficient conditions for stability were derived by Ramachandran and Kabal(l987). A three-tap pitch predictor may provide prediction gains of about 3 dB over a one-tap predictor (Flanagan et al., 1979; Jayant and Noll, 1984). Prediction gains similar to those achieved by a three-tap predictor can be obtained by a one-tap fractional pitch predictor (Kroon and Atal, 1989, 1990a).A fractional pitch predictor uses rational values for the pitch period k,, suggesting that a three-tap predictor implements a time-interpolation process that is important because the true pitch period is not an integer number of samples. The design of the pitch predictor requires the measurement of the pitch period. Pitch measurement algorithms are not presented in this paper; information about this subject can be found in Rabiner et al. (1976), Rabiner and Schafer (1978), Jayant and No11 (1984), and references.

SPEECH CODING

127

4. Adaptive Prediction: Forward us. Backward and Block us. Recursive

Early work in speech coding was based on fixed predictors designed by solving the Wiener -Hopf equations using long-time estimates for the autocorrelation function. An interesting discussion of the reasons leading to this situation and of early work in adaptive prediction is found in Gibson (1980). Recent work has shown that significantly improved performance can be obtained by using adaptive predictors that change their coefficients according to the time-varying speech statistics. There are two basic dichotomies in adaptive prediction. First, the adaptation information can be transmitted to the receiver, as in ,forward adaptation, or can be derived simultaneously at the receiver and the transmitter from the past samples of the reconstructed signal, as in backward adaptation. Second, the adaptation can be done for each frame (block) of data (block adaptation), or for each data sample (recursive adaptation). It is important to avoid the confusion between the terms backward adaptation and backward prediction. The term backward prediction is used for the prediction of a sample x(n - M ) based on the “future” samples x(n - M + I), x ( n - M + 2), . . . . The terms forward and backward adaptation are used in the adaptive filtering literature with the same meaning as in this paper (see, for example, Honig and Messerschmitt, 1984). Traditionally, block adaptation has been used in forward adaptive systems and recursive adaptation has been used in backward adaptive systems. However, in the last few years, backward block adaptation received much attention as a result of the emerging low-delay speech coding systems (see Section III.D.5). Forward recursive adaptation is rarely used. Figures 13a and 13b show the general configuration of forward and backward adaptive prediction, respectively. In forward block adaptation (Fig. 13a), the optimal linear predictor is computed at the transmitter separately for each frame, quantized, and sent to the receiver. In order to process the samples of each frame by its optimal predictor, the system requires a buffer of one frame length. The system computes the estimate of the autocorrelation function based on the samples stored in the buffer using Eqs. (1) and (4). Then, the optimal predictor coefficients are computed by solving the systems (30) or (37), and finally, the predictor is used to encode the buffered frame. The predictor coefficients in forward block adaptive linear prediction are sometimes called the linear prediction coding (LPC) coefficients. Forward adaptation has two disadvantages. First, it requires a communications delay equal to the duration of one frame. Second, the optimal prediction coefficients have to be transmitted to the receiver as side information. In backward adaptation (see Fig. 13b), the optimal predictor is computed using the reconstructed signal, y(n), which is available at the transmitter and

128

VLADIMIR CUPERMAN

Forward Adaptation

drrichannel^^^ , Channel

Computation

Backward Adaptation

Decoder

Coefficient Computation

Y(n) Coefficient Computation

FIG.13. Forward and backward adaptation in linear prediction.

the receiver. A copy of the decoder is used in the transmitter to produce the reconstructed samples. The reconstructed signal, y(n), is obtained from the quantized prediction error, u(n),using the inverse filter equation

Note that the same notation, h,, was used for the prediction coefficients in Eq. (49) and in Eqs. ( 2 5 ) and (26), although the optimal prediction coefficients for the reconstructed signal, y(n), are, of course, different than those computed for the input signal, x(n). Denoting the autocorrelation matrix of the reconstructed signal by R,, and the corresponding correlation vector by r,, the following vector form of Wiener-Hopf equations is obtained for this case:

R,,!!

(50)

=Yy,

where r , = (ryy(l), ryy(2),, . . ,r y y ( M ) ) TThe . element i, j of the matrix R,, is given by 1 N-li-jl-1 y(n0 k)y(no + k li - jl), ry,(li - j l ) = k=O

+

+

SPEECH CODING

129

where the index no points to the first sample in a block and N is the block length. Equation (50) leads to the autocorrelation method, and the derived synthesis filter is always stable. Alternatively, the covariance method can be used based on the assumption that the coefficients that minimize the total squared error on a given block, because of the slowly changing characteristics of speech, will lead to good performance on the next block where they will actually be used. i t may be tempting t o derive Eq. (50) by assuming y(n) and u(n) are : , in a stationary random signals and by minimizing the variance of u(n), a fashion similar to minimization of 0,’in Eq. (27). However,even if y(n) and u(n) are stationary, this approach is not correct. Indeed, u(n) is obtained by a deterministic procedure (prediction followed by quantization) from x(n). For any practical quantizer, output variance is a monotonic function of input variance. Hence, CT.” will be minimized simultaneously with a: by the optimal predictor for the input signal x(n), which is different from the optimal predictor for y ( n ) as given by Eq. (50).The shortest way to justify Eq. (50)is to consider it as an approximation of Eq. (30) that may be obtained by replacing the unknown rxx(k)by its approximation ryy(k),k = 1,2,. . . ,M . Note that if the reconstruction error, r(n), is assumed uncorrelated to the input x(n), then ryy(k) = rxx(k) + rrr(k)7

where r J k ) is the autocorrelation function of the reconstruction error at lag k. Keeping in mind that r x x ,r y y ,and rrr are actually estimates based on N signal samples, the assumption here is that the crossterms in the summation over N samples are negligible. Then, if the reconstruction error is white and a,‘ << af (“fine” quantization), the approximation r J k ) z ryy(k)is justified. However, at low rates the previous assumptions fail, which explains the difficulties encountered in using backward adaptation in low-rate speech coding systems. The above comments are true for the derivation of practically all the adaptation procedures based on the reconstructed signal and quantized prediction error. The autocorrelation method with windowing may also be used for the block backward adaptation. In this case, the prediction coefficients can be found by solving the corresponding system for the windowed signal: Rwyyh

= Lwyy,

(51)

where Rwyyis the autocorrelation function of the reconstructed windowed signal, and r w g y is the corresponding autocorrelation vector. Backward adaptation eliminates the communications delay associated with forward adaptation and does not require the transmission of the predictor coefficients to the receiver. The disadvantage of backward adaptation is that the estimation of the optimal predictor coefficients is affected by quantization noise. Moreover, block backward adaptation is based on the

130

VLADIMIR CUPERMAN

assumption that the optimal predictor coefficients for a given block, because of the slowly changing characteristics of speech, will lead to good performance on the next block where they will actually be used. In the regions where speech characteristics vary quickly, this assumption may not be satisfied. Very little is known about backward adaptation performance at low rates. It was believed that backward adaptation performance degrades rapidly when the rate decreases because quantization noise affects the estimation process. However, the successful use of backward adaptation in the 16 kb/s low-delay speech coders for the forthcoming CCITT standard showed that the minimum rate at which backward adaptation can still be used is not yet known. A quantitative comparison between forward and backward configurations depends on the encoder/decoder used to produce the reconstructed signal y(n).This problem will be further discussed in Section III.D.5. It is interesting to note that there is more flexibility in the choice of adaptation configurations than what results by defining the backward and forward configurations as above. For example, in block forward adaptation, the buffering delay does not have to be equal to the duration of a frame. Actually, the buffering delay could be completely eliminated; in this case, the prediction coefficients for the frame [no, no + N - 11 may be computed based on the previous frame [no - N , no - 11. More generally, forward-backward adaptation may be defined as a configuration in which the predictor coefficients are computed from a block of the original signal, [m,,mo N - 11, overlapping with the block on which the predictor will be used, [no, no N - 13, as shown in Fig. 14. The buffering delay introduced by such an adaptation procedure is mo - no + N - 1. The delay due to the buffering is completely eliminated if no = m, + N - 1. The price paid for reducing the delay is a possible performance degradation on speech segments characterized by fast variations. Forward-backward adaptation may be used to obtain low-delay speech coding configurations in which the shortterm predictor parameters are estimated from the original signal, rather than from the reconstructed signal, which is affected by quantization noise.

+

+

Buffering Delay t-

Predicted

no+N-l

";"'"""I"

Sample Number

Adaptation Block

FIG. 14. Forward-backward linear prediction- time diagram

131

SPEECH CODING .I

5

144

/-

0

2

4

6

8

10

12

14 16

18 20

Delay (rns)

FIG. 15. Average prediction gain versus buffering delay for forward-backward prediction

Figure 15 presents the prediction gain of a forward-backward block predictor as a function of the buffering delay for a 10th-order predictor and a block length of 160 samples. The prediction gain represents a long-time average computed over 10 s of speech. Figure 15 shows that a delay of 20 ms leads to a prediction gain increase of about 1.8 dB.

5. Recursive Adaptation The recursive adaptation is based on the application of the gradient algorithm for solving the prediction error minimization problem. Again, the signal is assumed to be stationary. By expanding the square in Eq. (27), the variance of the prediction error can be written as

The objective of the recursive adaptation procedure is to find the coefficient vector h that minimizes a:; for this reason, 0; is called the objective function of the gradient algorithm. To minimize the quadratic form (52), the gradient algorithm defines a sequence of coefficient vectors, h j (where j is an iteration index), that con, a verges to the optimum solution. The gradient of the error, V E { e 2 ( n ) ) is vector that points in the direction of the maximum increase of the error. The basic idea of the gradient algorithm is ‘‘slowly’’ to adapt the coefficient vector in the direction of the negative gradient in the hope that the distortion will decrease and reach the minimum value. Denoting by p the adaptation step size, the algorithm may be written

hj+l = -

h j - -P2V E { e 2 ( n ) ) .

(53)

132

VLADIMIR CUPERMAN

It can be shown that this algorithm converges to the global minimum given by the unique solution of the Wiener-Hopf equations if the adaptation step p respects the inequality 0 < c1 < 2/Jmax,

(54)

where A,, is the maximum eigenvalue of the correlation matrix Rxx. This is the constraint that quantitatively defines how “slowly” the coefficient vector should move. Using the gradient definition and replacing e(n) by its value given by Eqs. ( 2 5 ) and (26),Eq. (53) can be written

+ pE{e(n)xn},

(55) where x, = (x(n - M + l), x ( n - M + 2), . . ., ~ ( n ) is) ~an M-dimensional vector having the last M input samples as components. There are two difficulties in applying this adaptation procedure in speech coding. First, the evaluation of the expectation is not practical. This is a wellknown problem in adaptive filtering, which can be solved by using the stochastic gradient (SG)algorithm. The basic idea of the SG algorithm is to use the random vector e(n)x,, rather than its expectation. This is equivalent to replacing the objective function 0: by the random objective function e2(n). Now, the adjustment of the coefficient vector may be done at each time instant, n, leading to the adaptation equation hj+l = h

hn+ 1

j

=hn

+ P4n)En.

(56)

Second, the signals x(n) and e(n) are not available at the receiver. To circumvent this difficulty requires estimating the gradient in terms of the reconstructed signal, y(n), and the quantized prediction error, u(n). By ap- the adaptation proximating the vector e@)&, by its quantized form, u(n)y,, equation becomes hn+ 1 =

h n

where yn = ( y ( n - M + l), y(n - M writtenin scalar form

hP+”

=hf)

+ /4n)yn,

(57)

+ 2), . . ., ~ ( n ) )The ~ . last relation can be

+ pu(n)y(n - k),

(58)

where hP), k = 1,2,. . .,M, are the components of the vector .,I/ Different forms of Eq. (58) are used in speech coding algorithms based on recursive adaptation. Up to this point, only predictors based on the previous samples of the same signal have been considered. In speech coding, these predictors are referred to as short-term predictors, in contrast to long-term predictors based on the pitch periodicity. If the short-term predictor of Eq. (49) is used at the

133

SPEECH CODING

transmitter, the reconstruction of the speech waveform at the receiver will be based on the inverse filter l/A(z) with the system function

This is an M-pole system function; consequently, the corresponding shortterm predictor is referred to as an all-pole predictor. An all-zero short-term predictor can be obtained if the prediction is based on the previous samples of the prediction error:

where 2 is the number of zeros and g k are the all-zero predictor coefficients. All-zero predictors have been found to be more robust than all-pole predictors if the information is transmitted via a noisy channel (Nishitani et al., 1982). In low-rate speech coding, the short-term predictor based on the previous signal samples is often combined with the long-term predictor based on distant-sample pitch prediction. Moreover, the short-time predictor may contain poles as well as zeros. For these reasons, consider the configuration shown in Fig. 16, which includes a three-tap long-term predictor and a short-term predictor having P poles and 2 zeros. Note that to preserve consistency in the notation, the order of the all-pole predictor is denoted now by P. The output of the long-term predictor, w(n),in Fig. 16 is computed by 1

w(n) = u(n)

+ 1 uiw(n - k , -

(59)

iL- 1

where {ai}are the coefficients of the pitch predictor and k , is the pitch period. Taking into account that the input to the short-term predictor is now w(n)and hence the all-zero prediction should be based on w(n),the short-term predictor equation may be written as follows:

For deriving the recursive adaptation algorithm, consider again the squared prediction error as an objective function in a stochastic gradient Long-Term Pradictor

1

"?'

4

Short-Term Predictor

Y(n)

U

Adaptation

FIG. 16. Predictor configuration with recursive adaptation

134

VLADIMIR C U P E R M A N

approach. Then, using (59)and (60),it can be shown by an approach similar to the one used in deriving Eq. (56) that the required gradients with respect to the three sets of coefficients {ak],{ h k j , and {gk) are, respectively, VJu(n)l12

=

-2u(n)[w(n - k , - l), ..., w(n - k,), w(n - k ,

+ 1)IT,

(61)

and the corresponding adaptation equations are a?'')

h?'

')

= a?)

+ p,u(n)w(n - k, - i )

for i = - 1,0,1,

(64)

= h:-'"

+ phu(n)y(n

f o r i = 1,2,..., P,

(65)

sl"+1 ) =

&)

-

i)

+ p,u(n)w(n'- i )

for i

=

1,2,... , Z ,

(66)

where p a , ,uh,and ,ugare the adaptation stepsizes for the coefficients a,,hi,and y i , respectively.

The adaptation procedures given by Eqs. (57) and (58) and by Eqs. (64) and (65) do not guarantee the stability of the corresponding all-pole filters. Simple analytical stability checks exist for coefficients hi only for P I 3 or, respectively, for M I 3. For predictors of order larger than three, the simplest solution is to use the lattice filter stability check, which is discussed in the next section. 6. Lutrice Fillers and Recursitw Backward Adaptation

Lattice filters have significant advantages in the implementation of linear predictors. There is extensive literature covering this subject, and no attempt is made here to present all aspects of this topic. The presentation will be restricted to introducing the lattice structure and discussing one of the most-used adaptation algorithms. For details on this subject see, for example, Markel and Gray (1976), Reininger and Gibson (1985), and Honig and Messerschmitt ( I 984). The direct realization of the predictor filter A ( z ) is shown in Fig. 17. In adaptive filtering, such a configuration is called a rrunsuersul ,filter.The trans-

SPEECH CODING

I35

FIG. 18. Lattice filter configuration

versa1 filter implementation has the advantages of generality and simplicity. The transfer function, A ( z ) , is a linear function of the coefficients hi,a fact that facilitates the derivation of adaptation algorithms. Any transfer function that can be implemented in a transversal structure can also be represented in a lattice structure such as that shown in Fig. 18. The coefficients of the equivalent lattice structure, ki,i = 1,2,. . . ,M , can be found from the predictor coefficients, hi,by the following iterative procedure: First, set

y y = lIi, Then, for i

=

M, M

-

k . = y!" I

I

j = 1,2,...)M .

1,..., 2, I , )

A similar iterative procedure can be used to find the transversal filter coefficients from the lattice filter coefficients (Markel and Gray, 1976; Rabiner and Schafer, 1978; Furui, 1989). Lattice filters can be used with both block and recursive adaptation algorithms. In block adaptation algorithms, the lattice coefficients are obtained as a by-product of the Levinson-- Durbin algorithm for solving the WienereHopf equation. Hence, lattice implementation can be easily used for any predictor that uses block adaptation, and no change in the standard adaptation algorithm (based on the Wiener Hopf equations) is required. On the other hand, recursive adaptation algorithms for lattice filters are generally more complex than the corresponding algorithms for transversal filters. A variety of adaptation techniques are presented in Reininger and Gibson ( 1 985). The subsequent presentation is restricted to the simple gradient technique, which has been called in the adaptive filtering literature the least mean square (LMS) algorithm. Again, backward recursive adaptation based on the reconstructed signal, y ( n ) , will be considered, and for the reasons discussed previously, the filter input will be denoted by w(n). The corresponding all-pole lattice filter is shown in Fig. 19; note this is precisely the inverse of the filter in Fig. 18. The signals

136

VLADIMIR CUPERMAN

... ...

33 -1

FIG.19. All-pole lattice filter configuration.

e j ( n ) and rj(n), called the forward and the backward prediction error respectively, are given by

+ ky! ,rj(n - I), rj+ ,(n) = rj(n - 1) + kyi lej(n), ej(n) = ej+I(n)

(67)

where ky’ are the lattice filter coefficients at time n. The update of the coefficients ky’ according to the LMS algorithm is given by the relation

k j ( n + 1) =

aj(n Pj(n

+ 1)

+ 1)’

where

+ Pj(n + 1) = (1 - p)Pj(n) + e?(n).

x j ( n + 1) = ( 1 - p ) x j ( n ) Iej(n)rj(n- l),

(69)

- p ) represents an exponential fading memory coefficient, and I is a leakage factor for improving noisy channel performance. As mentioned previously, there are a variety of different lattice adaptation algorithms. For example, the least squares (LS) algorithm is an exact solution to the best lattice coefficients that minimize a sum of exponentially weighted square prediction errors (Morf and Lee, 1979; Reininger and Gibson, 1985). For a stationary process, the LS algorithm converges faster than the LMS algorithm presented above; however, the advantage in performance for speech coding applications is relatively small and does not justify the increased complexity. The lattice filter coefficients, k j , are called reflection coefficients or PARCOR coefficients. The name reflection coeficients is a result of an analogy with propagation in an acoustic tube (Kelly and Lochbaum, 1962; Flanagan, 1972). The name PARCOR (partial correlation) is related to the equivalent definition of the coefficient kj as the cross-correlation of the forward and backward prediction error (Itakura and Saito, 1971; Markel and Gray, 1976;

(1

SPEECH CODING

137

Furui, 1989):

Lattice filters have some significant advantages over the transversal filters that compensate for the more complex adaptation algorithm required for recursive adaptation. It can be shown that the signals ej(n) and rj(n) form an orthogonal set, resulting in a faster convergence rate for the stochastic gradient algorithm than that of the corresponding transversal filter. The optimal reflection coefficients in a lattice filter do not depend on the filter order, while for the transversal filter, all the coefficients h,, k = I, 2,. . .,M , have to be recomputed if the filter order M changes. Finally, the stability check for the lattice implementation is very simple: The inverse filter, l/A(z), is stable if (ki(I 1 for i = 1, 2,. . . ,M . This simple stability check facilitates the use of high-order lattice filters in backward recursive adaptation configurations. For transversal filters of order larger than three, the simplest stability check is based on converting the predictor coefficients into reflection coefficients of the equivalent lattice filter. C. Vector Quantization

A vector quantizer (VQ) is a mapping from a vector 5 in the k-dimensional Euclidean space Rk into a finite set of output vectors { yj}j”, The set of N k-dimensional vectors y j , j = 1,2,. . . ,N is called the codebook, and a particular codebook entryyyj, is called a codevector. Associated with a vector quantizer is a partition OF Rk into N regions (cells) Sj, where Sj consists of all vectors s E Rk that are quantized into y j . The quantized value of the vector 5 will be denoted by Q ( x ) ,where Q is tThe VQ function. The performance of a V Q is evaluated by a distortion measure, d(3, Q(&)), which assigns to each pair of input and output vectors a value indicating the “distance” or the dissimilarity between the input vector and the corresponding output vector. A V Q is optimal if, for given dimension k and codebook size N , it achieves the lowest average distortion E { d ( x , Q ( x ) ) } . There are two necessary conditions for a V Q to be optimal. First, given a codebook {yj),y,the encoder can do no better than select for each input vector E its closest neighbor in the codebook, which actually minimizes the distortion measure:

d(x,Q(x))= minjd(x,tj) (70) where minj indicates minimum with respect to the index j. Second, given a cluster of input vectors, 5 E Sj, a unique codevector y- j exists that minimizes

138

VLADIMIR CUPERMAN

the average distortion for the given cluster,

x

E { d(x,

j)} =

minueRk E { d ( & ,g)}.

The vector y j is called the centroid of the set Sj; the existence of a unique centroid hasbeen proven for the distortion measures of interest. In the case of squared Euclidean distance, d ( x ,Q ( x ) )= 115 - Q(x)1I2,the optimality conditions for a VQ become

(1) For a given partition S j , j the centroid condition yj

=

=

1, 2,.. . ,N , the codebook must satisfy

E{X I & E S j } .

(2) For a given codebook, the partition should satisfy the nearest neighbor condition Sj

E {x:x E Rk,

IIx

- YjII

2

IIx

- xiII

any i }

(73)

This is a generalization of the optimality conditions for a scalar quantizer given by Eqs. (17)and (1 8). The basic idea of vector quantization is contained in Shannon’s (1948) source coding theory. However, the applications to data compression in general and speech coding in particular started to develop only in the 1980s. There are two main reasons for this situation. First, the Shannon theory did not provide constructive techniques for designing VQs. Second, the computational complexity of the quantization process using the nearest neighbor search increases exponentially with the dimension and the rate expressed in bits/sample; significant progress in digital hardware was required to make vector quantization practical. This section includes a brief summary of the vector quantization technique as needed for the presentation of the following material. More details on this subject can be found in Gersho and Cuperman (1983),Gray (1984),Gersho (1986),and Adoul(l987). The VQ input vector may consist of k consecutive waveform samples or of any set of k parameters-for example, k autocorrelation coefficients estimated for a given speech frame. Figure 20 shows a schematic configuration in which a VQ is used for encoding speech waveforms. At the transmitter, each input vector is successively compared to all the codebook entries according to the given distortion criterion, d ( x ,y j ) , j = 1,2,.. .,N . The index of the nearest neighbor codevector is transmitted to the receiver. The receiver has a copy of the transmitter codebook and retrieves the corresponding codevector by a simple look-up procedure. A significant data compression ratio may be obtained because a single index is transmitted, rather than a k-component vector. Similarly to the scalar case, it is sometimes convenient to think of a VQ as generating two outputs, the codevector, y- j , and the index of the code-

139

SPEECH CODING index, J X

Coder

L

r

\

Decoder

Q(X)= y,

/

FIG.20. Vector quantization block diagram

vector, j . The decoding operation by which the codevector is assigned to its index is sometimes called inverse quantization and is denoted by Q-'. The usual distortion measure for waveform coding is the square of the Alternatively, the Euclidean distance between vectors, d ( 5 ,y ) = ((x- yll'. weighted Euclidean square error may be used: d ( x , -y ) = (x - Y)*WX(X - 21, (74) where Wx is a weighting matrix, which in general may depend on the vector 5. The centroid for this case is givcn by

On the other hand, spectral distortion measures are used for the quantization of the short-term predictor parameters (LPC coefficients). For example, for the likelihood ratio distortion measure, the input vector components are the autocorrelation coefficients, r r x ( , j ) ,j = 0, I , 2,. . . ,M, while the codevector is of the form y = (1, h , , h , , . . . ,h M ) T ,where hi are the LPC coefficients. The likelihood ratio may be computed by M

d ( x , y ) = j =1 -M

rxx(j)rhh(j)/CIM

- 1,

where rhh is the autocorrelation function of the sequence (1, - h , , - h , , . . . , - h M ) ,and aM is the residual energy resulting from filtering a speech sequence with autocorrelation rxx(j ) , j = 1, 2,. . . ,M by its Mth-order optimal linear predictor. The likelihood ratio may be considered a particular case of the more general Itakura-Saito distortion measure (Itakura and Saito, 1968). The detailed presentation of these distortion measures is beyond the scope of this paper. The interested reader is referred to Gray and Markel (1976), Gray et al. (1980a), and Buzo et al. (1980).

140

VLADIMIR CUPERMAN

The overall VQ performance may be quantified using averages of the distortion criteria just presented. If the input probability distribution is known, the average is obtained by taking the mathematical expectation; otherwise, the long-term sample average computed on a typical speech data base is used. The optimality conditions given by Eqs. (70) and (71) can be used for the design of a VQ codebook by minimizing the average error over a training data set. The corresponding algorithm is similar to the K-means algorithm in pattern recognition, and it is a generalization of Lloyd’s algorithm discussed in Section 1I.A. One of the first applications of this type of algorithm to speech coding is found in Adoul et al. (1979). The first application to the encoding of the LPC parameters in speech coding was introduced by Buzo et al. (1980).A rigorous presentation for training sequences and general distortion measures is found in Linde et al. (1980),and the algorithm is sometimes referred to as the Linde, Buzo, and Gray (LBG) algorithm. The basic algorithm can be summarized as follows: (1) Start with an initial codebook using, for example, the first N vectors of the training sequence. (2) For the given codebook, find the optimal clustering by encoding the training sequence using the nearest neighbor rule of Eq. (70). If the average distortion is small enough, stop. (3) For the given clustering, replace the codebook by the optimal centroids calculated by Eq. (71). G o to step (2). For a training set of data, the centroid condition of Eq. (72) becomes 1

where L j is the number of vectors in the cluster S j . This algorithm either reduces the average distortion at each step or leaves it unchanged. The resulting codebook achieves a local minimum that, generally, need not be the optimal codebook. However, empirical procedures can be used to test the resulting codebook and avoid a local optimum with performance significantly lower than the truly optimal codebook. More details on codebook design can be found in Buzo et al. (1980), Gray et al. (1980b), Gray (1984), and Cuperman and Gersho (1985). A fundamental result of the vector quantization theory shows that for high dimensions, a VQ achieves asymptotically the rate-distortion bound. Vector quantizers significantly outperform scalar quantizers even for the case of memoryless sources. Figure 21 shows the gain of a VQ over a scalar quantizer for a source that consists of a sequence of independent random variables with uniform, gamma, Laplace, and Gaussian distributions in the high-rate region (Cuperman, 1989).For speech waveform coding, VQ gains as large as 7 dB for dimension k = 8 have been found (Gersho and Cuperman, 1983). Figure 22

141

SPEECH CODING

m D

(3 G

m

c I , 6 C

0

m

4-.

N c

c

m

6

4

Gaussian

L

0 c 0

9 2 0

2

6

4

8

10

12

14

16

18

20

22

,

Vector Dimension, k

FIG.21. Vector quantization gain over scalar quantization for iid random sequences. 4IEEE 1989. Adapted from Cuperman (1989). reprinted by permission of IEEE, copyright (

compares the results obtained for encoding a speech waveform at the rate of 2 bits/sample by a scalar quantizer and a V Q of dimension four; the quantizer output points and the V Q codebook were optimized iteratively. Of course, the results obtained by scalar quantization can be significantly improved by using tree/trellis coding, entropy coding, and other delayed encoding procedures.

I

I H

32 rns

FIci. 7-2. Quantized speech waveforms at 16 kh/s. (a) Original signal. (b) Scalar quantization. (c) Vector quantization with dimension 4.

142

VLADIMIR CUPERMAN

However, this is also true for vector quantization. Actually, a fair comparison should be based on systems having the same computational complexity and delay; such a comparison involves considerations that may depend on the particular implementation environment. The training procedure described above is only one of the possible approaches for obtaining a codebook. A stochastic codebook can be obtained by using (pseudo) random numbers as codevector components. Also, algebraic and geometric lattice structures can be used for codebook generation (Gersho, 1979; Sloane 1981; Conway and Sloane, 1983; Adoul and Lamblin, 1987). A good presentation of lattice-based vector quantization can be found in Gibson and Sayood (1988). Generally, stochastic and algebraic codebooks achieve lower performance, but may have such advantages as robustness to speaker variability and lower implementation complexity. 1. Suboptimal Vector Quantization A significant reduction in the computational complexity of VQs may be achieved by using codebooks with structures amenable to fast search procedures. These vector quantizers achieve lower performance than a V Q of the same dimension and rate having an unconstrained codebook and are consequently called suboptimal. For example, a tree structured codebook used in conjunction with a tree search procedure leads to a computational complexity that grows only linearly with the dimension at a fixed rate (Buzo et al., 1980; Gray and Abut, 1982). In a binary tree codebook, a choice between two possible vectors is done at the first stage; then, only vectors connected in the three structure to the chosen branch are considered. This binary choice is repeated at each tree node so that the number of candidate codevectors is divided by two at each stage. Tree codebooks require more memory than optimal VQs in order to store the codevectors used for intermediate decisions (binary tree codebooks require roughly twice as much memory as the unconstrained codebooks). A suboptimal approach that reduces both the computational complexity and the memory is the multistage vector quantization (Juang and Gray, 1982). A multistage VQ uses multiple codebooks, each stage having the quantization error of the previous stage as input. The corresponding encoding operation is equivalent to approximating the input vector by a sum of codevectors, each drawn from a different codebook. The result is suboptimal because of the constrained structure of the codebook and the suboptimal character of the search. Indeed, for given codebooks, an optimal search should consider all the possible combinations of vectors drawn from different codebooks; such a search would lead to the same complexity as the exhaustive search of the unconstrained codebook.

143

SPEECH CODING

Finally, the class of suboptimal VQs includes the algebraic codebooks, lattice quantization, and the sparse codebooks mentioned in the previous subsection. 2. Adaptive Vector Quantization When vector quantization is applied to speech coding, vectors drawn from a nonstationary process must be matched with codevectors drawn from a fixed codebook. Taking into account the nonstationarity of the speech signals, adaptive techniques can be used to improve VQ performance at given rate and dimension. The basic idea is to adapt the VQ codebook in order to match the changes in the statistics of the input signal. A variety of adaptation techniques are described by Gray (1984). The presentation here will be restricted to the gain-adaptive vector quantization (Chen and Gersho, 1985, 1987b), which is actually a generalization of the adaptive scalar quantization. Backward gainadaptive vector quantization is used in the low-delay speech coding systems to be presented in Section 111. A gain-adaptive VQ predicts the norm of the current input vector, x,, based on the norms of past quantized vectors. For example, denoting by enthe prediction of l l ~ n l l a, linear prediction formulation may be used: M!? Gn

=

C

i= 1

(76)

~illQ(~n-i)ll~

where Mg is the predictor order, ai are the predictor coefficients, and n is the time index. The prediction coefficients can be found by using one of the linear prediction standard procedures applied to the sequence of past quantized vector norms taken as an input signal. Note that the previous equation uses backward adaptation. and the predicted gain need not be sent to the receiver since it can be generated by a copy of the gain predictor, as shown in Fig. 23. Further note that Fig. 23 is a rather straightforward generalization of Fig. 1 1 (Section 11). By replacing the norm of the vectors by the logarithm of the norm (log-norm), the multiplications and divisions in Fig. 23 become additions and

xn

i " . Gain Predictor

FIG.23. Backward gain-adaptive vector quantization-block

Gain Predictor diagram.

144

VLADIMIR CUPERMAN

I Fixed VQ Codebook

r--------

I

I

J I V Adaptive Filter

Error Computation

FIG.24. Generalized adaptive vector quantization.

subtractions, and the corresponding configuration is linearized (Watts and Cuperman, 1988). Figure 24 shows a generalized adaptive configuration in which a fixed V Q codebook is transformed into an adaptive codebook by an adaptive filter. The filter adaptation is based on the sequence of input vectors, g,, for forward adaptation, or on the sequence of quantized past input vectors, Q(x,-J, for backward adaptation. The filter need not be linear. This configuration includes as particular cases many of the known adaptation techniques. For example, the switched adaptation presented by Cuperman and Gersho (1982, 1985) is obtained if the adaptive filter chooses a subset of the fixed codebook to be used as adaptive codebook based on an estimate of the input signal statistics. On the other hand, the finite state vector quantization presented by Gray (1984) is obtained if the adaptive filter associates each quantized vector, Q(x.), to a subset of the fixed codebook that will be used to quantize the next input vector, x,, Finally, if the adaptive filter is a linear predictor, the configurations obtained are similar to the analysis-by-synthesis systems to be presented in Section 111. D. Quantization of the L P C Coeficients

The quantization of the short-term predictor coefficients (LPC coefficients) plays a central role in speech coding because of its importance in the low-rate forward adaptive speech coding systems. The coefficients hj have a wide dynamic range that would require a large number of bits per coefficient. Moreover, if the LPC coefficients are quantized directly, the stability of the resulting inverse filter cannot be guaranteed. Because of these unfavorable quantization properties, the LPC coefficients hj are never quantized directly. A possible solution is to quantize the reflection (PARCOR)coefficients k j . These coefficients have a relatively small dynamic range because lkjl < 1. The

SPEECH CODING

145

stability of the inverse filter is ensured if the magnitude of the quantized coefficients remains smaller than one, a condition that is very simple to control. The reflection coefficients are distributed nonuniformly; for example, for voiced frames k , is close to + 1, and k , is close to - 1, while higher-order coefficients are distributed around zero. This situation can be exploited for designing efficient scalar quantization schemes. A typical scalar quantization procedure for the reflection coefficients uses 40- 50 bits for each speech frame. A possible alternative is to transform the reflection coefficients into the socalled log-area ratio coefficients, vj, where b) =

log-

1 - kj 1 kj'

+

(77)

For example, the LPC-I0 speech coding standard to be discussed in Section I11 uses log-area ratios to quantize the first two coefficients, and direct quantization of reflection coefficients for the rest of the coefficients. The VSELP digital cellular speech coding standard uses direct quantization of the reflection coefficients (see Section 111). The performance of systems using scalar quantization of the reflection coefficients or the derivatives degrades if the number of bits per frame decreases below 35-40. Better results at lower rates may be obtained by using the line spectrum pair ( L S P )parameters (Sugamura and Itakura, 1981, 1986). The mathematics of LSP parameters sometimes obscures their simple physical interpretation. Consider the vocal tract model as a nonuniformsection acoustic tube consisting of M sections of the same length. The tube is open at the lips section, numbered zero, and closed on a matched impedance at the glottis section, numbered M + 1. Then the reflection coefficients, k j , are equal to actual wave reflection coefficients because of the impedance mismatch at the junction of sections ,j and j + 1. If the acoustic tube is open or closed at the glottis section, it becomes a lossless tube, and the corresponding transfer function has a line spectrum structure at frequencies f ; , q l , f 2 , q,, . . .,&,,, yM,, ( M is assumed here to be even in order to simplify the discussion). The pair A, gj is called the line spectrum pair. Figure 25 illustrates the relation between the LPC spectrum and the LSP parameters for a typical voiced sound (Sugamura and Itakura, 1986). The notation ,fi, gJ is a result of the following mathematical considerations. If A ( z ) is the transfer function of the acoustic tube closed on a matched impedance, then the transfer functions for the lossless cases are, respectively, P(z) = A ( z ) - z M + ' A ( z - ' )

(78)

Q(z)= A(z)+ zM+'A(z-').

(79)

and

146

VLADIMIR CUPERMAN 20

-20

Frequency (KHz) FIG.25. Speech spectrum envelope and LSP parameter locations. Adapted from Sugarnura and Itakura (19861, copyright 1986 Elsevier Science Publishers, reprinted by permission of Elsevier Science Publishers, BV (North Holland), NY.

It can be shown that the roots of P ( z ) and Q ( z ) are all on the unit circle in alternate positions. The frequencies corresponding to the roots of P ( z ) are denoted by A, while the frequencies corresponding to the roots of Q ( z ) are denoted by g j . The frequencies and gj alternate on the frequency scale as a result of the alternating positions of the respective roots on the unit circle. The alternating roots property allows for simple stability control for the LSP coefficients; actually, the stability condition is simply

.fz

(80) < < Yz < '.. < . L i z < Y M i 2 . The LSP coefficients have approximately uniform spectral sensitivities as well as good quantization and interpolation properties. For example, Sugamura and Itakura (1986) showed that 41 -44 bits per frame are needed to achieve spectral distortion less than 1 dB using reflection coefficients, while for LSP coefficients the same result can be obtained with only 33 bits per frame. However, the advantages of the LSP coefficients over the reflection coefficients remain controversial. Atal et al. (1989) found that the two sets of coefficients perform equally well: Spectral distortion close to 1 dB was obtained for the LSP coefficients as well as for the reflection coefficients using only 32 bits/frame. There are many differences between the two experiments reported above, including different speech data bases and different parameter estimation techniques. All the quantization procedures discussed above use scalar quantization. A significant reduction in the number of bits needed for quantizing the LPC parameters can be obtained by using vector quantization. Actually, quantization of LPC parameter? was one of the first applications of vector quantization to speech coding (Linde et ul., 1980; Buzo et id., 1980; Gray et al., 1981). Vector quantization can be applied directly to the LPC coefficients using spectral distortion measures such as the Itakura-Saito or likelihood ratio, or .f; < YI

SPEECH CODING

147

it can be applied to the LSP coefficients. A n example of potential improvements that may be obtained by using vector quantization (Juang et d., 1982) shows that a VQ with a 10-bit codebook achieves the same average spectral distortion as scalar quantization using 25 bits per frame. Despite the potential advantages discussed above, vector quantization of the LPC coefficients has found little use in the last generation of speech coding systems. For example, the VSELP cellular standard uses scalar quantization of the reflection coefficients,while the DOD standard uses scalar quantization of the LSP parameters (both these systems will be described in Section Ill). There are two reasons for this situation. First, in most applications the number of bits per frame available for LPC parameters encoding is on the order of 25-40; codebooks of 25-40 bits make the optimal vector quantization impractical. The alternative is to use multistage vector quantization, which is suboptimal and reduces the gain with respect to scalar quantization. Second, it is believed that a VQ may “1ock”on the spectral characteristics of the speakers in the training set and give unsatisfactory performance on some unexpected speakers outside the training. It is the opinion of this author that the latter problem can be solved by using a large training data base and a significant amount of out-of-training tests. This procedure may be expensive; however, it must be performed only once in the design phase of the speech coding system. Given the existing tendency toward lower rates and the progress in efficient transmission error control, vector quantization of the LPC coefficients may become essential in achieving good performance.

111. SPEECH CODINGSYSTEMS

Speech coding systems have traditionally been classified into waveform coders and analysis-synthesis coders (Flanagan et al., 1979; Jayant and Noll, 1984).The latter category is also called source coders (this term has a different meaning in information theory) or vocoders (a contraction of “voice coders”). The objective of a waveform coder is to produce a digital representation of the input signal that allows a precise reproduction of the amplitude-vs.-time waveform. On the other hand, the analysis-synthesis coders (vocoders) extract perceptually significant parameters from the input signal in order to synthesize a reproduced (output)signal that is acceptable to a human receiver. Waveform coders are generally signal-independent, while analysis-synthesis coders are based on a model of speech production and hence are signal-dependent. Analysis-synthesis systems achieve a higher data compression ratio than waveform coders; however, low rates are obtained in analysis-synthesis coders at the expense of a fundamental limitation on subjective speech quality. Speech reproduced by analysis-synthesis systems is characterized by a typical

148

VLADIMIR CUPERMAN

distortion that led to the definition of a separate subjective quality called vocoder quality, as opposed to the toll quality and the communications quality used to characterize waveform coders. The dichotomy of waveform vs. analysis-synthesis coders has began to blur with the introduction of modern speech coding systems in the 1980s. Traditionally, the quality of waveform reproduction in waveform coding systems has been judged by the mean square error (MSE) or another similar criterion; note that SNR and SEGSNR are both based on the MSE. The new speech coding systems strive to reproduce the input waveform under a perceptual error criterion. As such, these systems combine the waveform coding idea of precise waveform reproduction with the analysis-synthesis idea of quality as judged by a human receiver. Although most researchers consider these systems waveform coders and continue to use MSE-based criteria, such criteria are inadequate for evaluating system performance; the corresponding systems can be classified neither as waveform coders nor as analysis-synthesis coders. Good examples of perceptual criteria- based speech coding systems are the analysis-by-synthesis systems to be described in Section 1II.D. It is useful to follow the above classification of speech coders for different rates. The speech coders for rates in the range 32-64 kb/s are all waveform coders; their performance is well characterized by the SNR or the SEGSNR. The coders for rates lower than or equal to 2,400 b/s are all analysis-synthesis coders; their performance is characterized by speech-specific criteria, such as the DRT or the DAM (see Section 1.B). The most important category of systems in the range 2.4-16 kb/s is analysis-by-synthesis systems, which are neither waveform nor analysis-synthesis coders (vocoders). Their performance is characterized by perceptual (subjective) criteria, such as the mean opinion score (MOS). A . Analysis-Synthesis Speech Coding (Vocoders)

The first known analysis-synthesis (A-S) speech coding system, which happens also to be historically the first example of a speech coding system, is the channel vocoder (Dudley, 1936). In a channel vocoder, the short-term speech spectrum is estimated by measuring the average power at the outputs of a bank of bandpass filters. The result is transmitted to the receiver together with a voiced/unvoiced decision and the pitch period for voiced sounds. The receiver uses a similar bank of filters to synthesize the reconstructed speech from a noisy excitation for unvoiced sounds, or a periodical excitation for voiced sounds. Spectral representation of speech signals using linear prediction modeling led to an improved A-S system-the LPC vocoder (Markel and Gray, 1974).

SPEECH CODING

149

In Section 1I.B it was shown that the inverse filter l/A(z) may be employed for reconstructing the signal from a white noise excitation. In this process the inverse filter is used as a spectral representation of the signal, suggesting the viability of the inverse filter for modeling the vocal tract in the speech production model of Fig. 1. To reproduce precisely the original waveform x ( n ) , the input of the inverse filter should be identical to e(n),the white noise process obtained at the output of the whitening filter. If a different white noise process is used as excitation for the inverse filter, a waveform with the same spectral characteristics, but in general with a different shape, will result at the output. In speech coding, it was found that speech quality for unvoiced sounds depends little on the white noise used for excitation. Further, reasonablequality voiced sounds can be produced by using as excitation a periodical series of pulses with the period equal to the pitch value. These basic considerations lead to the linear prediction model for speech production shown in Fig. 26. The model of Fig. 26 is used in low-rate speech coding in the LPC vocoder shown in Fig. 27. The transmitter of the LPC vocoder computes and quantizes the optimal linear prediction coefficients, a gain factor, and the pitch value for each speech frame. Typically, a 10th-order predictor is used, and the prediction coefficients are found by applying the autocorrelation or the covariance method. The quantization of the prediction coefficients is not trivial; as discussed in Section ILD, the coefficients have unfavorable quantization properties and are transformed into reflection coefficients or log-area ratio parameters before quantization (Markel and Gray, 1976; Fallside and Woods, 1985).The receiver decodes the parameters and synthesizes the output speech using the linear prediction model of speech production. LPC vocoders achieve rates in the range 1,200-2,400 bits/s; however, the output speech is affected by a typical distortion, and the speech quality does not improve significantly if the rate is increased. Most applications are military, and the corresponding solution is subject to the LPC-10 standard. More details about the LPC vocoder and other analysis-synthesis approaches can be found in Flanagan (1972), Rabiner and Schafer ( I 978), and Fallside and Woods (1985).

9 Fir;. 26. Linear prediction speech production model

150

VLADIMIR CUPERMAN

Channel Pitch

Pitch Pulse Generator

--t

Synthesis Filter

FIG.27. LPC vocoder-simplified block diagram. Adapted from Bishnu S . Atal, “Linear Predictive Coding of Speech,” in Computer Speech Processing (Fallside and Woods, eds), @ 1985, pp. 99, reprinted by permission of Prentice Hall Inc., Englewood Cliffs, NJ.

A relatively new approach in analysis-synthesis systems is based on a sinusoidal speech model (McAulay and Quatieri, 1986; McAulay et al., 1990). In this system, called the sinusoidal transform coder (STC), the sine-wave amplitudes and frequencies are first determined by searching for peaks in a short-time D F T of the input speech. The sine-wave frequencies are coded by fitting a harmonic set of sine waves to the input short-time DFT. The amplitudes are encoded by a modified DPCM coder in which a special technique is used to avoid slope overload (McAulay et al., 1990). The phases are randomized for unvoiced speech and set to zero for voiced speech (more sophisticated phase processing is used in newer versions of the system). At 2,400 bits/s, the STC system achieved a DRT score of 90% and a DAM score of 56.8 (McAulay et a!., 1990). The DAM score shows a significant improvement over the LPC-10, which in its best version scores 54. One of the significant points about STC is that at high rates it achieves high speech quality, thus suggesting the potential for future improvements. B. Predictive Speech Coding

The basic idea of predictive coding is to transmit only information that cannot be linearly predicted from the past samples of the already reconstructed signal. For correlated signals such as speech, the variance of the

151

SPEECH CODING

Transmitter

Receiver

Q5? Predictor

FIG.28. Predictive speech coder.

prediction error is significantly smaller than the variance of the original signal. On the other hand, the quantization noise variance is proportional to the variance of the quantizer’s input signal. These observations suggest that better performance can be achieved by quantizing the prediction error rather than the original signal; this approach will result in the reconstruction error power being proportional to the prediction error power, rather than to the original signal power. In a predictive coder (see Fig. 28), an estimate of each input sample, Z(n), is obtained from the previous reconstructed speech samples, y(n), by linear prediction. The prediction error, e(n) = x ( n ) - ,?(n), is then quantized and transmitted to the receiver. The quantized prediction error, u(n), is added to the predicted value, ,?(a),both at the transmitter and at the receiver to obtain the next reconstructed speech sample, y(n):

y(n) = u(n) + 2(n).

(81)

Note that the encoder of a predictive coding system includes a copy of the decoder. This is necessary in order to compute the reconstructed samples, y(n), used at the encoder for predicting the next input samples. For the configuration in Fig. 28, it is easy to show that the reconstruction error is equal to the quantization error:

y ( 4 - 4 n ) = 0)- 4)= dn),

(82)

and consequently,

SNR, = SNR,Gp, (83) where SNR, is the signal-to-noise ratio (SNR) of the predictive coding system, SNR, is the quantizer SNR, and G , is the prediction gain given by Eqs. (43) and (44). It is interesting to note that a predictive coding system with a stationary input and using an infinite-order optimal predictor achieves a maximum

152

VLADIMIR CUPERMAN

prediction gain equal to the spectral flatness measure of the given input signal (Jayant and Noll, 1984): GPm

= .:Y

(84)

For Gaussian signals in the small-distortion (high-rate) region, rate-distortion theory shows that the maximum gain when encoding the source with memory over the memoryless case is also equal to the spectral flatness measure (Berger, 1971). Hence, the optimal infinite-order linear predictor achieves the maximum gain promised by the rate-distortion theory for encoding sources with memory. However, this does not mean that a predictive system using scalar quantization may generally (or in this particular case) achieve the ratedistortion bound. The reason is the performance of the scalar quantizer, SNR,, which cannot generally achieve the rate-distortion bound for memoryless sources. The prediction gain for speech signals depends on the sampling frequency and on the prefilter used to avoid aliasing in the sampling process. The short-time prediction gain varies widely from unvoiced to voiced speech. The minimum value of the short-time prediction gain on unvoiced frames is close to zero, while the maximum value on voiced speech may exceed 20 dB; typical long-time averages on sentences are 10-16 dB. In the telephone network, the speech is bandpass-filtered, thereby removing the low frequencies in the range 0-300 Hz; this reduces the average prediction gain by about 2 dB.

1. Delta Modulation and A D P C M

Delta modulation and adaptive differential PCM (ADPCM) are covered extensively in the literature (Jayant, 1974; Steele, 1975; Flanagan et al., 1979; Gibson, 1980; Jayant and Noll, 1984, and references). The following material includes only brief system descriptions, mainly introduced as examples of predictive coding applications. Delta modulation (DM) is a predictive coding system that uses first-order fixed prediction and a one-bit adaptive quantizer (Fig. 29). To achieve reasonable speech quality, the speech signal is oversampled at rates of 16-50 kHz, resulting in digital speech rates of 16-50 kb/s. The advantage of DM is the simplicity of the implementation. DM outperforms PCM at rates below 50 kb/s. However, ADPCM outperforms DM at all useful rates (of course, at the expense of the increased implementation complexity). Many refinements of the basic DM scheme have been developed, including systems with secondorder predictors (double-integrator DM) and with adaptive quantization step size. Good descriptions of these refinements can be found in Steele (1975),and Jayant and No11 (1 984).

153

SPEECH CODING

Predictor

FIG.29. Block diagram

of delta modulation.

Differential PCM (DPCM) is a predictive coding system that uses a shortterm fixed predictor and a fixed quantizer (Cutler, 1952; Oliver, 1952). Adaptive coding systems may be obtained from DPCM by introducing predictor adaptation, quantizer adaptation, or both. Generally, such systems use recursive backward adaptation. The advances in adaptive differential PCM (ADPCM)systems led to the development of the CCITT 32 kb/s speech coding standard. ADPCM at 32 kb/s achieves speech quality close to that of PCM at 64 kb/s; the bandwidth required for digital voice is thus reduced to half while preserving toll quality. The CCITT ADPCM (Fig. 30) uses a short-term backward adaptive predictor based on a combination of all-pole and all-zero predictors, and an

IQ'I

A

7

All-Zero Predictor

-

6

C

i-1

-

Predictor

g iu(n - i)

All-Pole Predictor 2

C

u(n)

hi y(n - i)

%

i=l

FIG.30. Block diagram of CCITT ADPCM

Predictor

154

VLADIMIR CUPERMAN

adaptive four-bit quantizer. The predictor has two poles and six zeros, which leads to the following prediction equation: 6

2

where gjare the coefficientsof the all-zero predictor, and hj are the coefficients of the all-pole predictor. The input speech signal is sampled at 8 kHz, which, combined with four-bit quantization, leads to the 32 kb/s rate. The predictor adaptation algorithm is based on the stochastic gradient algorithm and is similar to the recursive algorithms of Eqs. (65) and (66). The presence of the all-zero predictor and the specifics of the adaptation algorithm are motivated in part by requirements regarding performance in the presence of transmission errors, as well as the need for an adequate response to narrow-band inputs (Nishitani et al., 1982; Miller and Mermelstein, 1984). The adaptive quantizer algorithm is similar to that of Eq. (22). The CCITT ADPCM codec achieves toll-quality speech at 32 kb/s; the MOS score is about 4.0, and the communications delay is only one sampling interval. Although ADPCM was successful in meeting the requirements at 32 kb/s, for lower rates the speech quality degrades quickly. At 16 kb/s ADPCM was found to be unacceptable for applications that require toll quality. Details about the CCITT ADPCM standard can be found in CCITT (1984), Jayant and No11 (1984), and Papamichalis (1987).

2. Adaptive Predictive Coding Adaptive predictive coding (APC), introduced by Atal and Schroeder (1967, 1970, 1979), uses a combination of a short-term and a long-term predictor; the short-term predictor models the smoothed speech spectrum, while the long-term predictor models the fine spectral structure due to pitch periodicity. The order in which the two predictors are connected may have a significant impact upon performance; the best results are obtained by first connecting the short-term predictor, followed by the long-term predictor. The short- and long-term predictors use forward block adaptation based on algorithms similar to Eqs. (30) and (39) and Eq. (46), respectively. An important improvement in APC was the introduction of the noise weighting filter based on perceptual criteria. The reconstruction error in predictive coding has mostly a flat spectrum. However, the theory of auditory masking suggests that better perceptual performance may be obtained by “moving” most of the noise power to the frequency regions where speech power is concentrated (Atal and Schroeder, 1979; Makhoul and Berouti,

SPEECH CODING

155

1979). The shaping of the reconstruction error spectrum may be obtained by using a weighting filter connected in a quantization noise feedback loop, as shown in Fig. 3 1. Assuming that the quantization noise is additive and white, it can be shown that the reconstruction error of the system shown in Fig. 31 is shaped by the system function

W )= B(z)/A(z),

(86)

where B(z) and A(z) are the system functions of the weighting filter and shortterm predictor, respectively (Atal and Schroeder, 1979). The weighting filter system function takes into account the effect of the feedback loop-i.e., it is of the form

where M , is the weighting filter order and bj, j = 1,2,.. .,M,, are the weighting filter coefficients. Note that, with the notation used here and without taking into account the feedback loop, the system function of the weighting filter in Fig. 31 would be 1 - B(z). A good choice of coefficients for the weighting filter is bj = a'h,,

(88)

where hj are the coefficients of the short-term predictor, and 0 Ia I 1 is a parameter that determines the degree of spectral shaping. This choice of weighting filter parameters may be used whenever M , I M , which is the case for all practical applications. For a = 1, no spectral shaping is obtained, while for c( = 0, the shape of the error spectrum will be very close to that of the speech signal itself. Good performance was obtained with values a x 0.8-0.9. Note that, assuming M = M , and using the coefficients given by Eq. (88), the system function of the filter that shapes the reconstruction error spectrum becomes W ( z )= A(z/c()/A(z). (89) Adaptive predictive coding was pursued as a main candidate technique for achieving toll quality at 16 kb/s and below. One of the most powerful systems that resulted is adaptive predictive coding with adaptive bit allocation (APCAB), presented by Honda and Itakura (1 984), Tada et al. (1986), and hie et al. (1988). In APC-AB, the input signal is split into three sub-bands that are individually encoded by APC. The number of bits per sample in each individual adaptive predictive coder is adapted in accordance with the signal energy in the respective sub-band. Adaptive bit allocation in the frequency

Receiver

Transmitter

x(n) d 7

Short-Term Predictor A(z)

Channel

Long-Term

Q” u(n)

Short-Term

SPEECH CODING

157

domain is combined with adaptive bit allocation in the time domain by dividing each frame into a number of subintervals and allocating bits according to signal energy in each subinterval. The APC-AB system achieved speech quality close to 7-bit PCM (i.e., close to toll quality). Although promising results were also obtained at 8 kb/s, the results at low rates did not meet expectations. Two main problems plagued APC. The first problem is related to the weakness of the scalar quantization at low rates. Indeed, taking into account the side information needed to transmit the prediction coefficients, the scalar quantizer must use less than 2 bits/sample at 16 kb/s. The possible choices in the absence of adaptive bit allocation are only a three-level or a two-level quantizer; in both cases a very crude quantization is obtained. The situation is still more problematic at 8 kb/s, where less than 1 bit/sample is available for excitation quantization. Second, the long- and short-term predictor coefficients, as well as the quantizer outputs, are determined by open-loop estimation without taking into account the interaction between quantization and prediction. The problem with this approach is that the sequence of quantizer outputs may not be the best choice for signal reconstruction using the short-term predictor found by open-loop estimation. The missing link is the joint optimization of quantization and prediction. As is shown later, the first problem can be solved by combining predictive coding with vector quantization, and the second problem can be alleviated by the introduction of an analysis-by-synthesis configuration. 3. Vector Predictive Coding

Vector predictive coding (VPC) is a system in which a low-dimensionality vector quantizer is used in an adaptive predictive coding scheme (Cuperman and Gersho, 1982, 1985; Chang and Gray, 1986). In VPC the input signal is “blocked” by considering a few consecutive input samples as components of a vector. In the encoding process (see Fig. 32), a locally generated prediction of the current input vector is subtracted from the input vector, and the resulting error vector is coded by a vector quantizer. A switched-adaptation technique is used for prediction and quantization in the original system. The switched adaptation is based on the classification of each speech frame (consisting of many vectors) into one of m statistical types using an estimate of the frame autocorrelation function. This classification determines which one of m fixed vector predictors and of m vector quantizers will be used for encoding the current frame. VPC showed a significant performance improvement over ADPCM at 16 kb/s. For typical speech material, the segmental SNR increased from 1214 dB for ADPCM to 16-18 dB for VPC. However, VPC did not meet the

158 z

4777

VLADIMIR CUPERMAN ~

z Quantizer ~ ~

l

~

>

~

a

n

n

e

~

Vector Linear Predictor VQ-'

+ Vector Linear

FIG.32. Vector predictive coding-block

diagram. Adapted from Cuperman and Gersho

(1985), reprinted by permission of IEEE, copyright @ 1985 IEEE.

toll-quality requirements for 16 kb/s. The main weaknesses in the VPC approach are the lack of a perceptual distortion criterion, the simplistic adaptation procedures, and again, the open-loop optimization approach. C . Frequency Domain Speech Coding

Frequency domain speech coding systems divide the input signal into a number of frequency bands that are encoded separately. Transformation to the frequency domain reduces signal redundancy and allows the use of a different number of bits to encode each frequency band. Variable bit allocation provides a significant coding gain with respect to PCM. Moreover, separate quantization of different frequency bands results in the ability to obtain arbitrary forms of quantization noise shaping. The two main approaches in frequency domain coding are sub-band coding and transjorm coding. The former approach divides the input signal into a relatively small number of wide bands, while the latter approach is equivalent to using a large number of narrow-frequency bands. Tutorial presentations of the frequency domain coding of speech can be found in Tribolet and Crochiere (1979) and Jayant and No11 (1984). Sub-band coding (Crochiere et al., 1976) divides the speech band into four or five sub-bands using a bank of bandpass filters. Each sub-band is translated to base-band by a single-sideband modulation process, resampled at its Nyquist rate, and encoded by adaptive quantization or ADPCM. In the receiver, the sub-bands are decoded, modulated back to their original position in the frequency domain, and summed to give a reconstruction of the original signal. The bit allocation controls the spectral shape of the quantization noise;

SPEECH CODING

159

typically, more bits are allocated to lower-frequency bands in which pitch and formant structure must be preserved. Sub-band coding systems may use an adaptive bit allocation based on the average signal power in each subband. In this case, the bit allocation is computed for each speech block at the transmitter and sent to the receiver as side information. The use of quadrature mirror ,filters leads to significant advantages in digital implementation of the sub-band coders (Esteban and Galand, 1977; Crochiere, 1979; Jayant and Noll, 1984). Sub-band coders achieve good speech quality at 16 kb/s with moderate delay and computational complexity. Coders based on the sub-band approach are used in voice-mail and voice-response systems. Transform coding (Huang and Schultheiss, 1963; Zelinski and Noll, 1977) divides the input signal into blocks of length N and applies a transform to each block. The transform coefficients are adaptively quantized and transmitted to the receiver. The receiver decodes the coefficients and applies an inverse transform to reconstruct the signal. In transform coding systems, a different number of bits may be allocated to each transform coefficient;for a stationary process, an optimal bit allocation can be derived as a function of coefficient variances. For speech, the optimal bit allocation may be different for each block of data. Adaptive transform coding, where the bit allocation is computed at the transmitter for each block and sent to the receiver as side information, is used in speech coding to cope with signal nonstationarity. For a stationary signal, the optimal transform is the Karhunen- Loeve transform based on the eigenvectors of the signal correlation matrix. However, the discrete cosine transform, which achieves results close to the optimal transform and has the advantage of simpler implementation, is used in practical implementations. The asymptotic coding gain for transform coding has the same theoretical value as for DPCM (Eq. (84)); this means that transform coding may achieve the same degree of signal decorrelation as linear prediction. Adaptive transform coding (ATC) obtains better performance than subband coding at the expense of increased delay and complexity. Typically, blocks having at least 128 samples are used in ATC, which leads to a basic buffering delay of 16 ms; the total delay is at least 2-3 times the buffering delay. ATC is mostly used in image coding; there are few recent applications to speech coding. Vector transform quantization (VTQ) is a coding system in which consecutive N samples of waveform are transformed into a set of N transform coefficients that are quantized by m << N vector quantizers. VTQ was used for the quantization of the prediction residual using a weighted distortion measure by Moriya and Honda (1987, 1988). The prediction residual was obtained by using short- and long-term predictors in a configuration similar

160

VLADIMIR CUPERMAN

to APC. An adaptive VTQ system in which the bit assignment for vector quantizers is adapted to the local statistics of the speech waveform was presented by Cuperman (1986, 1989). The basic idea behind VTQ is to keep complexity low by combining large-size transforms with small-size vector quantizers. Such a system achieves performance close to that of an N dimensional vector quantizer, while using vector quantizers with dimensions much lower than N (Cuperman, 1989). D . Analysis-by-Synthesis Speech Coding

Analysis-by-synthesis (A-by-S) is a general approach for estimating a set of parameters for a speech production model. The model is assumed to be able to generate a variety of speech signals by adjusting the parameters; the synthesized speech signals are compared to the original speech signal, and the model parameters are varied in a systematic way to obtain the best match between the original and the synthesized signal. Initially, A-by-S was used for estimating such parameters as formant frequencies, short-time spectrum, and glottal waveform (Bell et al., 1961; Rabiner and Schafer, 1978). The first applications to speech coding were introduced by Atal and Remde (1982) and Stewart and Gray (1982). The basic analysis-by-synthesis idea can be easily adapted to low-rate coding of speech. In this case, the speech production model is used at the transmitter to find the optimal set of parameters for reproducing each segment of the original speech signal under a given distortion criterion. The optimal parameters are then quantized and transmitted to the receiver, which uses an identical speech production model and the received set of parameters to synthesize the reconstructed speech waveform. Digitizing a small number of parameters, rather than the entire speech waveform sample by sample, may result in a significant data compression ratio. At the same time, the fact that parameter values are based on direct comparison of the reconstructed and original waveforms helps to preserve good speech quality at low rates. Most speech coding systems use the A-by-S technique for excitation quantization only. The corresponding procedure is actually equivalent to the generalized adaptive vector quantization shown in Fig. 24. In such a system, both encoder and decoder have identical copies of an excitation codebook. The encoder finds the index of the best excitation codevector by A-by-S using the linear prediction model for speech production. The index is transmitted to the receiver, which retrieves the corresponding waveform by codebook look-up. Hence, the entire excitation waveform (typically 20-80 samples) is “compressed” into one index. It would be desirable to use analysis-bysynthesis for quantizing all the transmitted parameters of the speech pro-

SPEECH CODING

161

duction model; however, as will be shown below, this may lead to intractable complexity. The reader should avoid the possible confusion between the terms analysis-by-synthesis and analysis-synthesis. In A-by-S systems the encoded values are determined by an optimization procedure in which the reconstructed waveform is compared to the original waveform; no such comparison takes place in an analysis-synthesis system. In other words, in analysissynthesis the parameters are estimated in open loop, while in analysisby-synthesis the parameters are estimated in closed loop. Analysis-synthesis systems achieve rates as low as 600-2,400 bit+, but are plagued by a typical distortion that cannot be removed completely even at significantly higher rates. On the other hand, A-by-S systems can potentially achieve toll quality; results close to toll quality were achieved at rates as low as 8 kb/s, and achieving toll quality at 4 kb/s is currently one of the most challenging speech coding research topics. 1. A Generulized Anulysis-by-Synthesis Speech Coder

Figure 33 shows a general configuration for the Analysis-by-Synthesis speech coding systems. Actually, this configuration uses the simple speech production model of Fig. 1, consisting of a synthesis filter and an excitation generator. The excitation generator produces a sequence u(n) by reading values from an excitation codebook. The spectral codebook contains sets of parameters for the synthesis filter. The synthesis filter has a structure similar to that used in APC, consisting of short- and long-term predictors. The input to the synthesis filter is the gain scaled excitation sequence, gu(n). For each segment of the original waveform, all the possible indices in both the excitation and the spectral codebooks are generated by the index selection module; the corresponding synthesized waveforms, y(n), are compared to the original speech signal, x(n); the error, r(n), is weighted perceptually by the weighting filter W ; and the index that achieves the best match is chosen for transmission to the receiver. The receiver uses codebooks identical to the transmitter to retrieve the excitation waveform and the parameters of the synthesis filter. More precisely, let 5 = (.x(n,), x ( n , + l), . . . x(n, + N - 1)= be an input vector consisting of N consecutive speech samples, and let y and g denote the corresponding synthesized and excitation sequences, respectively. Assume that the synthesis filter operation is given by

where y is the excitation gain and j , i are the indices in the spectral and excitation codebooks, respectively. Note that Hj is here an operator that,

162

VLADIMIR CUPERMAN

Transmitter Spectral Codebook

Codebook

Index Selection

4

Receiver

Weighting Filter

Jd I T Spectral Codebook

Channel

FIG.33. A generalized analysis-by-synthesis configuration.

applied to an input vector consisting of N consecutive samples, gives as result the corresponding filter output vector; the output vector may depend on all the previous input vectors. The choice of the indices in the excitation and spectral codebooks is made by minimizing is the weighting filter operator. where y. The exhaustive search of the excitation and spectral codebooks for all possible values of indices i and j requires a huge computational complexity, For example, consider a 4,800 b/s speech coding system in which vectors consist of 160 consecutive samples representing 20-ms speech frames. In such a system, 96 bits are available for encoding each vector. Assuming that the available rate is split equally between the spectral and the excitation code-

SPEECH CODING

163

books, each codebook would have 248 vectors, and the complexity of the joint optimization of the indices i,.j by full search would be on the order of 2’‘ multiply/adds. This is the same complexity as that of a full-search vector quantizer applied to the entire speech frame. It is easy to see that using an excitation and a spectral codebook is suboptimal with respect to a unique VQ applied to the entire speech frame; the only justification for the approach of Fig. 33 is the expected reduction in computational complexity. Actually, the excitation and the spectral codebooks form a so-called product codebook, which is suboptimal in respect to a unique codebook because of the constraint imposed on its structure. To reduce the complexity, the existing A-by-S systems use forward or backward adaptation to find the synthesis filter parameters from the original or reconstructed speech signal as described in Section 1I.B (for example, using Eqs. (37) or (39) and (44)).This open-loop approach avoids the need to search a spectral codebook for the optimal synthesis filter parameters. And yet, the configuration shown in Fig. 33 covers the open-loop approach even in the case of scalar quantization; in this case, the codebook consists of all the combinations of the scalar quantized output points, and the codebook search is replaced by the much less complex operation of scalar quantization. Further complexity reduction is obtained by dividing each speech frame into a number of vectors (subframes) and using a unique codebook to encode all excitation vectors. For example, at 4,800 b/s, four excitation vectors per frame are typically encoded by a unique 10-bit codebook. Assuming a sampling frequency of 8 kHz, each frame has 160 samples and is divided into four 40-dimensional vectors. A total of 40 bits are used to encode the excitation in this example. a. Perceptual Error Criteria In the absence of a weighting filter, the system shown in Fig. 33 selects the parameters such that the mean-squared error between the reconstructed and the original speech is minimized. Minimizing a mean-squared error results in a flat spectrum for the quantization error. However, as mentioned in Section III.B.2, better perceptual results can be obtained by exploiting the auditory masking characteristics of human hearing. The weighting filter introduces a perceptual error criterion in the parameter selection process for analysis-by-synthesis systems; minimizing the corresponding weighted error leads to the desired noise spectral shaping. Note that the weighting filter is used only in the transmitter; hence, there is no need to quantize and transmit its parameters. Analysis-by-synthesis systems use the same type of weighting as APC systems. Assuming that the order of the weighting filter is equal to the order of the short-term predictor, M , = M , in order to obtain the same spectral shaping as in APC, the system function of the weighting filter must be of

164

VLADIMIR CUPERMAN

the form

W Z ) = A(z)/A(z/a), where 1/A(z)is the short-term part of the synthesis filter. Note that the transfer function of the weighting filter is exactly the inverse of the desired shaping function. An alternative form of the weighting filter (providing for zeros bandwidth expansion) is W(Z)= A ( z / P ) / A ( z / f f )

where 0 I a 5 fl I 1 (Kroon and Atal, 1990b).The values of a and p are found by listening tests. b. Fast Codebook Search Despite the complexity trade-offs discussed above, analysis-by-synthesis remains a computationally expensive approach. Signal processing techniques that may be used to alleviate the computational burden were introduced by Davidson and Gersho (1986),Transcoso and Atal (1986), Klejn et al. (1990), Lee and Un (1990), and others. The following presentation covers only the most-used technique, called the ZIR-ZSR decomposition (Davidson and Gersho, 1986). Assuming that the synthesis filter parameters are determined using linear prediction with forward or backward adaptation, the index j may be dropped in the preceding equations. By applying the superposition theorem, the output of the synthesis filter can be written as a sum of the zero input response (ZIR) and the zero state response (ZSR). The ZIR is the output of the synthesis filter with zero input; this output is determined by the filter’s internal memory (initial state). The ZSR is the output of the synthesis filter with the internal memory set to zero (zero initial conditions). The ZIR does not depend on the choice of the vector in the excitation codebook, while the ZSR does not depend on the previously processed vectors. Note that if the pitch period is larger than the vector dimension, the long-term predictor does not affect the ZSR. Most analysis-by-synthesis systems use a synthesis filter composed of a long-term predictor cascaded with a short-term predictor, both in the inverse filtering configuration. Eq. (91)indicates that the weighting filter in Fig. 33 can be moved over the summation element in both the synthesis branch and the input speech branch. The combination of the synthesis and weighting filters (cascaded) will be called the weighted synthesis Jilter. The system function of the weighted short-term sysnthesis filter, H ( z ) ,may be written

H(z) = W(z)/A(z).

(92)

Note that the last equation may be simplified to l/A(z/a); however, this was not done, given the fact that in some systems the short-term predictor

165

SPEECH CODING

Long-term Predictor

Weighted Short-term --f) - Predictor

+

Hw

FIG.34. Cornputationally efficient Analysis-by-Synthesis transmitter

coefficients may be quantized, while the weighting filter coefficients need not be quantized. By moving the weighting filter over the summation element and computing separately the ZSR and the ZIR, the transmitter of Fig. 33 is transformed into the configuration shown in Fig. 34. Denoting by yWzi,the ZIR of the weighted synthesis filter, the reconstructed signal for-a particular codebook index i can be written as

+

Y- ' ~ '= W ( H ( p ' " ) )= ywzir yH,tj'",

(93)

where H , is a matrix defined in terms of the weighted synthesis filter unit sample response, h,(n),

H, =

[

hW(0)

0

hw(l)

hW(0)

0 0

...

...

...

(94)

2) - 3) h,(O) and N, is the subframe length. Note that all the vectors in Eq. (93) have dimension N , . The minimization of Ilg,(( can be simplified by defining the target vectorr, h,(N, - 1 ) h,(N,

-

-t = Wj(X) -Ywzir,

(95)

g,= Hwg(".

(96)

and the ZSR vectors

166

VLADIMIR CUPERMAN

Now, the minimization problem is reduced to finding the index i that minimizes Note that i i s independent of the index i, while y:Jr - is independent of the initial state of the weighted synthesis filter. The minimization defined by the last equation is equivalent to vector quantization of the vector t by a VQ with the codebook y$Jr,i = 1,2,. . .,N,, where N, is the number of vectors in the excitation codebook. The set of vectors y:Jr, i = 1,2,. . . ,N,, is called in A-by-S systems the zero state response (ZSR) codebook. If the weighted synthesis filter parameters are kept constant for a number of consecutive vectors, there is no need to recompute the ZSR codebook, and a significant complexity reduction results. The minimization in Eq. (97) may be performed by first finding the optimal excitation gain, g, by a variational technique,

and then minimizing Ilgwll for g = i. By replacing the excitation gain in Eq. (97) by its optimal value, the minimization reduces to

The first term in the above equation does not depend on the index i; hence the optimization problem reduces to maximizing the second term. Using the previous equations, the second term (denoted by E ) may be written

The optimization criterion thus reduces to the maximization of the normalized cross-correlation between the target vector t and the ZSR codebook entry y:Jr. Note that the norm lly:Jrl12 must be computed only when the ZSR codebzok is updated. The derivation of Eq. (100) assumed that the gain is unquantized. If the gain is quantized, more accurate results can be obtained by minimizing Eq. (97) with g replaced by its optimal quantized value. From the complexity viewpoint, both the minimization of (97) and the maximization of (100) reduce to the computation of the inner product tTytJrand of the energy term l l ~ ~ J r l 1 2 . The inner product can be efficiently computed by first evaluating H,t based on the equality

167

SPEECH CODING

On the other hand, the energy term can be expressed in terms of autocorrelation sequences corresponding to codebook vectors and to the weightedsynthesis filter unit-sample response, respectively (Transcoso and Atal, 1986). For fixed codebooks, the codevector autocorrelation sequences can be precomputed and stored, resulting in significant computational savings. C . Closed-Loop Pitch Prediction A closed-loop pitch predictor can be obtained by defining an “adaptive” codebook composed of past excitation vectors and considering k, as an index in the codebook. This approach was introduced by Singhal and Atal(1984), and it was further developed by Rose and Barnwell (1986, 1990) in their work on the self-excited vocoder (SEV). The adaptive codebook generates excitation samples of the form

u,(n) = gpu(n - kpIr

(101)

where g, is the gain and k , is the pitch related delay. For simplicity, it is assumed that k, > N,. The parameters g, and k, may be determined by a closed-loop search in the adaptive codebook by minimizing llepwl12

=

I l l - Yp!!,l

2

(102)

9

+

+

where g p = (u(n, - k,), u(no - k, I), . . . , u(no - k, N, Here the target vector, 1, is computed by subtracting the ZIR of the weighted shortterm predictor from the weighted input vector, as indicated by Eq. (95). Eqs. (101) and (102)can be easily generalized for a three-tap predictor. Note that Eq. (101) actually represents a first-order all-zero pitch predictor, where g, and k, are the predictor coefficient and the estimated pitch period, respectively. Hence, the adaptive codebook implements the closedloop estimation of the parameters for this particular long-term predictor. It is interesting to note that in a system having an adaptive codebook and an excitation codebook, a joint optimization problem may be defined by considering the minimization of (i) 2 IIewI12 = IIt - g p g p - SVzsrII

(103)

Determining k, and gpby a search in the adaptive codebook, and then y:2r and g by a search in the excitation codebook, respresents a suboptimal glution that may lead to degradation with respect to the performance given by joint optimization. A simple improvement may be obtained by re-optimizing the gains, g and g,, after the vectors y$Jr and up have been found by searching the adaptive and the excitation codebooks. A different approach to joint optimization is presented in the next subsection. d. Joint Search Optimization for Multiple Codebooks The introduction of the adaptive codebook for long-term predictor closed-loop optimization leads to a system with two codebooks. On the other hand, the performance of the

168

Codebook (Adaptive)

VLADIMIR CUPERMAN

-UP

Error Computation

Codebook (Excitation 1)

I

Code book

I Q2 Fic. 35. A three-codebook Analysis-by-Synthesis configuration (VSELP).

analysis-by-synthesis system is expected to improve by increasing the vector (subframe)length. If the rate allocated for encoding the excitation is fixed, the excitation codebook size increases exponentially with the vector length. One possible solution for achieving reasonable complexity is to use multiple excitation codebooks. An example of configuration with an adaptive codebook and two excitation codebooks is shown in Fig. 35. (Note that in this and the following figures, only the part of the encoder related to the codebook search is shown.) The use of multiple codebooks in A-by-S was introduced by Davidson and Gersho (1986) and further developed by Gerson and Jasiuk (1989, 1990), who also suggested a joint search optimization procedure based on the GramSchmidt orthogonalization. The presentation below is based on the Gerson and Jasiuk (1989, 1990) approach. The codebook search has as its objective the minimization of the weighted squared error: where i, j are the indices in the first and the second excitation codebook, and g1 and g2 are the respective gains. To simplify the writing and allow a general discussion of the procedure, let Hwu, = w,, HWuY’= w l ,

SPEECH CODING

169

The objective is to minimize

as a function of g,, yl, g 2 ,and yp,y l , cy2 for a given!. First note that the gains y, q l , and y 2 can be jointly optimized for any given 1. yp,y I , and y 2 .Indeed, by setting to zero the derivatives of IIpw(12 with respect to the three gains, a linear system of three equations in the three unknown gains is obtained. Hence, the initial optimization problem is reduced to the optimization of the vector directions. Second, if the vectors yp,wl, and y2 are assumed to be orthogonal, the optimal direction of the vectors can be determined independently, because in this case

- 2S2tTW2

+ Y:llwzll’

A fast orthogonalization procedure may be obtained by using constrained excitation codebooks of the type

where m = 1,2, E m , are the filtered basis vectors for the two excitation codebooks, and Hk have values 1 or - 1 (Gerson and Jasiuk, 1989, 1990). Note that in each of the two excitation codebooks there are 2“ excitation vectors obtained by linear combinations of the basis vectors with coefficients & 1. This constrained codebook structure allows for orthogonalizing V filtered basis vectors, rather than 2“ filtered excitation vectors. With these preliminaries, the optimization process may be described by the following steps:

+

(I)

Find the optimal vector in the adaptive codebook, y,, by minimizing

I l l - SpWpIl2. (2) Orthogonalize the filtered basis vectors for the first excitation codebook by

I,“=

and find the optimal vector wfl = &g’,k that minimizes IIL - y p y pgIcy’,JJ2. Note that the difference between y l and cy‘, is a vector having the direction cyp. ( 3 ) Repeat the above orthogonalization procedure for y 2 ,orthogonalizing first with respect to the vector chosen from the adaptive codebook, and then with respect to the vector chosen in the first excitation codebook.

170

VLADIMIR CUPERMAN

(4) Finally, re-optimize the gains for the values of in the previous steps.

w,,w,,and y z found

The procedure presented above is optimal under the assumption that the gains are unquantized. For quantized gains, this procedure is not equivalent to joint optimization. e. Adaptive Postjiltering The perceived speech quality may be improved in low-rate speech coders by using an adaptive postfilter connected at the output of the decoder. Essentially, a postfilter improves the perceived speech quality by attenuating the frequencies where the signal energy is low and amplifying the frequencies where the signal energy is high. Adaptive postfiltering was introduced in the ADPCM by Ramamoorthy and Jayant (1984), and in the analysis-by-synthesis systems by Chen and Gersho (1987a). A transfer function for adaptive postfiltering may be derived from the transfer function of the short-term predictor by moving the poles toward the origin (Ramamoorthy and Jayant, 1984).The postfilter transfer function, P(z), is then given by

where 0 < a < 1. This postfilter has a frequency response similar to the shortterm predictor (which models a smoothed spectrum of the speech signal), except for the “dampening” of the spectral peaks that results from the pole movement toward the origin. The operation that transforms 1/A(z) into l/A(z/a) leads to an increase in the spectral peaks’ bandwidth (because of the pole movement) and is therefore called bandwidth expansion. Experimentally, it was found that such a postfilter indeed reduces the quantization noise perceived level; however, this reduction occurs at the expense of a significant muffling of the speech signal. The muffling is due to the low-pass spectral tilt that is characteristic of this type of postfilter (Chen and Gersho, 1987a). The low-pass spectral tilt can be reduced by adding zeros having the same phase angles as the poles but with smaller radii. Further reduction of the spectral tilt may be obtained by using a first-order filter with a slight high-pass spectral tilt. The resulting postfilter transfer function is given by

Typical values for coefficients are p = 0.5, c1 = 0.8, and P = 0.5. Despite the improvements discussed above, a postfilter introduces a slight muffling in the reconstructed speech. Moreover, in tandem configurations the distortion introduced by postfilters may cumulate to unacceptable levels.

171

SPEECH CODING

For analysis-by-synthesis systems at rates equal or below 4.8 kb/s and in the absence of tandeming, the postfilter remains an efficient approach to improving the perceived speech quality. 2. Multi-Pulse Excited Linear Prediction Coding ( M P L P C ) Multi-pulse excited linear prediction coding (MPLPC)is an analysis-bysynthesis system in which each excitation vector consists of a combination of a given number of pulses whose positions and amplitudes are optimized in closed loop (Atal and Remde, 1982; Kroon and Deprettere, 1984, 1988).The system shown in Fig. 36 uses a short-term predictor with forward block adaptation for the synthesis filter. Either autocorrelation or covariance methods may be used for estimating the optimal predictor coefficients. No long-term predictor is used, with the assumption that the pulse-type excitation is adequate for synthesizing voiced sounds. Let J be the total number of pulses per excitation vector, and ,!Ij and mj the pulse amplitudes and positions, respectively. The excitation vector can then be written as

where ern, = (O,O, . . . ,O, l , O , . . . ,O)T is the basis vector with the mjth component equal to 1 and all the other components equal to zero. Using Eqs. (96) and (97),

I -t Generator

+

w

HW

Pulse

FIG.36. Multi-pulse excited linear prediction (MPLPC)-transmitter diagram.

simplified block

172

VLADIMIR CUPERMAN

the closed-loop optimization approach reduces to the minimization of I

j=1

as a function of the positions mj and amplitudes flj, j = 1,2,. ..,J . The joint optimization of pulse positions leads to intractable complexity. As a result, the optimization is performed by determining the location and amplitude of one pulse at a time. Then, the effect of the pulse is subtracted from the input waveform, determining a new target vector. Finally, after all the positions are found, the amplitudes are jointly re-optimized. For example, for the first pulse a simple variational minimization gives the optimal pulse amplitude,

B1,

For this value of the pulse amplitude the weighted error becomes

and the pulse position, m,,is found by looking for the maximum of the second term. After the amplitude and the position have been found for the first pulse, a new target vector, l’,is defined by subtracting the effect of this pulse from the original target vector: where d,, corresponds to the optimal position of the first pulse. Now, the same operations are repeated for the second pulse, and the procedure may be further extended for all J pulses. Finally, the amplitudes can be re-optimized by considering the minimization problem of Eq. (108) with known pulse positions. The reoptimization leads to a system of J equations in J unknown amplitudes that may be obtained by setting to zero the derivatives of llg,112 with respect to Pj, j = 1,2,..., J . Multi-pulse excited linear prediction achieves toll quality at 16 kb/s. However, speech quality degrades quickly at rates below 10 kb/s; Berouti et al. (1984) report for 4.8, 9.6, and 16 kb/s MPLPC quality equivalent to PCM coded speech with four bits, five bits, and seven bits per sample, respectively. The quality of seven bits PCM corresponds to an MOS score of about 4.0 (toll quality), while the quality of five bits PCM corresponds to an MOS lower than 3.0, indicating unacceptable quality for wide applications (Daumer, 1982).An improved MPLPC using pitch prediction achieved an MOS score

SPEECH CODING

173

of 3.8 at 10 kb/s (Millar et al., 1990). However, at a rate of 10 kb/s or lower, systems based on CELP (to be described) achieved better performance than MPLPC. A simplified version of the MPLPC is the regular pulse excited linear prediction coding (RPELPC) described by Kroon et al. (1 986). In RPELPC, J equally spaced pulses are used in each length L excitation vector. As a result, there are only L/J possible positioning phases that completely determine pulse positions. The minimization problem can be thus reduced to the solving of L / J linear systems, each having J equations with J unknowns. RPELPC is used in the European digital cellular speech coding standard (Vary et nl., 1988). 3. Code Excited Linear Prediction ( C E LP ) , Vector Adaptive Predictiue Coding (VAPC),and Vector Excitation Coding ( V X C ) Code excited linear prediction (CELP), introduced by Atal and Schroeder (1984), Schroeder and Atal(1985), and Copperi and Sereno (1983, was the first vector quantization- based analysis-by-synthesis system. CELP uses a codebook of white Gaussian random numbers to generate the excitation sequence. The Gaussian distribution was chosen because the prediction residual has a nearly Gaussian distribution (Atal, 1982). In the first CELP versions, the long- and short-term predictors were determined in open loop. The short-term predictor was determined using the covariance method applied to frames of 160 samples with an update period of 80 samples. The long-term predictor had three coefficients and an update period of 40 samples. The length of the subframe was also 40 samples, and a codebook of size 1,024 was used to encode the excitation for each subframe. The initial computational procedure (which did not use the ZSR-ZIR decomposition) was extremely complex: It took 125 5 of CRAY-1 CPU time to process 1 s of speech. Vector adaptive predictive coding (VAPC) was developed as a natural extension of the vector predictive coding presented in Section III.B.3. (Chen and Gersho, 1986, 1987). First, a pitch predictor was added to the VPC configuration, resulting in a vector generalization of the APC. Then, the vector short-term predictor was replaced by a scalar predictor, leading to a configuration similar to CELP. On the other hand, vector excitation coding (VXC) is actually a CELP configuration that uses a vector quantization trained codebook (Davidson and Gersho, 1986). A first version of VXC used a sparse codebook similar to that of MPLPC. The excitation codebook can be trained by solvLI. Codebook Training ing the following minimization problem. Let { L , ) : = ~ be a cluster of target vectors for which a centroid must be found. For each t,, let If$) be the corresponding weighted short-term predictor, g , the excitation vector, and g,

174

VLADIMIR CUPERMAN

the excitation gain. Then, using Eq. (97), the average weighted error for the given cluster is

where g is the desired centroid of the set (g j}f= Setting to zero the derivative with respect to g, the following equation in the unknown g is obtained:

or

The notation used in the last two equations was simplified by omitting the centroid index, i , and the iteration index in the training process, k. The complete centroid relation using these two indices is

where Lik is the number of vectors in the cluster i at the iteration k, and l y k ' , j = 1,2,. . .,Lik, are the target vectors in the cluster i at iteration k.

Given a centroid g, the corresponding cluster (2j)j"= consists of all target vectors that are encoded into the excitation vector g using the nearest neighbor codebook search. An iterative vector quantization training algorithm may be used to find an optimized codebook; in each iteration, the algorithm computes a new clustering based on the previous iteration centroids, and then new centroids for this clustering. Denoting by k,,, the number of iterations, the final codebook is u(i)

-

= U(i.kmax)

-

The iterative algorithm described above is not guaranteed to converge. Moreover, the average distortion oscillates during the training process. This is a result of the fact that the target vectors depend on the codebook and hence change in each iteration. This procedure, introduced first for VPC by Cuperman and Gersho (1982, 1985) and adapted for VXC by Davidson et al. (1987), is called the closed-loop codebook design. The alternative approach would be to train the codebook on a fixed data base of target vectors-the open-loop codehook design. Although the average distortion does not decrease monotonically in the closed-loop design, the final codebook is systematically better than the codebook produced by the open-loop design. Both stochastic and trained (vector quantization) codebooks lead to systems characterized by the high computational complexity of the search

175

SPEECH CODING

process. The complexity may be significantly reduced by using the ZSR-ZIR decomposition and other signal processing techniques (see Section 1II.D.1). A further decrease in the computational complexity may be obtained by using overlapped codebooks (Lin, 1987), or by imposing an algebraic structure on the codebook (Adoul and Lamblin, 1987). b. The DOD 4.8 kbls Speech Coding Standard (Proposed Federal Standard 1016) The advances in low-rate speech coding based on the CELP config-

uration led to the development of the DOD 4.8 kb/s standard (proposed Federal Standard 1016)--see Campbell er al. (1989, 1990). The DOD standard uses an adaptive codebook and a ternary-valued ( + L O , - 1) stochastic codebook. The synthesis filter is a 10th-order short-term predictor with coefficients determined by applying the autocorrelation method to each frame of 240 samples (30 ms). The short-term predictor coefficients are transformed into LSPs (see Section I1.D) and scalar-quantized for transmission. Each frame is divided into four subframes (vectors) of 60 samples, and for each vector, optimal indices in the adaptive and the stochastic codebooks are determined independently. The stochastic codebook is overlapped, each vector containing all but two samples of the previous vector and two new samples. The adaptive codebook provides for the possibility of using non-integer delays. The search configuration for the DOD CELP is shown in Fig. 37. For more details on the DOD standard, see Campbell et al. (1990). The DOD standard achieved a DRT score of 93%, which compares favorably with 90% for the LPC-10 standard at 2,400 bits/s. The difference is even more significant for the DAM score, which improves from 48 for LPC-10 to 67 for the DOD standard. Campbell et al. (1989) showed that the results obtained by the DOD standard at 4.8 kb/s are comparable in terms of DRT and DAM scores to delta modulation and ADPCM at 32 kb/s (!). However,

1

Stochastic Codebook

l+bp'

I I

Error

I

gs FIG.37. Code excited linear prediction (CELP)-transmitter

simplified block diagram.

176

VLADIMIR CUPERMAN

this last statement should be considered cautiously. The DRT and DAM scores represent reasonably well the subjective quality of speech for the class of applications where the intelligibility and a degree of speaker recognizability are the main requirements. On the other hand, for applications in the general communications network, the only widely accepted criterion is the MOS. ADPCM at 32 kb/s achieves a MOS of 4.0, while CELP at 4.8 kb/s is believed by many researchers to achieve only about 3.0-3.5 (unfortunately, no formal MOS results are available for the DOD standard; Kroon and Atal, 1990b, indicate a performance of 3.5). The DOD standard is characterized by a relatively high computational complexity; for a stochastic codebook of size 512, a DSP chip rated at 25 MIPS is required. However, the standard is flexible, allowing the use of a variety of codebook sizes (64 to 512) to obtain a variety of complexity/ performance trade-offs. There is also a special bit left free in the output stream that may be used as a flag to indicate a modified algorithm for future improvements. Speech coding systems based on CELP, VAPC, or VXC may be used to achieve toll speech quality at 16 kb/s, communications speech quality at 8 kb/s, and good quality for special applications at 4.8 kb/s. At 16 kb/s, the only disadvantage of this approach is the large communications delay. The main source of the delay is the buffering needed for the forward block adaptation of the short-term predictor; the algorithmic delay is 15-20 ms, which leads to a total delay of 30-60 ms. Such a delay may be unacceptable for applications in the general communications network. At 8 kb/s, results close to toll quality were obtained by a modification of the basic CELP configuration called vector sum excited linear prediction (VSELP), which will be presented briefly in the next subsection. 4. Vector Sum Excitation Linear Prediction (VSELP)

In the basic CELP approach, the search in the adaptive and stochastic codebooks and the computation of the respective gains are done independently for the two codebooks. Joint optimization would require a complete search of the stochastic codebook for each codeword in the adaptive codebook, and such a search is computationally intractable. An alternative approach to joint optimization is the orthogonalization procedure described in Section III.D.l, which was introduced for VSELP by Gerson and Jasiuk (1989, 1990). VSELP uses a 10th-order block forward adaptive short-term predictor and three codebooks: an adaptive codebook, and two excitation codebooks with the constrained structure described in 1II.D.1. The short-term predictor parameters are transmitted by quantizing the reflection coefficients. The

SPEECH CODING

177

adaptive codebook covers the usual range of pitch values, while the excitation codebooks each have 128 vectors obtained as linear combinations of seven basis vectors. The basis vectors were optimized on a speech data base; the iterative optimization of codebooks resulted in a significant gain in performance, particularly in the subjective speech quality. The constrained codebook structure of VSELP leads to a fast search algorithm for the two excitation codebooks. There are two versions of VSELP for the rates of 4.8 kb/s and 8 kb/s. The 8 kb/s VSELP codec was chosen by the Telecommunications Industry Association (TIA) for the North American digital cellular speech coding standard. The performance achieved by VSELP at 8 kb/s is characterized by MOS values of about 3.8-3.9; the speech quality is very close to that required for toll-quality communications.

5. Low-Delay Speech Coding A challenging research problem in speech coding was stimulated by the CCITT when it established the requirement that the future 16 kb/s speech coding standard must have very low coding delay, while achieving essentially the same high quality as the 32 kb/s ADPCM standard. Although recently developed speech coding algorithms based on CELP, VAPC, or VXC are able to provide the required quality at 16 kb/s, these coders introduce a substantial delay because of forward adaptation where input speech samples are buffered to compute synthesis filter parameters prior to actual coding of the samples. To meet the low-delay constraint, forward adaptation is not feasible, yet backward adaptation at low rates tends to cause degraded quality and severe propagation of transmission errors. While ADPCM is based on backward adaptation, its quality at 16 kb/s is unacceptable. An alternative solution based on combining backward adaptation of predictors with the basic analysis-by-synthesis configuration was introduced by Watts and Cuperman (1988),Chen (1989,1990), and Cuperman et a/.(1989, 1990). The resulting configuration was initially called vector ADPCM, then low-delay CELP (LD-CELP), and low-delay VXC (LD-VXC), respectively. Most of the following discussion will be based on LD-VXC as described by Cuperman et al. (1990). In a backward adaptive analysis-by-synthesis configuration (Fig. 38), the parameters of the synthesis filter are not derived from the original speech signal, but computed by backward adaptation extracting information only from the sequence of transmitted codebook indices. Since both the encoder and decoder have access to the past reconstructed signal, side information is no longer needed for synthesis filters, and the low-delay requirement can be met with a suitable choice of vector dimension.

178

VLADIMIR CUPERMAN

1

X

Excitation Codebook

Predictor

Predictor

Adaptation L

p choice k p

r

Computation r o r

Weighting Filter

Adaptation

(b)

FIG.38. Backward adaptive analysis-by-synthesis configuration. (a) Transmitter. (b) Receiver.

In the encoder (see Fig. 38), a codevector is chosen from a codebook using an analysis-by-synthesis technique. Each candidate codevector, 1i, is multiplied by a gain value calculated using a backward adaptive gain predictor similar to that used in adaptive vector quantization (Section 1I.C). The resulting gain scaled codevector, _u, is input into a synthesis filter, which is a cascade of a pitch predictor and a short-term predictor. The index i of the excitation vector g c i )is omitted for clarity. The components of vector _u will be denoted u(n), where n is the time index. The output of the long-term predictor, w(n), is computed by Eq. (59), while the output of the short-term predictor, y(n), is computed using Eq. (60). The output of the short-term predictor is compared to the actual speech signal, and a choice is made of the best candidate codevector using a perceptually weighted minimum squared error criterion. Once the best codevector has been chosen, this codevector is reapplied to the synthesis filter to generate the proper predictor memory. Both the short-term and pitch prediction filters are adapted on a sample-by-sample basis using a backwardadaptive technique. The only information that is then transmitted to the decoder is the index, io, corresponding to the chosen codevector. Actually, the backward configuration shown in Fig. 38 can be obtained from the ADPCM configuration. To illustrate this transformation, Fig. 39a shows a standard ADPCM configuration and Fig. 39b shows an analysis-by-

179

SPEECH CODING

Quantizer

.

Inverse Quantizer

(b) FIG.39. ADPCM and analysis-by-synthesis. (a) Usual ADPCM configuration. (b) Equivalent analysis-by-synthesis configuration.

synthesis configuration having an identical predictor. If the analysis-bysynthesis configuration uses a scalar codebook having as entries the output points of the ADPCM quantizer, it is easy to show that the two configurations in Fig. 39a and Fig. 39b are equivalent. For this reason, a generalization of the configuration in Fig. 39b that uses a multidimensional codebook was called vector ADPCM by Watts and Cuperman (1988). Two approaches to backward adaptation may be used for the short-term predictor in an A-by-S configuration: block and recursive. In the block

180

VLADIMIR CUPERMAN

algorithms, the reconstructed signal and the corresponding gain-scaled excitation vectors are divided into blocks (frames), and the optimum parameters of the adaptive filter are determined independently within each block. In the recursive algorithms, the parameters are updated incrementally after each successive pair of excitation and reconstructed vectors are generated. In a block backward low-delay configuration, the end result is a new set of filter parameters at the end of each frame. These parameters have to be used for the duration of the next frame even though the speech statistics are changing from one frame to the next, and the parameters for one frame can often be poorly suited to the next frame. Consequently, the choice of the frame length becomes a very difficult trade-off For a long frame, the parameters become obsolete well before the end of the next frame; for a short frame, the estimates of the autocorrelation function used in the Wiener- Hopf equations may become unreliable. A possible alternative is to use a highly overlapped frame structure. This solution, however, leads to a high computational complexity. Recursive adaptation systems are more flexible from this point of view. The adaptation of the parameters can be carried out sample by sample, while the update period can be based on complexity considerations (Watts and Cuperman, 1988). In block adaptive systems, an all-pole short-term predictor is traditionally used ( Z = 0 in Eq. (60)).In this case, the coefficients hi can be computed for a backward adaptive configuration using the autocorrelation method by solving the Wiener-Hopf equations (50) or (51). Alternatively, the stabilized covariance method can be used. The assumption here is that the coefficients that minimize the MSE on a given block, because of the slowly changing characteristics of speech, will lead to good performance on the next block where they will actually be used. For recursive adaptation, it was found that an adaptation based on Eqs. (65) and (66) for the short-term predictor is not robust in the presence of transmission errors (Cuperman et al., 1989, 1990). Two improvements were found to increase the robustness significantly at the expense of a minor performance degradation in the absence of transmission errors. First, the robustness was found to increase when w(n - i ) was replaced with u(n - i ) in (66). This approach is equivalent to adapting the short- and long-term predictors in parallel rather than in cascade and consequently will be called parallel adaptation. Second, similarly to the ADPCM case, it was found that using the all-zero reconstructed signal for adapting { h i } and using leakage factors for all adapted coefficients further improves the robustness. Taking into account the parallel adaptation, the all-zero reconstructed signal, y’(n), is given by Z y’(n) = u(n) C gY’u(n - i). (1 16)

+

i= I

181

SPEECH CODING

With these changes, (64-66) become

where A", i,,, and 2, are the corresponding leakage factors. In order to compare the different approaches for backward adaptation, the LD-VXC codec that uses recursive adaptation and another version using block adaptation (BA) for the short-term predictor, called LD-VXC-BA, were simulated and tested on a speech file with a duration of approximately 16 s. The codebooks for the two coders were separately trained; speech data used for training the codebooks were not included in the test file. The objective quality was estimated by the signal-to-noise ratio (SNR)and segmental signalto-noise (SEGSNR). In Fig. 40 and Fig. 41, the relations between the short-term predictor update rates and the SNR performance of the two coders are illustrated. For LD-VXC-BA, the autocorrelation method with Hamming window of size 160 was used to obtain eight-pole coefficients at the end of each update period. Suitable bandwidth expansion is also incorporated for LD-VXC-BA for robustness to channel errors. However, it is obvious that LD-VXC-BA requires higher complexity than LD-VXC. Though LD-VXC-BA does show a slightly higher SNR curve, informal listening tests suggest that the two coders produce similar speech quality.

Recursive Adaptation

i

SNRseg

OSNR

10

20

30

40

Update Period (Vectors) FIG.40. Performance vs. short-term predictor update rate for low-delay vector excitation coding (LD-VXC).

182

VLADIMIR CUPERMAN

1

Block Adaptation

22

“1

DSNR SNRseg

20 30 40 Update Period (Vectors) FIG.41. Performance vs. short-term predictor update rate for low-delay vector excitation coding with block adaptation (LD-VXC-BA). 10

Generally, the use of long-term prediction in low-delay speech coding is a difficult problem because of reduced prediction gain in backward adaptation combined with sensitivity to transmission errors. A robust long-term predictor for the LD-VXC configuration was introduced by Pettigrew and Cuperman (1989,1990). In the LD-VXC long-term predictor, the coefficients are adapted sample by sample using the algorithm given by Eq. (1 17).The pitch period, k,, is also adapted sample by sample using a pitch tracking algorithm based on running estimates of the autocorrelation function at lags k, - 1, k,, and k , + 1. The estimate of the normalized autocorrelation function, pw,(k), can be obtained from the following recursion:

After each update of the autocorrelation function estimate, a decision is made to increment the pitch period by one if the following are true: pww(k, 1) > pww(kp); and pww(k, 1) > pww(k,- 1); and pww(k, 1) > pmin.The constant pminis a threshold for the autocorrelation term, to avoid tracking in regions of unvoiced speech. Using a value of pmin= 0.2 results in good performance. An alternative approach that leads to lower complexity is to track the pitch period based on estimates of the time derivatives of the coefficients a,. The speech quality in the LD-VXC codec depends significantly on the order of the short-term predictor. It was widely believed that the performance of speech coding systems using linear prediction saturates for predictor orders in the range of 12-16. However, Chen (1989, 1990) showed that, in the ab-

+

+

+

Codebook

j

Index Selection

Weighting Filter

I

Speech

Inpui-'

Lattice

I

1

FIG.42. Low-delay vector excitation coding with lattice short-term and weighting filters (LLD-VXC).

sence of a pitch predictor, the improvement of performance as a function of predictor order does not saturate for predictor orders as large as 50. Actually, LD-CELP uses a block backward adaptive short-term predictor of order 50 and no pitch predictor. Low-delay CELP employs a product gain-shape excitation codebook having 128 shape vectors and eight gain values; hence, each vector of dimension 5 generates a 10-bit code. LD-CELP was selected as the candidate for the CCITT 16 kb/s standard (Chen, 1989, 1990). For recursive adaptation configurations, the use of lattice filters for higherorder predictors has significant advantages, as shown in Section 1I.B. A lattice LD-VXC configuration is shown in Fig. 42. Note that the weighting filter parameters are determined by a lattice filter adapting on clean speech; the use of identical adaptation configurations for the weighting filter and the shortterm predictor is important for achieving good subjective quality (Peng and Cuperman, 1990). Figure 43 shows the performance of a lattice LD-VXC (LLD-VXC) codec vs. the short-term predictor order, for a system using a lattice short-term predictor with the adaptation equations given by Eqs. (68)and (69).The results indicate that, in the presence of a long-term predictor, the improvement in performance saturates for short-term predictor orders in the range 20- 30. A LLD-VXC codec with a 20th-order lattice predictor achieved subjective speech quality comparable to 7-bit PCM (which is rated at an MOS of about 4.0).

15

'

0

I

I

I

I

I

30 40 50 Short Term Predictor Order

10

20

60

FIG.43. Lattice low-delay vector excitation coding (LLD-VXC) performance vs. short-term predictor order.

E . Tree and Trellis Coding

Tree and trellis coders try to improve the performance of a given speech coding system by delaying the decisions in the encoding process. Rather than making independent single quantization decisions, all the possible decisions at consecutive time instants are considered, and the best sequence of decisions is found using a search algorithm. The general approach is similar (and inspired by) the well-known techniques developed for modulation and coding in digital communications. A detailed discussion of tree and trellis coding is beyond the scope of this paper; the purpose of this section is to describe briefly the basic ideas and to relate the resulting systems to the speech coders described in previous sections. 1. Tree Coding

In tree coding, the sequence of possible decisions forms a tree in which branches are labeled by the possible (allowable) reconstruction values. In the simple example shown in Fig. 44, there are two allowable reconstruction values at each time instant. These reconstruction values can be provided, for example, by a 1 bit/sample ADPCM coder. Starting from an initial state at the time instant n, the coder may transmit a zero and generate the reconstruction level yo, or transmit a one and generate the reconstruction level y,. Correspondingly, there are two possible initial states at time instant n + 1 , and in each state, two other reconstruction levels may be generated. The ADPCM

185

SPEECH CODING 0

Y2

Time

I

I

I

n

n+ 1

n+2

FIG.44. Tree code of depth L

=

2.

coder may have an infinite number of internal states, which leads to an infinite tree being generated. A tree coder considers all the paths through the tree for L consecutive time instants, where L is called the tree depth. The distortion between L samples of the input sequence and each possible path through the tree is calculated, and the path with the smallest distortion is selected. Decisions for the current sample, or a number of consecutive samples, are then released. In the example shown in Fig. 44,the tree depth is two. Hence, two consecutive input samples, (x(n),x(n + l)), are compared successively to the pairs of reconstructed values ( y o ,y 2 ) , ( y o ,y 3 ) , ( y , , y4), and ( y , , y s ) . Assuming the smallest distortion is obtained for the pair ( y l , y4), a “one” is released corresponding to the decision at the time instant n, the internal state of the ADPCM coder is adjusted according to the reconstruction level y , , and a new tree of depth two is built for the next time instant, n + 1. There are three basic elements in a tree coding algorithm: the code generator, the search algorithm, and the symbol release rule. Any of the previously described speech coding systems may be used as code generators for a tree coding system. Most of the tree coding systems described in the literature use PCM, DPCM, or APC code generators (Jelinek and Anderson, 1971; Anderson and Bodie, 1975; Jayant and Christensen, 1978; Wilson and Husain, 1979; Goris and Gibson, 1981; Fehn and Noll, 1982; Iyengar and Kabal, 1988; Chang and Gibson, 1988). Recently, tree coding was used in a CELP environment (Mano and Moryia, 1990). The search algorithm determines the computational complexity of the tree coding system. Consider a system in which the coder has N possible output points (reconstruction levels). If decisions at L consecutive time instants are considered before choosing the optimal sequence, a decision tree with N L paths is obtained. The exhaustive search of such a tree leads to very high

186

VLADIMIR CUPERMAN

computational complexity, even if efficient search algorithms, such as the Viterbi algorithm, are used. For this reason, most tree coding systems use the ( M ,L ) search algorithm, in which only a fixed number of paths, M , are left in contention at any stage of the search through the tree. It has been shown that for some speech coding applications there is little difference in performance between the ( M , L ) algorithm and the (optimal) Viterbi search (Gibson and Haschke, 1987; Iyengar and Kabal, 1988). Most tree coding systems use very simple release rules; for example, in the so-called one-symbol-release rule, only the symbol representing the decision at time n is released based on the optimal path of length L. Tree speech coding systems using APC-based code generators with forward and backward adaptation achieved good results at 16 kb/s. The subjective performance was found to be close to 7-bit PCM (Iyengar and Kabal, 1988),or comparable to MPLPC (Chang and Gibson, 1988).At rates below 16 kb/s, the performance of these systems degrades rapidly.

2. Trellis Coding

A trellis coder may be obtained from a tree coder that has a code generator with a finite number of internal states. Assume that in the example of Fig. 44, the reconstruction levels at each stage depend only on the current symbol and the previous symbol. As shown in Fig. 45a, the resulting tree repeats itself after an initial fan-out. The repetitive part of the tree can be represented by the trellis shown in Fig. 45b. This trellis encodes the signal at a rate of 1 bit/sample; depending on the state of the code generator, the pair of levels y,, y 3 , or the pair y,, y, are available (allowable) at each time instant. A simple code generator for trellis coding may be obtained using a shift register and a table look-up. If all possible values for the channel symbols are shifted into the register, the table look-up will produce all the allowable reconstruction levels by using the register contents as a table address. The parallel to convolutional codes in digital communications is obvious here. Trellis code generators may be obtained by using DPCM coders with finite impulse response (FIR) (all-zero) predictors, or by truncating the response of DPCM coders with infinite impulse response (IIR). Trellis coded quantization (TCQ) is a reduced-complexity trellis coding approach obtained by extending the trellis coded modulation idea (known in digital communications) to source coding. In TCQ the number of reconstruction levels allowable at rate R bits/sample is 2 R +I . The reconstruction levels are divided typically in four subsets. R - 1 bits/sample are used to specify which of the reconstruction levels in a given subset will be used at a given time instant. The rest of the bits specify the path that determines the subset to be chosen at each time instant. A predictive TCQ system may be

187

SPEECH CODING 0

1

Y2

1

0

Y2

x 1

L

1

Y5

1

(4

(b)

FIG.45. Trellis code with two states. (a) Resulting tree. (b)Trellis representing repetitive part of tree.

obtained by predicting the signal at each trellis node as a linear combination of reconstruction levels specified by the survivor path associated with that node. Trellis coding is an asymptotic optimal technique in the rate-distortion theory sense (Viterbi and Omura, 1974; Gray, 1977).Trellis coders have been used successfully for encoding synthetic signals and speech (Viterbi and Omura, 1974; Stewart et al., 1982; Ayanoglu and Gray, 1986; Marcellin and Fischer, 1990; Marcellin et al., 1990). A predictive TCQ system applied to speech coding at 16 kb/s achieved a SECSNR of 18-20 dB (Marcellin et al., 1990),which is a performance competitive with that of such systems as VPC, and significantly better than the performance of scalar ADPCM at 16 kb/s. Tree and trellis coding are techniques conceptually similar to vector quantization. The basic common characterization includes the idea of block or delayed coding. In vector quantization, the symbols corresponding to an entire data block are released at the end of a search procedure that involves only that particular data block. In tree/trellis coding, only a small number of symbols representing the first samples of the block are released when the search is completed for that particular block; the blocks are highly overlapped. Block overlapping suggests that tree/trellis coding might avoid some blocking (edge) effects at the expense of a higher computational complexity. On

1

188

VLADIMIR CUPERMAN

the other hand, vector quantization has been very successful in analysisby-synthesis systems using perceptual error criteria, while the problem of tree/trellis search optimization under a perceptual criterion is still open. The above comments represent a very loose attempt to compare tree/trellis coding and vector quantization; unfortunately, the data available in the literature regarding this comparison are scarce and do not allow a detailed comparison. Furthermore, the above comments refer to tree or trellis systems based on scalar quantization. Relatively little work has been done in vector tree/trellis quantization, an area that may be promising for future research.

IV. FUTURERESEARCH DIRECTIONS The following brief comments are restricted to the analysis-by-synthesis systems, which are expected to remain a major source of improved speech coding algorithms for the coming decade. Two main research topics will dominate the speech coding research activity in the next few years: low-rate speech coding (below 4.8 kb/s), and low-delay speech coding (below 5 ms). Achieving good quality at rates below 4.8 kb/s has been a major challenge in speech coding for many years. There are a number of important new applications for low-rate speech coding, including half-rate digital cellular and satellite communications. It is expected that new techniques will emerge to meet these new challenges; some new research directions can already be discerned in the literature. One of the deficiencies of the existing low-rate speech coding systems lies in their rigid configuration, wherein the input signal is divided in frames of equal duration and a fixed coding algorithm is used for each frame. Researchers try to cope with this problem by adapting the coder to the phonetical content of each frame (Wang and Gersho, 1989, 1990), and by adapting the bit allocation to the frame statistics (Jayant and Chen, 1989; Taniguchi et al., 1989, 1990; Yong and Gersho, 1988). Another problem in the existing low-rate systems is the weakness of the excitation model. Stochastic codebooks, multipulse excitation, as well as trained codebooks, use little knowledge about the residual structure; little is known about the connection between the residual quantization errors and the resulting speech quality. In order to study the effect of the residual quantization, a singular value decomposition of the input and output to the synthesis filter was introduced, and new models based on combined singlepulse and stochastic codebook excitation are being studied (Atal and Caspers, 1990). The progress in digital hardware may facilitate the implementation of computationally expensive analysis-by-synthesis techniques. An example is

SPEECH C O D I N G

189

the joint optimization of the excitation vector and synthesis filter discussed in Section 1II.D.1. Such techniques may lead to significant performance improvement at low rates. Finally, the perceptual weighting used in analysis-by-synthesis is based on the assumption of a flat quantization noise spectrum. This assumption fails at low rates; moreover, the quantization noise level increases significantly at rates below 4.8 kb/s. These observations lead to the conclusion that new perceptual criteria are needed to improve performance at rates below 4.8 kb/s. Possible improvements of the perceptual criteria may result by taking into account the spectral fine structure of the speech signal, or by exploiting the temporal masking of acoustic events (Kroon and Atal, 1990b). Generally, a better understanding of human perception may bring significant progress in this area. Achieving good speech quality at rates below 16 kb/s will represent a major challenge in low-delay speech coding. The progress in this area is related to a better understanding of the limitations of backward adaptation in analysis-by-synthesis. At low rates, the quantization noise affects the backward prediction loop, resulting in low prediction gains. Parameter estimation techniques for noisy environments may lead to better results in backward prediction. Low-delay speech coders may also take advantage of any progress in low-rate analysis-by-synthesis speech coding in the areas mentioned above.

ACKNOWLEDGMENTS I would like to thank Professor Allen Gersho of the University of California, Dr. Peter Kroon of AT&T Bell Labs, and Professor Thomas Fischer of Washington State University for their useful comments and suggestions. Also, thanks to Erdal Paksoy of the University of California and to my students Dr. Bhaskar Bhattacharya, Peter Lupini, Peter Schuler, and Feng-Hua Liu for reading the draft and helping with the figures.

REFERENCES Adoul, J.-P. (1987). “Speech-Coding Algorithms and Vector Quantization,” in Aduunced Digird Cornmunicutions (K. Feher, ed.). pp. 133- 181. Prentice-Hall, Englewood Cliffs, New Jersey. Adoul. J.-P.. and Lamblin, C. (1987). “A Comparison of Some Algebraic Structures for C E L P Coding of Speech,” Proc. ICASSP, April. Adoul. J.-P., Morissette, S., and Rudko. M. ( 1979). “Bit-Rate-Halving Algorithm for PCMEncoded Speech Using a New Bidimensional Data Compression Scheme,” Proc. ICASSP. April, 432-435.

190

VLADIMIR CUPERMAN

Anderson, J. B., and Bodie, J. B. (1975). “Tree Encoding of Speech,” IEEE Trans. on l n f . Theory IT-21,379-387. Atal, B. S. (1982).“Predictive Coding of Speech at Low Bit Rates,” I E E E Trans. Comm. COM-30, 600- 614. Atal, B. S., and Caspers, B. E. (1990). “Beyond Multipulse and CELP Towards High Quality Speech at 4 kb/s,” in Advances in Speech Coding (B. S. Atal, V. Cuperman, and A. Gersho, (eds.). Kluwer Academic Publ. Atal, B. S., and Remde, J. R. (1982).“A New Model of LPC Excitation for Producing NaturallySounding Speech at Low Bit Rates,” Proc. ICASSP, 614-617. Atal, B. S., and Schroeder, M. R. (1967). “Predictive Coding of the Speech Signals,” Proc. ConJ Speech Comm. and Processing, Nov., 360-361. Atal, B. S., and Schroeder, M. R. (1970).“Adaptive Predictive Coding of the Speech Signals,” Bell Syst. Tech. J . 49, 1973-1986. Atal, B. S., and Schroeder, M. R. (1979). “Predictive Coding of Speech Signals and Subjective Error Criteria,’’I E E E Trans. ASSP ASSP-27(3), 247-254. Atal, B. S., and Schroeder, M. R. (1984). “Stochastic Coding of Speech Signals at Very Low Bit Rates,” Proc. Int. Conf. Comm., 1610-1613. Atal, B. S., Cox, R. V., and Kroon, P. (1989).“Spectral Quantization and Interpolation for CELP Coders,” Proc. ICASSP, 69-72. Ayanoglu, E., and Gray, R. M. (1986).“The Design of Predictive Trellis Waveform Coders Using the Generalized Lloyd Algorithm,” IEEE Trans. Comm. COM-34, 1073- 1080. Bell, C. G., Fujisaki, H., Heinz, J. M., Stevens, K. N., and House, A. S. (1961).“Reduction of Speech Spectra by Analysis-by-Synthesis Techniques,” J . Acoust. Sac. Am. 35, 1264- 1273. Bennet, W. R. (1948).“Spectra of Quantized Signals,” Bell System Tech. J., July, 446-472. Berger, T. (1971).“Optimum Quantizers and Permutation Codes,” IEEE Trans. on lnf. Theory, Nov., 759-765. Berger, T. (1972). Rate Distortion Theory. Prentice-Hall, Englewood Cliffs, New Jersey. Berouti, M., Garten, H., Kabal, P., and Mermelstein, P. (1984). “Efficient Computation and Encoding of the Multipulse Excitation for LPC,” Proc. ICASSP, 10.1.1-10.1.4. Buzo, A., Gray, A. H., Jr., Gray, R. M., and Markel, J. D. (1980).“Speech Coding Based Upon Vector Quantization,” I E E E Trans. ASSP ASSP-28,562-574. Campbell, J. P., Jr., Welch, V. C., and Tremain, T. E. (1989). “An Expandable Error-Protected 4800 bps CELP Coder,” Proc. ICASSP, May, 735-738. Campbell, J. P., Jr., Tremain, T. E., and Welch, V. C. (1990). “The DOD 4.8 kbs Standard (Proposed Federal Standard 1016),”in Advances in Speech Coding (B. S. Atal, V. Cuperman, and A. Gersho, eds.). Kluwer Academic Publ. CCITT (1984).“32 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM),” Recommendation (3.721, October. Chang, W. W., and Gibson, J. D. (1988). “A Comparison of Adaptive Code Generators for Tree Coding of Speech,” Proc. Midw. Symp. on Circ. and Sys., Aug., 924-927. Chang, P-C., and Gray, R. M. (1986). “Gradient Algorithms for Designing Predictive Vector Quantizers,” IEEE Trans. ASSP ASSP-34(4), 679-690. Chen, J-H. (1989). “A Robust Low-Delay CELP Speech Coder at 16 kbit/s,” l E E E GIobecom Conf., 1237- 1241. Chen, J-H. (1990).“A Robust Low-Delay CELP Speech Coder at 16 kb/s,” in Advances in Speech Coding (B. S . Atal, V. Cuperman, and A. Gersho, (eds.). Kluwer Academic Publ. Chen, J-H., and Gersho, A. (1985).“Gain-Adaptive Vector Quantization for Medium Rate Speech Coding,” Proc. ICC, 1456-1460. Chen, J-H.. and Gersho, A. (1986).“Vector Adaptive Predictive Coding of Speech at 9.6 kb/s,” Proc. ICASSP, April, 33.4.1-33.4.4.

SPEECH CODING

191

Chen, J-H., and Gersho, A. (1987a). “Real Time Vector APC Speech Coding at 4800 bps with Adaptive Postfiltering,” Proc. ICASSP, April, 2185-2188. Chen, J-H., and Gersho, A. (1987b). “Gain Adaptive Vector Quantization with Application to Speech Coding,” I E E E Trans. Comm. COM-35,918-930. Conway, J. H., and Sloane, N. J. A. (1983). “A Fast Encoding Method for Lattice Codes and Quantizers,“ IEEE Trans. lnf. Theory IT-29, 820-824. Copperi. M.. and Sereno, D. (1985).“Vector Quantization and Perceptual Criteria for Low-Rate Coding of Speech,” Proc. ICASSP, 252-255. Crochiere (1979) p. 55 Crochiere. R. E., Weber, S. A., and Flanagan. J. L. (1976). “Digital Coding of Speech in Subbands.” Bell Sys. Tech. J., Oct., 1069- 1085. Cuperman, V. (1986).“Vector Transform Quantization for Speech Coding,” Proc. IEEE Globecorn c‘onf., Dec., 792-796. Cuperman, V. (1989). “On Adaptive Vector Transform Quantization for Speech Coding,” IEEE Trans. Comm. 37(3),261-267. Cuperman, V., and Gersho, A. (1982). “Adaptive Differential Vector Coding of Speech,” Proc. I E E E Globecorn Conf., 1092-1096. Cuperman, V., and Gersho, A. (1985). “Vector Predictive Coding of Speech at 16 kbit/s,” IEEE Trans. Comm. COM-33(7),585-696. Cuperman, V.,Gersho, A,, Pettigrew, R., Shynk, J. J, and Yao, J-H. (1989).“Backward Adaptation for Low Delay Vector Excitation Coding at 16 kbit/s,” I E E E Globecorn C o ~ f . 1242-1246. , Cuperman, V., Gersho, A., Pettigrew, R., Shynk, J. J., and Yao, J-H. (1990). “Backward Adaptive Configurations for Low-Delay Vector Excitation Coding,” in Advances in Speech Coding (B. S . Atal, V. Cuperman, and A. Gersho, (eds.). Kluwer Academic Publ. Cutler, C. C. (1952).“Differential Quantization of Communications.” US Patent 2,605,361,July. Daumer. W. R. (1982).“Subjective Evaluation of Several Efficient Speech Coders,” I E E E Trans. Comm. COM-30(4), 655-662. Davidson, G.. and Gersho, A. (1986). “Complexity Reduction Methods for Vector Excitation Coding,” Proc. ICASSP, April. Davidson, G., and Gersho, A. (1988). “Multiple-Stage Vector Excitation Coding of Speech Waveforms,” Proc. ICASSP, New York, April, 163-166. Davidson, G., Yong, M., and Gersho, A. (1987).“Real-Time Vector Excitation Coding of Speech at 4800 bps,” Proc. ICASSP, April. Dudley, H. (1936). “The Vocoder,” Bell Labs. Record 17, 122-126. Durbin, J. (1960).“The Fitting of Time Series Models,” Reo. Inst. Int. Statist., 233-243. Esteban, D., and Galand, C. (1977). “Application of Quadrature Mirror Filters to Split Band Voice Coding Schemes,” Proc. ICASSP, May, 191 - 195. Fallside, F., and Woods, W. A. (1985). Computer Speech Processing. Prentice Hall International (UK), Ltd. Fehn, H. G., and Noll, P. (1982). “Multipath Search Coding of Stationary Signals with Applications to Speech Coding,” IEEE Truns. Comm. COM-30,687-701. Flanagan, J. L. (1972).Speech Analysis Synthesis and Perception. Springer-Verlag, Berlin. Flanagan, J. L., Schroeder, M. R., Atal, B. S., Crochiere, R. E., Jayant, N. S., and Tribolet, J. N. (1979).“Speech Coding,” IEEE Trans. Comm., May, 710-737. Furui, S. (1989).Digital Speech Processing. Synthesis. and Recognition, Marcel Dekker, New York. Gersho, A. (1979). “Asymptotically Optimal Block Quantization,” I E E E Trans. on I n f . Theory lT-25,373-380. Gersho, A. (1986). “Vector Quantization: A New Direction in Source Coding,” in Digital Communications (E. Biglieri and G. Prati, (eds.), pp. 267-281. Elsevier Science Publ. B. V. (North-Holland), Amsterdam.

192

VLADIMIR CUPERMAN

Gersho, A,, and Cuperman, V. (1983).“Vector Quantization: A Pattern Matching Technique for Speech Coding,” IEEE Comm. Mag., December, 15-21. Gerson, I., and Jasiuk, M. (1989; 1990). “Vector Sum Excited Prediction (VSELP),” IEEE Workshop on Speech Coding in Telecommunications; and in Advances in Speech Coding (B. S . Atal, V. Cuperman, and A. Gersho, (eds.).Kluwer Academic Publ. Gibson, J. D. (1980).“Adaptive Prediction in Speech Differential Encoding Systems,” Proc. IEEE, April, 488-525. Gibson, J. D., and Haschke, G. B. (1987).“Adaptive Code Generators for Tree Codingof Speech,” Proc. ICC, June, 1142-1 146. Gibson, J. D., and Sayood, K. (1988). “Lattice Quantization,” in Aduances in Electronics and Electron Physics (P. Hawkes, (ed.), Vol. 72, pp. 259-330. Academic Press, New York. Gish, H., and Pierce, J. N. (1968).“Asymptotically Efficient Quantizing,” IEEE Trans. lnf. Theory, Sept., 676-683. Goblick, T. J., and Holsinger, J. T. (1967).“Analog Source Digitization: A Comparison of Theory and Practice,” IEEE Trans. lnf. Theory, April, 323-326. Goris. A. C., and Gibson, J. D. (1981).“Incremental Tree Coding of Speech”, IEEE Trans. In$ Theory IT-27,5 11- 5 16. Granzow, W., and Noll, P. (1983).“On Quantization of a Memoryless Gamma Source,” Technical University of Berlin, report. Gray, A. H., Jr., and Markel, J. D. (1976). “Distance Measures for Speech Processing,” IEEE Trans. ASSP ASSP-24,380-391. Gray, R. M. (1977). “Time-Invariant Trellis Encoding of Ergodic Discrete-Time Sources with a Fidelity Criterion,” IEEE Trans. Inf. Theory IT-23, 71-83. Gray, R. M. (1984).“Vector Quantization,” IEEE ASSP Mag., April, 4-29. Gray, R. M. (1990).Source Coding Theory. Kluwer Academic Publ. Gray, R. M., and Abut, H. (1982).“Full Search and Tree Search Vector Quantization of Speech Waveforms,” Proc. ICASSP, 593-596. Gray, R. M., Buzo, A,, Gray, A. H., Jr., and Matsuyama, Y. (1980a). “Distortion Measures for Speech Coding,” IEEE Trans. ASSP ASSP-28, 367- 376. Gray, R. M., Kieffer,J. C., and Linde, Y. (1980b).“Locally Optimal Block Quantizer Design,” lnf. and Control 45, 178-198. Gray, R. M., Gray, A. H., Jr., Rebolledo, G., and Shore, J. E. (1981). “Rate Distortion Speech Coding with a Minimum Discrimination Information Distortion Measure,” IEEE Trans. I f . Theory IT-27,708-721. Harrison, D. D., and Modestino, J. W. (1990). “Analysis and Further Results on Adaptive Entropy-Coded Quantization,” IEEE Trans. I f . Theory 36(5),1069-1088. Honda, M., and Itakura F. (1 984).“Bit Allocation in Time and Frequency Domains for Predictive Coding of Speech,” IEEE Trans. ASSP ASSP-32 465-473. Honig, M. L., and Messerschmitt, D. G . (1984).Adaptive Filters. Kluwer Academic Publ. Huang, Y., and Schulthesis, P. M. (1963).“Block Quantization of Correlated Gaussian Random Variables,” IEEE Trans. Cornrn. Sys., Sept., 289-296. Huffman, D. (1952).“A Method for the Constructing of Minimum Redundancy Codes,” Proc. IRE, Sept., 1098-1101. Irie, K., Tada, Y., and Honma, K. (1988).“APC-AB Modules Operating at 16 and 8 kbit/s,” IEEE Trans. on Sel. Areas in Comm. 6(2), 383-390. Itakura, F., and Saito, S. (1968). “Analysis Synthesis of Telephone Speech Based Upon the Maximum Likelihood Method,” in Conf. Rec. 6th Int. Congr. Acoust. (Y. Yonasi, ed.). Tokyo. Itakura, F., and Saito, S. (1971). “Digital Filter Techniques for Speech Analysis and Synthesis,” Proc. 7th Int. Cony. Acoust.. Budapest, 25-C-1. Iyengar. V., and Kabal, P. (1988). “A Low Delay 16 kbit/s Speech Coder,” Proc. ICASSP, 243 - 246.

SPEECH CODING

193

Jayant, N. S. (1973). “Adaptive Quantization with a One Word Memory,” Bell System Tech. J., Sept., 11 19-1 144. Jayant, N . S. (1974).“Digital Coding of Speech Waveforms: PCM, DPCM, and D M Quantizers,” Proc. o / the I E E E , May, 61 1-632. Jayant, N. S., and Chen. J-H. (1989). “Speech Coding with Time-Varying Bit Allocation to Excitation and LPC Parameters,” Proc. ICASSP, May, 65-68. Jayant. N. S., and Christensen, S. A. (1978).“Tree Encodingof Speech Using.the (M,L) Algorithms and Adaptive Quantization,” I E E E Trans. Comm. COM-26, 1376- 1379. Jayant, N. S., and Noll, P. (1984).Digital Coding of Wave/ix-ms, Prentice-Hall, Englewood Cliffs, New Jersey. Jelinek, F. (1968).“Buffer Overflow in Variable-Rate Coding of Fixed Rate Sources,” I E E E Trans. In$ Theory, May, 490-501. Jelinek, F.. and Anderson, J. B. (1971).“Instrumentable Tree Encoding of Information Sources,” I E E E Trans. In/’. Theory IT-17, 118-1 19. Juang, B-H., and Gray, A. H., Jr. (1982),“Multiple Stage Vector Quantization for Speech Coding,” Proc. ICASSP, 597-600. Juang. 9-H., Wong, D. Y.. and Gray, A. H.. Jr. (1982). “Distortion Performance of Vector Quantization for LPC Voice Coding,” I E E E Trans. ASSP ASSP-30,294-303. Kelly, J. R., Jr., and Lochbaum, C. (1962). “Speech Synthesis.” Proc. Sfockholm Speech Comnt. Seminar, R I T . Sfockholm. Sweden, Sept. Klejn. W. B., Krasinski. Ketchum, R. H. (1990).“Fast Methods for the C E L P Speech Coding Algorithm.” Proc. ICASSP, April and I E E E Trans. ASSP 38(8), 1330-1342. Kroon, P., and Atal, B. S. (1989; 1990a). “On Improving the Performance of Pitch Predictors in Speech Coding Systems,” Proc. I E E E Workshop on Speech Coding )or Telecommunications, pp. 49-50; also in Adiiances in Speech Coding (B. S. Atal, V. Cuperman, and A. Gersho, (eds.). Kluwer Academic Publ. Kroon. P.. and Atal. B. S. (1990b). “Predictive Coding of Speech Using Analysis-by-Synthesis Techniques,” in Advances in Speech Signal Processing (S. Furui and M. M. Sondhi. (eds.). Marcel Dekker. New York. Kroon, P., and Deprettere, E. F. (1984).“Experimental Evaluation of Different Approaches to the Multi-Pulse Coder,” Proc. ICASSP, 10.4.1-10.4.4. Kroon, P., and Deprettere, E. F. (1988).“A Class of Analysis-by-Synthesis Predictive Coders for High Quality Speech Coding at Rates Between 4.8 and 16 kbit/s,” I E E E J . Selected Areas in Comm. 6,353 -363. Kroon, P., Deprettere, E. F., and Sluyter, R. J. (1986). “Regular-Pulse Excitation-A Novel Approach to Effective and Eficient Multipulse Coding of Speech,” IEEE Trans. ASSP ASSP-34(5), 1054- 1063. Lee, J . C., and Un, C. K. (1990).“On Reducing Computational Complexity of Codebook Search in CELP Coding,” I E E E Trans. Commu., Nov. Levinson, N. (1947).“The Wiener RMS (Root Mean Square) Error Criterion in Filter Design and Prediction,” J . Mafh. Phys., 261-278. Lin. D. (1987).“Speech Coding Using Efficient Pseudo-Stochastic Block Codes,” Proc. ICASSP, 1354-1357. Linde, Y., Buzo, A., and Gray, R. M. (1980).“An Algorithm for Vector Quantizer Design,” IEC‘E Trans. Comm. COM-28,84-95. Ljung. L. (1987).System Identification. Prentice-Hall, Englewood Cliffs, New Jersey. Lloyd. S . P. (1957; 1982).“Least Squares Quantization in PCM,” I E E E Trans. on In$ Theory, March, 129-136 (unpublished report 1957, published 1982). Makhoul, J. (1975).“Linear Prediction: A Tutorial Review,” Proc. I E E E 63,561-580. Makhoul, J., and Berouti. M. (1979).“Adaptive Noise Shaping and Entropy Coding in Predictive Coding of Speech.” I E E E Trans. ASSP ASSP-27,247 -254.

I94

VLADIMIR CUPERMAN

Mano, K., and Moriya, T. (1990). “4.8 kbit/s Delayed Decision CELP Coding Using Tree Coding,” Proc. ICASSP, Apr., 21-24. Marcellin, M. W., and Fischer, T. R. (1990). ‘‘Trellis Coded Quantization of Memoryless and Gauss-Markov Sources,” I E E E Trans. Cornm. 38,82-93. Marcellin, M. W., Fischer, T. R., and Gibson, J. D. (1990).“Predictive Trellis Coded Quantization of Speech,” IEEE Trans. ASSP 38,46-55. Markel, J. D., and Gray, A. H. (1974). “A Linear Prediction Vocoder Simulation Based Upon Autocorrelation Method,” I E E E Trans. ASSP ASSP-23(2), 124-134. Markel, J. D., and Gray, A. H. (1976).Linear Prediction of Speech. Springer-Verlag, New York. Max, J. (1960). “Quantizing for Minimum Distortion,” 1 R E Trans. 1nf, Theory, March, 7-12. McAulay, R. J., and Quatieri, T. F. (1986). “Speech Analysis/Synthesis Based on a Sinusoidal Representation,” I E E E Trans. ASSP ASSP-34(4), 744. McAulay, R. J., Parks, T., Quatieri, T. F., and Sabin, M. (1990).“Sine-Wave Amplitude Coding at Low Data Rates,” in Advances in Speech Coding (B. S . Atal, V. Cuperman, and A. Gersho, (eds.). Kluwer Academic Publ. McEliece, R. J. (1977). The Theory of Information and Coding. Addison-Wesley Publ. Co. Millar, D., Rabipour, R., Yatrou, P., and Mermelstein, P. (1990). “A Multipulse Speech CODEC for Digital Cellular Mobile Use,” in Advances in Speech Coding (B. S . Atal, V. Cuperman, and A. Gersho, (eds.). Kluwer Academic Publ. Miller, D., and Mermelstein, P. (1984). “Prevention of Predictor Mistracking in ADPCM Coders,” Proc. ICC, May. Morf, M., and Lee, D. T. (1979). “Recursive Least Square Ladder Forms for Fast Parameter Tracking,” Proc. I E E E Conf. Decision Contr., San Diego, 1362-1367. Moriya, T., and Honda, M. (1987; 1988).“Transform Coding of Speech Using a Weighted Vector Quantizer,” Proc. ICASSP 17,1629-1633 and 1EEE Journal on SAC, 6(2),425-431. Nishitani, T., Aikoh, S., Araseki, T., Ozawa, K., and Maruta, R. (1982).“A 32 kb/s Toll Quality ADPCM Codec Using a Single Chip Signal Processor,” Proc. ICASSP, 960-963. Oliver, B. M. (1952).“Efficient Coding,” Bell Sys. Tech. J., July, 724-750. Oliver, B. M., Pierce, J. R., and Shannon, C. E. (1948).“The Philosophy of PCM,” Proc. I R E 36, 1324-1331. Paez, M. D., and Glisson, T. H. (1972).“Minimum Mean Squared Error Quantization in Speech PCM and DPCM systems,” I E E E Trans. Comm., April, 225-230. Papamichalis, P. E. (1987). Practical Approaches to Speech Coding. Prentice-Hall, Englewood Cliffs, New Jersey. Peng, R., and Cuperman, V. (1990). “Low-Delay Analysis-by-Synthesis Speech Coding Using Lattice Predictors,” I E E E Globeman Conf. Pettigrew, R., and Cuperman, V. (1989). “Backward Pitch Prediction for Low-Delay Speech Coding,” I E E E Globeman Conf., 124771252, Pettigrew, R., and Cuperman, V. (1990). “Hybrid Backward Adaptive Pitch Prediction for LowDelay Vector Excitation Coding,” in Advances in Speech Coding (B. s.Atal, V. Cuperman, and A. Gersho, (eds.). Kluwer Academic Publ. Rabiner, L. R., and Schafer, R. W. (1978). Digital Processing of Speech Signals. Prentice-Hall, Englewood Cliffs, New Jersey. Rabiner, L. R., Cheng, M. J., Rosenberg, A. E., and McGonegal, C. A. (1976). “A Comparative Performance Study of Several Pitch Detection Algorithms,” I E E E Trans. ASSSP ASSP-24, 393-417. Ramachandran, R. P., and Kabal, P. (1987).“Stability and Performance Analysis of Pitch Filters in Speech Coders,” I E E E Trans. ASSSP ASSP-35,937-946. Rarnamoorthy, V., and Jayant, N. S. (1984). “Enhancement of ADPCM Speech by Adaptive Postfiltering,” A T & T Bell Labs Tech. J., Oct., 1465-1475.

SPEECH C O D t N G

195

Reininger, R., and Gibson, J. D. (1985).“Backward Adaptive Lattice and Transversal Predictors in ADPCM,” I E E E Trans. Comm. 33(1),74-82. Rose. R., and Barnwell, T. P. I11 (1986).“The Self Excited Vocoder-An Alternate Approach to Toll Quality at 4800 bps,” Proc. ICASSP, 453-456. Rose, R., and Barnwell, T. P. 111 (1990).“Design and Performance of an Analysis-by-Synthesis Class of Predictive Speech Coders,” I E E E Trans. ASSP38,1489-1503. “Alternate Approach to Toll Quality at 4800 bps,” Proc. ICASSP, 453-456. Saito, S., and Nakata, K. (1985). Fundamentals q/ Speech Signal Processing. Academic Press, New York. Schroeder, M. R., and Atal, B. S. (1985).“Code Excited Linear Prediction: High Quality Speech at Very Low Rates,” Proc. ICASSP, 937-940. Shannon, C. E. (1948).”A Mathematical Theory of Communication,” Bell System Tech. J . 27, 379-423,623-656. Singhal. S., and Atal, B. S. (1984).“Improving the Performance of Multipulse Coders at Low Bit Rates,” Proc. ICASSP, Apr., 1.3.1-1.3.4. Sloane, N. J. A. (198I). “Tables of Sphere Packings and Spherical Codes,” I E E E Trans. I!/. Theory IT-27.327-338. Steele, R. (1975).Delta Modulation Systems. Pentech Press, London. Stewart, L. C., and Gray, R. M., (1982).“The Design of Trellis Waveform Coders,” I E E E Trans. Comm. COM-30,702-7 10. Stewart ef a/. (1982).p. 79. Sugamura. N., and Itakura, F. (1981). “Speech Data Compression by LSP Speech Analysis and Synthesis Technique,” I E C E Trans. J64-A(8), 599-605. Suyamura. N., and Itakura, F. (1986).“Speech Analysis and Synthesis Methods Developed at ECL in NTT-From LPC to LSP.” Speech Communications, No. 5, 199-215. Tada, Y.. Taka, M., and Honma, K. (1986).“16 kbit/s APC-AB Coder Using a Single Chip Digital Signal Processor,” Proc. Int. Conf’. Comm., June, 1295- 1299. Taniguchi. T., Unagami, S., and Gray, R. M. (1989).“Multimode Coding: Application to CELP,” Pror. ICASSP, May, 156- 159. Taniguchi. T., Unagami, S., and Gray, R . M. (1990).“Speech Coding with Dynamic Bit Allocation (Multimode Coding),” in Advances in Speech Coding (B. S. Atal, V. Cuperman, and A. Gersho, eds.). Kluwer Academic Publ. Transcoso. I. M., and Atal, B. S. (1986). “Etfrcient Procedures for Finding the Optimum Innovation in Stochastic Coders,” Proc. ICASSP, April, 44.5. Tribolet, J. M., and Crochiere, R. E. (1979).“Frequency Domain Coding of Speech,” I E E E Trans. ASSP ASSP-27,512-530. Vary, P., Hofmann, R., Hellwig, K., and Sluyter, R. I. (1988). “A Regular-Pulse Excited Linear Predictive Coder,” Speech Commun. 7,209-2 15. Viterbi, A. J., and Omura, J. K. (1974).“Trellis Encoding of Memoryless Discrete-Time Sources with a Fidelity Criterion,” I E E E Trans. In/. Theory IT-20, 325-332. Voiers, W. D. (1977a).“Diagnostic Evaluation of Speech Intelligibility,” in Speech Intelligihilify and Speaker Recognition (Benchmark Papers on Acoustics, Vol. 11) (M. E. Hawley, ed.). Dowden, Hutchinson, and Ross, Inc.. Stroudsburg, Pennsylvania. Voiers, W. D. (1977b).“Diagnostic Acceptability Measure for Speech Communications Systems,” Proc. ICASSP, 204- 207. Wang, S., and Gersho, A. (1989).“Phonetically-Based Vector Excitation Coding of Speech at 3.6 kb/s,” Proc. ICASSP, May. Wang, S., and Gersho, A. (1990). “Phonetic Segmentation for Low Rate Speech Coding,” in Adoances in Speech Coding (B. S. Atal, V. Cuperman, and A. Gersho, (eds.). Kluwer Academic Publ..

196

VLADIMIR CUPERMAN

Watts, L. (1989).“Vector Quantization and Scalar Linear Prediction for Waveform Coding of Speech at 16 kb/s”, Master’s Thesis, Simon Fraser University, Burnaby, British Columbia. Watts, L., and Cuperman, V. (1988).“A Vector ADPCM Analysis-by-Synthesis Configuration for 16 kb/s Speech Coding,” Proc. I E E E Globecorn, Nov., 275-279. Wilson, S. G., and Husain, S . (1979).“Adaptive Tree Encoding of Speech at 8000 bits/s with a Frequency-Weighted Error Criterion,” l E E E Trans. Comm. COM-27, 165- 170. Yong, M., and Gersho, A. (1988).“Vector Excitation Coding with Dynamic Bit Allocation,” Proc. Globecorn, Dec., 290-294. Zelinski, R., and Noll, P. (1977).“Adaptive Transform Coding of Speech Signals,” I E E E Trans. ASSP, Aug., 299- 309.

ADVANCES IN ELECTRONICS A N D I.L.EC1KON PHYSICS . VOL . X?

Bandgap Narrowing and Its Effects on the Properties of Moderately and Heavily Doped Germanium and Silicon

R . P . MERTENS AND R . J . VAN OVERSTRAETEN

.

IMEC L e i t i ~.iBclyiitni

I . Introduction . . . . . . . . . . . . . . . . . . . . . . . . A . General Remarks . . . . . . . . . . . . . . . . . . . . . . B. Historical Survey . . . . . . . . . . . . . . . . . . . . . . I1. Calculations of BGN in n-Type Silicon and !+Type Germanium . . . . . . . A . Commonly Used Approximations . . . . . . . . . . . . . . . . B. The Impurity Band in Lightly Doped Semiconductors . . . . . . . . . C . High-Density Theories . . . . . . . . . . . . . . . . . . . . D . Klauder’s Multiple Scattering Theory for Moderately Doped Semiconductors . 111. Impurity Concentration Fluctuations and Band Tails . . . . . . . . . . A . The Need for Concentration Fluctuations . . . . . . . . . . . . . B . Earlier Theories of Band Tails . . . . . . . . . . . . . . . . . C. Recent Quantum Mechanical Theories of Band Tail. . . . . . . . . . IV . Effect of Bandgap Narrowing on Optical Properties . . . . . . . . . . . A . Introduction . . . . . . . . . . . . . . . . . . . . . . . B . Optical Absorption . . . . . . . . . . . . . . . . . . . . . C . Photoluminescence Excitation(PLE)Absorption . . . . . . . . . . D . Photoluminescence . . . . . . . . . . . . . . . . . . . . . V . Summary of Important Results . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . References . . . .

.

197

. 197 . . .

. . . . .

. . . . . . . .

198

203 203 206 207 223 232 232 232 239 244 244 246 256 257 268 270 210 272

1 . INTRODUCTION

A . General Remarks

Heavy doping effects in germanium. silicon. and other semiconductors have been studied for about 40 years. The work done on silicon is most extensive . The experimental evidence that the bandgap of silicon and * Former address: Solid State Physics Laboratory. Delhi. India . 197 Copyright 1 1991 by Academic I’rest . InL All righLs of rcproduction in an) lirrni rescrved ISBN 0-I?-014hX2-7

198

SURESH C. JAIN, R . P. MERTENS AND R. J. VAN OVERSTRAETEN

germanium shrinks when doping is large started accumulating in the mid1950s,and soon it became clear that the bandgap of silicon and germanium is reduced considerably by heavy doping. Experiments show that in addition to reducing the bandgap, heavy doping distorts the density-of-states function. The experimental results stimulated a lot of theoretical work on the calculation of the band structure of doped silicon and germanium. In this paper we are mainly concerned with the properties of moderately and heavily doped silicon and germanium; lightly doped crystals are only briefly considered. We discuss only the work (experimental as well as theoretical) that helps in understanding the physics of bandgap narrowing (BGN) and the effect of BGN on device performance. This review paper is divided into five sections. In the remaining part of this section, we give a historical survey of the work done on BGN. The theoretical work on BGN is discussed in Section 11. The effect of concentration fluctuation and the theory of band tails are discussed in Section 111. Section IV is devoted to the effect of BGN on optical properties of Si and Ge. Concluding remarks and a summary of the main results are given in Section V.

B. Historical Survey This section is further divided into three subsections: Theoretical work disregarding concentration fluctuations is discussed in Section 11. B. 1, band tails due to concentration fluctuations are discussed in Section 11. B.2, and experimental work in Section 11. B.3. 1. Theoretical Work Neglecting Band Tails

a. Lightly Doped Semiconductors A large amount of work has been done on the impurity states in lightly doped semiconductors with doping concentrations smaller than Mott’s critical concentration N,(N, is defined in Eq. (6), given in Section 1I.A). In the extreme limit of very low doping and at low temperatures, the electrons are bound to donors and have hydrogen-like wave functions. As the donor concentration increases the orbitals of the wave functions begin to overlap, and the formation of impurity bands occurs. Theoretical studies of the formation of impurity bands and of the density of states in the impurity band were made by Matsubara and Toyozawa (1961), Morgan (1965), Gaspard and Cyrot Lackmann (1973, 1974), and Debney (1977). An excellent review of the properties of lightly doped semiconductors has been given by Mott (1974), and also by Abram et a1 (1978). Serre and Ghazali (1983) and Lowney (1986a) calculated the band structure at low doping concentrations using the theory of multiple scattering (Klauder, 1961).

BANDGAP NARROWING AND ITS EFFECTS

199

b. Heavily Doped Semiconductors Wiger (1934) and Gell-Mann and Brueckner (1957) did the pioneering work on the properties of a dense electron gas and defined the exchange and the correlation energies of the gas. Wolff (1962)and Haas (1962) suggested that these theories could be used to study the band structure of heavily doped semiconductors. Parmenter (1955) used the second-order perturbation theory to calculate the effect of the electronimpurity interaction on the conduction band structure. Wolff (1962) applied the diagrammatic techniques and calculated the effect of both electronelectron and electron-impurity interactions on the conduction band of heavily doped n-type semiconductors. The main result of these calculations is that the electron-electron interaction causes a rigid downward shift of the conduction band. The electron-impurity interaction causes additional shift, and also distorts the density of states function. Later, Hwang (1970a, 1970b, 1970c) used Parmenter’s method (Parmenter, 1955) to calculate the band structure of heavily doped GaAs. In the work of Parmenter (1955) and Wolff (1962),the effect of impurity concentration fluctuation was not included. Subsequently, Kleppinger and Lindholm (1971), Van Overstraeten et al. (1973), Mock (1973), and Slotboom (1977) used Kane’s theory (Kane, 1963) of band tails and Morgan’s theory (Morgan, 1965) of impurity bands to derive the density of states and BGN in heavily doped silicon. Limitations of these methods have been discussed by Wilson (1977). Reviews of this and other work on heavy doping have been published by Abram et al. (1978) and by Mertens et al. ( 1 98 1). An important paper on the theory of BGN was written by Inkson (1976). Inkson (1976) considered only the jellium model; the electron-impurity interactions were not considered. Inkson’s work showed for the first time that as a result of many-body interactions in n-type semiconductors, not only does the conduction band move down, as was shown by Wolff many years earlier, but also the valence band moves up. Inkson used the Thomas Fermi theory of screening. Abram et ai. (1978) improved upon these calculations by using the Lindhard dielectric function for the calculation of the screened potential of the donor ions. Lanyon and Tuft (1979) also calculated the combined shifts of both the conduction and valence band edges. After the work of Inkson and of Abram et al., however, a more serious attempt to calculate BGN of heavily doped silicon and germanium was made by Mahan (1980). Mahan included, for the first time, all interactions that affect the BGN. However, he considered the impurities to be distributed in a periodic lattice inside the semiconductor. In his formulation, the net effective contribution to the BGN by the carrierimpurity scattering came out to be rather small. Following the work of Mahan, important papers on the subject were published by Selloni and Pantelides ( I 982), Pantelides et al. ( 1 985), and Berggren and Sernelius ( 1 98 I , 1985). Selloni and Pantelides ( I 982) calculated the contribution of intervalley

200

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

scattering to BGN and added this to the value of BGN due to many-body effects (Mahan, 1980; Berggren and Sernelius, 1981) to obtain the total BGN. Berggren and Sernelius (1981, 1984, 1985) showed, however, that the contribution of intervalley scattering is negligible and that the effect of carrier-impurity scattering is significant even if the concentration fluctuation is neglected, provided the distribution of the impurity is taken to be random and not in a periodic lattice. Values of BGN in n-Si calculated by Berggren and Sernelius (1985) are in good agreement with experimental values. Recently, Jain and Roulston (1991) have derived a simple and accurate expression that can be used for Si, Ge as well as 111-V compound semiconductors and for any doping concentration in the high-density regime. The calculated values of BGN using this expression show good agreement with the theoretical values of BGN in n-type Si and G e calculated by Berggren and Sernelius (1981). They also show surprisingly good agreement with observed values of BGN in n- and p-type Si, Ge, GaAs, and GaSb. Jain et al. (1 990) have used this expression to predict BGN values of novel semiconductors. c. Moderately Doped Semiconductors We now consider the moderately doped semiconductors, in which the doping concentration is in the neighborhood of Mott’s critical concentration N,. Tunneling experiments (Abdurakhmanov et al., 1976; Sawaki et al., 1974; Mahan and Coniey, 1967) show that the density of states function is highly distorted in the moderately doped semiconductors. The shape of the density of states obtained by these experiments cannot be modelled by the high-density theory. The theories can not be used at moderate or low concentrations, and they cannot predict metalto-nonmetal transition of the semiconductors. Klauder’s (1961) multiple scattering theory can be used to calculate the density of state function as well as the spectral density of states for the whole range of dopings. The theory has been implemented only recently in a series of papers by Serre and Ghazali (1983) and Ghazali and Serre (1982, 1985). The NBS group (Bennett, 1983, l985,1986a, 1986b, 1987; Lowney, 1985,1986a, 1986b; Bennett and Lowney, 1981, 1987; Lowney and Bennett, 1982, 1983; Lowney and Geist, 1984; Lowney and Thurber, 1984; Lowney et al., 1981; and Kahn and Lowney, 1982) have done considerable work on the heavy doping effects in silicon and GaAs. Lowney (1986a) has also used Klauder’s theory (Klauder, 1961) of multiple scattering to calculate the density of states in moderately doped silicon.

2. Survey of Work on Band Tails The theory of the effect of impurity concentration fluctuations on the band structure, leading to the formation of band tails, was given by Kane (1963,

B A N D G A P N A R R O W I N G A N D ITS EFFECTS

20 1

1985) and by Bonch-Bruevich (1966). The quantum mechanical theory of band tails was developed by Halperin and Lax (1966, 1967). The theory for non-Gaussian statistics was given by Efros ( 1974). Using a formalism based on the Feynman path integral method, Sayakanit and co-workers (Samathiyakanit, 1974; Sayakanit, 1979; Sayakanit and Glyde, 1980,1982)have derived density of states in band tails. Their theory reduces to Kane's result at energies close to the band edge, and to Halperin's result for energies deep in the gap. This work is important because it specifies more accurately the range of energy in which the earlier theories are valid and extends the range of validity of quantum-mechanical theory originally derived by Halperin and Lax (1966, 1967). Serre et nl. (1981) have also calculated the band tails using a quantummechanical theory and an improved criterion for determining the size of the elementary volume in which the concentration can be regarded as uniform. These theories of the effect of fluctuations of impurity concentration on the band structure, like the work of Parmenter (1955), Wolff (1962), and Berggren and Sernelius (1981, 1985), are also valid only in the high-density limit. Fortuately, the effect of concentration fluctuations is significant only when the impurity concentration is very high (see Section 111). 3. Experiniental Studies oj' BGN a. Optical Measurements Experimentally, the earliest observation suggesting that the bandgap of semiconductors is modified by heavy doping seems to be that of Rose (1951). He observed unexpectedly large dark currents in doped photoconductors. Parmenter (1955)suggested that this could be due to the distortion of the density of states and shrinkage of the bandgap of the photoconductor due to doping. The first definitive experimental evidence of bandgap narrowing came from the optical absorption measurements on indium antimonide by Tanenbaum and Briggs ( I 953) and by Hrostowski et a / . (1954).They found that the calculated Burstein shift (Burstein, 1954) exceeded the observed threshold. Aigrain and des Cloizeaux (1955) and Stern and Talley (1955) suggested that the observation could be explained by an impurityconcentration-dependent reduction in the bandgap. Reflectively measurements by Cardona and Sommers (1961) also suggested that the bandgap of a semiconductor is reduced by heavy doping effects. The most conclusive evidence of the bandgap narrowing was provided by the experiments of Haas (1962) and of Pankove and Aigrain (1962) on the optical absorption in heavily doped n-type germanium. These authors found that the threshold of absorption shifted to lower energies and that the absorption coefficient increased with doping concentration. Subsequently, similar experiments were performed on heavily doped n-and p-type silicon by Vol'fson and Subashiev

202

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

(1967), Balkanski et al. (1969), and Schmid (1981). The values of BGN in heavily doped silicon determined by the absorption measurements were considerably smaller than those determined by theory and by device measurements (Pantelides et al., 1985). More recently, Parsons (1978, 1979), Schmid et al. (1981), Dumke (1983a, 1983b), Wagner (1984, 1985a, 1985b, 1987),Wagner et al. (1986, and Wagner and del Alamo 1988) have studied the luminescence spectra of heavily doped silicon. Wagner (1984, 1985a) also studied the excitation spectrum. The values of BGN derived by these measurements are considerably higher than those determined by the absorption measurements and are in general agreement with the values obtained theoretically and by device measurements (Wagner and del Alamo, 1988;Jain and Roulston 1991). b. Device Measurements Kauffman and Bergh (1968) appear to be the first authors to determine the BGN by transistor measurements. They found that the activation energy of the base current of a bipolar transistor depends on the doping of the emitter and is considerably smaller than the bandgap of silicon. They attributed it to the reduction of bandgap in the emitter due to heavy doping. Buhanan (1969) and Kannam (1973) obtaned similar results from measurements of temperature-dependent current gain of a bipolar transistor. Several other papers appeared on this subject in the late 1970s and early 1980s. Slotboom and De Graaf (1976), De Graff et al. (1977), and Slotboom (1977)derived the values of BGN from the measured characteristics of a bipolar transistor. Lindholm et al. (1977) determined BGN from the observed solar cell characteristics. This was followed by the papers of Wieder (1980),Tang (1980),Mertens et al. (1980),Neugroschel et al. (1982),Possin et al. (1984), del Alamo et al. (1984, 1985a, 1985b, 1987a, 1987b), Park et al. (1986), and Swirhun et al. (1988). Recently, several new methods for determining BGN have been reported. Open circuit voltage decay and reverse recovery are modified by BGN in the emitter of a junction diode (Jain and Murlidharan, 1981; Jain and Van Overstraeten, 1983, 1984; Tewary and Jain, 1986; Rauh et al., 1990). Using the open circuit voltage decay measurements, Totterdell et a!. (1990) have determined the values of the BGN in the heavily doped emitter of a diode. Jain et al. (1988) have investigated the effect of heavy doping on the C - V relation of abrupt as well as linearly graded junctions and have shown that the measured C- I/ characteristics can be used to determine the BGN. Van Mieghem et al. (1990a, 1990b)have determined experimentally the BGN in GaAs using this method. This method has certain advantages over the previously used methods. Jain and Roulston (1991) have given a summary of the BGN results obtained by device measurements in Si, Ge, and GaAs.

BANDGAP NARROWING AND ITS EFFECTS

203

11. CALCULATIONS OF BGN IN n - T w F SILICON AND n-TYPE GERMANIUM A. Commonly Used Approximations 1. Effective Mass and Rigid Band Approximation

In calculating the donor and acceptor impurity states in semiconductors, several approximations have to be made. In many calculations, the band is assumed to be isotropic and parabolic, the electrons and holes are described by effective masses, and the screened Coulomb potentials are used in the Hamiltonian. The ionization energies of the isotropic Group I11 and Group V impurities in silicon and germanium are well described by this model. However, the wave vector dependence of the dielectric function and the complex band structure must be taken into account in many other cases. For normal states in the conduction band, the central cell correction can also be ignored, but this correction is likely to be important for the localized states in the band tails (Abram et al., 1978). States produced by deep impurities and those produced by clusters, complexes, and other defects cannot be described by this theory. Marshak and co-workers have published several papers pointing out the weaknesses of the approximations usually made in the theory of heavily doped semiconductors and have given more rigorous derivation of the theory, particularly of the theory of transport in heavily doped semiconductors (Marshak et ul., 1980, 1981, 1984; Marshak, 1985). They have shown that in heavily and inhomogeneously doped semiconductors, the effective mass may be a function of space coordinates. The effective mass in the deep tails, where the electron is localized, may have a different value. Even without these complications, the density-of-states effective mass does not have one universally accepted value. Park et al. (1986) have given values of the BGN for two possible values of the effective masses (see references 7 and 16 of Park et al. (1986) for a recent discussion of the values of the effective masses). One must be careful in using the appropriate values of the effective mass depending upon the property being calculated. The parabolic rigid band approximation is not always valid (Marshak and Van Vliet, 1984). Even in the limit of very high dopings, it is valid only approximately (Berggren and Sernelius, 1981). The approximation fails completely at the intermediate values of the dopings. 2. Screening und Dielectric Functions.

Most early calculations of BGN were made on the assumption that the screening is linear. The theory based on this approximation is known as the Thomas- Fermi theory of screening. In this approximation, the total impurity

204

SURESH C. JAIN, R . P. MERTENS AND R. J . VAN OVERSTRAETEN

potential at any given point is the superposition of the screened potentials due to all the impurity ions (Abram et ul., 1978).The potential at a point r is given by -e2 V ( r ) = i E J -~ RiJ exp( -kslr - h l } , (1)

c

where R i is the position of the impurity ion and k, is the reciprocal screening length. Symbols are defined in Appendix 1. In the Thomas-Fermi approximation, ks is given by (see, for example, Abram et ul., 1978)

where D is the density of states and f is the Fermi function. Since the densityof-state function D inside the integral depends on k , , the calculation of k, has to be done self-consistently and is quite involved. Numerical calculations show, however, that for a degenerate semiconductor when the Fermi level is well above the band edge, the screening length is given to a good approximation by 4ne n1/3 k, = - p ( E , ) = 7.45 x 10’mfcmP2 (3) E E Similarly, for lower densities of the impurity and/or at higher temperature when Boltzmann statistics are applicable, k,

4ne2 = 6.99 x EkT

=-

300 n - cm-2. T E

(4)

For a given impurity concentration, a critical temperature T, can be defined such that below this temperature the material is degenerate and Eq. (3)must be used for the screening length. Above this critical temperature, Boltzmann statistics are applicable and the screening length is given by Eq. (4). The expression for T, is (see, for example, Ghanam et ul., 1988)

.=(;>’”(&) 213

’

(5)

The values of the parameters used in Eqs. (1)-(5) by different authors (Mahan, 1980; Berggren and Sernelius, 1981; Lowney 1986a, Jain and Roulston, 1991) are given in Tables I-IV, which appear in Sections 1I.C and 1I.D. For calculating the density of states, the commonly used value of the effective mass mf of an electron in silicon is 1.1 (all effective masses in this paper are nor= malized by mo, which is the free electron mass). We will use m: for 1.108, where mde is the effective density-of-state mass of electron with a value 0.33 given in Table I (see Section II.C.2).

BANDGAP NARROWING AND ITS EFFECTS

205

Several authors have pointed out that in many cases linear screening and use of the Yukawa potential in Eq( 1 ) gives a large error in the calculated value of the hole self-energy (which is equal to the shift in the valence band) in n-type semiconductors (Abram et ul., 1978; Berggren and Sernelius, 1981). In such cases the dielectric function and the screened potential must be calculated using random phase approximation (RPA).The calculations now become very involved. An alternative method, known as the plasmon pole approximation, used by Mahan gives very accurate results for the self-energy of the holes in ntype silicon and n-type germanium, provided the quartic term in the expansion of the plasmon dispersion relation is retained. In fact Abram et al. (1984) and Saunderson (1983) have shown that the plasmon pole approximation is capable of giving highly accurate values of the dielectric function that agree well with the values calculated using the RPA. This is a very useful result, since with the use of the plasmon pole approximation, the calculations are considerably simplified. In addition to Eqs (1)-(5), we quote below some other useful expressions. Mott has given an approximate expression for the critical concentration N, at which the semiconductor changes from nonmetal to metal at low temperatures. The expression for N, is

The parameter rs commonly used in the many-body theory is given by

Here r, is the average interelectron distance and N is the impurity concentration. The effective Bohr radius a is given by

and the effective Rydberg energy R is given by

Since cgs units have been used in most of the early work, we have also used cgs units here. A table useful for converting cgs units to SI units is given in the review by Abram et al. (1978).

206

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

B. The Impurity Band in Lightly Doped Semiconductors

Since this doping range is not of great interest for design and modelling the devices, we give here only a brief account of the theoretical and experimental work on the lightly doped silicon. Pearson and Bardeen studied the electrical conductivity of silicon as a function of doping at low temperatures (Pearson and Bardeen, 1949, 1956). They found that the activation energy of the conductivity decreases as the impurity concentration increases, and ~ , activation energy becomes at a boron doping of about 5 x 10l8 ~ m - the zero. More recent measurements of phosphorus-doped silicon give a critical concentration N, equal to 3.74 x 10l8 ~ r n (Rosenbaum - ~ et al., 1980, 1983). Mott (1974) has given a detailed discussion of the theory of impurity states near and below the critical concentration N,. Lee and McGill (1975) attributed the observed decrease in the activation energy for conduction (Pearson and Bardeen, 1949, 1956) to impurity band formation and band tailing. The screening needed to calculate the impurity band and the band tails has been discussed by Lee and McGill (1975), by Neumark (1977a, 1977b), and more recently by Schechter (1988).Schechter found that the Lee and McGill theory is quite satisfactory in explaining the experimental results. The impurity band has been calculated theoretically by several workers (Morgan, 1965; Gaspard and Cyrot Lackmanns, 1973, 1974; Matsubara and Toyozawa, 1961; Serre and Ghazali, 1983; Ghazali and Serre, 1982, 1985). As the value of N increases, the electron orbitals start to overlap and formation of the impurity band begins. The peak of the band moves toward the conduction band edge with increasing impurity concentration. The qualitative behavior of the band is the same in all the calculations but quantitatively the results are different. Gaspard and Cyrot Lackmann (1974) found that as the concentration of the impurity becomes large, the peak of the band moves to the close vicinity of the conduction band edge. Matsubara et al. (1961) found, however, that the peak enters the conductions band. In both cases, a long tail penetrating deep into the bandgap was found. The work of both authors involves the assumption that the impurity states do not mix with the conduction band states. This is not a good assumption: In actual practice, hybrid wave-functions must be formed by the mixing of the orbital and freeelectron wave functions (Serre and Ghazali, 1983). If the semiconductor is compensated, the inverse electron screening length is substantially reduced and strong band tails extending deep into the bandgap are formed. Morgan (1965) and Stern (1971) have used these tails to explain observed optical properties of the compensated semiconductors. Mott has shown that the low-temperature electrical conduction in the heavily doped and compensated semiconductors is given by his theory of variable range hopping and varies as exp{ - E/kT‘’’}. There is some experimental

B A N D G A P N A R R O W I N G A N D ITS EFFECTS

207

evidence that this equation is obeyed in these heavily doped compensated semiconductors at low temperatures (Abram et al., 1978; Mott, 1974). We will not discuss early work on lightly doped semiconductors any further, and the interested reader should read the reviews of Abram et al. (1978) and of Mott (1974). We will again discuss, however, recent results obtained using Klauder's theory of multiple scattering for lightly doped silicon in Section I1.D. C . High-Density Theories 1. Properties of the Dense Electron Gas

Foundation of the high-density theory (when doping concentration N is more than Mott's critical concentration N,) of bandgap narrowing, BGN, was laid by the classical work of Wigner (1934), who studied the properties of dense electron gas. The work was extended by Cell- Mann and Brueckner (1957), who gave numerical values of various terms that contribute to the ground-state energy of such a gas. The theory was developed for understanding the properties of metals. The theory assumes that r, smaller than unity, i.e., the average interelectron distance ra is much smaller than the effective Bohr radius a. It was found later that the theory cannot be applied to metals and, surprisingly, it turned out to be useful for the study of heavily doped semiconductors (Haas, 1962; Wolff, 1962). The condition rs < 1 under which the theory holds is not satisfied in metals but can be satisfied approximately in heavily doped semiconductors because the values of dielectric constant and the effective Bohr radius are much larger for semiconductors. For good metals, rs is typically 2 to 5 (Wolff, 1962). For InSb doped with 1 x 10l8 cm-3 donors, r, is about 0.2 (Wolff, 1962).This is an extreme case, and in most semiconductors the value of rs does not become as small as 0.2. However, it does become smaller than 1 in heavily doped silicon and germanium and most other semiconductors. For example, in germanium doped with more than 1 x 1019~ r n donors, - ~ and in silicon doped with more than 3.9 x 1019~ m - ~ , rFbecomes less than unity [Haas, 19623. We will see in this review that the theory does describe the behavior of heavily doped silicon and germanium to a fair degree of accuracy in the high-density regime, i.e., above Mott's critical concentration N,(N, has a value 3 x 10'' cm-3 to 5 x lo'* cm13 for Si and Ge) and gives reasonable results at doping concentrations as low as The properties of dense electron gas discussed above are valid in the presence of a uniform background of positive charge that makes the system neutral [the so called jellium model]. In actual crystals, the uniform positive charge must be replaced by lattice points. Following Wigner (1934), and

208

SURESH C. JAIN. R. P. MERTENS AND R. J. VAN OVERSTRAETEN

Gell-Mann and Brueckner (1957), we shall ignore this point for the present and discuss the energy of the electron gas in the jellium model.’ The gas is then a fully degenerate Fermi-Dirac system with exchange and other interactions. The ground-state total energy E, of the gas in units of effective Rydberg energy is a function of rs only (r, is defined as the ratio of ra and a ; see Eq. (7)). As far as the magnitude of the various terms is concerned, the leading term is found to be the Fermi energy FF, i.e., the kinetic energy of degenerate gas, and the next term is the exchange energy E x . The total energy E, of the electron gas can be written as

El = E ,

+ E x + Ecr.

(10)

Wigner showed that the first two terms do not represent the exact total energy of the system, and the residual contributions to the total energy, also designated as “correlation energy,” are denoted collectively by Ec, . Correlation energy is thus defined as the difference between the exact energy and the sum of the kinetic and exchange energies. The source of the exchange energy is the Fermion nature of the electrons. The electrons d o not remain in spatially uniform distribution, but rearrange themselves so that electrons with same spins avoid each other. In so doing, they reduce their repulsive energy. In other words, exchange interaction is equivalent to an attractive interaction. 2. Muhan’s Calculations of Kinetic, Exchange, and Correlation Energies in Heavily Doped Silicon and Germanium The work of Inkson (1976) showed for the first time that when a large number of electrons is introduced into the conduction band of a semiconductor (as is the case in n type semiconductor), the structure of the valence band is also altered. Not only does the conduction band edge move down, as was shown earlier by Wolff (1962), but the valence band edge moves up, and the total BGN is considerably more than that implied in Wolffs model. Abram et al. (1978) have discussed Inkson’s work in detail. As mentioned by Abram et al. (1978), Inkson’s model goes much beyond the previous treatments of many-body effects on BGN. However, numerical values obtained by Inkson were not accurate. Abram et al. (1978) showed that the Thomas-Fermi screening used by Inkson is not valid for calculating the modification of the valence band in the n-type semiconductor. They improved on these calculations by using the complete Lindhard dielectric function. Electronimpurity scattering was not included in the work of Inkson (1976) or of Abram et al. (1978). Abram et al. (1978) and Keyes (1977) have written reviews on I

Impurity scattering will be discussed later.

209

BANDGAP NARROWING AND ITS EFFECTS TABLE I VALUES OF

PAKAMETEKS USED BY MAHAN(1980) IN THEORETICAL CALCULATIONS' ~

Dielectric constant Number of conduction band minima N , Etfective Rydberg energy R(meV) Longitudinal effective electron mass m , Transverse effective electron mass ni, Electron effective density of states mass nidc Etfective hole mass rnz Correction factor A

HIS ~~

Silicon

Germanium

11.4 6 34.5 0.98 0.19 0.33 0.33 0.95

15.4 4 12.6 1.59 0.08 1 0.22 0.22 0.84

" All effective masses are normalized by the free electron mass mio

the subject covering the earlier work and pointing out the deficiencies that existed at the time. Mahan assumed rigid parabolic bands and used effective mass approximation to calculate E,, E x , and ECr.Mahan took the anisotropy of the conduction bands into account by using the density-of-states effective mass. The degeneracy of the valence band was taken into account by using an effective mass m,* of the hole equal to the density of states effective mass mdeof the electron. All calculations were made at 0 K. The values of the parameters used by Mahan are given in Table I.

a. Kinetic Energy The expression for kinetic energy in a single isotropic conduction band containing a density n of the electrons is well known (Mahan, 1980).This expression has to be modified for silicon and germanium to take into account the multiplicity of the conduction band. The modified expression for the kinetic or the Fermi energy is given by (Mahan, 1980) E , = - h- 2- (k % i -) , it2 2m,

3n'n

2mde

For silicon, this becomes

(

Si: E ,

=

3.34

Ge: E,.

=

6.63

~

1;18)2/33

and for germanium,

(, ~

J ;3.

213

210

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

The factor N , in Eq. (1 1) takes into account the multiple conduction bands in germanium and silicon. The value of N , is 4 for germanium and 6 for silicon. The multiplicity of the conduction band lowers the numerical value of kF and of E,. The kinetic energy of the holes in n-type semiconductors is zero. b. Exchange and Correlation Energies For n-type silicon and n-type germanium, the expression for exchange energy has to be modified to take into account the multivalley ellipsoids in the conduction band. The modification has been discussed by Haas (1962), Bonch-Bruevich (1966), and Mahan (1980). The result of Haas, also used by Mahan, leads to the following expression for the self-energy of the electron, which is equal to the downward shift of the Fermi level and of the conduction band edge in the semiconductor:

Substituting the values of the parameters, we obtain

(

Si: AE,, = Sp, = -6.47 1:18)113 for silicon, and Ge: AEcx = tip, = -4.89

(

__ *;41i3

for germanium. The values of the correction factor A introduced to account for the multiple conduction bands are given in Table 1. The correlation energy of the electrons in the conduction band comes out to be approximately -0.lR (Mahan, 1980). We now discuss the self-energy of the holes (which is also equal to the shift AEv(,-,, in the valence band) due to the hole-electron interaction. The selfenergy due to this interaction is the hole correlation energy (since the number of the holes in the valence band is zero, the exchange energy of the holes is also zero). Lanyon and Tuft (1979)calculated the self-energy of the holes due to this interaction. However, they assumed that the hole is fixed, i.e., it did not recoil. They also used the Thomas-Fermi approximation for the screened potential. Their calculation grossly overestimates the self-energy of the holes (see the discussion of this point by Mahan, 1980). We have already mentioned earlier that the Thomas- Fernii screening does not give accurate results for calculating the correlation energy of the holes in n-type semiconductors. Mahan used the plasmon pole approximation for calculating the dielectric function and the screened potential. Retaining the quartic term q4 in the plasmon dispersion and using the Green function method, Mahan obtained

B A N D G A P NARROWING A N D ITS EFFECTS

21 I

for the many-body self-energy of the hole

Numerical values for silicon are Si: AEvtc,)= C,, =

-

(1i18)1’4-

13.1

~

and for germanium, Ge: AEv,c,, = C,,

=

-

(“.

8.2 10‘8

Note that the self-energy of the holes in Eq. ( I 8) depends on n1I4 (this is the correct result), whereas Inkson had obtained n i l 6 dependence. Sterne and Inkson (1981) have pointed out that in calculating the selfenergy of the hole, a (rnd,/rnz)1’2 dependence of the integral J(so) used by Mahan (1980, p. 2642) should be taken into account. This will remove the factor (rn,*/rnd,)”2 from Eq. (1 7). Since the effective masses of the electron and the hole are taken to be equal by Mahan, this does not introduce any numerical error in the final results. This correction will be important for GaAs. We have eliminated this factor from Eq. ( 1 7). 3. lrnpurity Scatrering in Mahan’s Model Calculation of the interactionenergy of the donors with electrons and with holes is rather complex since it depends on how one views the spatial distribution of the electrons (Mahan, 1980).The usual method of calculating the interaction energy using Born’s approximation is logically inconsistent (Mahan, 1980). The approximation assumes that the electrons are uniformly distributed. It also assumes that the electron donor potential is a screened Coulomb potential based on Thomas- Fermi screening. The screening makes the distribution of the electrons essentially nonuniform, and therefore the model is not self-consistent. Mahan calculated the effect of impurity scattering on the BGN by two different methods. In the first method, all the conduction electron charge is located as the screening charge. The charge distribution is highly nonuniform in this method. Mahan has shown that the result of Lanyon and Tuft (1979) for hole correlation energy can be used to calculate the shift AE,, of the conduction band edge or the change in chemical potential p due to electrondonor interaction. The shift comes out to be

212

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

The numerical values are Si: AEci = Spi = - 12.07(1:18) 1’6 and

(

1 Ge: AEci = Spi = -6.01 -

16a

In the second model, the donors are assumed to be uniformly distributed in a periodic lattice inside the semiconductor, and the distribution of the electrons is assumed to be uniform. In this case the shift of the conduction band edge comes out to be

The numerical values are Si: AEci = Spi = -6.11

(-

li3mev,

and

(

Ge: AEci = 6pi = -4.5 :18) 1’3me~. Mahan found that numerical values calculated by using any of the two equations come out to be nearly the same in the concentration range 1 x lo’* cm-3 to 1 x 10’’ ~ r n - In ~ .the final calculation of the bandgap, Mahan used the second model and calculated the effect of nonuniformity in the distribution of the electrons on the interaction energy by a variational method. The variational correction due to the nonuniformity is rather small, about 0.09 R. Mahan also calculated the self-energy of the hole due to impurity scattering in the second model and used the variational method to take into account the correction due to nonuniform distribution of electrons. Mahan found that in the model based on the uniform distribution of electrons (and periodic distribution of the impurity), the shift of the valence band edge was in the same direction as that of the conduction band edge. The shifts of the two bands are nearly of the same magnitude. Therefore, they cancel each other. The only contribution that was left over was due to the corrections obtained by the variational method to account for the nonuniformity in the distribution of the electrons. This result does not agree with the result obtained with a random distribution of the impurity (Berggren and Sernelius, 1981,1985), to be discussed later.

213

BANDGAP NARROWING AND ITS EFFECTS

The final values of the chemical potential and conduction band edge in Mahan's model are given by the expressions 117-

(3;y)2'3 __

2m*e

- _ (($)1'3A

+ 0.481)n'/3

- 0.19R

(26)

and

The term 0.19R in these two equations is approximate and is the sum of the electron correlation energy 0.1R and the variational correction 0.09R (Mahan, 1980). We have not quoted Mahan's results for BGN because they do not contain the correct contribution of the impurity scattering to the BGN. Though Eq. (27) is based on an incorrect model of impurity scattering, we will show later that it is useful for numerical computation of BGN.

4. Work of Berggren and Sernelius Berggren and Sernelius (1 98 1,1985) have written comprehensive papers on the calculation of BGN in heavily doped n-silicon and n-germanium using second-order perturbation theory. They evaluated the values of the BGN as closely as possible by the presently available many-body techniques. They used full RPA dielectric screening. Since they did not assume that the selfenergies of the quasi-particles are independent of the wave vector, they were able to investigate the effect of heavy doping on the density-of-state function in the conduction band. They took into account the effect of the conduction band anisotropy and the complex structure of the valence band. The third band split off by spin orbit interaction was ignored. The values of the parameters used by Berggren and Sernelius are given in Table 11.

TABLE 11 VALUFS

OF THE

PARAMETERS USEDBY BERGGREN AND SERNELIUS (1981) IN THEIR THEORETICAL CALCIJLATIONS Si

Dielectric constant E Number of equivalent conduction bands N,,, Longitudinal effective electron mass rn, Transverse effective electron mass m, EtFective heavy hole mass inhh Effective light hole mass m,,,

1 1.40

6 0.9 163 0.1905 0.523 0.154

Ge 15.36 4 1.58 0.082 0.347 0.042

214

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

FIG.1. Various contributions (absolute magnitudes) to AEg in n-Si at different donor concentrations. In (a), curve a is the total bandgap narrowing, whereas curve b gives the downward shift of the bottom of the conduction band and curve c the upward shift of the valence band edge at the centre of the zone. In (b), the shifts of the two bands are separated into contributions from electron interactions and impurity scattering. Curve a is the shift in the conduction band edge due to electron interactions, and curve b the corresponding shift in the valence band edge. Impurity scattering shifts the valence band edge according to c and the conduction band according to d. Calculations are based on the input parameters listed in Table 11. [After Berggren and Sernelius (198 I).]

The result of these calculations for the random distribution of impurity shows that the shift of the valence band edge due to impurity scattering is in a direction opposite to that obtained by Mahan. In Berggren’s calculations, the shifts of the conduction and the valence bands do not cancel; they are in opposite directions and are added to give a significantly large contribution to BGN by impurity scattering. This is the most significant new result of the calculation of Berggren et al. and is similar in nature to the result of Inkson for many-body effects, The results of the calculation of BGN by Berggren and Sernelius for random distribution of impurity ions are shown in Figs. l a and 1b for silicon, and Fig. 2 for germanium. The physical argument in favour of Mahan’s result is that the impurity potential for electrons is attractive, and for holes it is repulsive. The impurity scattering therefore lowers the energy of the electrons but increases that of the holes, giving rise to shifts of the band edges in the same direction. The two shifts, then, cancel each other, giving nearly a zero value of the BGN due to impurity scattering. Work of Berggren and Sernelius shows, however, that this

BANDGAP NARROWING AND ITS EFFECTS

215

N3 (cni3)

1

1

1

1

2

Nd’3 1crn-I) FIG.2. BGN in n-type Ge as a function of impurity concentration; (a) shows the total BGN = A& and (b) the magnitude of downward shift of the bottom of the conduction band according to the theory of Berggren and Sernelius (1981). The dotted line (c) gives the result obtained by Mahan (1980).[After Berggren and Sernelius (1981).]

argument is not correct. In the case of the filled valence band, no scattering takes place because all states are occupied. When a hole is created at the top of the valence band, the remaining valence electrons relax around the donors, lowering the energy of the system, pushing the valence band edge up, and increasing the value of BGN. Serre and Ghazali (1983, p. 28) have shown theoretically that in the high-density approximation, the sign of the impurity potential plays no role; the shifts of the band edges are in opposite direction and are added, in agreement with the results of Berggren and Sernelius (1981). Berggren and Sernelius also made calculations for the case of periodic distribution of impurity ions. The authors found that the values of the BGN were sensitive to the assumed distribution of the impurity ions in the semiconductor. For the periodic distribution, they obtained results similar to those of Mahan. A random distribution (i.e., concentration fluctuations are not taken into account in this theory) gives quite different results, as discussed above.

216

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

The work of Parmenter (1955) and Wolff (1962) was largely ignored in the 1970s and early 1980s, and the general view taken by several authors was that impurity scattering gives rise only to band tails due to concentration fluctuations (Abram et al. 1978;Lee and Fossum, 1983).Work of Berggren and Sernelius (1981) shows that this is not correct. We will see in Section 111 that the effect of tails calculated using correct theories cannot explain the experimental results. Several other features of the results of Berggren and Sernelius are noteworthy: 1. The shifts of the edges of the two bands due to many-body effects are nearly equal. The same is true for the shifts due to impurity scattering. 2. The BGN due to many-body effects varies sublinearly, and that due to impurity scattering, superlinearly, with d l 3 . When added together, the curvatures in the two plots nearly cancel each other to give a final linear variation of BGN = AEg with n1I3,for both Si and Ge. 3. Fig l b shows that the BGN due to many-body effects is relatively greater than that due to impurity scattering. However, the actual values of the BGN due to impurity scattering are quite large and cannot be neglected. We will see in Sections 111 and IV that this contribution to BGN is necessary to bring agreement between theory and experiment. 4. Since Mahan treated the self-energies as being independent of the wavevector, he obtained final results in the rigid band approximation. Berggren and Sernelius found that the error introduced by the assumption of wave-vector independent energy is 5 to lo%, depending upon the concentration. 5. The BGN does not show any tendency to saturate up to the highest concentration at which calculations have been made.

On the whole, Berggren’s results are the most comprehensive, complete, and reliable high-impurity density calculations of the BGN in silicon and germanium made so far. We will show in Section IV that these results are in good agreement with the experiments. Mahan’s approach has the virtue that the final results are given by very simple closed-form expressions and are easy to compare with experiments. Recently, Jain and Roulston (1991)have given a closed-form expression for calculation of BGN in all semiconductors. We will discuss their work in Section II.C.7. 5. Intervalley Scattering

Selloni and Pantelides (1982) compared the luminescence spectra of heavily doped n-type silicon observed by Schmid et al. (1981) with the calculated spectra, taking into account Mahan’s theoretical values of the BGN. They found that the many-body values of BGN were too small to give good

BANDGAP NARROWING AND ITS EFFECTS

21 7

agreement with experiment. They calculated the additional contribution to BGN due to intervalley scattering of the electrons. By adding this additional contribution to the many-body values of the BGN, they could obtain a close agreement between the calculated and the observed luminescence spectra. Berggren and Sernelius (1984, 1985) showed, however, that the effect of intervalley scattering is negligible if the impurity ions are randomly distributed. Their theory (Berggren and Sernelius, 1981) also explained successfully the observed luminescence results. We thus have two competing theoretical models that can explain the experimental results. The first is the MahanPantelides model, in which the impurity ions are distributed in a periodic lattice, the effect of impurity scattering is negligible, and intervalley scattering is important. The second is the Berggren-Sernelius model, in which the impurity ions are distributed randomly, the effect of intervalley scattering is negligible, and impurity scattering contributes significantly to the BGN. The many-body effects are nearly the same in both the models. Both models give similar values of the final BGN. The difficulty can be resolved by studying GaAs, which is a single-valley semiconductor. If intervalley scattering is important, many-body effects alone should be sufficient to interpret experiments in GaAs. Sernelius (1986) has calculated the BGN of heavily doped p-type GaAs and compared it with the experimental values of BGN obtained from luminescence. Sernelius (1986) found that scattering from randomly distributed impurity ions was necessary to explain the experimental results in p-type GaAs. The conclusion that impurity scattering is important seems inescapable. 6. An Empirical Expression for the B G N The Fermi level and the conduction band edge as calculated by Mahan and by Berggren and Sernelius are compared in Figs. 3 and 4. The results of Mahan calculated by the two methods-i.e., by calculating impurity scattering, using Eq. (20) and using Eq. (23) with correction applied by the variational method-are shown. The agreement between the results obtained by the two methods and the results of Berggren and Sernelius is very good for both silicon and germanium; the maximum differences is only a few millielectron-volts. The numerical values of the self-energy of the hole (or shift of the valence band) due to many-body effects calculated by Mahan (1980) and by Berggren and Sernelius (1981, p. 1985) also agree closely. A t n = 4 x 1019 ~ m - ~ , Mahan's value is 33 meV and Berggren and Sernelius obtain 35 meV. At n = 1 x 10l8 ~ m - the ~ ,two values are 13 meV and 14 meV, respectively. We have shown earlier that in the high-density approximation, the total value of BGN at any given donor concentration is practically equally divided between the shifts of the conduction and valence band edges. Since the

218

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

FIG.3. Concentration dependence in silicon of the chemical potential p and the conduction band minimum E,. The reference energy is the bottom of the unperturbed conduction band. Dotted curves are Berggren’s results (Berggren and Sernelius, 1981). The solid and dashed curves refer to Mahan’s results calculated by two different methods (Mahan, 1980).

FIG.4. Concentration dependence in germanium of the chemical potential p and the conduction band minimum E,. The reference energy is the bottom of the unperturbed conduction band. The notation is the same as in Fig. 3 (Mahan, 1980).

219

B A N D G A P N A R R O W I N G A N D ITS EFFECTS

shift of the conduction band edge given by Mahan’s analytical expression, Eq. (27), agrees closely with the values of Berggren and Sernelius, we can obtain the value of total BGN by simply multiplying by two the shift of the conduction band edge given by Eq. (27). The expression for BGN obtained in this manner is AEg = -2-

‘’E ((:)‘I3

+ 0.481)n”’

- 0.38R.

Similarly, we can obtain the BGN in the jellium model, i.e., the BGN due only to many-body effects, by multiplying Eq. (17) for the self-energy of a hole by two. The expression for BGN( jellium) becomes, 0.95

A E,( jellium) = - 2 314 R. IS

The values of total BGN calculated by using the simple Eq. (28) are plotted (dashed curves) in Fig. 5 for both silicon and germanium. The values of BGN of Berggren and Sernelius obtained from Figs. 1 and 2 (read as accurately as possible) are shown by solid curves. The agreement is surprisingly good. The maximum difference between the values given by Eq. (28) and those of Berggren and Sernelius is less than 3 meV for silicon and 2 meV for germanium. Similar agreement between the many-body values of BGN in n-type Si calculated using Eq. (29) and those of Berggren and Sernelius is seen in Fig. 6.

I

10’8

I

2

1

5

I

10’9

I

2

I

5

I

1020

N,,l~rn-~) FIG. 5. B G N A & in fi-lype Si and n-type Ge as a function of donor concentration N,; solid curves: theoretical results of Berggren and Sernelius 198 1); dashed curves: plots of Eq. (28) derived in this paper.

220

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

20L

I

I

1ol8

5

I

I

1019 No (cm?)

I

2

I

1O2O

FIG.6. BGN AE,(jellium) in n-type Si neglecting impurity scattering. The solid curve is the theoretical result of Berggren and Sernelius (1981), and the dashed curve is a plot of Eq. (29) derived in this paper.

7. A Simple Expression for BGN Derived by Jain and Roulston The calculated values of BGN are given in the form of graphs. They are not convenient for interpreting experiments and for use in computer-aided design of devices. Calculations are involved and time-consuming, and they have to be repeated for every new semiconductor or for extending the range of doping in the same semiconductor. The empirical expressions (28) and (29) are applicable only to n-type Si and n-type Ge; they cannot be used for p-type semiconductors. In parabolic spherical band and effective mass approximation, semiconductors differ only as far as the values of the dielectric constant and the effective masses are concerned. Jain and Roulston (1991) showed that this dependence can be eliminated and a universal expression for BGN applicable to all semiconductors can be derived if energies are normalized by R and impurity concentrations are expressed in terms of the many-body parameter rs defined by Eq. (7). However, real semiconductors cannot be described by these approximation; the actual band structures are complicated. Introducing two correction factors to take into account deviations from this ideal band structure, Jain and Roulston derived the following equation for BGN applicable to all n- and p-type semiconductors: AE R

A 1 8-1.83--+++

N i l 3 rs

0.95 rs

(

1+-

R(;nol) ~

1.57 N,,r;/* ’

(30)

BANDGAP NARROWING AND ITS EFFECTS

22 1

TABLE 111 VALUES OF THE PARAMETERS USED IN THE CALCULATIONS OF THE BGN FOR DIFFERENT SEMICONDUCTORS (JAIN AND ROULSTON(199 I)) Parameter

n-Si

p-SI

n-Ge

p-Ge

11.4 0.33 0.59 6 1 34.5 35.4"

11.4 0.33 0.59

15.4 0.22 0.36 4 0.84 12.6 11.2" 37.1

15.4 0.22 0.36 2 0.75 20.6 12.6 22.7

18.3

2 0.75 61.7 34.5 10.2

a The average of the heavy and light hole masses has been used in these calculations (Jain and Roulston. 1991).

The correction factor A takes into the effect of anisotropy of the bands in n-type semiconductors and the effect of interactions between the light and heavy hole valence bands in p semiconductors; N , is number of conduction band minima in the case of n-type Si and n-type Ge, and N , = 2 for all p-type semiconductors. Justification for using N , = 2 in p-type heavily doped semiconductors has been given by Jain and Roulston (1991).The effective Rydberg energy R(,in, is for the minority band and is calculated (Jain and Roulston, 1991) in a manner suggested by Berggren and Sernelius (1981) for determining the effect of impurity scattering on the bandgap. No adjustable parameters are used in going from one semiconductor to another for calculating BGN values. Only known values of the parameters with N , = 2 are used for all p semiconductors. The values of different parameters used by Jain and Roulston for calculation of BGN are given in Table 111. Jain and Roulston (1991) have used Eq. (30) to calculate BGN in Si, Ge, and GaAs. BGN values for n-type Si and n-type Ge calculated using Jain and Roulston's Eq. (30) agree well with the values calculated by Berggren and Sernelius (1981), and we do not show them here. Their results for BGN in ptype Ge are shown in Fig. 7. No other calculated or experimental values of BGN in p-type G e are available with which these results could be compared. The results for p-type Si calculated by Jain and Roulston (1991) using Eq. (30) will be compared with experimental values in Section IV. 8. Comparison of BGN in n-Type Si and n-Type Ge For any given impurity concentration, the changes in chemical potential (see Figs. 3 and 4) and in the BGN (Figs. 1 and 2) are very different in n-type

222

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

Dopant Concentration ( ~ r n - ~ ) Fic. 7. The values of BGN in p-Ce calculated using Eq. (30).[After Jain and Roulston (19911.1

silicon and n-type germanium. In germanium, the chemical potential becomes positive at an impurity concentration of about 4 x 10" cmp3, whereas in silicon, positive chemical potential is not achieved until the impurity concentration reaches a value of about 6 x 10'' ~ m - With ~ . increasing impurity ~ . increasing imconcentration reaches a value of about 6 x lo'' ~ m - With purity concentration, the bandgap decreases much more rapidly in silicon than in germanium. The difference in the behavior of chemical potential arises because the numerical coefficient in the kinetic energy term for germanium is twice as large as it is for silicon (see Eqs. (1 1 ) and (13)). The larger coefficient in germanium is partly due to its smaller effective density-of-states mass and partly due to a smaller number of conduction band minima. The difference in the bandgap arises mainly because of the very different values of the effective Rydberg energy in the two semiconductors. If energy is plotted in units of R and concentration in units of N,, the BGN curves for silicon and germanium become very similar. There is a difference due to the difference in the number of the conduction band minima, but the difference is small. It is therefore possible to predict the BGN for a heavily doped alloy of silicon and germanium, provided that the band structure (e.g., the effective masses) of the intrinsic alloy is known.

B A N D G A P N A R R O W I N G A N D ITS EFFECTS

223

D. Klauder’s Multiple Scattering Theory for Moderately Doped Semiconductors 1. The Need ,for a New Theory

We will see later in this paper that the magnitude of BGN calculated by using high-density theory agrees with the experimental values of the pn product to a reasonable approximation even when the doping concentration N is not larger than Mott’s critical concentration N,, but close to it. This agreement does not imply that the theories are valid in this range. The experimental values of BGN derived from electrical measurements depend only the position of the Fermi level with respect to the minority band edge. A given position of the Fermi level can be obtained for numerous possible shapes of the density-of-states function and the correspondingly different values of the bandgap. The agreement of the experimental BGN values (derived from the pn product) with the theory is not a good criterion for testing strictly the validity of the theory. A recent paper of Dhariwal et al. (1987) illustrates this point. These authors were able to model the Fermi level and the density of the minority carriers (or pn product) “correctly” (i.e., so as to provide agreement with the experimental values) by invoking only the impurity effects; exchange and correlation effects (many-body effects) were completely ignored! There are several other properties of the semiconductors where doping in the neighborhood of and below the critical concentration N, is important and that cannot be interpreted using high-density theories. For example, a theory is needed to calculate the value of N, where a semiconductor changes from metal to nonmetal. Luminescence spectra also show excessive width of the line at lower concentrations (Dumke, 1983a, 1983b). A theory is also needed to interpret density of states obtained by tunnelling experiments shown in Fig. 8 for silicon [Abdurakhamanov et al., 19761 and Fig. 9 for germanium (Sawaki et al., 1974), respectively. When doping concentration is in the neighborhood of N,, the density of states is highly distorted. A theory to calculate the density of states and BGN of moderately doped semiconductors (i.e., when N is in the neighborhood of N,) is needed. 2. Klauder’s Theory of Multiple Scattering

Klauder proposed in 1961 a multiple-scattering* formalism that enables one to calculate electron-impurity interaction potential in a self-consistent Multiple scattering means that the electron IS scattered several times by the same impurity ion before it is scattered by another impurity. This possibility is not included in the high-density theories based on second-order perturbation theory. The multiple scattering becomes increasingly important as the concentration of the impurity decreases, and it ultimately leads to the bound states at very low impurity concentration.

224

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

Energies (rneV1 FIG.8. Density of states curves near (a) the conduction band edge of n-type Si, and (b) the valence band edge of p-type Si, plotted for different doping concentrations. The values of the concentrations are (a) curve 1: 7.7 x 10" ~ r n - curve ~ ; 2: 13.5 x loL8cm-'; curve 3: 17.1 x 10" cm-'; curve 4: 31 x lo1' ~ r n - (b) ~ ; curve 1: 4.7 x 10" ~ m - curve ~ ; 2: 13 x 10l8~ m - ~ ; curve 3: 23 x loL8cm-'. In (a), curves 3 and 4 are so close together that they cannot be seen separately. Reference energies on the x axes are the positions of the unperturbed band edges. [After Abdurakhamanov et al. (1976).]

3

3

>r

c .-

c .-

g2

:2 m

U

a

+

U

a

0

c

Zl

E l Ln 0

-20

-10

0

Energies (meV1

10

0

-20

-10

0

10

Energies ImeV)

(0) Ib) FIG.9. Density of state curves of (a) As-doped Ge (open circles: 5 x 10l8 cm-'; closed circles: 1.5 x lot8 ~ m - and ~ )(b) Sb-doped Ge (open circles: 5 x 10" cm-'; closed circles: 1.3 x IOl9 cm-3). The broken line shows the parabolic unperturbed bands. [After Sawaki et a/. (1974).]

BANDGAP NARROWING AND ITS EFFECTS

225

manner. The method based on Klauder's theory is very powerful. The fifthlevel approximation of the theory, which is most accurate, calculates the selfenergy to all orders in the perturbation and for all values of the impurity concentration. Serre and Ghazali (1983) used the one-electron Green function G ( k , E ) , which obeys the Dyson equation, for describing properties of the interacting electron system. They solved nonlinear integral equations given by Klauder in his best approximation (Klauder, 1961) to obtain values of the self-energy and the Green function. The spectral density A(k, E ) and the density of states D(E)were then calculated. To implement the method for the doped semiconductors, Serre and Ghazali used the Yukawa potential, Eq. (l), and the Thomas-Fermi screening length for a degenerate semiconductor given by Eq. (3). The exchange correlation part of the self-energy was approximated by the high-density value of the exchange energy given by Eq. (14) without the correction factor A. Single spherical parabolic bands were assumed, and calculations were done numerically using the reduced (or normalized) values of the energy, concentration, and distance so that in this approximation (single spherical parabolic bands) universal curves were obtained that could be used for any semiconductor. Bennett (1986a), Lowney (1986b), and Bennett and Lowney (1987) have applied this method to calculate the density of states and BGN in n-type and p-type gallium arsenide. Lowney (1986a) also used this method to calculate the density of states of p-type silicon at room temperature. Lowney (1986a) confined his calculations to moderately doped p-type silicon for NA = 1.5 x 10" cm-3 and NA = 6.2 x 10" cmP3,and a compensated sample with N, = 1.2 x lOI9 cm-3 and NA = 6.2 x 10'' ~ m - In ~ this . doping range, the fifth level of Klauder's approximation is required since the model must provide for bound or highly localized states, (if the calculations are performed only in the high-density regime, the third-level approximation of Klauder is sufficient (Bennett 1986a, 1986b; Lowney, 1986a, 1986b).) The calculations were made at room temperature, and the Debye length, Eq. (4),was used as the screening length. Full ionization was assumed, which is probably a justified approximation at room temperature. Since calculations were confined to lower dopings and were made for room temperatures, the semiconductor is nondegenerate, and many-body effects can be neglected. The spin-off band in the valence band was ignored. The values of the effective masses used by Lowney (1986a) are given in Table IV. The values of the Debye screening lengths are 63a, (ao is the Bohr radius 0.529 A; it is not the effective Bohr ~ , 31a, both for radius as used by us earlier) for NA = 1.5 x 10" ~ m - and NA = 6.2 x 10" cm-3 and for the third compensated case (Lowney, 1986a). Results obtained by Serre and Ghazali (1983) and by Lowney (1986a) are discussed in the next section.

226

SURESH

c. JAIN. R. P. MERTENS AND R. J. VAN OVERSTRAETEN TABLE I V

LOWNEY THEORETICAL CALCULATIONS

VALUES OF THE PARAMETERS USED BY

( 1 9 8 6 ~ )IN

HIS

Silicon Effective electron density of states mass mdP Effective heavy hole mass rnhh Effective light hole mass m,,, Effective hole mass m$

0.36 0.49 0.16 0.55

3. Band Structure of Moderately Doped Semiconductors

Density of states (in units of ( R a 3 ) - ' )as a function of energy (in units of R ) calculated by Serre and Ghazali are shown in Fig. 10 for different values of the of the reduced concentration N' (measured in units of (7~/3)(4a)-~) impurity. Plotted in this manner, the results are universal and are applicable to any semiconductor, provided the bands are single spherical parabolic bands. To obtain numerical values for a particular semiconductor, values of a and R for that semiconductor must be used to remove the normalization used in Fig. 10. The lowest concentration for curve 12 in Fig. 10 comes out

Energies I R FIG. 10. Density of states curves as functions of energy for different reduced impurity concentrations N ' . Regions I , 11, and 111 marked by dashed lines are regions of existence of quasiatomic, hybrid, and extended states, respectively. The units of concentration N are ( ~ / 3 ) ( 4 a ) - ~ . [After Serre and Ghazali (1983).]

BANDGAP NARROWING AND ITS EFFECTS

227

to be 1.5 x lOI4 ~ m - and ~ , the highest concentration for curve I is 6 x 1 O I 8 ~ m - It~ is, interesting that even for such a low concentration as 1.5 x l O I 4 ~ m - the ~ , impurity band is formed and its peak has moved from the isolated donor levels at E = - 1R to E = -0.89 R. As the concentration increases, the band broadens and its peak moves towards the conduction band edge rapidly. The band becomes increasingly asymmetrical. Note that as the concentration increases, the conduction band edge also starts moving downwards in the bandgap. It has moved into the gap by about 0.1 R at N , = 3 x 10’’ ~ m - Approximately ~ . at this concentration, the impurity band and the conduction band merge with each other. This concentration at which the two bands merge and the semiconductor changes from nonmetal to metal is 10 times smaller than the correct values N , = N,. Lowney has suggested that the discrepancy arises because Serre and Ghazali have assumed full ionization at small Concentrations and at low temperatures. Considerable deionization must occur under these conditions. The qualitative behavior of the impurity band at very low concentrations discussed above is the same as obtained by the earlier theories discussed in Section 1I.B. The new features obtained in the present work are the distortion of the conduction band, the merging of the two bands at a critical concentration, and the shape of the composite band at higher concentrations, as seen in Fig. 10. However, the density of states is not parabolic even at the highest concentration used by Serre and Ghazali; it is highly distorted. In agreement with the result of Berggren and Sernelius (1981, see Section KC), Serre and Ghazali also found that the valence band edge moves up because of impurity scattering in n-type semiconductors. The density-of-states values calculated by Lowney for the three cases mentioned earlier are shown in Fig. 11. Qualitatively, the results in Figs. 1l a and 1 1b are similar to those of Serre and Ghazali (Fig. 10)discussed earlier. At the lowest concentration (1.5 x 10l8 ~ m - used ~ ) by Lowney, the impurity band has just touched the distorted main valence (majority carrier) band. The distortion in the conduction band appears to be more pronounced and the impurity band is less asymmetrical than the corresponding results obtained by Serre and Ghazali. The concentration at which the two bands merge is about 10 times higher in the work of Lowney and is in better agreement with the experimental values. The calculated density of states in the third compensated sample will be discussed later. 4. Localization in the Impurity Band

The band tails formed because of impurity concentration fluctuation will be discussed in the next section (Section 111). Here we discuss the localization in a “new type” of tails obtained in the work of Serre and Ghazali. The values

228

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

NEUTRAL p SUBSTRATE

NEUTRAL p SUBSTRATE

NEUTRALpSUBSTRATE

,L

Conduction bond

0 -

N~ = 1 s 10%i3

-

P Valence band

Valence band -2

-2 0

1

2

3

L

5

6

7

8

I

i

I

1

I

1

r

;r

No = 0

VI

I

-2

0 1 2 3 6 5 6 7 8

0 1 2 3 L 5 6 7 E l . ] Density of states llO-‘Rydau 1

Id

Ibl

la)

FIG. 11, Density of states of the conduction band and the valence band in Si at room temperature. The dashed curves in (a) refer to unperturbed bands. Note that Ryd in this figure denotes Rydberg energy and not effective Rydberg energy. Note also that the sample in (c) is compensated. (Lowney, 1986a).

t

;

4200 JIE.O.0

N. 1 S x l o ~ , o o

rn

5EE08

0

N‘=O 5

E.0 0

05

10

15

Wave number k (6’) FIG. 12. Spectral densities as functions of wave number k at typical energies (energies E given in the figure have been normalized by R) and for three typical reduced impurity concentrations N ’ . Units of N ’ are the same as in Fig. 10 (Serre and Ghazali, 1983).

of the spectral density A ( k ) as a function of the wave vector k calculated by Serre and Ghazali are shown in Fig, 12 at typical energies and for three different values of the reduced impurity concentration N ‘ . At low doping concentrations, N ’ = 5 x lop3, the spectral density at E = 0-i.e., just at the bottom of the conduction band is sharply peaked and is Lorentzian,

BANDGAP NARROWING AND ITS EFFECTS

229

characteristic of a quasi-free electron. At E = -0.5R,i.e., inside the impurity band, the curve is very broad, suggesting highly localized wave functions. At higher doping concentrations, the localization sets in at higher energies closer to the majority carrier band edge. At the highest concentration shown in Fig. 12, the wave functions appear to be quasi-localized at the bottom of the unperturbed majority carrier band. The high-energy states very deep in the conduction band are free, the states at the intermediate energies are quasilocalized, and the states at low energies in the gap (i.e., in the impurity band) are quasi-atomic, i.e., they are localized. The localization occurs as a result of the mixing of orbital and free states to form hybrid wave functions. The approximate boundaries between the extended (region 111), quasi-localized (region 11), and localized (region I ) states are shown in Fig. 10 by dashed lines. Lowney calculated the spectral density of states for both the majority and the minority carrier bands. For the majority carrier bands, his results are very similar to those obtained by Serre and Ghazali, which were discussed earlier. As expected, no evidence of localizations of the minority carriers was found from Lowney’s calculations of the spectral density of this band. These results are instructive: The low-and high-density theories are incapable of giving these results.

5 . B G N in Kluuder’s Theory

To confront this work with the experiments, calculation of the Fermi level is necessary. The Fermi level can be calculated only if the spin degeneracy of the states involved is known. The subject of degeneracy has been extensively debated, and a clear consensus does not exist. Serre and Ghazali assumed that only one electron per state exists in the highly localized states, but extended states in the conduction band are spin-degenerate.3 Using the charge neutrality condition, Serre and Ghazali then calculated the Fermi level. The calculations show that after the impurity and conduction bands merge with each other, the Fermi level lies approximately at the boundary between the regions I1 and 111. The spectral density curve at the Fermi level is still somewhat broad, suggesting that a large fraction of the electrons are in the hybrid states, are quasi-localized, and have very small mobility. This should affect the majority carrier conductivity at low temperatures. The experimental values of the conductivity have not been compared with this theory as yet, It will be interesting to reinterpret the earlier work on low-temperature In their original work on the band tails, Halperin and Lax (1966,1967) also assumed that only one electron can exist per state in the localized states (Abram ei al, 1978). However, Kane (1985) has recently introduced a factor of two due to spin degeneracy in the Halperin and Lax expression for density of states in the tail.

230

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

conductivity of doped uncompensated and compensated silicon and its thermal variation (Lee and McGill, 1975; Neumark, (1977a,b); Schechter, 1987, 1988) with the help of this theory. Serre and Ghazali have given the following expression for the BGN: AE,/R

=

l.2(N')''3

+ 0.18(N')0.55.

(3 1)

Using the values of a and R given in Table I, the above equation can be written for silicon in the form

A numerical calculation shows that BGN values calculated using Eq. (32) are significantly lower than those given by Berggren's theory. Lowney also calculated the Fermi level using the charge neutrality condition. The spin degeneracy was determined in an approximate manner following a procedure similar to that of Serre and Ghazali. Lowney then calculated the pn product or the effective BGN = AEga.Since he did not make calculations at higher doping concentrations, his results can not be compared with the high-density theories. The discrepancy between the values of the BGN calculated by the multiple scattering theory and by the high-density theory is probably due to a combination of several reasons. We have already mentioned the errors caused because full ionization linear screening was used by Serre and Ghazali. The errors caused by the uncertainty in the values of the spin degeneracy, by complete neglect of correlation energy, and by using linear screening can also be significant. Both Serre and Ghazali and Lowney have found that the final results are very sensitive to small changes in the screening length. Kane (1985) suggested that the effective Bohr radius a or infinitely large screening length are better approximations than the use of Thomas-Fermi screening. In view of the work of Abram et al. (1984) and of Saunderson (1983), it will be most interesting to repeat Serre and Ghazali's work using the plasmon pole approximation for the dielectric function.

6 . Band Structure of Compensated Semiconductors

The density-of-states curves for compensated silicon calculated by Lowney are shown in Fig. 1lc, and those calculated by Serre and Ghazali, in Fig. 13. Serre and Ghazali have also made calculations for lower concentrations with different values of the compensation ratio K . The inset in Fig. 13 shows the position of the Fermi level EF, the band edge (solid line), and the

BANDGAP NARROWING AND ITS EFFECTS

23 1

Energies / R FIG. 13. Evolution of the density of states when the compensation ratio K is increased. The inset shows the energies of the band edges (solid line). the Fermi level E,, and the boundary E , between regions I and 11 (dashed line) as functions of the compensation ratio. [After Serre and

Ghazali (1983).]

boundary between the regions 1 and I1 denoted by E, as a function of the compensation ratio K. Some features of the calculated density of states of the compensated semiconductors are summarized below. 1. For the same value of N , the density-of-states function penetrates deeper into the bandgap as the value of K increases (see Fig. 13). 2. For lower concentrations N , the peak of the impurity band seems to move away from the edge of the conduction band as the value of K increases. 3. The Fermi level moves down, first slowly and then rapidly, as the values of K increase. The Fermi level moves into region I1 or I11 at sufficiently large values of K . Now carriers at the Fermi level are localized, making the conductivity thermally activated at low temperatures.

These results are in qualitative agreement with the other work on compensated semiconductors (Mock, 1973; Morgan 1965). In view of the uncertainty in the theoretical results discussed earlier, we have not made any attempts to compare these results quantitatively with experiments (see, however, Green, 1987).

232

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

111.

IMPURITY CONCENTRATION

FLUCTUATIONS AND BANDTAILS

A . The Need for Concentration Fluctuations

The results of the high-impurity-density theories give a density-of-states function that drops to zero rather rapidly at the shifted conduction band edge. Many experiments show that even in the high-density regime, there are band tails that penetrate deep into the gap beyond the band edges and decay exponentially with some power of energy measured from the band edges (Halperin and Lax, 1966,1967).It is necessary to invoke the fluctuations in the concentration of the dopant impurity to obtain theoretically these exponentially decaying tails. The semiclassical theory of Kane (1963, 1985) and the quantummechanical theory of Halperin and Lax (1966, 1967) are discussed in Section 1II.B. Recent quantum mechanical theories (Serre et al., 1981; Sayakanit, 1982) are discussed in Section 1II.C.

B. Earlier Theories of Band Tails 1. Semiclassical Theory, Kane Band Tails

Consider a fluctuation with a local concentration Nlocal(per unit volume) of the donors in the fluctuation. Let Nlocalbe larger than the average concentration ND of the impurity. This local high concentration Nlocalof donors will produce a potential well that can create a state and bind an electron if Nloca,is large. Stronger fluctuations (i.e., with larger values of the local concentration Nlocal)will produce deeper states. These states will be highly localized because the volumes of local regions of high donor concentrations are small and are well separated in space. These states constitute the band tails. The probability of finding the concentration and the strength of the fluctuation fluctuation decreases as the value of Nloca, increase. The number of the states in the tails therefore decreases as the energy - E (i.e., below the conduction band edge) increases. In the semiclassical theory developed by Kane (1963, 1985), the potential fluctuations are assumed to be slow on the scale of the electron wavelength at the Fermi surface. The local density of states is assumed to be equal to the unperturbed band density of states corresponding to the local band edge V ( r ) - Vo; V, is the average band edge potential and V(r)is the local band edge potential. At the point r, the number of states per unit volume per unit energy is given by 21/2m,*3/2 [ E - ( V ( r )- VO)]”*. p(E,r)= (33) n2h3

233

BANDGAP NARROWING AND ITS EFFECTS

Using the Thomas-Fermi theory of screening, the average band edge V, is

Taking the average of p ( E , r) over the probability distribution P(V by

-

V,) given

(35a) where V

=

V(r)and o = - - e2 ( 4;~)’/’ &

we obtain the density of states pKin Kane’s theory,

with Y ( x )=

lX

6

( x - x ’ ) 1 / 2 exp( - x’)2 dx’.

(37)

-m

The function Y ( x ) obtained by numerical integration has been plotted by Kane (1963) and quoted by many authors (e.g., Abram et al., 1978).For large positive and negative values of E , p,(E) reduces to p,(~= ) const. JE, p , ( E ) = const. exp

for E

(

::),

--

-+

a;

(38a)

forE+-co.

We have derived the preceding expression for the tails of the majority carrier band, e.g., the conduction band in n-type silicon. The same expression can be used for the valence band of n-type silicon, except that the hole effective mass should now be used. In a compensated semiconductor, ND + NA is used for the impurity concentration in the expression for o, but ND - NA (free carrier concentration) is used in calculating k,. However, if (ND - NA)/ND is small, Thomas- Fermi screening does not remain valid. Consider again a fluctuation in the concentration of the impurity with a local concentration N,oca,.Let the spatial volume of this fluctuation be R, and let the wave function of the bound or the trapped electron be f ( r ) . The magnitude of the potential that the On the other electron bound to this fluctuation “sees” is determined by N,oca,. hand, the kinetic energy of localization is mainly determined by the spatial

234

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

extent of the wave function or of the volume R of the fluctuation. If R is small, the wave function is confined to a small volume, and the kinetic energy of localization of the electron becomes large. For any given fluctuation, the and potential and kinetic energies are determined by the two parameters Nlocal R, respectively, which vary independently from fluctuation to fluctuation. Only those fluctuations will contribute to the density of states in the tails for which the kinetic energy determined by R is consistent with the solution of the Consider Schrodinger’s equation for the local potential determined4 by N,ocal. a fluctuation with a small value of R and also a small value of Nloca,.In this case, the kinetic energy is large, and the energy with which the electron can be bound to the fluctuation is small. The binding energy is not sufficient to overcome the kinetic energy, and therefore a bound state cannot exist for this fluctuation. In Kane’s theory, this fluctuation also contributes to the density of state. Our argument shows that not only the strength of the potential fluctuation, but also its volume or shape must be considered while determining whether the fluctuation will contribute to the density of state in the tails. The number of states deep in the tails is reduced considerably on this account, as compared to the number obtained from Kane’s theory. It is now clear that Kane’s theory is valid only for energies close to the band edge where localization is not strong. The theory becomes invalid for low values of E , i.e., for large negative values of E (deep in the bandgap) where the kinetic energy of localization becomes large. However, if the condition *<< 16zN

(39)

is satisfied, the kinetic energy of the electron bound to the most probable fluctuation is small, and the semiclassical theory can be used approximately over the whole energy range. For very large values of - E, Gaussian statistics also fails. 2. Halperin- Lax Band Tails Before discussing the Halperin and Lax theory, we show from general physical considerations that for a given energy E, mostly fluctuations of only one shape and size contribute to the density of states. Consider the five Nloca, is the same for fluctuations at energy - E (which is determined by Nlocal; all five fluctuations) and of different volumes R and correspondingly different This statement is not strictly true. The wave function is determined by both the potential and the kinetic energy terms in a self-consistent manner, and each energy depends on both the volume R and the concentration N,,,,,.

BANDGAP NARROWING AND ITS EFFECTS

1

j2:

3

L

235

5

L I

Fiti. 14. Local potential fluctuations due to fluctuations in the impurity concentration In a doped semiconductor below the majority band edge at an energy E measured from the edge. The symbols r , to r5 show the spatial extent of the fluctuations and depend upon the volume of the region of local high impurity concentration N,,,,,,. N,,,,, (per unit volume) is assumed 10 be the same in all the fluctuations shown; their volumes increase in going from r , to is. The rectangular shape is shown for convenience.

radii r shown schematically in Fig. 14. The electron ‘‘sees’’ the same magnitude of the potential at the center of the well in all the five fluctuations determined by the value of Nlocal. The spatial extent of the potential and of the wave function are very different in these fluctuations. The kinetic energy of localization is also different for these five fluctuations. It is very large in fluctuation 1, which has the smallest radius r l , and decreases as we go from fluctuation 1 to 5. It is very small in the largest fluctuation, 5, with a large radius r 5 . Bound states in fluctuation 1 will not be formed because the kinetic energy is too large and the binding energy is not sufficiently large to overcome the kinetic energy. We need a minimum radius of the fluctuation, say r 3 , before a state can exist in the fluctuation. In principle, all fluctuations of a size larger than fluctuation 3 will contribute to the density of states at this value of E . The smaller fluctuations can contribute only if they are deeper. For example, a fluctuation of size 2 will contribute only at lower E if it is deeper (i.e., has larger as shown by the dashed lines in Fig. 14. value of Nlocal), It is possible to argue that, for the purpose of calculation of density of states in the tails, the larger fluctuations can also be neglected. The statistical concentration N, of the fluctuations with radius r is shown schematically in Fig. 15. It can be seen that the number of fluctuations with larger radii decreases rapidly and their contribution will be relatively small. T o summarize, then, we can say that for the energy E under consideration, only fluctuations with optimum radius r3 need be considered. The fluctuations with smaller radii are more numerous, but they are useless because for them the kinetic energy of localization is too large for a state to be formed. There are fewer fluctuations with larger radius because, statistically speaking, it is more probable to have smaller fluctuations than the larger ones.

236

SURESH c.JAIN. R. P. MERTENS AND R. J. VAN OVERSTRAETEN

i

I

FIG. 15. The number of fluctuations N, at energy E below the majority band edge is shown as a function of radius r of the fluctuation (schematic). The symbols r , to r5 have the same meaning as in Fig. 14. Energy E is determined by the concentration Nlosal.Both NlWaland E have the same value in all the fluctuations.

Halperin and Lax focused their attention on the states deep in the tail, where the density is small and varies rapidly with energy. Their theory is based on the following assumptions: 1. At a given energy E below the band edge, only fluctuations of a particular size and shape will form a bound state and contribute to the density of states. 2. The wave functions are highly localized. They are real and spherical in shape. In view of assumption 1, all wave functions at a given energy are identical. 3. Halperin and Lax assumed that each state can occupy only one electron. (Kane (1985) has recently introduced a factor 2 in the Halperin and Lax expression for the density of states to take into account the spin degeneracy.) 4. Since electrons can also be bound to the excited states, there will be additional contribution to the density of states due to the excited states. This contribution will be at energies closer to the band edge. Halperin and Lax neglected this contribution. 5. High density of dopants, effective mass approximation, Gaussian distribution of the potential fluctuations, and linear Thomas- Fermi screening are used in the calculations. Assumptions 1,2, and 4 restrict the validity of the calculations to low energies deep in the bandgap. Assumption 5 restricts it to the heavily doped semiconductors. Assuming the trial wave function to be f (r - ro) as discussed in assumption 2 above, the energy associated with the bound state can be

BANDGAP NARROWING AND ITS EFFECTS

237

written as

and the potential energy V’(ro)is

s

V‘(ro) = f’(r - ro)’V(r).f(r- ro)dr.

(40b)

V’(r,) fluctuates about a zero mean value as ro fluctuates (Abram et al., 1978). It is assumed that every minimum in V‘(ro)gives a bound state. The density of states pHLis calculated by counting the number of minima in E’(ro). The density of states pHLcomes out to be proportional to exp( -(b(v,z)/25’}, where z is a free parameter and v is given below by Eq. (42d). The parameter z for the best ground-state wave function is obtained by a variational method by minimizing the exponent b/25’. Calculations are made numerically. The final result is

or 1 24’

-=-

a2k:

167cN,’

E is the energy measured from the conduction band edge. The reciprocal screening length k, is given by Eq. (3), and the standard deviation IS by Eq. (35b). It is interesting that the quantity 1/25‘ in the exponent of Eq. (9) is precisely the quantity that should be small for the semiclassical theory to be valid (see Eq. (39)) over the most probable fluctuations (see earlier discussion and Abram et al., 1978). Halperin and Lax have given a table of the numerical values of a(v) and b(v). In their model, the values are independent of the material parameters. We have calculated and plotted the values in the range of energy important for silicon on a more sensitive scale in Fig. 16.

238

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

Y

4

FIG. 16. Values of n(v) and b(v)as functions of v are plotted. The range of values of v useful for Si is chosen.

Halperin and Lax (1966,1967; see also Hwang, 1970a, 1970b, 1970c) have shown that for linear screening to be valid, 0

< (EF - &).

(434

Halperin and Lax have also discussed the conditions under which Gaussian statistics holds. There are two physical conditions that must be satisfied for this statistics to be valid. The first condition is that the number of impurity atoms in a sphere of radius Ilk, should be sufficiently large, i.e., 47~k;~N,-, > 1. 3 Even if this condition is satisfied, the potential fluctuations deep in the gap are not Gaussian. Consider, for example, a heavily doped n-type semiconductor. There is no limit to the excess (over the average) density of donors in a fluctuation except that demanded by the solubility limit of the impurity in the host crystal lattice. For majority carriers, tail states are possible at any depth below the conduction band edge. For minority carriers, a concentration N,ocal less than the average is required for the tail states to be formed. Since the in a sphere of radius Ilk, can only be 0, minority minimum concentration Nloca, carrier tails must be of limited depth. Gaussian statistics, which is essentially symmetrical about the mean value, always fails at such large depths from the

BANDGAP NARROWING AND ITS EFFECTS

239

band edges. Halperin and Lax estimated that the minimum value of b/2(' in Eq. (41) must be about 3 for the approximation (on which Eq. (41) is based) to be valid and b/25' must be less than about 30 for the Gaussian statistics to hold. Hwang (1970a, 1970b, 1970~;see also Casey and Panish, 1978) calculated the Halperin- Lax tails for GaAs, and Eymard and Duraffourg (1973) did the same for GaSb. Such calculations have not been published for silicon or germanium.

C . Recent Quantum Mechanicul Theories of Band Tails 1. Feynman Path Integral Method

Using the Feynman path-integral method (Feynman and Hibbs, 1965) and assuming the Gaussian distribution of the band edge potential, Sayakanit et al. (1982; see also Samathiyakanit, 1974; Sayakanit, 1979; Sayakanit and Glyde, 1980) derived the density of states deep in the tail in a heavily doped semiconductor in the following form:

As previously, z is a free parameter to be determined by the variational method. DJx) is the well-known parabolic cylinder function of x (Abramowitz and Stengun, 1965), and other symbols have the same meaning as in the previous section. The functions a(v,z ) and h(v,z) are given by

a(v,z) = ( T + v ) ~ ' 8xfiz'exp ~

[

b(v,zj = ( T + v ) ~ &

[

G2) l 1 (3 I' -

Df3(z)

,

(45)

2fiexp - D-3(z)

T is the kinetic energy of localization, 3 T=--

2z2,

(47)

As in the Halperin and Lax model, Eq. (44) includes only the ground states available to the electrons; the contribution of the excited states is ignored. The exact evaluation of the correction due to excited states is difficult and has not been achieved. The variational principle has been derived rigorously by Lloyd and Best (1975), and their work shows that the optimum p ( E ) should maximize the

240

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

function

P(E) =

!Yrn

dE‘ m:[

dE” p(E”).

(48)

The variational equation for z obtained by using Eq. (44)and maximizing P ( E ) in (48) is (Sayakanit et al., 1982)given by

[i

-

~]j~dx’x’exp[~]D,I,(n’)

+ 3 D-4(z)

[4 D - 3 ( z )

2 z

]{:

z - ~ T v

+

[-;’I

dx’x‘exp - D3/Z(x’)= 0, (494

where x2

=-*

b(v,4

C‘

(49b)

The density of state p ( E ) can now be calculated using Eqs. (44), (45), (46), and (49). Eq. (44) can be simplified by using two approximations: The first is called the deep-tail approximation, and we will refer to the second as the Halperin and Lax approximation since it yields the result identical to the Halperin and Lax equation, (41). In the deep-tail approximation, the function 0 3 1 2 is expanded in powers of l/x and only the leading term is retained in Eq. (44).Equation (44) now reduces to a form similar, but not identical, to that obtained by Halperin and Lax (see Eq. (41)),

where ps is the Sayakanit density of states in the tails. The equation determining the variational parameter z now becomes

Here r(a, y) is the incomplete gamma function

and x2

y=-=-. 2

b(v, z) 2“

(53)

24 1

BANDGAP NARROWING AND ITS EFFECTS

Unlike the result of Halperin and Lax, a(v,z) and b ( v , z ) now become functions of material parameters and are not uniquely determined by the energy v. We now discuss the approximation referred to as the Halperin and Lax approximation, which reduces Eq. (44) to Eq. (41).We take the limiting value of T(a,y ) as y -+ co (or [' + 0) and retain the leading term. Equation ( 5 1) for z reduces to

The values of a(v) and b(v) come out to be the same as those numerically calculated by Halperin and Lax shown in Fig. 16, and Eq. (50) reduces to (41). The results of different approximations are compared in Fig. 17. It is seen that as compared to the full ground-state case, the deep-tail approximation causes only a small error; the Halperin and Lax limit causes a much larger error. For 5% accuracy, the Halperin and Lax result can be used only for

lo2-,

I

8--

1

,

I

,

1

, , --

I

-

L

O

. I .-

-

L-

6-

-

L-

0-1

0.3

0.2

0.5

05

V

FIG. 17. Density of states in units of (k%3/Eks('2), for the full ground-state Case, for the dmptail approximation, and for the Halperin and Lax limit are plotted as functions of Kane density of states is also shown. The point of intersection of Kane states and full ground-state densities at L$ is shown by an arrow. [After Sayakanit et ul.(1982).] 18.

242

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

x2/2 > 10, a much more stringent condition than that (x2/2 > 3) given originally by Halperin and Lax. For small v, both theories give wrong results: pHLis too small, and ps is too large. The range of validity of ps extends to lower values of v as compared to the range of P H L . Sayakanit et al. (1982) have also shown that for values of v close to the unperturbed band edge, Kane’s result (36) can be derived by using the Feynman path-integral method. The authors have suggested that the pKand ps ( p s is the Sayakanit p given by Eqs. (50) and (51)) should be plotted as has been done in Fig. 17 (Sayakanit et al., 1982).The curves intersect at v = v l . pK should be used for v < vl, and ps for v > v l . (The plot of PHL does not intersect pK,as the error in pHLis too large at v = v1 .) The two curves can be smoothly joined near v = v1 (Sayakanit et al., 1982). 2. Theory of Serre and Chazali Serre et al. (198 1) have re-examined the deep-tail density-of-state theory and have calculated the tail states using a different approach. In spirit, their theory is similar to that of Halperin and Lax (1966, 1967) and of Sayakanit et al. (1982) in the sense that it is quantum mechanical, and Serre et al. also match the wave function with the potential due to local impurity concentration. They assume that the semiconductor crystal is divided into small elementary volumes R and the local impurity concentration is constant with in each R. As distinguished from the variational theory, they assume that the minimum value of the linear dimension R, of 0 is not smaller than about 2A,, where A, is the mean free path of the electron, which depends on the energy E and on the local concentration Nloca,in the volume R. Using Klauder’s theory of multiple scattering and the Green function method, they calculate the values of p,(E) and A, as a function of radius R, of the volume R shown in Fig. 18. They then select the values of R, and A, self-consistently by using the points where the R,/2 line intersects the A, curves. Values of p,(E) for the values of R, such that R, = 212, are shown by the point of intersections in the upper part of Fig. 18. The p , ( E ) in the tail calculated by taking these values from the upper part of Fig. 18 is shown in Fig. 19a. The Halperin and Lax PHL is also plotted for comparison. It is seen that at low energies, PHL is slightly larger than that calculated by Serre et al. This happens because the variational method implies unrealistically small values of volume R at low E ; the volume is so small that bulk properties cannot be used. The difference in the two values of density of states is negligible, however. Comparison at higher values of E (i.e., close to the band edge) cannot be made because here the Halperin and Lax theory is not accurate. From a practical point of view, the most important question is how to model the tail states over the whole range of energies. Hwang (1970a, 1970b,

t -0

P’n

l1@1

0.5

0 1.5

A, 1.0

0.5 1 2 3 I Normolised rodius R n FIG.18. Average values (a) of the density of states pQ and (b) of the mean free path An as a function of the radius R of the sampling volume R, for ditferent values or energy E . The crossing of the straight line R,,/2 and of the An curves defines the self-consistent values of RI,. The vertical dashed lines that join the corresponding curves of (b)and (a)define the self-consistent density or states at a given energy. Energies are given in the units of effective Rydberg Rn; lengths in units of elfective Bohr radius a; impurity concentration in units of (n/3)(4a)- and density of states in units of (Ra3)-’.[After Serre et a/. (1981).]

0

’;

-2 -1 0 Reduced energies FIG. 19. (a) Density of states y ( E ) as a function of energy. Solid line: theory; dashed line: theory omitting impurity-conentration-fluctuationeffect; dotted line, parabolic band; and dashanddot line, from Halperin-Lax tabulation. (b) Density of states as a function of energy for different degrees of compensation, keeping No’-NA’constant. Solid line, theory; dashed line, theory omitting impurity-concentration-fluctuation effect; and dotted line, parabolic band. Units are the same as in Fig. 18. [After Serre rt a/. ( I 98 1). J -L

-3

244

SURESH C. JAIN, R. P. MERTENS AND R . J. VAN OVERSTRAETEN

1970~)interpolated Halperian and Lax tails pHLwith the distorted parabolic band calculated by Bonch-Bruevich (1966) for this purpose. Casey and Stern (see the discussion of this work in Casey and Panish 1978) combined the Halperin and Lax tails at low energies with Kane’s results a t high energies in the gap near the band edge. In the formulation of Serre and Ghazali, Kane’s tales at high energies are not required, and also no interpolation is necessary. We have already discussed the method suggested by Sayakanit et a/. (1982) to use ps at energies away from the band edge and pK near the band edge, and interpolate smoothly between them near the point where they intersect. This method seems to be the best at the present time (Kane, 1985). The results of a compensated semiconductor calculated by Serre et al. are shown in Fig. 19b.These results emphasize again that compensation enhances considerably the exponential tails due to the concentration fluctuation. The effect arises because of the reduction in the free carrier concentration and a large increase in the screening length.

IV. EFFECTOF BANDGAP NARROWING ON OPTICAL PROPERTIES A. introduction

Band structure and electron transitions that occur during absorption or emission in a semiconductor are shown in Fig. 20a, byand c. Figure 20a is for an intrinsic semiconductor. Figure 20b is for a semiconductor doped with an n-type impurity; the concentration of the impurity is just above Mott’s critical concentration. Figure 20c is for a semiconductor doped with a very high concentration of n-type impurity. In Fig. 20b, the impurity band and the conduction band have joined each other, but the impurity band has not yet merged completely with the conduction band. In Fig. 20c, the impurity band has merged completely with the conduction band; it does not have a separate existence. The conduction band has shifted downwards nearly rigidly, and Halperin band tails have been formed. In the absorption process in a doped crystal, an electron makes a transition from the valence band to an empty state in the conduction band. The minimum energy needed for a photon to cause such a transition is Ego (see Figs. 20b and ~OC), known as optical bandgap. Absorption measurements thus monitor the high-energy empty states of the conduction band. In the photoluminescence process, an electron makes a transition from the filled part of the conduction band to an empty state in the valence band. The spectrum will have a high-energy cutoff at Ego and a low-energy cutoff at Egd, the fundamental indirect bandgap of the doped semiconductor. These transitions

245

BANDGAP NARROWING AND ITS EFFECTS (bl

(01

(Cl

Holperin

toils

c

FIG.20. Band structure of a heavily doped semiconductor. In (a), the semiconductor is intrinsic; in (b), it is doped n-type with a concentration just above Mott's critical concentration; and in (c), the doping is much higher. The fundamental bandgap E, of a pure crystal, the reduced bandgap(due to heavy doping) Epdof a doped crystal, the optical bandgap E g o ,the Fermi level E , , and the Halperin tails are shown. During luminescence. HL transitions of electrons occur from conduction band to valence band at high levels of exciting light, and LL transitions from conduction hand to impurity levels occur at low levels of exciting light.

are referred to as HL in the figure. Transitions to compensating impurities states (shown as LL in the figure) can occur and give rise to luminescence at lower wavelengths. Luminescence due to H L transitions is observed preferentially when the intensity of the exciting light is high. The LL luminescence is observed at a low level of excitation and only if some compensating impurities are present in the crystal. The transitions between the valence band and the conduction band are affected by distortion of the density of states, by band tails, by thermal spread of carriers at the Fermi level, and by the presence of the impurity band if the band has not yet merged completely with the conduction band. In addition to these factors, interpretation of absorption measurements is complicated because of absorption by free carriers and because of additional band-to-band transitions induced by impurity or free carrier scattering. It is interesting to point out that the p n product in heavily doped n-type semiconductor is given by pn

=

N,, N, exp -

(EFI;TEv)

= N,Nvexp -

(s),

(55)

where N, is the donor concentration, N , the effective density of states in the valence band, and E , the valence band edge; all symbols are defined in Appendix 1. The pn product is determined by the optical bandgap or by the sum of the reduced bandgap Egdand the Fermi energy measured from the

246

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

majority band edge. In a sense, like photoluminescence, electrical measurements monitor the filled part of the conduction band and the reduced bandgap. B. Optical Absorption

1. Phonon-Assisted Transitions If the energy hv of a photon is smaller than the bandgap E, of an intrinsic semiconductor or Eed of a doped semiconductor, and if the semiconductor is free of defects, it is transparent’ to the incident radiation at low temperatures. As the energy hv of the photon increases and becomes nearly equal to the bandgap, the absorption of the incident radiation begins, and as hv increases further, absorption increases rapidly. This rapid increase in absorption is known as the absorption edge of the semiconductor. In this optical excitation process, the total momentum must be conserved. The simultaneous interaction of the electron with the incident photon and lattice vibrations (or phonons) produces absorption at energies equal to the indirect bandgap Eg (or Egd)of intrinsic (or doped) Si and Ge (Cheeseman, 1952; Hall et al., 1954).The momentum is conserved by emission or absorption of a phonon. Macfarlane et al. (1954, and references given in McLean, 1960) made extensive measurements of the optical absorption in silicon and germanium. A review of the early work on optical absorption in these materials has been given by McLean (1960). 2. Energy of Phonons Involved in Optical Absorption The optical absorption of pure silicon as a function of photon energy is shown in Fig. 21 (McLean, 1960). At 4.2 K, the absorption consists of two components, one beginning at 1.1735 eV and the other at 1.2130 eV. These are identified as being due to emission of momentum-conserving phonons, the TA phonon with energy 18 meV and the T O phonon with energy of 59 meV. At higher temperatures, additional components at lower energies are observed due to absorption of momentum-conserving phonons. In the case of germanium, the momentum-conserving phonons are TA (8 meV), LA (28 meV), LO (30 mev), and T O (36 meV), respectively. Unlike the case of germanium, LA and LO phonons are not seen in the optical absorption in silicon. Brockhouse and Iyenger (1957,1958) measured the phonon dispersion curves of silicon and germanium by neutron scattering experiments. Their

’ There will be free carrier absorption at low energies in the case of the doped semiconductor.

BANDGAP NARROWING AND ITS EFFECTS

247

P h o t o n energy, hvleV) FIG.2 I . The absorption edge in intrinsic Si at different temperatures. [After McLean (1960).]

results show that these phonons have the right wave vectors required for momentum conservation in the indirect transitions of the electrons during the absorption process.

3. Theory of Optical Absorption The expressions for the absorption coefficient a,, in intrinsic crystals were derived by Macfarlane and co-workers and are discussed in the review by McLean (1960)and by Balkanski et al. (1969).For parabolic bands, expression for the phonon-assisted absorption coefficient up in intrinsic semiconductor crystal is (Balkanski, 1969). c(

8 A 2 (hv - E,

=-

+ Ep)’

n exp(E,/kT) - 1

j,,

(hv - E, - E,)’ +-8A2 x 1 - exp(-E,/kT)

x1’2(

I +exp

E,

jo+ 1

At low temperatures, Eq. (56a) reduces to

1 - x)l’2 dx -

(hv - E, kT

x”2(1

exp

+ E,)x

- x)”2dx

E , - (hv - E, kT

-

E,)x

.

(56a)

248

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

E , is the energy of the phonon involved in the absorption process, and A is a constant that depends on the effective mass and on temperature. The band-toband absorption coefficient ad of a doped crystal can be written as (Haas, 1962; Balkanski et a!., 1969).

ad = apd

+ a,.

(57)

Here apd is the absorption coefficient in the doped semiconductor involving phonons only, and ct, is the absorption coefficient due to extra band-to-band absorption induced by free carrier scattering or by impurity scattering. It is assumed that the value of A in the expression for apdin the case of doped crystal remains the same as in Eq. (56) for intrinsic crystal; the only change that occurs is in the bandgap, which changes from Eg for intrinsic crystal to Egd for doped crystal. Therefore, the expression for can be written as C(pd

8A2 (hv - E,,

=-

+ E,)2 1

- Egd +-8 nA 2 1(hv exp( - E,/kT) Ep)'

-

+ '

Jo

X'/Z(l - x)'/2 dx EF - (hV - Egd exp kT

+

1

+ Ep)X

x 1 - x ) ' / ~dx , (584 EF - (hv - E g d - Ep)x exp kT

and at low temperatures Eq. (58a) reduces to Mpd(LT)= A*

+ Ep)' + (hV - Egd - Ep)' 1 1 - exp( - E,/kT)

(hv - E,d exp(E,/kT)

-

The expression for ct, due to the extra impurity- or free-carrier-induced absorption is (Balkanski et al., 1969) 8B2 n

a, = -(hv

-

Egd)Z 1

+ exp EF - (hvkT

-

Egd)x '

where B is another constant relevant for the doped semiconductors only. The free carriers in a doped semiconductor also give rise to additional absorption that does not involve band-to-band transitions. The classical expression for the free carrier absorption coefficient tlfree is (Balkanski, 1969)

Finally, the observed absorption coefficient sobs is given by

BANDGAP NARROWING AND ITS EFFECTS

249

To interpret experiments, the absorption coefficient due to free carriers, afree, is subtracted from the observed coefficient aOh5.Values of A and E, are obtained by fitting the absorption data of intrinsic crystal with Eq. (56a) or Eq. (56b). The value of the Fermi level with respect to the band edge can be calculated because the concentration of the free carriers is known. The bandgap E,, and B are treated as adjustable parameters, and they are determined by fitting the sum of Eqs. (58) and (59), i.e., a d = Npd c t e , with the experimental values of ctd = a O b s - afree.

+

4. Experimental Results und Discussion u. H e u t d y Doped n-Type Germmium Optical absorption in heavily doped n-type germanium was measured by Pankove and Aigrain (1962) at 4.2 K and 300 K, and by Haas (1962) at 80,200, and 295 K. Typical values of the absorption coefficients a,, and X O b s = afree apd a, for an n-type (N, = 1.95 x loL9~ m - heavily ~ ) doped germanium sample are shown in Fig. 22. The results for the doped germanium are taken from the paper by Haas and are for the temperature 80 K. The results for intrinsic germanium (at 77 K ) are taken from McLean (1960) and are approximate since they were read from the plot of CI vs. hv of Fig. 5 of this reference. The absorption coefficient of the doped crystal increases rapidly at low energies because of free carrier absorption. In a considerable part of the absorption curve near the edge, hv 0.75 eV, with contributions due to both free carrier absorption and band-to-band absorption, the absorptions overlap. Extracting values of the absorption coefficient for band-to-band transitions in this region is

+

+

-

500r

!*O0I

a

0

:: 100 03

0.6 07 0.8 09 hvleV1 FIG.22. The absorption coefficient of germanium is shown as a function of photon energy. Curve 1 is for an n-type Ge crystal doped with N , = 1.95 x 10” cm-3(Haas, 1962);curve 2 is for an intrinsic Ge crystal (McLean, 1960). Of+

0.5

250

SURESH C. JAIN, R. P. MERTENS A N D R. J. VAN OVERSTRAETEN

5'

18

-a -

16 1L

$ .-0 E 0, 8 6

12

L

I-

L

10

8 6

6p L

-U2

2

080 0.85 hv (eV) FIG.23. Values of square root of the absorption coefficient old of doped Ge crystal due to band-to-band transitions are shown as a function of photon energy hv at 80 K. Curves 1 to 6 are for uncompensated n-type Ge doped with zero, 2.4 x 10'' 4.5 x 10" 9.6 x loi8~ m - 19.5 ~ , x 10'' cm-3, and 43.0 x lo'* P atoms; curve 8 is for compensated n-type Ge doped with 15 x 10'' Ga atoms and 26 x 10'' P atoms. [After Haas (1962).] 0.65

0.70

0,75

therefore difficult. The procedure adopted by most authors is to extrapolate the low-energy part of the observed curve (where a due to band-to-band transitions is negligible) to higher energies to obtain afree,and to subtract it from the aObscurve to get the absorption coefficient ad= ffpd +a, due to bandto-band transitions in the doped crystal. Since contributions due to both processes are comparable in this energy range, the error in the values of apd+ a, for band-to-band transitions extracted in this manner can be quite large. The absorption at low energies is also sensitive to unintentional compensating impurities, which makes the situation even more complicated (Balkanski et al., 1969). We will see later that on this account, the error introduced in the value of &, determined from absorption measurements is very large in the case of Si. The 80 K values of ad given by Haas (1962)for pure (curve 1) and six doped samples (curves 2 to 6 and curve 8) of germanium are shown in Fig. 23. Samples 2 to 6 are n-type, and sample 8 is also n-type but compensated. The fastrising part of the curves at higher energies is due to transitions to the higher direct band. The values of EPand A are found by fitting Eq. (56) to the observed values of clp (curve 1). It is found that both A and E , are temperaturedependent. Since Eq. (56) has only one phonon energy, the value of E , determined in this manner is some kind of average of the four phonons that participate in the absorption process. Usually the TO phonon dominates the process of absorption, and to describe the absorption by one E , is not a bad approximation.

BANDGAP NARROWING AND ITS EFFECTS

25 1

hvleV)

FIG.24. Comparison of calculated absorption curves and experimental data for the Ge sample with N , = 4.3 x lOI9 At 295 K, the two contributions used to determine the total calculated absorption curve are shown separately: curve a is the phonon, and curve b is the electron-electron contribution. [After Haas (1960).]

The best fit of theoretical Ed given by the sum of Eqs. (58) and (59) with for a Ge sample doped with the experimental values of C(d = aObs- qree 4.3 x 10’’ cm-3 is shown in Fig. 24 at three temperatures. For 295 K, the electron and the free-electron-induced scattering contributions are shown separately. It is seen that the free-carrier- or impurity-induced band-to-band absorption is very large in Ge. As discussed earlier, the values of B and Egd were determined by the best fit of the experimental data with the theory. B was found to be independent of temperature; it increased the doping concentration. In compensated samples, B did not depend on the total impurity concentration, but only on the free carrier concentration. The value of Egd was found to be smaller than E, and values of A E, = E, - Egd were calculated in this manner. The values of AEg obtained by Haas in this manner are compared with the theoretical values obtained by Berggren and Sernelius (1981) in Fig. 25. The experimental results show that bandgap narrowing values decrease as the temperature increases. The experimental values are somewhat lower than the theoretical values. The values of AE, in the compensated samples are found to be considerably larger for the same carrier concentration than those in the corresponding uncompensated samples. For example, in sample 8 (Fig. 23), the total impurity concentration ND + NA was 4.7 x 1019 cm-3, but the free ~ ; value of bandgap narrowing carrier concentration was 1.1 x 10’’ ~ r n - the in this sample was 80 meV. In an uncompensated sample, the value is 47 meV for a larger free carrier concentration of 1.95 x lo’’ cm-’. This result is

252

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

loor 80

20

10l8

/

,

I

I

I I

IIII

1019 N ( cni31

I

, 1 1 1 1 1

1

lozo

FIG.25. Values of AEg vs. donor concentration N , in n-type Ge. Curve 1 is theoretical (Berggren and Sernelius, 1981); curves 2 and 3 are drawn through experimental data of Haas (1962) at 80 K and 293 K, respectively.

consistent with the effect of compensation on optical absorption in compensated germanium discussed by Stern (1971) and also with the theoretical results of Serre et al. (1981) discussed in Section 111. Pankove and Aigrain interpreted their results somewhat differently. Their values of AE, are considerably higher than both the experimental and the theoretical values shown in Fig. 25. They also found that B was strongly dependent on the impurity concentration and was independent of temperature. b. Heavily Doped n-Type Silicon Vol'fson and Subashiev (1967) studied both n-type and p-type samples doped with As, P, Sb, and B in the concentration range 1 x 10'' cm-3 to 1 x lozo cm-3 at room temperature. Balkanski et al. (1969) studied P-doped samples in the concentration range 6 x 10l8 to 4.9 x lozo cm-3 at 35, 85, and 300K. Schmid (1981) studied both As-doped and B-doped samples at 4 and 300 K. The concentration of As was in the range 6 x 10'' to 4 x 1019 ~ m - and ~ , of B, 5 x 1 O I 8 to 1.2 x 10'' cmP3. Experimental results of Balkanski et al. (1969) for P-doped samples are shown in Fig. 26. The free carrier absorption was determined by curves Balkanski using Eq. (60). For the two least-doped samples, afreevs. ,Iz

253

BANDGAP NARROWING AND ITS EFFECTS

2500 -

0.6

N, = 1.5x1020cm-3

I I 15 2

I

3

L

L

I

,

5

1,

I

6

,

I

,

7

A(pm) FIG 26. Values of measured absorption coetficient z ~of , phosphorus-doped Si are shown as a function of wavelength i. of incident light. Concentrations of donor atoms, assumed to be equal to free electron concentrations, are shown on the curves. [After Balkanski et al. (1969).]

1000

-

I

'5

-Y 500

:

U

0

5

10

15

20

25

30

35

LO

A21yrn2)

FIG.27. Free carrier absorption I~~~~in two samples of ,I-type Si (solid lines, N, = 1 x 10'" cm-': dashed lines. N , = 6 x 10" cm-') is plotted as a function of 2,' at 300 K ( O ) ,85 K ( A ) . and 35 K ( x ). [After Balkanski cr t i / . (19691.1

are shown in Fig. 27.6 As predicted by Eq. (60), the plots are straight lines in the low energy part where free carrier absorption dominates. Balkanski rt ul. (1969) and others (Vol'fson and Subashiev, 1967; Schmid, 1981) also determined A, E,, and A E , using the absorption data in pure and doped Si and using the method of Haas (1962) discussed in the previous section. The values of BGN = AE, = E, - Egd determined in this manner are " The plateau seen in the lowest Iwo curves is attributed to the electronic absorption from the Fermi level to some higher states in the conduction band (Balkanski, el al., 1969; see also Spitzer and Fan, 1957).A similar band also exists due to intervalence band transitions in B-doped p-type samples (Schmid, 1981).

254

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

N, I cm3) 1018

i o ~

10'9 I

11'

1

Nc for Si : P

0.21

0

0

-3

cn

W

Q

0.1 -

m o I

I

I

I

I

I

I

3 N23 lcm-') FIG.28. Comparison of AEa in heavily doped Si derived from optical absorption measurements by different workers with theory (solid line, Berggren and Sernelius, 1981):(0) Balkanski, 35 K; ( 0 )Balkanski, 300 K; ( A ) Schmid, 4 K; (+) Volfson, RT.

plotted in Fig. 28, along with the theoretical values of Berggren and Sernelius (1981).This figure shows that the values of AEg determined using absorption measurements in Si are considerably smaller than the theoretical values-a result different from that obtained with Ge. The bandgap narrowing is higher at lower temperatures, in agreement with the observation of Haas for Ge.

5. Comparison of Optical Absorption in Ge and Si A comparison of optical absorption in doped Ge and Si shows that the free-carrier-induced or impurity-induced band-to-band absorption is much larger in Ge than in Si (see Fig. 23 for Ge and Figs. 26 and 29 for Si). In fact, Schmid (1981) and Vol'fson and Subashiev (1967) do not find any evidence of band-to-band absorption due to impurity or free-carrier scattering. The vs. hv plots depended only slightly on the free carrier concenslope of tration. To fit the curved portion at lower energies, Schmid included band-tail density of states in the integral in Eqs. (58a) and (59). He used a semiclassical theory to calculate the band tails. We have already shown in Section I11 that the band tails are grossly overestimated in this type of theory. The error arising from the uncertainty in determining the free carrier absorption and its effect on AEg in Si has been discussed by Wagner (1985a,b), Wagner and del Alamo (1988), and Pantelides, et al. (1985); it is illustrated in Fig. 29. The absorption data in this paper are from the paper of Schmid

&

255

BANDGAP NARROWING AND ITS EFFECTS SI. As14 K )

C

.-WV

t W 0

v)

n

Q

ed I

I

I

I

08

1.2

1.6

20

Photon energy lev) FIG.29. The optical absorption spectra of Schmid (1981) in heavily doped Si. The short vertical lines are the calculated Fermi levels taking into account BGN due to heavy’ doping . Inset is described in text (Pantelides et al., 1985).

(198 1). The positions of the Fermi levels derived from photoluminescence data are shown by short vertical lines (Pantelides et a/., 1985). These lines are to the left of the dip in the experimental curves. The inset explains why ~ , obtained this is so. By fitting his theory to the case of 6 x 10l8~ m - Schmid a zero value of bandgap narrowing. If this were true, the optical edge should be to the right of the edge of intrinsic silicon as a result of band filling. The actual optical edge is to the left. The error was caused by the uncertainty in determining the free carrier absorption,’ which causes error in determining the absorption edge. Pantelides et al. (1985) have shown that Schmid’s data yield values of BGN consistent with theory [Berggren and Sernelius, 1981) and with luminescence experiments if free carrier absorption is “correctly” determined. The same conclusion can be derived from Wagner’s photolum inescence excitation experiments, which are discussed in the next section. To summarize, we can say that the effects of doping on optical absorption are quite different in Ge and in Si. Doping increases optical absorption considerably in Ge, but only slightly in Si. The procedure used for subtracting free carrier absorption is valid for Ge, but gives large errors in the case of Si. It is for these reasons that absorption experiments have been used successfully in determining BGN in Ge, but have yielded wrong values of BGN in Si.

’

Hartman (quoted by Benoit and Voos, 1967) has suggested errors in interpreting observed absorption spectra can also arise because the matrix element used in the theory are assumed to be energy independent.

256

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

C. Photoluminescence Excitution ( P LE ) Absorption

In the PLE absorption (also known as selective absorption) technique, one measures the luminescence intensity as a function of the energy of the exciting photon. Assuming that the radiative recombinations are a constant fraction of the total recombinations, it can be shown easily that the luminescence intensity is proportional to the absorption coefficient (at the energy of the exciting photon) due to band-to-band transitions. Using this technique, one can measure band-to-band absorption without the complication caused by free carrier absorption discussed in the previous section. Wagner (1984, 1985a) has made PLE absorption measurements on both n-type and p-type - ~1.5 x lOI9 cmp3. heavily doped silicon in the doping range 1 x l O I 7 ~ r n to The results for silicon containing 8 x lo” cm-3 P atoms and 2 x 1OI8 cmp3 B atoms are shown in Fig. 30 (Wagner, 1985a). The results for an intrinsic sample shown in the bottom of the figure are in general agreement with the earlier work (McLean, 1960) and allow the identification of the band edges modified with the momentum-conserving TA and T O phonon energies as T: 5K

I

Laser wavelength (nml FIG.30. PLE spectra of Si:P and Si:B. The lowest dashed curve shows for comparison an excitation spectrum of the freeexciton luminescencein apuresample. IB indicates impurity-bandrelated absorption; Eo (pure sample) and EOd(doped sample) mark the onset of band-to-band absorption. TO, TA, and N P refer to TO-phonon-, TA-phonon-, and no-phonon-assisted transitions, respectively. [After Wagner (1985a).]

BANDGAP NARROWING AND ITS EFFECTS

257

indicated in the figure. In these experiments, only LL luminescence (see discussion of Fig. 20) was monitored because the excitation level was low.' When the crystals are doped, both Ego(TO)and Ego(TA)are shifted to lower energies. In the p-type sample, these features shift from 1022 to 1056 nm to 1026 and 1060 nm, respectively. Four other features are seen in the borondoped samples at lower energies. Two peaks related to the impurity band absorption are seen; one is the no-phonon peak IB"'), and the other is the TA phonon peak IBITA'.On the high-energy side of these peaks, the features and the similar associated with the phononless band-to-band transition transition due to the TA phonon, i.e., EL?', are seen. The features associated with the TA phonons are not resolved in the phosphorus-doped samples. A t low temperatures, the energies determined by these features are the sums of the appropriate band edges and the energies of the corresponding momentumconserving phonons emitted during the absorption of photons. Since the energies of the phonons are known, the band edges can be easily determined from these plots. Wagner has shown that the observed curves in Fig. 30 agree with the theory of optical absorption discussed in Section IV.B.39 and can be used to obtain reliable values of E g o ,The values of Ego determined in this manner are shown in Fig. 31, along with the data determined from the high-energy cutoff of PL peaks to be discussed in the next section. Vertical arrows show the values of N c , Mott's critical concentration.

Er'

D. Photoluminescence

1. EurIy W o r k Significant work on photoluminescence of heavily doped silicon was done in the mid-1960s and 1970s. Williams (1968) has written a review on the optical properties of donor- acceptor pairs in silicon, germanium, 111-V and 11-VI semiconductors covering the work done up to 1968. Dean et al. (1967) and Enck and Honig (1969) studied extensively luminescence in heavily doped silicon. The first important paper on luminescence suggesting that bandgap on Si is modified by heavy doping was published by Benoit and Voas (1967). A value of bandgap narrowing of 34 meV for the doping of 7 x 10'' cm-3

' In one case, Wagner (1984)was able to monitor both the LL and H L peaks and found that the position of the band edge associated with a parlicular phonon was the same in both cases. This suggests that the absorption process is the same for both the LL and the H L peaks and is due to band to band transitions. ' Wagner used the theory given by Pankove and Aigrain (1962) for optical absorption in a doped crystal.

258

SURESH C. JAIN. R. P. MERTENS AND R. J. VAN OVERSTRAETEN

c

116

T = 5K

s"p

,,,I

I

I 1 1 1 1 1 1 1

I

I 1 , 1 1 1 1 1

I

1017 10'8 1019 Carrier concentration ( ~ r n - ~ ] FIG. 31. Position of the optical bandgap Ego as determined from the onset of no-phonon and TA- and TO-phonon-assisted band-to-band absorption. Also shown (0) is the high-energy cutoff of the photoluminescence. All data have been corrected for the phonon energy involved. Critical Mott density is indicated by an arrow. [After Wagner (1985a).] 10'6

in Si is obtained from these measurements. This value is considerably larger than the values obtained by Balkanski and by Schmid and is in better agreement with the now-accepted value of BGN at this concentration. In a later paper, Benoit and Cernogora (1969) also reported results of their luminescence studies on heavily doped germanium crystals. The results of Benoit et al. and of Enck and Honig were important: They established that reliable values of BGN can be obtained from luminescence measurements. 2, Low-Level (LL) and High-Level (HL) Luminescence During the years 1974 to 1979, Parsons and co-workers did extensive work on photoluminescence of heavily doped n-type and p-type silicon with different impurity concentrations. They used different intensities of the exciting light and worked at low temperatures up to about 160 K (Bergersen et al., 1976, 1977; Parsons, 1978, 1979; Parsons et al., 1978). One of the important results of this work is that the peak positions of the dominating luminescence spectra depend on the intensity of the exciting light (see also the discussion of Fig. 20). At low intensities of excitation, the peak positions of the stronger spectra were about 40 meV lower in energy than the peaks that dominate at a high intensity of illumination. At low temperatures, the

BANDGAP NARROWING AND ITS EFFECTS

259

structure (TA and TO phonon and no phonon peaks) was similar in the two spectra. The line shapes were different, however. The two spectra are now commonly designated as the LL and H L spectra. Parsons (1979) suggested that “the H L luminescence involves band-to-band transitions whereas the LL luminescence is due to the transitions from the majority carrier band to the residual acceptor levels.” Schmid el al. (1981) have measured the photoluminescence of a heavily phosphorus-doped silicon sample before and after implanting the compensating boron impurities. They found a large increase in the LL spectra in crystals containing compensating impurities, confirming Parsons’s model of the transitions responsible for these two spectra. The main features of H L spectra are the following: 1. Below a concentration of 1 x 10” ~ m - the ~ , position of the H L spectrum is independent of the impurity concentration, but at higher concentrations, it shifts to lower energies. ~ ,lines 2. Above Mott’s critical concentration of about 3 x lo’* ~ m - the start to broaden in both P- and B-doped samples, and it becomes difficult to separate the H L spectra from the LL spectra. 3. In both n- and p-type samples, the lines are asymmetrical. 4. The HL spectrum is relatively insensitive to the temperature at least up to 90 K. 5. The decay time of this luminescence is of the order of 0.1 1 ps.

The most important differences in the behavior of the LL spectrum are that its decay time is about 8.9 ps, and the line shape changes during the decay process; that the line shape is strongly temperature-dependent; and that it dominates at low levels of illumination only. 3. Method of Extracting the Values of BGN from the Observed Photoluminescence

The photoluminescence spectrum component for one specific momentumconserving phonon with energy E , at OK of an ideal heavily doped semiconductor is shown schematically in Fig. 32a. The intensity of luminescence I , is plotted as a function of energy hv + E,, where hv is the energy of the emitted photon and E , is the energy of the emitted momentum-conserving phonon. The heavy doping effects are not included in this figure, i.e., bandgap shrinkage and the distortion of density of states due to heavy doping are ignored, but the effect of band filling is included. In Fig. 32a, luminescence intensity I , varies as f i = Jhv+E,up to E = E,, and then drops vertically to 0 at E = EF. The conduction band edge is at the low-energy cutoff point of the spectrum where the density of states becomes zero. The changes that

260

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

Ec hv+Ep FIG.32. Schematic representation of intensity of luminescence as a function of energy hv + E , for a semiconductor with (a) undistorted parabolic band and (b)distorted band due to heavy doping. The effects of BGN and Halper in tails are shown in (b) (see text).

occur in the spectrum of a “real” heavily doped semiconductor at finite temperature are shown in Fig. 32b. The effect of finite temperature is seen on the high-energy cutoff of the spectrum. Instead of a vertical drop, there is a small tail due to thermal spread of the electrons at the Fermi level. The tail extends over an energy equal to a few kT. Drastic changes take place, however, on the low-energy side. Now the intensity I , does not vary as @ near the band edge and it does not become zero at the edge. In addition to the curve being distorted, the tail extends into the bandgap. These considerations apply in the high-density regime where appreciable Halperin tails are formed (see Section 111). Below Mott’s critical concentration N,, the line broadening and shape distortion occur because the impurity band has not yet merged completely with the conduction band and provides additional density of states just below the conduction band. Photoluminescence spectrum can be used to determine the bandgaps Ego and E,d (see Fig. 20 for the definition of Ego and E g d ) of the heavily doped semiconductor. In practice, determination of the optical gap Ego is not so difficult. Parsons (1979) and Wagner in his earlier paper (1984) used the highenergy cutoff point e (Fig. 32b), where I , becomes practically zero, for determining Ego.Dumke (1983b) used the point d , the point where the negative gradient has maximum value, for this purpose. The point d is a more appropriate choice; however, the error caused by taking the point c or e is small at low temperatures. Determination of the edge Egd from the low-energy

BANDGAP NARROWING AND ITS EFFFCTS

26 1

cutoff is more difficult. The distorted density-of-state function is not always known, the band edges of E,, are not well defined and the exponential tails make it difficult to define the cutoff point. Wagner (1984) has taken a point where the intensity falls to 5% of the peak as the cutoff point to determine the value of Egd. For a complete interpretation of these spectra, a line-shape analysis may be performed. This involves a quantum mechanical calculation of the transition matrix elements and has not yet been attempted. Parsons (1979), Dumke (1983a),and more recently Wagner (1987), and Wagner and del Alamo ( 1988) have assumed a constant energy-independent matrix element for the transitions to derive an expression for the line shape. The expression for the line shape (for each component assisted by a particular phonon with energy&) is given by (see, for example, Wagner and del Alamo, 1988)

I,(hv)

[:fe(E)q(E),fh(hv

- Egd

+ Ep - E)Dh(hv

- Egd

+ E, - E ) ( d E ) , (62)

where E,, is the bandgap reduced due to heavy doping (see Fig. 20) and E , is the energy of the phonon that assists the transition. Parsons noted that at low temperatures, the energy range kT over which the minority holes are distributed at the top of the valence band is small as compared to the width (about 15 meV) of the HL spectra and can be neglected. At 4 K, the thermal tail of the majority carriers at the Fermi level can also be neglected, and Eq. (62) reduces to I,(liv) = AD,(hv - E,,

I,(hv) = 0

+ EP)

+ E,, for hv > Ego + E , or hv < Egd+ E , .

for hv < Ego

(63a) (63b)

I t is found that Eq. (62) or (63) does not describe the observed spectra satisfactorily; the observed linewidth is considerably larger than the width given by Eq. (62)(Dumke, 1983a; Wagner, 1985a, 1985b). Dumke (1983a), and later Wagner (1985a, 1985b) suggested that the observed lines are broadened for several reasons, e.g., finite resolution of the experimental data, incomplete thermalization of the carriers, unknown defects, strains in the lattice and band distortion. Following a suggestion of Kane (1985), this broadening can be assumed to be Gaussian and can be taken into account by using the following expression for the line shape (Wagner et d., 1988):

where E, is an adjustable broadening parameter. By fitting Eq. (62) or Eq. (63) along with (64) with the observed spectra, values of both E,, and Egdcan be determined.

262

SURESH C . JAIN, R. P. MERTENS A N D R. J. VAN OVERSTRAETEN

4. BGN values in n-Type Si Determined by Photoluminescence Measurements Most work on the interpretation of photoluminescence is devoted to the HL spectra. In the discussion that follows, we always imply HL spectra unless we state otherwise. The typical spectra of both n- and p-type Si (Wagner, 1987) are shown in Fig. 33. The components due to different phonons that assist the transitions are shown in the figure. The relative heights of the components depend on the quantum mechanical transition probabilities and symmetry considerations and have been discussed by Wagner (1985a). Both TA phonon and no-phonon lines increase in intensity as the impurity concentration increases. Wagner has suggested that the TA phonon lines in silicon are partially forbidden and therefore, the transitions gain strength with increasing impurity concentration. The main features of the photoluminescence spectra observed by Parsons and co-workers (Bergersen et al., 1976, 1977; Parsons, 1978, 1979; Parsons et al., 1978), by Schmid et al. (1981), and by Wagner (1985a, 1985b) are the same as those shown in Fig. 33. However, in the earlier published data on p-type silicon, both LL and HL spectra were always seen, presumably because of the higher concentration of residual donors. Since Wagner could observe only HL spectra in B-doped samples, the starting samples he used must have Photon energy (eV) 1.0 1.1 1.2

0.9 I

I

I

I

I

I

Photon energy (eV)

I

I

1.4 1.3 1.2 1.1 1.0 Wavelength (pm) FIG.33. Photoluminescence spectra of heavily doped n-type and p-type Si for different carrier concentrations. Arrows indicate the high-energy cutoff of the N P line (E,,, left) and of the TA replica (.Ego- E,,,,, right) and the low-energy edge of the TO replica (ESd- E,,,”). [After Wagner (1987).] 1.6 1.3 1-2 1-1 1.0 Wavelength I ~ ~ r n l

BANDGAP NARROWING AND ITS EFFECTS

263

had a higher degree of purity. T O replicas are most prominent in all the spectra. No-phonon (NP) lines are seen in the phosphorus-doped silicon samples. (Note that the no-phonon lines are at the highest energy in the luminescence, whereas they are at the lowest energy in the selective absorption shown in Fig. 30.) TA phonon replicas are not well resolved at higher dopings in n-type silicon. At the lowest energy, TO plus center-of-zone phonon lines are resolved in all but the most highly doped samples. The spectra shift to lower energies as the concentration increases in the lower concentration range. However, the high-energy edge becomes approximately constant at concencm-j, and the linewidth increases rapidly. trations higher than about 2 x ~ ,peak position and the In the case of the highest B doping, 4 x lozo~ m - the high-energy cutoff point move to higher energy. The vertical arrows indicate approximate positions of the low-energy and high-energy cutoff points. These cutoff points are for the T O and N P replicas in the n-type silicon, and for T O and TA replicas for the p-type silicon; they are marked at positions where the luminescence intensity drops to 5% of the peak value. Appropriate phonon energies have to be added to the positions of the arrows to get the values of the band edges. From the high- and low-energy cutoff points of his spectrum, Parsons (1979) determined the values 1.117 meV for E,, and 1.137 meV for Ego for a silicon sample containing 6 x lo'* ~ r n P- ~atoms at 35 K. Taking E, of a pure crystal as 1.165 eV at this temperature (Balkanski et al., 1969), the BGN comes out to be 43 meV, considerably larger than the corresponding value obtained by Balkanski et al. (1969). Dumke (1983a, 1983b) analyzed the spectra observed by Schmid et al. (1981) for both n-type and p-type heavily doped silicon and derived the values of the BGN. Wagner (1984,1985a, 1985b, 1987), and Wagner and del Alamo (1988) published several papers in which he derived the values of BGN from his measurements of photoluminescence, also in both n-type and p-type heavily doped silicon. Both Dumke and Wagner have used the line-shape analysis as well as the high- and low-energy cutoff values of the observed spectra to determine the values of the BGN. The experimental data of Dumke (1983a) and of Wagner (Wagner et al., 1988) can be fitted well with Eqs. (64) and (62),except for the most heavily doped sample, for which the low-energy side of the observed curve is highly distorted and is not well defined. The values of the bandgap Egd and Ego = Egd+ EF were determined in this manner. For n-Si, the values of E , were also determined by Dumke by simple calculation assuming parabolic band, cyclotron resonance effective mass, and measured donor density. The values of E , obtained by these two methods agreed closely except for the lowest and the highest dopings. For the lowest doping, the value of E , obtained from the fit is 17 meV as compared to the calculated value of 11 meV, presumably because of the presence of the impurity band. For the highest doping, the value from the fit is 40 meV (from the high-energy cutoff), whereas the calculated value

264

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

10'8

1019

1020

N, ( ~ r n - ~ l

FIG.34. Optical bandgap E,, (open triangles) and reduced bandgap Egd (open circles) vs. dopant concentration for n-type silicon as deduced from the low-energy cutoff of photoluminescence spectra by Wagner (1984) and (filled circles) by the curve-fitting method (Wagner and del Alamo, 1988). Experimental data for Egodeduced by Pantelides et al. (1985) from optical absorption data of Schmid (1981) are also shown, by filled triangles. The upper shaded area AB defines calculated Egoand Egdfor a non-interacting uniform electron gas; the middle one CD, Ego and Egdfor an interacting uniform gas neglecting impurity scattering; and the lower shaded area EF, Ego and Esd with all interactions included. [After Fig. I of Berggren and Sernelius (1985).]

is 73 meV. For the highest doping, the effects of concentration fluctuations, band tails, distortion of the density-of-state function, and broadening due to other causes dominate and give rise to the discrepancy. At other doping densities, the high-density theory and the rigid shift of the parabolic band due to heavy doping is a reasonable approximation. The values of bandgap narrowing obtained by Dumke from this fit for n-type silicon agree with the values obtained by Wagner (1988). For p s i , values obtained by Dumke are considerably larger than those obtained by Wagner et at'. (1988) (see Fig. 36, later in this section). The experimental and theoretical values of bandgaps for n-type Si are summarized in Fig. 34, taken from the paper of Berggren and Sernelius (1985). The PL data shown by open circles for Egdand by open triangles for Ego have been taken from Wagner (1984). These early values (Wagner, 1984) of the bandgaps were derived by using the low and high-energy cutoff points. Filled triangles show the values of Egd derived by Pantelides et al. (1985) from

BANDGAP NARROWING AND ITS EFFECTS

265

the optical absorption data of Schmid ( 198I). Wagner et al. (1988) have also given values of E,, and E,, derived by the line-shape analysis method. We find that the values of E,,, are practically the same in the two papers. However, values of E,, are significantly higher in the second paper. We have added these recent values of Wagner et a/ (1988) as the filled circles in the figure. If we first neglect all interactions, the fundamental bandgap Egd remains unaltered by the doping and is shown by the horizontal straight line B. The optical bandgap Ego = E,, + E , increases rapidly because of band filling and gives rise to the Moss-Burstein shift (the uppermost curve A). The experimental values of the bandgaps Egd and E,, obtained from PL data shown by open and solid triangles are much smaller as compared to this curve. If, now, carrier-carrier interactions (exchange and correlation effects) are included, both bandgaps move down substantially, as shown by the middle shaded area between curves C and D. Though they move in the direction of the experimental data, the discrepancy is still large. The theoretical values of the bandgaps with all interactions (carrier-carrier and carrier-impurity interactions) included in the calculations are shown by the lowest shaded area between curves E and F. Now agreement between theory and experiment is good. 5. Effiw qf' Temperuture on BGN

The effect of temperature on BGN has been discussed by several authors (Haas, 1962; Balkanski ef a/., 1969; Wagner rt al., 1988; Dumke 1985b; Casey and Panish, 1978; Bennett and Lowney, 1981; Saunderson, 1983; Thuselt and Roster, 1985a, 1985b) and is rather complex. Wagner et a/. (1988) fitted Eqs. (64) and (62) to the RT spectra taking replicas due to all the phonons (plus one N P line for n-type silicon) into account. The relative strengths of various replicas or components were assumed to be the same as at low temperatures. Phonon emission as well as absorption replicas were considered. The intensity ratio of these Stokes and anti-Stokes lines were calculated using Bose-Einstein distribution factors. The values of BGN at room temperature came out to be the same as at low temperatures, within experimental errors. We have seen earlier that the optical absorption measurements of Haas (1962) for germanium and of Balkanski et ul.. (1969) for silicon show that the BGN at room temperature is smaller than at low temperatures. Bennett and Lowney (198 I), Saunderson (1983), and Thuselt and Rosler (1985a, 1985b) have calculated values of BGN at room temperature. Bennett and Lowney find that for 1 x lozocm.-3 doping in an n-type semiconductor, the BGN at RT is higher by 17 meV than at low temperatures (see also Dumke 1983a. who has discussed this result). The theoretical results of Saunderson (1983; see also Wagner rt ul., 1988) are also somewhat higher at room temperature

266

SURESH C. JAIN, R. P.MERTENS AND R. J. VAN OVERSTRAETEN

than at low temperatures, supporting the calculation of Bennett and Lowney. The room-temperature results of Thuselt et al. are very close to the 0 K values. The results of different workers therefore conflict. However, the lowtemperature and the room-temperature values of BGN are not different by more than 10 to 20 meV. These values are so small that they are within the experimental errors and errors caused by the approximations used in the theory. 6. BGN in n-Type Si Determined by Different Methods

In Fig. 35, we have compared the values derived from the PL data by the two methods of Wagner, discussed earlier, and from the device measurements. The values shown by open squares in this figure are the early values derived by Wagner (1984) using the high- and low-energy cutoff method. The values obtained by the line-shape analysis, (Wagner et al., 1988) are shown by open circles. The values obtained at room temperature, also by line-shape analysis, are shown by the plus signs. Curve 1 is the theoretical curve from the paper of Berggren and Sernelius (1981). Curve 2 represents the device data as recommended by del Alamo et al. (1985a, 1985b) as a result of analysis of all I

0

FIG.35. Values of AE, for n-Si obtained theoretically and by different experimental methods. Curve 1 is that calculated by Berggren and Sernelius (1981,1985). Curve 2 is based on the device data and is obtained by converting the apparent BGN values given by Del Alamo et al. (1985b) into real BGN values. The values obtained by the line-shape analysis of PL data at low temperatures are shown by open circles (Wagner, 1984), and those obtained by the low-energy cutoR method are shown by open squares (Wagner et al., 1988). Room-temperature BGN values obtained by Wagner et al. (1988) by line-shape analysis of PL spectra are shown by the points marked by plus signs.

BANDGAP NARROWING A N D ITS EFFECTS

267

the available experimental data. Del Alamo et al. have in fact given the values of the apparent BGN. We have applied room-temperature Fermi-Dirac correction to these values to obtain the real BGN values so that they can be compared directly with the theory and the PL data. The values obtained by the low-energy cutoff method are higher than those obtained by line-shape analysis. The theoretical curve of Berggren and Sernelius lies in between the two values. The values obtained by the line-shape analysis shows better agreement at lower concentrations, whereas the values obtained by the first method show better agreement at higher concentrations. As discussed earlier, in the line-shape analysis method we introduce an adjustable parameter E , to account for the line broadening. This method is equivalent to reducing the observed linewidth to remove the broadening, and then determining the BGN values. The low-energy cutoff point therefore moves towards the high energy when the broadening is removed, increasing the value of E,, and reducing the value of BGN. Among other causes, distortion of the density-of-state function can also cause the broadening, as discussed earlier. For correct evaluation of the BGN, we should remove broadening due to causes other than distortion of density of states. The procedure adopted removes broadening due to distortion of the density-ofstates function also, and an error is introduced on this account. It can be easily shown that the low-energy cutoff method will give values of BGN that are too high. Because of broadening, the low-energy cutoff point moves to lower energies, reducing Egdand increasing BGN. It is therefore expected that the correct theoretical curve will lie in between the two sets of data. Curve 2, representing the BGN values derived from the device data, is significantly lower than curve 1 except at the highest concentrations. The discrepancy must be due to the uncertainty in the mobility and the lifetime values of the minority carriers needed to interpret the device data. 7. BGN in Heaiiily Doped p-Type Si Abram et al. (1984) have calculated BGN(jellium), i.e., neglecting impurity interactions, in p-type Si. Until recently, no calculations of BGN taking all interactions into account existed in p-type Si and Ge. Jain and Roulston have calculated the BGN in heavily doped p-type Si using Eq. (30) and compared the results (see Fig. 36) with the values obtained by using luminescence measurements of Dumke (1983a) and of Wagner et d. (1988). The experimental and theoretical values agree well, except for Wagner’s values at the highest doping concentrations. While discussing the results for n-type Si shown in Fig. 35, we had mentioned that BGN values derived by line-shape analysis of luminescence spectra are usually too low. The error becomes large at very heavy doping concentrations because of the large distortion in the density-ofstates curve.

268

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

DOPANT CONCENTRATION (cmFIG.36. The values of BGN in p-Si calculated using Eq. (30) (solid line) are compared with

the experimental luminescence values taken from the papers of Wagner and del Alamo (1988) (open circles) and of Dumke (1983b) (closed squares). [After Jain and Roulston (1991).]

v. SUMMARY OF IMPORTANT RESULTS This review shows that as a result of the work done during last 10 years, our understanding of the science of BGN due to heavy doping in n-Si and n-Ge has become mature and all the observed features are now explained satisfactorily. A summary of important results is given below. 1. A quantum mechanical description of BGN is necessary to interpret experimental results. 2. The effect of many-body interactions shifts the two band edges by nearly equal amounts and in opposite directions in n-type Si and n-type Ge. These effects cannot be calculated correctly using Thomas screening. Plasmon pole or random phase approximations must be used. The many-body shifts vary approximately as "I4. 3. The effect of impurity scattering on BGN is quite significant and cannot be neglected. It is not to be confused with the effects caused by fluctuations in the concentration of the impurity distribution. Wrong results are obtained if impurity distribution is assumed to be periodic; a random distribution of impurity must be used in the theory. Also because of impurity scattering, the conduction band moves down and the valence band moves up

B A N D C A P NARROWING A N D ITS EFFECTS

269

by nearly equal amounts in n-type Si. The contribution of impurity scattering to BGN varies superlinearly with N'I3. 4. The total BGN varies as to a good approximation. 5. At 1 x lo2' ~ m - total ~ , BGN in Si is about 130meV. About 70% of the value of the BGN arises from many-body interactions, and the remaining 30% from the carrier impurity interaction. 6. The classical theory of band tails due to concentration fluctuations overestimates the tails considerably. The effect of band tails calculated quantum mechanically on the p n product and other properties is rather small. 7. All values of BGN obtained before 1984 by device measurements at doping concentrations close to 1 x lozocm713were too high. Use of Kane's theory of band tails was unfortunate, since this also gave values of the pn product that were too high, resulting in a fortuitous agreement between theory and experiment. 8. Intervalley scattering does not contribute significantly to the BGN. 9. The optical absorption measurements on germanium have yielded accurate values of BGN in germanium. In silicon, the values of BGN obtained by this method are too small. The discrepancy arises because the absorption due to free carrier and due to other causes near the absorption edge in heavily doped silicon cannot be estimated accurately. 10. The excitation or selective absorption measurements have proved very valuable in determining the correct BGN values obtained by absorption measurements and have provided a clue as to why the earlier conventional absorption measurements gave wrong values of BGN in Si. I I . The photoluminescence measurements are capable of giving accurate values of BGN. The error arises mainly because of the broadening of the emission line due to defects and strains in the crystal, due to finite slit width, due to collisions, and due to other unknown causes. If the BGN is determined by the low-energy cutoff point, its value is too high by 10 to 15 meV because of line broadening. On the other hand, line-shape analysis gives values that are too low by about the same amount. The theoretical curve given by the theory of Berggren and Sernelius (1981) lies in between the two values. This implies a satisfactory agreement between Berggren's theory and experiment. 12. The values of the BGN derived from the measurements of device characteristics are somewhat smaller than the values obtained from the luminescence measurements. The values of minority-carrier lifetime and mobility are technology-dependent and vary from sample to sample. Presumably this causes errors in the values of BGN determined by measuring device characteristics. 13. Measurements on transistors yielded more accurate values of BGN than those obtained by diode measurements. Great care has to be taken to separate the emitter and base currents in the diodes. It is also difficult to allow

270

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

correctly for the recombinations in the emitter and in the base when diode structures are used. The excessively large values obtained at higher concentrations were all obtained by measurements on diode structures. 14. The temperature variation of BGN is rather small, 10 to 15 meV between liquid helium and room temperatures. Experimentally, the theoretical values are lower at RT, but theory seems to give the opposite result. 15. Relatively less work has been done on BGN in p-type Si, and practically no work has been done on BGN in p-type Ge except for a recent theoretical calculation.

ACKNOWLEDGMENTS We are thankful to Mr P. Van Mieghem and Mr J. Poortmans for many useful discussions. S . C. J is grateful to Professor Van Overstraeten and Professor Mertens for arranging his visit to IMEC in 1988 when a large part of this work was done.

APPENDIX Effective Bohr radius Electron density of states Hole density of states Parabolic cylinder function of x Dielectric constant Exchange energy Conduction band edge Correlation energy Valence band edge Fermi level measured from the majority band edge Indirect bandgap of pure or intrinsic semiconductor Reduced indirect bandgap of a doped semiconductor Optical bandgap or distance between Fermi level and minority band edge; see Fig. 20 Phonon energy involved in the optical transition during absorption or luminescence process Broadening parameter Fermi function Fermi factor in the conduction band Fermi factor in the valence band Inverse screening length

BANDGAP NARROWING AND ITS EFFECTS

27 1

Effective mass of electron or hole Effective mass of electron Effective mass of hole Effective density-of-states mass of electron Heavy hole effective mass Light hole mass Free electron mass Free carrier density Dopant concentration Acceptor concentration Mott's critical density defined by Eq. ( 6 ) Donor concentration Local concentration of impurity Number of conduction band minima Average distance between impurity atoms Many-body parameter defined by Eq. (7) Critical temperature at which Boltzmann statistics changes to Fermi-Dirac statistics Variational parameter Absorption coefficient in a doped crystal due to band-to-band transitions. It consists of two parts, a p d and a, Absorption coefficient due to impurity- or free-carrier-induced bandto-band transitions in a doped crystal Absorption coefficient in a doped crystal due to free carriers. Bandto-band transitions do not occur in this process Observed absorption coefficient in a doped crystal Absorption coefficient in an intrinsic crystal due to phonon-assisted band-to-band transitions Absorption coefficient in a doped crystal due to phonon-assisted band-to-band transitions Absorption coefficient in a doped crystal due to free-carrier- or impurity-induced bund-to-band transitions Downward shift of conduction band due to exchange interaction in n-type semiconductor Downward shift of valence band due to correlation effect in n-type semiconductor E, - E,, = BGN, bandgap narrowing due to heavy doping Apparent or effective bandgap narrowing due to heavy doping Dielectric constant Energy in expressions for band tails in Section I11 and frequency of the absorbed or emitted photon during absorption or luminescence in Section IV

272 P PK

pHL Ps 0

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

Density of states Density of states in band tails according to Kane’s theory Density of states in band tails according to Halperin and Lax theory Density of states in band tails according to Sayakanit theory Standard deviation

REFERENCES Abdurakhmanov, K. P., Mirakhmedov, Sh., and Teshabaev, A. T. (1978).Sou. Phys. Semiconduc. 12, 457. Abram, R. A., Rees, G . J., and Wilson, B. L. H. (1978).Advances in Physics. 27,799. Abram, R. A., Childs, G. N., and Saunderson, P. A. (1984).J . Phys. C . Solid State P h y . 17,6105. Abramowitz, M., and Stengun, A. 1. (1965). Handbook o j Mathematical Functions, Appl. Math. Series 55. National Bureau of Standards, Washington, D.C. Aigrain, P., and des Cloizeaux, J. (1955).Compt. Rend. 241,859. Balkanski, M., Aziza, A,. and Amzallag, E. (1969). Phys. Status Solidi. 31,323. Barber, H. D., (1967).Solid-St. Electronics. 10, 1039. Bennett, H. S. (1983). IEEE Trans. Electron Dev. ED-30, 920. Bennett, H. S. (1985).Solid-St. Electronics. 28, 193. Bennett, H. S. (1986a).J . Appl. Phys. 59,2837. Bennett, H. S. (1986b).J . Appl. Phys. 60, 2866. Bennett, H. S. (1987). Solid-St. Electronics, 30, 1137. Bennett, H. S., and Lowney, J. R. (1981).J . Appl. Phys. 52, 5633. Bennett, H. S., and Lowney, J. R. (1987).J . Appl. Phys. 62, 521. Benoit a la Guillaume, C., and Cernogora, J. (1969).Physica Sratus Solidi. 35,599. Benoit a la Guillaurne, C., and Voos, M. (1967). Physica Status Solidi. 23, 295. Bergersen, B., Rostworowski, J. A., Eswaran, M., Parsons, R. R., and Jena, P. (1976).Phys. Rev. B. 14, 1633. Erratum: (1977).Phys. Rev. B. 15,2432. Berggren, K. F., and Sernelius, B. E. (1981). Phys. Reu. B. 24, 1971. Berggren, K. F., and Sernelius, B. E. (1984). Phys. Reu. B.29,5575. Berggren, K. F., and Sernelius, B. E. (1985).Solid-St. Electronics. 28, 1 1 . Bonch-Bruevich, V. L. (1966). The Electronic Theory of Heavily Doped Semiconductors. English Translation, American Elsevier Publishing Co. Inc. See also (1966). Physics of I l l - V Compounds (R. K. Willardson and A. C. Beer, eds.), Vol. 1 . p. 101. Academic Press, New York. Brockhouse, B. N., and lyenger, P. K. (1957). Phys. Rev. 108,894. Brockhouse, B. N., and lyenger, P. K. (1958). Phys. Rev. 111,747. Buhanan, D. (1969). I E E E Trans. Electron Dev. 16, 117. Burstein, E. (1954).Phys. Rev. 93, 632. Cardona, M., and Sommers, H. S., Jr. (1961).Phys. Rev. 122, 1382. Casey, H . C., Jr., and Panish, M. B. (1978). Heferostructure Lasers. Academic Press, New York. Cheeseman, I. C. (1952). Proc. Phys. Soc., Lond., A . 65, 25. de Graaf, H. C., Slotboom, J. W., and Schmitz, A. (1977). Solid-St. Electronics. 20,515. Dean, P. J., Haynes, J. R., and Flood, W. F. (1967).Phys. Rev. 161,711. Debney, B. T. (1977). J . Phys. C. 10,4719. Deby, P. P., and Conwell, E. M. (1954). Phys. Rev. 93,693. del Alamo, J., and Swanson, R. M. (1984). I E E E Trans. Electron Dev. ED-31, 123. del Alamo, J., and Swanson, R. M. (1987a).I E E E Trans. Electron Dev. ED-34, 1580.

BANDCAP NARROWING AND ITS EFFECTS

273

del Alamo. J., and Swanson, R. M. (1987b). J A ~J.. Appl. Phys. 26, 1860. del Alamo, J.. Swirhun, S.. and Swanson, R. M. (1Y85a).I E E E I E D M Tech. Digest. 290. del Alamo. J.. Swirhun. S.. and Swanson, R. M. (1985b).Solid-St. Electronics. 28,47. Dhariwal. S. R.. O,jha, V. N.. and Srivastava. G. P. (1987). I E E E Trans. Electron DCP.ED-34, 1975. Dumke, W. P.(1983a). Appl. Phys. Letters. 42, 196. Dumke, W. P. (1983b). J. .4ppl. Phys. 54. 3200. Efros. A. L. (1974). Soo. Phys. Usp. 16, 789. Enck. R. C., and Honig, A. (1969). Phys. Reti. 177, 1182. Eymard, R., and Duraflourg, G . (1973).J. Phys. D. Appl. Phys. (Inst. Phys. London). 6, 66. Feynman, R. P. (1955). Phys. Rev. 97,660. Feynman, R. P., and Hibhs, A. R. (1965). Quantum Mechanics and Puth Infegral. McGraw-Hill. New York. Caspard, J. P., and Cyrot Lackmann, F. (1973).J . Phys. C. 6, 3077. (1974). /bid. 7, 1829. Cell-Mann, M., and Brueckner, K. A. (1957). Phys. Reti. 106, 364. Ghanam, M., Mertens, R. P., Jain, S. C.. Nijs, J.. and van Overstraeten, R. J. (1988). ESSDERC 1988, Q Q , Monpillier, France. Ghazah, A,, and Serre, J. (1982). Phys. Re[>.Letters. 48, 886. Ghazali, A., and Serre, J. (1985). Solid-Sf.Electronics. 28, 145. Green, M. A. (1987). High Eficiency Silicon S o l m Cells. Trans Tech. Publications. Haas (1955) p. 10 Hass (1960) Fig. 24 legend Haas, C. (1962). Phys. Ref:.125, 1965. Hall, L. H., Bardeen. J.. and Blatt. F. J . (1954). Phy.~.Rev. 95, 559. Halperin. B. I., and Lax, M. (1966). Phys. Reu. 148. 722. Halperin, B. I.. and Lax, M. (1967). Phys. Reu. 153. 802. Hrostowski, H. J., Wheatley, G. H., and Flood. W. F. (1954). Phys. Rrii. 95. 1683. Hwang, C . J. (1970a).J . Appl. Phys. 41, 2668. Hwang,C. J.(1970b). Phys. Reo. B2,4117. Phys. Reu. B2.4126. Hwang, C. J. (1970~). Inkson, J. C. (1976).J . Phys. C, Solid Sture Phys. 9, 1 1 77. Jain, S. C., and Murlidharan, R. (1981). Solid-St. Elecironics. 24, 1147. Jain, S. C., and Roulston, D. J. (1991).Solid-St. Elec.fronics.34,453. Jain, S. C., and Van Overstraeten, R. J. (1983).Solid-Sr. Electronics. 26,473. Jain, S. C., and Van Overstraeten, R. J. (1984).J. Appl. Phys. 55,604. Jain, S. C., Mertens, R. P., Van Mieghem, P., Mauk, M. G., Ghanam, M., Borghs, G., and Van Overstraeten, R. J. (1988). Proc l E E E Bipolar Circuits und Technology Meeting, p. 195. Jain, S. C., MeGregor, J. M., and Roulston, D. J. (1990).J. Appl. Phys. 68. 3747. Kahn, A. H., and Lowney, J. R. (1982)..I. Appl. P h ! x 53. 454. Kane, E. (1963). Phys. RW. 131. 79. Kane. E. (1985). Solid-Sr. Elecrronics. 28, 3. Kannam, P. J. (1973). IEEE Trcrns. Electron Dei.. 20, 845. Kauffman, W. L., and Bergh, A. A. (1968). l E E E Trctns. Electron Dev. 15, 732. Keyes, R. W. (1977). Comm. Solid State Phys. 7, 149. Khan et d..(1982) p. 3 Klauder, R. (1961). Ann. Phys. 14,43. Kleppinger, D. D., and Lindholm, F. A. (1971). Solid-St. Electronics. 14,407. Lanyon. H. P. D., and Tuft. R. A. (1979). IEEE Trans. Electron Deu. ED-26, 1014. Lee, D. S.. and Fossum, J. G. (1983). IEEE Trans. Elrctron Deu. ED-30, 626. Lee, T. F., and McCill, T. C, (1975). J. Appl. Ph),s. 46, 373.

274

SURESH C. JAIN, R. P. MERTENS AND R. J. VAN OVERSTRAETEN

Lindholm, F. A., Neugroschel, A,, Sah, C. T., Godlewski, M. P., and Brandhorst, H.W. (1977). IEEE Trans. Electron Dev. ED-24,402. Lloyd, P., and Best, P. R. (1975). J. Phys. C. 8, 3752. Lowney, J. R. (1985). Solid-Sr. EIectronics 28, 187. Lowney, J. R. (1986a). J. Appl. Phys. 59, 2048. Lowney, J. R. (1986b). J . Appl. Phys. 60, 2854. Lowney, J. R., and Bennett, H. S. (1982). J. Appl. Phys. 53,433. Lowney. J. R., and Bennett, H. S. (1983). J . Appl. Phys. 54, 1369. Lowney, J. R., and Geist, J. C. (1984). J . Appl. Phys. 55, 3624. Lowney, J. R., and Thurber, W. R. (1984). Electronics Letters. 20, 142. Lowney, J. R., Kahn, A. H., Blue, J. L., and Wilson, C. L. (1981). J. Appl. Phys. 52,4075. Macfarlane er a/. (1954) p. 40 Mahan, G. D. (1980). J. Appl. Phys. 51, 2634. Mahan, G. D., and Conley, J. W. (1967). Appl. Phys. Letters. 1 I, 29. Marshak, A. H. (1985). IEEE Electron Dec. Letters. EDL-6, 128. Marshak, A. H., and Van Vliet, C. M. (1984). Proc. I Marshak, A. H., and Van Vliet, K. M. (1980).Solid-St. Electronics. 23, 1223. Marshak, A. H., Shibib, M. A., Fossum, J. C., and Lindholm, F. A. (1981). IEEE Trans. Electron Deu.. ED-28, 293. Matsubara, T., and Toyozawa, Y. (1961). Progr. Theor. Phys.. Osaka. 26,739. McLean, T. P. (1960). In Progress in Semiconducrors, Vol. 5 (A. F. Gibson, F. A. Kroger, and R. E. Burgess, eds).Heywood & Co., London. pp. 53- 102. Mertens, R. P.. van Meerbergen, J. L., Nijs, J. F., and van Overstraeten, R. J. (1980). IEEE Trans. Elecfron Deu. ED-27,949. Mertens, R. P., Van Overstraeten. K. J.. and de Man, H. J. (1981). Aduances in Electronics and Electron Physics. 55, 77. Mock, M. S. (1973). Solid-St. Electronics. 16, 1251. Morgan, T. N. (1965). Phys. Rev. 139, A343. Mott, N. F. (1974). Metal lnsulator Tran.sifion. Taylor and Francis Ltd. Neugroschel. A.. Pao, S. C., and Lindholm, F. A. (1982).IEEE Trans. Eleciron Dev. ED-29,894. Neumark, (1972) p. 27 Neumark, G . F. (1977a). Phys. Rev. B. 5,408. Neumark, G. F. (1977b). J . Appl. Phys. 48, 3618. Pankove, J. I., and Aigrain, P. A. (1962). Phys. Rev. 126,956. Pantelides, S. T.. Selloni, A,, and Car, R. (1985).Solid-St. Electronics. 28, 17. Park, J. S., Neugroschel, A., and Lindholm, F. A. (1986). IEEE Trans. Electron Deu. ED-33, 1077. Parmenter, R. H. (1955). Phys. Rev. 97, 587. Parsons, R. R. (1978). Can J . Phys. 56, 814. Parsons, R. R. (1979). Solid State Commun. 29, 763. Parsons, R. R.. Rostworowski, J. A., and Bergersen, B. (1978). Proc. X I V Int. Con$ Phys. Sem., Edinburgh. Pearson, G . L., and Bardeen, J. (1949). Phys. Rev. 75, 865. Pearson, G . L., and Bardeen, J. (1956). Phys. Rev. 103, 51. Possin, G . E.. Adler, M. S.. and Baliga, B. J. (1984). IEEE Trans. Eleciron Deuices. ED-31, 3. Rauh. H., Jain, S. C., Mertens, R. P., and Van Overstraeten, R. J. (1990).Solid Srate Electronics. 33, 205- 2 15. Rose,A.(1951). R.C.A. Reis. 12, 362. R0senbdum.T. F., Andres, K.,Thornas,G. A,. and Bhatt, R. N. (1980).Phys. Rev. Letters.45.1723. Rosenbaum, T. F., Milligan, R. F., Paalanen, M. A., Thomas, G . A., and Bhatt, R. N. (1983).Phys. Rev. B. 27,7509.

BANDGAP NARROWING AND ITS EFFECTS

215

Samathiyakanit, V. (1974). J. Phys. C. 7, 2849. Saunderson. P. A. Ph.D. Thesis, University of Durham, 1983. Sawaki, N., Yoshida, A., and Arizumi, T. (1974). J. Phys. Soc. Japan. 36, 149. Sayakanit, V. (1979). Phys. Reu. B. 19, 2266. Sayakanit, V., and Clyde, H. R. (1980). Phys. Reu. B. 22, 6222. Sayakanit et a / . (1980) p. 4, p. 35 Sayakanit, V., Sritrakool, W., and Clyde, H. R. (1982). Phys. Rev. B. 25,2776. Schechter, D. (1987).J . Appl. Phys. 61, 591. Schechter, D. (1988).J . Appl. Phys. 63, 1250. Schmid, P. E. (1981). Phys. Rev. B. 23, 5531. Schmid, P. E., Thewalt, M. L. W., and Dumke, W. P. (1981). Solid Stare Commun. 38, 1091. Selloni, A,, and Pantelides, S. T. (1982). Phys. Rec. Letters. 49, 586. Selloni and Pantelides (1985) p. 52 Sernelius, B. E. (1986). Phys. Rev. B. 34, 5610. Serre. J., and Ghazali, A. (1983). Phys. Reo. B. 28,4704. Serre. J., Ghazali, A,, and Hugon, P. L. (1981).Phys. Rev. B. 23, 1971 Slotboom. J. W. (1977).Solid-St. Electronics. 20, 279. Slotboom, J. W., and De Graaf, H. C. (1976). Solid-St. Electronics. 19, 857. Spitzer, W. G., and Fan, H. Y. (1957). Phys. Reo. 108, 268. Stern, F. (1971). Phys. Rec. B. 3, 3559. Stern, F., and Talley. R. M. (1955). Phys. Rea. 100, 1638. Sterne. P. A., and Inkson, J. C. (1981). J. Appl. Phys. 52, 6432. Swirhun, S. E., Kane, D. E., and Swanson, R. M. (1988). Tech. Dig. I E D M 88.298. Tanenbaurn, M., and Briggs, B. H. (1953). Phys. Ref).91, 1501. Tang, D. D. (1980). I E E E Trans. Electron Deu. ED-27,563. Tewary, V. K., and Jain, S. C. (1986). Adoances in Electronics and Electron Physics, Vol. 67 (P. W. Hawkes, ed.). Academic Press, New York. pp. 329-414. Thuselt, F.. and Rosler, M. (1985a). Phys. Status Solidi B. 1.30, 661. Thuselt, F., and Rosler, M. (1985b). Phys. Status Solidi B. 130, K 139. Totterdell, D. H. J., Leake, J. W.. Jain, S. C., Mertens, R. P., and van Overstraeten, R. J. (1990). Solid-St. Electronics. 33, 793. Van Mieghern, P., Mertens, R. P., Borghs, G., and Van Overstraeten, R. J. (1990a).Phys. Reu. B. 41, 5952. Van Mieghern, P., Mertens, R. P., and Van Overstraeten, R. J. (1990h). J. Appl. Phys. 67, 4203. Van Overstraeten, R. J., De Man, H. J., and Mertens, R. P. (1973). IEEE Trans. Electron Der!. ED-20, 29. Vol’fson, A. A,, and Subashiev, V. K. (1967). SOP.Phys., Semicond. 1, 327. Wagner, J. (1984). Phys. Rev. B. 29, 2002. Wagner, J. (1985a). Solid-Sf. Electronics. 28, 25. Wagner, J. (1985b). Phys. Rev. B. 32, 1323. Wagner, J. (1987). Solid-St. Electronics. 30, 11 17. Wagner (1988) pp. 49,51,52 Wagner, J. and del Alamo, J. A. (1988).J. Appl. P h w 63,425. Wagner, J., Appel, W.. and Warth, M. (1986). J. Appl. Phys. 59, 1305. Wieder, A. W. (1980). IEEE Trans. Electron Deu. ED-27, 1402. Wigner. E. P. (1934). Phys. Rev. 46, 1002. Williams, F. (1968). Physica Status Solidi, 25,493. Wilson. B. L. H. (1977). Solid-St. Electronics. 20, 71. WoltT, P. A. (1962). Phys. Rev. 126,405.

This Page Intentionally Left Blank

ADVANCES I N ELECTRONICS AND ELECTRON PHYSICS. VOL. 82

The Rectangular Patch Microstrip Radiator- Solution by Singularity Adapted Moment Method E. LEVINE* ELTA Electronics Industries Ltd. Ashdod, Israel

H. MATZNER AND S. SHTRIKMAN** Department of Electronics, Weizmann Institute of Science Reholm, Israel I. Introduction . . . . . . . . . . . . . . 11. The Spectral Domain Presentation . . . . . . A. Horizontal Currents. . . . . . . . . . . B. Vertical Currents . . . . . . . . . . . . 111. The Moment-Method Formulation. . . . . . . A. The Moment-Method Solution . . . . . . . B. Singularity Adapted Basis Functions . . . . . C. One-Dimensional Solution . . . . . . . . D. Numerical Results for the One-Dimensional Case IV. Two-Dimensional Solution . . . . . . . . . A. Current Density Modeling . . . . . . . . B. Numerical Results for the Two-Dimensional Case V. Conclusion . . . . . . . . . . . . . . . Appendix A. Far Fields in Two Polarizations . . . Appendix B. Evaluation of Typical Matrix Elements References . . . . . . . . . . . . . . .

I.

. . . , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

.

. . . . . . . .

. .

.

. . .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

217 219 279 284 286 286 289 290 292 302 302 303 317 317 319 324

INTRODUCTION

Rectangular microstrip patches are the most widely used radiators in printed antenna arrays. In recent years several contributions have been made in the analysis of the basic microstrip radiator, either the rectangular by Rana and Alexopoulos (1981), Newman and Tulyathan (1981), Bailey and Deshpande * Also with the Weizrnann Institute. ** Also with the Department of Physics, University of California, San Diego, La Jolla, California. 211 Copynght fC 1991 by Academic Press. 11% All rights of reproduction in any form reserved ISBN 0-12.014682-7

278

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

(1982), Deshpande and Bailey (1982), Mosig and Gardiol(1982), Pozar (1982), Lier and Jakobsen (1983), Pozar and Shaubert (1984), Pues and Van de Capelle (1984) and others; or the disk by Chew and Kong (1981), Yano and Ishimaru (1981), Bailey and Deshpande (1985), Davidovitz and Lo (1986) and others; and quite a lot of useful data has been gathered for the benefit of the antenna designer. The numerical solutions based on the moment-method techniques (for example, see Bailey and Deshpande (1982), Mosig and Gardiol (1982), Pozar (1982), Chew and Kong (1981), Yano and Ishimaru (1981), Bailey and Deshpande (1985)) provide reliable results for these radiators, but they need a vast amount of computations. The main reason for this is the fact that the current distribution on the patch has singularities at the feed point and at the edges, and thus a large number of basis functions are needed in the moment expansion of the current. A discussion of this problem and techniques for improving the convergence are presented by Pozar (1983). In several works (Bailey and Deshpande (1982),Richmond (1980), Liu et al. (1985), Postoyalko (1986), Kuester (1987), Blischke et al. (1988), Liu et al. (1988)), it has been shown that the inclusion of specific singular functions among the basis function set can greatly improve the convergence of the moment-method calculations. In the present work a detailed solution of the rectangular microstrip patch that incorporates such singularity functions is given. The solution includes several stages. First, it describes the Fourier domain presentation of the currents and the fields in the microstrip radiator, taken from Perlmutter et al. (1985). The vertical current of the feed is then treated by an equivalence principle, which transforms vertical currents into equivalent horizontal surface currents as shown in Pinhas and Shtrikman (1987). The third stage is a moment-method solution of a simplified rectangular radiator whose substrate is air ( E , = I) and whose feed is a vertical thin sheet of current mounted across the patch. The problem is solved first for the one-dimensional dependence of the current (Matzner et al. (1989)), in which the currents are only y-directed without any dependence on x, and then for the complete two-dimensional case (Levine et al. (1989)), where the current distribution is a combination of longitudinal and transversal components. The moment-method solution is based, as said before, on the inclusion of specific basis functions that accelerate the convergence rate of the computations. The choice of the basis functions follows a comprehensive treatment of the simpler case of a center-fed disk given by Pinhas et al. (1989). In the rectangular radiator discussed here, four types of basis functions are used in the expansion of the current on the radiator: (a) an attachment mode, which describes the singularity at the feed point;(b) an edge mode, which describes the singularity of the longitudinal current derivative at the patch edges; (c)a sum of harmonic functions; and (d) a vertical feed current, which is also included

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

279

by the equivalence principle mentioned earlier. This solution has a fast convergence rate, and it enables an efficient computation of the rectangular patch currents and radiated fields, including the cross polarized components. The efficiency of the computations is illustrated and emphasized by a large number of numerical results, which are presented graphically. The purpose of this work is therefore to give the reader a full-scale treatment of the rectangular microstrip radiator, which is rigorous and accurate on one hand and contains practical aspects for the use of the antenna designer on the other hand. The antenna that is treated here is not exactly the commonly used microstrip patch built on a dielectric substrate and fed by a coaxial line or by a microstrip line. Nevertheless, the principal approach of using singularity adapted basis functions and a major part of the numerical results can be used either in theory or in practice. A generalization for any c, and the improvement of the feed modeling are subjects for further studies in the future.

11. THESPECTRAL DOMAIN PRESENTATION A. Hnrizontul Currents

The approach for calculating the radiation from microstrip configuration is based on the current that flows on the printed conductor. Once this current is known, then the radiation from it can be calculated using standard potential methods. The presence of the dielectric substrate is taken into account in the Green’s function of the microstrip structure. This function gives the electric field due to a unit current element on the surface of the dielectric material, and it contains the properties of the dielectric substrate. The Green’s function of single or multilayer microstrip structures can be expressed analytically in the Fourier domain; thus, it is natural and easier to make the complete analysis in the Fourier domain. In Section I1 we give a detailed description of the spectral domain presentation of currents and fields in the microstrip radiator. Section II.A., which is based on the work of Perlmutter et al. (1985), deals with horizontal currents. Section ILB., which is based on the work of Pinhas and Shtrikman (1987) presents an equivalence principle for the replacement of vertical currents by an appropriate surface current. This replacement is used later on to model the vertical feed of the microstrip radiator. Consider first the general case of a microstrip radiator shown in Fig. 1. The infinite ground plane is at plane z = 0, and the metal patch is at the plane z = H . A horizontal current density Js,which exists on the patch, is excited by a vertical current density J,. The use of current densities rather than currents

280

-

E. LEVINE, H. MATZNER AND S. SHTRIKMAN r

GROUND PLANE FIG. 1. Geometry of the general microstrippatch radiator.

is more convenient and natural throughout the mathematical formulation. Sometimes, in general discussions, we use the term current instead of current density for conveniency. The dielectric substrate has a dielectric constant of E, and no dielectric and ohmic losses are taken into account here. In order to find the fields due to the horizontal current density we solve the wave equation for the electric field E: VZE+ k2E = 0, (1) where

k, is the free space wave vector given by k, = O

V/~= E 2n/lO, ~

(3) w is the frequency, and 2, is the free space wavelength. The ejof time The boundary dependence is assumed throughout this work and j = conditions of the problem are defined by Eqs. (4a)-(4c):

G.

fxE,=O 1 x (HI - H,)= J,

onz=O on z = H

(4b)

1 x (El - E,) = O

o n z = H.

(44

(44

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

28 1

The subscript 1 denotes the region above the patch while the subscript 2 denotes the region under the patch. The ground plane is infinite so there are no fields under the ground plane. In addition,

V .El

=0

-

V E2 = 0

(region 1)

(54

(region 2).

(5b)

The current density can be expressed in the real space or in the Fourier domain, while the connection between the two is defined by

and the same is done for the electric field E and the magnetic field H. Inserting these expressions into Eqs. (1)and (4)makes it possible to solve for the relevant Fourier components of the electric field as a function of the corresponding Fourier component of the current. The basic connection between the electric field and the current density on the patch is given by Eq. (7):

GxY(kx7 k,) Gy,(k,, k y ) The tilde over the letter denotes the Fourier transform of the variable. The elements of the dyadic Green’s function are given by: cxx(k,, kJ

Gx(kx, ky)

+

GX, = A(kI - k&,

tan(y2H) jyl(&,kg - k : )

G Y = A(-k,k,(hJ,

(8a)

- Y 2 tanb2H)))

(8b)

Gyx= A ( - k , k x ( j y l

-

zyy= A&,’

yZ t a n ( y 2 ~ ) ) )

- k$)y2 tan(y2W + jY&,k;

(8C) - k,’),

(84

where y1 = J k ; y2 = J E , k i

- k:

- k,’

- k f - k,’

(94 (9b)

and A=

‘I

k d j y i cot(~zH)- Yi)(j&,Yi

- YZ tan(~2H))‘

(10)

is the free space impedance (1207~Ohm). The electric field as a function of the spatial coordinates can be calculated by taking the inverse Fourier transform of E(k,, ky).However, this cannot be done analytically, so further results will be written in the Fourier domain. A

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

282

compact and convenient form in which to write Eq. (7) is

where we define “orthogonal” and “parallel” wave numbers in respect to the current density by

and

k, =

Jm.

Equation (1 1)is equivalent to Eq. (7) but it shows the nature of the electric field as composed of two terms. The first term in Eq. (1 1) is radiation by waves that hit the air-dielectric interface with an electric vector perpendicular to the plane of incidence. The second term is radiation by waves with a parallel electric vector. The existence of two field components can explain the differences between the E-plane and the H-plane radiation patterns of the microstrip radiator. The complex input power at the antenna terminals is

or in the Fourier domain,

P.in

=

8a2

-LJm

-m

J~~e(k,,4,.j“kX,k,)dk,dk,.

(15)

This expression has real and imaginary parts. The contribution to the real part comes from the radiation into free space and from surface wave excitation. The imaginary part is due to stored energy around the antenna. Our main interest is in the radiation effects, but for the sake of completeness we also give the expressions for the first and dominant surface wave mode. In the case of air as a dielectric substrate ( E , = 1) surface wave d o not exist, but the expression for the surface waves can be used in further generalization of this solution. The contribution to free space radiation comes from integrating Eq. (15) over the range

k:

+ k,’ < kg,

(16)

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

283

which is called also the oisihle range. The physical meaning of this range is that the two-dimensional current density generates plane waves with k, and k, components. These waves carry power only if the condition in Eq. (16) is satisfied, because in this case the wave has a real kZcomponent. In other cases, the k, component is imaginary and the waves are evanescent. Transforming into spherical coordinates by

4 k , = ko sin 0 sin 4,

k,

= ko sin 0 cos

(174

J

The radiated power is written in the form of

where the integrand f(O,@), which is the power radiation pattern of the antenna, is given by 15 k 2 l-xsin4
[

+

+

1.

+ sin2B) +.$cosZ0cot2(Hkodm

I ~ C O S J;sin41Zcos20(&, ~ (E, -

sin'I1)

(19)

4

and &, denote the Fourier transformation of the x and y components of the vector J,. It may be desired in some cases to have the amplitude radiation pattern rather than the power pattern. In such cases one may use far-zone expansion for the electric field in Eq. (1 I ) and get an expression for the radiated fields in (x, y, z) or in (r,8, +), coordinates, as shown in Appendix A. In any case, the principal cuts in which we are interested throughout this work are defined by the angle 4: E-plane is when 4 = n / 2 and H-plane is when = 0. Surface waves are the result of further integration in the region

+

k,f

+ k;

> kg,

(20)

where the integrand has singularities, i.e., the denominators in Eq. (15) or in Eq. (19) are zero. The power that goes into surface waves is the contribution of these poles to the integral in Eq. (15). The dominant surface wave mode that has no cutoff frequency is the first TM mode. We assume that the thickness H is small enough such that we are below the cutoff frequencies of other modes. The wave numbers of the poles are found by solving the equation cot(y2H) - ~2 = 0. The solutions of Eq. (21) are denoted by k,,, where j&,Yt

(21)

284

E. LEVINE, H. MATZNER A N D S.SHTRIKMAN

Also denote xp = kt,/ko,

(23)

so that Eq. (21) becomes

~,,/m -d q tan(k,H,/-)

= 0,

(24) and x p is a function of E, and koH and is found numerically. Transforming from integration variables (kx,ky) to (kt,4) and calculating the contribution of the first pole to the integral, the power that goes into surface waves becomes

where and

7J.i

-

1

x;

-

E,

1

B. Vertical Currents The currents treated so far were horizontal, but vertical currents should be considered as well. In order to do so we bring here an equivalence principle suggested by Pinhas and Shtrikman (1987), in which vertical currents are replaced by horizontal currents. Other theoretical studies of vertical currents inside a dielectric layer can be found in Chew and Kong (1981), Yano and Ishimaru (1981), Chi and Alexopoulos (1984). Aberle and Pozar (1988) and Vanderbosch and Van de Capelle (1989). Inside the dielectric layer shown in Fig. 1 there is a vertical current density in the z direction that is independent of z but has a general dependence on x and y: J, = ?J,(X,Y). (28) We represent the current density J, as well as the electric and magnetic fields by the Fourier decompositions

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

H(x,y,z) =

1

4n

-*

G ( k , , k , , ~ ) e - j ( ~yY)dk,dky. =~+~

285

(31)

The equations for the electric field in the regions 1 and 2 (above and under the patch respectively) become

($+ ki k: )E,(k,, k,, -

z) = 0

(region 1)

($+ e,ki - kf 1E 2 ( k x r k y , z ) j w @ ~ ( k , , k , ) =

(region 2).

(33)

The general solution of Eqs. (32-33) is

El(kx, k,,z) = i l ( k x , ky)e-jy1z (region I ) E 2 ( k x k,, , 2) = i 2 ( k x k,)e-jY2' , + B2(kx, ky)ejYZZ

(34)

Applying the boundary conditions

2 x (El - E 2 ) = 0 2 x (H, - H2) = 0 together with

ixE2=0

onz=H on z = H

(36b)

onz=O

(364

(3W

V -El =0

in region 1

(364

V E, = 0

in region 2

(36d

gives the solution of Al, A, and B , in terms of the current distribution:

286

E. LEVINE, H. MATZNER A N D S. SHTRIKMAN

-

82, = -Azt

-

EZz= AZz. The subscript t denotes the component of the vectors in the x-y plane. It can be seen now that if we replace the vertical current density in the Fourier domain by a surface current according to

the coefficients A , , A, and B2 remain the same. In other words, one can replace the vertical current density

by a surface current density on the dielectric layer’s upper surface,

which satisfies

The only change in Eqs. (36) is that Eq. (36b) changes to S x (HI - H2) = J,.

(42)

In conclusion, we have shown in Section I1 how horizontal and vertical currents in the microstrip configuration are presented in the Fourier domain. Once the currents are known, the radiation patterns and the directivity are easily derived. The objective of Section 111 is to show how the currents are calculated by the method of moments.

111. THEMOMENT-METHOD FORMULATION

A . The Moment-Method Solution

The discussion presented so far treated the general structure of a microstrip radiator built on any dielectric substrate. At this point we restrict the analysis to the simpler case of the rectangular radiator with air as a dielectric layer, as shown in Fig. 2. The length of the microstrip radiator is L ,

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

L

287

A

X

GROUND PLANE FIG.2. Rectangular radiator in air, fed by a thin vertical wall at distance F from the center.

the width is W, the height above ground plane is H and the vertical feed is a thin wall at distance F from the center. This wall carries a constant current in the z-direction (1 Ampere). The ground plane is assumed to be infinite. This structure has been chosen because it allows us to demonstrate the momentmethod solution with a moderate complexity of the integrals involved. However, a generalization for any E , and narrowing the vertical wall can be done without major changes in the solution procedure. The moment-method solution of the currents on the patch begins with the requirement that the tangential electric field on the patch will be zero: E ( x , y ) = 0.

(43)

However, this requirement can be replaced by the more convenient condition of

L2IW,, L/2

U’/2

Ji(x,y)E(.u,y)d-udy=O

i = 1,2,..., M ,

(44)

where the current density J ( x , y) is expanded into M components, each one of them is denoted by Ji(x,y).The integration is made here in the real space over the patch area. By transforming the electric field and then the current density into the Fourier domain one gets

s:,jl

2:J

sw2:

~(k,,k,.)e-J(kx”+ky)”Ji(x,y)d?L d y d k , d k , = 0, (45)

and then

J:&;j;‘

j-1( - k,, - k,)E(k,v, k,) dk, dk,

= 0.

(46)

288

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

Now, the electric field is replaced also by M + 1 terms where each term is s j ( k x , k y ) The . additional M + 1 term comes from the feed contribution M+1

1

j= 1

jIrn jIrn x(-

k,, - ky)Ej(kx,k,) dk, dk, = 0.

(47)

We denote the electric field component Ej(k,,ky), which is related to each current density J j by E(4), and write M+I

c

&( - k,, - k,)E(.$) dk, dk, = 0.

j = 1 l-co

co:l

This is a matrix equation that can be written in the form of Ajjbj =

(49)

Bi,

where

and Bj =

- JYrn

E(J,+,)j,"(-k,, -k,)dk,dky.

(51)

The * denotes the complex conjugate, and &J,+,-) means the electric field caused by the vertical feed and the horizontal attachment mode, as will be explained soon. The coefficients to be solved by the matrix equation are bj. Each one of these coefficients requires a two-dimensional integration over an infinite range, so it is very important to choose the basis functions carefully in order to reduce the amount of numerical computations. In conclusion, the moment condition for i = 1,. . .,M current density components is

Once the coefficient of each component is derived from Eq. (52), the electric field associated with this component is given by

Ri(kx, ky)

= c(kx, k y ) * J i ( k x ,

ky).

(53)

We select, therefore, test functions that are identical to the expansion functions. This choice (known as Galerkin method) is not only conceptually simple but it also gives convenient mathematical expressions.

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

289

B. Singularity Adapted Basis Functions It has been shown by several authors, such as Richmond (1980), Bailey and Deshpande (1982), Liu el al. (1985), Postoyalko (1986), Kuester (1987), Blischke et al. (1988), Liu et al. (1988), and Pinhas et al. (1989), that the inclusion of specific basis functions that take care of singularity points can greatly improve the convergence rate of the solution, meaning that considerably fewer coefficientsare needed for an accurate description of the current. The current density on the rectangular patch is decomposed here by four types of functions: J = J, J, + J, + J,. (54)

+

J, is a series expansion into harmonics of sine and cosine terms, which are the natural modes of an ideal rectangular resonator. J, is a special contribution due to the singularity at the edges of the patch. This contribution is included because the charge density on the patch diverges near the edge. In the electrostatic limit, the form of this singularity may be concluded from canonical solutions of a rectangular capacitor (Kuester, 1987). We use here simple approximations of square roots that are found to be most efficient (Bailey and Deshpande, 1982, Richmond, 1980). Two terms with symmetric and antisymmetric nature are used, although it will be shown that in the vicinity of the first resonance only the symmetric term is required. J, is a contribution due to the divergence nature of the current density at the feed area. This contribution provides for continuity of currents so that no charge will accumulate there. We use here a linear function with a jump of 1 at the feed point, in accordance with the vertical feed whose current is also 1. This contribution is sometimes referred to as the attachment mode. J, is the effective current density which is derived from the vertical feed current by the equioalence principle described in Section 11. The electric field can also be divided into four types: E = E,

+ E, + El + E,.

(55)

The matrix elements to be computed are therefore divided into three types as shown in Eqs. (56a, b,c), where h, e, u, and f denote the current types.

290

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

The terms that are connected to the excitation of the patch are calculated as shown in Eqs. (56d, e),where Eq. (56d) indicates mixed terms of the attachment mode and the vertical current and Eq. (56e) indicates pure vertical feed contribution. (564

One can see that the number of double integrals to be computed grows rapidly with the number of harmonics. For one end term and N harmonics, one has to calculate N 2 integrals of type (56a) plus 2N integrals of types 56(b) and 56(c). C. One-Dimensional Solution

In order to gain some insight into the basis function choice and the obtained results, we begin first with a one-dimensional current modeling. The specific basis functions will be given in this section while the numerical results will be shown in the next section. The current density in the one-dimensional model has the form of As explained, this current density has four contributions:

where each contribution is given by N-

Jh(y) =

1 ... a,cos 1

n = 1.3.

JyY)=

(‘y) __

-y/L - 1J2 -y/L 1J2

+

J,( y ) = 2 6(y - F).

N

+ n = 2C, 4 . ... a,sin (“zy) -

(59)

-L/2 c y < F F < y < LJ2

(62) The coefficients are a, (harmonic expansion), b (symmetrical end term) and c (antisymmetrical end term). The vertical feed and the attachment mode are known. Transformation of Eqs. (59-62) into the Fourier domain is done by

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

29 1

the one-dimensional transformation

which gives the following expressions:

jf(k,, ky) = [$sin(?)

+ iejkyF]sinc(T)

k X

W

where sinc(x) = sin(x)/x and y(Er =

I) =

(68)

[r& if 0 < k, < ko if k , > k,.

Again, we see that the unknown coefficients to be determined are a,, b and c. The computation of the two-dimensional integrals shown in Eqs. ( 5 6 ) requires enormous CPU time. Some analytical techniques can be used to handle the singularity points and accelerate the convergence. Two examples of such procedures are given in Appendix B. The electric field is given in Eq. (1I). For the case of 6,. = 1 Eq. ( I I ) reduces to

E Lko [

k : ( j k:)k: Y(jcOt(yH) - I )

y ( j - kj')k/ + jcot(yH) -I

(70)

and the radiated power in Eqs. (18-19) reduces to PR = 15kg

jo2'l<12sin2(Hkocost))

7c

- [cos' 0 + sin' 4 cos201 .sin 6dO d 4 .

(71)

292

E. LEVINE, H. MATZNER A N D S. SHTRIKMAN

Other properties of interest are derived as follows. The input power is related to the impedance by

pi, = 1;zi,,

(72)

I , = 1,

(73)

z. = p.in’

(74)

and since we took we get In

At resonance, the imaginary part of Zi, vanishes, and the real part presents the radiation resistance of the antenna

R , = Real(Z,,). (75) The bandwidth of the antenna can be found in two equivalent ways. First, the quality factor Q is given by the amplitude of the current density (I&[) according to

The “natural” bandwidth is given by 1/Q, and the bandwidth for any specific VSWR value is given by (Levine et al., 1988): BW”,,, =

vswr - 1 1

Jvswr ‘3’

(77)

For example, for VSWR = 2 and lo = 1, the bandwidth is

Secondly, the bandwidth can be calculated directly from &,,(a) curve. We assume that the radiator is matched at resonance to an impedance of R, and find the reflection coefficient in any other frequency. The bandwidth is then found between the desired reflection coefficients, for instance, [rl= 3 for VSWR = 2.

D. Numerical Results for the One-Dimensional Case The inclusion of the current singularities results in a quickly convergent solution, thus it requires much less computing time than moment methods that ignore the feed and the end current singularities. Our calculations were done on a VAX 11/780 mini-computer with an FPS-5000array processor.

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

293

For one frequency point the analysis with six current terms (feed, two endfunctions and three harmonics) takes about 150 seconds CPU, or about 3000 seconds without the array processor. The same analysis with 30 current terms (harmonics without singularities) takes about four hours CPU, or about 140 hours without the array processor. A typical graph of Z i , ( o )needs about 20 frequency points; therefore a moment-method calculation of Z,,(w) with singularities takes about one hour, but without singularity functions it may take 80 hours. 1. Current Density Profiles

Two current density profiles for typical end-fed radiators are presented in Figs. 3-4. Two cases are shown. In the first case (Fig. 3) the patch dimensions (see Fig. 2) are L = 50 mm, W = 72 mm and H = 22 mm. The resonant wavelength is calculated to be ,Ir = 144 mm; thus at resonance, W/A, = 0.50 and H/%, = 0.15. In the second case (Fig. 4), the patch dimensions are L = 50 mm, W = 31 mm and H = 8 mm. The resonant length is Ar = 124 mm, and thus W/A, = 0.25 and H / 1 , = 0.06. These two radiators will be referred to as typical thick and thin radiators respectively. The quality factors of the two radiators, according to Eq. (76) are 3 and 10, which means that their bandwidths for VSWR = 2 are (Eqs. (77-78)) 25% and 7% respectively. One can see that the dominant contributions are the symmetric end function (whose coefficient is denoted in Eqs. (61,65) as b), the feed attachment mode

-04

-03

-02

-01

0.0

0.1

02

0.3

04

0.5

Y/

FIG.3. One-dimensional solution of the surface current density on an end-fed thick radiator (li/A, = 0.15, W/A, = 0.50). where A, is the wavelength at the first resonance. The dominant current contributions are the symmetric end, the feed and two harmonics.

294

E. LEVINE, H. MATZNER A N D S. SHTRIKMAN 8 7 6

5 4

1 I In- 0

7

3 2 I

0

-0.5 -0.4 -0.3 -0.2 -0.1

0.0 Y/L

0.1

0.2

0.3

0.4

0.5

FIG.4. One-dimensional solution of the surface current density on an end-fed thin radii ( H / , i , = 0.06, Wj1, = 0.25), where 1,is the wavelength at the first resonance.

(Eqs. (60,66)) and the first cos harmonic ( n = 1 in Eqs. (59,64)). The other harmonics and the antisymmetric end function are not important. It is also interesting to observe that in the thick patch (Fig. 3) the relative end contribution is larger than in the case of the thin patch (Fig. 4). I 100 -

-

-

IMAGINARY

-150

I

I

I

I

I

I

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

295

2. Input Impedance The input impedances for the two cases are shown in Figs. 5-8 in the following order. Fig. 5 shows the input impedance of the thick radiator in a broad range of frequencies while Fig. 6 shows it in the vicinity of the first resonance. Figure 7 shows the input impedance of the thin radiator in a broad range of frequencies while Fig. 8 shows it in the vicinity of the first resonance. 500. 1

7 c 0

I

1

I

1

I

I

I

I '

400 -

-

300 -

-

200-

-

loo7

C

i 0. IMAGINARY

-100 -

KOL

FIG.6 . Results of Fig. 5 near the first resonance. 500

I

I

I

I

I

I

-

500 I

I

I

I

I

I

I

1

I

1

I

I

I

I

I

I

I

400300200 -

: 100r 0

v

0-

C

Kl -100-

-200-

-300

'

Note that the frequency range is presented by koL or equivalently by ( L / c ) w where c is the speed of light, At low frequencies the input impedance gets the nature of a capacitor in the form of Zi, = (jwC)-' where C is the electrostatic capacitance as expected. Also, the bandwidths obtained by Figs. 6, 8 agree well with those estimated by the Q factors.

3. Convergence Rate The excellent convergence rate of the solution is demonstrated in Figs. 910.Figure 9 shows the real part of the input impedance for a typical patch as a function of 1/N, where N is the number of harmonics used. The patch dimensions are L = 50 mm, W = 50 mm and H = 5 mm and the calculation is done near resonance. Figure 10 shows the imaginary part of the impedance for the same patch. Three cases are examined in these graphs: (a) a calculation without the attachment mode and without the end current; (b) a calculation with the attachment mode but without the end mode; and (c) a calculation with the two modes. It can be seen that the inclusion of the two singularity based functions enables us to use only one harmonic (1/N = l), while without these singularities even 10 harmonics are far from being enough. 4. Radiation Resistance, Bandwidth and Resonant Length

The effectiveness of this moment-method analysis in covering a large number of parameters is demonstrated also by a graphical design procedure of the main antenna properties. The design procedure starts with the desired

270

E

=

I

I

I

I

I

I

0No Attachment, no end

210-

@with attachment,noend

190 -

@

-

-

-

with ottachment,with end

-

170-

0

150130-

0

110-

0:

I

230 -

v

-c!

I

1

250 -

-

-

90 -

70 -

50 30

. I

I

I

140

I

I

I

1

I

I

I

I

I

I

I

I

I

I

I

I

I

120 -

E

6 N c (

1

-

100 -

8060-

0No attachment ,no end

-

@ with ottochment,no end

-

0 with attachment ,with end

-

40-

-

20 -

-

0-

-

-20 -

-40b

011

0.; 0 :

0;

0:

d6

0 :

0.b

0;

1/N

FIG.10. Imaginary part of the input impedance, calculated as in Fig. 9.

I.!)

298

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

.5L

\ ‘0.5

/////

‘y.8

0 . 0 8 -

I

0.20

0 16

0.12

ll

H/X,

FIG. 1 1. Bandwidth (VSWR 2: 1) of a rectangular radiator. Ar is the wavelength at resonance.

bandwidth, which is shown in Fig. 11. Any other value of VSWR can be found by Eq. (77). For a desired bandwidth and at a given resonant wavelength (A,), one chooses from Fig. 11 the normalized width (W/A,)and height (H/Ar)of the patch. Then, the radiation resistance of the patch, at resonance, is found from Fig, 12. A comparison of the results in Figs. 11-12 with those reported by Perlmutter et al. (1985), where only the first harmonic had been used, reveals a good agreement. This shows that the fine details of the surface current do not much influence these properties. Figure 13 presents the effect of inserting the feed point into the patch, for a square patch (W/L = 1) with three different

700

08

0.5 E

c 0

300

1

I00 I

000

I

004

I

I

I

0.12

008 H/

I

I

I

0 16

I

I ]

0.20

A,

FIG.12. Radiation resistance of an end-fed rectangular radiator. 1, is the wavelength at resonance.

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

299

270

A

SQUARE ELI M E N T ( W = L )

2401

FEED POSITION (F/L)

Flc;. 13. Radiation resistance of a square radiator as a function of the feed position. F i L is the center of the radiator.

=

0

heights. In case a lower input impedance is needed, these results could be used to change the feed location along the y-axes. The last step is to find out the length of the patch using Fig. 14. Here the normalized length L / I , is presented as a function of the chosen H I I , and WIj.,, Figure 14 has been prepared in the following way: For given values of H , W and L, the input impedance was calculated and the wavelength I , in which fm(Zi,)= 0 was found. With a sufficient accumulation of calculated points, the lines of constant width are I

I

I

t\

I

I

I

I

I

I 1

w / A,

c

0.42t \

0.3

5 0.381 _1

0.04

0.08

0.12

0.I 6

0.20

H/X,

FIG. 14. Resonant length of an end-fed rectangular radiator (one-dimensional solution),

300

E. LEVINE, H. MATZNER AND S. SHTRIKMAN TABLE I CALCULATED RESULTS FOR THE WNANT LENGTHOF A RECTANGULAR MICROSTRIP RADIATOR AS A FUNCTION OF THE THICKNESS OF THE SUBSTRATE (H/ir)AND THE PATCH WIDTH (w/n,). A COMPARISON IS MADEBETWEEN THE RESULTSACCORDING TO SENGUPTA (1983), GARCAND LONG(1987),MARTIN (1988)AND THIS WORK.

0.04 0.04 0.04 0.04 0.04 0.08 0.08 0.08 0.08 0.08 0.16 0.16 0.16 0.16 0.16

0.2 0.3 0.4 0.5 0.8 0.2 0.3 0.4 0.5 0.8 0.2 0.3 0.4 0.5 0.8

Sengupta

Garg

Martin

Ours

0.439 0.433 0.429 0.426 0.42 1 0.422 0.409 0.401 0.395 0.383 0.412 0.393 0.378 0.367 0.344

0.455 0.45 1 0.448 0.446 0.442 0.419 0.4 14 0.409 0.405 0.397 0.353 0.345 0.338 0.332 0.318

0.433 0.43 1 0.428 0.425 0.418 0.407 0.406 0.405 0.404

0.437 0.432 0.428 0.423 0.420 0.399 0.388 0.383 0.378 0.374 0.365 0.350 0.339 0.33 1 0.320

0,400 0.380 0.380 0.379 0.379 0.377

drawn. In the limit of a very thin patch, the length approaches half a wavelength as expected. Unlike the bandwidth and the radiation resistance, the resonant length of the patch is influenced by the details of the surface current, including the vertical feed. A set of calculated results for the resonant length, compared with results for the same cases, according to Sengupta (1983),Garg and Long (1987) and Martin (1988) is listed in Table I. It should be noted that the results of these references are best fitted and compared with experiments for dielectric constant between 2 and 3, although the formulae are given for any dielectric constant. As can be seen, our results agree well with those calculated according to Sengupta and Garg and Long for thin substrates, while there is a close agreement with Martin for thick substrates. The differences in the results are probably due to the simplification of the currents by the one-dimensional model and the inclusion of the vertical feed.

5. Radiation Patterns Additional numerical results given in Figs. 15-16 show the radiation patterns of the two patches (dimensions are given in Figs. 3-4). It is interesting

A

-25-

-

-30 -

-

-35-

I

I

I

-’t,f

I

I

I

I

I

5 -10 -15

H I X , ==0.06, H/X, 0.06, W/X, =0.25

--- H/Xr=0.15, H / X r - O . 15,

W / X , =0.50

-25

-35 -90 -70

-50

-30

-10

10

30

50

70

90

DEG

FIG. 16. H-plane radiation pattern of an end-fed rectangular radiator (one-dimensional solution). Two cases of thin and thick radiators are shown.

to note that the end-fire radiation, which is the outcome of the vertical feed alone, depends strongly on the patch dimensions. In Fig. 15 the ratio between the end-fire radiation of the two patches is 13.5 dB (- 9 dB and -22.5 dB). This ratio agrees well with the estimated value of (H,/H,)’ (W,/W,)’ = 14 dB.

302

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

In Fig. 16 the ratio between the two patches is 9 dB ( - 16 dB and -25 dB) while the estimated value is (H1/H2)’= 8 dB.

IV. TWO-DIMENSIONAL SOLUTION A . Current Density Modeling

We focus now our attention on a more general case of a two-dimensional current density on the patch with the form of J, = J,(x,y)?

+ J,(X,Y)~.

(79)

Again, as in the one-dimensional case, physical considerations and a variety of numerical simulations show that an accurate description of the current density on the rectangular patch can be achieved by very few functions, given in Eqs. (80-87) as follows: y-directed currents

L

I

L

x-directed currents Jh(x,y) = c, sin

(w)27tx

x

z-directed currents F)rect

($)

12x1‘ -

THE RECTANGULAR PATCH MlCROSTRlP RADIATOR-SOLUTION

303

The Fourier transforms of these functions are not given here since there is not much difference in comparison with the one-dimensional case. Note that the transversal dependence of the y-directed current transformed into a Bessel function (Butler, 1982) and that the x-directed currents are antisymmetric. The vertical current density is transformed into two horizontal current densities: jeJkyF

I

J,(k,, kJx =

-

J,(k,,k,),

=

Y

(")

k, sinc -

jejkyF ("W) k,sinc - . Y

(87)

As can be seen, only the first term in the harmonic expansion is used, and ) to be determined. In case of an just six coefficients ( a l , a z , b l , b z , c t , c 2 are off-resonance analysis, an additional antisymmetric term should be included as well. The electric fields are given in Eqs. (88-89) in terms of the x and ydirected current densities:

y-directed Electric Fields

x-directed Electric Fields

The main advantage of the two-dimensional model is the ability to calculate the far fields in the two polarizations, as will be demonstrated in the next section.

B. Numerical Results for the Two-Dimensional Case We show in this section a set of calculated results obtained from the twodimensional model. These calculated results are given in three groups. First, we show in Figs. 17-24 some examples of current density profiles for different

304

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

patches. Second, we give in Figs. 25-30 a generalized graphical description of the six main current coefficients as appear in Eqs. (80-84).The third group, shown in Figs. 31-40 is a set of radiation patterns for several cases. 1. Current Density Profiles

The current density profiles are shown for three components: the longitudinal component (J,( y)), its transversal cut (Jy(x))and the transversal current density (JJx)). Figs. 17-18 show the longitudinal current density for the cases of the thick and the thin patches (dimensions are given in Figs. 3-4) respectively. The five dominant contributions are cos 1 and cos J ( x ) taken from Eq. (80),end 1 and end. J ( x ) taken from Eq. (82)and the feed taken from Eq. (81).It is interesting to compare these figures to the one-dimensional solutions that are given in Figs. 3-4 respectively. One sees that although the general distribution remains quite the same, there are some differences in the relative amplitudes of the current density functions. Figures 19-20 show the transversal behaviour of the main longitudinal contributions for the same two cases of the thick and the thin radiator. The singular nature of the current densities at x = W/2 and at x = - W/2 is well observed in these two figures. This nature of Jy(x)is known in the literature, for example see Butler (1982)or Shih et al. (1988).The transversal behaviour of the total longitudinal current density J,,(x), for a square radiator (W/L = 1) is given in Fig. 21.Three values of the thickness are shown: a thin patch with H / L = 0.08,

.

-

2.50

1 1 1 1 1 1 1 1 1 )

2.00

I. 50

0.50

0.00 -C

I

-0.4 -0.3

-0.2 -0.1

0.0 Y/L

0.1

0.2 0.3 0.4 05

FIG.17. Two-dimensional solution of the longitudinal current density J,(y) on an end-fed thick radiator ( H / L = 0.44 and W / L = 1.44 or H/d, = 0.15, W/A, = 0.50 and L/A, = 0.347). The main current contributions and the total current are shown.

I

I

I

I

1

1

1

1

zoo -

-

Y /L

FIG. 18. Two-dimensional solution of the longitudinal current density Jy(y)on an end-fed thin radiator ( H / L = 0.16 and W/L= 0.62 or H/L, = 0.06, W / i , = 0.25 and L/L, = 0.403).The main current contributions and the total current are shown.

I

I

I

I

I

I

I

I

I

I

COS.! I

I

-0.5 -0.4 -0.3 -0.2

-0.1

0.0

0.1

0.2

0.3 Q4

0.5

x/w FIG. 19. A transversal profile of the longitudinal current contribution Jy(x) on the thick radiator.

306

E. LEVINE, H. MATZNER A N D S. SHTRIKMAN

50

I

I

I

I

I

I

“i

I

I

END.i

30

-I

I

C0S.I

2o 1.0 I

END.J { x )

COS .J(x )

0-0.5 .0 -0.4 -0.3 -02 -0.1

0.0

0.1

0.2

0.3 04

0.5

x/w

FIG.20. A transversal profile of the longitudinal current contribution J,(x) on the thin radiator.

I

I

I

I

I

I

I

I

1

I

7 -

‘

10.0 -

H/L = 008

c

---I

L

-I

8.0

I

1

I

I

I

I

307

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

a thick patch with H / L = 0.2 and a very thick patch with H / L = 0.4. Note that the relative effect of the edges is larger for the thicker substrates. Now we look into the transversal current density contribution J,(x), which plays a major roll in the cross-polarized radiation. Three square ( W / L = 1 ) patches are considered: thin ( H / L = 0.04), thick ( H / L = 0.2) and very thick ( H / L = 0.4). The transversal current densities for the three cases are shown in Figs. 22-24. The main contributions are the sin functions as described in

O 0.10

J'

'

4

I

-01

-03

5

(J<

C 1

02

l

03

04

05

x/w FIG 22 The transversal current density contribution J J x ) on a thin square radiator ( W / L = 1 and H / L = 0 04 or H/A, = 0 018, W/A,= 0 46 and L / i , = 0 461

1

-2

-0.0 -09 -0.3

-0.2

-0.1

0.0

01

Q2

0.3

0.4

0.5

x/w FIG.23. The transversal current density Contribution J,(x) on a thick square radiator ( W / L = 1 and H / L = 0.2 o r H / A , = 0.077, "/A, = 0.386 and L/A, = 0.386).

308

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

0.05 -

" -0.5

-0.1 0.0

-0.4 -0.3 -0.2

0.1

a2

0.3

0.4

0.5

x/w FIG.24. The transversal current density contribution JJx) on a very thick square radiator ( W / L = 1 and H / L = 0.4 or H / 1 , = 0.14, W/1, = 0.351 and L/A, = 0.351).

Eq. (83) and the end function as described in Eq. (84). Other contributions are much lower in amplitude and thus are not shown here. One interesting observation is that the sin and the end contributions have opposite signs thus the total current density is lower than the end contribution. Another observation is the fact that the transversal currents increase with the thickness of the susbtrate.

I

0.10

-

0.0ot -0.10 0.00

0.04

I

0.08

I

I

0.12

I

I

0.16

I

I

1

0.20

H/X,

FIG.25. The a, coefficient (Eq. 80) of the current density contribution as a function of the geometry.

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

309

2. Current Coeficients General results describing the six main current coefficients that appear in Eqs. (80-84) are given in Figs. 25-30. The results are plotted as a function of the width and the height, normalized to the resonant wavelength. They can serve as a synthesis tool for the rectangular patch. For a given patch one finds

0.05 I

4 \

-I 0 " 0.00

-005

t

040

r

=: 0.20

I

' 2 0.00 -

0.4

0.5

-

310

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

0.'51--Y=y 0.2

0.3-

0.4 0.5

0.8-

0.00

-0.051

0.00

I

I

0.04

I

I

I

a08

I

0.12

I

I

0.I 6

I

I

0.20

H/X,

FIG.28. The b, coefficient (Eq. 82) of the current density contribution as a function of the geometry.

0.20 0.15

0.10

-j

0.05

\

I

0.00

-0.05 -0. I 0 0.00

0.04

0.08

0.12

0.18

0.20

H / A,

FIG.29. The c1 coefficient (Eq. 83) of the current density contribution as a function of the geometry.

from them the relevant current coefficients and uses Eqs. (80-84) to get a complete description of the currents on the patch. The coefficients a2, c1 and c2 are small compared with a,, bl and b, and thus can be sometimes omitted. One can see that the basic harmonic expansion a, is the dominant term for thin patches while the basic end singularity b , is the dominant term for thick

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

3 11

0,'5F0 10

I

j

, ; ; >ooo 0.3 -005 000

0.04

012

008

0.16

020

H/X,

FIG.30. The r z coefficient (Eq. 84) of the current density contribution as a function of the geometry.

patches. Also, narrow patches (with high quality factor) have higher current densities as expected.

3. Rudiation Patterns Integral properties like the radiation resistance, the bandwidth, or the gain are not sensitive to the details of the current distribution. Hence, the onedimensional model seems to be sufficient for the prediction of these properties. However, the radiation patterns depend to a large extent on the details of the current on the patch. One of the important outcomes of the two-dimensional analysis is, therefore, the ability to calculate radiation patterns in two polarizations. Figures 31-40 show ten examples of radiation patterns for the typical end-fed radiators that have been considered throughout this work. Definitions and explanations of the two polarizations (E, and E,) in the two main cuts (E-plane and H-plane) are given in Appendix A. Figures 31-32 show the radiation patterns of the thick rectangular radiator with H/L = 0.44, W/L = 1.44 (see Figs. 17, 19) in the E-plane and in the H-plane respectively. Figures 33-34 show the radiation patterns of the thin rectangular radiator with H / L = 0.06, W / L = 0.62 (see Figs. 18,20) in the E-plane and in the H-plane respectively. Several observations can be made at this point. First, we notice that there is no cross-polarization at the H-plane cut. Second, the E-plane cuts are quite similar to those obtained from the onedimensional results shown in Fig. 15. Nevertheless, the H-plane patterns differ from the one-dimensional solutions (Fig. 16) in several aspects. Of special

312

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

-

IE ~ I ~ = T O T A L

?!

z a

-20-

-

-30-

-

-I

4

-40

I

I

I

I

I

I

I

O r -

If It II DEG FIG.32. H-plane radiation pattern of the thick rectangular radiator. The two polarizations are shown.

W

IEoI2=TOTAL

-

-30 -

40

I

I

1

I

I

I

I

0

-10

I

-30 -!

\

/

\

\ -40 -90

I

-70

I

\

/

/

I

-50 -30

IE+12

/

IEgI2

I

-10 10 DEG.

30

I

I

50

70

i 90

FIG.34. H-plane radiation pattern of the thin rectangular radiator. The two polarizations are shown.

E. LEVINE, H. MATZNER A N D S . SHTRIKMAN

314

I

I -90 -70

-40

I

I

I

I

-50 -30

-10 10 DEG.

I 30

I 50

I

J

70

90

FIG.35. E-plane radiation pattern of a thin square radiator ( H / L = 0.04, W/L= 1) calculated by the two-dimensional model. E, in the pattern is zero.

-40 I I -90 -70 - 5 0

I\

-30

I

I

-10 10 DEG

/I

30

50

70

90

FIG.36. H-plane radiation pattern of the thin square radiator. The two polarizations are shown.

interest is the behaviour of the patterns near the end-fire, where the crossradiation is higher than the copolarization. The third observation is the slight asymmetry seen in Figs. 31 and 33 due to the vertical feed current on the edge of the radiator. Additional radiation patterns of interest for the square patches are given in Figs. 35-40. These results may be practical for the design of dualpolarized arrays where the cross-polarization between the two orthogonal

THE RECTANGULAR PATCH MICKOSTRIP RADIATOR-SOLUTION

= 4

3 15

-

-20-

: -30-

-

-4 0

-90 -70

-50

-30

-10 10 DEG.

30

50

70

90

FIG.37. E-plane radiation pattern of a thick square radiator. ( H / L = 0.2, W / L = 1) calculated by the two-dimensional model. Ed in the pattern is zero.

\

/

\

I

I -30- j

-40-

'

'

I

I

-90 -70 -50 -30

-10 10 DEG.

30

I

I

50

70

\ 90

FIG.38. H-plane radiation pattern of the thick square radiator. The two polarizations are shown.

channels should be reduced as much as possible. The E-plane and the H-plane cuts of a thin square whose J,.(x) current density profile is shown in Fig. 22 are given in Figs. 35-36. The cuts for the thick square radiator (as in Fig. 23) and for the very thick radiator (as in Fig. 24) are shown in Figs. 37-40 respectively. The end-fire radiation in both cuts is very sensitive to the thickness of the substrates. For example, in the H-plane it changes from - 35 dB for the thin

316

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

patch to -21 dB for the thick patch and up to - 14 dB for the very thick patch. The relative levels of the end-fire radiation is proportional to the square of H. The set of radiation patterns concludes the numerical results obtained by the two-dimensional analysis. It shows the computational versatility as well as the practical aspect of this singularity adapted moment-method solution.

-40

-90

-70

-50

-30

-10 10 DEG

30

50

70

90

FIG.39. E-plane radiation pattern of a very thick square radiator ( H / L = 0.4, W / L = 1) calculated by the two-dimensional model. E+ in the pattern is zero.

.-

-30 -10 10 30 50 70 90 DEG. FIG.40. H-plane radiation pattern of the very thick square radiator. The two polarizations are shown. -90 -70

-50

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

3 17

V. CONCLUSION An efficient moment-method analysis of the microstrip radiator, based on the inclusion of singularity-adapted basis functions, is presented in detail. The singularities introduced here are a feed attachment mode and an edge singularity. The inclusion of such functions enables one to obtain a quickly convergent solution in terms of the number of required basis functions. It is shown that only seven current terms are needed for a complete characterization of the radiator, thus a huge CPU time is saved in comparison with standard moment-method solutions. The current coefficients are computed by a modest mini-computer using analytical techniques that accelerate the required numerical integrations. Numerical results are given first for the one-dimensional case in which only y-directed currents with a constant profile in the transversal cut are considered. The results include current density profiles, input impedances and a set of graphical design curves for bandwidth, radiation resistance and resonant length. Then, in Section IV, the twodimensional case of both x and y-directed currents is discussed. The significant difference between the one-dimensional and the two-dimensional models is the ability to predict the cross-polarized radiation patterns as illustrated in Section IV. Further steps towards a more general treatment will include (a) generalization to any dielectric constant and (b)modeling of a narrow vertical or horizontal feed and a rigorous calculation of the feed current distribution. However, it is expected that the practical results shown here for radiation resistance, bandwidth, resonant length and cross-polarization will not be changed much for substrates with dielectric constant close to one (like foams and honeycomb structures) even with such generalized analysis.

APPENDIX A. FARFIELDS

IN TWO POLARIZATIONS

The electric field in the far zone is written as

where the potential components are

318

E. LEVINE, H. MATZNER A N D S. SHTRIKMAN

+ k i p + kiQ)] . -k,* 1R - P’

+ J,k,(k;Q

(93)

and where

The k components are

k, = k, sin 6 cos

(954

k, = k , sin 6 sin qi

(933)

k, = k,c0~6. (954 The three components of the electric field are expressed in terms of the surface current components in the following notation:

+ Jy - k,k,] cos 0 - kg) + J, - k,k,] cos 6 + Jik,k,] cos26

Ex = [sin2(yH) + jcos(yH) sin(yH)] [J, (k;

+ E, = [sin2(yH) + jcos(yH)sin(yH)] - [J;k,k,

Ey = [sin2(yH) j cos(yH)sin(yH)] [J;(ky’

- k;)

(96) (97) (98)

Now, transforming into spherical coordinates gives EB = Ex cos 6 cos 4 E+ = - Ex sin 4

+ E, cos 6 sin 4 - E, sin 6

+ E, cos 4

(99) ( 100)

E, = 0. (101) The two principal cuts and the two orthogonal contributions in each cut are defined as

( 9)

E plane d, = -

H plane (4 = 0)

co-polar EB = (E, cos 6 - E, sin 0) cross-polar E+ = - Ex

( 102)

co-polar E, = (Excos6 - E, sin 6 ) cross-polar E+ = E,

( 103) Approximations for the end-fire radiation (0 = n/2)are derived as follows. The

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

radiation pattern (compare to Eq. (19)) in the case of

E,

=

3 19

1 is given by

4 cos 41' sin2(Hkocos 0) + I&cos$J + J;sin$J12cos2Bsin2(Hkocos8),

f(O,+) = 1 - sin @ +

(104)

where

-

jeikyF

JY =-

Thus,

E plane (4 =

Y2

k, sinc

(7).

5)

H plane (4 = 0)

f(8,4) a sinc2(k,W/2) H 2

APPENDIX B. EVALUATION OF TYPICAL MATRIXELEMENTS The matrix elements, which are to be calculated in order to find the current coefficients converge very slowly, especially those that include edge functions. In this appendix we give two examples for analytical procedures that improve the convergence rate of such integrals. The main idea of this technique is the following: given a two-dimensional integral with an infinite range of integration, we first transform the variables into two other variables where one of them has a finite range. For example,

:J :1 f ( k x

ky)dk, dk,

-:1 1;

f(o,kr)krdk,d8

( 109)

Secondly, the problematic integration of f'(k,) can be avoided by analytical solution. In some cases the integral may be divided into two terms. One term is an asymptotic expression j&,(kr), which is evaluated analytically. The second term is [f(kr) - &,,,(kr)], which has a fast convergence rate and can be computed numerically.

J:

f(kr)dkr =

J:

[ f ( k r ) - ~ s y m ( k r ) ldkr

+

:J

Lsym(kr)dkr

(110)

The first example is the cosine harmonic functions described in Eq. (64) by the odd values of n. These integrals are part of the expression in Eq. (56a), which

320

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

can be denoted as E(h)J*(h). After transforming the integral into (k,,B) coordinates, one has to calculate the following integral: sinc2(k,W/2)

dk, do,

where n and m are the odd mode numbers. The integration over 0 can be done easily; however, the integration over k, is difficult. We have found analytical values for this integral as follows: we divide the integration into two parts of 8 # n/2 and 8 = n/2. In the case of 8 # n/2 the integral (1 11) has the form of sin2(ak,)cos’(fik,) (y’ - d2k;)(E2 - S’k;)

dkr

(1 12)

9

where a, p, y, 6 and E are parameters. Using the identity sin2(ak,)cos2(/?k,)= d{sin2[(a

+ /?)k,] + sin2[(a - B)k,]

+ ~0~(2fik,)- ~0~(2ak,)},

(1 13)

the following results (Gradshteyn and Ryzhik, 19,3.728.5)

r“ Jo

I

n[b sin(ac) - c sin(ab)] cos(ax)dx 2bcrb’ - c 2 ) (b2 - xz)(c2- x’) = is[sin(ic) n - ac~os(ac)] .

I

18;3

ifbfc (1 14)

if b = c

and (Gradshteyn and Ryzhik, 3.825.3)

jr

sin2@) d x (b’ - x2)(c2- x’) =

n[c sin(2ab) - b sin(2uc)l 4bc(b2 - c’)

ifb#c 9

-[2accos(2ac) - sin(2ac)I

(115)

if b = c

the whole integral at 0 # n/2 is found. For the case of 0 = 7112 the integral (1 1 1) has the form of k; cos2(krL/2)dk, -1 “ k; cos(k,L) dk, (a2n2- k;L2)(n2m2 - k;L2) (n2n2- k;L2)(n2m2- k;L2)’

jOm

jo

(116) Again, n, m are odd numbers. Now we use the result (Gradshteyn and Ryzhik, 3.728.7) n[csin@) - b sin(ab)] 2(b2 - c’) (b2 - x2)(c2- x’) -n -[abcos(ab) - sin(ab)] 4b

fm x’cos(ax)dx . , Jo

ifb#c (117) .

if b = c

I

321

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

and finally

J:

x 2 cos(ax)dx ( b 2 - x2)(c2- x 2 )

ifb#c

0

if b = c.

The remained part of the integral is found. The second example is the symmetric end function described by the first term of Eq. (65). The integral is a part of Eq. (56a), which can be denoted as E(e)J*(e).This integral is written as

I

=

Jy:1

J : ( k , L / 2 ) sinc2(k,W/2) dk, d0,

(1 19)

where k, = k, cos 6, k, = k, sin 0. In the limit of 6 = n / 2 , the integral diverges. Thus, a numerical integration is most difficult,and the technique described in Eq. (110) is applied through the following stages. We use the expression (Gradshteyn and Ryzhik, 6.673.2)

[Ji(ax)]’sin(Px)dx

=

5 I:

--

Jo(ax)Jb(ax)sin(Px)dx

Now,

+ =-

;1;

Jo(ax)J;(ax)cos(~x)dx

f J;J:(ax)cos(Px)dx

322

E. LEVINE, H. MATZNER A N D S. SHTRIKMAN

denote

~ o m J o ( a x ) . J 2 ( a x ) c o s ( ~ x= ) dI,. x

(123)

We get the following relations:

B

a

= --I2

I , = -1, 2a

P

+ - aI ,

a 2P

--13

2P

In addition we find that

1 4

=-I,

1 + -I4 + -21I , 4

(!

= 4 - E2a2 )Io

where

1 + 41,

- 21,,

lom low

1, =

J;(ax)cos(px)dx.

Now, the required integral I can be written in the form of I =

J?(kj’L/2)

sin ( k W/2) (k,G/2)2 dk?

or 1

I=

1 I = 2W2 cos2 0

-

sin2(k,W cos 0/2) dkr k,2

m

1

1, co

( 129)

J:(k,L sin 912) k,2 dk, m

1 L sin 812)cos(k, W cos 0) -dk,. k,Z

(1 30)

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

323

Using Gradshteyn and Ryzhik (6.574), the first term in Eq. (130)is

ar(2)r(+) - 4a 4r($)r($)r($) 3n '

(131)

Thus.

jI

'---p.( - 1)"

J~(a.x)cos(/?x)dx =

a

(01)2

,,2[j 1 B

-

11

if 0 < P c2 a

1

0

if

\

1

P- 1,2(x)= 71 K

B > 2,

(133)

-

a

(/y) K(E)]

P 1 / 2 ( xn) = 2 [ 2 E ( E ) -

".E(E) + :1 J:(y)

= 3n

At the limit of

f) -, 4

-k ( 4 x

l)K(/e-).

(136)

2 we get

J:(k,L sin 8/2)sin2(k,W cos 012) dk, + (k, W cos 19/2)~

cos( k r

= l~(;,Wcosep).

wys')

dk,

(137)

The function K ( m )is singular for m -, 1 . By Abramowitz and Stegun (17.3.26)

324

E. LEVINE, H. MATZNER AND S. SHTRIKMAN

we get m+ 1

Then, 1

( ;) o-+-

=--

8

2 + -log(128) EL EL nL

lOg(c0S 8). (139)

At the end of this process we come to the following result: the integral I is written in the general form of

:j

1=

f (9)dO.

This integral is divided into two parts:

fi’

Cf(6) - L s y m ( 0 ) I d o +

r2

Lsym(~)dQ.

(141)

Unlike the integration of Eq. (140), which is most difficult, the integration of the first part in Eq. (141) is easy, and the second part is given analytically by Eq. (139).

REFERENCES Aberle, J. T., and Pozar, D. M. (1988).IEEE AP-S Intl. Symposium Digest, 11,438-441. Abramowitz, M.,and Stegun, 1. A. (1970).“Handbook of Mathematical Functions,” Dover, New York. Bailey, M. C., and Deshpande, M. D. (1982).IEEE Trans. Antennas Propagat., AP-30,651-656. Bailey, M. C., and Deshpande, M. D. (1985).IEEE Trans. Antennas Propagat., AP-33,954-959. Blischke, M . A., Rotwell, E. J., Chen, K. M., and Lin, J. L. (1988).J. of Electromagnetic Waves and Applications, 2,353-378. Butler, C. M. (1982).IEEE Trans. Antennas Propagat., AP-30,755-758. Chew, W . C., and Kong, J. A. (1981).IEEE Trans. Antennas Propagat., AP-29,68-76. Chi, C. L., and Alexopoulos, N. G. (1984).Nat. Radio Sci. Meeting, Boulder CO. Jan. 11-13,1984 (URSI Digest), 89-90. Davidovitz, M., and Lo, Y. T. (1986).IEEE Trans. Antennas Propagat., AP-34,905-911. Deshpande, M.D.,and Bailey, M. C. (1982).IEEE Trans. Antennas Propagat., AP-30,645-650. Garg, R.,and Long, S.A. (1987).Electron. Lett., 23,1149-1151. Gradshteyn, 1. S.,and Ryzhik, I. M. (1980).“Tables of Integrals, Series and Products,” Academic Press. Kuester, E. F. (1987).J. of Electromagnetic Waoes and Applications, 2, 103-135. Levine, E.,Shtrikman, S., and Treves, D. (1988).I E E Proc. 135H, 54-59. Levine, E.,Matzner, H., and Shtrikman, S.(1989).Electromagnetics J., 9,451-471. Lier, E.,and Jakobsen, K. R.(1983).IEEE Trans. Antennas Propagat., AP-31,978-984.

THE RECTANGULAR PATCH MICROSTRIP RADIATOR-SOLUTION

325

Liu, C. C., Shmoys, J., Hessel, A., Hanfling, J. D., and Usoff, J. M.(1985). I EEE Trans. Antennas Propagat., AP-33,426-435. Liu, C. C., Hessel, A.. and Shmoys, J. (1988).I E E E Trans. Antennas Propagat., AP-36, I501-1 509. Martin, N. M., (1988).Electron. Lett., 24,680-681. Matzner, H., Levine, E., and Shtrikman, S. (1989).U R S I Intl. Symposium on E M Theory, Sweden. August 1989.73-75. Mosig, R.,and Gardiol, F. E. (1982).Advances in Electronics and Electron Physics, 59, Academic Press. 139-237. Newman, E. H., and Tulyathan, P. (1981). IEEE Trans. Antennas Propagat., AP-29,47-53. Perlmutter, P., Shtrikman, S . , and Treves, D. (1985). IEEE Trans. Antennas Propayat., AP-33, 301 -31 1. Pinhas, S., and Shtrikman, S. (1987).I E E E Trans. Antennas Propagat., AP-35, 1285-1289. Pinhas, S., Shtrikman, S., and Treves, D. (1989). IEEE Trans. Antennas Propagat., AP-37, 1516- 1522. Postoyalko, V. (1986).IEEE Trans. Microwaw Theory Tech., MTT-34,1092- 1095. Pozar, D. M. (1982).I E E E Trans. Antennas Propcigat., AP-30, 1191-1 196. Pozar, D. M.(1983). Electromagnetics, 3,299-309. Pozar, D. M., and Shaubert, D. H. (1984).IEEE Trans. Antennas Propagat., AP-32, 1101-1 107. Pues. H., and Van de Capelle, A. (1984).I E E Pror. I J I H , 334-340. Rana, 1. E., and Alexopoulos, N. C. (1981). IEEE Trans. Antennas Propagat., AP-29,99- 105. Richmond, J. H. (1980).IEEE Trans. Antennas Propayat., AP-28, 883-887. Sengupta, D. L. (1983).Electron. Lett., 19,834-835. Shih, C., Wu,R. B., Jeng, S. K., and Chen, C. H. (1988).IEEE Trans. Antennas Propagat., AP-36, 576- 581. Vanderbosch, G., and Van de Capelle, A. (1989). URSI Inrl. Symposium on EM Theory, Sweden, August 1989,61-63. Yano, S., and Ishimaru, A. (1981). IEEE Trans. Antennas Propagat., AP-29,77-83.

This Page Intentionally Left Blank

ADVANCES IN ELECTKONICS A N D LLECTKON PHYSIC'S . VOL . 82

Some Recent Advances in Multigrid Methods * JAN MANDEL Computational Mathematics Group University of Colorado at Denver Denver. Colorado

I . Introduction . . . . . . . . . . . . . . . . . . . I1. The Fundamental Multigrid Algorithm . . . . . . . . . A. Formulation of the Algorithm . . . . . . . . . . . B. Finite Differencesand Fourier Analysis . . . . . . . C. Multigrid on General Domains . . . . . . . . . . D. Variational Approach . . . . . . . . . . . . . .

. . . . . .

328

. . . . . . . 329 . . . . . . . 329 . . . . . . . 332 . . . . . . . 335 . . . . . . 337

E . Numerical Quadratureand Nonconforming Elements . . . . . . . . . F. Regularity-Free Convergence . . . . . . . . . . . . . . . . . . G . Nonlinear Problems . . . . . . . . . . . . . . . . . . . . . H . Grid Refinement . . . . . . . . . . . . . . . . . . . . . . LUnigrid . . . . . . . . . . . . . . . . . . . . . . . . . J . Anisotropic Problems and Semirefinement . . . . . . . . . . . . . 111. Preconditioning by Multigrid . . . . . . . . . . . . . . . . . . . IV. Methods Based on Space Decomposition . . . . . . . . . . . . . . . A . Hierarchical Bases and Multilevel Preconditioners . . . . . . . . . . . B. Methods with Several Coarse Grids . . . . . . . . . . . . . . . . C. Using Symmetry: Domain Reduction Methods . . . . . . . . . . . . V . Multigrid in Elasticity . . . . . . . . . . . . . . . . . . . . . . V1 . Multigrid for Mixed Problems . . . . . . . . . . . . . . . . . . . A . Stokes Equations . . . . . . . . . . . . . . . . . . . . . . B. Mixed Formulations of Elasticity . . . . . . . . . . . . . . . . VII . Multigrid for High Order and Spectral Methods . . . . . . . . . . . . A . Spectral and Spectral Element Methods . . . . . . . . . . . . . . B. Methods for p-Version Finite Elements . . . . . . . . . . . . . . . VIII . Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . IX . Multigrid and Parallel Computing . . . . . . . . . . . . . . . . . X . Some Other Multigrid Developments . . . . . . . . . . . . . . . . XI . Multigrid Software . . . . . . . . . . . . . . . . . . . . . . . Appendix A . PLTMG . . . . . . . . . . . . . . . . . . . . . Appendix B . MADPACK . . . . . . . . . . . . . . . . . . . . Appendix C. MUDPACK . . . . . . . . . . . . . . . . . . . . AppendixD. MGDl . . . . . . . . . . . . . . . . . . . . . . Appendix E. MGOO . . . . . . . . . . . . . . . . . . . . . .

340 343 345 345 346 346 347 349 351 354 356 357 359 359 359 360 360 361 363 364 365 365 365 366 366 367 368

* The work of the author reported here was partially supported by the National Science Foundation under grant DMS.8704169 . 327 Copyright 1991 by Acadernlc Press. Inc All rights of reproduction in anv furm reserved. ISBN 0-12-014682-7 ( ' )

328

JAN MANDEL F.AMG.. . . . . . . . . . . . . . . . . . . . . . . . . G. BOXMG. . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . .

368 368 368

I. INTRODUCTION Multigrid methods are fast iterative methods for the solution of large systems of algebraic equations. Most applications of multigrid and related methods involve the solving of sparse systems arising by discretizations of partial differential equations. The unknowns and equations in such systems typically correspond to points in a physical space, and only unknowns that correspond to neighboring points are directly coupled (i.e., they occur together in at least one equation). Usual iterative methods then adjust in one step the value of an unknown depending on the current values of other unknowns in its neighborhood. It is obvious that if there is a large number of unknowns, propagating the error through the system will take many steps. Multigrid methods overcome this bottleneck by using several related systems with a smaller number of variables that interact with the original system to propagate the information. In applications to differential and integral equations, these systems are naturally obtained by discretizing the same differential or integral equation with different levels of resolution. The reader is assumed to be familiar with basic concepts of multigrid methods; for background information, we refer to the monograph edited by McCormick (1987), which contains a tutorial chapter as well as chapters on multigrid methods for elliptic and some hyperbolic problems. However, we introduce some of the concepts in order to provide a common notation, to introduce new developments in their light, or to avoid misunderstanding. For more comprehensive tutorials, see Stuben and Trottenberg (1982) and Briggs (1987). For a comprehensive treatment of the field up to 1984, see Hackbusch (1985b).

The multigrid literature has experienced an explosive growth in recent years. A complete bibliography up to about 1986 has been compiled by Brand et al. (1987). While there was only one multigrid paper published in 1975, there were 31 in 1980and 149 in 1985. A search in the MathSci database in May 1990 has found a total of 655 publications with the multigrid keyword, which covers only a fraction of 1989 publications (45 at that time) and does not even include some articles with engineering and physical character. This article presents a review of selected recent developments with emphasis on new fundamental ideas that might influence the field in the future. Rather than attempting a complete coverage of literature, the author has

SOME RECENT ADVANCES IN MULTIGRID METHODS

329

concentrated on topics in his own area of interest and the work he is familiar with. Therefore, the bibliography is certainly not complete, and the author would like to apologize to all whose work is not referenced. Some important areas of applications are not included, most notably multigrid in fluid mechanics. For an introduction to this area, see Hemker and Johnson (1987) and Brandt (1984). Many other areas are treated only quite superficially. The application of multigrid principles to fluids as well as other multigrid developments are well represented in the proceedings of the various multigrid conferences (McCormick, 1983, 1986, 1988; Mandel et al., 1989; Hackbusch and Trottenberg, 1982,1986; Braess et al., 1985).

11. THEFUNDAMENTAL MULTIGRID ALGORITHM

We first state for reference the basic form of the multigrid algorithm. A . Formulation of the Algorithm We shall be concerned with a family of linear problems AkUk=fk, k = O ,..., m, (1) where A, is an invertible linear operator on a finite dimensional space V,, dim V, < dim & + k = 0,. . . m - 1. The right-hand side fm is given, and we wish to solve Eq. (1) for k = m. The operators A , , k < m, are auxiliary, and they are introduced for the use in the multigrid algorithm. With each space Vk, there is associated a characteristic parameter hkrwhich is the mesh spacing or a characteristic spacing of a nonuniform grid. Typically, hk-1 = 2hk; that is, Eqs. (1) are a family of discretizations of a differential equation with a refinement factor of 2. When the mesh spacing is different for different spatial directions, we have a vector of mesh sizes h k .We occasionally omit the subscript m for the highest level, thus writing V = V,, A = A,, f=f,, u =u,, h =h,. For each Eq. (l), we are given a basic iterative method uk

where

+

+

uk

+ ck(fk

- AkUk),

(2) (3)

denotes replacement. Denoting by uf the solution of Eq. ( 1 ) and ek = uk - u:

(4)

330

JAN MANDEL

the error of uk, we see that the iteration (3) transforms the error according to Gk = 1 - CkA,.

ek + Gkek,

(5)

The multigrid algorithm is essentially an acceleration of the basic iterative method (3) for k = m. The correspondence between the spaces V, is given by prolongations If:' I : V, + V,, and restrictions I f : , :V,' + V,, which are assumed to be full rank linear mappings. The multigrid algorithm is then defined as follows:

Algorithm 1 The Fundamental Multigrid Algorithm I . Presmoothing: do v1 times

-

uk f c k ( h

uk

2. Coarse Grid Correction: a. Create the restricted residual

h-1 = I : - l ( f k - A k U k ) * b. Solve the problem Ak-lUk-l

=h-l

(6)

directly or by some other method if k - 1 = 0 or by a recursive application of this algorithm with the initial value u k - = 0 if k - 1 > 0. The exact way this is done dejnes various multigrid cycles (shown later). c. Correct the solution in V, by ukcuk

k +Ik-lUk-l.

3. Post-smoothing: do v2 times uk

uk

+ck(h

-

One of the numbers of smoothing steps v1 and v2 may be zero, giving an algorithm with only presmoothing or only post-smoothing. Both v1 and v2 may depend on k (Bramble and Pasciak, 1987) or they may be determined adaptively (Brandt, 1977). The special version of the algorithm when the coarse grid problem ( 6 )is solved exactly is called the two-grid algorithm. The error in the two-grid algorithm is transformed according to ek

Mkek,

(7)

where Mk is the two-grid operator,

M = GH2KGL1,

(8)

SOME RECENT ADVANCES IN MULTIGRID METHODS

with

33 1

being the coarse grid correction operator given by =I

-

I:-l.4;!lI:-%Ak.

(9)

We obtain the basic multigrid cycles when the coarse grid problem ( 6 ) is solved recursively by ,u applications of the multigrid itself. The case p = 1 is called the V-cycle and the case ,u = 2 is called the W-cycle (Stuben and Trottenberg, 1982). The intermediate case between the V-cycle and the Wcycle is the F-cycle (Stuben and Trottenberg, 1982; Mandel and Parter, 1990), where one iteration of the multigrid algorithm and one on the V-cycle is used in the solution of Eq. (6). In Fig. 1, the smoothing stages on the various levels are represented by dots. Once an estimate of the two-grid convergence factor llMkll has been obtained, one can get an estimate of the convergence factor for the multigrid W-cycle by a perturbation argument and recursion. The essential requirement here is that the restrictions and prolongations are bounded in suitable norms. level

0

\

W-Cycle level

0

V-Cycle level

F-Cycle

FIG. 1. Multigrid cycles.

332

JAN MANDEL

Let e: be the error before and e p after one step of the multigrid iteration, respectively. The convergence factor is then defined as

An important property of multigrid methods is that for many hierarchies of discretizations on the spaces V,, V l , . . . , their Convergence factor is bounded away from one independently of h (or independently of the number of levels m), that is, E,

5 const. < 1.

(10)

This, in general, requires that the two-level convergence factor can be made small enough by using sufficiently many smoothing steps: assuming that the restrictions and prolongations are bounded in a suitable way, one has the recursion for the convergence bound of the W-cycle, 5 Ek. two-level

+ const*Ek2

(11) where &k, Iwo-leve, = IlMkll is the two-level convergence factor. If &k,two.level is bounded uniformly in k by a sufficiently small constant, then one has Eq. (10). Multigrid recursions in an abstract setting were studied by Maitre and Musy (1984) and recently by Douglas and Douglas (1990). Assuming that the cost of smoothing on V, is proportional to nk, where &k

nk

= dim

Z

- 1,

hid,

with d as the dimension of the physical domain, the cost of the V-cycle can be shown to be also proportional to nk, and the cost of the W-cycle and the Fcycle are proportional to nk if d 2 2. The multigrid cycle thus has optimal asymptotical computational complexity: the number of operations to perform one multigrid cycle is only constant times that of accessing the data that express the discrete problem Amum= f,. The technique called nested iteration or full multigrid uses the approximate solution of the problem on level k as the initial approximation for iterations on level k + 1. It can be proved that only a fixed number of multigrid cycles is needed to obtain a solution with error of the same order as the discretization error, giving a method with optimal asymptotical computational complexity for the solution of the discrete equations. For more details, see, for example, Brandt (1977), Hackbusch (1981b, 1985b), Bank and Dupont (1981) or Mandel et al. (1987). B. Finite Differences and Fourier Analysis From the start, the multigrid algorithm was motivated by modal analysis for constant coefficient problems on a uniform grid (Fedorenko, 1961; Brandt,

SOME RECENT ADVANCES IN MULTIGRID METHODS

333

1977; Stiiben and Trottenberg, 1982). In modal analysis, the problem is idealized so that one considers an infinite uniform grid f i k in with spacing hk and the space Vkis the space of grid functions on n k . The grid Qk consists of mesh points 1hk

= (llhkl,...>@kn),

where 1 = ( I l , . . . ,/ k n ) is an integer vector. The error (4), which is a function on is expanded in terms of the modes uo, 8 = (el,.. . ,On), defined by

Qk,

uo(x) = e i ( X o / h k )

(12)

where x e =-X I ~ l -

+ . " + -.

hk

xn o n

hkl

hkn

Note that modes whose angular frequencies 8 differ by an integer multiple of 277 are not distinguishable on Q,. We can thus restrict ourselves to modes with

1 = 1 , ..., n.

--n
is given by a stencil of the form grid function u according to

A stencil operator

AkU(X)

=

1i

UjU(X

= [aj], and it acts on a

-k hj),

where the multiplication hj is to be understood by components and the summation is over integer vectors j. Only a finite number of the stencil coefficients is nonzero, of course. Then it is easy to see that any mode uo is invariant under such mapping A,,

where-denotes inner product. The space k'- is the space of functions defined on the grid Q k - in an analogous manner. The restriction 1;- : Vk + V, - is given by a similar stencil, which is evaluated only at the points of Qk - 1 . For example, the nine-point averaging in two dimensions is given by the stencil

The prolongations are perhaps best understood as adjoints of restrictions with respect to the inner product for grid functions on f&,

334

JAN MANDEL

and an analogous inner product on 0,- which somehow facilitates analysis (Mandel et al., 1987; Mandel and Ombe, 1988). For example, the adjoint to the nine-point prolongation (1 3) is the stencil of bilinear interpolation. It turns out that smoothing operators corresponding to many known relaxation methods, such as Jacobi or Gauss-Seidel, also keep the modes (12) invariant (Brandt, 1977). One can expect that the coarse grid correction eliminates (or at least significantly attenuates) the modes with wavelength at least 2h,, so the convergence factor could be estimated by the smoothing factor introduced by Brandt (1977), u=

max

i6y(e)i,

n i 2 2 101 < n

where

6&is the symbol of

the smoothing operator

c&from Eq. (5), given by

Gkue = C & ( B ) U o ,

and 101 = ~ a x { l ~ l l , . . . , l ~ ~ l } ~

A more sophisticated analysis is obtained by studying the transformation of the modes u0 by the two-grid algorithm. Because the restriction operator is evaluated only at the points of 4- and modes with angular frequencies 8, different by integer multiples of 4 2 have the same values on 0,I , the twogrid algorithm couples 2" modes with each mode uO, - n/2 < 6, I 4 2 . These 2" modes span an invariant subspace of the two-grid operator, and thus the transformation of the error by the two-grid method can be characterized by the corresponding 2" x 2" symbol matrix M.The spectral radius and the spectral norm of the two-grid operator (8) can be calculated as

AM) =

SUP

($1 < 7112

p(rii(e)),

iiwi =

SUP (01s nl2

iiriiwii.

(15)

This simplified presentation of modal analysis is the basis of the modal analysis technique developed in Mandel and Ombe (1988), which can be easily automated and applied to complicated problems and systems of equations. In previous analyses, special cases arise for modes with some el= n/2, because then some of the 2"modes are mapped to zero by the restriction operator 1:Those modes had to be treated separately. The five-point restriction operator corresponds to linear interpolation,

Mandel and Ombe (1988) have applied the preceding variant of modal analysis to the linear elasticity problem in three dimensions, resulting in

SOME RECENT ADVANCES IN MULTIGRID METHODS

335

24 x 24 matrices for the symbol of the two-grid operator. Niestegge and Witsch (1990) presented modal analysis (on an infinite domain) of several variants of a multigrid method for the Stokes equation in two dimensions, representing the two-grid operator by 12 x 12 matrices. A well known effective smoother for two-dimensional problems is the red-black Gauss-Seidel relaxation, in which the points are processed in a checkerboard fashion (Stuben and Trottenberg, 1982). Then the subspaces of the same four modes as above are invariant under the red-black relaxation. A closely related approach was adopted by Kuo and Levy (1989). Blaheta (1988) used Fourier analysis for a multigrid-like process, in which the restrictions are defined by aggregation. Fourier analysis of multigrid cycles with m levels leads to significant complications, since the m-level cycles couples 4" modes in two dimensions (Decker, 1988), and direct computation of convergence factors similar to Eq. (15) is quite impractical. Instead, Decker derived bounds on the elements of the symbol matrices of the V-cycle to analyze properties of the V-cycle as a preconditioner for the Hemholtz equation. C. Multigrid on General Domains

Mode analysis does not apply on general domains. One common solution is to disregard the effect of boundary conditions and to use the results of the mode analysis in the infinite domain. This technique is called local mode analysis, and it is based essentially on the hope that the results will approximate the actual behavior of the method well enough. This is not always the case because the singularities of the generic solution on the boundary may corrupt the convergence factor of the multigrid algorithm. If the multigrid algorithm on a uniform grid is changed by adding a sufficient amount of extra smoothing (that is, relaxation steps) near the boundary, then the speed of convergence predicted by local mode analysis can be recovered in practice (Brandt, 1984), and there is an ongoing effort to develop a rigorous analysis for this (Brandt, 1989a, 1989b). Instead, multigrid analysis for general domains relies on results from approximation theory. We start by presenting the main principles of the theory due to Hackbusch (1981b, 1982, 1985b). For simplicity, assume that we are solving a self-adjoint second order linear elliptic partial differential equation -V(aVu)

+ cu = f ,

c 2 0,

a > const. > 0

in R,

(17)

on a family of quasi-uniform meshes with spacings hk.The spaces V, are spaces of functions defined on a finite mesh Q, in a bounded domain Q, and define the

336

JAN MANDEL

norm JJ.IJk induced by the inner product (14). Assume that the discretization matrix A, is symmetric and positive definite. Suppose that only presmoothing is present, that is, v 2 = 0, and that simple Richardson iteration is used for smoothing, that is,

Then the two-grid operator is

However, for any f k , Ailh and 1;- lAk! lI:-lfk are just two discrete solutions for some common right-hand side f .If the L2 optimal error estimate holds and the restrictions and prolongations have norms bounded uniformly in hk (with respect to the norms l l . l l k ) and some other technical conditions are satisfied, then for any fk,

I const.h:llfkllk,

IIAi'fk - l : - l A [ ~ l I ~ - l f k l l k

(19)

so

lIAk1 - l i - l A i ! l Z ~ - l l l I const.h:,

(20)

where 11-11 is the spectral norm. The property (20) is called the approximation property. Using the fact that p(&) z const. h i 2 , we have the smoothing property

Combining Eqs. (20) and (21), one finally obtains the bound

"1

which shows that the two-grid convergence factor (in the Euclidean norm) can be made arbitrarily small, independently of hk, by choosing the number v1 of smoothing steps sufficiently large. Because the optimal error estimate (19) holds only in rather special cases, Hackbusch developed this argument in a general setting using fractional Sobolev norms instead of the L2 norm. The theory was also extended for the case of nonself-adjoint problems by a perturbation argument for the smoothing property.

SOME RECENT ADVANCES IN MULTIGRID METHODS

337

The multigrid method can be also applied to nonelliptic problems, and one could formally apply the above argument, except that the required approximation and smoothing properties do not hold any more. One can still use a formally identical algorithm for solving systems arising by a discretization of hyperbolic problems (for a review, see Hemker and Johnson (1987)), but the properties of such methods are much less favorable than for elliptic equations (Mulder, 1989a).Various multigrid approaches to difficult problems are being proposed and studied in Brandt (1989b). D. Variational Approach

The approach just discussed is well suited for the case when A , is obtained by finite differences. When the differential equation is discretized by finite elements, however, one can often take advantage of the imbedding

v,,

(22) where the spaces v, are now understood as spaces of piecewise polynomial ! as well as spaces of vectors of their values at functions on a physical domain 2 nodes as before. It is then natural to define the prolongation by the imbedding (22), K-1

I;-l:u€

c

G-,H

U E

v,.

(23)

The restrictions are then defined as transposes with respect to the inner products defined using nodal values analogously as in Eq. (141, (li-'vk?

vk - 1 ) k -

1

=

(1:-

Iuk - 1, V k ) k ,

for all v k - l E v,-l, uk E 4. The stiffness matrices in the conforming finite element method then satisfy the variational condition Ak-1

= 1i-'AkIi-I.

(24)

In fact, one can always consider 6 - a subspace of V, with the imbedding (22), but this is not useful unless Eq. (24) holds. Again, for simplicity, let A,,, be symmetric and positive definite. Then Eq. (24) implies that the energy norm lllUklll

coincides in all spaces

vk.

=

Note that 1111-:

llll

= 1.

338

JAN MANDEL

Bank and Dupont (1981) studied multigrid convergence in this framework, obtaining mesh independent convergence factors for the two-grid algorithm and the W-cycle with sufficiently many smoothing steps. Because all usual iterative methods decrease the energy norm of the error and so does the coarse grid correction step*, the multigrid method is trivially convergent, although this does not say anything about convergence factors. It was observed that the V-cycle even with one smoothing step converges independently of mesh size for many problems (Stuben and Trottenberg, 1982). The proof of this fact for symmetric positive definite problems that satisfy the approximation property (20) was given by Braess and Hackbusch (1983) for smoothing by Richardson iteration (18) and by McCormick (1985) and Bank and Douglas (1985) for general smoothers, including Gauss-Seidel and conjugate gradients. Mesh independent convergence of the V-cycle is closely related to regularity properties of the partial differential equation. Let & be the coarse grid correction operator from Eq. (9). The approximation property (20) can be written as

for all ek E V, independently of h,, with a = 1. This property can be interpreted that the error in the energy norm after the coarse grid correction is small if the error before the correction is “smooth.”(Note that I(AkTkek,ek)lk= llleJ12 and that IIAEekllZ is just the l 2 norm of the residual if a = 1.) For many problems, Eq. (27) holds only with some 0 < u < 1. Hackbusch (1981b, 1982) and Bank and Dupont (1981) have shown for second order problems that if the solution satisfies the regularity condition J J u J J HI m _< +const.JJuJ)Ho-I,

(28)

then Eq. (27) holds with the same a = a’. In particular, for second order problems, we have a’ = 1 when the domain R is convex, and always u’ > 0 under very weak assumptions, but c1 < 1 if the domain R has a re-entrant corner (see, for example, Grisvard (1985)).Decker et al. (1988)have shown that u = a’ in Eq. (27) is the best possible, that is Eq. (27) cannot hold with a > u’. The value of u has a profound importance for the theory of the multigrid V-cycle, because all known theoretical results require a = 1 for the convergence factor of the V-cycle to be bounded away from one independently of h, (Mandel et al., 1987). The problem of the convergence of the W-cycle for a < 1 with any nonzero number of smoothing steps was solved in Mandel (1988) for symmetric, positive definite problems and extended to nonsymmetric and indefinite problems under the assumption that h, is sufficientlysmall and the smoother is

SOME RECENT ADVANCES IN MULTIGRID METHODS

339

simple Richardson iteration in Mandel (1986)' using a sharpening of the perturbation argument in Bank (1981). The assumption that ho is sufficiently small is required in any case because otherwise the problem in Vo may not be solvable, but the required value of h, depends on the number of smoothing steps. For a detailed systematic treatment of this theory, see Mandel et al. (1987, 1988). A unified treatment of number of some earlier proofs was given by Parter (1987). Following Mandel et al. (1987), the main assumption of the analysis for the symmetric positive definite case and one presmoothing step is the inequality with a constant /l > 0,

for any ek E such that Gkek # 0, where ek is the error before the multigrid cycle and Gk is the smoothing operator as in Eq. (5). The inequality (29) can be obtained from the approximation property (27) and purely algebraical properties of the smoother G k . It expresses the fact that for a giver? error e k , either smoothing is efficient in reducing the error and then ~ ~ ~ e k ~ ~ ~is/ [ ~ ~ G k e large, or the coarse grid correction is efficient and IIITkGkeklll/lllGkek((Iis small. When the coarse grid problem (6) in V,-, is solved by any method with the convergence factor in the energy norm at most c, then the convergence factor of the multilevel algorithm in is bounded by SUP

OSf5l

t

+ c(1 t) 1 + Bt"= . -

Depending on the cycling scheme, one gets a recursion for bounds on the convergence factor, which gives for a = 1 the bound 1 1+B

on the convergence factor of the V-cycle and the W-cycle, and a bound independent of h k for the W-cycle and 0 < a < 1. For 0 < a < 1, this recursion gives the bound of the form 1 - C,(ct,/l)k ' I

""

on the convergence factor of the V-cycle (Decker et al., 1988), and 1

-

C,(a,/I)k-"

-')zia

for the F-cycle (Mandel and Parter, 1990), where C, and Cz d o not depend on k .

' These two papers appeared in reverse order because of editorial delays

340

JAN MANDEL

General smoothers such as Gauss-Seidel relaxation can be treated with ease in this framework; see Mandel et al. (1988) for the symmetric position definite case and Cao (1988) for the nonsymmetric and indefinite case. Bramble and Pasciak (1987) independently obtained estimates for the Vcycle and the W-cycle and the symmetric, positive definite case, which are asymptotically equivalent to the above. In addition, they proved a bound independent of h, on the convergence factor of the variable V-cycle, which employs const.2m-k smoothing steps in each space V, and thus has the total number of smoothing steps identical to that of the W-cycle but fewer prolongations and restrictions and is easier to program. Their analysis, however, is limited to the class of smoothers of the form of Eq. (3) with ck symmetric and such that A, - A&A, is positive definite. This positive definiteness condition is required only because of the proof technique employed, which was motivated by considering multigrid as a preconditioner. The estimates were extended to the nonsymmetric and indefinite case by Bramble et al. (1988a). Unlike in other analyses (Mandel, 1986; Cao, 1988; Bank, 1981), Bramble et af. get convergence for all sufficiently small ho, independently of the number of smoothing steps, but the bound on the convergence factor of the V-cycle is weaker and approaches one as 1 - const. k - '. A quite different multigrid analysis was given by Mandel(1987), which was inspired by projection-iterative methods (LuEka, 1980) and iterative aggregation methods (Mandel and Sekerka, 1983). Unlike the other methods of analysis mentioned, which treat the nonsymmetric case as a perturbation of the symmetric one, the technique in Mandel (1987) is based on algebraic identities that do not rely on any kind of symmetry, showing that the two-level algorithm in a variational setting with one smoothing step by Eq. (3) with ck equal to a diagonal matrix is guaranteed to converge. The approximation property (27) is then used to show that the convergence factor in a suitable norm can be bounded independently of h. The result is applied to the multigroup neutron diffusion equation, which is strongly nonsymmetric. The technique, however, does not extend to more general smoothers, more smoothing steps, or to the analysis of multigrid cycles.

E. Numerical Quadrature and Nonconforming Elements Even if the imbedding conditon (22) holds, the variational condition (24) may be violated in many common finite elements discretizations because the stiffness matrix A, is not evaluated exactly but rather by numerical quadrature. Moreover, refinement of the triangulation near a curved boundary leads to G - l q! h. The violation of Eqs. (22) or (24) is referred to in the

SOME RECENT ADVANCES IN MULTIGRID METHODS

34 1

multigrid literature as the nonvariational case. The theory for the W-cycle outlined in Section 1I.C does not require Eqs. (22) or (24), but rather the approximation property (19) and technical assumptions on the prolongations and restrictions, which are satisfied for common discretizations by finite differences and can be expected to hold for finite elements (Hackbusch, 1981b, 1982,1985b). Mansfield (198 1) proved W-cycle convergence for isoparametric elements essentially by verifying the approximation condition. The method of Bank and Dupont (1981), also with W-cycles only, was extended to the nonvariational case by Douglas (1984) in an abstract way stipulating a number of assumptions that are to be verified in concrete cases. A comprehensive study of multigrid method for nonconforming elements was undertaken by Brenner (1989a, 1989b, 1989~).For nonconforming elements, the spaces V, are not nested. The stiffness matrices A , are determined by integration over the elements, disregarding the jumps at the element boundaries (Ciarlet, 1972). Here are some examples of nonconforming finite element spaces for a given triangulation 3 of a domain R and the corresponding prolongations: 1. (Brenner, 1989b)The space of functions that are linear on all triangles T E .Fand determined uniquely by the values at the midpoints of the sides of the triangles. Thus, uk is not uniquely defined as a function on the domain R because the values on the sides may be different on the two neighboring triangles. The prolongation needs to map such collection of linear functions on the triangles into a similar collection for a refined triangulation. This is done by prescribing the values at the midpoints of the sides of the refined triangulation. For the midpoints that fall on the sides of the coarse triangulation but do not coincide with the midpoints, the value is determined by taking the average of the two values from the two neighboring coarse triangles. 2. (Brenner, 1989c)The Morley finite element space consists of functions that are quadratic on each triangle T E T and determined by their values at the vertices of the triangles and the normal derivatives at the midpoints of the sides. The definition of prolongation is similar as before, using the values and normal derivatives or their averages from the two neighboring elements. 3. (Brenner, 1989a, 1990) The space of vector functions that are linear, divergence-free, and given by their values at the midpoints of the sides of the triangles of K The prolongation is also defined by averaging, with special care taken to preserve the divergence-free condition. The nonconforming energy norm is defined by the stiffness matrix, as in Eq. (25), but it no longer coincides on all spaces V, and we have a different energy norm l . l l , k on each space V,. The trivial bound (26)on the norm of the

342

JAN MANDEL

prolongation operator no longer holds, and much of the theory is devoted to deriving the bound 5 const.Iluklll,k,

\~l!-luk-l\~l.k

(30)

and other related bounds on the prolongation operator. Once these are established, Brenner proves the approximation property (20), and the multigrid theory for the W-cycle shows that the multigrid algorithms are of optimal computational complexity. Zhang (1989) has studied a C' element, given by the function values and first order derivatives at the vertices and the barycenter of a triangle, and the normal derivatives at the midpoint. Because the first order derivatives are continuous across the element boundaries, the prolongations can use the values of the function to define uk = 1:- ]uk- but the functions ukand u k - in general do not coincide on the fine grid triangIes. The finite element spaces are not nested but Eq. (30) still holds. It is proved that the W-cycle converges independently of the mesh size and that the multigrid algorithm has optimal complexity. Because the nonvariational case can be understood as a perturbation of the variational case, it should be possible to obtain the same results as for the variational case, namely V-cycle convergence and convergence with only one smoothing step, if h, is small enough. Bramble et al. (1991) have shown that this is indeed the case under essentially the approximation assumption (27). Note that the verification of Eq. (27) is now more complicated because the prolongation and restriction enter into the definition of in Eq. (8). Their other significant assumption is that a weaker form of Eq. (24) holds, -

1 u k - 1, u k - l ) k -

1

5

uk-

l)k,

(31)

for all u k - E - I , or that the multigrid algorithm is a symmetric V-cycle, that is, with the same number of presmoothing and post-smoothing steps, and that the operators for presmoothing should be adjoint in energy to those for postsmoothing. This guarantees that the multilevel operator is self-adjoint in I , but this would energy. Note that Eq. (31) can be satisfied by just scaling conflict with preserving Eq. (20). Bramble et al. (1991) show that their assumptions are indeed satisfied with some a > 0 in several cases involving the symmetric V-cycle, including the second order equation (17) discretized by linear elements on a collection of meshes that do not create a nested sequence of finite element spaces, with the prolongation defined by natural interpolation, (lf:-luk-l)(xi)

= uk-l(xi)

for the nodes of xiof &. Bramble et al. also considered discretization by the Raviart-Thomas element (Raviart and Thomas, 1977). Goldstein (1989b) has further extended this theory and shown that it ap-

SOME RECENT ADVANCES IN MULTlGRlD METHODS

343

plies to the case of finite elements with the stiffness matrices calculated by a numerical quadrature that satisfies the "patch test." This test is a requirement on the quadrature rule, which guarantees optimal error estimates in the energy norm under sufficient regularity assumptions (Ciarlet, 1972). Braess and Verfiirth (1988)extended the technique of Verfiirth (1988)to the nonconforming Crouzeix-Raviart element (Crouzeix and Raviart, 1973). Braess and Verfurth define the prolongation 1:by taking an L*-like projection of uk onto V,, which is easily computable because of their choice of basis. The boundedness property (30) is then analogous to the standard H' stability of L 2 projections onto finite element spaces (Crouzeix and Thomee, 1987).

-,

F. Regularity-Free Convergence The perennial question in multigrid theory is whether the approximation property (27) is really needed for multigrid convergence independent of hk or if it is only an analytical tool. The problem is that the constant in the approximation property (27) enters directly into multigrid convergence bounds, and it is not easy to evaluate. In fact, the essential ingredient in Eq. (27) is the regularity property (28) of the differential equation. In the theory of partial differential equations, the inequality (28) is usually derived from the fact that u E Ha" and from the closed graph theorem (NeEas, 1964); thus, the value of the constant in Eq. (28) is not known in general. The constant in Eq. (27) can be calculated only in special cases by Fourier analysis (Mandel el ul., 1987). Moreover, the predictive value of the rigorous convergence bounds is small even when the best constant is known, because the bounds are still pessimistic (Mandel et al., 1987). For this reason, many researchers have been looking for bounds that can be calculated with little effort from the data of the problem and which are sharp enough to be used to guide algorithm development. The results are known only for symmetric, positive definite problems, so the variational framework of Eqs. (22) to ( 2 5 )is assumed. One possible modification of the multigrid method is to do smoothing by solving for variables that are not in the coarse grid. Then the two-level algorithm becomes in fact a two block Gauss-Seidel iterative method with the blocks given by the spaces & - and V t - 1, where V:- consists of functions from that are zero on coarse grid nodes. Bank and Dupont (1980) and Braess (1981) have shown that the convergence of such a method can be calculated using the angle y in the energy inner product u ( u k , u k ) = U l A k U k between the spaces h- and V:- which is usually written as the strengthened Cauchy inequality

344

JAN MANDEL

where c = cosy. Moreover, the problem restricted onto VE- is well conditioned, and thus it can be solved approximately by a standard iteration like Jacobi or Gauss-Seidel. An alternative approach to regularity-free convergence is to use, instead of Eq. (32), the approximation condition that for any Uk E V, there exists u k - l E h - l such that

which is a weaker condition than Eq. (27). Using the condition (33), Brandt ( 1 986) and Mandel (1984a) analyzed two-grid convergence and proved bounds independent of h k for a general class of smoothers including Gauss-

Seidel. The important property of both Eqs. (32) and (33) is that the constants can be calculated directly from the data of the problem and that they depend only on the local properties of the discrete system. Thus the same numerical bound will hold on domains of arbitrary size and even for problems with discontinuous coefficients. This locality of the bound is quite easy to see for the condition (33), where one can take for uk - the function given by interpolation of nodal values of uk.The constant can be then calculated considering one coarse grid element at time. Mandel (1984a) applied this analysis to a multigrid method for a freeboundary problem, where the approximation property (27) does not hold for the discrete equations because of poor approximation properties near the u priori unknown free boundary. KoEvara and Mandel (1987) have incorporated smoother by Gauss-Seidel iteration into the weak approximation property (33), resulting in sharper estimates. The sharpened bounds still can be evaluated locally. All bounds considered above have the common property that, unlike the bounds obtained from regularity properties in Section II.C, they d o not guarantee that the convergence factor can be made arbitrarily small using sufficiently many steps. Brandt (1 986) gave a counterexample based on the Laplace equation in one dimension and prolongation by piecewise constant functions, which shows that the convergence factor cannot be made arbitrarily small for more smoothing steps independently of h if only the weak approximation property (33) holds. Thus the argument (I I) giving h independent convergence factors may fail if the two-level convergence factor does not happen to be small enough. This is the case even if in the variational case, the constant in (1 1)equals to one. In this case, the convergence factors of both the W-cycle and the V-cycle approach one for a large number of levels, and straightforward bounds based on recursions of the type (1 1) give convergence

SOME RECENT ADVANCES IN MULTIGRID METHODS

345

factors with the asymptotic behavior 1 - const.qm,0 < q < 1 (Maitre and Musy, 1984). Recently, Bramble et al. (1989b, 1989c) showed that for the multigrid Vcycle with smoothing using only the variables not in the coarse grid, one has the convergence with the asymptotic bound 1 - c0nst.m-'. Bramble et al. need the assumption

(34) which is stronger than Eq. (33). For a more detailed discussion of the inequality (34), see (Xu, 1989)and Section 1V.A. The convergence of other methods based on space splitting can also be shown not to depend on the approximation assumption (27); see Section IV. G. Nonlinear Problems

There are several approaches to multigrid for nonlinear problems: multigrid iteration can be used as a linear solver for the linearized problem, so few multigrid iterations are made in each step of the multigrid method (Bank and Rose, 1982),or multigrid can be applied directly to the nonlinear problem, so the problems on all levels are nonlinear (Brandt, 1977, 1982; Hackbusch, 1985b). For a common framework for multigrid method for nonlinear problems, see Mandel (1985, 1984~).Recently, Reusken (1988) and Hackbusch and Reusken (1989) have studied conditions under which the nonlinear multigrid iterations converge. The software package PLTMG (Bank, 1990), reviewed in Section XLA, incorporates multigrid solution of elliptic nonlinear problems by a damped Newton method. For the nonlinear multigrid method for integral equations of the second kind, see Hackbusch (1981a, 1985b) and also Mandel(l985). H . Grid ReJlnement

In practice, one needs to solve partial differential equations with different resolution in different parts of the domain, because the grid or triangulation need to be refined in areas where the solution changes quickly, in particular in the neighborhood of singularities. Grid refinement was considered very early in multigrid by Brandt (1977) and Bank and Dupont (1981). Many recent multigrid results have been proved for grids or triangulations with refinement; see, in particular, Bramble ef al. (1990), Yserentant (1986a, 1986b, 1989) and

346

JAN MANDEL

Bank et al. (1988). Grid refinement is incorporated in the package PLTMG reviewed in Section X1.A below. A survey of adaptive refinement methods and many new results can be found in a recent monograph by McCormick (1989). 1. Unigrid

McCormick and Ruge (1983) used a representation of multigrid algorithms on the finest grid for quick testing of multigrid ideas. Instead of keeping a hierarchy of approximations uk, they keep only the finest approximation urn.If any part of the multigrid process changes, say, uk

+ dk,

Uk

then unigrid reflects this change by changing U,

+

+

urn

where 1;: is the composed prolongation, 1; = l ; - J ~ I ~ - . l ~ + 1 .

The disadvantage of unigrid is that iterations on coarse levels take more computational work than on the finest level. For example, the Gauss-Seidel iteration in V, proceeds by minimizing and replacement

+ tlrei)TArn(urn + tire:) - 2(um + tIreL)Tf+

(urn

min,

+ tlpe:,

urnc urn

where e: run over all coordinate vectors of V, and the minimization is over a scalar t . Thus the unigrid method has to access all the fine grid data during a Gauss-Seidel sweep on a coarse grid. Unigrid is much simpler to program than multigrid and uses less storage. Hunt (1988) has shown that on a uniform two-dimensional grid, it is possible to implement unigrid for the V-cycle with only post-smoothing using only arrays of the same size as urnand preserve the computational complexity of the standard V-cycle. Hunt’s method stores the coarse grid residual and the solution in the elements of the arrays representing the solution and the residual on the finest grid, and recomputes the overwritten fine grid quantities when needed. In exchange, the method never requires the more complicated multigrid data structures. J . Anisotropic Problems and Semirefinement The diffusion equation with coefficients different by orders of magnitude in different directions causes degradation of multigrid convergence rates and

SOME RECENT ADVANCES IN MULTIGRID METHODS

347

calls for the use of special techniques. The ways of dealing with this problem have been block (line or plane) relaxation and semicoarsening (keeping the grid in some space direction and coarsen in another). Behie and Forsyth (1983) use conjugate gradients preconditioned by incomplete Choleski decomposition to perform plane relaxation for a problem with anisotropies and strongly discontinuous coefficients. Thole and Trottenberg (1985) employ a two-dimensional multigrid method for plane relaxation in the constant coefficient case. Dendy (1987) treats strongly anisotropic problems with strongly discontinuous coefficients by plane relaxation by two-dimensional multigrid and a special prolongation operator, which is implemented by successively coarsening in each of the three independent variables and then discarding the intermediate semicoarsened grids. Decker and Van Rosendale (1989) use semicoarsening, zebra relaxation (line relaxation in two sweeps) and restriction stencils optimized for the operator at hand to obtain a method that is very fast for the Poisson equation and insensitive to the mesh aspect ratio. Hackbusch (1989b) suggests several coarse grids used concurrently to eliminate different parts of the error for strongly anisotropic problems. This method is reviewed in more detail in Section 1V.B. In the method of Mulder (1989a), one uniform grid gives rise to two coarse grids by the semicoarsening process (decreasing the number of points only in one direction). On the next coarser level, one coarse grid combines the information from both semicoarsened grids. In this way, Mulder overcomes the “weak coupling” across the characteristic direction for a model hyperbolic equation. Mulder gives Fourier analysis of the method for the linearized problem and analyzes several smoothers including damped red-black GaussSeidel. In a numerical experiment, the method is applied to Euler equations in two dimensions. Mulder’s method has lower complexity than a related method of Hackbusch (1989a) (cf. Section IV.B), but it does not handle well the case when the grid is aligned with the characteristic direction at the angle 45“ (Mulder, 1989a).

111. PRECONDITIONING BY MULTIGRID

First note that every linear stationary method can be used as a preconditioner and every preconditioner can be used to derive an iterative method. Consider an iterative method ref’-&

u+u+B-’r,

for the solution of the linear system Au

=f.

(35)

We are omitting here the

348

JAN MANDEL

subscript m as in Eq. (2). Because the error is transformed according to e t ( I - B-'A)e, the iterations (35) will converge fast if B is close to A . Suppose that both A and B are symmetric positive definite and define the numbers m, and m2 by m,B I A Im2B,

(36)

where inequality of matrices means that the difference is positive definite, and m, and m2 are the largest and the smallest possible, respectively. Let K = m2/m,

(37)

be the generalized condition number of A and B. Then the energy norm of the error, lllelll = will be reduced in n iterations at least by the factor (which is attained for some errors) (max{m2 - 1, 1 - m,})". By scaling B by a scalar, this factor can be optimized to be

Je*Ae,

The preconditioned conjugate gradient method (Concus et al., 1976) needs in each step one evaluation of the matrix-vector products Au and B-lr, and n steps will reduce the error by at least the factor 2(%)".

(39)

Thus, the conjugate gradient method effectively replaces the condition number by its square root. The matrix B is called a preconditioner. If the procedures for evaluating Au and B-'r are given, they can be used to implement the method (35). On the other hand, when starting from the iteration (35), the procedure for evaluating B-'r is not given explicitly, but it can be obtained by applying one iteration (35) with starting value u = 0 and r in place off. Note that the iterations (35) fail to converge to the solution if m, 2 2, while the conjugate gradient method always converges for symmetric and positive define A and B. This shows that one algorithm can be considered alternatively an iterative method or a preconditioner, which may be useful even if the associated iterative method does not converge. (Of course, one can always scale B so that convergence occurs, but the scaling factor is not known in advance.) A preconditioner is in fact an approximate solver K 1 : r H B-'r for the problem Au = r. One or more multigrid iterations can be used as a preconditioner for a symmetric, positive definite problem (Braes, 1987; Jung et al., 1989). For a

SOME RECENT ADVANCES IN MULTIGRID METHODS

349

comparison of the performance of multigrid and conjugate gradients, see, for example, (Gary ef al., 1983). Goldstein (1989a)has studied the solution of the discrete equations arising from the Hemholtz equation on a bounded domain in two dimensions,

(-A - K2)u= f in Q c , (40) by the conjugate gradient algorithm applied to the normal equations after preconditioning by the V-cycle applied to the system with K = 0. This type of preconditioning was shown to give condition numbers bounded independently of the number of levels and on the parameter K when h, = h , ( K ) is sufficiently large and the problem in V, is only iterated upon rather than solved exactly. The fundamental multigrid algorithm applied directly to the Hemholtz equation was studied by Hackbusch (1985b). Fourier analysis of Goldstein’s method has been done by Decker (1988). Goldstein (1989a) also considered the more general singular perturbation equation -V(a(x)V(u))

+ K 2 ’ h ( x ) * V u+ K 2 d ( x ) u = f ,

with K -+ + 00 and a parameter s > 0. Goldstein (1989~) has studied the multigrid method for the two and three dimensional exterior Hemholtz problem (40) on the complement of a domain 0, with homogeneous Neumann boundary conditions on the boundary of Q, and the growth condition au dr

-

+ (r-l

-

iK)u = O(r-’)

as r

= 1x1 + 00,

where K > 0. This problem is discretized by finite elements on the intersection

RR of the complement of R with the unit ball of radius R . The size of the elements grows linearly with the distance r from the origin. Goldstein uses the smoother (3) with the matrix ck given as the stiffness matrix of the weighted inner product

IV. METHODSBASEDON SPACEDECOMPOSITION

The methods reviewed in this section are based on the concept of decomposition of the space into a number of subspaces,

v=

V,@...@U,,

(41)

where each subspace V, is the range of a prolongation mapping Jk of a full

350

JAN MANDEL

rank from a lower dimensional space

wk,

uk = rangeJk.

uk,

Jk:w k

Any function u E I/ is then split uniquely into

u = u1

+ + urn,

uk

"'

= JkWk E

uk,

wk

E wk,

(42)

and one can construct an iterative method or a preconditioner based on solving for the components uk separately. Define B by ...

m

UTBu =

1 u:Bkvk,

(43)

k=l

where uk and v,, are defined by the decomposition (42) and Bk are suitable approximations of A on Uk, symmetric and positive definite, possibly Bk = A. Now to solve for u = B-lr, one needs to solve the variational problem

f vlBkuk

k= 1

rn

=

1 vlr,

for all u E

I/,

k= 1

Choosing the test function u so that in turn only one v k is nonzero, we get the following algorithm: Algorithm 2 Space Decomposition Preconditioner

Let r be given. Compute

It is clear that the terms in the sum (44) can be computed independently and therefore in parallel, which was indeed the motivation underlying some of the methods reviewed in this section. Using Algorithm 2 in iterations, we get the following algorithm: Algorithm 3 Space Decomposition Iteration 1. For all k = 1,..., m, a. Calculate the residual r projected onto the space

rk = J [ ( f

wk,

- AU).

b. Solve the auxiliary problem on space Wk, wk

= (JfBkJk)-lrk.

c. Find the prolongations of the solution uk = J k W k .

(45)

SOME RECENT ADVANCES IN MULTIGRID METHODS

35 1

2. Replace the iterate,

For the effectivity of these methods, it is essential to keep the condition number ti from Eq. (37) as small as possible. To estimate rn, and rn2, one usually proceeds in two steps. In the first one, the effect of the decomposition (42) is bounded by

and

Let the effect of the replacement of J:AJk by Bk be bounded by 1 -J,'AJk

S Bk I c,J:AJk.

(48)

c 3

Then it follows from Eqs. (46) to (48) that K

I

c,c2c3c4.

(49)

A. Hierarchical Bases and Multilevel Preconditioners

The hierarchical bases method of Yserentant (1986a, 1989) starts with a sequence of nested finite element spaces h as in the fundamental multigrid algorithm, based on linear finite elements in three dimensions. The splitting (42) is defined by

<

where Ik: Vm+ vk are interpolation mappings, that is, if uk = l k u m , then uk E and iik(xj) = um(xj)for all nodes of the triangulation associated with the finite = I, if l < k , element space Vk. Note that I , are projections and that IJ, = so 1, - 1, are also projections. Thus Eq. (42) is defined correctly and -

,

ug

= IOU,

uk = (Ik - l k - ~ ) u ,

The space

uk = ( I k

-

l k - ,)V

k

=

1,. . .,m.

352

JAN MANDEL

is the space of finite element functions on level k such that the functions that correspond to the nodes of coarser triangulations are zero. It has been known in multigrid literature that the stiffness matrix A , is well-conditioned on such spaces u k (Axelsson and Gustafsson, 1983; Braess, 1981),see also Section II.F, so the approximations B k to A on u k are defined simply as identity on u k . Another important ingredient is the discrete Soboleu inequality (Yserentant, 1986a; Wendland, 1979) llUllL-(T)

5 const.llog hl”211UlIH’(T)

(51)

for all functions u such that u is continuous and piecewise linear on a subdivision of a fixed triangle T into triangles of size at most h and interior angles bounded away from zero. The inequality (51) gives the upper bound 5 const*(m-

1)lllu1112, giving(46)with C , = const. log2hk.Yserentant obtained the lower bound (47) with C, = const. by estimating the angles between the spaces Uk in the energy inner products, obtaining for the condition number K = const. log2h k . Yserentant (1986a) has shown that the method can be implemented with optimal amount of computational work because solving for B - ‘r involves only the application of the interpolation operators. The name “hiearchical basis” has been motivated by the interpretation of the method as expressing the stiffness matrix in terms of new basis functions such that the spaces vk are spanned by the subsets of the new basis functions. Then the method is just a block-diagonal preconditioning, and the operators Jk are used to pass from the original standard nodal basis to the new hierarchical basis and back. Yserentant (1986a)extended the theory to indefinite problems by splitting the linear system in two parts, a low dimensional one on the space V,, which is independent of the number of levels, and a high dimensional system, on the orthogonal complement (in the energy inner product) to U,, which is positive definite. Bank et al. (1988) have extended the hierarchical basis approach from a block Jacobi method to a method of the symmetric block Gauss-Seidel type, called “the hierarchical multigrid V-cycle.” It can be formulated as follows, using the general framework of this section: IIIUkIII’

f

Algorithm 4 Space Decomposition V-Cycle Let the iterate u be expressed in the form u = CF=,, uk, U k E U,and Dk be an approximation to J,TAJk. F a r a l l k = m , ...,0,..., m, a.

Calculate the residual r projected onto the space Wk, r k = J[(f

- Au).

SOME RECENT ADVANCES IN MULTIGRID METHODS

b. Solve the auxiliary problem on space

353

wk,

Dk wk = r k . c. Replace uk by

the prolongution of the solution, Uk

6

JkWk.

Bank et al. (1988) use at the coarest level Do = I&410; that is, the problem is solved exactly there. On other levels, they implement D;’ by one or more symmetric Gauss-Seidel iterations for the system (JTAJk)Wk

= rk.

The theory for this type of method is obtained from similar estimates as in Eqs. (46) to (49)(Bank et al., 1988), and it is an etension of the standard theory for Gauss-Seidel and block Gauss-Seidel methods (Young, 1971). A related method called multilevel preconditioners was developed by Bramble et al. (1990). It is based on the splitting m k=l

where Q k : V -+ V, are the L2-like projections onto V,, given by (QLu, u ) = (u, u),

for all u, u E V,

(53)

and (.,.) is a weighted L2 inner product. Bramble et al. proved that the condition number is bounded by const. m 2 for two-dimensional problems under the assumption that

with C ( k )a constant. This property does not depend on elliptic regularity, but it cannot be verified locally as the two-level assumptions (33) and (34) (Xu, 1989). Note that the bound (54) with C ( k )= c0nst.m follows from Eq. (51). Under an assumption equivalent to the approximation property (27), Bramble et al. show that the bound on the condition number can be improved to const. m’”. For a discussion of the weaker condition (54), which can hold even in the case of strongly discontinuous coefficients, see Xu (1989). Following Yserentant (1989),the upper bound (46) can be also derived from the stability of L2 projections onto finite element spaces (Crouzeix and Thomee, 1987). The common form above for both the hierarchical multigrid method and multilevel preconditioners was noted by Yserentant (1989), who also gave alternative bounds on the condition numbers of both methods and has shown that if suitable weights are introduced in the inner products then the hierarchical basis method does not deteriorate for strongly discontinuous

354

JAN MANDEL

coefficients. Yserentant (1989) obtained estimates for the multilevel preconditioners that show that the condition number can be bounded in terms of a local variation of the coefficients. The main disadvantage of the hierarchical basis method is that it does not extend to more than two dimensions, because then the discrete Sobolev inequality (51) does not hold any more with only a logarithmic factor, and the upper bound in Eq. (46) deteriorates (Yserentant, 1986a). The multilevel preconditioners of Bramble et a!., however, can be extended to more dimensions. The two-level method of Axelsson and Gustafsson (1983) also does not rely on regularity and uses strengthened Cauchy inequality. This method uses a type of approximate factorization for preconditioning, and it was recently further developed by Axelsson and Vassilevski (1989) and Vdssilevski (1989). B. Methods with Several Coarse Grids

In the fundamental multigrid algorithm, there is only one coarse grid problem for each level. The role of this coarse grid problem is to attenuate the component of the error that falls in the range of the prolongation 1;- 1 . The smoothing should then remove most of the remaining error, which results in fast convergence. For some problems, however, the situation is not so simple, and one would wish to have the coarse grid correction more efficient, so that the method would be more robust and could handle well a large class of problems. This is the motivation of the frequency decomposition method due to Hackbusch (1988). Hackbusch defines several coarse grid equations with the goal to eliminate various frequency ranges of the error. For simplicity, consider the simple five-point discretization of the Poisson equation given by the stencil

In two dimensions, Hackbusch defines four coarse grid problems with the corresponding prolongations given by the stencils

='I

Joo

JOl

=

1 2 1 4 21, 4 1 2 1

J,,

=:[

:i

I:] 2

; -1

-2

-1

1 -2

-1

S O M E R b C E N T ADVANCES IN M U L T I C R I D METHODS

355

The coarse grid operators are given by the variational formula (24). Thus the coarse grid operator for the prolongation Joo will have the same form as the original operator ( 5 5 ) , and the corresponding coarse grid system can be solved (approximately) by the recursive application of the same method. The other three coarse grid operators are not consistent with the differential equation, and they are well conditioned. As a result, those three problems can be solved with sufficient precision by any of a number of standard iterative methods, and the total computational complexity of the multigrid method is preserved. Hackbusch (1989b) gave Fourier analysis of the behavior of the method for a model problem. The method performs well even for some strongly anisotropic and hyperbolic problems (Hackbusch, 1989a). However, for the strongly anisotropic problem

with n >> b > 0, one of the coarsegrid problems is ill-conditioned and has itself to be solved by a multilevel scheme, giving rise to more complicated recursion. The fact that the coarse grid has a smaller number of points is essential for the optimal complexity of the multigrid algorithm in the classical sense. On a parallel computer with a large number of processors, there may be on coarse grids more processors available than grid points. Frederickson and McBryan (1988) use multiple identical grids to keep all processors busy on all levels for a beneficial purpose. On a uniform grid and with a constant coefficient operator A , given by a stencil, they use the smoothing (3) with the operator c, and prolongations 1 ; - given by stencils of the form

Unlike in the Fourier analysis in Section II.B, there is no restriction operator evaluated only on the coarse grid. The “coarse grid equation” then is defined on the same fine grid, but it decouples into four equations defined on four subgrids with mesh size twice as big. The restriction operator is simply the identity. The “coarse grid operator” is then defined by the same stencil as the original operator but with stepsize twice as big; thus, if A , is given by the fivepoint scheme (16), the coarse grid operator is (when expressed on the same grid) 0 0 1 0 0 0 0 0 0 0 1 1 0 4 0 1 . A k - , =4h:- I 0 0 0 0 0 0 0 1 0 0

:

1

356

JAN MANDEL

Because all operators involved are given by stencils on the same fine grid, the Fourier analysis of the method is particularly simple, and the symbols of the operators are just scalar functions of the (vector) frequency rather than matrix functions of frequency as in Section 1I.B. Frederickson and McBryan then calculate exactly the two-grid convergence factors and optimize the parameters in the stencils (56), obtaining lower convergence factors than for the fundamental multigrid algorithm. Chan and Tuminaro (1989) developed further analysis of this algorithm. C. Using Symmetry: Domain Reduction Methods

The Domain Reduction Method by Douglas is of the general form of Algorithm 3, where the subspaces u k in Eq. (41) are defined by symmetries of the domain. Douglas and Miranker (1988) have stated the condition for Algorithm 3 to converge in one iteration as APk = PkA, where Pk are the projections corresponding to the decomposition (41). Douglas and Smith (1989) have shown how to apply the method to a range of problems and various boundary conditions on a square. In (Brezzi et al., 1989), it is shown how to get further symmetries, which allow one to obtain a nontrivial decomposition of the problem on a cube. Results of parallel implementations, further theory for the case of nonexact solvers for the auxiliary problems (49, and a decomposition for three-dimensional problems can be found in Douglas and Mandel(l989). We explain the principle of the method on a simple example. Consider the square S' = (- 1, 1) x (- 1, 1) and a space I/ of grid functions on a uniform grid on S (or finite element functions defined using a triangulation obtained from the same grid on S ) , which have zero values on the boundary 8s. Let

+

+

The functions in one of the spaces Uij are determined by their values on a smaller square, say, ( 0 , l ) x (0, l),and therefore by a smaller number of degrees of freedom. If L is a homogeneous linear differential operator with constant coefficients, then LUij c Vij,and the algorithm will converge in one iteration. In the general case of variable coefficients, we obtain an iterative method. The problems in the spaces Uijcannot be always decomposed further in the same way, because the definitions of the subproblems impose different boundary conditions on the sides of the smaller square (0,l) x (0,l). The decomposition introduced in Brezzi et al. (1989) uses more general symmetries than just flipping the sign to decompose the problem on a square

SOME RECENT ADVANCES IN MULTIGRID METHODS

357

into eight subproblems of the same size but a variety of shapes. In Douglas and Mandel (1989), the problem on a cube is decomposed into 64 subproblems. Because symmetries of a domain form a group, it is natural to interpret the method from the point of view of group theory. Allgower et al. (1990) found independently a decomposition on a square into six subproblems (not all of them of the same size). Douglas and Mandel(l990) gave an interpretation of the domain reduction method in terms of the representation of a group in its algebra and gave conditions for the existence of the decomposition of the problem into subproblems on domains of the same size and shape.

v. MULTIGRID IN ELASTICITY The linear elasticity problem for an isotropic material is the problem of finding a displacement field u in the space V c (H'(Q))3 so that the variational formulation of Hooke's law is satisfied, for all v E V,

a(u,v) = f(v),

(57)

where a(-,.) is the bilinear form defined by

~ ( uV), =

IQ

(1 div u div v

+ 2p 1 eij(u)eij(v) i.j (58)

and ,f is a linear functional that incorporates the boundary constraints and the load (NeEas and HlavaEek, 1981). The parameters 1 and p are the Lamt coeficients of the material. The Poisson ratio A

o=-

2Q

+ PI

approaches 9 for nearly incompressible materials. Note that in this case the first term in Eq. (58) is essentially a penalty term for the incompressibility constraint div u = 0. If cr is close to 9, discretizations of the elasticity problem suffer from the locking eflect, and the discretization error is large even for small h. A straightforward application of the fundamental multigrid algorithm to a finite element discretization of the elasticity problem then suffers from a degradation of convergence caused by ill-conditioning essentially due to the penalty term. A similar locking phenomenon occurs for linear elasticity of a beam or a plate (Arnold, 1981),where the locking is associated with the beam or plate thickness approaching zero. Some multigrid methods for elasticity based on a mixed formulation that are currently under development and

358

JAN MANDEL

promise to exhibit convergence independent of elasticity parameters are reviewed in Section V1.B. Parsons and Hall (1990a, 1990b) have implemented the fundamental multigrid algorithm for solid three-dimensional multigrid models with linear elements using the variational approach with natural imbedding of the finite element spaces, and also studied vectorization on Alliant FX/8 (Parsons, 1989). Jung (1987) has considered a version of the multigrid algorithm for linear elasticity discretized by linear elements in two dimensions and calculated (somewhat pessimistically) two-level bounds from the strengthened Cauchy inequality. Jung obtained multilevel bounds by recursion, which deteriorate fast for increasing number of levels. For the Poisson ratio r~ --.* $, the two-level bounds deteriorate as it could be expected. KoEvara and Mandel(l987) have developed a two-level method for linear elasticity in three dimensions, with the “fine grid” given by quadratic elements and the “coarse grid” by linear elements. Smoothing was by block GaussSeidel iteration, with a 3 x 3 block corresponding to the components of the displacement at each node. Algebraic estimates of the two-level convergence rate were computed, but they proved to be much too pessimistic. However, the two-level method has outperformed a frontal solver for a complicated problem with several hundred elements. Farhat (1989) and Farhat and Sobh (1989) proposed a two-level preconditioning for linear elasticity in three dimensions. The stiffness matrix A for a nodal basis on a fine mesh is partitioned into blocks

where the subscript c means coarse grid nodes and f means the remaining fine grid nodes. Farhat and Sobh propose the approximate factorization

where A,, is the stiffness matrix for the coarse mesh. Thus, in each iteration, one has to solve a system with the matrix B, which involves as one of the steps solving a subsystem with the coarse grid matrix Azh. The whole process can be then interpreted as a two-grid iteration with operator defined interpolation (Farhat and Sobh, 1989). The method was tested on several large threedimensional problems with both solid and shell elements. Braess (1987) presented an application of multigrid to elasticity in two dimensions and noted that using multigrid as a preconditioner for conjugate gradients may help to overcome the effect of locking for a thin plate

SOME RECENT ADVANCES IN MULTIGRID METHODS

359

(degradation of convergence observed for the thickness approaching zero). A multigrid-type preconditioner for a plate was considered by Peisker et al. (1990), but truly multigrid independence of h was not achieved. Their condition number deteriorates like O(h-’), which is a significant improvement over the O(h-4) behavior exhibited by the discrete equations themselves. For multigrid-like methods for three-dimensional elasticity discretized by high order elements, see Section VI1.B. VI. MULTIGRID FOR MIXED PROBLEMS Mixed problems like the Stokes equation -div(AVu)

+ Vp = f ,

divu = 0

give rise to indefinite linear systems with the matrix of the form

Hackbusch (1980) analyzed multigrid methods for a mixed problem in this form using Richardson iteration on the normal equations as a smoother to guarantee convergence of the smoother. The approximation condition now has to be formulated using a pair of spaces for both components of the solution. A . Stokes Equations

Verfurth (1 988) has given a theory for a standard finite element discretization of the Stokes equation and smoothing by Jacobi for normal equations. The analysis was extended to the Mini element, which has low storage requirements and is easier to program (Verfurth, 1988). Maitre et al. (1985) have experimented with the Uzawa algorithm for a smoother instead of iterating on the normal equations. B. Mixed Formulations of Elasticity

Braess and Blomer (1989) gave a multigrid method for the Timoshenko beam problem, which is (after elimination of the physical constants) as follows: For given linear functionals fi, f 2 , find ( 4 , ~E)V = (H1(Z))2,Z = (0,l), such that (4’,w’)

+ t - 2 ( 4- o’,$ - v ’ ) = f , ( $ ) + f2(v),

for all ($, v ) E V, (60)

360

JAN MANDEL

where (.,-) is the usual L2 inner product on I (Arnold, 1981). Here t is the (relative) thickness of the beam, and for small t, the problem has essentially the constraint q!~= o’and thus it reduces to the biharmonic equation on I . Braess and Blomer used the mixed formulation of Eq. (60) by introducing the auxiliary variable q = t-’(4 - w‘),

(4‘7w ’ ) + ($ - v‘, rl) = fi(lcI) + f A V ) (4 - w’, 5) - t2(5,s) = 0 for all $, v, and 5. Thus, the singular perturbation problem (60) has been transformed into a well-behaved mixed problem, which can be handled by the fundamental multigrid algorithm with smoothing by Jacobi for the normal equations, and Braess and Blomer proved that its convergence factor is bounded away from one as t --+ 0. Ziping (1988) used a related approach for a multigrid method for the Reissner-Mindlin plate discretized by the Mini element (Brezzi and Fortin, 1986) and proved that the convergence does not deteriorate for the thickness of the plate approaching zero. A multigrid method for the solid elasticity problem (57) that would not deteriorate for nearly incompressible materials is not known to the author at present, but it can be expected to be developed in the near future for a suitable mixed formulation of Eq. (57).

VII. MULTIGRID FOR HIGHORDERAND SPECTRAL METHODS Unlike standard finite element or finite difference methods, high order and spectral methods increase precision by increasing the order of the method rather than decreasing the mesh spacing or the element size. The natural approach to the application of multilevel ideas is then to construct the “coarse grid” problems using lower order polynomials with a smaller number of degrees of freedom. A . Spectral and Spectral Element Methods

Spectral methods (Gottlieb and Orszag, 1977) use high order polynomials and collocation. Although the discretization matrices A , arising in spectral methods are dense, the matrix vector product Akukcan be implemented very efficiently using the Fast Fourier Transform, which accounts for much of the popularity of spectral methods (Gottlieb and Orszag, 1977). Zang et al. (1982a, 1982b) used the fundamental multigrid algorithm for spectral methods based on Chebyshev approximation and collocation. An

SOME RECENT ADVANCES IN MULTIGRID METHODS

36 1

efficient smoother is preconditioning by incomplete decomposition of the associated finite difference system on the Chebyshev points (Phillips, 1987). The fact that the Chebyshev points are not uniformly distributed leads to high mesh aspect ratios near the boundary, and the problem appears to be numerically locally anisotropic. Line relaxation thus increases the efficiency of smoothing (Heinrichs, 1988a, 1988b). The spectral element method (Patera, 1984; R6nquist and Patera, 1987) uses the standard variational finite element approach rather than collocation. The spectral element method uses nodal basis functions associated with Gauss-Lobato quadrature nodes. The use of Gauss-Lobato nodes for numerical quadrature, as well as the nodes of the finite element method, allows one to factor the evaluation of the matrix vector product AkUk, so that the stiffness matrix A , is actually never calculated explicitly. Ranquist and Patera (1987) use the imbedding definition (23) of the prolongation and simple Jacobi iteration as smoothing. Because the basic functions are associated with nodes, the l 2 norm of the coefficient vectors is equivalent to the L z norm, and much of the justification of multigrid in Section 1I.D still applies, at least in one dimension (Maday and Muiioz, 1988). B. Methods for p- Version Finite Elements

The p-version finite element method (BabuSka et al., 1981; BabuSka and Suri, 1990)does not use the special location of quadrature nodes, which allows for fast matrix vector multiplication, but it is more flexible. The fast matrix multiplication in spectral and spectral element methods is based on a tensor product decomposition, and thus it allows only topologically rectangular elements. The p-version saves some computational effort by using serendipity type elements, which have a smaller number of internal degrees of freedom (Zienkiewicz, 1977) but about as good approximation properties as elements containing all the tensor product basis functions. The basis functions in the pversion finite element method are often hierarchical (BabuSka et al., 1981), which suggests the use of space decomposition iterations, such as for the hierarchical basis multigrid method (Yserentant, 1986a; Bank et al., 1988). However, the condition numbers for a method completely analogous to the hierarchical basis multigrid method are not as advantageous (Mandel, 1990d), and the resulting method, although very well parallelizable, requires many iterations (Foresti et al., 1988; Brussino et al., 1989). The method developed in (Mandel, 1990a) is a special case of conjugate gradients preconditioned by the space decomposition (44) with Bk = A . There is one subspace for each edge, face, and interior in the finite element structure, and one “global” subspace V, that consists of all linear and quadratic

362

JAN MANDEL

functions. The space U,, plays the role of the coarse grid, and it is instrumental in the global exchange of information in the system. The spaces U,,k > 0, are chosen so that the constant C, in Eq. (46) is small. The space of functions associated with a face is chosen to be orthogonal in energy to the functions in the adjacent interiors, and the space of functions for an edge is chosen to be energy orthogonal to all functions from the adjacent face and interior spaces. The condition number can be estimated by calculating numerically the constants C1and C, in Eqs. (46) and (47) using the data of one element at a time. This makes it possible to tune the method to achieve fast convergence and to use a simplified version of the estimator to obtain fast though less precise estimates of convergence at run time. In practice, the elements are given a priori. Because elements with high aspect ratio corrupt the condition number, the method copes with this effect by heuristically decreasing the number of the subspaces U, by merging them into larger subspaces until the estimated condition number is acceptable. A theoretical justification for an analogous merging process in two dimensions was given by Mandel and Lett (1990). The method was applied to linear elasticity problems on general three-dimensional domains and proved to be faster and to have lower storage requirements than industry standard direct sparse solvers, but its speed of convergence deteriorates for nearly incompressible materials. An analogous method in two dimensions with the space V, consisting of linear functions was studied by BabuSka et al. (1991), who proved that the condition number in two dimensions grows asymptotically only like log’ p, where p is the degree of the elements. The two-dimensional method and its analysis are related to the domain decomposition method of Bramble et al. (1986). For a parallel implementation of the two-dimensional method, see BabuSka and Elman (1989). A different method was proposed and studied in Mandel (1989, 1990b, 1990~). This method constructs global “coarse grid” problem with the number of variables per element equal to the dimension of the nullspace of the local stiffness matrix in that element. For scalar problems, this means one coarse grid variable per element and six variables per element for linear elasticity in three dimenions. This method is related to the domain decomposition method by Bramble et al. (1989a). A two-dimensional version of this method was analyzed by BabuEka et al. (1991), who proved that the condition number again grows asymptotically as log2p. As in the preceding method, the condition numbers can be estimated a priori using the data of one element at a time. All methods give condition numbers that deteriorate for nearly incompressible materials, t7 -,*, cf., Section V. The fact that the p-version finite

SOME RECENT ADVANCES IN MULTIGRID METHODS

363

element method avoids locking for c -+ (Vogelius, 1983) motivates ongoing research in iterative methods that would perform uniformly well for c close to +. VIII. EIGENVALUE PROBLEMS

Multigrid can be applied in several ways to an eigenvalue problem arising by discretization of the differential eigenproblem Lu = l.u,

(61)

where L is an elliptic operator. Discretizing Eq. (61) by finite differences or finite elements yields the eigenvalue problem in V,, A,u,

=

~B,u,.

(62)

Bank (1982) used the multigrid method to find an approximate solution to the problem in each step of the inverse iteration method (A,

-

tB,)ur' = B,uk,

controlling the shift r so that it does not approach too close to the eigenvalue and proved that multigrid preserves its efficiency in this case; that is, the convergence factor is bounded away from one. The method of Brandt et al. (1983) consists of reformulating the eigenvalue problem as a system of nonlinear equations and using Ritz projections to calculate several eigenvalues and eigenvectors simultaneously. The method of Hackbusch (1979) can be formulated as the fundamental multigrid algorithm applied to Eq. (62) understood as a homogeneous problem in unknown u,, with ;Ireplaced by different Ak at different levels and with projecting the coarse grid corrections so that they are perpendicular to the already calculated approximate eigenvector. Hackbusch (1985a) applies the multigrid method to find several eigenvalues and the corresponding eigenvectors simultaneously by solving the nonlinear system for unknown Urnand T,, A,U,

U,T,

=0

u;u,

= I,

-

where U, is a matrix whose columns span an invariant subspace of A,,, and T, is an upper triangular matrix. The method studied by Mandel and McCormick (1989) finds the smallest eigenvalue of Eq. (62) in the case when A, and B, are symmetric by minimizing

364

JAN MANDEL

the Rayleigh quotient

For a given approximation u,, the coarse grid problem on level rn - 1 is minimize RQ(u,

+ 1;-

lu,-

(63) The problems on levels rn - 2 to 0 are defined analogously. The key observation there is that the problem (63) can be formulated using only level rn - 1 quantities and restriction of level rn quantities. Thus, problems lower in the hierarchy can be operated on with the usual efficiency. The smoother is chosen to be coordinate descent. A linearization of this method close to the solution is simply iterating on Eq. (62) as a homogeneous system with a constant h, and thus somehow related to the method by Hackbusch (1985a). The method in Mandel and McCormick (1989) is guaranteed to converge to an eigenvalue, and in practice it always converges to the minimal eigenvalue. It was applied to problems with local refinements for differential eigenproblems of the form Lu = Afu, where f is a function that is zero or very small in a part of the domain. Such problems occur in the study of neutron diffusion in a nuclear reactor.

IX. MULTICRID AND PARALLEL COMPUTING Use of multigrid techniques on modern supercomputers requires parallel implementation to take advantage of advanced architectures. Because the fundamental multigrid algorithm has the asymptotically optimal parallel complexity of log iz parallel steps per iteration, most research has concentrated on effective implementation of the multigrid algorithm (Thole and Trottenberg, 1987). For recent implementations, see, for example, Parsons (1989) and Adams (1989b), for vectorization of multigrid, and Briggs et al. (1988) and Hempel and Lemke (1 989), for message passing parallel architectures. Some special multigrid related methods suitable for parallel implementation are reviewed in Section IV. Douglas and Mandel(l989) give results for a parallel implementation of one of those methods. For a comprehensive survey of parallel approaches in multigrid, see Chan and Tuminaro (1987). Douglas et al. (1990) and Mandel and Miranker (1990) proposed and simulated a hybrid two-level method, where analog circuitry plays the role of the coarse grid, and digital circuits perform a smoothing step by simple Jacobi iteration completely in parallel.

SOME RECENT ADVANCES IN MULTIGRID METHODS

365

X. SOMEOTHERMULTICRID DEVELOPMENTS For multigrid for integral equations and fast multilevel evaluation of integral operators see Hackbusch and Nowak (1989) and Brandt and Lubrecht (1990). Multigrid for optimization problems was studied in Gelman and Mandel (1990), providing a general framework for the earlier work in Mandel(1984a, 1984b). Multilevel methods in quantum physics systems and Monte Carlo simulations are studied by Kandel et al. (1989a, 1989b). For an expository treatment, see Brandt (1988). XI. MULTIGRID SOFTWARE In this section, we describe several software packages that implement multigrid methods. We have restricted the more detailed description to a few software packages that are of general interest and known to the author to be easily available and well-documented. Less comprehensive information is also given about some other selected software. Some of the software is available from NETLIB. NETLIB is a depository of public domain software, which sends source code by electronic mail (Dongarra and Grosse, 1987). To obtain NETLIB software, send a message containing the line send index on Internet to [email protected] in the U S . or to netlib(4nac.no in Europe. NETLIB will reply with a message containing further instructions. These From EARN/BITNET, use the address netlibO/onac.no~~norunix.bitnet. NETLIB sites are in operation at the time of writing of this article, though they may change or cease operation in future. A . PLTMG

PLTMG (Bank, 1990)is a multigrid package written by Randy Bank and a number of collaborators at the University of California at San Diego. It solves second-order elliptic equations dependent on a parameter A, -Va(x,y)u,Vu,A)

+ f’(x,y,u,Vu,A) = 0,

in R,

(64)

for a general domain R in !)I2 and with a general form of boundary conditions. The subroutines in the package provide initial coarse triangulation of the

366

JAN MANDEL

domain, adaptive refinement with error control, continuation, finding limit and singular points, switching between branches of the solution, and graphical display and plotting of the results. The systems of nonlinear equations arising by the discretization of Eq. (64) by piecewise linear elements are solved by damped Newton iterations (Bank and Rose, 1982) and the hierarchical basis iteration (Bank et al., 1988). A Vcycle is used as a preconditioner for a global linear iterative process, which can be either conjugate gradients for symmetric problems or biconjugate gradients for nonsymmetric ones. Sparse Gaussian elimination is used to solve the equations on the coarsest mesh. The current distribution contains (among others) drivers for X-Windows, SunView, Tektronix, and Postscript output. Additional output routines can be easily added. PLTMG can be currently obtained from NETLIB or on Internet by anonymous ftp (userid anonymous, password guest) at research.att.com in the directory ldistlpltmg. Because PLTMG is quite large, the latter method is certainly preferred. B. MADPACK

The package MADPACK (Douglas, 1990) was written by Craig Douglas (Duke University, currently JBM and Yale University) and it provides sparse Gaussian elimination, symmetric Gauss-Seidel, conjugate gradients or Orthomin(1) (Young, 1971), and a variety of multigrid cycling schemes. The user specifies the matrices of A , and the restrictions and prolongations as sparse matrices. Several sparse formats are supported, suitable both for symmetric, nonsymmetric, and almost symmetric matrices (that is, matrices that are a sum of a symmetric matrix with a very sparse nonsymmetric perturbation). MADPACK thus supports any linear partial differential equation on a general domain in any number of dimensions that the user can construct the sparse matrices for. MADPACK is a collection of portable routines written in a variant of Ratfor known as Chez77. It is available in the original form, in Fortran 77 (with the original comments), and in C. MADPACK is available from NETLIB. C . MUDPACK

The MUDPACK package by John Adams (1989a, 1989b) at the National Center for Atmospheric Research, Boulder, Colorado (NCAR) is a collection of FORTRAN routines for the solution of elliptic partial differential

SOME RECENT ADVANCES IN MULTIGRID METHODS

367

equations. It provides “black box” solvers that discretize and solve real and complex second-order elliptic equations in two dimensions on rectangles and in three dimensions on boxes. Boundary conditions can be any combination of periodic, Dirichlet, and mixed derivative. The equation and the boundary conditions are automatically discretized using second-order differences on a uniform global grid. The discretization automatically adjusts the secondorder terms at coarse levels if the first order terms dominate. The resulting system of linear equations is solved by multigrid iterations, using the V-cycle and optionally full multigrid (nested iteration). Weighted averaging is used for restrictions. Available smoothers are point, line, and plane relaxation using red-black ordering to aid vectorization and achieve faster convergence. MUDPACK is available on NCAR Cray computers and at several other super-computing sites in the United States. It was written in portable Fortran 77 and has been tested on a variety of machines and operating systems. It vectorizes on Cray computers. Version 2.0 of MUDPACK is distributed free in relocatable binary form. A Fortran source for individual solvers will sometimes be distributed, but it may not be modified and/or distributed further, and the users will be expected to provide feedback on their performance. Those wishing to use the codes on other machines should contact

Dr. John C. Adams N.C.A.R. P. 0. Box 3000 Boulder, CO 80397-3000 Internet electronic mail address:johnad@ ncur.ucar.edu phone: 303-497-1213 The latest version of MUDPACK (Adams, 1991) is incompatible with the earlier versions (Adams, 1989a, 1989b) and it contains the following improvements:

(a) the inclusion of multigrid options, (b) the addition of fourth-order solvers, (c) the addition of“hybrid” multigrid/direct method solvers, (d) the allowance of more general grid sizes, and (e) more use of equivalencing to save storage. All the MUDPACK software is copyrighted.

D. M G D l The NAG FORTRAN Mark 13 Library (The Numerical Algorithms Group Ltd., 1988) contains Wesseling’s routine MGDI (Wesseling, 1982a,

368

JAN MANDEL

1982b)under the name D03EDF. This routine solves seven-diagonal systems of linear equations that arise from the discretization of an elliptic partial differential equation with variable coefficients on a uniform mesh in a rectangular region. The number of levels is determined from the highest power of two that divides the number of mesh points in each directions. The smoothing is by incomplete LU decomposition, and the cycling scheme and other technical aspects are hidden from the user. The routine works in conjunction with other NAG routines that generate the discretization, and it can handle both symmetric and nonsymmetric problems. It is robust and can handle a wide range of problems. E. MGOO

MGOO (Stuben and Trottenberg, 1982; Stuben et al., 1983) solves general elliptic PDEs in two dimensions. General distribution. Contact: Gesellschaft fur Mathematik und Datenverarbeitung, St. Augustin, Germany. F. A M G

AMG (Algebraic Multigrid) is a multigrid method that builds the coarse levels automatically, based on information contained in the system to be solved. It was successfully applied to finite element systems as well as to problems with no differential equation background at all (Ruge and Stiiben, 1987). Contact: John Ruge, University of Colorado at Denver, e-mail [email protected]. G. B O X M G

BOXMG is a “black box” multigrid method, which was developed for solving finite difference systems for general diffusion/convection equations in two and three dimensions with highly discontinuous coefficients(Dendy 1985, 1987). Contact: Joel E. Dendy, Jr., Los Alamos National Laboratory.

REFER ENcEs Adam, J. C. (1989a). FMG Results with the Multigrid Software Package MUDPACK, in “Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods,” (J. Mandel, S. F. McCormick, J. E. Dendy, Jr., C. Fdrhat, G . Lonsdale, S. V. Parter, J. W. Ruge, and K. Stiiben, eds.) Philadelphia, SIAM, pp. 1-12.

SOME RECENT ADVANCES I N MULTIGRID METHODS

369

Adams, J. C. (1989b). “MUDPACK: Multigrid Portable FORTRAN Software for the Efficient Solution of Linear Elliptic Partial Differential Equations,” Appl. Math. Comput., 34, pp. 113146.

Adams, J. C. (1991). “Recent Enhancements in MUDPACK, A Multigrid Software Package for Elliptic Partial Differential Equations,” Appl. Math. Compur. 43,79-94. Allgower, E., Bohmer, K., and Zhen, M. (1990). A Generalized Equihranching Lemma with Applications in D, x Z2 Symmetric Ellipfir Problems: Parf 1, Tech. Report 9, PhilippsUniversitat Marburg, Marbug, Germany. Arnold, D. N. (1981). ”Discretization by Finite Elements of a Model Parameter Dependent Problem, Numer. Math., 37, pp. 405-421. Axelsson, O., and Gustafsson, I. (1983). “Preconditioning and Two-Level Multigrid Methods of Arbitrary Degree of Approximation,” Math. Comp., 40,pp. 219-242. Axelsson, O., and Vassilevski, P. S. ( 1 989). “Algebraic Multilevel Preconditioning Methods, I,” Numer. Math., 56, pp. 157-177. BabuSka, I. and Elman, H.C. (1989). “Some Aspects of Parallel Implementation of the Finite Element Method on Message Passing Architecture,” J . Comput. Appl. Math., 27, pp. 157187.

BabuSka, I., and Suri, M. (1990). “The p- and h-p Versions of the Finite Element Method.” international Conference on Spectral and High Order Methods for partial Differential Equations. Como, Italy, June 1989; Comput. Methods Appl. Mech. Enyrg. 80, 5-26. BabuSka, I., Szabo, B. A., and Katz, I. N. (1981). “The pVersion of the Finite Element Method,” S l A M J . Numer. Anal., 18, pp. 515-545. BabuSka, I., Craig, A. W.,Mandel, J., and Pitkiranta, J. (1991). “Efficient Preconditioning for the p-Version Finite Element Method in Two Dimensions.” S l A M J. Numer. Anal., to appear. Bank, R. E. (1981). “A Comparison of Two Multi-level Iterative Methods for Nonsymmetric and Indefinite Elliptic Finite Element Equations,“ S I A M J . Numer. Anal., 18, pp. 724-743. Bank, R. E. (1982). “Analysis of a Multilevel Inverse Iteration Procedure for Eigenvalue Problems,” S l A M J. Numer. Anal., 19, pp. 886-898. Bank, R. E. (1990). P U M G : A Sofiware Package j b r Solving Ellipric Partial Diflerential Equations, User’s Guide 6.0, SIAM, Philadelphia. Bank, R. E., and Douglas. C. C. (1985). “Sharp Estimates for Multigrid Rates of Convergence with General Smoothing and Acceleration,” S l AM J. Numer. Anal., 22, pp. 617-633. Bank, R. E.. and Dupont, T. (1980). Anulysis of u Two-Level Scheme f o r Solving Finite Elemenr Equations, Tech. Report CNA-I 59. Center for Numerical Analysis, University of Texas at Austin. Bank, R. E., and Dupont, T. (1981). ”An Optimal Order Process for Solving Elliptic Finite Element Equations,” Math. Comp., 36, pp. 35-51. Bank, R. E., and Rose, D. J. (1982). “Analysisof a Multilevel Iterative Method for Nonlinear Finite Element Equations.” Math. Comp., 39, pp. 453-465. Bank, R. E., Dupont. T., and Yserentant, H. (1988). “The Hierarchical Basis Multigrid Method,” Numer. Math., 52, pp. 427-458. Behie. A., and Forsyth, P. A. (1983). “Multi-grid Solution of Three-Dimensional Problems with Discontinuous Coefficients,” Appl. Math. Compur., 13, pp. 229-240. Blaheta, R. (1988). “A Multilevel Method with Overcorrection by Aggregation for Solving Discrete Elliptic Problems,” J . Coniput. Appl. Math., 24, pp. 227-239. Braess, D. (1981). “The Contraction Number of a Multigrid Method for Solving the Poisson Equation,” Numer. Math., 37, pp. 387-404. Braess, D. (1987). On the Combination o j Multigrid and Conjugafe Gradients, in “Multigrid Methods JI” (W.Hackbusch and U. Trottenberg, eds.), vol. 1228 of Lect. Notes Math.. Springer-Verlag, pp. 52-64. Procs. 2nd European Conference on Multigrid Methods, Koln, October 1985.

370

JAN MANDEL

Braess, D. and Blomer, C. (1989).A Multigrid Method for a Parameter Dependent Problem in Solid Mechanics, Tech. Report 127/1989,Fakultat fur Mathematik der Ruhr-Universitat Bochum. Braess, D., and Hackbusch, W. (1983). “A New Convergence Proof for the Multigrid Method Including the V Cycle,” SlAM J. Numer. Anal., 20, pp. 967-915. Braess, R., and Verfiirth, R. (1988). Multi-grid Methods for Non-Conforming Finite Element Methods. Preprint Nr. 453, Universitat Heidelberg. Braess, D.,Hackbusch, W., and Trottenberg, U., eds. (1985). Advances in Multi-Grid Methods, Vol. I 1 of Notes on Numer. Fluid Mech., Friedr. Vieweg & Sohn. Bramble, 3. H.,and Pasciak, J. E. (1987).“New Convergence Estimates for Multigrid Algorithms,” Math. Comp., 49, pp. 31 1-329. Bramble, J. H., Pasciak, J. E., and Schatz, A. H. (1986).“The Construction of Preconditioners for Elliptic Problems by Substructuring, I,” Math. Comp., 47, pp. 103-134. Bramble, J. H., Pasciak, J. E., and Xu, J. (1988a). “The Analysis of Multigrid Algorithms for Nonsymmetric and Indefinite Elliptic Problems,” Math. Comp., 51, pp. 389-414. Bramble, J. H., Pasciak, J. E., and Schatz, A. H. (1989a).“The Construction of Preconditioners for Elliptic Problems by Substructuring, IV,” Math. Comp., 53, pp. 1-24. Bramble, J. H., Pasciak, J. E., Wang, J., and Xu,J. (1989b). Conuergence Estimates for Multigrid Algorithms Without Regularity Assumptions. Cornell University, Preprint. Bramble, J. H., Pasciak, J. E., Wang, J., and Xu, J. (1989~).Convergence Estimates for Product Iterative Methods with Application to Domain Decomposition and Multigrid. Cornell University, Preprint. Bramble, J. H., Pasciak, J. E., and Xu,J. (1990), Parallel Multilevel Preconditioners. Math. Comp. 55, 1-22. Bramble, J. H., Pasciak, J. E., and Xu, J. (1991). The Analysis of Multigrid Algorithms with Nonnested Spaces or Non-inherited Quadratic Forms, Math. Comput. 56, 1-34. Brand, K., Lemke, M., and Linden, J. (1987). Multigrid Bibliography, in “Multigrid Methods” (S. F. McCormick, ed.), Vol. 5 of Frontiers in Applied Mathematics, SIAM, Philadelphia. Brandt, A. (1977). “Multi-level Adaptive Solution to Boundary-Value Problems,” Math. Comp., 31, pp. 333-390. Brandt, A. (1982).Guide to Multigrid Development, in “Multigrid Methods”(W. Hackbusch and U. Trottenberg, eds.), Springer-Verlag, New York, pp. 220-312. Brandt, A. (1984).“Multigrid Techniques: 1984 Guide,” Vol. 85 of GMD Studien, Gesellshaft fur Mathematik und Datenverarbeitung, Postfach 1240, D-5205 St. Augustin, W. Germany. Brandt, A. (1986). “Algebraic Multigrid Theory: The Symmetric Case,” Appl. Math. Comput, 19, pp. 23- 56. Brandt, A. (1988). Multilevel Computations: Review and Recent Deuelopments, in “Multigrid Methods: Theory, Applications, and Supercomputing” (S. F. McCormick, ed.), Marcel Dekker, New York. Proceedings of the Third Copper Mountain Conference on Multigrid Methods. Brandt, A. (1989a). Rigorous Local Mode Analysis, in “Preliminary Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods,” University of Colorado. Brandt, A. (1989b). The Weizmann Institute Research in Multilevel Computation: 1988 Report, in “Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods” (J. Mandel, S. F. McCormick, J. E. Dendy, Jr., C. Farhat, G. Lonsdale, S. V. Parter, J. W.Ruge, and K. Stuben, eds.), SIAM, pp. 13-53. Brandt, A., and Lubrecht, A. A. (1990). “Multilevel Matrix Multiplication and Fast Solution of Integral Equations. J. Comp. Phys. 90,348-370. Brandt, A., McCormick, S., and Ruge, J. (1983). “Multigrid Methods for Differential Eigenproblems,” SIAM J. Sci. Stat. Comput., 4, pp. 244-260. Brenner, S. C. (1989a). Multigrid Methods for Nonconforming Finite Elements, in “Proceedings of

SOME RECENT ADVANCES IN MULTlGRlD METHODS

37 1

the Fourth Copper Mountain Conference on Multigrid Methods” (J. Mandel, S. F. McCormick, J. E. Dendy, Jr., C. Farhat, G. Lonsdale. S. V. Parter, J. W.Ruge, and K. Stiiben, eds.), Philadelphia, SIAM, pp. 45-65. Brenner, S . C. (1989b). “An Optimal Order Multigrid Method for PI Nonconforming Finite Elements,” Math. Comp., 52, pp. 1-15, Brenner, S . C. (1989~).“An Optimal Order Nonconforming Method for the Biharmonic Equation,” SIAM J. Numer. Anal., 26, pp. 1124-1 138. Brenner, S. C. (1990).“A Nonconforming Multigrid Method for the Stationary Stokes Equation.” Math. Comp. 55.41 1-438. Brezzi, F., and Fortin, M. (1986). “Numerical Approximation of Mindlin- Reissner Plates,” Math. Comp., 47, pp. 151-158. Brezzi, F., Douglas, C. C., and Marini, L. D. (1989). “A Parallel Domain Reduction Method,” Numer. Meth. .for PDE, 5, pp. 195-202. Briggs, B., Hart, L., McCormick, S., and Quinlan, D. (1988). Multigrid Methods on a Hypercube, in “Multigrid Methods: Theory, Applications, and Supercomputing,” Marcel Dekker, New York, pp. 63-83. Proceedings of the Third Copper Mountain Conference on Multigrid Methods. Briggs, W. (1987). Multigrid Tutorial, SIAM, Philadelphia. Brussino, G., Herbin, R.,Christidis, Z., and Sonnad, V. (1989), Parallel Multilevel Finite Element Method with Hierarchical Basis Functions. Manuscript, IBM Kingston, NY. Cao, Z. H.(1988). “Convergence of Multigrid Methods for Nonsymmetric, Indefinite Problems,” Appl. Math. Comput., 28, pp. 269-288. Chan, T.F., and Tuminaro, R. (1987).A Survey qf Parallel Multigrid Algorithms, in “Proceedings of the ASM Symposium on Parallel Computations and Their Impact on Mechanics” (A. K. Noor, ed.), AMD-Vol. 86, New York, The American Society of Mechanical Engineers. Chan. T. F., and Tuminaro, R. (1989).Analysis of a Parallel Multigrid Algorithm, in “Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods” (J. Mandel, S . F. McCormick, J. E. Dendy, Jr., C. Farhat, G. Lonsdale, S. V. Parter, J. W.Ruge, and K. Stiiben, eds.), Philadelphia, S A M , pp. 149-160. Ciarlet, P. G. (1972). “Mathematical Analysis of the Finite Element Method, Academic Press, New York. Concus, P., Golub, G. H., and O’Leary, D. P. (1976). A Generalized Conjugate Gradient Method j b r the Numerical Solution of Elliptic Partial Differential Equations, in “Sparse Matrix Computations” (J. R. Bunch and D. J. Rose, eds.) Academic Press, New York, pp. 309-332. Crouzeix, M., and Raviart, P. A. (1973). “Conforming and Non-conforming Finite Element Methods for Solving the Stationary Stokes Equation,” RAIRO Anal. Numir., 7 , pp. 33-76. Crouzeix, M., and Thomeee, V. (1987). “The stability in Lpand W ’ x pof the L2 Projection onto Finite-Element Function Spaces,” Math. Comp., 48, pp. 521-532. Decker, N. H. (1988). The Fourier Analysis of a Multigrid Preconditioner, in “Multigrid Methods: Theory, Applications, and Supercomputing” (S. F. McCormick, ed.),Marcel Dekker, New York, pp. 117-141. Proceedings of the Third Copper Mountain Conference on Multigrid Methods. Decker, N. H., and Van Rosendale, J. (1989). Operator Induced Multigrid Algorithms Using Semirefinement, in “Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods”(J. Mandel, S . F. McCormick, J. E. Dendy, Jr., C. Farhat, G. Lonsdale, S . V. Parter, J. W. Ruge, and K. Stiiben, eds.), Philadelphia, SIAM, pp. 87-105. Decker, N., Mandel, J., and Parter, S. V. (1988). On the Role of Regularity in Multigrid Methods, in “Multigrid Methods: Theory, Applications, and Supercomputing” (S. F. McCormick, ed.), Marcel Dekker, New York, pp. 143-156. Proceedings of the Third Copper Mountain Conference on Multigrid Methods.

372

JAN MANDEL

Dendy, Jr., J. E. (1985). Black Box Multigrid for Systems, Appl. Math. Cornput. 19, 57-74. Dendy, Jr., J. E. (1987). “Two Multigrid Methods for Three-Dimensional Problems with Discontinuous and Anisotropic Coefficients,” SIAM J. Sci. Stat. Comp, 8, pp. 673-685. Dongarra, J. J., and Grosse, E. (1987). “Distribution of Mathematical Software via Electronic Mail,” Comm. ACM, 30, pp. 403-407. Douglas, C. C. (1984). “Multi-grid Algorithms with Applications to Elliptic Boundary-Value Problems,” SIAM J , Numer, Anal., 21, pp. 236-254. Douglas, C. C. (1990). “MADPACK (version 2) Users’ Guide.” Available from NETLIB. Douglas, C. C., and Douglas, Jr., J. (1990). Abstract Multilevel Convergence Theory Requires Almost no Assumptions, Tech. Report RC 15853, IBM Research, Yorktown Heights, NY. Douglas, C. C., and Mandel, J. (1989). The Domain Reduction Method: High Way Reduction in Three Dimensions and Convergence with Inexact Solvers, in “Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods” (J. Mandel, S. F. McCormick, J. E. Dendy, Jr., C. Farhat, G. Lonsdale, S. V. Parter, J. W.Ruge, and K. Stuben, eds.), Philadelphia, 1989, SIAM, pp. 149-160. Douglas, C. C., and Mandel, J. (1990). A Group Theoretic Approach to the Domain Reduction Method: The Commutative Case. In preparation. Dougas, C. C., and Miranker, W. L. (1988). Some Nontelescoping Parallel Algorithms Based on Serial Multigrid/Aggregation/DisaggregationTechniques, in “Multigrid Methods: Theory, Applications, and Supercomputing” (S. F. McCormick, ed.), Marcel Dekker, New York, pp. 167-176. Douglas, C. C., and Smith, B. F. (1989). “Using Symmetries and Antisymmetries to Analyze a Parallel Multigrid Algorithm,” SIAM J . Numer. Anal., 26, pp. 1439-1461. Douglas, C. C., Mandel, J., and Miranker, W. L. (1990).Fast Hybrid Solution of Algebraic Sysrems, SIAM J . Sci. Stat. Comput. 11, 1073-1086. Farhat, C. (1989). A Multigrid-Like Semi-fteratiue Algorithm for the Massively Parallel Solution of Large Scale Finite Element Systems, in “Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods,” (J. Mandel, S. F. McCormick, J. E. Dendy, Jr., C. Farhat, G. Lonsdale, S. V. Parter, J. W. Ruge, and K. Stiiben,eds.), Philadelphia, SIAM, pp. 171-180. Farhat, C., and Sobh, N. (1989).“A Coarse/Fine Preconditioner for Very Ill-Conditioned Finite Element Problems,” fnt. J . Numer. Meth. Engin., 28, pp. 1715-1723. Fedorenko, R.P. (1961).“A Relaxation Method for Solving Elliptic Difference Equations,” USSR Cornput. Math. and Math. Phys., 4, pp. 1092-1096. Foresti, S., Brussino, G., and Sonnad. V. (1988). Multilevel Solution Methods for the pVersion of Finite Elements, Parallel Zmplementations and Comparison with other Solution Methods, Tech. Report KGN-137, IBM Kingston, NY. Frederickson, P., and McBryan, 0. (1988). Parallel Superconvergent Multigrid, in “Multigrid Methods: Theory, Applications, and Supercomputing” (S. F. McCormick, ed.), Marcel Dekker, New York, pp. 195-210. Gary, J., McCormick, S. F., and Sweet, R. (1983). “Successive Overrelaxation, Multigrid, and Preconditioned Conjugate Gradients Algorithms for Solving a Diffusion Problem on a Vector Computer,” Appl. Math. Comput, 13, pp. 285-310. Gelman, E., and Mandel, J. (1990).“Multilevel Algorithms for Optimization Problems,” Math. Progr. Ser. B, 48, pp. 1-18. Goldstein, C. I. (1989a). “Analysis and Application of Multigrid Preconditioners for Singularity Perturbed Boundary Value Problems, SIAM J. Numer. Math., 26, pp. 1090-1 123. Goldstein, C. I. (1989b). “Multigrid Analysis of Finite Element Methods with Numerical Integration.” Math. Comp., submitted. Goldstein, C. I. (1989~).The Numerical Solution of Exterior Hemholz Probkms, in “Numerical and Applied Mathematics” (W. Ames, ed.), J. C. Baltzer, AG, 1989, pp. 359-364.

SOME RECENT ADVANCES IN MULTIGRID METHODS

373

Gottlieb, D. and Orszag, S. A. (1977). “Numerical Analysis of Spectral Methods,” SIAM, Philadelphia. Grisvard, P. (1985). “Elliptic Problems in Nonsmooth Domains,” Pitman, Boston. Hackbusch, W. (1979).“On the Computation of Approximate Eigenvalues and Eigenfunctions of Elliptic Operators by Means of the Multigrid Method,” SIAM J. Numer. Anal., 16, pp. 201 215. Hackbusch, W. (1980). Analysis and Multi-Grid Solutions oJ Mixed Finite Element and Mived Finite Diflerenre Equations. Manuscript, Ruhr-Universitat Bochum, Germany. Hackbusch, W. (1981a). “Error Analysis of the Nonlinear Multigrid Method of the Second Kind.” Appl. Mar., 26, pp. 18-29. Hackbusch, W. (l981b). ”On the Convergence of Multi-Grid Iterations,” Beilriiye zur Numer. Muth., 9,pp. 213-239. Hackbusch, W. (1982). Multi-grid Conaergence Theory, in ”Multigrid Methods” (W. Hackbusch and U. Trottenberg, eds.), Springer-Verlag. New York, pp. 177-219. Hack busch, W. (198Sa). Multigrid Eiyenualue Computations, in “Advances in Multi-Grid Methods,”(D. Braess, W. Hackbusch. and U. Trottenberg, eds.), Vol. I I of Notes on Numer. Fluid Mech., Friedr. Vieweg & Sohn, Braunschweig, pp, 24-32. Hackbusch, W. (1985b). “Multigrid Methods and Applications.” Springer-Verlag. Berlin. Hackbusch, W. (1988).A N e w Approach fo Robust Multi-Grid Solvers. in “ICIAM’87: Proceedings of the First International Conference on Industrial and Applied Mathematics,” Society for Industrial and Applied Mathematics, Philadelphia, pp. I 14-126. Hackbusch, W . (1989a). The Frequency Decomposition Multi-grid Method Jor Hyperbolic Problems. in “Nonlinear Hyperbolic Equations-Theory, Computation Methods, and Applications”(J. Ballmann and R. Jelsch, eds.) Notes on Numerical Fluid Mechanics, Volume 24, Friedr. Vieweg & Sohn, pp. 209-21 7. Hackbusch, W. (1989b).“The Frequency Decomposition Multi-Grid Method, Part I: Application to Anisotropic Equations,” Numer. Math., 56, pp. 229-245. Hackbusch, W. and Nowak, Z. P. (1989). “On the Fast Matrix Multiplication in the Boundary Element Method by Panel Clustering, Numm. Mafh.,54, pp. 463-491. Hackbusch, W., and Reusken, A. (1989). Analysis of a Damped Nonlinear Multigrid Method,” Numer. Math., 55, pp. 225-246. Hackbusch, W., and Trottenberg, U., eds. (1982).“Multigrid Methods,” Vol. 860 of Lect. Notes in Math.. Berlin, Springer-Verlag. Proceedings, Cologne. Hackbusch, W.,andTrottenberg,U.,eds.(1986).“Multigrid Methods 11,”Vol. 1228of Lect. Notes in Math., Berlin, Springer-Verlag. Proceedings. Cologne. Heinrichs, W. (1988a). “Line Relaxation for Spectral Multigrid Methods,” J . Comput. Physics, 77, pp. 166-182. Heinrichs, W. (1988b). “Multigrid Methods for Combined Finite Difference and Fourier Problems,” J. Comput. Phys., 78, pp. 424-436. Hemker, P. W., and Johnson, G. M. (1087). Mulfigrid Approaches fo Euler Equations, in “Multigrid Methods” (S. F. McCorrnick. ed.), Vol. 5 of Frontiers in Applied Mathematics, SIAM, Philadelphia, ch. 3. Hempel, R.. and Lemke, M. (1989). Parallel Black Box Multiyrid, in “Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods” (J. Mandel, S. F. McCormick. J. E. Dendy, Jr., C. Farhat, G . Lonsdale, S. V. Parter, J. W. Ruge, and K. Stiiben, eds.), Philadelphia, SIAM, pp. 255-272. Hunt. R. (1988). “Single-Level Multigrid.” J . Comput. .”.ppl. Mech., 23, pp. 133-139. Jung, M. (1987). “Konvergenzfaktoren von Mehrgitterverfahren fur Probleme der ebenen. lienaren Elastizitatstheorie,” 2. ungew. Math. Mech., 67, pp. 165-173. Jung, M., Langer, U., Meyer, A., Queck, W., and Schneider, M. (1989). Multigrid Preconditioiirrs

374

JAN MANDEL

and Their Applications, in “Third Multigrid Seminar” (G. Telschow, ed.), no. 89-03 in Report MATH, Berlin, Akademie der Wissenschaften der DDR, Karl-Weierstrass-lnstitut fur Mathematik, pp. 11-52. Papers from the seminar held in Biesenthal, May 2-6, 1988. Kandel, D., Domany, E., Ron, D., and Brandt, A. (1989a).“Simulations without Critical Slowing Down: king and Three-State Potts Models,” Phys. Rev. B, 40,pp. 330-344. Kandel, D., Domany, E., Ron, D., Brandt, A., and Loh, Jr., E. (1989b). “Simulations without Critical Slowing Down,” Phys. Rev. Letters, 60, pp. 1591-1594. KoEvara, M., and Mandel, J. (1987).“A Multigrid Method for Three-Dimensional Elasticity and Algebraic Convergence Estimates,” Appl. Math. Comput., 23, pp. 121-135. Kuo, C.-C. J., and Levy, B. C. (1989).“Two-Color Fourier Analysis of the Multigrid Method with Red-Black Gauss-Seidel Smoothing,” Appl. Math. Comput., 29, pp. 69- 87. LuEka, A. Y. (1980). “Projection-Iterative Methods for Solving Differential and Integral Equations, Naukova Dumka, Kiev. In Russian. Maday, Y.,and Mufioz, R. (1988). “Spectral Element Multigrid. 11. Theoretical Justification,” J . Sci. Camp., 3, pp. 323-354. Maitre, J. F., and Musy, F. (1984).“Multigrid Methods: Convergence Theory in a Variational Framework,” SIAM J . Numer. Anal., 21, pp. 657-671. Maitre, J. F., Musy, F., and Nigon, P. (1 985).A Fast Solver f o r the Stokes Equation Using Multigrid with a Uzawa Smoother, in “Advances in Multi-Grid Methods” (D. Braess, W. Hackbusch, and U. Trottenberg, eds.) Vol. 11 of Notes on Numer. Fluid Mech., Friedr. Vieweg & Sohn, Braunschweig, pp. 77-83. Mandel, J. (1984a). “Etude Algebrique d‘une Methode Multigrille pour Quelques Problemes de Frontiere Libre,” Comptes Rendus Acad. Sci. Paris, S i r . 1,298, pp. 469-412. Mandel, J., (1984b). “A Multi-level Iterative Method for Symmetric, Positive Definite Linear Complementarity Problems,” Appl. Math. Optim., 11, pp. 77-95. Mandel, J., (1984~).On Some Two-Level Iterative Methods, in “Defect Correction Methods” (K. Bohmer and H. J. Stetter, eds.), Vol. 5 of Computing Supplementum, Springer Verlag, Wien, pp. 75-88. Mandel, J. (1985).“On Multilevel Iterative Methods for Integral Equations of the Second Kind and Related Problems,” Numer. Math., 46, pp. 147-157. Mandel, J. (1986). “Multigrid Convergence for Nonsymmetric, Indefinite Variational Problems and one Smoothing Step,” Appl. Math. Comput., 19, pp. 201-216. Mandel, J. (1987).On Multigrid and Iterative Aggregation Methods f o r Nonsymmetric Problems, in “Multigrid Methods 11 (W. Hackbusch and U. Trottenberg, eds.), Vol. 1228 of Lect. Notes Math., Springer-Verlag, pp. 219-231. Procs. 2nd European Conference on Multigrid Methods, Koln, October 1985. Mandel, J. (1 988). “Algebraic Study of Multigrid Methods for Symmetric, Definite Problems,” Appl. Math. Comput., 25, pp. 39-56. Mandel, J. (1989). A Domain Decomposition Method f o r p-Version Finite Elements in Three Dimensions, in “Proceedings of the 7th International Conference on Finite Element Methods in Flow Problems,” April 3-7, 1989, Huntsville, Alabama, University of Alabama at Huntsville. Mandel, J., (1990a).Hierarchical Preconditioning and Partial Orthogonalization f o r the p-Version Finite Element Method, in “Third International Symposium on Domain Decomposition Methods for Partial Differential Equations” (T. F. Chan, R. Glowinski, J. Periaux, and 0.B. Widlund, eds.), Philadelphia, SIAM, pp. 141-156. Mandel, J. (1990b). “Two-Level Domain Decomposition Preconditioning for the p-Version Finite Element Method in Three Dimensions,” Int. J . Numer. Methods Engrg., 29, pp. 1095-1108. Mandel, J. (1990~).Iterative Soluers b y Substructuring for the p-Version Finite Element Method.

SOME RECENT ADVANCES IN MULTIGRID METHODS

375

International Conference on Spectral and High Order Methods for Partial Differential Equations, Como, Italy, June 1989; Comput. Methods Appl. Mech. Engrg., to appear. Mandel, J. (1990d). “On Block Diagonal and Schur Complement Preconditioning.” Numer. Math.. to appear. Mandel, J. and Lett, G. S. (1990). “Domain Decomposition Preconditioning for p-Version Finite Elements with High Aspect Ratios.” Applied Numer. Anal., to appear. Mandel, J., and McCormick, S. F. (1989). “A Multilevel Variational Method for Au = LEU on Composite Grids,” J . of Comput. Phys., 80, pp. 442-452. Mandel, J., and Miranker, W. L. (1990). “New Techniques for Fast Hybrid Solution of Systems of Equations,“ Int. J . Num. Meth. Engin, 27,pp. 455-468. Mandel, J., and Ombe, H. (1988). Fourier Analysis oJ a Multiyrid Method f o r 3 0 Elasticity, in “Multigrid Methods: Theory, Applications, and Supercomputing” (S. F. McCormick, ed.), New York, Marcel Dekker, pp. 389-412. Mandel, J.. and Parter, S. V. (1990). “On the Multigrid F-Cycle,” Applied Math. Computation, 37, pp. 19-36. Mandel, J., and Sekerka, B. (1983). “A Local Convergence Proof for the Iterative Aggregation Method,” Linear Algebra Appl., 51, pp. 163-172. Mandel, J., McCormick, S., and Bank, R. (1987). Variational Multiyrid Theory, in “Multigrid Methods” ( S . F. McCormick, ed.) SIAM, Philadephia, ch. 5, pp. 131-177. Mandel, J., McCormick, S. F., and Ruge, J. (1988).“An Algebraic Theory for Multigrid Methods for Variational Problems,” SIAM J . Numer. Anal., 25, pp. 91-1 10. Mandel, J., McCormick. S. F., Dendy. Jr., J. E., Farhat, C., Lonsdale, G., Parter, S. V., Ruge, J. W. and Stiiben, K., eds. (1989).“Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods,” Philadelphia, SIAM. Mansfield, L. (1981). “On Multigrid Solution of Finite Element Equations with lsoparametric Elements,” Numer. Math, 37,pp. 423-432. McCormick, S. F., ed. (1983). “Multigrid Methods,” North-Holland, 1983. Selected papers from the International Conference on Multigrid Methods, Dillon, Colorado, special issue of Appl. Math. Comput., 13,Pages 213-474. McCormick, S. F. (1985). “Multigrid Methods for Variational Problems: General Theory for the V-Cycle,” S I A M J . Numer. Anal., 22,pp. 634-643. McCormick, S. F., ed. (1986). “Second Copper Mountain Conference on Multigrid Methods, 1985,” North-Holland. Special issue of Appl. Math. Comput., 19, Pages 1-372. McCormick, S. F., ed. (1987). “Multigrid Methods,” Vol. 3 of Frontiers in Applied Mathematics, SIAM, Philadephia. McCormick, S. F., ed. (1988).“Multigrid Methods: Theory, Applications, and Supercomputing,” Marcel Dekker, New York. Proceedings of the Third Copper Mountain Conference on Multigrid Methods. McCormick, S. F. (1989). “Multilevel Adaptive Methods for Partial Differential Equations,” Vol. 5 of Frontiers in Applied Mathematics, SIAM, Philadelphia, 1989. McCormick, S. F., and Ruge, J. (1983). “Unigrid for Multigrid Simulation,” Mafh. Comp., 19, pp. 924-929. Mulder, W. A. (1989a). Multiyrid. Alignment, and Euler’s Equations, in “Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods”(J. Mandel, S. F. McCormick, J. E. Dendy, Jr., C. Farhat, G. Lonsdale, S. V. Parter, J. W. Ruge, and K. Stiiben, eds.), Philadelphia, SIAM, pp. 348-364. Mulder, W. A. (1989b). “A New Multigrid Approach to Convection Problems,” J . Comput. Physics, 83, pp. 303-323. NeEas, J. (1964). “Sur la Coercivite des Formes Sesqui-Lineaires Elliptiques,” Rev. Roumaine Math. Pures Appl., 9, pp. 41-69.

376

JAN MANDEL

NeEas, J., and HlavaEek, I. (1981).“Mathematical Theory of Elastic and Elasto-Plastic Bodies,” North-Holland, Amsterdam. Niestegge, A., and Witsch, K. (1990). “Analysis of a Multigrid Stokes Solver,” Appl. Math. Comput., 35, pp. 291-303. The Numerical Algorithms Group Ltd. (1988). N A G F O R T R A N Mark 13 Library Manual, Oxford. Parsons, I. D. (1989). “The Implementation of an Element Level Multigrid Algorithm on the Alliant FX/8,” Comp. Phys. Comm., 53, pp. 337-348. Parsons, I. D. and Hall, J. F. (1990a).“Multigrid Methods in Solid Mechanics: Part I- Algorithm Description and Behaviour,” Int. J . Numer. Meth. Engin., 29, pp. 719-737. Parsons, 1. D. and Hall, J. F. (1990b).“Multigrid Methods in Solid Mechanics: Part 11-Practical Applications,” Int. J . Numer. Melh. Engin., 29, pp. 739-753. Parter, S. V. (1987).“Remarks on Multigrid Convergence Theorems,” S I A M J . Numer. Anal., 23, pp. 103-120. Patera, A. T. (1984). “A Spectral Element Method for Fluid Dynamics: Laminar Flow in a Channel Expansion,” J . Comput. Phys., 54, pp. 468-488. Peisker, P., Rust, W. and Stein. E. (1990). “Iterative Solution Methods for Plate Bending Problem.” Report, Universitat Hannover, SIAM J . Numer. Anal. 27, 1450-1466. Phillips, T. N. (1987). “Relaxation Schemes for Spectral Multigrid Methods,” J . Comput. Appl. Math., 18, pp. 149-162. Raviart, P. A. and Thomas, J. M. (1977). A Mixed Finite Element Method for 2-nd Order Elliptic Problems, in “Mathematical Aspects of Finite Element Method” (I. Galligani and E. Magenes, eds.), Vol. 606 of Lect. Notes in Math., Springer-Verlag, Berlin, pp. 292-315. Reusken, A. (1988).“Convergence of the Multigrid Full Approximation Scheme for a Class of Elliptic Mildly Nonlinear Boundary Value Problems,” Numer. Math., 52, pp. 251 -277. Ronquist, E. M., and Patera, A. T. (1987). “Spectral Element Multigrid. I. Formulation and Numerical Results.” J . Sci. Comput., 2, pp. 389-406. Ruge, J. W., and Stiiben, K. (1987). Algebraic Multigrid, in “Multigrid Methods” (S. F. McCormick, ed.), Vol. 5 of Frontiers in Applied Mathematics, SIAM, Philadephia, ch. 4, pp. 73-130. Stiiben, K., and Trottenberg, U. (1982). Multigrid Methods: Fundamental Algorithms, Model Problem Analysis and Applications, in “Multigrid Methods,” Proc. Koln 1981 (W. Hackbusch and U. Trottenberg, eds.), Springer Verlag, pp. 1-176. Stuben, K., Trottenberg, U., and Witsch, K. (1983). Software Development Based on Multigrid Techniques, in “Proceedings of IFIP Conference on PDE Software, Modules, Interfaces, and Systems, Sonderkoping, Sweden, 1983,” (8. Enquist and T. Smedsaas, eds.), Amsterdam, North-Holland. Thole, C.-A., and Trottenberg, U. (1985).Basic Smoothing Procedures for the Multigrid Treatment of Elliptic 3d-Operators, in “Advances in Multi-Grid Methods” (D. Braess, W. Hackbusch, and U. Trottenberg, eds.), Vol. 1 I of Notes on Numer. Fluid Mech., Friedr. Vieweg & Sohn, pp. 102-111. Thole. C.-A., and Trottenberg, U. (1987).A Short Note on Standard Parallel Multigrid Algorithms for 3d-Problems. in “Supercomputing” (A. Lichnewski and C. Sagues, eds.), Elsevier, NorthHolland. Vassilevski, P. S. (1989). Nearly Optinial Iterative Methods for Solving Finite Element Elliptic Equations Based on the Multileoel Splitting of the Matrix, Tech. Report 1989-09, Enhanced Oil Recovery Institute. University of Wyoming, Laramie, Wyoming. Verfiirth, R. (1988). “Multilevel Algorithms for Mixed Problems: 11: Treatment of the MiniElement,” S I A M J . Numer. Anal., 25, pp. 285-293.

SOME RECENT ADVANCES IN MULTIGRID METHODS

377

Vogelius, M. (1983). “An Analysis of the p-Version of the Finite Element Method for Nearly Incompressible Materials,” Numer. Math., 41, pp. 39-53. Wendland, W. L. (1979). “Elliptic Systems in the Plane,” Pitman, London, San Francisco, Melbourne. Wesseling, P. (1982a). M G D I - A Robust and Efficient Multigrid Method, in Multigrid Methods (W. Hackbusch and U. Trottenberg, eds.), Vol. 860 of Lect. Notes in Math., Berlin, SpringerVerlag. Proceedings, Cologne 1981. Wesseling, P. (1982h).“Theoretical Aspects of a Multigrid Method,” SIAM J . Sci. Stat. Comp., 3, pp. 387-407. Xu,J. (1989). Theory of Multilevel Methods, Tech. Report AM 48, Pennsylvania State University, State College, Pennsylvania. Young, D. M. (1971). “Iterative Solution of Large Linear Systems,” Academic Press, New York. Yserentant, H. (1986a). “On the Multi-Level Splitting of Finite Element Spaces,” Numer. Math., 49, pp. 379-412. Yserentant, H . (1986b). “On the Multi-Level Splitting of Finite Element Spaces for Indefinite Elliptic Boundary Value Problems,” S I A M J . Numer. Anal., 23, pp. 581-595. Yserentant, H. (1989).“Two Preconditioners Based on the Multilevel Splitting of Finite Element Spaces.” (Universitat Dortmund, Germany, Preprint, 1989. Zang, T. A., Wong, Y.S., and Hussaini. M. Y. (1982a). “Spectral Multigrid Methods for Elliptic Equations,” J . Comput. Phys., 48, pp. 485-501. Zang, T. A,, Wong, Y. S., and Hussaini, M. Y. (1982b). “Spectral Multigrid Methods for Elliptic Equations 11,” J . Comput. Phys., 54, pp. 489- 507. Zhang, S. (1989). “An Optimal Order Multigrid Method for Biharmonic, C’ Finite Element Equations,” Numer. Math., 56, pp. 613-624. Zienkiewicz, 0.C. (1977). “The Finite Element Method,” McGraw Hill, London, third ed. Ziping, H. (1988). A Multi-Grid Algorirhrn ,for Mived Problems with Penalty, P h D thesis, RuhrUniversitat, Bochum, Germany.

This Page Intentionally Left Blank

Index A

B

Absorption band-to-band. 248 coefficient due to free carriers, 249 for band-to-band transitions, 249 observed, 248 edge, 246 free-carrier, 255 free-carrier-induced, 248 impurity-induced, 248 Activation energy of the base current, 203 Adaptation in linear prediction backward, 129 block, 129 forward. 129 forward-backward, 130 recursive, 130 Adaptive postfiltering, 170 predictive coding, 154 transform coding, 159 Analysis-by-synthesis speech coding, 160 generalized, 161 Analysis-synthesis speech coding, 148 Antenna, properties of bandwidth, 16,20-22 input impedance, 16.19-24 polarization, 16, 19-24 radiation pattern, 7,24-26, 35-40 radiation resistance. 16,20-23 resonant length, 20-24 Approximating function, 46, 50, 55, 56, 58, 62, 66, 67, 68 Approximation effective mass, 236 parabolic rigid band, 203 plasmon pole, 205, 2 10 property, 336 random phase (RPA), 205 Thomas-Fermi, 210 Autocorrelation matrix, 121 Autocorrelation method for linear prediction of speech, 123 379

Band anisotrophy of, 209,213.221 conduction, isotropic, 209 impurity, 198, 206, 245 rigid parabolic, 209 valence, complex structure, 21 3 Bandgap, 244 Band structure, 220, 244 effect of electron-impurity interaction, 199 effect of fluctuation of impurity concentration, 200 of silicon and germanium, 198 Band tails, 206, 245 formation of, 200 Halperin-Lax, 234 Kane, 232 quantum mechanical theory of, 200,232 theory of, 198 Biot-Savart’s Law, 12, 70,75 Bipolar transistor, 202 Bisection method, 87 Boundary condition of current vector potential, 19.28 Dirichlet, 11, 12, 14, 18,27,44,46, 51, 52, 54, 58, 62, 63, 67, 70, 85 of eddy current field, 2 1 of electric scalar potential, 10,26,41 of electric vector potential, 42 homogeneous, 40,45 of magnetic scalar potential, 12, 28,42, 80 of magnetic vector potential, 16,26, 41 Neumann, 11, 13, 14, 18, 34,44,46, 51. 52, 55, 58, 61, 70, 75 of static field, 4, 5, 8, 9 of waveguide and cavity, 39 Burstein shift. 201

C CAD, I, 2 , 3 Cancellation error, 13 Cavity, 4, 38, 39, 40.45.61, 92

380

INDEX

CClTT ADPCM, 153 Centroid, 138 Centroid condition, 138 Charge, 5 Charge density, 2 Surface electric, 5, 6 magnetic, 8 Chemical potential, 221 Cholesky algorithm, incomplete, 72 Closed-loop codebook design, 174 pitch prediction, 167 Codebook, 138 training, 173 Code excited linear prediction, 173 Compensated samples, 251 Compensation ratio K , 230 Computational complexity, 332 Computer aided analysis, 4,68 Concentration fluctuation, 233 effect on the band structures, 200 need of, 232 Conductivity, 3, 23, 34, 38, 39 Conjugate gradients, 348 method, preconditioned, 72,77,81 Constitutive relationship, 2, 3, 5, 7, 10, 12, 16, 19,23, 26, 27, 38, 41, 42 Coulomb gauge, 2, 18.20, 26, 28, 30, 34,35, 41, 42, 43 Coupling, of eddy current and static magnetic field, 29 Covariance method, 124 Critical concentration N,, 205, 206 Current, 9, 13, 15, 19, 85 Current density, 2,8, 12, 19,27, 29, 77, 85 displacement, 42 surface electric, 8 magnetic, 5, I 1 cutoff high energy, 244,260,263,266 low energy, 244,263,266,267 Cuttingsurface, 13, 31

Density of states, 204,213, 227, 232, 237 in band tails, 201 deep tail, 242 distortion of, 201, 245 effect of, 198 effective mass, 203,222 effective mass, 204,209 in heavily doped silicon, 199, 200 in the tails, Sayakanit, 240 Dielectric constant, 3-8, 10-11, 15,27,41 function, 210,230 substrate, 3-6.8.41 Differential equation of current vector potential, 19, 20,27, 28 of eddy current field, 21 of electric scalar potential, 10, 26, 27, 4l,42 of electric vector potential, 42.43 elliptic, 43 Laplace, 17, 18, 34 generalized, 10, 12 vector, generalized, I8 of magnetic scalar potential, 12, 14, 27, 28, 42.43 of magnetic vector potential, 16, 17,26,27, 41.42 parabolic, 43 partial, 2, 43 set of, ordinary, 47, 48 of static field, 4,5,7, 8 of waveguide and cavity, 39,45 Differential operator, 44,45 Differential PCM, 153 Direct iteration method, 52 Dispersion, characteristics, 86, 87, 90 Distortion measure, 137 Divergence, of vector potential, 17, 35, 36 normal derivative of, IS, 27.28, 30, 37 DOD CELP, 175 DOD 4.8 kb/s speech coding standard, 175 Domain decomposition, 362 reduction, 356 Double layer, electric, 5

D Debye length, 225 Degenerate Fermi-Dirac system, 208 Dense electron gas, 199,207

E Eddy current, field, 2,3,20 equations, 21

INDEX potential formulations A+, 34 A,V-A, 29 A, V-A+, 3 I A,V-$, 34 T,R, 35 T,$-A-$. 36 TI$-$, 34 Effect of finite temperature, 260 Effect of temperature on BGN, 265 Effective Bohr radius a, 205. 207, 230 Effective mass, 225 Eigenfunction, 40 Eigenvalue, 4.40, 43, 45,49, 62, 63, 86, 87 problems, 363 Elasticity problems, 357 Electromagnetic, wave, 3, 38 Electrostatic, field, 4, 6, 10 Energy binding, 234 correlation, 199,210 effective Rydberg R, 205 exchange, 199,208 exchange and correlation, 210 Fermi, 209 hole, self, 205, 2 10, 2 12, 2 I7 kinetic, 209 kinetic, of localization, 234, 239 Entropy coding, 117 Excitation spectra, 202 Exhaustive search, I57 Expansion function, 47, 50. 53, 54, 56, 58. 60, 62, 63, 66. 67 Exponential tails, 244 Exponentially decaying tails, 232

38 1

spectral, 360 Fixed segment extraction technique, 103 Fluctuations, 234 Flux, 8, 15, 16, 17 Flux density electric, 4, 19 magnetic,4, 14, 16, 17, 28, 72. 85 Fourier analysis of multigrid methods, 333 locals, 335 Frequency domain speech coding, 158

c Galerkin's equations for A,V-A-$ formulation of eigenvalue problem, 49 for magnetic vector potential, 52 of second order elliptic differential equation, 47 for total and reduced scalar potential, 50 for T,$-A-$, 57 of transient problem, 47 Calerkin's method application to potential formulations, 49 in eddy current case, 53 general description of, 46 in static case, 50 for waveguide and cavity. 61 Gauge transformation, 17 Gauss's theorem, 24. 25,40 Gaussian statistics, 234 Green function, 225 Green function method, 210,242 Green's identity, 44,45,47, 48, 49, 50, 51, 52, 53, 54, 56, 57

F Faraday's Law, 25,42 Feynman path integral method, 201,239,242 Field intensity electric, 4, 10, 26, 36 magnetic, 4, 12, 22, 27 Finite differences, 333 elements. 335, 337 mesh, 7 1, 77, 8 1, 87,90,92 method, summary of, 64 nonconforming, 340,341 p-version, 361

H Hierarchical bases, 351 High density limit, 200 High density regime, 207,232 Hole, in conductor, 31, 75, 80 Huffman coding, 118

I Impurity, periodic distribution of. 215 Initial condition, 22, 23, 40,45, 48, 54

382

INDEX

Interaction electron-electron, 199 electron-impurity, 199, 223 exchange, 208 Interactions, many-body, 268 Interface condition of eddy current field, 21, 22,27, 29,30, 35, 55, 57, 59 for magnetic scalar potential, 15,50, 51 treatment as boundary condition, 32, 36 Inverse filter, 122 Iterative method, convergence factor of, 332

J

Jellium model, 199,207, 208, 219 Joint optimization, 167

L Lattice filters, 134 Likelihood ratio distortion measure, 139 Lindhard dielectric function, 199 Linear, 3, 38,47,48, 49, 52, 72, 77, 81 Line shape, 261 Line-shape analysis, 263,266 Line spectrum pair, 145 Line width, 261 Lloyd’s algorithm, 113 Log-area coefficients, 145 Long-term prediction, 125 Low-delay speech coding, 177 low-delay CELP (LD-CELP), 177 low-delay VXC (LD-VXC), 177 lattice LD-VXC (LLD-VXC), 183 Luminescence high level (HL), 258 low level (LL), 258 spectra, 202,223, 216, 217,219,223

M Magnetostatic, field, 4, 8 Many-body effects, 216,217 Many body parameter r s , 205,220 Maxwell equation, 2, 3, 5, 16, 19.20, 26, 27, 28,36, 38, 39,41,42 Mean opinion score, 101

Microstrip, antenna, radiator, 1-49 printed patch, 1-49 Green’s function of, 3-5, 12 electric field inside, 4-10, 13-15, 27 Mode analysis, see Fourier analysis physical, 63 spurious, 40 Moment method, 1-3, 10-12,41 basis functions, 1-3, 12-15, 26-27 convergence rate, 1-3, 15,20-21,41-43 matrix elements, 13-14,43-48 Mott’s critical concentration N,, 198, 200, 223,259 Multigrid algorithm, fundamental, 329 analog computers, 364 anisotropic problems, 346 convergence analysis, 332, 338 cycles, 331 hyperbolic problems, 328,337 mixed problems, 359 parallel computing, 364 preconditioning by, 347, 35 1 software, 365 Stokes’ equation, 359 tutorial, 328 variational approach, 337 with several coarse grids, 354, 356 Multiply connected, eddy current region, 31, 35,36 Multi-pulse excited linear prediction coding, 171 Multivalley ellipsoids, 210

N NAG, 367 Nearest neighbor condition, 138 NETLIB, 365 Newton-Raphson procedure, 52,72 Nonlinear, 3,7,47, 52,61,68,72 No phonon line, 263 Normal component of current density, 9, 19,22,27,28 of current vector potential, 20,28,36 of electric flux density, 56 of magnetic flux density, 8, 15, 22,26, 29, 32, 36, 70, 81 of magnetic vector potential, 17,26, 33

INDEX

0 Optical absorption, 246, 269 in C e and Si, 254 theory of, 247 Optical gap, 245 Optical properties, 206. 244 Orthogonality principle, 121 Orthogonalization procedure, I69 Outer normal, 5, 15, 22, 34, 38

P Perceptual criteria, 154 Permeability, 3, 7, 13.23, 39,40,41, 52. 61, 72 Permittivity, 3, 7, 40, 61, 92 Phonons center of zone, 263 energy of, 246 LA, 246 LO, 246 momentum-conserving, 246 momentum-conserving TA and TO, 256 TA, 246,259,263 TO. 246,250,259,263 Photoluminescence, 244,255,257,262 BCN from observed, 259 spectrum, 260 Photoluminescence excitation absorption (PLE), 256 Pitch prediction, 125 Potential description of eddy current field, 25 of static field by scalar potential, 9 by vector potential, 15 of waveguide and cavity, 40 Potential fluctuation, 232, 235 Poynting theorem, 40 vector. 87, 90 Prediction error sequence, 122 Predictive speech coding, sre Speech, coding. predictive Propagation coefficient, 86

Q Quadrature, numerical, 340 Quantization, 110

383

adaptive, 114 optimal, 113 scalar, 110-Ill, 117 uniform, 111-113 Quasi-stationary, 3,20,21

R Refinement, grid, 345 Reflectivity measurements, 201 Reflection coefficients, 136 Regularity of partial differential equations, 338, 340 -free convergence of multigrid methods, 343

S Scalar potential electric, 11, 26, 41 magnetic, 27.42 reduced, 12 total, 13 modified, 54.61 Scattering carrier-impurity. 200 electron impurity, 208 impurity, 21 1, 216, 227 intervalley, 199,200,216,217, 269 multiple, 200, 242 Screening length, 204, 225, 230, 237 linear, 203, 205,230 RPA dielectric. 2 I3 Thomas-Fermi, 236 Thomas--Fermi theory of, 203,233 Section, of bounding surface, 5, 6. 7, 8, 9. 10, 1 1 . 16, 17, 19. 26, 28 Semiconductors compensated, 206,230,233.244 doped compensated, 207 heavily doped, 199 lightly doped, 198 moderately, 200 Signal-to-noise ratio. 100 segmental, 101 Silicon heavily doped, n type, 252 lightly doped, 206 moderately doped, 225

384

INDEX

Singularity, of currents edge (end), function of, 2-3, 13-15, 17-18.41 feed (attachment), 2-3, 13-15, 17-18,41 Smoothing property, 336 Spectra HL, 259 LL. 259 Spectral methods, 360 Speech autocorrelation function of, 104 coding, 97 linear prediction in, 120 low-delay, see Low-delay speech coding low-rate, 97 performance criteria in, 100 predictive, 150 power spectral density of, 107 probability density function of, 108 production model, 102 signal characterization, 103 spectral flatness of, 109 States deep in the tail, 236 localized, 232 Static current field, 4,8 Stochastic gradient, 132 Stokes’ and anti-Stokes lines, 265 Stokes’ theorem, 16,25 Sub-band coding, 158 Surface current, density, 26-32, 41 coefficients of, 2-3, 17-18,28-35 horizontal, 3-8, 14-15, 17-18, 26-27 vertical, 2-3, 8-10, 14-15, 26-27 Surface waves, 7-8 Symmetries, group of, 357

T Tangential component of current vector potential, 19, 28,35 of electric field intensity, 5, 19, 22, 26, 28, 36, 39, 81 of magnetic field intensity, 7, 8, 15,22, 28, 29, 32, 33, 35, 36, 39, 70, 80 of magnetic vector potential, 16, 17, 29, 36 TEAM workshop, 75,80 Temperature variation, 270 Tensor permeability, 3, 39,41,42, 61 permittivity, 3, 39, 61

Theory classical, 269 high density, 200,207,223 Kane’s, 234 Klauder’s multiple scattering, 200, 223 of Kane, semiclassical, 232 Time-harmonic, 75 Time variation, sinusoidal, 38,40,41 Transversal filters, 2-39 Tree coding, 184 Trellis coding, 186 Tunnelling, 223

U Unigrid, 346 Uniqueness of eddy current field, 22 of electrostatic field, 6 of magnetostatic field, 8 of scalar potential electric, 11, 27 magnetic, 13, 15, 29 of static current field, 9 of vector potential current, 20 magnetic, 17, 26, 27

V Variable range hopping, 206 Variational correction, 212 method, 212 principle, 239 Vector adaptive predictive coding, 173 Vector excitation coding, Vector potential current, 19,27 electric, 19,42 magnetic, 16, 26,41 Vector predictive coding, Vector quantization, 137 adaptive, 143 optimality conditions, 137 suboptimal, 142 Vector sum excitation linear prediction (VSELP), 176 Vector transform quantization, 159 Vocoders, 148

385

INDEX Voltage electric, 5, 10, 1I , 26 magnetic, 8, 12, 14, 28

of transient problem. 44 Weighting filter, 154, 155, 163 Weighting function, 46, 47, 48. 51, 53, 60, 63 Whitening filter, 122

W

Wall electric, 39, 86, 90, 92 magnetic, 39, 87, 92 Waveguide, 4, 38, 39,40,45,61, 85,86, 87 anisotropic, 85 Wavenumber, free space, 61.62 Weak formulation of eigenvalue problem, 45 of second order elliptic differential equation, 43

Y Yukawa potential, 205,225

2

Zero input response, 164 Zero state response, 164

This Page Intentionally Left Blank