Software Co-Design

DEDICATED DIGITAL PROCESSORS Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenb...

Author: F. Mayer-Lindenberg

40 downloads 1119 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

DEDICATED DIGITAL PROCESSORS

Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

DEDICATED

DIGITAL PROCESSORS Methods in Hardware/Software System Design F. Mayer-Lindenberg Technical University of Hamburg-Harburg, Germany

C 2004 Copyright

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England Telephone

(+44) 1243 779777

Email (for orders and customer service enquiries): [email protected] Visit our Home Page on www.wileyeurope.com or www.wiley.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to [email protected], or faxed to (+44) 1243 770620. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Other Wiley Editorial Offices John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1 Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0-470-84444-2 Typeset in 10/12pt Times by TechBooks, New Delhi, India Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire This book is printed on acid-free paper responsibly manufactured from sustainable forestry in which at least two trees are planted for each one used for paper production.

Contents

Preface

ix

1 Digital Computer Basics 1.1 Data Encoding 1.1.1 Encoding Numbers 1.1.2 Code Conversions and More Codes 1.2 Algorithms and Algorithmic Notations 1.2.1 Functional Composition and the Data Flow 1.2.2 Composition by Cases and the Control Flow 1.2.3 Alternative Algorithms 1.3 Boolean Functions 1.3.1 Sets of Elementary Boolean Operations 1.3.2 Gate Complexity and Simplification of Boolean Algorithms 1.3.3 Combined and Universal Functions 1.4 Timing, Synchronization and Memory 1.4.1 Processing Time and Throughput of Composite Circuits 1.4.2 Serial and Parallel Processing 1.4.3 Synchronization 1.5 Aspects of System Design 1.5.1 Architectures for Digital Systems 1.5.2 Application Modeling 1.5.3 Design Metrics 1.6 Summary Exercises

1 1 3 7 9 10 11 13 14 14 16 18 19 20 22 25 29 29 31 35 37 38

2 Hardware Elements 2.1 Transistors, Gates and Flip-Flops 2.1.1 Implementing Gates with Switches 2.1.2 Registers and Synchronization Signals

41 41 41 50

• vi

2.2

2.3

2.4

CONTENTS

2.1.3 Power Consumption and Related Design Rules 2.1.4 Pulse Generation and Interfacing Chip Technology 2.2.1 Memory Bus Interface 2.2.2 Semiconductor Memory Devices 2.2.3 Processors and Single-Chip Systems 2.2.4 Configurable Logic, FPGA Chip Level and Circuit Board-Level Design 2.3.1 Chip Versus Board-Level Design 2.3.2 IP-Based Design 2.3.3 Configurable Boards and Interconnections 2.3.4 Testing Summary Exercises

57 61 67 69 71 76 78 86 88 91 91 94 96 96

3 Hardware Design Using VHDL 3.1 Hardware Design Languages 3.2 Entities and Signals 3.3 Functional Behavior of Building Blocks 3.4 Structural Architecture Definitions 3.5 Timing Behavior and Simulation 3.6 Test Benches 3.7 Synthesis Aspects 3.8 Summary Exercises

99 99 101 102 106 107 109 111 112 113

4 Operations on Numbers 4.1 Single Bit Binary Adders and Multipliers 4.2 Fixed Point Add, Subtract, and Compare 4.3 Add and Subtract for Redundant Codes 4.4 Binary Multiplication 4.5 Sequential Adders, Multipliers and Multiply-Add Structures 4.6 Distributed Arithmetic 4.7 Division and Square Root 4.8 Floating Point Operations and Functions 4.9 Polynomial Arithmetic 4.10 Summary Exercises

115 115 116 120 122 124 128 130 131 133 134 135

5 Sequential Control Circuits 5.1 Mealy and Moore Automata 5.2 Scheduling, Operand Selection and the Storage Automaton 5.3 Designing the Control Automaton 5.4 Sequencing with Counter and Shift Register Circuits 5.5 Implementing the Control Flow

137 137 140 142 144 146

CONTENTS

5.6 Synchronization 5.7 Summary Exercises

• vii

148 148 149

6 Sequential Processors 6.1 Designing for ALU Efficiency 6.1.1 Multifunction ALU Circuits 6.1.2 Pipelining 6.2 The Memory Subsystem 6.2.1 Pipelined Memory Accesses, Registers, and the Von Neumann Architecture 6.2.2 Instruction Set Architectures and Memory Requirements 6.2.3 Caches and Virtual Memory, Soft Caching 6.3 Simple Programmable Processor Designs 6.3.1 CPU1 – The Basic Control Function 6.3.2 CPU2 – An Efficient Processor for FPGA-based Systems 6.4 Interrupt Processing and Context Switching 6.5 Interfacing Techniques 6.5.1 Pipelining Input and Output 6.5.2 Parallel and Serial Interfaces, Counters and Timers 6.5.3 Input/Output Buses 6.5.4 Interfaces and Memory Expansion for the CPU2 6.6 Standard Processor Architectures 6.6.1 Evaluation of Processor Architectures 6.6.2 Micro Controllers 6.6.3 A High-Performance Processor Core for ASIC Designs 6.6.4 Super-Scalar and VLIW Processors 6.7 Summary Exercises

151 153 153 158 159 160 162 165 168 168 172 179 182 182 183 185 192 193 193 194 198 199 203 203

7 System-Level Design 7.1 Scalable System Architectures 7.1.1 Architecture-Based Hardware Selection 7.1.2 Interfacing Component Processors 7.1.3 Architectures with Networking Building Blocks 7.2 Regular Processor Network Structures 7.3 Integrated Processor Networks 7.4 Static Application Mapping and Dynamic Resource Allocation 7.5 Resource Allocation on Crossbar Networks and FPGA Chips 7.6 Communicating Data and Control Information 7.7 The π-Nets Language for Heterogeneous Programmable Systems 7.7.1 Defining the Target System 7.7.2 Algorithms and Elementary Data Types 7.7.3 Application Processes and Communications 7.7.4 Configuration and Reconfiguration 7.7.5 Hardware Targets

205 205 205 206 208 211 218 221 224 226 228 230 232 235 238 240

• viii

7.7.6 Software Targets 7.7.7 Architectural Support for HLL Programming 7.8 Summary Exercises

CONTENTS

243 244 247 247

8 Digital Signal Processors 8.1 Digital Signal Processing 8.1.1 Analog-to-Digital Conversion 8.1.2 Signal Sampling 8.1.3 DSP System Structure 8.2 DSP Algorithms 8.2.1 FIR Filters 8.2.2 Fast Fourier Transform 8.2.3 Fast Convolution and Correlation 8.2.4 Building Blocks for DSP Algorithms 8.3 Integrated DSP Chips 8.4 Integer DSP Chips – Integrated Processors for FIR Filtering 8.4.1 The ADSP21xx Family 8.4.2 The TMS320C54x Family 8.4.3 Dual MAC Architectures 8.5 Floating Point Processors 8.5.1 The Sharc Family 8.5.2 The TMS320C67xx Family 8.6 DSP on FPGA 8.7 Applications to Underwater Sound 8.7.1 Echo Sounder Design 8.7.2 Beam Forming 8.7.3 Passive Sonar 8.8 Summary Exercises

249 249 249 251 253 255 256 257 260 261 263 266 267 270 271 273 273 276 279 279 280 283 286 288 289

References

291

Index

295

Preface

This book is intended as an introduction to the design of digital processors that are dedicated to performing a particular task. It presents a number of general methods and also covers general purpose architectures such as programmable processors and configurable logic. In fact, the dedicated digital system might be based on a standard microprocessor with dedicated software, or on an application-specific hardware circuit. It turns out that there is no clear distinction between hardware and software, and a number of techniques like algorithmic constructions using high-level languages, and automated design using compilation apply to both. For some time, dynamic allocation methods for storage and other resources have been common for software while hardware used to be configured statically. Even this distinction vanishes by using static allocation techniques to optimize software functions and by dynamically reconfiguring hardware substructures. The emphasis in this book is on the common, system-level aspects of hardware and software structures. Among these are the timing of computations and handshaking that need to be considered in software but play a more prominent role in hardware design. The same applies to questions of power consumption. System design is presented as the optimization task to provide certain functions under given constraints at the lowest possible cost (a task considered as one of the basic characteristics of engineering). Detailed sample applications are taken from the domain of digital signal processing. The text also includes some detail on recent FPGA (field programmable gate arrays), memory, and processor, in particular DSP (digital signal processor) chips. The selected chips serve to demonstrate the state of the art and various design aspects; there remain interesting others that could not be covered just for reasons of space. The statements made in the text regarding these chips are all conclusions by the author that may be erroneous due to incomplete or wrong data. Viable corrections mailed to the author will be posted to a page dedicated to this book at the web site: www.tu-harburg.de/ti6/ddp along with other supplementary information. A non-standard topic of special interest covered in this book will be the design of simple yet efficient processors that can be implemented on FPGA chips, and, more generally, the balance between serial and parallel processing in application-specific processors. A processor design of this kind is presented in detail (the ‘CPU2’ in Chapter 6), and also a system-level design tool supporting this processor and others. The VHDL source code for a version of this

• x

PREFACE

processor can also be downloaded from [55] along with the software tools for it for free use in FPGA designs and for further experimentation. Licensing and checking for patent protection are only required for commercial usage. The book is the outcome of lectures on digital systems design, DSP, and processor networks given at the Technical University of Hamburg-Harburg, and is intended as an introductory textbook on digital design for students of electrical engineering and computer science. It presents a particular selection of topics and proposes guidelines to designing digital systems but does not attempt to be comprehensive; to study a broad subject such as digital processing, further reading is needed. As an unusual feature for an introductory text, almost every chapter discusses some subject that is non-standard and shows design options that may be unexpected to the reader, with the aim of stimulating further exploration and study. These extras can also serve as hooks to attach additional materials to lectures based on this book. The book assumes some basic knowledge on how to encode numbers, on Boolean functions, algorithms and data structures, and programming, i.e. the topics usually covered in introductory lectures and textbooks on computer science such as [13, 20]. Some particular DSP algorithms and algorithms for constructing arithmetic operations from Boolean operations are treated. The system designer will, however, need additional knowledge on application specific, e.g. DSP algorithms [14] and more general algorithms [15]. Also, semiconductor physics and technology are only briefly discussed to have some understanding of the electronic gate circuits and their power consumption, mostly concentrating on CMOS technology [10]. For the main subject of this book, the design of digital systems, further reading is recommended, too. In books such as [2, 49] the reader will find more detail on standard topics such as combinatorial circuit and automata design. They are treated rather briefly in this book and are focused on particular applications only in order to cover more levels of the design hierarchy. The text concentrates on the hierarchical construction of efficient digital systems starting from gate level building blocks and given algorithms and timing requirements. Even for these topics, further reading is encouraged. Through the additional literature the reader will gain an understanding of how to design both hardware and software of digital systems for specific applications. The references concentrate on easily accessible books and only occasionally cite original papers. Chapter 1 starts with some general principles on how to construct digital systems from building blocks, in particular the notion of algorithms, which applies to both hardware and software. It discusses complexity issues including minimization, and, in particular, the timing and synchronization of computations. The presentation proceeds at a fairly abstract level to aspects of system-level specifications and introduces some important metrics for digital systems to be used in the sequel, e.g. the percentage of time in which a circuit such as an ALU (arithmetic and logic unit) of a processor performs computational steps. Chapter 2 enters into the technological basics of digital computers, including transistor circuits and the properties of current integrated chips. It provides the most elementary hardware building blocks of digital systems, including auxiliary circuits such as clock generators, and circuits for input and output. Configurable logic and FPGA are introduced. Board and chip level design are considered, as well as the design of application-specific systems from IP (intellectual property) modules. Chapter 3 then introduces the method of describing and designing hardware using a hardware description language. VHDL is briefly introduced as a standard language. All VHDL examples and exercises can be simulated and synthesized with the free design tools provided by FPGA companies such as Xilinx and Altera.

PREFACE

• xi

Chapter 4 proceeds to the realization of arithmetical functions as special Boolean functions on encoded numbers, including the multiply-add needed for DSP. Serial versions of these functions are also presented, and some special topics such as the distributed arithmetic realized with FPGA cells. Chapter 5 further elaborates on the aspects of sequential control, starting with scheduling and operand storage. It includes a discussion of those automata structures suitable for generating control sequences, and, in particular, a memory-based realization of the controller automaton. In Chapter 6 the concept of a programmable processor is discussed, including the handling of input and output, interrupt processing and DMA. The presentation of sequential processors does not attempt to trace the historical development but continues a logical path started in Chapter 5 towards what is needed for efficient control. This path does not always duplicate contemporary solutions. Two simple processor designs are presented to demonstrate various techniques to enhance the ALU efficiency mentioned above. Some standard microprocessors are discussed as well, and techniques used to boost performance in modern high-speed processors. Chapter 7 proceeds to the system level where processors and FPGA chips are just components of a scalable architecture (as defined in Chapter 1), and the systems based on such an architecture are networks of sequential processors or heterogeneous networks including both FPGA-based logic circuits and programmable processors. The components need to be equipped with interfaces supporting their use in networks. The chapter also sketches a systemlevel design tool taking up several of the ideas and concepts presented before. It demonstrates a convenient setting for a compiler support surpassing the individual target processor or programmable logic circuit. The chapter also explains some automatic allocation techniques used by compilers and FPGA design tools. Chapter 8 discusses the application domain of digital signal processing starting from the basics of signal sampling and proceeding to application-specific processors. Some recent commercial signal processors are discussed in detail, and the use of FPGA chips for DSP is considered. The final section discusses some specific examples of embedded digital systems performing high-speed real-time DSP of sonar (underwater sound) signals. Throughout this book, the notion of a ‘system’ encompassing components and subsystems plays a crucial role. Processors will be viewed as complex system components, and processorbased systems as sub-systems of a digital system. In general, a digital system will contain several processor-based sub-systems depending on the performance and cost requirements. Dedicated digital systems are usually embedded sub-systems of some hybrid supersystem, and the operations performed by the sub-system need to be consistent with the operation of the entire system. It may not be enough to specify the interfaces with the supersystem, but necessary to analyze the dependency on other sub-systems of the total system that may be variable to some degree or be invariable givens. The reader is encouraged to proceed with this analysis to further levels, in particular to the dependencies within the social environment of engineering work, even if their analysis becomes more and more complex. It is a shame to see the beautiful technology of digital systems being applied to violate and destroy goods and lives. The judgement will, however, be different if the same techniques are used to create countermeasures. Fortunately, there are many applications in which the benefits of an achievement are not as doubtful, and the engineer may choose to concentrate on these.

1 Digital Computer Basics

1.1 DATA ENCODING A digital system is an artificial physical system that receives input at a number of sites and times by applying input ‘signals’ to it and responds to these with output that can later be measured by some output signals. A signal is a physical entity measurable at some sites and depending on time. The input signals usually encode some other, more abstract entities, e.g. numbers, and so do the outputs. In a simple setting, the numbers encoded in the output may be described as a function of the input numbers, and the artificial system is specifically designed to realize this function. More generally, the output may also depend on internal variables of the system and the sites and times at which it occurs may be data dependent. The main topic of this book is how to systematically construct a system with some wanted processing behavior, e.g. one with a prescribed transfer function. The application of such a system with a particular transfer function first involves the encoding of the input information into physical values that are applied at the input sites for some time by properly preparing its input signals, then some processing time elapses until the output signals become valid and encode the desired output values, and finally these encoded values are extracted from the measured physical values. For the systems considered, the input and output signals will be electrical voltages measured between pairs of reference sites and restricted to range within some allowed intervals. In contrast to analogue circuits, an input signal to a digital system at the time at which it is valid is restricted to ranging within a finite set of disjoint intervals. These intervals are used to encode or simply are the elements of a finite set K. Any two voltages in the same interval represent the same element of K (Figure 1.1). Moreover, the circuits are designed so that for whatever particular values in the allowed intervals present at the inputs, the output will also range in allowed intervals and hence encode elements of K. If two sets of input values are ‘equivalent’, i.e. represent the same elements of K, then so are the corresponding outputs. Thus, the digital system computes a function mapping tuples of elements of K (encoded at the Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 2

DIGITAL COMPUTER BASICS

allowed voltage intervals

corresponding elements of K

k1

k2

k3

…..

kn-1

kn

Figure 1.1 Range of an n-level digital signal

different input sites and times) to tuples of elements of K encoded by the outputs, i.e. a function Kn → Km . The continuum of possible voltages of a digital signal is only used to represent the finite set K. This is compensated by the fact that the assignment of output values does not suffer from the unavoidable variations of the signals within the intervals due to loading, temperature, or tolerances of the electronic components. The correspondence of signal levels in the allowed intervals to elements of K is referred to as the physical encoding. The most common choice for K is the two elements set B = {0, 1}. This restricts the valid input and output values to just two corresponding intervals L and H (‘low’ and ‘high’), e.g. the intervals L = [−0.5, 2] V and H = [3, 5.5] V of voltages between two reference sites. Most often, one of the reference sites is chosen to be a ‘ground’ reference that is common to all input and output signals. If there are n input sites and times to the system as well as the ground, the voltages at these encode n-tuples in the set Bn , and the outputs at m sites and times define an element of Bm . Then the system computes a ‘Boolean’ function: f:

B n → Bm

To let the system compute f(b) for some specific input tuple b, one connects the input sites to specific voltages in the L and H intervals w. r. t. the ground reference, e.g. 0V or 5V, perhaps by means of switches, and the output tuple is determined from voltage measurements at the output sites. The fact that the same type of signal occurs both at the input and at the output sites is intentional as this permits digital circuits to be cascaded more easily by using the output of one machine as the input of another to construct more complex processing functions. This method will be used to construct machines computing arbitrary functions f as above from simple ones. If the output sites and times of the first machine are not identical to the input sites and times of the second, some effort is needed to produce a copy of the output of the first as the input of the second. In order to communicate an output voltage of a circuit site w. r. t. the ground reference to a nearby input site of another circuit at nearly the same time, it suffices to connect the sites by a metal wire that lets them assume the same potential. If the sites are apart and do not share a common ground reference, more effort is involved, and if the copy of the output value is needed later when the signal at the output has been allowed to change, the value must be communicated through some storage device. Copying the same output value to several different input sites of other circuits involves still more effort. This can be done by first applying the ‘fan-out’ function mapping an input x to the tuple (x, . . . ,x) and then connecting the individual output components each to one of the inputs. To build digital systems that process more general information than just binary tuples, a second level of ‘logic’ encoding is used as well as the physical one. The input information, e.g. a number, is first encoded as a binary n-tuple (a bit field), which in turn is represented to the machine as a voltage, as explained above. Similarly, the output m-tuple represented by the output voltages needs to be further decoded into a number. Obviously, only finite sets can be

• 3

DATA ENCODING

encoded by assigning different n-bit codes to their elements. If N and M are finite sets, binary encodings of N and decodings into M are mappings: e:

M → Bn

d:

Bm → M

As in the case of the physical encoding, a decoding need not be injective and defined on all of Bm , i.e. different binary m-tuples may encode the same element of M, and not all tuples need to be used as codes. By composing it with e and d, the Boolean function f computed by a digital system translates into the abstract function: f ◦:

N→M ◦

f (n) = d( f (e(n)))

defined by for n ∈ N

◦

The function f is also said to be computed by the system although e and d need to be applied before and after the operation of the machine. For the data exchange between subsystems of a digital system the codes can be chosen arbitrarily, but for the external input and output of a system intended to compute a given function f ◦ , e and d are chosen so that their application is straightforward and useful to further represent the data, using e.g. the binary digits of a number as its code both for the input and the output. Otherwise one could simply use the e(n) for some encoding e as the codes f(n) of the results f ◦ (n). This would satisfy the above requirements on codes, but make the operation of the machine insignificant and put all computational effort into the interpretation of the output codes. Every digital system will necessarily have limited numbers of input and output signal sites. These numbers, however, do not limit the sizes of the input and output codes that can be operated by the system. By applying sequences of data one by one to the same n input sites or collecting sequences of outputs from the same n output sites at k different, distinguished times (serial input and output), the input and output codes actually range in Bn* k . Even a single input or output signal can pass tuples of arbitrary size. Moreover, digital systems are often used repetitively and then transform virtually unlimited sequences of input tuples into unlimited sequences of output tuples.

1.1.1 Encoding Numbers In this section we very briefly recall the most common choices for encoding numbers, and hint at some less common ones. Once bit fields encode numbers, the arithmetic operations translate into Boolean functions, and digital systems can be applied to perform numeric computations. Of particular interest are encodings of numbers by bit fields of a fixed size. Fixed size fields can be stored efficiently, and the arithmetical operations on them which are still among the most elementary computational steps can be given fast implementations. However, it is only finite sets of numbers that can be encoded by fields of a fixed size, and no non-trivial finite set of numbers is closed under the add and multiply operations. The maximum size of the encoded numbers will be exceeded (overflow), and results of the add and multiply operation within the size range may first have to be rounded to the nearest element of the encoded set. These difficulties can be overcome by tracking rounding errors and overflows and switching to encodings for a larger set of numbers by wider bit fields if required. The most common binary encoding scheme for numbers within a digital system is the base-2 polyadic encoding on the finite set of integers from 0 to 2n −1 which assigns to a number

• 4

DIGITAL COMPUTER BASICS

m the unique tuple b = (b0 , . . . ,bn−1 ) (in string notation the word ‘bn−1 . . . b0 ’) of its binary digits defined by the property: m = b0 + 2b1 + 4b2 + · · · =

n−1

b i 2i

(1)

i=0

In particular, for n = 1, the numbers 0,1 are encoded in the obvious way by the elements 0,1∈ B, and B can be considered as a subset of the integers. The terms ‘unsigned binary number’ or simply ‘binary number’ are often used to refer to this standard base-2 polyadic encoding. Every positive integer can be represented as a binary number by choosing n high enough. The (n+k)-bit binary code of a number m <2n differs from its n-bit code by leading zeroes only. For inputting or outputting it is more common to use a similar base-10 format in which a number is represented by a tuple d = (d0 ,.., dk−1 ) of decimal digits (in string notation ‘dk−1 . . . d0 ’) so that: m=

k−1

di 10i

(2)

i=0

A corresponding binary encoding of m can be obtained from such a representation by concatenating the 4-bit binary codes of the digits di . This is called BCD (binary coded digit) encoding. A number range containing negative and positive numbers can be encoded by separately encoding the sign or by shifting it to a range of non-negative numbers and taking their binary codes. Most common, however, is the n-bit twos complement encoding of integers in the range −2n−1 ≤ m < 2n−1 that is obtained by using for 0 ≤ m < 2n−1 the above polyadic encoding, and for −2n−1 ≤ m < 0 the polyadic code of m + 2n . A number m is related to its twos complement code by: m = −2n bn−1 +

n−1

bi 2i = −2n−1 bn−1 +

i=0

n−2

bi 2i

(3)

i=0

A number encoded by b = (b0 ,..,bn−1 ) is negative if and only if bn−1 = 1. Therefore, bn−1 is called the sign bit of b. The term ‘signed binary number’ refers to this encoding. Sometimes one restricts m to the symmetric range −2n−1 < m < 2n−1 and uses the code ‘1000 . . . 000’ of –2n−1 differently. Every integer can be represented as a signed binary number by choosing n high enough. The (n+k)-bit signed binary code of a number m with |m| < 2n−1 differs from its n-bit code by leading sign bits only. A non-unique encoding used for fast adders (see Chapter 4) where different codes represent the same number is the 2n-bit code (b0 ,..,bn−1 , c0 ,..,cn−1 ) to represent the number: m=

n−1 i=0

bi 2i +

n−1 i=0

ci 2 i =

n−1 i=0

(bi + ci )2i

(4)

The number m = bi 2i can e.g. be encoded by (b0 ,..,bn−1 , 0,..,0) or by (0,..,0, b0 ,..,bn−1 ). In equation (4) the couples (bi ,ci ) can be thought of as encoding digits with the values 0,1,2. Both (1,0) and (0,1) encode the digit 1. The code of the number 0 is unique. Other redundant encodings also yielding fast add operations are the signed-digit codes due to Avizienis [6]. Here the sets of binary or decimal digits are enlarged to also include negative digits. The representation of a number is still defined by (1) or (2) but is no longer unique

• 5

DATA ENCODING

(e.g., 1∗ 2k = 1∗ 2k+1 − 1∗ 2k ) and also covers negative numbers. The code of the number 0 is still unique, and the sign of a number is the sign of the highest non-zero digit. Some encodings use particular ways to describe a number in terms of others and then concatenate codes for these others, or map the numbers into another mathematical structure for which an encoding is already defined. An integer k is e.g. uniquely characterized by its remainders after the integer divisions by different, mutually prime ‘bases’ m1 , m2 , · · · as long as 0 ≤ k < i m i . The choice of m1 = 2n –1, m2 = 2n e.g. gives a unique 2n-bit encoding for integers in the range 0 ≤ k < m1 ∗ m2 . The code set Bn is in a one-to-one correspondence to the set Pn of ‘binary’ polynomials of degree
(5)

• 6

DIGITAL COMPUTER BASICS

s 63 62

ex

man 52 51

0

Figure 1.2 Floating point code fields

that force a result within the encoded set. In some applications even overflows are handled by ‘approximating’ the true results by the most positive or the most negative representable value (saturation). It must be checked whether the result of a computation is within the required error bounds for an application. This is mostly done statically by analyzing the algorithm to be executed and selecting an appropriate number of places. It is also possible to dynamically track the errors of a computation and to adaptively increase the number of places if the error becomes too high. The input data to a digital system may themselves be imprecise, e.g. derived from measurements of some continuous signal. Then the number of places is chosen so that the extra ‘quantization’ error due to the limited set of representable numbers is sufficiently small in comparison to the measurement error. An n-bit fixed point encoding is for rational numbers q of the form q = m/2r with m being an integer in the range −2n−1 ≤ m < 2n−1 and r being fixed. It is obtained by simply using the n-bit twos complement code of m defined by equation (3) as a code for q and corresponds to a scaling of the two’s complement integer range (the redundant signed bit code could also be used). Usually, r = n − 1 so that −1 ≤ q < 1 and q = −bn−1 +

n−2

bi 2i−n+1

(6)

i=0

Floating point encoding is for rational numbers h of the form h = m∗ 2r with 1 ≤ m < 2 and a variable integer r in the range –2p−1 ≤ r < 2p−1 . It is obtained by concatenating a q-bit fixed-point code of m − 1 (the q-bit binary code of the integer (m − 1)∗ 2q ) with the p-bit binary code for the positive integer r + off, with off = 2p−1 − 1. An extra bit is added for the sign, and numbers m∗ 2r with r = − 2p−1 + 1 and 0 ≤ m < 1 (called denormalized) use the q-bit fixed point code for m. The total code size is thus p + q + 1 (Figure 1.2). Thus, if ‘man’, ‘ex’ are the non-negative integers encoded by the mantissa and exponent fields then for the normalized case of ex = 0: h = ±(2−q man + 1)∗ 2ex−off

(7)

The common 32-bit IEEE standard format is q = 23, p = 8 (‘single precision’) and covers numbers in the range of ± 3.37∗ 1038 that are defined to 6–7 decimal places. For the 64-bit ‘double precision’ format q = 52, p = 11, the range is ± 1.67∗ 10308 with numbers being defined to 15–16 decimal places. There are standard 40- and 80-bit floating point formats as well. A simple non-standard floating point encoding for numbers m∗ 2r is obtained by concatenating a fixed point code for m and a twos complement integer code for r yet dropping the normalization requirement of 1 ≤ m < 2. Then, different codes may represent the same number using different numbers of places for the mantissa. This can be used to track the precision of a result. Floating point arithmetics including unnormalized floating point number representations are discussed in more detail in [54]. Another non-standard m-bit encoding similar to a floating point format with a zero bit mantissa field is the logarithmic encoding for numbers of the form ±qn for some fixed real number q close to 1 and − 2m−2 < n < 2m−2 . It is obtained by concatenating the (m-1)-bit

• 7

DATA ENCODING

binary code of n +2m−2 with a sign bit. The all zeroes code is used to represent the number 0, and the code (0,0,..,0,1) is not used. The logarithmic encoding covers a large dynamic range of numbers and eases the implementation of multiplication (which is performed by adding exponents) [82].

1.1.2 Code Conversions and More Codes If c: N -> Bn and c : N -> Bm are encodings on two sets N,N of numbers, the numbers in the set Q = N∩N are encoded both by c and c . The code conversion function is defined on c(Q) ⊂ Bn and maps a code c(q) to c (q). Often, code conversions are implemented as processing functions in the digital system and used to switch to the encodings that are most convenient for the desired processing steps (e.g. compact codes that can be communicated in a shorter time, or ones for which the implementation of the arithmetical operations is particularly simple). The simplest conversions are those that transform an n-bit binary or twos-complement code into an m-bit one by appending or stripping zero or sign bits. Other common conversions are between integer and floating point formats, or floating point formats of different lengths. If real numbers are first approximated by numbers in a set N on which an encoding is defined (as in the case of fixed and floating point encodings), the notion of code conversion becomes relaxed. The conversion from an m-bit to an n-bit fixed point code (6) is by appending zero bits if n < m or by performing a rounding operation otherwise, i.e. using the closest approximation by an n-bit fixed point number. A single precision (32-bit) floating point code can be exactly converted into a double precision (64-bit) code, but the conversion from double to single involves first performing a rounding operation to the closest number that can be represented in the shorter format. The conversion is defined on all double precision codes of numbers p satisfying –r ≤ p ≤ r where r is the maximum single precision number. If a number is to be converted that is absolutely greater than the maximum representable one in a fixed or floating point target format, then sometimes saturation to the maximum representable number of the right sign is performed. Conversions are also needed for input and output. For example, numeric input and output are most convenient in the multiple decimal digits format whereas the arithmetic operations are implemented more efficiently for the twos-complement codes. Or, the result of a computation performed with floating point numbers may be desired in a rational representation p/q. This conversion is achieved by means of Euclid’s algorithm to expand it into a continued fraction [12]. Another example is the inputting of an n-bit number code in parallel from n digital inputs. ‘In parallel’ means simultaneously from n nearby input sites. As changes at the input sites or their reading could occur with slight time delays, there is a chance of misreading the input. If the numeric input is known to only change by increments of ±1, it is useful to encode it in such a way that the codes of two numbers i and i + 1 only differ in one bit position, i.e. have a Hamming distance of 1. The Hamming distance of two codes b = (b0 ,..,bn−1 ) and c = (c0 ,..,cn−1 ) is defined by: d(b, c) =

n−1

|bi − ci |

i=0

and simply counts the number of bit positions where the codes differ. The n-bit codes can be interpreted as the nodes of the n-dimensional hypercube as a subset of n-dimensional space.

• 8

DIGITAL COMPUTER BASICS

Then a code with the desired property defines a Hamiltonian path, i.e. a path along the edges of the cube that visits every node just once. This requirement on the codes is fulfilled by the Gray code. The n-bit Gray code gn (k) for integers k in the range 0 ≤ k < 2n is constructed recursively from gn−1 codes by appending an nth bit as follows: gn (k) = app(gn−1 (k), 0) = app(gn−1 (2n − 1 − k), 1)

for k < 2n−1 for k ≥ 2n−1

If an n-bit code needs to be communicated to a different site by means of electrical signals or through storage media, there may be some chance that some of the bits get ‘flipped’ in this process. It is important to be able to detect such errors. To distinguish faulty and correct code transmissions, the n-bit code is mapped into a longer one, e.g. an (n+1)-bit code constructed by appending a ‘parity’ bit chosen so that the total number of ones becomes even. A single-bit error in the (n + 1)-bit code then results in a parity error, i.e. an odd number of ones and can easily be detected. For an error-free code the conversion to the original n-bit code is done by stripping the last bit. More generally, the n-bit code is subdivided into k-bit words that are interpreted as binary numbers and summed up modulo 2k . This ‘check sum’ is appended to the code to form an (n + k)-bit code before it is communicated. Then many multi-bit errors can be detected (but not all). Another common method is to append a CRC (cyclic redundancy check) code computed from the original bit sequence. A k-bit CRC code for the bit sequence (b0 , . . . , bn−1 ) is obtained using two fixed polynomials p, q, q having the degree k. It is the remainder of the binary polynomial division of the polynomial with the coefficient vector (0, . . . , 0, bn−1 , . . . ,b0 ) (k zeroes) plus the polynomial pXn by q. The fixed size CRC does not uniquely encode the bit sequence which is usually much longer but it may be used as a fingerprint (a hash code) for it. Certain codes not only permit the detection of a limited number of bit errors but also their correction [16, 68]. In a code capable of correcting single bit errors any two distinct error-free codes need to have a Hamming distance of >2. Then for a code with a single bit error, there is a unique error-free code at the distance of 1 that is used as the corrected one. Due to the allowed tolerances for the values of the physical signals representing bits and for the times when they are read off, bit errors tend to be rare. If single bit errors are corrected, the probability of remaining bit errors drops considerably. A code allowing detection and correction of single-bit errors is obtained starting from a primitive polynomial p(X) of the degree n. Let N = 2n −1. Any (N−n)-tuple/polynomial b = (b0 , . . . ,bN−n−1 ) is converted to the N-tuple m(X) = b(X)∗ p(X) before being transmitted. If instead of m(X) the sequence m (X) = m(X) + Xs with a single error at the bit position s is received, then s can be uniquely identified from the remainder of m (X) after a division by p(X) due to the assumed property of p(X) and be corrected. b(X) is the result of the polynomial division of the corrected code by p(X). If a double bit fault has occurred, m (X) = m(X) + Xr + Xs , then there is a unique code m (X) = b (X)∗ p(X) so that m (X) = m (X) + Xt for some t as the balls of Hamming radius 1 around the correct codes exhaust all of BN . Then m has the opposite parity to m. Double bit faults can hence be detected by restricting the encoding to tuples b of even parity (b and b∗ p have the same parity). While for the error-handling capabilities some extra bits are deliberately invested, large, composite codes representing multi-component objects, e.g., high-dimensional vectors or text files composed of many ASCII character codes, need to be converted (‘compressed’) into

ALGORITHMS AND ALGORITHMIC NOTATIONS

• 9

smaller codes for the purposes of communications or storage and to be reconverted (‘decompressed’) afterwards. Common methods for data compression are (a) the use of different code sizes for the elements occurring in the object so that the most frequent ones have the shortest codes; (b) to substitute repetitions of the same component code by a single one plus a repeat count (run length coding); or (c) by encoding the changes between subsequent groups of components if they can be described by smaller codes. If the large code is the result of a computation, it can be advantageous to simply encode and communicate the parameters of this computation, or a definition of this computation along with the parameters. Finally, for the purpose of encryption, a standard code may be converted into another one that cannot be reconverted without knowing some secret parameter. Such code conversions related to secure communications have become important applications for digital systems in their own right.

1.2 ALGORITHMS AND ALGORITHMIC NOTATIONS Digital systems are constructed from building blocks of a few types that perform some simple transfer functions (called elementary). If the input and output signals of these are compatible, the output signals of a building block or copies of them can be used as input signals of another. For electronic building blocks using voltage signals between pairs of reference sites this is particularly simple. As already pointed out, the output signal sites are directly connected to the input sites by means of wires that force the potentials at the connected input and output reference sites to become the same after a short time. If an output value is required as an input later, it must be passed through an electronic storage device that conserves or delays it until that time. For two building blocks with the (abstract or encoded) transfer functions f and g, respectively, their connection in series computes the composition ‘g ◦ f’, i. e. the function defined by: (g ◦ f )(x) = g( f (x)) The procedure to compute some desired result from given input values is usually given by prescribing a number of computing steps, each performing a particular one of a small choice of basic functions or operations on the inputs or intermediate values. Such a computational procedure is called an algorithm for the desired total transfer function. If the elementary operations are the transfer functions of the hardware building blocks, then the algorithm can be considered to be a set of instructions on how to build a machine with the desired transfer function from the available building blocks, simply by providing a hardware building block of the right type for every operation in the algorithm and connecting outputs to inputs whenever the algorithm says that the output is an intermediate value that is used as an operand for the other operation. The same building block can be used at different times for different steps of the algorithms if the intermediate results required as their inputs are passed through storage devices to be available at the time the building block is used for them. The notion of an algorithm is thus used for a function f:

M → N

being represented as a composition of simpler functions or operations. The simple operations are called elementary, as they are not further reduced to still simpler operations. Besides the

• 10

DIGITAL COMPUTER BASICS

*

+

a

* +

b c

1

+

+

a b

c 1

Figure 1.3 Data flow graph and related circuit (boxes represent machines performing the operations)

set of elementary operations methods such as ‘◦ ’ of composing simple or composite functions must be defined (and eventually be associated with methods of connecting hardware building blocks such as connecting through wires or through storage elements).

1.2.1 Functional Composition and the Data Flow The most basic composition is functional composition allowing multi-argument functions to be applied to the results of multiple others, the above ‘◦ ’ operator being a special case. Functional composition translates into feeding outputs from building blocks into selected inputs of multiple others. Algorithms for functions are usually described in a formal mathematical notation or an equivalent programming language. If the elementary operations are the arithmetic operations +,∗ etc. on numbers, one can use the ordinary mathematical notation for these and denote the result of every operation by a unique symbol in order to be able to reference it as an input of another operation (which needs some ‘naming’ notation; we choose the notation ‘-> name’). Then an algorithm to compute a result d from inputs a, b, c using functional composition only might read: a + b −> r c + 1 −> s r ∗ s −> d The same algorithm could also be given by the single composite expression ‘(a+b) ∗ (c+1)’. The algorithms of this kind (only using functional composition) always compute expressions formed from the inputs, constants and elementary operations. The references to the intermediate results can be represented as a directed graph with the individual operations as nodes (Figure 1.3). This graph showing the dependency of the operational steps is called the data flow graph (DFG) of the algorithm. It must not contain cyclic paths in order to define a computational procedure. The graph directly translates into a diagram of connected hardware building blocks. Obviously, a formal notation as sketched above or taken from a programming language can be used to describe the building blocks and interconnections for a digital computer, at least one designed for computing expressions. If a standard language such as C is used for this purpose, one has to keep in mind that an assignment ‘r = a + b;’ similar to the above one only indicates a naming yet not a store operation to a variable; names and assignments must be unique. Also, the order of assignments does not prescribe an order of execution, as

ALGORITHMS AND ALGORITHMIC NOTATIONS

• 11

the operations are not executed serially. For languages like VHDL dedicated to describing hardware structures (see Chapter 3), this is the standard semantics.

1.2.2 Composition by Cases and the Control Flow Another common way to compose functional building blocks in an algorithm besides the functional composition is the composition by cases corresponding to a condition being true or false. A mathematical shorthand notation for this would be: g(x) if c(x) f(x) = h(x) otherwise while in a programming language this is usually indicated by an if/else construction: if

condition set of operations pass r1

else other set of operations pass r2 For each of the branches the ‘pass’ statement indicates what will be the result if this branch is taken (in C, one would assign the alternative results to common local variables). A condition can be considered as a function outputting a Boolean result b (‘true’ or ‘false’), and the branches compute the alternative results r1 and r2 . A similar behavior would result from applying a select function ‘sel’ to r1 , r2 and b that outputs its first argument, r1 , if b is true and its second, r2 , otherwise, i.e. from a special functional composition. In many cases an algorithm using branches can be transformed in this way into one using functional composition only. An important difference, however, is that as a function ‘sel’ can only be applied if both r1 and r2 are computed before, executing the operations in both branches whereas, in the if/else construction, the operations of the unselected branch are not computed at all. Its result does not even need to be defined (e.g. due to a division by zero). If both r1 and r2 can be computed, the ‘sel’ version gives the same result as the if/else version, yet performs more operations. The composition with a select function directly translates into a hardware structure if a building block performing the selection is provided. This can be used to implement the if/else composition. The operations for computing r1 and r2 must both be implemented on appropriate building blocks although only one of them will be used for any particular computation. To implement the if/else in its strict sense, one might look for some control device switching between alternative wirings of elementary building blocks depending on the branching condition. Then, the result of the unselected branch is definitely not computed (on a conventional computer this is implemented by ‘jumping’ over the instructions of the unselected branch). Building blocks may then be shared between both branches. If sharing of building blocks is not possible, then at least one does not have to wait for the result of the unselected branch. The if/else branches in an algorithm impose a structure on the set of all operations specified in it that is known as its control flow, as they control which steps are actually performed. For an algorithm using branches, the number of operations actually performed becomes dependent

• 12

DIGITAL COMPUTER BASICS

on the input data. If the if/else construction is translated into a controlled hardware structure, the time and energy needed for the computation become data dependent. If in a complex algorithm a pattern of dependent operations shows up several times, then one can arrive at a more concise description by giving the pattern a name and substituting its occurrences by references to this name, or by using an (implicit) indexing scheme distinguishing the individual instances. The latter is done using loops or, more generally, recursion. Here, a substructure (a set of dependent operations) is repeated a finite but, maybe, unlimited number of times depending on the data. If the number of times is data dependent, conditional branches and thereby the control flow are involved. In a formal language, the repeated substructure is identified by enclosing it between begin/end brackets and by naming it for the purpose of the recursive reference. As an example, the recursion for calculating the greatest common divisor (GCD) of two numbers might read: function gcd(n, m) { if n=m else if n>m else }

pass n pass gcd(m,n-m) pass gcd(n,m-n)

The individual operations cannot be performed by different hardware building blocks, as the total number of building blocks is necessarily limited while the number of recursive steps is not. If, however, a limit value is specified for the loop count or the depth of the recursion, the straightforward translation of the individual operations into hardware building blocks remains possible. With such a limitation the result of the recursions is undefined for inputs demanding a recursion depth beyond the limit (a special output value might be used to encode an invalid output). The expansion of the GCD recursion into elementary operations up to the depth of two starts by the expression shown in Listing 1.1 that could be used to build a GCD computer: if n=m pass n else if n > m n – m -> n1 if m = n1 pass m else if m > n1 m – n1 -> m1 if n1 = m1 pass n1 else pass invalid else n1 – m -> n2 if m = n2 pass m else pass invalid else m – n -> m1 if n = m1 pass n else if n > m1 . . . ..etc. etc. . . . .. Listing 1.1

Expanded GCD recursion

ALGORITHMS AND ALGORITHMIC NOTATIONS

1.2.3 Alternative Algorithms

• 13

Once an algorithm for a function is known that is based on elementary operations for which corresponding hardware building blocks and interconnection facilities are available, it may serve as a blueprint to construct a special purpose computer to execute it. The design of a digital system will start by giving algorithms for the functions to be performed. After that, the operations need to be assigned to hardware building blocks. This assignment does not need to be one-to-one as some building blocks can be used for more than one operation. Our further discussion mostly concentrates on providing the building blocks and on the assignment of operations to building blocks, but the finding of the algorithms is of equal importance. An important property of an algorithm is its complexity. It is defined as the number of operations used as elementary building blocks applied therein. If the algorithm contains branches, the number of operations actually performed may depend on the input data. Then the worst-case complexity and the mean complexity may differ. The complexity depends on the selection of building blocks. Numeric algorithms, for example, use arithmetic operations on encoded numbers as building blocks, and their complexity would be measured in terms of these. If the operations of the algorithm directly correspond to hardware building blocks, then its complexity measures the total hardware effort. If the operations execute one-by-one on the same block, the complexity translates into execution time. A given function may have several algorithms based on the same set of elementary operations that differ in their total numbers of elementary operations (i.e. their complexity), and in their data and control flows. Often functions are defined by giving algorithms for them, but other algorithms may be used to execute them. It turns out that there can be a dependency of the optimum algorithm w. r. t. some performance metric, say, the speed of execution, on the target architecture, i.e. the available elementary building blocks and interconnection methods. Algorithms and architectures must fit. In some cases, algorithms for a given function can be transformed into other ones with slightly different characteristics using algebraic rules, and the specification of the system design through the algorithm is understood to allow for such transformations as a method of optimization. If an operation is associative and commutative (such as ‘+’), then for a set S of operands a,b,c,. . . , the result of the composite operation

S = ..((a + b) + c) + . . .

does not depend on the particular selection and order of individual ‘+’ operations and operands but only on S. The 2n-1 add operations to add up 2n numbers can, for example, be arranged linearly or in a binary tree (Figure 1.4). Both versions can be used to construct processors from subunits performing the individual add operations. The linear version suffers from each adder stage having to wait for the result of the previous one (which takes some processing time) while in the tree version adders can operate simultaneously. If there is just one adder that has to be used sequentially, the tree version cannot exploit this but suffers from needing more memory to store intermediate results. When just defining the output, the arrangement of the ‘+’ operations used to construct the system may be left unspecified.

• 14

DIGITAL COMPUTER BASICS

+ +

+

+ +

+ +

+

+ +

+ +

+ +

Figure 1.4 Equivalent adder arrangements

1.3 BOOLEAN FUNCTIONS From a given, even a small set of elementary operations, many functions may be constructed by means of algorithms, even if only functional compositions are allowed and branches and recursion are not used. As the operations performed by a digital system are Boolean functions, it is of interest to consider algorithms for Boolean functions based on some set of elementary operations. Any algorithm based on special Boolean operations that e.g. implement arithmetic operations on encoded numbers can be expanded into one based on the elementary operations once the arithmetic functions themselves have algorithms based on these.

1.3.1 Sets of Elementary Boolean Operations Some common Boolean operations that are used as building blocks in Boolean algorithms are the unary NOT operation defined by NOT(0) = 1, NOT(1) = 0, the dual input AND, OR, NAND, NOR, XOR (exclusive OR) operations defined by: x

y

AND(x, y)

OR(x, y)

NAND(x, y)

NOR(x, y)

XOR(x, y)

0 1 0 1

0 0 1 1

0 0 0 1

0 1 1 1

1 1 1 0

1 0 0 0

0 1 1 0

and the 3-argument SEL operation defined as in section 1.2.2 by: SEL(x, y, 0) = x, SEL(x, y, 1) = y

for all x, y ∈ B

The operations AND, OR, and XOR are commutative and associative so that they may be applied to sets of operands without having to specify an order of evaluation. Theorem: Every totally defined function f: Bn → B can be obtained as a purely functional composition (a composite expression) of the constants 0,1 and operations uniquely taken from any particular among the following sets of operations (1) (2) (3) (4)

AND, OR and NOT NAND SEL AND, XOR

• 15

BOOLEAN FUNCTIONS

f(0,0,0) f(1,0,0)

SEL

f(0,1,0) f(1,1,0)

SEL

f(0,0,1) f(1,0,1)

SEL

f(0,1,1) f(1,1,1)

SEL

SEL SEL

f(x,y,z)

SEL

x

y

z

Figure 1.5 Selector tree implementation of a Boolean function

In other words, every function has, at least, one algorithm over each of these sets of elementary Boolean operations. Although the theorem states the existence of such algorithms without explicitly indicating how to obtain them, its proof is by actually constructing them starting from a table listing the values of the function. For the single operation set consisting of the SEL operation only, the algorithm realizing a given function f is the selector tree shown in Figure 1.5 as a composition of functional building blocks. This algorithm uses 2n − 1 SEL building blocks. The same SEL tree structure can be used for every function f by composing it with the appropriate input constants. For the AND, OR and NOT set, a particular algorithm that can be immediately read off from the function table of f is the so-called disjunctive normal form (DNF) for f. If one writes ‘xy’ for ‘AND(x, y)’, ‘x + y’ for ‘OR(x, y)’, ‘x0 ’ for ‘NOT(x)’ and ‘x1 ’for ‘x’, this algorithm is: f (x1 ,..,xn ) = x1 b1 ..xn bn where the sum (a multiple OR operation) extends over all n-tuples (b1 ,..,bn ) for which f(b1 ,..,bn ) = 1. That this really is an algorithm for f is easily verified using the fact that the x1 b1 ..xn bn term takes the value of 1 exactly on the tuple (b1 ,..,bn ). To prove that a particular set of building blocks generates all Boolean functions, it is otherwise enough to verify that the AND, OR and NOT functions can be obtained from it. For example, AND, OR and NOT are partial functions of SEL obtained by keeping some of the SEL inputs constant (composing with the constants 0,1): (1) NOT(z) = SEL(1, 0, z) (2) AND(y, z) = SEL(0, y, z) (3) OR(x, z) = SEL(x, 1, z). Vice versa, as explained above, the SEL, NAND and XOR operations are obtained as combinations of AND, OR and NOT using their DNF algorithms. The XOR operation can be expressed as (4) XOR(x, z) = SEL(x, NOT(x), z) Each of the sets of operations in the theorem can hence be used as a basis to construct Boolean functions and digital systems once they are implemented as hardware building blocks. The existence of these finite and even single element sets of basic operations generating all Boolean functions implies that general digital systems can be constructed from very small selections of building blocks. SEL was introduced in section 1.2.2 as an operation implementing control. It

• 16

DIGITAL COMPUTER BASICS

actually performs no operation resulting in new data values but only passes the selected argument. A machine capable of moving data and performing conditional branches can therefore compute every Boolean function by performing a suitable sequence of these. In the recent discussion on quantum computers and reversible computations [8] bijective (both injective and surjective) Boolean functions from Bn onto itself are considered. Every Boolean function f: Bn → B can be obtained by composing a bijective Boolean function with extra constant inputs and only using some of its outputs. The mapping (b0 , b1 , .., bn ) -> (b0 , .., bn−1 , XOR(f (b0 , .., bn−1 ), bn )) is, in fact, bijective, and with bn set to 0, the last component of the result becomes f(b0 ,..,bn−1 ). Arbitrary bijective mappings can be constructed from simple ones like the exchange function XCH(x, y) = (y, x) or the Fredkin controlled exchange function on B3 defined by F (x, y, 0) = (x, y, 0),

F (x, y, 1) = (y, x, 1)

1.3.2 Gate Complexity and Simplification of Boolean Algorithms The complexity of an algorithm describing a composite circuit of AND, OR, NOT ‘gates’ or similar building blocks is also known as its gate count. A given function may have many different algorithms based on a given set of elementary operations. This may be exploited by searching for one of minimum complexity (there could be other criteria as well), starting from an algorithm such as the selector tree or the DNF that can be read off from the function table and then simplifying it using appropriate simplification steps. For the selector tree implementation of a function, simplification steps are the application of the rule SEL(x, x, z) = x that eliminates an SEL building block if the inputs to select from are the same values, and the rule SEL(0, 1, z) = z. Also, the formulas (1) to (4) in section 1.3.1 can be used to replace SEL building blocks by simpler ones. The leftmost column of not less than 2n−1 selectors in Figure 1.5 can be substituted this way by a single inverter (if at all) as the only possible outputs to the next column are the values SEL(0, 0, x) = 0, SEL(1, 0, x) = NOT(x), SEL(0, 1, x) = x and SEL(1, 1, x) = 1. For the AND, OR and NOT building blocks, the well-known rules of Boolean algebra [12] can be used to simplify algorithms, in particular the rules a b + a c = a (b + c) (a + b)(a + c) = a + bc a + a = a , and a a = 0 a (a + b) = a , and a + a b = a 0a = 0, 1 + a = 1, a + 0 = a , and a 1 = a a + a ◦ = 1, a a ◦ = 0, and (a ◦ )◦ = a u 0 v0 = (u + v)0 , and u 0 + v0 = (uv)0 (de Morgan’s laws) the scope of which can be further extended by applying the commutative and associative laws for the AND and OR operations. All of them reduce the number of operations to be performed. For example, the DNF for a Boolean function f is more complex, the more ones there are in the function table. By applying de Morgan’s laws to the DNF of the negated function f 0 , one obtains the CNF (the conjunctive normal form): ◦ f (x1 , .., xn ) = x1 b1 + · · +(xn bn )◦

• 17

BOOLEAN FUNCTIONS

where the product (a multiple AND operation) extends over all n-tuples (b1 ,..,bn ) with f(b1 ,..,bn ) = 0. It is less complex than the DNF if the function mostly takes the value of 1. The DNF algorithms for Boolean functions belong to the special class of algorithms called disjunctive forms. A disjunctive form is an algorithm for a function of n input variables x1 ,..,xn that is a disjunction (a multiple OR) of multiple AND terms in the variables and their complements (the simplest forms being the variables themselves). In the AND terms, each variable occurs at most once (an AND term in which a variable occurs several times can be simplified using the rule x x = x or simplifies to 0 if a variable and its complement occur in it), but not all of the variables need to occur. The most complex AND terms are those in which all variables occur (only these are used in the DNF). These take the value of 1 for a single input pattern only. Two Boolean functions g, g are defined to be in the relation g ≤ g if g(x) = 1 implies g (x) = 1. If g, g are AND terms the condition of g ≤ g is equivalent to g being an extension of g by more factors. The terms containing all variables are the minimum terms w.r.t. to the ‘≤’ relation. In a disjunctive form for f all terms g satisfy g ≤ f. Every function f has its DNF algorithm that is a disjunctive form, and is hence the sum of terms g so that g ≤ f. A disjunctive form is called minimal if no term can be left out or changed into a shorter one by leaving out some of its AND factors. For a minimum disjunctive form, the terms g occurring in it are maximum terms w.r.t. ‘≤’ so that g ≤ f. A short form notation for AND terms is by the length n string of exponents of the variables, using the ‘∗ ’ character (‘don’t care’) if a variable does not occur. For example, the string ∗∗ 1∗ 0∗ denotes the term x13 x05 for n = 6. If all variables occur there are no ‘∗ ’ characters. The rules x1 = x,

x 0 + x 1 = 1,

xy + yz = x(y + z)

and therefore

gz0 h + gz1 h = gh

imply that, starting from the DNF, disjunctive forms can be simplified step by step by combining similar AND terms that only differ in a single variable occurring in one of them in its direct and in the other in its inverted form. For the short form notation this simplification step translates into substituting two strings that only differ in one position only, that is ‘0’ in one and ‘1’ in the other by a single string having a ‘*’ at this position and the same entries in the others: ∗∗

110∗∗ 0110∗ ,∗∗ 100∗∗ 0110∗ → ∗∗ 1∗ 0∗∗ 0110∗

Starting from the DNF of a Boolean function, this simplification step can be repeated until a minimal disjunctive form for f is reached which is the well-known simplification procedure due to and named after Quine and McCluskey. It can be easily automated by letting a computer perform the following steps that eventually yield all minimal disjunctive forms for f (in general, there are several solutions). For other simplification methods we refer to [1].

r Set up the DNF for f to obtain a list L0 of all strings ≤f having no ‘∗ ’ characters. r As long as for k ≥ 0 the list Lk of all strings ≤f having k ‘∗ ’ characters is given and nonempty, set up Lk+1 by applying the simplification step to all pairs of strings in Lk that satisfy its precondition. Let Nk the list of those strings in Lk that could not be simplified. r The strings in N0 , N1 , N2 . . . constitute the set N of maximum strings ≤f , and f is their sum. It may already be the sum of a subset of them. To find out the minimum subsets of N, the

• 18

DIGITAL COMPUTER BASICS

sum of which is f, set up a table showing for every string g the strings h in L0 with h ≤ g. To find a minimum set of strings incrementally, extract strings g from N and eliminate the elements h of L0 with h ≤ g starting with strings that are the only ones to be ≥ some element of L0 .

The Quine–McCluskey method generalizes to Boolean functions f that are specified on a subset of Bn only, e.g. arithmetic functions for an encoding that is not surjective. First the definition of f is completed by letting f(x) = 1 for the undefined positions. Then the first two steps are carried out as before. The last is carried out with the modification to restrict L0 to the specified places only (excluding the added 1 places). A Boolean function may admit a simple algorithm but have complex minimal disjunctive forms, an example being the parity function that assigns to a binary n-tuple the parity of the number of ones in it (0 if this number is even, 1 otherwise). Here the DNF has 2n−1 AND terms. It cannot be simplified by applying the above method at all and is a minimal disjunctive form. A much simpler algorithm for the function is obtained by cascading n−1 XOR gates. Due to the excessive number 2ˆ(2n ) of functions Bn → B, for the majority the complexities of optimum algorithms are above c∗ 2n /n for some constant c [11]. The number of algorithms with n variable inputs and using up to b NAND operations is bounded by (b + n + 1)!2 / (n + 1)!2 . This estimate results from arranging the NAND gates in a sequence (a ‘schedule’) so that every gate only inputs from gates with lower indices or from an input or a constant. The factorial expression can be approximated using Stirling’s formula and compared to the above number of functions. In other words, for the majority of Boolean functions, there are no simple algorithms. Fortunately, for the most important arithmetical operations on binary number codes as described in section 1.1.1, the complexities are moderate. The complexity of an arithmetical operation heavily depends on the chosen number encoding (if different encodings are used for the various inputs and outputs, the complexity can be expected to decrease). If a nonstandard encoding is used to simplify the implementation of the arithmetical operation, the effort to convert into standard codes has also to be considered.

1.3.3 Combined and Universal Functions A Boolean function on Bn+k can be considered as performing a set of 2k alternative functions on Bn by using the upper k bits as a control code and only the lower n bits for data (the distinction between data and control input is somewhat arbitrary). Also, any given set of 2k functions on Bn can be combined into a single function on Bn+k by selecting from the results of the individual functions depending on the code in the upper k bits. This technique was proposed in section 1.2.2 as an implementation for algorithms with branches. The selection code is then obtained from the evaluation of a condition on the input data. As the number of functions Bn -> B is finite (namely, = 2k with k = 2n ), it is even possible to define a ‘universal’ function on Bn+k that combines all functions on Bn . This is realized by the k:1 selector (or ‘multiplexer’) function constructed from k-1 SEL building blocks (Figure 1.5) with the inputs constants (the entries of the function table) being considered as variable control inputs. The k-bit control code for selecting a particular function on Bn is thus the sequence of entries to the function table, and the n-bit input simply selects the right entry of this lookup table. Despite its complexity, the universal function is an important building block. It is incorporated into memories where the k control bits are buried storage elements, e.g.,

• 19

TIMING, SYNCHRONIZATION AND MEMORY a b NOT

S E L

+ cy

op

Figure 1.6 Combined adder/subtractor

fixed-position switches to the 0 or 1 levels for a read-only memory, and the n ‘data’ inputs constitute an address code for the buried input to be selected. As many Boolean functions are very complex anyhow, they are realized by means of memories without attempting to derive an optimized circuit. Memory structures are very regular and can be densely integrated, and, being applicable to all kinds of functions, can be used and produced in high volume (low cost). As pointed out in section 1.2.2, it is more attractive not to realize the combination of two functions f1 and f2 on Bn using the select operation: f (x, c) = SEL( f1 (x), f2 (x), c)

(8)

but to only compute the result that is actually selected. This amounts to looking for a less complex algorithm for the function defined by equation (8), maybe using a common subcircuit in the algorithms for f1 and f2 in both selections or by performing a minimization of the DNF for f. If c is a function of x, then one can look for a simple algorithm for the combined function: g(x) = SEL( f1 (x), f2 (x), c(x))

(9)

As an example, consider the add and subtract operations on n-bit two’s complement numbers. These will be considered in detail in section 4.2 and have a similar complexity. Selecting between the outputs of separate add and subtract circuits costs twice the complexity of each of them (plus the SEL gates). Using the equation a − b = a + (−b − 1) + 1

(10)

one can obtain a less complex algorithm for the combined operation by using the same add circuit equipped with an additional carry input for both operations, selecting its second input between b and −b −1 (Figure 1.6) and applying constant 1 to the carry input in the case of subtraction. The function computing the code of −b −1 from the code of b is particularly simple (applying NOT bit per bit), and the output selector is eliminated as the output is from the adder for both selections. By means of the input selector the circuit is configured differently for both functions yet sharing a major component. The NOT operation and the input selection are equivalent to applying a bank of XOR gates.

1.4 TIMING, SYNCHRONIZATION AND MEMORY As well as the algorithms underlying the construction of a system, the timing of a computation will be an important topic in what follows. In many applications the time available to the

• 20

DIGITAL COMPUTER BASICS input event

corresponding output event

valid input: valid output: time processing time

Figure 1.7 Processing time

computation is limited. The basic definitions and analysis do not depend on a particular technology but apply to all kinds of compute systems constructed from building blocks. A system receiving an n-bit input code (b0 , . . . ,bn−1 ) does so by receiving every bit bi at a site si and during the time interval [fi , ti ]. If two of them are input at the same site si = s j , then the input intervals must not overlap; these bit are input serially. In general, the pattern of the si , fi , ti extends in space and time so that there is not a well-defined time reference for the entire input. The same applies to the outputting of multi-bit codes from a system. In order to simplify the subsequent discussion, the application of input data will be considered as an event that occurs at a specific time, assuming a n-bit input code to be applied simultaneously at n different sites. Once a data processing machine has been put in place, it will not be used only once but many times for varying inputs. The machine responds to every input event with a corresponding output event (Figure 1.7). The time delay from the input event to the corresponding output event is called the processing time. For most machines, the time delay of the output event does not depend on the time of the input event (a property called time invariance) but it can depend on the input data. The maximum (worst case) processing time is an important characteristic of the machine. The times for the input and output events may be restricted to a certain discrete subset of the time axis at which the input signals are sampled (e.g., the integer multiples of a basic period T), and there may be times at which the machine is occupied with processing the previous input and not ready to process a new input. In general, it is not excluded that a new input event may occur before the output event corresponding to the previous input. The throughput of the digital system is defined as the maximum possible frequency of input events with arbitrary input data. It may be higher than the reciprocal processing time. Processing time and throughput are independent measures. Some applications only require a fast throughput whereas others need a short processing time. The processing time cannot be reduced arbitrarily for a given algorithm and technology whereas throughput can, e.g., be raised by duplicating the hardware and using the parts in an interleaved fashion.

1.4.1 Processing Time and Throughput of Composite Circuits The building blocks of a complex system are themselves machines to which the definitions of worst case processing time and throughput apply. The processing time of the system actually results from the timing in which the building blocks perform the individual operations. If a building block B performs an operation on the result of another block A, then its input event occurs at the same time or later than the output event of A. If it occurs later, then the intermediate data to be passed must be stored in some way until the later time.

• 21

TIMING, SYNCHRONIZATION AND MEMORY

input to m1 output from m2 t1

t2

Figure 1.8 Timing for a serial composition

We first consider the serial composition of two machines m1 and m2 computing functions f1 and f2 with worst case processing times t1 and t2 (Figure 1.8) so that the output events of m1 are input events for m2 . Then no time is needed to communicate a copy of the output data of m1 to the input of m2 , and the composition computes the function f = f2 ◦ f1 with a worst case processing time t below t1 + t2 . Otherwise the process used to communicate the data can be considered as a third machine put in series with the others with a processing time t3 that has to be added to the sum of the others. If we take into account that t1 and t2 are maximum execution times and that the actual execution times for any given input could be smaller at least for one of the circuits, then we can only conclude that: max(t1 , t2 ) ≤ t ≤ t1 + t2 The serial composition does not use the components efficiently for an individual computation. Only after the processing time t1 from applying a new input, does the machine m2 get new valid input data. At this time, m1 has already completed its processing of the input data and is no longer used for the computation of the final result. If s1 and s2 , are the maximum throughputs of m1 and m2 , their composition has the maximum throughput s above satisfying: s ≥ min(s1 , s2 ) i.e., the composite machine can process input events at the rate of its slowest building block. Again, this is a safe estimate; the input sequences to m2 are restricted and could permit a faster rate. If the throughput of the serial composition is exploited, m1 will accept new input while m2 continues processing its output, and both operate simultaneously. Although the processing time did not change, the same hardware now performs more operations by avoiding the idle times mentioned before. This kind of simultaneous operation of two machines m1 and m2 connected in series is called pipelining (Figure 1.9). Next, we consider the computation of two intermediate results f1 (x) and f2 (x) in parallel by two machines m1 and m2 with the processing times t1 and t2 and throughputs s1 and s2 in response to a common input event (Figure 1.10). If the output obtained first is kept available (stored) until the second one arrived, this later one actually defines the processing time of the circuit composed of m1 and m2 at which all components of the result are ready in parallel. Its worst case processing time is hence: t = max(t1 , t2 ) For the time in which the faster machine has already finished while the other is still working, the faster one remains unused. Actually, m1 and m2 may not be required to start their operations simultaneously and to deliver their outputs at the same time; the time from starting the first

• 22

DIGITAL COMPUTER BASICS

#1

#2

#3

#4

input data m1 processing

<------ #1 ------> <------ #2 ------> <------ #3- -----> <------ #4 ----

m2 processing

<------ #1 ------> <------ #2 ------> <------#3 ----

output data

Figure 1.9 Pipelined execution to enhance the throughput common input output from m1 output from m2 t1

t2

Figure 1.10 Timing of a parallel execution

to finishing the last of them will not be below t. For the throughput of the combined circuit one again obtains s = min(s1 , s2 ) assuming that the data pass the circuits at some common rate. If the if/else construction is implemented as a composition of machines m1 and m2 executing the operations in the branches in parallel, a third machine m3 calculating the condition and a select function the processing time becomes t = max(t1 , t2 , t3 ) (ignoring the time for the selection) and the throughput s = min(s1 , s2 , s3 ) as discussed before. If control is implemented so that only the data from the selected branch are computed and awaited, the condition needs to be computed first and the worst case processing time is estimated by: t ≤ t3 + max(t1 , t2 ) It can strictly be less because a branch might be deselected for those data on which it would require the longest execution times. Otherwise, implementing the control flow leads to a higher execution time than for the select implementation as it creates a serial dependency on the condition. A gain in processing time similar to the one through the select implementation of the control flow can be obtained for serial compositions of the form f(x, c(x)) by computing in parallel c(x) and all f(x, r) with r in the range of c, and then selecting the one with r = c(x) (this is used for the carry-select adder in section 4.2 where c(x) is the ‘carry’ signal).

1.4.2 Serial and Parallel Processing If an algorithm includes two computational steps performing the functions f1 and f2 and if building blocks are available capable of performing both f1 and f2 , then a single building block of this kind may be used for both steps by executing them at different times. If f1 and f2 are different functions, the building blocks would be a combined one as discussed in section 1.2.3.

• 23

TIMING, SYNCHRONIZATION AND MEMORY

#1 input #2 #3 …

storage

input select

reused circuit (multifunction)

control signal generator

Figure 1.11 Auxiliary control needed for serial processing

This is called serial execution. It has the potential of simplifying the hardware at the expense of a higher processing time as it excludes both the f1 and f2 steps to be performed in parallel at the same time. As well as choosing good algorithms, the application of serial processing is the major method to raise the efficiency of a design by performing the same processing in the available time with less hardware resources. It depends on the performance requirements of an application, which balance between providing independent sub-circuits and serially reusing them should be used. There are algorithms in which serial processing does not significantly increase the processing time. A higher throughput may result from serial processing if a similar amount of hardware is used as in a ‘parallel’ circuit. Serial processing needs some auxiliary control. The input data to the two steps must be switched to the inputs of the building block at the appropriate times using a select circuit, then for different f1 and f2 the block has to be controlled to compute f1 in the first step and f2 in the second one (or vice versa), and the input data for the later step may have to be stored in some kind of memory device until the execution of the second step can start (Figure 1.11). This is the case in particular if the serial execution is extended to performing several steps on the same circuit and if the input to some of them is the result of a previous one. The order of execution of steps on the same building block must be so that a step producing input for another one (maybe, indirectly) is executed before the latter. Serial processing can be extended up to the point of using a single multi-function building block capable of executing each of the elementary operations used in the algorithm (e.g. the arithmetic unit of a microprocessor) and controlling it to execute the steps of the algorithm in a long time sequence, using storage devices for all the intermediate data to have them available when they are needed as an input for another step. The sequence of function select codes may be long and irregular. It can be derived from an easy-to-generate regular sequence of binary codes by applying a complex Boolean function to them that is realized as a memory (the universal function shown in Figure 1.5). This is the basic concept underlying the programmable processors to be discussed in detail in Chapter 6. Then software programming (setting up the memory data) shows up as part of the design process. If the composition f2 ◦ f1 is executed on the same building block with the processing times t1 and t2 for f1 and f2 respectively, first executing f1 and then f2 , then the total processing time for f2 ◦ f1 can be as short as t = t1 + t2 if the block accepts new input at the time of outputting the result of f1 and if the time for selecting the input signal and control can be ignored. This processing time is the same as the one obtained using separate building blocks for f1 and f2 . Even if the combined building block would allow new input before the output is ready, due to internal pipelining, the output of the first operation must be waited for before the second operation can start. In principle, the throughput of the building block may still be exploited

• 24

DIGITAL COMPUTER BASICS

Figure 1.12 Layered network allowing full speed serial execution

by filling the pipeline with subsequent inputs until the first f1 outputs arrive, which are then used to feed the corresponding f2 steps, etc. If two steps calculating f1 (x) and f2 (x) can be performed in parallel, it is still possible to use the same hardware building block to execute both of them. Here the computation can be performed in both orders. Let L be the ‘latency’ time that needs to elapse after an input before a new input can be accepted (L is the inverse of the throughput s of the circuit and may be shorter than the processing time). The second operation need not wait for the result of the first and can be started at this time. If f2 is computed as the second operation and t2 ≤ t1 , then the processing time becomes: t = max(t1 , t2 + L) If both results are inputs to the same subsequent operation, at least the one obtained first must be stored until the start of that operation which is similar to the case of parallel execution. Thus, the serial execution of operations that are composed in parallel is slowed down while in a serial composition this does not happen. A network of circuits composed of h layers of up to d similar operations that can be executed in parallel while the layers are composed serially can be processed serially using d circuits without increasing the processing time obtained from using a separate circuit for all operations provided that h is minimum (Figure 1.12). The throughput of a composite computation that uses a particular building block having a maximum throughput of s to execute n operations in series necessarily falls below s/n. Hence serial processing always implies a throughput penalty. If the higher throughput of a serial composition of separate circuits cannot be exploited, the serial usage of a single circuit is more efficient. If a single circuit is used to perform the computations of the results f1 (x) and f2 (x) of the branches of an if/else composition, then the control can be so that only the selected branch is executed. Then the execution time for the other one is skipped and the total execution time remains at max(t1 , t2 ) or less and does not suffer at all from serial execution. In this implementation of the control flow no hardware resources are spent for the computation of unnecessary intermediate results. The penalty of having to wait for the evaluation of the condition can be reduced by at least accepting some operations of one of the branches (the one with the longer processing time or the one most probably taken) or both to be performed in advance in parallel to computing the condition using circuits that would otherwise be idle. Then at least partial usage is made of them.

• 25

TIMING, SYNCHRONIZATION AND MEMORY

H

L t

Figure 1.13 Non-constant digital signal A B Q

Figure 1.14 Timing behavior of a reactive AND gate with inputs A, B and output Q

The idea of serial processing is to make efficient use of the available resources, and to consider the compute circuits as such resources. It extends to other resources of a digital system, to memory devices that are used to store unrelated data or to input and output devices and interconnection media that transfer different data at different time steps. In all cases some auxiliary control circuits will be involved to perform this multiplexing.

1.4.3 Synchronization The generation of input and output events must be implemented in some way for the building blocks of a digital system. A simple approach would be to consider every change of an input signal at some time to another of the allowed intervals as an input event. A non-constant digital signal needs to pass through values outside the allowed intervals for every change and becomes invalid and valid again at certain discrete times (Figure 1.13). The transitions between the different levels are not sharply defined but take short time intervals. For the sake of simplifying the subsequent discussion they will be idealized to occur at specific times. A building block that directly responds to changes at its data inputs and that holds its output unchanged after the processing time from the last change if the input data remain unchanged will be called reactive, since it continuously reacts to changes at its inputs (Figure 1.14). The electronic feed-forward circuits realizing Boolean gates to be discussed in section 2.1 will be of this kind. Compositions of reactive building blocks are reactive. Unfortunately, for reactive building blocks it cannot be expected that the outputs make unique changes from their previous values to the new ones that clearly define the output events. First of all, the output might not change at all. Then, if two intermediate values f1 (x) and f2 (x) are computed in parallel with different execution times, invalid combinations of them can appear for a short time and may feed into another reactive building block that consequently outputs invalid data for some time. In general, the outputs may go through several changes during the processing time, and only the final change represents the output event. It is therefore useful to define the input and output events through changes of extra control signals. The output event can be generated with some delay greater than the processing

• 26

DIGITAL COMPUTER BASICS

D C Q

Figure 1.15 Timing behavior of a register with input D, output Q and control signal C

time from the input event if the output data are still unchanged at that time. The availability of control signals defining the output events is also essential if the building blocks are to operate in a pipeline or be used serially and be switched to different data inputs. Without them, compositions of reactive components are usually operated without using pipelining and holding the input throughout the entire worst case processing time. Related to the generation of input events is the need to store the output f(x) of a building block until a later event. For a reactive building block this can be implemented by not changing the input x and thereby the output f(x) which obviously excludes the usage of the building block for another computation during the waiting time. In order to avoid input and output hold time requirements on the computational building blocks and to free them up for more computations, the storage of intermediate results over extended times is implemented by special, auxiliary building blocks (‘registers’) with a trivial (identity) transfer function but the particular timing behavior that they output a stable n-bit pattern briefly presented at the previous input event to its inputs. The changes of the output signals occur synchronously in response to a new input event, the time of which is defined by a change of a single, extra control input, say, from L to H (Figure 1.15). The input data are supposed not to change at that time. A register is not reactive w.r.t. changes on the data inputs. It can be thought of as sampling the data input at the time of the control event and holding this value at the output until the next control event. The ability to maintain a stable state over time can be implemented with feedback circuits (see section 2.1.2). If the two values f1 (x) and f2 (x) are synchronized by sampling them simultaneously through a register, invalid combinations of output signals no longer show up at the register output. Storage elements (even large banks of such) are essential elements of digital systems. They also have to be used to hold input data to an algorithm that come in serial increments, or serve to store function tables. One can distinguish two major, independent roles for memory. One is to supply the input or store the output of algorithms. The other is to hold intermediate results of algorithms as needed for pipelining or to reuse sub-circuits. The latter is an auxiliary function and depends on the particular implementation while the former is independent of the implementation and dictated by the application. Usually both are implemented with the same types of storage circuits. There are two common methods for generating input and output events with the aid of control signals, namely synchronization with a time reference (a ‘clock’) and handshaking. It is only after equipping the building blocks of a digital system with these that compositions of them of arbitrary size become feasible with an orderly exchange of the intermediate results and a clear definition at the end of a computation. A time reference is implemented by a signal (a ‘clock’) performing periodic transitions between the L and H intervals. The control events may be defined to be the L-to-H transitions only (this is assumed in what follows), the H-to-L transitions only, or both. The clock produces

• 27

TIMING, SYNCHRONIZATION AND MEMORY

in

input

reg

reactive compute circuit

out

output

reg

C

Figure 1.16 Registered building block (input register may be the output register of another block)

reactive reg circuit I

pipeline reg

reactive reg circuit II

C

Figure 1.17 Pipelining arrangement for reactive circuits

a series of control events that are separated by integer multiples of the clock period T. If the elementary building blocks are all known to have worst case processing times of at most T and to have a reactive behavior, a valid output event can be generated by sampling the output signals at the time of a clock transition, provided that the input was applied at the previous clock transition and remained unchanged since then (using an input register to hold it, if necessary). The combination of the input register and the reactive compute block and the unit delay provided by the time reference actually results in an enhanced building block with well-defined input and output events (Figure 1.16). Thus the generation of input and output events to and from the building blocks is derived from the clock with the aid of registers, relying on worst case estimates of their execution times and without being able to exploit their possibly higher throughput. If all operations are handled this way, the total execution time of the algorithm will be an integer multiple of T that still depends on the control flow. If two registered building blocks are put in series, they operate in a pipeline. After every clock event the first receives new input while the second operates on the previous result of the first that it stored in the output register. This use of a register is, in fact, mandatory for pipelining two reactive circuits, as the second requires stable input throughout its processing time (Figure 1.17). If two intermediate outputs are needed simultaneously but are generated at different multiples of T, the earlier one must be delayed with the aid of extra registers that transform the composite circuit into a layered one as in Figure 1.12. The alternative concept of handshaking can be applied both to the data inputs and outputs of the entire system and to the inputs and outputs of building blocks. It also relies on extra control signals, the transitions of which (L-to-H, H-to-L, or both) define the times of the input and output events but this time each building block uses individual signals instead of the common clock, and additional control signals to define acknowledge events used to hold off further input and output events while the previous data are still needed (input events must respect the maximum throughput and the rate at which the corresponding output events are accepted). Handshaking allows exploiting the data dependency of the execution time and the throughput of a building block, but demands the implementation of the extra control signals. The signal indicating the output event must be generated so that its transition occurs after the processing time. A reactive building block can be augmented with handshaking signals using an individual time delay generator to get an indication of when the processing time is

• 28

DIGITAL COMPUTER BASICS

data input

IR IA

reactive circuit handshaking delay circuit

data output

OR OA

Figure 1.18 Handshaking building block

IR/OR IA/OA valid data

Figure 1.19 Timing protocol of the handshake signals

over (Figure 1.18), the idea being again to substitute the purely functional reactive building blocks by ones with a timing behavior that is defined by appropriate control signals. Complex handshaking circuits built from such augmented handshaking building blocks may exhibit non-constant delays. There are various ways to define handshake signals and the protocols on when and how to perform their transitions. In contrast to the clock signal generator output, handshaking is defined to be bidirectional even if it is implemented by a single signal per input and output. A common definition is as follows. Besides the data input signals there is an extra control input IR (input request) that defines an input event through a transition from L to H with the data being applied to the data input. After this transition IR remains H and the input data do not change until the circuit responds by an L-to-H transition on its IA (input acknowledge) output. This corresponds to holding the input data as long as needed, maybe with the aid of a storage element. Thereafter IR returns to L, immediately followed by IA. A new input event can be applied as soon as IA is L again. The building block also generates a control output OR (output request) that defines the output event through a specific transition (in response to an input event after the processing time has passed). Finally, the input OA (output acknowledge) is assumed to respond to OR in the same way as IA to IR by a transition indicating that the output data may change again (Figure 1.19). The handshaking add-on to a reactive circuit would be implemented so that IA responds after OA in order to keep the output constant until the output handshake is completed. The handshaking signals are defined in such a way that for a composite building block (or the entire system) the handshake signals are easily derived from those of the components (individual periodic clocks for the individual components would not do this job). If two building blocks are connected in series, the OR and OA signals of the first connect to the IR and IA signals of the second, and the IR and IA of the first and the OR and OA of the second are the handshake signals of their composition. If both are augmented reactive building blocks, the first becomes inactive during the processing of its result in the second, and the combined circuit is equivalent to an augmented reactive circuit using the sum delay. Pipelining their processing and introducing a register at the output of the first as discussed before can avoid this. This time the register is equipped with the same kind of handshake signals as well. In

D1 IR1 IA1 D2 IR2 IA2

• 29

ASPECTS OF SYSTEM DESIGN

Q1 OR1 OA1

in out (data) &

IR OR (control) IA OA

&

Q2 OR2 OA2

Figure 1.20 Conjunction of handshake signals

contrast to the handshake for the reactive circuit the register handshake would immediately respond to the IR transition with an IA response. Only the next IA response would be delayed as long as the register output data of the register are still needed (i.e. until OA). A register with this kind of handshaking connected to the output of an augmented reactive circuit lets the reactive circuit receive its OA immediately after the OR and immediately respond with its IA after the processing time has passed and enables its use in a pipeline again. If a building block has several data inputs from others that operate asynchronously, a composite IR signal is generated as the conjunction of the individual IRs as the execution can only start when all inputs are available. If the building block outputs to several others, its OA must be generated as the conjunction of the individual OA signals from them in order to keep the output data as long as needed (Figure 1.20). For an if/else composition, the handshake signals of the two branches are selected according to the branch to be executed. The generation of handshake signals for the building blocks of a digital system requires some effort, and current designs mostly opt for using a single clock signal to define the input and output events of the more complex reactive sub-circuits (a ‘global’ clock). For the external input and output, however, there is no such option. Except for control parameters that are input on demand without synchronization, both individual unidirectional clock type signals and bi-directional handshake signals are applied to define the input and output events (the digital system has to be designed so that all input events can be processed). On a microprocessor, interrupts and software loops would be used respectively to synchronize the processing with the external input and output events (see section 6.4). If a system uses several microprocessors, handshaking is implemented for the data exchange between them. If a processor receives several asynchronous inputs and outputs to several others, its software would typically perform the conjunction of the handshake signals.

1.5 ASPECTS OF SYSTEM DESIGN 1.5.1 Architectures for Digital Systems The term ‘architecture’ will be used to summarize a set of elementary hardware building blocks that provide certain basic data processing operations and are further characterized by their cost and performance, of storage elements, and of methods and devices to interconnect these with each other. The interconnection devices and media contribute to the timing of composite systems and must provide an orderly flow of data through the system, e.g. by using handshaking for all components. The interconnection may also take the form of applying the same physical circuit at different times to differently selected data with the aid of auxiliary

• 30

DIGITAL COMPUTER BASICS

circuits. An architecture typically provides a limited, even a small number of component types, but several components of any given type. The use of an architecture deliberately restricts a design to using the provided components. Given an architecture, algorithms can be realized on the basis of the operations of its building blocks to construct complex digital systems using as many physical components as needed to meet the performance requirements (maybe, a single building block applied serially). An architecture is said to be universal if it allows arbitrary processing functions to be realized, and scalable if it also covers applications at a broad performance range by supporting an unlimited number of components to be connected. An architecture may provide a selection of building blocks for the same elementary processing function having different costs and performances (e.g., different types of processors). We note that a processing function may be realized with different algorithms using different functional building blocks; therefore the choice of building blocks and cost/performance tradeoffs provided by an architecture usually implies the need to choose between different algorithms. Moreover, the basic building blocks of an architecture may be used to construct a set of more complex building blocks and define a derived architecture using just these for system design. If the basic operations execute on a sequential processor, the complex building blocks correspond to a software library, e.g. for vector operations on floating point numbers. A design using complex building blocks can be expected to be easier and less error-prone. The choice of components of an architecture may be related to the packaging of functions into chips, boards and cabinets, or substructures to be mapped onto application-specific chips (ASICs), and consequently a system design would proceed at the chip level, the board level or at the level of connecting standard circuit boards. A typical component for board level designs is a multifunction circuit providing the most common binary arithmetic operations along with a control circuit and a memory interface supporting its sequential use (i.e. a microprocessor). A practical, more flexible architecture would result from a small choice of processors with different properties. The components of an architecture are usually based on a particular semiconductor technology that defines the interfacing characteristics of the gate level building blocks (including the L and H intervals). A selection of hardware circuits like the medium-scale integrated logic circuits available since the 1970s that are directly wired into networks constitutes a universal, scalable architecture. The circuits are integrated components with multiple input and output signal ‘pins’ and a power supply input; some pack multiple sub-functions, e.g. four independent NAND functions. There are storage elements permitting the usage of the same circuits at different times, which allow cost for performance tradeoffs. The architecture of networks of certain microprocessors connected using a choice of interfacing components is universal and scalable, too. The next chapters further develop the architectural choices, starting with building blocks for the most elementary Boolean operations. Integrated components with a large number of elementary components can be much cheaper than a network of smaller-scale integrated components but tend to restrict the interconnection of components. The manufacturing costs of the integrated digital components selected for an architecture not only depend on their complexity and the needed amount of materials but to a large degree on the volume in which they are used. The volume increases if the component in question can be used for several applications. This has led to configurable architectures using medium to large-scale integrated circuits. Even if not all features of a configurable chip are used for an application, the higher volume compensates for the

ASPECTS OF SYSTEM DESIGN

• 31

wasted hardware complexity as it may do for the extra sub-circuits needed to implement the configurability. It will be shown that unused sub-circuits and storage elements holding constant configuration data also do not contribute to the power consumption (the operating costs). The use of configurable components may reduce the development cost and time. Therefore architectures providing a carefully chosen selection of configurable components are a viable choice for a large range of applications. They are superior to ASICs (specific integrated circuits) except for high volume applications, superior to relying, for example, on a single type of programmable processor, and superior to a very large component inventory. There are several classes of configurable components. The PLD, for example, is a configurable logic device that can implement a number of different disjunctive forms and includes optional storage elements. The programmable processor is another case. By putting different instruction lists into the instruction memory chip the same microprocessor hardware can be used for many different applications. Recent FPGA chips offer thousands of logic cells that can be connected in various ways through a network of on-chip electronic switches (see section 2.2.4). Configurable interconnections have also been proposed for networks of programmable processors both at the chip and at the board levels. The FPGA and processor components of a system also offer the potential of being reconfigured for different data processing steps within an application, in addition to the multiple serial usages of FPGA sub-circuits in a given configuration. The common characteristic of the configurable architectures is that the components are not hardwired into an application-specific network, but are controlled by extra configuration data that actually define the implemented algorithm.

1.5.2 Application Modeling The basic notions of data encoding, algorithms and synchronization for input and output apply to all systems that perform certain Boolean functions on their input data or on sequences of inputs. Digital systems may be built to perform a more general processing than just applying a single function to a single source of inputs as considered in section 1.1. By combining and interconnecting such simple systems into networks (so that a sub-system generates the input for another one), several functions can be applied to different input sources and feed several outputs. Whether the simple sub-systems correspond to separate hardware structures or timeshare the same circuits is a matter of implementation only. Also, the auxiliary functions of storage and sequential control have more applications than just supporting sequential computations, and so do the pre- and post-processing of inputs and outputs (e.g. code and signal conversions) and data communications. A digital system can thus perform the following tasks:

r Boolean algorithms to compute functions on encoded data r data storage and retrieval r generation of patterns of events in response to others r communication of data between remote sites r access of input and generation of output signals in various formats. The timing behavior of a system can be described and specified through the input events it is exposed to and the output events by which it responds to them. They correspond to the input and output of data (an event without a transfer can be thought of as an ‘empty’ data transfer). The functional behavior of the system details the dependency of the output data on the inputs. The functional and timing specifications go together but are also used separately in

• 32

DIGITAL COMPUTER BASICS

Physical environment

user interface

Embedded digital system

(remote) data base

Figure 1.21 Multiple inputs and outputs to an embedded system

the further analysis and design steps. Usually, the operation of a digital system is not one-shot but is performed on sequences of input events and associated data and results in sequences of outputs. These will be referred to as input and output streams. The timing specification also defines the relative timing of subsequent entries in the streams. Figure 1.21 shows the multiple inputs and outputs of a digital system embedded into some physical environment (as used for control and measurement, or e.g. as a pacemaker). A still more complete application model would also specify the effect of the output events on subsequent inputs if there is feedback through the environment, and provide a system model of part of this environment. There are various theoretical frameworks and system models in the literature to describe aspects like the timing of operations and their mutual synchronization like Petri Nets and automata models [17, 33] that are applied for the purposes of simulation and analysis, and formal languages and tools dedicated to specifying the behavior of digital systems like [34, 35]. We will at this point informally discuss the main ingredients of a system specification, but later be interested in some unique, formal scheme to completely specify the operation of a digital system including the algorithms to be performed in order to eventually derive an implementation on a set of components supplied by a selected architecture and including software. The experimental tool described in section 7.7 attempts this. The top-level specification of a digital system (which usually results from a still more abstract, informal one of the processing task) will contain several subsystems with multiple inputs and outputs, and the specification of the external interfaces. It needs to describe the input data including their encoding and their timing, the desired processing including the time available for it, and the outputs and their relative timing. Some of the inputs and outputs may have to be stored in a memory. Then the i/o specification also includes memory requirements. The behavioral definition of the processing functions may be by means of algorithms but without implying that the same algorithms have to be used to construct the system. The specification may include the conditional generation of output events with a specific timing behavior, which corresponds to an original control flow, and prescribe a sequential order on some input and output events. It then defines a course, combined data and control flow in which the basic functional blocks receive and generate internal or external input and output and perform abstract processing functions with a prescribed timing for the external interfaces. The data from multiple inputs can either be applied in parallel to a processing function or by merging the input streams from several interfaces (or in a combination of both). The parallel data input to a processing function does not need to be strictly simultaneous but may be synchronized using appropriate storage devices. Then parallel inputs to the same processing function must at least occur at the same mean rate. A variant of parallel input is unsynchronized input where one of the input interfaces is always ready to deliver data (the last value received

ASPECTS OF SYSTEM DESIGN

• 33

from the corresponding stream and stored in some memory device). Unsynchronized input must be properly initialized or waited for until it becomes valid, and be sampled so that an extended computation uses consistent input data. Data output to multiple interfaces may also be parallel or alternative (or a combination of both). Alternative output corresponds to splitting an output stream to several destinations (depending on the input data) and represents another view of the control flow. The input and output data transferred via the interfaces of a system are multi-bit fields (tuples) encoding the abstract application data. Conceptually, they are entities presented in parallel to or by the processing functions although the actual transfers may split them into multiple packets that are transferred serially. An interface providing a data stream is an object giving access to the subsequent entities in a sequence with a specific timing. It can be modeled as a sequential file structure and also be simulated as such (for a timing simulation time stamps would be added to the data entries). The resulting structure of a general application system is easily formalized as a graph with a node set V (the functional blocks) and an edge set defined by a relation d on V describing the data flow edges and by a subset c of d describing the control flow. (V, d) is supposed to be an acyclic graph (without cyclic paths), and for v ∈ V there is at most one v ∈ V so that (v , v)∈c. Couples (v , v) in d but not in c specify a data exchange via a buffer, and the sequencing of outputs from them if v synchronizes with another event. Control flows in the algorithms for the functions may be added to arrive at a hierarchical combined data and control flow graph [35]. An important structure defined in terms of the top-level system model is the partitioning of the set V of processing functions into ‘processes’. A process groups a set of input and output events that occur in some partial order due to the control and data flow. On a single, sequential processor, this order would be implemented through a thread of instructions implementing the involved functions in a compatible order. Uncorrelated input streams would feed into different processes, and a function executing after another due to the data flow would be placed into the same process. A functional block inputting data from another one but operating at a different pace is in a different process and receives the input data through a buffer. The different processes proceed independently and simultaneously, and repetitively at the rates of the associated data streams. If a functional block happens to receive alternative input from different processes, it starts an extra process. This notion of processes is not related to the usage of processors and thereby differs from other common process models found in the literature. A single process may use several microprocessors to speed up the computation of individual processing functions or operating in a pipeline in order to meet the timing requirements, which would communicate intermediate data to each other. Different processes exchange data using memory buffers which are usually accessed by a single processor only without having to communicate data via interfaces in this case. In the common CSP model (communicating sequential processes, [32]), the processes are like individual processors that only communicate with others. The process model used here is application-oriented instead of being related to an architecture of communicating processors. In a system of several microprocessors it is quite common to input the data to be processed and to output the results of the processing on different processors. In the example shown in Figure 1.22 inputs 1 and 2 feed separate processing functions with sequences of input events (input streams). All parallel inputs to F1 and G1 are summarized in input 1 and 2. As input to different blocks, they are considered as being unrelated and

• 34

DIGITAL COMPUTER BASICS

F2 input 1

F1

F4

output 1

F3 buffer G2 input 2

G1 G3

G4 output 2

output 3

Figure 1.22 Top-level data (➝) and control (→) flow specification

are processed independently. Alternative branches originating from a functional block (e.g. F1) are executed alternatively depending on a condition that results from the computation in that block, and input events from alternative incoming branches to a block trigger the same subsequent processing. The control flow splits the intermediate data streams to feed alternative processing functions. For the blocks that are connected in series as in the case of the output from G3 to G4, the inputs and outputs of the second occur after those of the first. The block G2 receiving parallel input from G1 and F3 uses an intermediate storage structure (‘buffers’) to decouple the timing of F3 and the rates of the sending blocks using serialization as required (vice versa, the use of intermediate buffers can serve to specify the independent timing of the processing in different blocks). The partial control flows triggered by the input streams, F1–4 and G1–4, are the processes executed within the application. For the transition from this level of specification to an implementation, algorithms for the processing functions based on the building blocks supported by the architecture must be provided, or even a choice of algorithms for every function. This usually occurs in two steps. First, algorithms based on arithmetic operations are provided (more generally, on operations related to the abstract data types to be handled), and then these operations are given algorithms based on the elementary functions performed by the architectural building blocks. The algorithms for the functions may involve the conditional execution of operations, and a choice of them for the functional blocks leads to a refined data and control flow. Due to the choice of algorithms and some freedom to move operations between blocks and to convert between data and control flow, the refined data and control flow graph is not unique. It is usually expanded by a compiler from a textual description in some formal language using hierarchical function references. It is important that the resulting specification of a digital system be complete when describing the algorithms and the needed timing characteristics but remains at a level of abstraction not yet reflecting a particular hardware structure to implement it. The application is described through the data and control flow graph yet ignoring all auxiliary functions including the support needed for sequencing operations on processors, e.g., those memory requirements and control structures only related to sequential processing, and the communications between components as defined by the architecture, postponing the decision whether to use sequential processing to after the analysis of the requirements during the implementation. Modern compilers permit an automatic transition from the specification to an instruction list for a particular processor capable of implementing the application system, or to configuration data for a PLD (programmable logic device), and some of them attempt to automatically generate the auxiliary control and communications for more general targets [36].

• 35

ASPECTS OF SYSTEM DESIGN

1.5.3 Design Metrics

A given design task usually has many solutions, and it is useful to be able to compare different ones on the basis of well-defined metrics to select one that is particularly advantageous, or to perform a manual or automatic optimization. The target of this optimization varies with the application. Some properties of a desired system are mandatory while for others a strict optimization is not necessary. A non-optimal system design may be acceptable if the development time can be cut. The metrics to be discussed concentrate on a few technical aspects. If a digital (sub-) system is designed to serve several applications (e.g. a standard processor), the metrics for the comparison of alternative designs must be evaluated on a set of typical applications (benchmarks). A digital system implements algorithms on the basis of a selected scalable architecture. A basic property of a system is its complexity. At first, there is the complexity of the algorithms underlying its construction, i.e., the number k of elementary operations to be executed to obtain an output. Depending on the chosen balance of serial and parallel processing we have to distinguish from this the hardware complexity, i.e. the number of elementary building blocks used to perform the operations plus the required interconnection devices and the extra components needed to implement the serial execution of several operations on the same building blocks, including the memories to hold control programs and intermediate results. The hardware complexity and the architecture determine the manufacturing costs of the system. Both the processing time and the throughput introduced in section 1.4 are important performance measures of a digital system. For many applications one or both of them are specified. This is particularly the case if the system is to be embedded in a physical environment that delivers input data with a pre-defined frequency. Then the digital system must not only calculate the right results but must be guaranteed to be able to process new data after a prescribed time. If the system accepts new input with the throughput f, then the frequency of individual operations is k*f . If there are m hardware building blocks each being able to perform operations at a maximum rate of g then k f ≤ mg thus, the throughput f is limited by m∗ (g/k), with k and g being given by the chosen algorithm and hardware technology. Thus, a given throughput requirement can only be fulfilled by providing a sufficiently high number m of hardware building blocks. Usually, g ≥ 1/T holds where T denotes the (mean) processing time of an individual operation. The processing time t of the algorithm satisfies t ≥ hT where h is the length of a critical (i.e., longest) path through the network of building blocks prescribed by it. If the number of compute building blocks drops, the processing time rises to a limit ≥k/g and the throughput decreases to a limit ≤g/k, which correspond to the serial execution of all operations on a single building block (m = 1). To achieve the minimum execution time of h∗ T, the throughput f may be as small as 1/(h∗ T) and hence m ≥ k/(h∗ T∗ g) building blocks plus the required auxiliary circuits need to be employed. If g = T−1 this reduces to the estimate m ≥ k/h. Throughput and the hardware complexity of the system are related through the notion of efficiency. An efficient design will fulfill a given throughput or processing time requirement with the least possible hardware effort. The hardware effort includes the auxiliary control

• 36

DIGITAL COMPUTER BASICS

hardware needed for serial execution. If all building blocks have the same speed and hardware cost, the efficiency of a design achieving the throughput f is e = ( f ∗ k/g)/M = q ∗ m/M Here M denotes the total number m + m of hardware building blocks (the overall hardware complexity or costs), where m is the equivalent control effort in units of compute building blocks. The constant m0 = f ∗ k/g is the minimum number of compute blocks needed to achieve the number k of operations in the time 1/ f . q = m0 /m is the mean fraction of time in which the compute building blocks are actually used, and m/M is the fraction of the hardware that actually performs operations in the implemented algorithm. It is important to note that this definition of efficiency includes the control overhead; often, instead of maximizing e, designs merely attempt to maximize q. This amounts to the same if the auxiliary circuits are fixed, e.g. by adopting a particular processor architecture with a single multifunction compute circuit (the ALU). In that case m = 1 and q is the fraction of the pipeline cycles in which the ALU completes an operation of the algorithm (other cycles occur waiting for ALU operations to complete, for memory accesses, for the control flow, and for input/output). q is then called the ALU efficiency. It also depends on the schedule chosen for the operations of the algorithm. If several subsystems Si are used, each being composed of mi blocks and operating at the fraction qi of the time then q = (m1 /m)q1 + · · · + (mr /m)qr To maximize q, one has to ensure qi is high whenever mi /m is high; the most complex subcircuits must be used to a high degree. If e.g. the subsystems operate in a pipeline and ti is the execution time of Si , then qi = Qi ti /max(t1 , . . . ,tn ) where Qi is the average usage of the mi components of Si during its execution time ti . For a fully serial implementation (m = 1) the throughput requirement may not be met at all, or the control overhead m could be too large. A fully parallel (m = k, m = 0) and pipelined implementation achieves a 100% efficiency assuming the output events of the building blocks are usable as input events of subsequent operations without having to add delays to synchronize parallel inputs. Then the throughput becomes g, which may be much higher than needed. If delays are needed or if the building blocks are reactive and need registers to be added to achieve their usage to 100%, this additional hardware effort decreases the efficiency. For a parallel circuit of depth h of reactive building blocks processing at the maximum rate of g = 1/T not using pipelining, the throughput is the reciprocal processing time and the efficiency becomes 1/ h as every building block is used for a single operation taking the time T during the time t = h∗ T. The efficiency can be raised by increasing the average usage q of the compute building blocks (thereby decreasing m) or by decreasing the ratio M/m, i.e. the hardware effort for control relative to the effort for the compute circuits. The minimization of the hardware costs for the given algorithm implies certain design rules, namely:

r to use pipelining to fully exploit the capabilities of the individual building blocks, or to avoid non-pipelined serial compositions of reactive building blocks;

SUMMARY

• 37

r to employ the serial usage of compute circuits and other resources but consider the overhead

for the serial control and keep it in proportion, e.g. by controlling more complex circuits or by sharing control devices between several compute circuits, in particular the devices implementing the control flow between compute devices used for a process; r to exploit the control flow so that operations that are unnecessary for the given input are not performed, or at least do not consume extra resources. It is important to note that the notion of efficiency depends on the selection of an algorithm. It is possible that an algorithm will run less efficiently on a given architecture than on another one but still leads to a lower cost solution due to a significantly lower complexity. The notion of efficiency has been used in the literature mainly for the parallel execution of programs on networks of microprocessors [26] without considering its effect on the individual processor. For a processor plus memory system the hardware and thereby m /m are fixed and only q can be increased by reducing the number of overhead cycles not used for a compute operation (a task for a compiler). Efficiency is quite restrictive by not counting load, store and jump cycles that do not compute. Actually, a computer can perform computations just by performing conditional jumps and moves. The ‘sel’ function has been shown both to generate all Boolean functions and to be replaceable by an if/else jump construction. On an FPGA the overhead necessary for the capability of configuration is significant but cannot be changed either. However, auxiliary circuits needed for sequential processing are implemented with the same resources as the processing functions, and efficiency becomes essential (an efficient design would perform the required processing with a cheaper FPGA). For a processor design on an FPGA or as a chip, the conclusion from the above design rules is that most of the hardware resources should be devoted to the pipelined compute circuits (maybe, more than one) while the sequencing of these should be implemented as simply as possible (current CPU designs do not always follow these considerations). Regarding the design of an efficient FPGA chip, similar considerations hold. As another metric that may be crucial in some applications we mention the energy required for a computation, or the power consumption of a system that performs repetitive computations. The reduction of the hardware costs or the increase of efficiency by following the above design rules does not imply an increase in power efficiency. It turns out that for this some extra hardware effort has to be accepted, a possible approach being the use of handshaking building blocks.

1.6 SUMMARY This first chapter started with some well-known computer basics such as the concept of binary encoding and the realization of abstract functions through Boolean functions. Then it proceeded to the construction of machines from components (the hardware interpretation of algorithms), in particular by realizing the elementary Boolean operations, and showing a common method to minimize the complexity of such a construction. Before looking at any technology providing such building blocks, the timing of the operation of composite machines was considered, including the behavior of storage elements and the use of clock and handshake signals to define input and output events. Some important concepts introduced

• 38

DIGITAL COMPUTER BASICS

along these lines were the pipelining of operations in order to increase the throughput, the serial execution of several compute steps on the same hardware building blocks, and the idea of a programmable sequential machine. It was pointed out that the data and control flow not only show up in algorithmic constructions but are already part of the top-level specification of a digital system and give rise to a structure of processes within an application that will be taken up in later chapters. We finally introduced the notion of efficiency of a digital system that can be applied to compare various design options for specific applications, and to compare general purpose (universal) architectures such as microprocessors.

EXERCISES 1. Let (b0 ,. . . ,bn−1 ) and (g0 ,. . . ,gn−1 ) be the binary and the Gray codes of an integer k in the range 0 ≤ k < 2n . Verify that bi = gi ⊕ gi+1 ⊕ . . . ⊕ gn−1 , ‘ ⊕ ’ being the XOR operation, and that gi = bi ⊕ bi+1 2. Prove that the binary polynomial p(x) = x3 + x + 1 is primitive and list the corresponding error-correcting code. Describe how to encode tuples in B3 by tuples in B7 so that single bit errors can be corrected and double bit errors be detected. 3. Apply the simplification procedure of Quine and McCluskey to the function f = 0000 + 001∗ + 1∗ 00 + ∗ 0∗ 1

4. 5.

6.

7.

8.

and compare the complexity of the minimized form to the original form using a single NOT operation for every variable and reducing the multiple AND operations to 2-input AND operations. Prove that the Quine–McCluskey method yields all minimum disjunctive forms for a function f. Implement the method as a program using the list L0 as its input for an arbitrary n ≤ 16 and estimate the complexity of the method (the number of program steps) as a function of n. Carry out the Quine–McCluskey procedure for an 8-input, 3-output encoder function that is only specified to map the tuples (b0 , . . . ,bn−1 ) having a single 1 bit to the 3-bit binary code of the bit index. Perform simplification on the select tree implementation of the parity function by substituting select nodes with constants, inverters and XOR gates and by eliminating nodes delivering the same output. Let a digital system receive the complex inputs u, v and compute from these the outputs u + v∗ w, u − v∗ w for a fixed, complex number w. The system is realized with real multiplier and real add-subtract components each having an execution time T and the throughput 1/T. Determine the execution times, throughputs and efficiencies for implementations with 1–4 multipliers and 1–6 add-subtract circuits.

EXERCISES

• 39

9. Design an algorithm that maps the operations of a data flow graph with n inputs and m outputs to k similar processors than execute every operation in the same processing time, so that every operation is scheduled after finishing the operations that deliver its inputs, taking advantage of being able to execute several operations in parallel. 10. Formalize the process model in section 1.5.2 and assign to every input a rate parameter for the input events, to every branching of a stream a parameter defining the percentage by which it is taken, and derive conditions on these rates. Refine the model by assigning computational complexities to the nodes and derive throughput requirements. In particular, determine the output data rates.

2 Hardware Elements

2.1 TRANSISTORS, GATES AND FLIP-FLOPS 2.1.1 Implementing Gates with Switches Elementary building blocks implementing the Boolean operations AND, OR, NOT or SEL from which all Boolean functions can in turn be constructed (see section 1.3.1) are realized quite easily by electronic means. A simple solution that was actually applied in the early days of computing is the use of electrically controlled, mechanical switches. The single basic component is the controlled switch with a control input G, a coil connected between G and a ground reference C, and the poles S, B, M of the mechanical switch (Figure 2.1). If a sufficiently high voltage is applied to G w. r. t. C, the magnetic force of the coil breaks the connection from S to B and makes the one from S to M, and hence performs a selection between the voltage levels applied to B and M, depending on the control input. An interval of voltage levels that cause the switch to be actuated is used to present the Boolean 1 while the voltages near zero represent the 0, all voltages being referenced to C. This SEL building block fulfills the requirement that the input and output signals be compatible. It can thus be composed with others. The high voltage level can be selected from a power supply (denoted ‘+’ in Figure 2.1). To output a zero voltage to another coil, the corresponding select switch input can be left open, as the unconnected coil will assume the zero level by itself. Thus, the switch can be simplified to a break or a make switch actuated by the field of the coil. A break switch connected to ‘+’ realizes the NOT operation. The parallel and serial compositions of make switches shown in Figure 2.1 implement the OR and AND functions. In the serial composition the switches controlled by X and Y must both close to output the ‘+’ level. The parallel and serial compositions generalize to networks of switches with two dedicated nodes. The network is in the state f(a1 , . . . ,an ) depending on the state ai of the switches it is composed of, the possible states being 1 (‘closed’) and 0 (‘open’). If a second network of switches is in the state g(b1 , . . . ,bm ), then their serial and Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 42

HARDWARE ELEMENTS

B M

G

+

AND(X,Y) Y

X

coil + C

OR(X,Y)

X

S Y

Figure 2.1 Switch-based SEL building block, and AND and OR switch circuits

D G

n

G

Cin

R on

D Cout

S S

S

Figure 2.2 N-channel transistor switch and equivalent circuit

parallel compositions are switch networks with two dedicated nodes with the state functions AND(f(a1 , . . . ,an ), g(b1 , . . . ,bm )) and OR(f(a1 , . . . ,an ), g(b1 , . . . ,bm )). Unfortunately electromechanical switches are slow, consume much space and power and suffer from a limited lifetime. Modern electronic computers use networks of transistors instead which behave like electronic switches and are used in a similar fashion to the electromechanical switches, but are cheap and fast solid state devices with a low power consumption and almost unlimited life that, moreover, have microscopic dimensions and can be integrated in their thousands into silicon chips. For an overview of the various classes of transistors and circuits implementing the gate functions, we refer to [2] and concentrate on the NMOS technology and on the most important and elegant one, the CMOS technology invented as long ago as 1963. The common CMOS (complementary metal oxide semiconductor) technology uses two kinds of insulated gate field effect transistors, the n-channel and the p-channel devices. The transistor symbols in the figures are denoted accordingly. These transistors have three terminals, the source and drain terminals (S, D) and the gate (G) which is the control input (for the sake of simplicity, the influence of the potential of the silicon substrate beneath the gate is ignored). For the n-channel transistor (Figure 2.2) with the source S near the ground reference (the negative supply), the gate input G at the H level causes a low resistance connection from S to the drain D whereas an L level disconnects S and D. The device is voltage controlled; no current flows into the gate once the tiny input capacitance Cin has been charged, as shown in the simplified equivalent circuit in Figure 2.2. The transistor, switch is modeled as an ideal switch put in series with a resistor, which is valid for small output voltages only. For the complementary p-channel transistor, the S terminal is at the level of the positive supply (0.6...18V depending on the technology; the most common for external interfacing are 5V, 3.3V). For an H input to G the S–D switch becomes disconnected while for an L input it becomes low resistance. The n-channel transistor is a make switch to L, and the p-channel transistor is a break switch to H. This behavior of the transistors results from the VG S − I SD and V SD − I SD characteristics shown in Figure 2.3. For voltages V SD well below VG S , I SD grows linearly with VG S and the transistor behaves like a resistance that is inversely proportional to VG S − VT , VT being a constant of a few 100 mV that depends on the dimensions of the device and slightly decreases

• 43

TRANSISTORS, GATES AND FLIP-FLOPS

ISD

(VSD = VGS)

(VGS = 5 V)

ISD

1 mA

1 mA

( -> saturation) ( < - resistive behavior, Ron = 2kΩ) VSD 5V

VGS

5V

Figure 2.3 Characteristics of the n-channel transistor

+

VD 5V

R

H D

G

n L L

H

5V

VG

Figure 2.4 Simple inverter circuit and its transfer characteristic

with temperature by about 3 mV/◦ C. For a manufacturing process with 0.8 µm feature sizes (e.g. the gate length) VT is about 0.8V [3] and the supply voltage is 5V; for a finer process of 0.1 µm VT is below 0.2V, and the supply voltage is reduced to about 1V [4]. A simple approximation to the current I SD valid for V SD < VG S − VT is: I SD = ß(VG S − VT − 1/2VSD )∗ VSD

(1)

For output voltages V SD beyond VG S – VT the current through the transistor becomes saturated to Isa t = 12 ß(VG S − VT )2

(2)

R on = ß−1 (VG S − VT )−1 = 12 (VG S − VT )/Isa t

(3)

and from (1) one concludes that:

A more accurate description reveals I SD will still grow slowly with V SD for V SD > VG S − VT and will not vanish but decay exponentially as a function of VG S for VG S < VT [4, 5]. The transistor is actually a symmetric device; source and drain can be interchanged and used as the poles of an electrically controlled, bi-directional switch (the role of the source is played by the more negative terminal). The simplest way to implement the Boolean NOT function with transistor switches is by connecting a properly chosen ‘pull-up’ resistor between the drain terminal of an n-channel transistor and the positive supply. Figure 2.4 shows the circuit and its VG − VD characteristic. The L interval is mapped into the H interval, and H into L as required. A second transistor switch connected in parallel to the other leads to an implementation of the NOR function while a serial composition of the switches implements the NAND, similarly to the circuits shown in

• 44

HARDWARE ELEMENTS

+

+

VD

p

H R

G

D n

L C in

Cout

L

H

VG

Figure 2.5 CMOS inverter, equivalent circuit and characteristic

Ip -In 0.8V 1.6V 2.4V 4.8V VD

0V 3.2V 4.0V

Figure 2.6 Inverter output current characteristics for different VG (VT = 0.8V)

Figure 2.1. These circuits were the basis of the NMOS integrated circuits used before CMOS became dominant. Their disadvantage is the power consumption through the resistor if the output is L, and the slow L-to-H transition after the switch opens which is due to having to load the Cout capacitance and other load capacitors connected to the output through the resistor. The H-to-L transition is faster as the transistor discharges the capacitor with a much higher current. These disadvantages are avoided in CMOS technology by replacing the resistors by the complementary p-channel transistors. The n- and p-channel transistors combine with the CMOS inverter shown in Figure 2.5 with a corresponding equivalent circuit and the typical VG − VD characteristic over the whole supply range. The CMOS inverter also implements the Boolean NOT operation. The equivalent circuit assumes that both transistors charge the output capacitor as fast as the same resistor R would do which is the case if the transistors are sized appropriately. Typical values for the capacitors reported in [3] for a 0.8 µm process are Cin = 8 fF and Cout = 10 f F (1f F = 10−15 F = 10−3 pF). The characteristic is similar to the curve in Figure 2.4 but much steeper as the p-channel becomes high-impedance when the n-channel one becomes low-impedance and vice versa. The inverter circuit can actually be used as a high gain amplifier if it operates near the midpoint of the characteristic where small changes of VG cause large changes of V D . The dotted curve in Figure 2.5 plots the current through the transistors as a function of V D , which is seen to be near zero for output voltages in L or H. When the input level to the CMOS inverter is switched between L and H the output capacitance C is charged by the limited currents through the output transistors. Therefore, the digital signals must be expected to take a non-zero amount of time for their L-to-H and H-to-L transitions, called the rise and fall times respectively. The characteristic in Figure 2.6 shows

• 45

TRANSISTORS, GATES AND FLIP-FLOPS

transition times

t

H H

input

output

L

L

propagation delays processing times

Figure 2.7 Timing of the inverter signals

+ X

p

Y

p

+

NOR(X,Y) = SEL(X°,0,Y) n

n

+ p

p

X

n

Y

n

NAND(X,Y) = SEL(1,X°,Y)

Figure 2.8 CMOS circuits for the NAND and NOR functions

that for input voltages in the middle third of the input range (0. . . 4.8V) the currents supplied to charge the load capacitance are reduced by more than a factor of 2, and an input signal making a slow transition will have the effect of a slower output transition. There is hardly any effect on the output before the input reaches the midpoint (2.4V), and at the midpoint where the VG − V D characteristic is the steepest, the output becomes high impedance and does not deliver current at all at the medium output voltages. The worst case processing time t of the inverter computing the NOT function may be defined as the time to load the output capacitance from the low end of the L interval (the negative supply) to the lower end of H for an input at the upper end of L (which is supposed to be the same as the time needed for the opposite transition). It is proportional to the capacitance, t = RC

(4)

where R depends on the definition of the H and L intervals and is a small multiple of the ‘on’ resistance of the transistors. Moreover, the output rise time that may be defined as the time from leaving L to entering H is also proportional to C (Figure 2.7). It is part of the processing time. In the data sheets of semiconductor products one mostly finds the related propagation delay which is the time from the midpoint of an input transition to the midpoint of the output transition for specific input rise and fall times and a given load capacitance. Figure 2.8 shows how more transistor switches combine to realize the NAND and NOR operations. A single transistor is no longer enough to substitute the pull-up resistor in the corresponding, unipolar NMOS circuit. CMOS gates turn out to be more complex than their NMOS counterparts. Inputs and outputs are compatible and hence allow arbitrary compositions, starting with AND and OR composed from NAND and NOR and a NOT circuit. Putting

• 46

HARDWARE ELEMENTS

INV INV INV INV INV INV

Figure 2.9 Inverter tree to drive high loads

control input to switches

+ p-channel switches

(variables and inverses) control input to switches

( network computing H condition f+) output

n-channel switches

(network computing L condition f− )

Figure 2.10 Structure of a complex CMOS gate

switches in series or in parallel as in the NAND and NOR gates can be extended to three levels and even more (yet not many). The degradation from also having their on resistances in series can be compensated for by adjusting the dimensions of the transistors. Another potential problem is that the output resistance in a particular state (L or H) may now depend on the input data that for some patterns close a single switch and for others several switches in parallel. This can only be handled by adding more switches to a given arrangement so that in a parallel composition the branches can no longer be on simultaneously. The timing of CMOS gates with multiple switches is similar to that of the inverter, i.e. depends essentially on the load capacitances, the ‘on’ resistances and the transition times of the inputs. For a gate with several input signals that transition simultaneously, some switches may partially conduct during the transition time. For short rise and fall times it can be expected that the gate output makes just a single transition to the new output value within its processing time. During the signal rise and fall times invalid data are presented to the inputs, and the gates cannot be used to compute. The transition times hence limit the possible throughput. Therefore, short transition times (fast signal edges) are desirable, and large capacitive loads must be avoided. The load capacitance Co at the output of a CMOS gate is the sum of the local output capacitance, the input capacitances of the k gate inputs driven by the output, and the capacitance of the wiring. The processing time and the output rise and fall times are proportional to Co and hence to k (the ‘fan-out’). Figure 2.9 shows how a binary tree of h levels of inverters can be used to drive up to k = 2h+1 gate inputs with the fan-out limited to at most 2 at every inverter output. The tree has a processing time proportional to h = ld(k) which is superior to the linear time for the direct output. For an even h, the transfer function from the input of the tree to any output is the identity mapping. All outputs transition synchronously. The general form of a complex CMOS gate is shown in Figure 2.10. If n-channel switch network driving the output to L has the state function f− and the p-channel network driving to

• 47

TRANSISTORS, GATES AND FLIP-FLOPS

H has the state function f+ , then the Boolean function f computed by the gate is:   1 if f+ (x) = 1 f(x) = 0 if f− (x) = 1  undefined otherwise (gate output goes high impedance)

Usually, f+ and f− are complementary and f = f+ . The switch networks must not conduct simultaneously for some input, i.e. f+ , f− satisfy the equation f+ f− = 0. For an NMOS gate there is only the n-channel network with the state function f− = f ◦ ; the p-channel network is replaced by a resistor. CMOS or NMOS circuits can be designed systematically for a given Boolean expression. One method is to construct the switch networks from sub-networks put in series or in parallel. Not every network can be obtained this way, the most general network of switches being an undirected graph (the edges representing the switches) with two special nodes ‘i’ and ‘o’ so that every edge is on a simple path from ‘i’ to ‘o’ (other edges are no good for making a connection from ‘i’ to ‘o’). This uses less switches than any network constructed by means of serial and parallel compositions that are controlled to perform the same function. Another method is to set up a selector switch tree and only maintain those branches on which the network is to conduct, and eliminate unnecessary switches. This also does not always yield a network with the minimum number of switches. To derive a construction of the n-channel switch network using serial and parallel compositions to implement a given function f, the state function f− = f ◦ of the network needs to be given as an AND/OR expression in the variables and their complements yet without further NOT operations (every expression in the AND, OR and NOT operations can be transformed this way using the de Morgan laws). For every negated variable an inverted version of the corresponding input signal must be provided by means of an inverter circuit to drive the switch. To every variable site in the expression an n-channel transistor switch is assigned that is controlled by the corresponding signal. AND and OR of sub-expressions are translated into the serial and parallel compositions of the corresponding switch networks, respectively. For the NMOS design, a single pull-up resistor is used to produce the H output when the switch arrangement is open. A CMOS circuit for the desired function requires a p-channel network with the state function f+ = f that is obtained in a similar fashion, e.g. by transforming the negated expression into the above kind of expression. The required number of transistor switches for the NMOS circuit is the number c of variable sites in the expression (the leaves in the expression tree) plus the number of transistors in the inverters required for the variables (the AND and OR operations that usually account for the complexity of a Boolean expression do not cost anything but add up to c-1). The CMOS circuit uses twice this number of transistors if the complementary switch arrangement is chosen to drive to the H level. Forming circuits by this method leads to less complex and faster circuits than those obtained by composing the elementary NAND, NOR and NOT CMOS circuits. The XOR function would e.g. be computed as: XOR(X,Y) = (X◦ Y◦ + X1 Y1 )◦ ◦

for the n-channel network ◦

= ((X + Y )(X + Y )) 1

1

◦

for the p-channel network

by means of two inverters and 8 transistor switches whereas otherwise one would use two inverters and 12 transistors (and more time). In the special case of expressions representing

• 48

HARDWARE ELEMENTS

Figure 2.11 4-transistor and-or-invert gate

Figure 2.12 CMOS gate using complementary n-channel networks

f− without negated variables, no inverters are required at all. The expression XY + UV for f− yields the so-called and-or-invert gate with just 8 transistors for the CMOS circuit or 4 transistors for the NMOS circuit (Figure 2.11). Another example of this kind is the three input operation O(X, Y, Z) = XY + YZ + ZX = X(Y + Z) + YZ which requires 10 transistors for the CMOS gate and an inverter for the output. Due to the availability of such complex gate functions the construction of compute circuits can be based on more complex building blocks than just the AND, OR and NOT operations. The p-networks in CMOS gates require a similar number of transistors as the n-networks but require more space. The circuit structure shown in Figure 2.12 uses two complementary n-channel networks instead, and two p-channel transistors to drive the outputs at the n-channel networks to the H level. This structure also delivers the inverted output. If the inputs are taken from gates of this type, too, then all inverters can be eliminated. For simple gates like AND and OR this technique involves some overhead while for complex gates the transistor count can even be reduced as the n-channel networks may be designed to share transistor switches. The XOR gate built this way also requires just 8 transistors plus two input inverters (which may not be required) and also provides the inverted output. The n- and p-channel transistors not only can be used to switch on low resistance paths to the supply rails but also as ‘pass’ transistors to connect to other sources outputting intermediate voltages. The n-channel pass transistor, however, cannot make a low-impedance connection to a source outputting an H level close to the supply voltage U (above U − UT ), and the p-channel pass transistor cannot do so to a source outputting a voltage below UT . If an n-channel and a p-channel transistor switch are connected in parallel and have their gates at opposite levels through an inverter, one obtains a bi-directional switch (the ‘transmission gate’) that passes signals with a low resistance over the full supply range in its on state. The controlled switch is also useful for switching non-digital signals ranging continuously from

TRANSISTORS, GATES AND FLIP-FLOPS

• 49

Figure 2.13 SEL based on transmission gates

the ground reference to the positive supply. Transmission gates can be combined in the same fashion as the n-channel and p-channel single-transistor switches are in the networks driving L and H to perform Boolean functions but are no longer restricted to operate near the L or H level (if they do, they can be replaced by a single transistor). The output of a transmission gate will be within a logic level L or H if the input is. The transmission gate does not amplify. The output load needs to be driven by the input through the on resistance of the switch. Figure 2.13 shows an implementation of SEL with bi-directional transistor switches which requires less transistors than its implementation as a complex gate, namely just 6 instead of 12. If an inverter is added at the output to decouple the load from the inputs, two more transistors are needed. The multiplexer/selector can be further simplified by using n-channel pass transistors only. Then for H level inputs the output suffers from the voltage drop by UT . The full H level can be restored by an output inverter circuit. Besides driving an output to H or L there is the option not to drive it at all for some input patterns (it is not recommended to drive an output to H and L simultaneously). Every output connects to some wire used to route it to the input of other circuits or out of the system that constitutes the interconnection media used by the architecture of directly wired CMOS gates and constitutes a hardware resource. The idea to sequentially use the same system component for different purposes also applies to the interconnection media. Therefore it can be useful not to drive a wire continuously from the same output but to be able to disconnect and use the same wire for another data transfer. Then the wire becomes a ‘bus’ to which several outputs can connect. An output that assumes a high-impedance state in response to some input signal patterns is called ‘tri-state’, the third state besides the ordinary H and L output states being the high impedance ‘off’ state (sometimes denoted as ‘Z’ in function tables). A simple method to switch a CMOS gate output to a high-impedance state in response to an extra control signal is to connect a transmission gate to the output of the gate. If several outputs extended this way are connected to a bus line, one obtains a distributed select circuit similar to the circuit in Figure 2.13 yet not requiring all selector inputs to be routed to the site of a localized circuit. Another implementation of an additional high-impedance state for some gate output is to connect an inverting or non-inverting buffer circuit (one with an identity transfer function) to it with extra transistor switches to disconnect the output that are actuated by the control signal (Figure 2.14). The switches controlled by the disconnect signal can also be put in series with the n- and p-channel networks of a CMOS gate (see Figure 2.10), or the ‘off’ state can be integrated into the definitions of the n- and p-channel networks by defining the input patterns yielding the ‘closed’ states for them not to be complementary (just disjoint). Banks of several tri-state buffers are a common component in digital systems and are available as integrated components to select a set of input signals to drive a number of bus

• 50

HARDWARE ELEMENTS

D

+ p

disconnect C

p

data input

output to bus INV

n

n

Figure 2.14 Tri-state output circuit

lines. The circuit in Figure 2.14 can be considered as part of an 8 + 2 transistor inverting selector circuit that uses another chain of 4 transistors for the second data input to which the disconnect signal is applied in the complementary sense. A simplified version of the tri-state output circuit connected to a bus line is the ‘opendrain’ output that results from replacing the p-channel transistors driving the output to the H level by a single pull-up resistor for the bus line. Several open-drain outputs may be connected to the same bus line. Several outputs may be on and drive the bus to the L level simultaneously. The level on the bus is the result of an AND applied to the individual outputs as in Figure 2.14 within a gate. The AND performed by wiring to a common pull-up resistor is called the ‘wired AND’. An open drain output can be simulated by a tri-state output buffer that uses the same input for data and to disconnect. The CMOS building blocks explained so far are reactive in the sense of shown in section 1.4.3 After the processing time they keep their output if the inputs do not change. Complex circuits composed from the basic CMOS gates are also reactive. They are usually applied so that their input remains unchanged within the processing time, i.e. without attempting to exploit their throughput that may be higher. Circuits suitable for raising the throughput via pipelining must be given a layered structure (see Figure 1.12) by adding buffers if necessary. Then they also have the advantage that they do not go through multiple intermediate output changes (hazards) that otherwise can arise from operands to a gate having different delays w.r.t. the input.

2.1.2 Registers and Synchronization Signals Besides the computational elements which can now be constructed from the CMOS gates according to appropriate algorithms (further discussed in Chapter 4), storage elements (registers) have been identified in section 1.4 as an essential prerequisite to building efficient digital systems using pipelining and serial processing. A simple circuit to store a result value for a few ms from a starting event is the tri-state output (Figure 2.14) or the output of a transmission gate driving a load capacitance (attached gate inputs). Once the output is switched to the high-impedance state, the load capacitor keeps its voltage due to the high impedance of the gate inputs and the output transistors in their ‘off’ state. Due to small residual currents, the output voltage slowly changes and needs to be refreshed by driving the output again at a minimum rate of a few 100 Hz if a longer storage time is required. This kind of storage element is called ‘dynamic’. If the inverter inside the

• 51

TRANSISTORS, GATES AND FLIP-FLOPS

T

compute circuit

compute circuit

T

T

clock clock

Figure 2.15 Pipelining with tri-state buffers or pass gates used as dynamic D latches T

T

clock

clock

Figure 2.16 Dynamic master–slave D flip-flop

tri-state output circuit can be shared between several storage elements (e.g., in a pipeline), only two transistors are required for this function. Figure 2.15 shows how a pipeline can be set up using dynamic storage and a periodic clock as in Figure 1.16. The required tri-state outputs can be incorporated into the compute circuits or realized as separate circuits (called dynamic ‘D latches’). The input to the compute circuits is stable during the ‘off’ phase of the clock signal when the transmission gates are high-impedance. In the ‘on’ phase they change, and the compute circuit must not follow these changes before the next ‘off’ time but output the result of the previous input just before the next ‘off’ time at the latest (the clock period must be larger than the processing time). This would hold if the compute circuit has a layered structure operated in a non-pipelined fashion. If the output follows the input changes too fast, one can resort to using two nonoverlapping clocks, one to change the input and one to sample the output. Then the input data are still unchanged when the output gets stored. The output latch is the input latch of the next stage of the pipeline. A simple scheme is to connect two complementary clocks to every second set of output latches which, however, implies that the input ‘on’ phase cannot be used for processing (the dotted clock in Figure 2.15 is the complementary one). Alternatively, the input to the next pipeline stage can be stored in a second storage element during the time the output of the previous one changes, which is the master–slave storage element shown in Figure 2.16 that provides input and output events at the L-to-H clock edges only as discussed in Section 1.4.3. The clock signal for the second (slave) storage element is the inverse of the first and generated by the inverter needed for the first. While the first (master) storage element can be a transmission gate or the tri-state function within the data source, the second cannot be realized as a transmission gate as this would discharge the storage capacitor but needs an additional inverter or buffer stage (see Figure 2.14). Then a total of 8 transistors are needed to provide the (inverting) register function. From the inverter the dynamic D flip-flop also has a non-zero propagation delay or processing time from the data input immediately before the L-to-H clock edge to the data appearing at the output. With the master–slave elements the input data are stable during the full clock period and the compute circuit can use all of the period for its processing except for the processing time of the flip-flop without special requirements on its structure. If the circuit in the pipeline is a single gate, the flip-flop delay would still inhibit its efficient usage as in the case of the two-phase sampling scheme.

• 52

HARDWARE ELEMENTS

+ p

+ p

n-channel network

data input (stable for C = H)

C

C n

n

data output (stable for C = L)

n

Figure 2.17 Pipeline stage using dynamic logic

The tricky variant shown in Figure 2.17 (called ‘dynamic logic’) implements a single gate plus D flip-flop pipeline stage with just 5 transistors for the flip-flop function and further reduces the hardware complexity of the gate by eliminating the pull-up network. When the clock is L, the inner capacitor is charged high but the output capacitor holds its previous value. In the H phase the data input to the gates of the switch network must be stable to discharge the inner capacitor again to L if the network conducts, and the output capacitor is set to the resulting value. The processing by the gate and the loading of the output occur during the H phase of the clock, while the n-channel switch network is idle during the L phase. The next stage of the pipeline can use an inverted clock to continue to operate during the L phase when stable data are output. Alternatively, the next stage can use the same clock but a complementary structure using a p-channel network. The simplest case for the n- or p-channel network is a single transistor. Then the stage can be used to add delay elements into a complex circuit (called shimming delays) to give it a layered structure to enable pipelining. One can hardly do better with such little hardware. If the n-channel network were to be operated at the double rate, the input would have to change very fast. The idle phase is actually used to let the inputs transition (charge the input capacitance to the new value). Dynamic logic is sometimes used in conjunction with static CMOS circuits to realize complex functions with a lower transistor count. Static storage elements that hold their output without having to refresh it are realized by CMOS circuits using feedback and require some more effort. Boolean algorithms correspond to feed forward networks of gates that do not contain cycles (feedback of an output). If feedback is used for the CMOS inverter by wiring its output to the input, it cannot output an H or L level as none of them is a solution to the equation x = NOT(x) In this case the output goes between the L and H levels and the inverter input (and output) is forced to the steep region of the VG − V D characteristic (see Figure 2.5) where it behaves like an analogue amplifier. For the non-inverting driver built from two inverters put in series, the feedback equation x = NOT(NOT(x)) has the two solutions L and H. The circuit composed of two CMOS inverters remains infinitely in each of these output states as the flipping to the other state would require energy to charge the output capacitance until it leaves the initial interval, overcoming the discharge current

• 53

TRANSISTORS, GATES AND FLIP-FLOPS

from tri-state output/bus

p

p

Q n

n

feedback connection

Figure 2.18 Simple static storage element

R

NOR

Q

I

NOR

NOR

S

NOR

Q°

I

NOR

NOR

Figure 2.19 The RS flip-flop (NOR version) and the MRS gate

of the active output transistor that does not switch off before the double inverter delay. This 4-transistor feedback circuit is thus a storage element keeping its output value through time. If the energy is applied by briefly connecting a low impedance source to one of the outputs (e.g., from a tri-state output), the feedback circuit can be set into any desired state that remains stored afterwards. Actually, the needed amount of energy can be made very small by applying the feedback from the second inverter through a high resistor or equivalently by using transistors with a high on resistance for it (Figure 2.18) which is sufficient to keep the input capacitance to the first inverter continuously charged to H or L (there is no resistive load otherwise). An immediate application is to keep the last value driven onto a bus line to avoid the line being slowly discharged to levels outside L and H where gates inputting from the bus might start to draw current (see Figure 2.5). The combination of Figures 2.14 and 2.18 (the dynamic D latch and the bus keeper circuit) is the so-called D latch (usually, an inverter is added at the output). To set the output value Q from the first inverter in Figure 2.18 to the value presented at the data input D, one needs to apply the L level to the disconnect input C for a short time. Thereafter Q does not change. During the time when the disconnect input is L, the D latch is ‘transparent’. The data input value is propagated to the output and the output follows all changes at the input. This may be tolerable if the input data do not change during this time which may be kept very short and may even be required in some applications. There are many other ways to implement storage elements with feedback circuits. The feedback circuit in Figure 2.19 built from 2 NOR gates (8 transistors) allows the setting of the output to H or L by applying H to the S or R input. It is called the RS flip-flop. It performs its output changes in response to the L-to-H transitions on R or S. A similar behavior results if NAND gates are used instead of the NOT gates (L and H become interchanged). A similar circuit having three stable output states and three inputs to set it into each of these can be built by cross-connecting three 3-input NOR gates instead of the two 2-input gates. An RS flip-flop can be set and reset by separate signals but requires them not to become active simultaneously. A similar function often used to generate handshaking signals is the

• 54

HARDWARE ELEMENTS

so-called Muller C gate with one input inverted (called MRS below) which differs from the RS flip-flop by also allowing the H-H input and not changing the output in that case. It can be derived from the RS flip-flop by using two extra NOR gates and two inverters to suppress the H-H input combination. The register storing input data at the positive edges of a control signal (see Figure 1.15) without any assumptions about their frequency, and holding the output data for an unlimited time, can be derived from the static D latch. To pass and hold the input data present at the positive clock edge but not change the output before is done by cascading two D latches into the master–slave D flip-flop and using a complementary clock for the second as already shown in Figure 2.16 for the dynamic circuit. While the first stage opens to let the input data pass, the second stage still holds the previous output. At the positive edge the first stage keeps its output which is passed by the second. The inverted clock signal is already generated in the first D latch. Thus 18 transistors do the job, or 14 if pass gates are used instead of the tri-state circuits. The (static) D flip-flop is the standard circuit implementing the sampling of digital signals at discrete times (the clock events). Banks of D flip-flops are offered as integrated components to sample and store several signals in parallel, also in combination with tri-state outputs. The timing of a static D flip-flop is similar to that of the dynamic flip-flop, i.e. a small processing time is required to pass the input data to the output after the clock edge. For commercial components the timing is referenced to the positive clock edge (for which a maximum rise time is specified) so that input data must be stable in between the set-up time before the edge and the hold time after the edge. The new output appears after a propagation delay from the clock edge. Apart from these basic storage circuits feedback is not used within Boolean circuits. Feedback is, however, possible and actually resolved into operations performed at subsequent time steps if a register is within the feedback path (Figure 2.20). If the consecutive clock edges are indexed, xi is the input to the Boolean circuit from the register output between the edges i, i + 1, and ei is the remaining input during this time the output f(xi , ei ) of the Boolean circuit to the register input is not constrained to equal xi but will be the register output after the next clock edge only, i.e.: xi+1 = f(xi , ei )

(5)

Circuits of this kind (also called automata) have many applications and will be further discussed in Chapter 5. If e.g. the xi are number codes and xi+1 = xi + 1, then the register outputs consecutive numbers (functions as a clock edge counter). The simplest special case are single bit numbers stored in a single D flip-flop and using an inverter to generate xi+1 = xi + 1 = (xi )◦ (Figure 2.21). After every L-to-H clock edge the output transitions from H to L or from L to H and toggles at half the clock frequency.

ei

Boolean circuit

f(xi,ei)

register flip-flops C

Figure 2.20 Feedback via a register

xi

• 55

TRANSISTORS, GATES AND FLIP-FLOPS

C

D x i

I

Q

t Q

C

Figure 2.21 Single bit counter e i-1 ei

D

ei-2 D

ei-n+1 D

C

Figure 2.22 Shift register

data input

D latch

clock

0 SEL

inv

D latch

data output

1

Figure 2.23 D flip-flop clocked at both edges

Another example of an automaton of this kind is the n-bit shift register built from n D flip-flops put in series (Figure 2.22). At the clock edge the data values move forward by one position so that if ei is the input to the first flip-flop after the ith edge, the n-tuple output by the shift register is (ei−1 ,ei−2 ,..,ei−n+1 ). The shift register is a versatile storage structure for multiple, subsequent input values that does not need extra circuits to direct the input data to different flip-flops or to select from their outputs. If the shift register is clocked continuously, it can be built using dynamic D flip-flops of 8 transistors each (6 if dynamic logic is employed). If instead of just the L-to-H transitions of a ‘unipolar’ clock, both transitions are used, then the clock signal does not need not return to L before the next event, and this ‘bipolar’ clock can run at a lower frequency. Also, two sequences of events signaled by the bipolar clock sources c, c can be merged by forming the combined bipolar clock XOR(c, c ) (nearly simultaneous transitions would then be suppressed, however). An L-to-H only unipolar clock signal is converted into an equivalent bipolar one using both transitions with the 1-bit counter (Figure 2.21), and conversely by forming the XOR of the bipolar clock and a delayed version of it. A general method for building circuits that respond to the L-to-H edges of several unipolar clock signals is to first transform the clocks into bipolar ones signaling at both transitions and then merging them into a single bipolar clock. Figure 2.23 shows a variant of the D flip-flop that samples the input data on both clock edges. The D latches are not put in series as in the master–slave configuration, but in parallel to receive the same input. The inverter is shared by the latches and the select gate. The auxiliary circuits needed to provide handshaking signals (see Figure 1.18) to a compute building block can be synthesized in various ways from the components discussed

• 56

HARDWARE ELEMENTS

input data

IR

compute circuit

delay

output data

data register

S

MRS

OR

R

IA S

MRS

OA R

Figure 2.24 Handshake generation

so far [7, 39]. In order not to delay the input handshake until the output is no longer needed, and to implement pipelining or the ability to use the circuit several times during an algorithm, a register also taking part in the handshaking is common for the input or output. If a certain minimum rate can be guaranteed for the application of the building block, dynamic storage can be used. A building block that can be used at arbitrary rates requires static storage elements. The handshaking signals can be generated by a circuit that runs through their protocol in several sequential steps synchronized to some clock, but at the level of hardware building blocks simpler solutions exist. Due to the effort needed to generate the handshake signals, handshaking is not applied to individual gates but to more complex functions. Handshaking begins with the event of new input data that is signaled by letting IR perform its L-to-H transition. After this event the IR signal remains active using some storage element to keep its value. It is reset to the inactive state in response to another event, namely the IA signal transition, and hence requires a storage circuit that responds to two clock inputs. If IR and IA were defined to signal new data by switching to the opposite level (i.e., using both transitions), they would not have to be reset at all and could be generated by separately clocked flip-flops. This definition of the handshaking signals is suitable for pipelining but causes difficulties when handshaking signals need to be combined or selected from different sources. The generic circuit in Figure 2.24 uses two MRS flip-flops to generate IA and OR. It is combined with an arbitrary compute function and a storage element for its data output (a latch freezing the output data as long as OR is H, maybe just by tri-stating the output of the compute circuit). The OR signal also indicates valid data being stored in the data register. The rising edge of the IR signal is delayed corresponding to its worst case processing delay of the compute circuit by a delay generator circuit while the falling edge is supposed to be passed immediately. A handshaking cycle begins with IA and IR being L. IR goes H, and valid data are presented at the input at the same time. After the processing delay the rising edge of IR is passed to the input of the upper MRS gate. It responds by setting IA to the H level as soon as the OR signal output by the lower MRS gate has been reset by an OA pulse. The setting of IA causes OR to be set again once OA is L, and thereby latches the output data of the compute that have become valid at that time. IA is reset to L when the falling edge of IR is passed to the upper MRS gate. Alternatively, the compute and delay circuits may be placed to the right of the MRS gates and the data register which then becomes an input register. To generate the delay for a compute circuit that is a network of elementary gates, one can employ a chain of inverters or AND gates (then the delay will automatically adjust to changes of the temperature or the supply voltage). If the circuit is realized by means of dynamic logic or otherwise synchronized to a periodic clock signal, the delay can be generated by a shift

• 57

TRANSISTORS, GATES AND FLIP-FLOPS

register or by counting up to the number of clock cycles needed to perform the computation (an unrelated fast clock could also serve as a time base). Some algorithms may allow the delayed request to be derived from signals within the circuit.

2.1.3 Power Consumption and Related Design Rules A CMOS circuit does not consume power once the output capacitance has been loaded and all digital signals have attained a steady state close to the ground level or the power supply level and transistor switches in the ‘open’ state really do not conduct. Actually a small quiescent current remains, but less than 1% of the power consumption of a system based on CMOS technology is due to it typically at the current supply voltage levels. Another part of the total power consumption, typically about 10%, is due to the fact that for gate inputs in the intermediate region between L and H both the n-channel and p-channel transistors conduct to some degree (Figure 2.5). Inputs from a high impedance source (e.g., a bus line) may be kept from discharging into the intermediate region by using hold circuits (Figure 2.18) but every transition from L to H or vice versa needs to pass this intermediate region. The transition times of the digital signals determine how fast this intermediate region is passed and how much power is dissipated during the transitions. Using equation (4) in section 2.1.1, they are proportional to the capacitance driven by the signal source. If f are the frequency of L-H transitions at the inverter input, t the time to pass between L to H and j the mean ‘cross-current’ in that region, then the mean current drawn from the supply is: I = 2∗ j ∗ t ∗ f

(6)

To keep this current low, load capacitances must be kept low, and high fan-outs must be avoided. If N inverter inputs need to be driven by a signal, the load capacitance is proportional to N and the cross-current through the N inverters becomes proportional to N2 . If a driver tree is implemented (Figure 2.9), about 2N inverter inputs need to be driven, but the rise time is constant and the cross-current is just proportional to N. The major part of the power consumption is dissipated during the changes of the signals between the logic levels to charge or discharge the input and output capacitances of the gates. To charge a capacitor with the capacitance C from zero to the supply voltage U, the applied charge and energy are: Q = U ∗ C,

E = U2 ∗ C

(7)

Half of this energy remains stored in the capacitor while the other half is dissipated as heat when the capacitor is charged via a transistor (or a resistor) from the supply voltage U. If the capacitor is charged and discharged with a mean frequency f, the resulting current and power dissipation are: I = Q∗ f = U∗ C∗ f,

P = E ∗ f = U∗ I

(8)

This power dissipation may set a limit to the operating frequency of an integrated circuit; if all gates were used at the highest possible frequency, the chip might be heated up too much even if extensive cooling is applied. Semiconductor junctions must stay below 150 ◦ C. The junction temperature is warmer than the surface of the chip package by the dissipated power times the thermal resistance of the package.

• 58

HARDWARE ELEMENTS

Equations (7) and (8) also apply if the capacitor is not discharged or charged to the supply voltage but charged by an amount U w.r.t. to an arbitrary initial voltage and then discharged again to this initial voltage through resistors or transistors connected to the final voltage levels to supply the charge or discharge currents. U cannot be reduced arbitrarily for the sake of a reduced power consumption as some noise margin is needed between the H and L levels. The voltage swing can be lowered to levels to a few 100 mV if two-line differential encoding is used for the bits (i.e. a pair of signals driven to complementary levels) by exploiting the common mode noise immunity of a differential signal. If the inputs to a Boolean circuit implementing an algorithm for some function on the basis of gate operations are changed to a new bit pattern, after the processing time of the circuit, all gate outputs will have attained steady values. If k gate inputs and outputs have changed from L to H, the energy for the computation of the new output is at least E = k∗ U 2∗ C

(9)

if the capacitances at all gate inputs and outputs are assumed to be equal to C and the actual values within the L and H intervals are zero and U. It becomes higher if there occur intermediate changes to invalid levels due to gate delays. These may be determined through an analysis or a simulation of the circuit and are only avoided in a layered circuit design with identical, dataindependent gate delays. If the computation is repeated with a frequency f, and k is the mean number of bit changes for the applied input data, then the power dissipation is P = E∗ f. The power dissipation depends both on the choice of the algorithm and the applied data. Different algorithms for the same function may require different amounts of energy. The number k of level changes does not depend on whether the computation is performed by a parallel circuit or serially. As a partially serial computation needs auxiliary control and storage circuits, it will consume more energy than a parallel one. Equation (8) depends on the fact that during the charging of the capacitor a large voltage (up to U) develops across the resistor. If during the loading process the voltage across the resistor is limited to a small value by loading from a ramp or sine waveform instead of the fixed level U, the energy dissipated in the resistor of transistor can be arbitrarily low. The capacitor can be charged by the constant current I to the level of U in a time of T = UC/I. During this time the power dissipated by the resistor is N = RI2 and the energy dissipated during T becomes: E = RCU ∗ I

(10)

If before and after a computation the same number of signal nodes with capacitances C w. r. t. to the ground level are at the H level, then theoretically the new state could be reached without extra energy as the charges in the capacitors are just redistributed at the same level of potential energy. This would always be the case if input and output codes are extended by their complements and the Boolean circuit is duplicated in negative logic or implemented from building blocks as shown in Figure 2.12 (then NOT operations can be eliminated, too, that otherwise introduce data-dependent processing delays). ‘Adiabatic’ computation through state changes at a constant energy level also plays a role in the recent development of quantum computing [8]. Figure 2.25 shows a hypothetical ‘machine’ exchanging the charges of two capacitors (hence performing the NOT function if one holds the bit and the other its complement) without consuming energy. Both capacitors are assumed to have the capacitance C, the capacitors and

• 59

TRANSISTORS, GATES AND FLIP-FLOPS

U1

U2

U U2

U1 C1

S

L

C2 0

T

t

Figure 2.25 Zero-energy reversible NOT operation power clock n-channel switches

p

p

input

n-channel switches

power clock n

n

t

Figure 2.26 Adiabatic CMOS gate

the inductance are ideal, and the switch is ideal and can be operated without consuming energy. At the start of the operation C1 is supposed to be charged to the voltage U while C2 is discharged. To perform the computation, the switch is closed exactly for the time of T = 2−1/2 π (LC)1/2 . At the end C2 is charged to U and C1 is discharged. After another time of T the NOT computation would be undone. In practical CMOS circuits, the energy stored in the individual load capacitors cannot be recovered this way (unless a big bus capacitance were to be driven), but a slightly different approach can be taken to significantly reduce the power consumption. The option to move charges between large capacitors, without a loss of energy, can be exploited by using the sine waveform developing during the charge transfer to smoothly load and discharge sets of input and output capacitors with a small voltage drop across the charging resistors or transistors, as explained above. Thus, the DC power supply is substituted by a signal varying between zero and a maximum value U (a ‘power clock’). Circuits relying on smoothly charging or discharging from or to a power clock are called adiabatic. Various adiabatic circuit schemes have been implemented [37, 38]. A simplified, possible structure of an adiabatic CMOS gate with two complementary n-channel switch networks and complementary outputs is shown in Figure 2.26. During the charging of the output capacitors the logic levels at the transistor gates are assumed to be constant. This can be achieved in a pipelined arrangement where one stage outputs constant output data using a constant supply voltage while the next one gets charged by smoothly driving up its supply. Once charged, the gate keeps its state even while its input gets discharged due to the feedback within the gate circuit. Using equation 10, the energy dissipated by an adiabatic computation can be expected to be inversely proportional to the execution time T (∼ I−1 ), and the current consumption to decrease such as T−2 instead of just T−1 as for standard CMOS circuits clocked at a reduced rate. Practically, only a fraction of these savings can be realized, but enough to make it an interesting design option. The charge trapped in the intermediate nodes of the switch networks cannot be recycled unless all inputs are maintained during the discharging, and the discharging through the p-channel transistors

• 60

HARDWARE ELEMENTS

h

inv

inv

DFF

DFF h/2

inv DFF h/4

h/2n-1

h/2n

Figure 2.27 Ripple-carry counter

only works until the threshold voltage is reached. Low capacitance registers can be added at the outputs as in Figure 1. 16 to avoid the extensive input hold times. Storage elements are built from CMOS gates and also dissipate power for the output transitions of each of them. A latch uses a smaller number of gates and hence consumes less power than a flip-flop. In a master-slave flip-flop the clock is inverted so that every clock edge leads to charging some internal capacitance C even if the data input and output do not change. Thus just the clocking of an n-bit data register at a frequency f continuously dissipates the power of Pc = 2n ∗ U 2∗ C ∗ f

(11)

Registered circuits implemented with dynamic logic (see Figure 2.17) consume less power than conventional CMOS gates combined with latches or master–slave registers. If the clock is held at the L level, then there are no cross-currents even if the inputs discharge to intermediate levels. In order to estimate the continuous power consumption of a subsystem operating in a repetitive fashion one needs to take into account that the transition frequencies at the different gate inputs and outputs are not the same. The circuit shown in Figure 2.27 is a cascade of single bit counters as shown in Figure 2.21 obtained by using the output of every stage as the clock input of the next. This is called the ripple counter and serves to derive a clock signal with the frequency f/2n from the input clock with the frequency f. Each stage divides the frequency by 2. If I0 is the current consumed by the first stage clocked with f, then the second stage runs at half this frequency and hence consumes I0 /2, the third I0 /4 etc. The total current consumption of the n-stage counter becomes: I = I0 (1 + 1/2 + 1/4 + · · ·) < 2I0 The technique of using a reactive Boolean circuit with input and output registers clocked at a rate higher than the processing time of the circuit (see section 1.4.3) in order to arrive at a well-defined timing behavior thus leads to continuous power consumption proportional to the clock rate. Some techniques can be used to reduce this power consumption:

r Avoid early, invalid signal transitions and the secondary transitions that may result from them by using layered circuits.

r Use data latches instead of master–slave registers, maybe using an asymmetric clock with a short low time.

r Suppress the clock by means of a gate if a register is not to change, e.g. for extended storage or if the input is known to be unchanged.

r Use low level differential signals for data transfers suffering from capacitive loads. The gating of a clock is achieved by passing it through an OR (or an AND) gate. If the second input is H (L for the AND gate), H (L) is selected for the gate output. The control signal applied to the second input must not change when the clock signal is L (H).

TRANSISTORS, GATES AND FLIP-FLOPS

• 61

If the power consumption is to be reduced, the frequency of applying the components (the clock frequency for the registers) must be reduced and thereby the processing speed, the throughput and the efficiency (the fraction of time in which the compute circuits are actually active). The energy needed for an individual computation does not change and is proportional to the supply voltage U. The energy can only be reduced and the efficiency can be maintained by also lowering U. Then the transistor switches get a higher ‘on’ resistance and the processing time of the gate components increases. The ‘on’ resistance is, in fact, inversely proportional to U – UT where U denotes the supply voltage and UT is the threshold voltage (see section 2.1.1). Then the power consumption for a repeated computation becomes roughly proportional to the square of the clock frequency. If the required rate of operations of a subsystem significantly varies with time, this can be used to dynamically adjust its clock rate and the supply voltage so that its efficiency is maintained. The signals at the interface of the subsystem would still use some standard voltage levels. This technique is common for battery-powered computers, but can be systematically used whenever a part of a system cannot be used efficiently otherwise. A special case is the powering down of subsystems that are not used at all for some time. The use of handshaking between the building blocks of a system can also serve to reduce the power consumption. Instead of a global clock, individual clocks are used (the handshake signals) that are only activated at the data rate really used for them. A handshaking building block may use a local clock but can gate it off as long as there are no valid data. This is similar to automatically reducing the power consumption of unused parts of the system (not trying to using them efficiently). If the processing delay for a building block is generated by a chain of inverters, the estimating delay adapts to voltage and temperature in the same way as the actual processing time. It then suffices to vary the voltage to adjust the power dissipation, and the handshake signals (the individual clocks) adjust automatically. A control flow is easily exploited by suppressing input handshake to unused sub-circuits. Similar power-saving effects (without the automatic generation and adjustment of delays) can, however, also be obtained with clocked logic by using clock gating.

2.1.4 Pulse Generation and Interfacing Besides the computational building blocks and their control, a digital system needs some auxiliary signals like a power-on reset signal and a clock source that must be generated by appropriate circuits, and needs to be interfaced to the outside world, reading switches and driving loads. In this section, some basic circuits are presented that provide these functions. Interfaces to input and output analogue signals will follow in Chapter 8. For more details on circuit design we refer to [19]. The most basic signal needed to run a digital system (and most other electronic circuits) is a stable DC power supply delivering the required current, typical supply voltages being 5.0V, 3.3V for the gates driving the signals external to the chips and additionally lower voltages like 2.5V, 1.8V, 1.5V, and 1.2V for memory interfaces and circuits within the more recent chips. In many applications, several of these voltages need to be supplied for the different functions. To achieve a low impedance at high frequencies the power supply signals need to be connected to grounded capacitors close to the load sites all over the system. A typical power supply design is to first provide an unregulated DC voltage from a battery or one derived from an AC outlet and pass it through a linear or a switching regulator circuit.

• 62

HARDWARE ELEMENTS

Uout

Uin

Uout

Uin

Uin > U out

Uout > Uin

Figure 2.28 Switching down and up regulator configurations + R

Schmitt trigger inv

inv

reset signal

C

Figure 2.29 Reset signal generation using a Schmitt trigger circuit

Regulators outputting e.g. a smooth and precise 5V DC from an input ranging between 7-20V with an overlaid AC ripple are available as standard integrated 3-terminal circuits. The current supplied at the output is passed to it from the input through a power transistor within the regulator. For an input voltage above 10V, more power is dissipated by this transistor than by the digital circuits fed by it. A switching regulator uses an inductance that is switched at a high frequency (ranging from 100 kHz to several MHz) to first store energy from the input and then to deliver it at the desired voltage level to the output. It achieves a higher efficiency (about 90%, i.e. consumes only a small fraction of the total power by itself) and a large input range. Switching regulators can also be used to convert from a low battery voltage to a higher one (Figure 2.28). The switches are implemented with n-channel and p-channel power MOS transistors having very low resistances (some 0.1). The transistor switches are controlled by digital signals. Single and multiple regulators are available as integrated circuits including the power transistors. A high efficiency voltage converter deriving the voltage /2 from a supply voltage can be built by using a switched capacitor only that is connected between the input and the output terminals to get charged by the output current, or alternatively between the ground reference and the output terminal to get discharged by the load current. The two connections are made by low resistance transistor switches and alternate at a high frequency so that a small voltage change U develops and the power dissipation is low due to equations (7) and (8) in the previous section. The input delivers the load current only at half time. After power-up, some of the storage elements in a digital system must usually be set to specific initial values which is performed in response to a specific input signal called a reset signal. It is defined to stay at a specific level, say L, for a few ms after applying the power and then changes to H. An easy way to generate a signal of this kind is by means of a capacitor that is slowly charged to H via a resistor. In order to derive from a digital signal that makes a fast transition from L to H, the voltage across the capacitor can be passed through a CMOS inverter that is used here as a high gain amplifier. If feedback is implemented as in Figure 2.29 a single transition results even if the input signal or the power supply is overlaid with some electrical noise. The reset circuit outputs the L level after power-up that holds for some time after the power has reached its full level depending on the values for C and the resistors (usually its

• 63

TRANSISTORS, GATES AND FLIP-FLOPS 1 MΩ inv inv 20pF

crystal

square wave output

20pF

Figure 2.30 Crystal oscillator circuit

duration does not need be precise). The switch shown as an option permits a manual reset by discharging the capacitor. The buffer circuit with a little amount of feedback to the input is a standard circuit known as the Schmitt trigger that is used to transform a slow, monotonic waveform into a digital signal. Its Vin − Vout characteristic displays a hysteresis. The L-H transition occurs at a higher input level than the H-L transition. The actual implementation would realize the feedback resistor from the output by simply using transistors with a high on resistance. The other one can be substituted by a two transistor non-inverting input stage (similar to Figure 2.5 but with the n- and p-channel transistors interchanged). A periodic clock signal as needed for clocking the registers and as the timing reference within a digital system is easily generated using the CMOS inverter circuit as a high gain amplifier again and using a resonator for a selective feedback at the desired frequency. The circuit in Figure 2.30. uses a piezoelectric crystal for this purpose and generates a full swing periodic signal at its mechanical resonance frequency which is very stable (exhibits relative frequency deviations of less than 10−7 only) and may be selected in the rage of 30 kHz . . . 60 MHz through the mechanical parameters of the crystal. The resistor serves to let the amplifier operate at the midpoint of its characteristic (Figure 2.5), and the capacitors serve as a voltage divider to provide the phase shift needed for feedback. The second inverter simply amplifies the oscillator output to a square waveform with fast transitions between L and H. Crystals are offered commercially at any required frequencies, and complete clock generator circuits including the inverters are offered as integrated components as well. The frequency of a crystal oscillator cannot be changed but other clock signals can be derived from it by means of frequency divider circuits. A frequency divider by two is provided by the circuit shown in Figure 2.21 using a D flip-flop and feeding back its inverted output to its data input. Then the output becomes inverted after every clock edge (plus the processing delay of the flip-flop), and the resulting signal is a square wave of half the clock frequency h and a 50% duty cycle, i.e. the property that the L and H times are identical (this is not guaranteed for the crystal oscillator output). If several frequency dividers of this kind are cascaded so that the output of a divider becomes the clock input for the next stage, one obtains a frequency divider by 2n , the ripple-carry counter already shown in Figure 2.27. As each of the flip-flops has its own clock their clock edges do not occur simultaneously. To divide the input frequency h by some integer k in the range 2n−1 < k ≤ 2n , a modified edge counter circuit can be used, i.e. an n-bit register with a feedback function f that performs the n-bit binary increment operation f(x) = x + 1 as proposed in section 2.1.2 (also called a synchronous counter as all flip-flops of the register here use the same clock signal), but

• 64

HARDWARE ELEMENTS

binary adder

k

n-bit register

output from the most significant bit

input clock

Figure 2.31 Fractional frequency divider

div/k

div/ VCO Xtal

reference clock

– P C +

R

R′

C

Figure 2.32 PLL clock generator

only for x < k − 1, whereas f(k − 1) = 0. Then the register cycles through the sequence of binary codes of 0,1,2, . . . ,k-1 and the highest code bit is a periodic signal with the frequency h/k. Another variant is the fractional counter that generates the multiple h ∗ k/2n for a nonnegative integer k < 2n−1 (Figure 2.31). This time the feedback function is f(x) = x + k (algorithms for the binary add operation follow in section 4.2). The output from the highest code bit is not strictly periodic at the prescribed frequency (for odd k, the true repetition rate is h/2n ). The transitions between L and H remain synchronized with the input clock and occur with a delay of at most one input period. The frequency dividers discussed so far generate frequencies below 1/2h only. It is also useful to be able to generate a periodic clock at a precise integer multiple k of the given reference h. The crystal oscillators do not cover clock frequencies of the several 100 MHz needed for high speed processors but their frequencies might be multiplied to the desired range. It is quite easy to build high frequency voltage-controlled oscillators (VCO), the frequencies of which can be varied over some range by means of control voltages moving continuously over a corresponding range. The idea is to control the frequency q of a VCO so that q/k = h (a signal with the frequency q/k is obtained from a frequency divider). The deviation is detected by a so-called phase comparator circuit and used to generate the control voltage, setting up a phase-locked loop (PLL, Figure 2.32). If the VCO output is divided by m, then the resulting output frequency becomes k/m∗ h. The phase comparator (PC in Figure 2.32) can be implemented as a digital circuit that stores two bits encoding the numbers 0, 1, 2, 3 and responds to the L-to-H transitions at two separate clock inputs. The one denoted ‘+’ counts up to 3, and the one denoted ‘−’ counts down to 0. The phase comparator outputs the upper code bit, i.e. zero for 0, 1 and the supply voltage for 2, 3. If the frequency of the VCO is higher than k*h, there are more edges counting down and PC is in one of the states 0, 1 and outputs the zero level which drives the VCO frequency down. If the reference clock is higher, it is driven up. If both frequencies have become equal, the state alternates between 1, 2 and the mean value of the output voltage depends on their relative phase which becomes locked at some specific value. The R-R -C

• 65

TRANSISTORS, GATES AND FLIP-FLOPS

integrator circuit needs to be carefully designed in order to achieve a fast and stable control loop [40]. The VCO output can then be passed through a divide by m counter to obtain the rational multiple of the reference clock frequency by k/m. Input data to a digital system must be converted to the H and L levels required by the CMOS circuits. The easiest way to input a bit is by means of a mechanical switch shorting a H level generated via a resistor to ground. Mechanical make switches generate unwanted pulses before closing due to the jumping of the contact, which are recognized as separate edges if the input is used as a clock. Then some pre-processing is necessary to ‘debounce’ the input. The circuit in Figure 2.29 can be used, or a feedback circuit like the RS flip-flop or the hold circuit in Figure 2.18 that keeps the changed input value from the first pulse (but needs a separate switch or a select switch to be reset). Data input from other machines is usually by means of electrical signals. If long cabling distances are involved, the L and H levels used within the digital circuits do not provide enough noise margin and are converted to higher voltage levels (e.g. [3, 12]V to represent 0 and [−12, −3]V to represent 1) or to differential signals by means of input and output amplifiers that are available as integrated standard components. For differential signals the H and L levels can be reduced to a few 100 mV. At the same time the bit rates can be raised. The LVDS signaling standard (‘low voltage differential signaling’) e.g. achieves bit rates of 655 M bit/s and, due to its low differential voltages of ±350 mV, operates from low power supply voltages [21]. LVDS uses current drivers to develop these voltages levels across 100 termination resistors. Variants of LVDS support buses and achieve bit rates beyond 1 Gbit/s. An LVDS bus line is terminated at both ends and therefore needs twice the drive current. If systems operating at different ground levels need to be interfaced, the signals are transferred optically by converting a source signal by means of a light emitting diode that is mounted close to a photo transistor converting back to an electrical signal. Such optoelectronic couplers are offered as integrated standard components as well (alternatively, the converters are linked by a glass fiber replacing the cable). The switches, converters, cables, wires and even the input pins to the integrated circuits needed to enter data into a system are costly and consume space. The idea of reusing them in a time-serial fashion for several data transfers is applied in the same way as it was to the compute circuits. Again, this involves auxiliary circuits to select, distribute and store data. A common structure performing some of these auxiliary functions for the transfer of an n-bit code using a single-bit interface in n time steps is the shift register (Figure 2.22). After n time steps the code stands in the flip-flops of the register and can be applied in parallel as an input to the compute circuits. Figure 2.33 shows the resulting interface structure. The clock

parallel data

parallel data (interface media)

shift register G

C

serial data bit clock word clock

C

shift register

Figure 2.33 Serial interface structure (G: bit and word clock generator, C: signal converter)

• 66

HARDWARE ELEMENTS

signal defines the input events for the individual bits and must be input along with the data (or generated from the transitions of the data input). If both clock edges are used, the interface is said to be a double data rate interface (DDR). No further handshaking is needed for the individual bits, but it is needed to define the start positions of multi-bit code words and must be input or be generated as well (at least, the clock edges must be counted to determine when the receiving shift register has been filled with new bits). The serial interface is reused as a whole to transfer multiple code words in sequence. The register, the generation of the clock and the handshake signals add up to a complex digital circuit that does not directly contribute to the data processing but can be much cheaper than the interface hardware needed for the parallel code transfer. The output from a digital system (or subsystem) to another one is by means of electrical signals converted to appropriate levels, as explained before. A serial output interface requires a slightly more complex register including input selectors to its flip-flops so that it can also be loaded in parallel in response to a word clock (Figure 2.33). If the data rate achieved with the bit-serial transfer is not high enough, two or four data lines and shift registers can be operated in parallel. Another option is to convert the interface signals into differential ones using LVDS buffers. Then much higher data rates can be achieved that compensate for the serialization of the transfer. To further reduce the cables and wires the same can be used to transfer data words in both directions between the systems (yet at different times using some extra control). Finally, the clock lines can be eliminated. For an asynchronous serial interface each word transmission starts by a specific signal transition (e.g. L -> H) and the data bits follow this event with a prescribed timing that must be applied by the receiver to sample the data line. Another common method is to share a single line operating at the double bit rate for both the clock and the data by transmitting every ‘0’ bit as a 0-1 code and every ‘1’ as a 1-0 code (Manchester encoding), starting each transmission by an appropriate synchronization sequence. Then for every bit pattern the transmitted sequence makes many 0-1 transitions which can be used to regenerate the clock using a PLL circuit at the receive site. The effort to do this is paid for by the simplified wiring. The CMOS outputs can directly drive light emitting diodes (LED) through a resistor that give a visible output at as little as 2mA of current (Figure 2.34). To interface to the coil of an electromechanical switch or a motor one would use a power transistor to provide the required current and voltage levels. When the transistor switches off, the clamp diode limits the induced voltage to slightly above the coil power supply voltage and thereby protects the transistor from excessive voltages. The same circuit can be used to apply any voltage between the coil power supply and zero by applying a high frequency, periodic, pulse width modulated (PWM) digital input signal to the gate of the transistor. To output a bipolar signal, ‘H’ bridge arrangements of power transistors are used. Integrated LED arrays or power bridges to drive loads in both polarities are common output devices.

from CMOS output

1 kΩ

+5V or 0V

n

Figure 2.34 Interfacing to LED lamps and coils

positive coil power supply

CHIP TECHNOLOGY

2.2 CHIP TECHNOLOGY

• 67

Since the late 1960s, composite circuits with several interconnected transistors have been integrated onto a silicon ‘chip’ and been packed into appropriate carriers supplying leads to the inputs and outputs of the circuit (and to the power supply). Since then the transistor count per chip has raised almost exponentially. At the same time, the dimensions of the individual transistors were reduced by more than two orders of magnitude. For the gate lengths the values decreased from 10 µm in 1971 to 0.1 µm in 2001. The first families of bipolar and CMOS integrated logic functions used supply voltages of 5V and above. A 100 mm2 processor chip filled with a mix of random networks of gates and registers and some memory can hold up to 5∗ 107 transistors in 0.1 µ CMOS technology. For dedicated memory chips the densities are much higher (see section 2.2.2). The technology used for a chip and characterized by the above feature size parameter s determines the performance level of a chip to a high degree. If a single-chip digital system or a component such as a processor is reimplemented in a smaller feature size technology, it becomes cheaper, faster, consumes less power, and may outperform a more efficient design still manufactured using the previous technology. Roughly, the thickness of the gate insulators is proportional to s. The supply voltage and the logic levels need to be scaled proportional to s in order to maintain the same levels for the electrical fields. For a given chip area, the total capacitance is proportional to s−1 , the power dissipation P = U 2 C f (formula (8) in 2.1.3) for an operating frequency f hence proportional to s, and f can be raised proportional to s−1 for a fixed power level. At the same time, the gate density grows with s−2 . A problem encountered with highly integrated chips is the limitation of the number of i/o leads to a chip package. Whereas early small-scale integrated circuits had pin counts of 8–16, pin counts can now range up to about 1000, but at considerable costs for the packages and the circuit boards. For chips with up to 240 leads surface-mount quadratic flat packages (QFP) are common from which the leads extend from the borders with spacing as low as 1/2 mm. To reduce the package sizes and to also support higher i/o counts, ball grid array (BGA) packages have become common where the leads (tiny solder balls) are arranged in a quadratic grid at the bottom side of the package and thus can fill out the entire area of the package. While a 240 pin QFP has a size of 32 × 32 mm , a BGA package with the same lead count only requires about 16 × 16 mm. For the sake of reduced package and circuit board costs, chips with moderate pin counts are desirable. Chips are complex hardware modules within a digital system. Generally, the module interfaces within a system should be as simple as possible. The data codes exchanged between the chips may be much wider than the number of signal lines between them anyhow by using serial data transfers in multiple time steps. For large chips, testing is an issue and must be supported by their logic design. Although the manufacturing techniques have improved, isolated faulty transistors or gates can render a chip unusable unless the logic design provides some capabilities to replace them by spare operational ones (this is common for chips which contain arrays of similar substructures). Otherwise the ‘yield’ for large chips becomes low and lets the cost of operational ones increase. Chips are produced side by side on large silicon wafers (with diameters of 20 cm and above) from which they are cut to be packaged individually. The level of integration has been raised further in special applications by connecting the operational chips on a wafer without cutting

• 68

HARDWARE ELEMENTS

it (wafer-scale integration). The array of interconnected chips on a wafer must support the existence of faulty elements. The achievable complexity of integrated circuits is high enough to allow a large range of applications to be implemented on single chip digital processors, at least in principle. The high design and manufacturing costs of large-scale integrated circuits, however, prohibit single chip ASIC implementations except for very high volume products. Otherwise the digital system would be built from several standard or application specific chip components mounted and connected on one or several circuit boards. The standard chips and ASIC devices are the building blocks for the board level design, and the implementation of multi-chip systems on circuit boards provides the scalability required to cover both high performance or low volume applications. Chips always have a fixed, invariable structure. They can, however, be designed to offer some configurability to support more than one application or some design optimizations without having to redesign the hardware (by implementing combined functions in the sense discussed in Section 1.3.3). The component chips can only be cost effective if they are produced in large volumes themselves which is the case if their respective functions are required in several applications, or if they can be programmed or configured for different applications. At the board level, reusable ‘standard’ subsystems are attractive, too, and the cost for board level system integration must be considered to compare different design options. Chips to be used as components on circuit boards benefit from integrating as many functions as possible and from having a small number of easy-to-use interface signals with respect to their timing and handshaking. In general, the interfacing of chips on a board requires pin drivers for higher signal levels than those inside the chips involving extra delays and power consumption related to their higher capacitive loads. If there is a choice of using a chip integrating the functions of two other ones, it will provide more performance and lower power consumption yet less modularity for the board level design. For the internal and external interfaces of digital systems small-to-medium-scale standard or application-specific integrated circuits are used to provide the generation of the required signal levels and to perform digital functions to accommodate them to the interfacing standards of the digital processor. It is e.g. common to realize driver functions that adapt the internal digital signals to the voltages and currents required at the external interfaces in separate integrated circuits, both because they are the most likely places where damage can occur to a digital system (then only the drivers need to be exchanged) and because they use transistors with different parameters which are not easily integrated with the processing gates. Generally it is hard to integrate circuit structures with different, special characteristics, e.g. special memory technologies, random gate networks and analogue interfaces. Otherwise highly integrated components are predominant, starting from configurable standard interface functions. In the subsequent sections some common highly integrated building blocks of digital systems will be presented that are usually packaged as chips or constitute large modules within still larger chips. Fixed function small-scale and medium-scale integrated circuits have lost much of their former importance and are often replaced by configurable components but some still play a role. If a few gates are needed-one can choose from small and cheap packages like those containing six CMOS inverters or four 2-input gates, and for interfacing to buses there are banks of tri-state drivers with or without keeper circuits and multi-bit latches and registers.

• 69

CHIP TECHNOLOGY

+ A0-A15 /BLE /BHE /CE /OE /WE

(address bus)

RAM or EPROM chip

(data bus )

D0-D15

Figure 2.35 16-bit SRAM and Flash memory interface signals

data address

XX output valid valid address code

input valid address code

/WE /CE

Figure 2.36 Read and write cycle timing

2.2.1 Memory Bus Interface Among the most prominent components of digital systems are the various kinds of memory chips. They are fixed configuration building blocks used in large volume. Memory is used for storing intermediate results, for holding the input and output data of computations, and to provide random access to data that came in serially. The flip-flops and registers introduced in section 2.1.2 can be extended by select and decode circuits to implement storage for multiple data words that can be selected via address signals. In many applications the storage requirements are for quite large numbers of data bits. Integrated memory chips offer a large numbers of individual, highly optimized multi-bit storage cells and the selection circuits. The static random access memory (SRAM) and the ‘flash’ erasable and programmable read only memory (EPROM) chips or modules explained in the next section have the generic interface shown in Figure 2.35. The n address inputs A0 ,..,An−1 are used to select from 2n storage locations (common values for these chips are n = 16, 18), the control signals /OE (output enable), /WE (write enable), /CE (chip enable) transfer read and write commands, and the k data lines D0 ,..,Dk−1 that transfer k-bit data words during read or write operations (k = 8, 16). 16-bit devices usually have extra control signals /BLE and /BHE to activate the lower and upper 8-bit half (‘byte’) of the data word separately. Read and write operations are performed sequentially. During a read operation the data lines of the memory device are outputs. Otherwise the outputs are tri-stated. During a write operation the data lines input the data to be stored. Read and write operations can be controlled most simply with /CE and /WE alone if the other control signals are held at the L level. Figure 2.36 shows the timing of the read operation from an SRAM or an EPROM and the SRAM write operation. The address inputs and /WE are signaled valid by the falling edge of /CE and do not change during the time /CE is low. Alternatively, /WE or /OE are pulsed low for the write and read operations while /CE is low. In the read cycle, the output data become available before the rising edge of /CE (/OE in the other scheme), some time after applying the address needed for the selection of the data (their

• 70

HARDWARE ELEMENTS

/CE SRAM

/CE1

EPROM

decode /CE2

A16 /WE,/OE,/BLE,/BHE (control) A0-A15 (address bus) D0-D15 (data bus)

Figure 2.37 Multiple memory chips connected to a bus

arrival after the invalid data XX is not indicated by an extra signal). This time is referenced to the falling edge of /CE and specified as the access time of the particular memory device. The data can be stored in a register clocked with /CE (/OE) but disappear from the bus a short ‘hold’ time after /CE (/OE) is high again. In the write cycle the write data must be applied no later than a specific set-up time before the rising edge of /CE (/WE). After the rising edge of /CE the address lines may change for the next memory cycle. Several memory chips of different kinds and sizes can be connected to the same sets of data and address lines (‘buses’) provided that their /CE signals do not become active and read operations are not carried out simultaneously on several devices (Figure 2.37). The data and address words are transferred to all memory chips using the same signal lines. The individual /CE signals are generated by means of a decoder circuit (a few CMOS gates) in response to additional address signals. An important parameter of the memory interface is the number of data lines which determines how many bits can be transferred simultaneously (performance), and how many wires and signal drivers are needed (cost). A 16-bit code required as a parallel input to a circuit can be loaded from a memory via an 8-bit data bus but this takes two bus cycles and requires the first byte to be stored in a register until the second is ready, too (if there was just one data line, one would arrive at a bit-serial interface to the memory and have to use a shift register as in Figure 2.33). Thus transfer speed can be traded off for a simpler interface. A 16-bit memory device can be connected to an 8-bit data bus, too. If /BLE and /BHE are never activated simultaneously, the lower and upper bytes can be tied together and connected to the data bus. Also, several memory modules with a small number of data lines can be connected in parallel to the same address and control signals but to different data bus lines to yield a wider memory structure. The bus with the multiple memory devices connected to it and the inputs to the address decoder behaves like a single memory device with the generic interface. Another way to trade off performance against lower cost for the interfacing is to use the same signal lines to transfer the addresses and the data. Then an extra time step is needed for the address transfer, and the address must be latched for the subsequent read or write operation using an extra address latch enable control signal (ALE, Figure 2.38). A bus with common address and data lines is called a multiplexed bus. On it, every memory operation needs two transfers via the bus, and for the attached memory devices the address latches must be provided. If they are integrated into the memory chips, the pin count is reduced significantly. There are many cases in which the addresses of subsequent memory accesses follow some standard pattern, e.g. obtained by performing a binary increment (add 1) operation. This

• 71

CHIP TECHNOLOGY

(data)

AD0-AD15

address register

ALE /CE /OE /WE

(address)

SRAM or EPROM chip

Figure 2.38 Interfacing to a multiplexed bus

can be exploited by augmenting the address latch to a register that increments its contents in response at the end of every read or write operation. Then for the accesses at ‘sequential’ addresses no further overhead is involved through the multiplexing, apart from the initial loading of the address register, and the circuit generating the bus addresses may be simplified, as addresses do not need be computed and output for every memory access. If the address lines saved by the multiplexing are invested into further data lines, then the multiplexed bus becomes even higher performance than the non-multiplexed one. The set of bus signals and the definitions of the read and write cycles (the bus ‘protocol’) define an interfacing standard (called the asynchronous memory bus) that also applies to other circuits than memory devices. A decoded /CE type signal can e.g. be used to clock a register attached to the data bus by its inputs, or to activate the tri-state outputs of some circuits to place their data onto the data bus. To perform data transfers via the bus, some digital circuit must drive the address and control lines which are just inputs to the other devices, and the data lines during a write cycle. The /CE signals of the individual devices must be activated according to their access times. The bus is thus an interconnection structure for a number of modules with compatible interfaces supporting n-bit parallel word transfers between them that are performed in a time series. Of course, if only one memory chip is used, the time-sharing is only for the read and write accesses to it. There are various extensions to the basic structure of a multiplexed or non-multiplexed bus, adding e.g. clock or handshaking signals. The use of buses as standard module interfaces is further discussed in section 6.5.3. Logically, one has to distinguish between the wiring resources for a bus, the signal parameters to be used and the assignment of the signals, and the protocols on how to access the bus and perform data transfers on it to a desired destination.

2.2.2 Semiconductor Memory Devices The storage cells are packed onto the memory chips in large, regular, two-dimensional arrays. Due to the tight packing of cells the silicon area per transistor is small, and due to this and the high volume production of memory chips the per transistor cost of a memory array is much less than for other types of digital circuits (as a rule of thumb, by a factor of 100). This is one of the clues to the success of the current microprocessor architectures that rely on large data and program memories. Semiconductor memory devices can be classified as volatile memories (that need to be initialized with valid data after being supplied with power) and non-volatile ones (that hold their data even without being supplied), and further by their access to the individual words and bits of data which may be random (using select codes called ‘addresses’) or serial. We include some typical memory parameters which hold for the year

• 72

HARDWARE ELEMENTS

2001 but have changed year by year to ever more impressive ones. For a long time, memory chip capacities have doubled every 2–3 years. The random access memories (RAM) are volatile. They offer a large selection of addressable word locations that data may both be written to or be read from. There are two common RAM implementations, the SRAM (static RAM) and the DRAM (dynamic RAM). SRAM provides easier-to-use storage whereas DRAM achieves a higher storage capacity in relation to the transistor count. A typical SRAM chip would run from a 3.3V supply and consume about 20mA of current, provide a million of bit locations (an ‘M bit’) and perform read and write operations in as little as 10 ns. There are low power versions with slower access times of up to 120 ns but a current consumption of a few µA only, and higher density devices with capacities of up to 16 M bit. DRAM chips provide storage capacities of up to 256 M bits and beyond. Non-volatile memory chips are needed to hold the program and configuration code for programmable subsystems that must be available after applying power to a system. The Flash EPROM chips provide non-volatile storage with capacities similar to SRAM and slightly longer read access times, and can be erased and reprogrammed for a limited number of times only. They are also used as non-volatile data memories (silicon discs). The SRAM memory cell is the feedback circuit built from two CMOS inverters using four transistors (Figure 2.18). All memory cells in a column of the two-dimensional memory array are connected via n-channel pass transistor switches to two bus lines (one from each inverter output) which results in a total of six transistors per storage bit (Figure 2.39). A decoder circuit generates the control signals to the gates of the switches from part of the address inputs so that only one cell in the column is switched to the bus line. This structure is applied for all columns in parallel yet sharing the decoder circuit which therefore selects an entire row of the array, and the output from a column to the bus line is by a wired OR. For a read operation from a particular location all bits in its row are read out in parallel to the bus lines. A select circuit selects the desired column using the remaining address inputs. For a write operation, an L level is forced on the unique bus line of the column of the cell to be written to and the inverter side to be set low, similarly to writing a D latch. Due to the sharing of the decoder circuit for all columns and the wired OR of all outputs from a column, the selection of the memory cells only requires a small fraction of the transistors (but determines the time required to access the selected location). A 16 M bit SRAM thus contains about 100 million transistors. There are a number of issues on memory design beyond these basics [10].

M

M

M

(from row decoder) M

M

M

M

M

M

(to column selector)

Figure 2.39 Selection of memory cell outputs (M) in a 2D array

• 73

CHIP TECHNOLOGY

D0-D15 A0-A11 bus I

(data)

(address)

SRAM

(data)

E0-E15

(address)

B0-B11

/CE1 /OE1 /WE1

bus II /CE2 /OE2 /WE2

Figure 2.40 Dual-port RAM interfacing to two non-multiplexed buses

D0-D7 /WR /CE1

E0-E7 FIFO buffer

BA

/RD /CE2 DA

Figure 2.41 FIFO interface

If an SRAM is to be operated at high speed, the transfer of new addresses and control signals via the bus can be pipelined with the memory access to the previous one. The resulting structure using input registers for addresses and control signals is the synchronous SRAM. A memory bus equipped with an extra control signal (the clock) to signal the input events for addresses and read/write commands is called a synchronous bus. The synchronous burst SRAM (SBSRAM) chip is a common variant of the SRAM that integrates these registers and an increment function for the registered address as proposed above for the multiplexed bus (the use of SBSRAM on a synchronous multiplexed bus is non-standard). Some SBSRAM designs add additional registers for the write data and the read data. There are a number of specialized memory architectures based on SRAM cells. If a second set of pass transistors and bus lines is added to an array of SRAM cells, one arrives at a structure providing two independent access ports to the memory that permit asynchronous accesses to the same storage cells via two separate buses. This structure is called a dual-port RAM (Figure 2.40). It is useful for implementing parallel read and write operations or for interfacing to subsystems without restricting the timing of their memory accesses. If a single port memory were used, they would have to compete for the right to access the memory data and address buses and would have to perform their accesses one-by-one. The dual-port RAM doubles the possible rate of read and write cycles (the ‘memory bandwidth’) and e.g. allows the pipelined inputting of new data into the memory without restricting the read accesses to previous data still stored in it. Dual port RAM modules packaged as chips suffer from the large number of interface signals to the two buses. The use of multiplexed buses helps this. Another common memory structure that is also used to interface two subsystems and provides independent read and write ports is the first-in-first-out buffer (FIFO). The FIFO is a serial memory. A sequence of words can be input that are stored at subsequent locations, the addresses of which are generated automatically by integrated counters. The read operations retrieve the words one-by-one in the order in which they were input. A FIFO is usually equipped with extra logic to support synchronization by outputting handshaking signals BA and DA indicating the buffer space or read data being available (Figure 2.41). These interface

• 74

HARDWARE ELEMENTS

definitions for the read and write ports are generic for handshaking input and output via the bus and can be adopted for many interfaces transmitting or receiving data streams. Other data structures like the last-in-first out buffer (LIFO or ‘stack’) with a single bus port yet without address lines can be implemented by combining the SRAM with appropriate address generator circuits that could otherwise also be computed by a sequential processor. The associative, content addressable memory (CAM) can also be based on the SRAM cell. Its read operation performs a search for a word location holding a particular pattern that is input to the memory and outputs the address at which it is stored or other data associated to the input pattern. The CAM can be thought of as encoding the multi-bit search pattern by another (shorter) code. The write operation places a search key and an associated data pattern into a new location. CAM structures are used in cache memories (see section 6.2) where a portion of a large yet slow memory is mapped to a small, fast one, encoding the long addresses of the first by the short ones of the second. They also provide an efficient way of storing a large yet sparse, indexed data set (where most of the components are zero). Only the non-zero values are stored along with the indices. The CAM implements a computational function (the comparison) along with its storage cells. While the SRAM storage cell is similar to a static D latch, the DRAM cell is like a dynamic D latch. The storage element in it is a tiny capacitor (a fraction of a pF) that keeps its voltage through time as long as it is not charged differently. A single pass transistor switch is used to connect the capacitors in a column to a common bus line, again using the structure shown in Figure 2.39 (where ‘M’ is now the capacitor). Thus, a single transistor per bit is required which explains the higher bit count of DRAM devices. A 256 M bit device hence contains about 256 million transistors and capacitors. Two problems arise. First, when the storage capacitor is switched to the extended bus line that has a much higher capacity, the stored charge is distributed to both capacitors and the voltage breaks down. The voltage on the bus line must consequently be amplified, and the full voltage must be restored to the cell capacitor (through the pass transistor). Secondly, for a non-selected cell the charge cannot be guaranteed to remain in the storage capacitor within the limits of the H and L levels for more than about 0.1s. Hence all rows of the memory must be periodically read out and rewritten independently from the access patterns of the application. This is called ‘refreshing’ the memory. The row access to a DRAM takes some time to amplify and restore the data while the selection of a column position within the row is fast. This is exploited by applying the row and the column addresses sequentially on the same address inputs (thereby reducing the pin count) and by allowing fast ‘page mode’ accesses. One might expect a further multiplexing with the data, but this is not common. The access time from applying the row access may be about 40ns, while subsequent column accesses are 2–4 times faster. Figure 2.42 shows the

data address

data 1 row

column 1

data 2 column 2

data 3 column 3

/RAS /CAS

Figure 2.42 DRAM read cycle using multiple page accesses (data follow /CAS edges)

CHIP TECHNOLOGY

• 75

timing of a page mode read cycle (for the write cycle it is similar). Several column addresses are applied in sequence and signaled by the /CAS transitions. The /RAS and /CAS control signals are typical to the DRAM. They substitute /CE on the SRAM and identify the input events for the row and column addresses. As in the case of SRAM, higher performance DRAM chips interface to a synchronous bus and include registers for the address and control inputs including /RAS and /CAS, for the data, and a counter function for the column address register to support accesses to subsequent locations without having to issue extra address latch commands. With these extensions the DRAM becomes the synchronous DRAM (SDRAM). Clock rates are in the range of 100..200 MHz, and sequential accesses can be performed at that rate. Still faster accesses are obtained by transferring data on every clock edge. The double data rate (DDR) SDRAM chips achieve this by using several banks on memory that are accessed in an interleaved fashion so that each individual block transfers data at a reduced rate. A typical DDR chip transfers stores 16 million 16-bit words and transfers them at a rate of up to 333 MHz (still much slower than the clock rate of some recent processors). A quad data rate SDRAM has been proposed using two interleaved DDR banks, the clocks of which are phase shifted by 90 degrees. The RAMBUS DRAM is a variant of the DRAM that pipelines the transfer of serialized commands and addresses to the data transfer using a smaller number of signal lines only. It achieves word transfer rates of up to 800 MHz on sequential transfers using both edges of a 400 MHz clock. The application of an address in two halves to the DRAM that is usually generated as a single n-bit word, the generation of the /RAS and /CAS signals and the generation of refresh cycles require additional support circuits. The selection of the row to be refreshed is supported by an extra counter circuit integrated onto the DRAM chip, but the refresh cycles are not related to the application processing and must be interleaved with the read and write operations. The use of DRAM (in particular, SDRAM) is common for recent processor chips and some integrate the DRAM support circuits. If these are integrated onto the memory chip the interface signals may be made compatible to the generic bus interface of an SRAM. Chips of this kind are called pseudo-static. They combine the easy interface of an SRAM with the density of a DRAM. The non-volatile Flash EPROM uses a single transistor cell with an extra fully isolated gate, the charge of which determines whether the transistor will conduct once it is selected by means of the main gate. Once charged, the isolated gate holds its charge infinitely and even during power off. The gates can be discharged electrically in large blocks and be charged selectively using the Tunnel effect. In the erased EPROM all storage cells output the H level, and programming a cell can only change an H to an L level. The erasure can only be applied to large blocks within the cell array (on the related EEPROM the cells can be erased individually). Erasing and programming require higher voltages and are fairly slow. Current flash memories include charge pumps to automatically generate them. The writing process is initiated by performing a series of special write operations with a timing similar to SRAM write operations that store the write data and address into registers, start the pump and trigger an internal circuit to control the subsequent charging of the isolated gates of the selected cell. The high voltage stresses the silicon structure, and the number of erase cycles is limited. Current chips support up to a million erasures, offer up to 64 M bit cells and guarantee a data retention time of 20 years. The read cycles are fairly fast (about 100 ns) and unlimited in their number.

• 76

HARDWARE ELEMENTS

The write cycles needed to initiate the programming of a cell mainly serve protect it against accidental writes due to software flaws or hardware-related faults. For a common 8-bit wide EPROM chip it is as follows (the data and address words are given in hexadecimal notation):

r Write $AA to address $5555. r Write $55 to address $2AAA. r Write $A0 to address $5555. r Write data to desired address. At the end read operations from the location just programmed reveal whether the programming of the location is terminated (this may take about 10ms). Several locations in the same row of the cell array may be programmed in parallel to reduce the total time. A similar sequence of write cycles is needed to perform the erase operation. Often, the contents of an EPROM are copied into a faster memory during the startup phase of a system. For this block transfer, subsequent locations need to be selected by means of stepping through the address patterns. The address generation can be integrated into the EPROM chip to further simplify its interfacing. Actually the address patterns within the EPROM do not need to be related to the addresses appearing on the address bus of a processor reading its contents as long as the sequence of data words to be output is pre-determined. There are serial EPROM chips of this kind that output bit or byte sequences of up to 8 M bits and are housed in small packages with 8 to 28 pins. Their interface does not show any address signals but only a reset input for their internal address counter. For their programming, a serial protocol is used to enter the address and the write command. As Flash EPROM chips are erased and programmed by means of electrical signals generated from the standard supply voltage, they can be attached (soldered) to a circuit board if the programming signals can be generated on it or routed to it via some interface. This is in contrast to former EPROM chips that required the exposure to ultraviolet light through a glass window in their package for their erasure, the application high programming voltages and special signal patterns. They were usually mounted in sockets and erased and programmed using special equipment. Non-volatile storage at capacities of many G bytes but slower access times and strictly serial access schemes are the well-known rotating magnetic and optical storage devices (hard discs, DVD) which are interfaced to digital systems whenever long-term mass storage is required. Magnetic storage devices have been used since the early days of electronic computation. A new generation of semiconductor memory chips is being developed (FRAM and MRAM) that rely on two competing cell technologies based on the ferro-electric and magnetoresistive effects. They promise non-volatile low-power storage combined with the high densities and the fast read and write operations found in current DRAM chips [22]. In 2001, the first commercial FRAM products appeared, including a 32k byte memory chip with the SRAM bus interface (see Figure 2.35) and operating at 3.3V, and by the end of 2002 a 64 M bit chip was reported, and a 1 M bit MRAM, too.

2.2.3 Processors and Single-Chip Systems The elementary Boolean gates with a few transistors only but individual inputs and outputs are not good candidates for a highly integrated standard chip without also integrating

• 77

CHIP TECHNOLOGY +

(address bus) (data bus)

clock

A0-A15 D0-D15

processor reset

/CE /OE /WE

Figure 2.43 Generic processor module interface clock, reset

CE1 processor

EPROM /WE,/OE address bus

CE2 SRAM /C

CE3 in port CE4 out port

input to system output from system

databus

Figure 2.44 Single processor-based digital system (CE decoder not shown)

interconnection facilities (see section 2.2.4). If, however, a particular, complex Boolean function can be used in many applications (or in a system that is needed in very high volume), its integration makes sense. This is the case for the Boolean functions that implement the arithmetic operations on signed and unsigned binary numbers and floating point codes that are the building blocks in all numeric algorithms. If a complex function can be applied several times, one will try to reuse the same circuit with the aid of auxiliary select and control circuits. This gives rise to another important class of standard components or modules, the programmable processors. A processor chip integrates a multifunction circuit providing a number of complex Boolean functions (e.g., the arithmetic operations on 16-bit binary numbers) and a control circuit for the sequencing and the operand selection. In order to support many applications (each with its own sequence of operations and operand selections), it interfaces to a memory holding a list of operation codes (instructions) for the operations to be carried out. The same memory can also be used to store the operands. The interface to a generic processor chip or module is shown in Figure 2.43. It is complementary to the standard memory interface in Figure 2.35. The processor drives the address and control lines of the memory bus (to which a memory module is attached) to sequentially read instructions and operands and to write results. The bus can also be used to access input and output ports that are connected to it like memory chips using decoded chip enable signals. If the sequential execution of operations performed by the processor meets the performance requirements of an application, then the system can be as simple as shown in Figure 2.44. The structure and the design of processors will be studied in much more detail in Chapters 4, 5, 6 and 8. The most important attributes of a processor are the set of Boolean functions provided by it (in particular, the word size of the arithmetic operations) and the speed at which they can be executed. Commercial processor chips range from processors integrating a few thousand transistors and providing some 8-bit binary arithmetic and some other Boolean operations on 8-bit codes at a rate of a few million operations per second (MOPS) to processors

• 78

HARDWARE ELEMENTS

data bus line

T T

D

selection signals

package pin

C

Figure 2.45 Configurable bit port with data (D) and control (C) flip-flops

executing arithmetic operations on 64-bit floating point codes at rates beyond a giga operations per second (1 GOPS = 1000 MOPS) and employing more than 107 transistors. The programmable processor and the memories to be interfaced to it are modules that instead of being realized as separate standard chips can also be integrated onto a single chip. Entire small systems of the type shown in Figure 2.44 are offered commercially as standard systems-on-a-chip (SOC) even including the inverter and PLL circuits for the clock oscillator and the Schmitt trigger circuit for the reset input. They are single-chip micro computers integrating e.g. a 16-bit processor, Flash EPROM, some SRAM, and a selection of standard interfaces including parallel and serial ports (except for the signal drivers) and counters (see section 6.6). On some recent chips the selection is quite abundant and for every specific application only a subset of the interfaces can actually be used. The unused others, however, do not draw current, and enable the chip to be used in more applications. The term SOC is also applied to systems realized on a single application-specific circuit (ASIC) or systems realized on a single FPGA (see below) and just reflects the fact that all of the design hierarchy becomes mapped to a single chip. The interfaces implemented in a standard SOC product can usually be configured, to serve as many applications as possible. The pins of the chip package may e.g. be programmed to be input or output signals or to serve special purposes such as extending the on-chip memory bus. The control signals required to select the different hardware functions are generated by means of control registers that are connected to the on-chip bus and can be written to under software control. Figure 2.45 shows a single-bit port that can be configured as an input or as an output by means of a control flip-flop. A number of such single-bit ports can be connected in parallel to the data lines of the bus to provide the parallel input or output of binary words. Configuration registers are used for various other purposes such as to set the bit rates and the data formats of asynchronous serial interfaces, or to define the address range at which a chip select signal provided for some external memory or port device becomes active.

2.2.4 Configurable Logic, FPGA The individual CMOS gates and registers that are needed as building-blocks of applicationspecific computational circuits are not suitable to be packed as integrated chips as this approach cannot exploit the current level of integration. Instead, it is large inventories of such building blocks that are offered as chips with a provision to connect them in an application-specific way within the chip. Even if most of the functions within a digital system are within highly integrated chips, there may remain some auxiliary ‘glue logic’ to interface them with each other, to decode

• 79

CHIP TECHNOLOGY

selection signals for chips connected to a bus, or for some extra control and interface signals. A common choice is to implement such functions (except for bus and interface drivers) in a single or a few PLD devices (programmable logic device). PLD devices arrived in the 1980s as a replacement for the large variety of gate and registers functions used in small-scale integrated circuits before. They are multifunction circuits in which the selection signals of the actual function are generated on-chip. In the first PLD generations, the selection was defined once and for all by burning fuses within the silicon structure. Now these are generated by EEPROM cells that can be reprogrammed several times. The configuration memory and the transistor switches of a PLD add to its complexity, and any particular application will only use a part of its gates and flip-flops. As the PLD functions are usually just a small fraction of the overall digital system, these overheads are outweighed by the advantages of the higher integration of application specific functions and the possibility to change the circuit functions to some degree without changing the board design. PLD devices are programmed with bit streams that are compiled from a set of Boolean equations defining the desired behavior by design tools. More specifically, PLD devices contain identical slices, each generating a signal that is defined by OR’ing a few (e.g., 8) AND terms computed from the input and output signals of the device and their complements, i.e. a low complexity disjunctive form, and optionally outputting their complements, or a registered signal by providing an integrated flip-flop (Figure 2.46). Output pins may be tri-stated and also serve as inputs. The AND terms are realized as wired AND functions and selected by means of a matrix of transistor switches. These switches and the output selectors are controlled by an on-chip non-volatile, electrically erasable memory. Thus the same PLD device can be configured for various functions including registers, feedback circuits using registers, decoders and selectors. PLD chips start from small packages of just 18 input and output signals. More complex ones include hundreds of flip-flops and provide many interface signals to accommodate application specific interfaces or systems functions such as DRAM control. They are usually composed of several simple PLD sub-modules that each selects a limited number of inputs from an interconnection structure spanning all of the chip. Some PLD circuits can be fixed to the circuit board and provide an interface that allows them to be programmed ‘in circuit’. A complex PLD can be used for computational functions, too, but more flexibility and a still higher degree of integration of application-specific functions on configurable standard chips are provided by the field programmable gate arrays (FPGA). These allow for single chip implementations of complete digital systems and constitute a universal architecture for application specific design. An FPGA provides a large array of identical configurable cells. The configurable functions of these are the elementary building blocks of the FPGA

inputs feedback

configurable disjunctive form

DFF clock

S E L

S E L

INV

enable

Figure 2.46 PLD slice feeding an output (simplified)

pin

• 80

HARDWARE ELEMENTS

selector input from neighbors and wire segments

selector LUT selector selector

DFF

clock

S E L

(to wire segments)

(output to neighbors)

Figure 2.47 Generic FPGA cell

architecture. A typical FPGA cell computes a 4-input Boolean function and also provides a flip-flop (Figure 2.47). Only the border cells are special and connect to the external interface signals of the FPGA package. The inputs to a cell are selected from the outputs of others according to the configuration data. They cannot be selected, however, from arbitrary outputs of the thousands of others but only from the direct neighbors of the cell and from a limited number of longer-distance wiring segments that can be linked to longer-distance connections if needed. The cells are arranged in a regular pattern and fill out the chip area. The regular arrangement of the cells and their fixed interconnection facilities permits the FPGA architecture to be scaled, i.e. to build larger arrays from the same kind of cells and to offer families of FPGA chips with cell arrays of different sizes. Current high-density FPGA devices offer more than 10000 cells and the equivalent of a million of gates (not counting the configuration memory and the switches). The number of border cells grows with the total size of the FPGA. Generally, FPGA packages have higher pin counts than memory chips, exceeding 1000 for the largest FPGA packages. Most current FPGA devices use SRAM configuration memories. The configuration RAM can be automatically loaded with a sequence of bits or bytes from a serial Flash EPROM. Alternatively, the FPGA can be attached to a processor bus using an integrated control port and receive the sequence of configuration data words from there. The control port is attached to the data bus of the processor, and the processor reads the configuration words from its own EPROM which is hence shared by the processor program and FPGA data. The same control port can be used as an interface from the processor to the FPGA circuits after configuration. The use of SRAM for the configuration memory implies the additional capability of reconfiguration for different steps of a computation which can raise the overall efficiency in some applications. The control port may include address lines to give the attached processor random access to the configuration memory. Then, the overheads involved in a serial configuration protocol are avoided, but at the expense of having to dedicate many interface signals of the FPGA to the purpose of configuration. Some FPGA chips also offer the capability of being partially reconfigured. Then a part of the application circuit is changed while the remaining circuits keep running. To exploit this, one has to set apart a subarray of the FPGA to which the changes are confined and to use fixed interfaces to the rest of the FPGA. Due to such restrictions, and without the support of high-level tools, partial reconfiguration is only rarely used. The large amount of configuration memory and the bit- or byte-serial access to it result in fairly long (re-)configuration times. Current FPGA chips do not provide an extra configuration memory that could be loaded in a pipelined fashion without interrupting the current configuration. Thus the reconfiguration time for an FPGA cannot be used for computations. The full

• 81

CHIP TECHNOLOGY

RRRR—— (conf# 3)

—> time

---RRRR —— (conf # 1)—— RRRR—— (conf# 2) —— RRRR—— (conf# 4) ——

Figure 2.48 FPGA reconfigure (R) and compute process (repetitive)

or partial reconfiguration of an FPGA can exploit a control flow to use different configurations for the alternative branches of an algorithm (Figure 2.48). The loading of new blocks of configuration data is not directly supported by the automatic load circuits of the FPGA but requires an additional control circuit (that could be part of the serial EPROM device) or the loading by an attached processor. It is quite similar to loading new blocks of instructions into the internal instruction cache memory (see section 6.2.3) of a programmable processor, which is essential for running complex applications on it, too. Without an extra load circuit, an FPGA intended for SOC applications would have to support the reconfiguration control flow through some persistent application circuit and would therefore need the capability of partial reconfiguration. FPGA chips suffer from large configuration overheads. For example, to configure an arbitrary Boolean function of four inputs, a 16-bit configuration memory used as a lookup table (LUT) is required. More is required for the input and output switches and for switches between wiring segments. Current FPGA devices consume 128–320 bits of configuration data per cell and accept a more than 10-fold overhead in chip area for their configurability (100-fold comparing to an optimized integration of the application circuit without building on multi-function cells). Moreover, due to the limited interconnection resources the available cells cannot be used all in a given application, and some of the cells are only used inefficiently. An efficient design reconfiguring the FPGA resources can use a smaller cell array and proportionally reduce the configuration overheads. The performance of an FPGA implementation is lower than that of an equivalent fixedconfiguration ASIC due to the larger size of the FPGA chip and the delays through the electronic switches. Similarly to ASIC designs, the timing of the application circuit is not the result of its structure (the ‘algorithm’) but depends heavily on the routing of the interconnections by the design tools. The resulting high cost by performance ratio of FPGA circuits is partially compensated by saving on the board level due to the higher integration and the fact that the FPGA is a standard part that can be produced in volume to serve many applications. Also, as long as the configuration data don’t change, the configuration circuits inside the FPGA do not consume power. A common way to fight the low overall efficiency of FPGA devices is to integrate standard building blocks such as memory arrays, fixed multi-bit functions and even programmable processors into the devices. Processors integrated into the FPGA chip are useful for implementing the sequential control of FPGA circuits that is needed for the efficient usage of the cells (see section 1.5.3). Simple control circuits and even processors can also be built from the memory arrays and cells of the FPGA (see Chapters 5 and 6). Apart from the processing speed of the cells and the level of integration resulting from the underlying chip technology, the available FPGA architectures differ in such basic features as the capabilities of the cells, the definition of the neighborhood of a cell and the provided pattern of wiring segments, and the choice and integration of predefined standard structures, and such system-related features as their input and output capabilities and features related to their configuration or the handling of clock signals. While the memory chips of the different

• 82

HARDWARE ELEMENTS

categories (SRAM, DRAM, etc.) have similar structures and are easily compared by their parameters, the design of an FPGA architecture leaves many choices. All are concerned with the efficient usage of cells for arithmetic operations and make sure that basic arithmetic circuit elements like the binary full adder with a product operand (see section 4.1) can be realized in a single cell, and provide memory blocks for banks of registers and sequential control that cannot be realized as efficiently with the cell flip-flops. Some play tricks to make certain configuration circuits available for the application processing. To compare different FPGA architectures, one has to determine the total cost and the performance obtained in particular, relevant applications (the results of such analysis also depend on the quality of the tools generating the configuration code). Although the FPGA is provided as a basis of application specific design, it is interesting to consider the task of designing an FPGA architecture as well which includes a proper choice of building blocks and a versatile interconnection structure. Following the above remarks and those made in section 1.5.3 the configuration overheads can be reduced by doing the following:

r Keeping the set of configurable cell functions small; r Using fairly complex functions; r Sharing configuration circuitry between several cells; r Providing predefined structures for sequential control; r Supporting pipelined partial reconfiguration loads. FPGA structures with complex cells have been considered in the research literature [23]. A simple approach to sharing control is to perform identical functions on sets of two or four cells or to use a more complex cell like the one proposed in section 4.4, and to switch segments of multiple wires, which slightly increases the overall costs if just the single bit operations can be used but significantly reduces the configuration overhead otherwise. The dependencies of the timing and the correct function of application circuits on the routing could be dealt with by a two-level scheme distinguishing local (fast) connections between cells and longdistance connections routed through switches and wire segments and using handshaking for the latter. An FPGA architecture suitable for asynchronous circuits was reported on in [24]. Finally, one would consider integrating a control circuit to perform multiple reconfigurations (or even multiple threads of partial reconfigurations). In the commercial FPGA products only the integration of some complex predefined functions has been realized. Some integrate simple processors that can also be used to perform reconfigurations of the FPGA. The following examples show the different feature mixes in some of the current products. The At40k family from Atmel provides a range of low-to-medium density FPGA devices that operate from a single 3.3V supply. These devices may not cover entire systems but are convenient for application-specific interfaces and special functions. The largest one, the At40k40, contains an array of 48 × 48 cells (i.e., 2304), each providing a Boolean function of four inputs (or two functions of three inputs) and a flip-flop. The ‘40’ suffix refers to a claimed equivalent of about 40000 gates (about 18 per cell). The other FPGA manufacturers’ families make similar claims with even higher ratios of gates per cell. Although a look-up table of 16 entries does require 15 select gates, these numbers are misleading. It is more useful to compare the number of 4-bit look-up tables. Real applications implement special Boolean functions and never expoit the complexity of the cells as a universal circuits, and hardly pack more than a full adder plus AND gate (the multiplier building block) into an average cell, which is the equivalent of 6 gates.

CHIP TECHNOLOGY

• 83

The At40k cells receive input from their 8 neighbors and interface to 5 vertical and 5 horizontal bus lines that span 4 cells each and can be connected to adjacent wire segments through extra switches. For every group of 4 × 4 cells there is an extra 32 × 4 bit dual port RAM block that can e.g. be used to implement banks of registers or simple automata. The RAM blocks can be combined into larger RAM structures. The border cells can be configured for different load currents. A typical package is the 20 × 20 mm2 144-pin TQFP. There are about 128 configuration bits per cell (including those used by the switching network). The At40k FPGA chips receive their configuration data from a serial EPROM or via an 8- or 16-bit data port controlled by an external processor and can be partially reconfigured. A important extension to the At40k family are the At94k devices that significantly enhance the FPGA resources by also integrating an SRAM of 16k 8-bit words (bytes) and an 8-bit processor with another 10k 16-bit words of SRAM to hold its instructions. The data SRAM is e.g. useful to implement data buffers for interfaces implemented in the FPGA that would otherwise need access to an external memory. The processor can be used in particular for the fast reconfiguration of parts of the FPGA circuits (even individual cells) through a fast, internal interface to the FPGA configuration memory, with the option to implement a control flow for a compute and reconfigure process of the FPGA (Figure 2.48). On the At40k an equivalent interface is available to an attached processor at the price of more than 32 dedicated FPGA signals, the corresponding amount of circuit board area, and the generation of wide addresses by the attached processor. On the At94k, no external FPGA signals are needed for this purpose, and the time-consuming reconfiguration does not have to be handled by some attached processor. The integrated processor can also be used for the sequential control of FPGA functions. Besides these FPGA enhancements, it can also be used for conventional software functions such as input and output via serial interfaces and real-time control using the integrated timers. The processor bus does not leave the chip; only some interfaces from the processor section are connected to package pins. The At94k devices are pin compatible to the At40k devices and loaded with configuration and program code from the same kind of serial memory via a three-wire interface. The combined hardware and software capabilities allow for numerous applications of the simple FPGA plus EPROM set-up. The Virtex II family from Xilinx provides medium-to-high density devices with predefined arithmetic building blocks. The devices use a 1.5V supply for the cell array but support 3.3V for the pin drivers. As an example, the XC2V1000 device provides as many as 10240 cells with a 4-input LUT and a flip-flop each in a 17 × 17 mm2 256-pin BGA package or in a range of larger ones (less complex Virtex II chips starting from 512 cells are available in the 256-pin package, too). The cells are grouped by 8 into 1280 configurable logic blocks (CLB). Inside a CLB the look-up tables can be combined, and some extra gates speed up the binary add. Each CLB is connected to a switch matrix (SM) that implements the switched connections to the adjacent and more distant ones (Figure 2.49). Each matrix has double connections to each of the eight neighboring matrices, multiple connections to the horizontal and vertical neighbors at distances 2, 4, 3 and 6, to horizontal and vertical long lines spanning the entire chip and to four horizontal bus lines, and can pass an input signal to an output without switching it to the attached CLB. Bus lines are supported, too. The cells within the CLB are coupled more tightly so that two levels of interconnections may be distinguished (in contrast to the Atmel architecture). There are about 300 configuration bits per cell. The XC2CV1000 is a multi-million transistor chip.

• 84

HARDWARE ELEMENTS

SM

SM cell cluster (CLB)

SM

SM CLB

CLB

Figure 2.49 Interconnection structure of the Virtex-II FPGA

As a special feature of the Virtex architecture, the 16-bit LUT defining the Boolean function performed by a single cell can be changed into a 16-bit dual port RAM or a 16-bit shift register by reclaiming configuration resources for the application circuits. The XC2V1000 also provides 40 dual port RAM blocks of 18k bit each (a total of 90k bytes). Moreover, there are 40 predefined arithmetic building blocks performing the multiplication of 18-bit signed binary numbers with a 36-bit result. They are interfaced to the RAM blocks and are most suitable in signal processing applications where many multiplications can be carried out in parallel. The implementation of an 18-bit parallel multiplier by means of cells would cost at least 324 cells (see section 4.3), the 40 multipliers hence 12960 cells. The multiplier building blocks are much smaller and faster than the equivalent cell networks and have no configuration overheads. The Virtex-II FPGA also provides a testing interface through which configuration data can be loaded and read back. It also supports partial reconfiguration and provides access to the contents of the flip-flops and memory locations. Moreover, it can also be used as a serial interface from within the application. Finally, there are sophisticated resources for the generation and synchronization of clock signals, and various options for the signal levels at the pins including LVDS. Serial interfaces built with these achieve bit rates of up to 840 Mbit/s. The Virtex-II family has been extended to the Virtex-II Pro family that also includes programmable processor modules (up to 4) based on the PowerPC architecture (see section 6.6.4) and still faster serial interfaces including special shift registers and encoding clock and data on the same lines. The processor modules are fast 32-bit processors executing up to 400 million instructions per second (MIPS) at a power consumption of less than 0.4W and each include 32k bytes of cache memory. They are optionally interfaced to the FPGA memory blocks. More memory would be added by interfacing memory chips to some pins of the FPGA chip and by routing the processor bus to these. It may be useful to dispose of several sequential control circuits, but the PowerPC modules appear oversized for just controlling FPGA circuits and would typically take over a substantial part of the application processing. They do not have access to the configuration memory. The APEX family from Altera also extends to high densities. The smallest packages are 484 pin 22 × 22 mm2 BGA packages. The EP20k1000 e.g. provides as many as 38400 cells with a 4-input lookup table and a flip-flop in each in a 672-pin package. The cells come in logic array blocks of 10 (LAB) which are the counterparts of the Virtex CLBs. 24 blocks

CHIP TECHNOLOGY

• 85

line up into a ‘mega lab’ which is a subsystem supplying local connections to neighboring LABs and a set of extra connections between all of them. 160 mega labs are attached to a grid of horizontal and vertical ‘lanes’ across the chip. Each mega lab also contains an embedded system block (ESB) that can be configured to provide PLD style logic functions, i.e. disjunctive forms with more input variables than supported by the LUT cells, or serve as a 2k bit dual port RAM. As a special feature of the APEX architecture the ESB can be configured to operate as a 1k bit CAM memory. The APEX device uses about 224 configuration bits per cell (including the bits for the routing). Altera also offers the APEX chip with an integrated 32-bit ARM processor running at 200 MIPS (section 6.6.3), and also additional SRAM for it. The ARM processor can be used to (totally) reconfigure the FPGA cell array and to implement a control flow for the reconfigurations. The smallest version integrates 32k bytes of processor memory and an interface to external SDRAM memory and yields a complete FPGA plus processor system if a single 8-bit wide Flash EPROM chip is added. A memory bus to external DRAM chips is useful in many FPGA applications. Although the processor plus memory part occupies much less chip area than the FPGA part, the chip does not seem to be conceived as an FPGA enhancement but more as a single chip system also integrating the common sequential processor component to implement software functions. A more recent architecture from Altera is the Stratix family. It provides larger memories (yet no longer the CAM option) and arithmetic building blocks. The chip EP1S25 with 25660 cells includes a total of about 256k bytes including two large dual port RAM blocks of 72k bytes each. It also contains 40 18-bit multipliers and associated adders that are grouped into 10 digital signal processing (DSP) blocks. The multipliers in a DSP block can also be configured to perform eight 9-bit or a single 36-bit binary multiplication. The Stratix devices also implement LVDS and include fast shift registers. 8-bit multipliers are also provided on a recent FPGA from Quicklogic. While the Virtex II and Stratix families are expensive high-end products that are intended as platforms to implement entire high-performance application-specific systems including the required computational functions, there are lower-cost derivatives from them in cheaper packages such as a 144-pin TQFP with or without a smaller number of special arithmetic functions and with smaller amounts of integrated RAM that still extend to considerable densities. They support LVDS interface signals and can be used to implement fast serial interfaces using the FPGA cells and are typically used to implement interface functions to other processors although they also permit the implementation of simple processors and systems. These new, lower-cost FPGA families let the FPGA become attractive in a broader range of low-to-medium volume applications. These include simple processor and software-based applications as the available cell counts suffice for the implementation of simple processors. The low-cost version derived from the Virtex II architecture is the 1.2V Spartan-III family. This is the first FPGA to be fabricated in a 0.09 µ technology. It shares many features of the VirtexII but is slightly slower and provides a lower density for the multipliers and the embedded RAM. The XC3S400 device is the densest one offered in the 144-pin TQFP (which we fix as a parameter related to the board level costs). It packs about 8000 cells, 32k bytes of RAM and 16 multipliers, and maintains the special Xilinx feature of being able to use some cells as 16-bit registers. The low-cost variant of Stratix is the 1.5V Cyclone FPGA. The EP1C6 packs nearly 6000 cells and 10k bytes of RAM into the same package. Both families

• 86

HARDWARE ELEMENTS Table 2.1 Evaluation of some FPGA chips

FPGA Family

Technology

MIPS Cycle time

Cells used

CPU2 Cycle time

Cells used

Atmel At40k ProASIC+ Cyclone Spartan III

0.35 µ 0.22 µ 0.13 µ 0.09 µ

5.6 MHz 54.4 MHz 54.5 MHz 77 MHz

5371 11567 4052 1052

9.5 MHz 73.2 MHz 42.3 MHz 60.2 MHz

1973 4393 1785 1240

Source: Compiled by W. Brandt, TU Hamburg-Harburg

achieve high clock rates and fast arithmetic functions through dedicated carry signals. An FPGA with a similar density but using an integrated Flash memory instead of an SRAM to hold the configuration data and which consequently needs no extra storage device to become operational is in the ProASIC+ family from Actel. The At94k40 FPGA is comparable to these more recent FPGA chips in terms of cell count and RAM integration. These latter achieve better cost to complexity ratios, higher clock rates and new interfacing options yet do not provide the special reconfiguration and processor support of the At94k. Table 2.1 lists the estimated clock rates and cell counts achieved for these chips in two reference designs using a common vendor independent tool (except for Spartan III), using a global clock for all registers. The first is a 32-bit processor core with the MIPS I architecture [48], and the second uses a behavioral definition of an early version of the CPU2 (see section 6.3.2) without any structural optimizations. Note that an overview of this kind just represents a snapshot of some current chip offerings at a particular time. The results of a particular reference design do not necessarily carry over to other designs and also depend on the optimization capabilities of the design tools and the suitability of the design to the particular FPGA architecture. The specific predefined functional blocks (e.g. memory blocks) of an FPGA architecture require some changes in a processor design. A general rule is that the cost to performance ratio offered by an FPGA family is mostly related to the feature size of the technology used for it (this also holds for processor chips). The most recent offerings simply outperform the older ones but they may lack particular features such as partial reconfiguration. The high number of cells needed for the Actel FPGA reflects its simpler cell structure. Note that there is no Atmel device with 5000+ cells. The actual clock rates for the designs depend on the results of the placement and routing steps (cf. section 7.5). For a Virtex II FPGA, estimated clock rates are about 30% slower. After placement and routing they get reduced by about another 30% for the MIPS design, but remain close to the estimate for the other. For both, the cell counts decrease. The supplied data are thus fairly incomplete and only clearly indicate the difficulties of benchmarking.

2.3 CHIP LEVEL AND CIRCUIT BOARD-LEVEL DESIGN Circuit boards are used to mount the individual hardware building blocks (chips and others), to interconnect them, and to provide access to the interface signals via connectors. They also distribute the power and participate in removing the heat generated by the components. Larger digital systems are realized by several circuit boards mounted on ‘motherboards’ or in racks and cabinets. The circuit board simply constitutes the next hierarchical level of the

CHIP LEVEL AND CIRCUIT BOARD-LEVEL DESIGN

• 87

hardware. As for the pin count of chips, it is desirable to have low signal counts at the external interfaces of a circuit board (the connectors). The design of the circuit boards can be so that they can easily be plugged together by means of connectors, or be performed by mapping several design modules to a single board. The main difference between the design levels of chip, board and cabinet design are in the degree of miniaturization and reliability and in the cost involved in designing application-specific systems and building them in a given volume. In contrast to the integrated circuits, boards are more expensive but easier to design, rely on a simpler manufacturing technology and can be used to implement low volume applications at reasonable costs. Still simpler yet more costly technology is required if an application-specific system can be plugged together from existing boards. If, however, the volume goes higher, it is more cost effective to map several boards to a single one, and multiple components to a single ASIC. The design choices for a circuit board are similar to the ones for chip design. As in the case of the chips, a circuit board is a fixed, invariable hardware module. A circuit board design can be made so that different subsets of components can be supported and different components match with a compatible interface. The fixed hardware structure of a circuit board can be compensated by designing it to be configurable. Then it may allow more than one application or changes within an application without having to redesign the board. A standard circuit board serving many applications can be produced in higher volume (at lower cost), and its development costs become shared. Also the module interfaces at the chip and board levels are similar. It is e.g. common to link several boards via bus lines. The interconnection of the chips is by fine leads on the surface of an epoxy carrier plane etched out from a copper coating. Circuit boards use between 1 and 12 interconnection planes fixed on top of each other. Leads at different planes are connected via connections (‘vias’) chemically deposited in small holes through the layers. It is common to use dedicated nonetched layers for the ground and power supply voltages. The integrated circuits are mounted to the top and bottom surfaces of the board. The placement of the components and the routing of the interconnection through the layers are done with the aid of CAD tools (as in the case if chip design; for very simple boards, they can be done manually). For the electrical design issues of circuit boards we refer to [2] and [9] except for a few remarks. The power supply noise caused by synchronous clock transition at many sites overlays all digital signals and must be taken care of by connecting the power planes through capacitors with low series resistances. On a circuit board the digital signals may change at rates of up to several 100 MHz (on-chip rates even extend to several GHz). Traces on the circuit board beyond a few cm must be considered as wave guides at the ends of which reflections may occur that overlay the digital signal, and in some cases cause a faulty operation of the high-speed digital circuits interfaced to it. Therefore such signal lines need to be driven through series resistors matched to the impedance of the transmission line (typically 24 . . . 100) and, at longer distances or for bus lines, to be terminated at the ends by a matched resistive load. The signal delays at longer distances need to be considered, and synchronous signals such as the clock and the data lines of a synchronous serial bus should be routed close to each other (this is even true for long-distance connections within a chip). A signal travels 15-20 cm in a time of 1ns. There may be cross-talk between signal lines, and high frequency signals are radiated so that the digital system may have to be shielded in order to keep this radiation within required limits. Cross-talk and radiation are low for differential signals such as those according to the LVDS norm. In mixed analog/digital systems even a slight cross-talk from the fairly large

• 88

HARDWARE ELEMENTS

digital signals (in particular, clock signals and power supply signals) to the involved analog signals can be a severe problem. Usually, the analog circuits such as operational amplifiers are set apart from the digital ones and get an extra ground reference that is linked to the digital ground at a unique site only, and an extra, decoupled power supply line.

2.3.1 Chip Versus Board-Level Design It is possible to trade off chips level integration for board level integration to minimize the total cost. A circuit board can be collapsed into a single chip, but sometimes it can be more cost effective not to use a highly integrated chip but to distribute its functions to several chips (higher volume ones, or chips from different vendors). The manufacturing costs of the circuit boards grow with the board area, with the number of layers, and the number of vias between the layers. For a simple board design, the integrated circuits should have as simple interfaces as possible (in terms of the number of i/o signals). This requirement is, of coarse, common to all modular structures (including software modules). For a single chip system the board design becomes almost trivial as, apart from simple support functions, only the external interface signals must be routed from the chip to appropriate connectors. System designs of this kind arise when a single-chip micro controller with integrated memory suffices to fulfill the processing requirements, or an FPGA configured from a serial EPROM. This simplicity is not always achieved. If e.g. a processor chip needs a large external memory, a large number of interfacing signals is required. Current highly integrated chips have hundreds of interface signals. High pin counts require BGA packages that can no longer be supported by single or double layer circuit boards at all and hence raise the cost of the circuit board design even if just a subset of them is used for an application. An exception to the board costs increasing with the pin counts of the chips occurs when the layout of the pins of two chips to be connected to each other is so that they can be arranged side by side and the connections become short and are within a single board layer. Wide memory buses and multiple signal lines not only consume board area but also cause capacitive loads and signal delays. In contrast, multiple, wide memory structures can be easily supported within a medium-sized chip. For large chips, propagation delays, capacitive loads and the wiring area are an issue, too, although less than at the board level. Extended buses with multiple (capacitive) loads, in particular buses extending to several boards, require the buffering of the local signals with auxiliary driver circuits, and maybe partitioning into several segments. Buffering and partitioning cause additional signal delays. There is thus an advantage in not extending a bus outside a board, and in attaching a small number of circuits only mounted at a small distance. The highest performance results if the bus signal does not need to leave the chip at all. Some techniques to simplify the chip interfaces and thereby the circuit boards have already been discussed. One is to use multiplexed buses if off-chip buses are required at all. The other is to use serial interfaces instead of parallel ones if the required data rate allows for this. A serial bus designed to connect a number of peripheral circuits to a processor is the I2 C bus introduced by Philips. It uses two open collector signal bus lines only pulled up by resistors (one for the data and one for a clock signal) and supports data rates of up to 400k bits/sec. As a bus it is shared for the data exchanges with all the different attached peripherals, and the data rate is shared as well. Every device connected to the I2 C bus has an address

• 89

CHIP LEVEL AND CIRCUIT BOARD-LEVEL DESIGN

micro computer

IF1

IF2

IF3

+

clock data

Figure 2.50 Serial interface bus with attached interface modules

Figure 2.51 Processor board design (case study)

that is transferred via the bus, too, to direct subsequent data to it. The I2 C bus significantly simplifies the board design yet at the cost that the processor and every peripheral chip need to implement the serial interface and the address decoding required for selecting between them (Figure 2.50). Simple I2 C peripherals have addresses that are predefined up to one or two bits which are defined by wiring some device pins to H or L. An I2 C bus is a multi-master bus that may be shared between several processors used in a system. It provides a mechanism to detect collisions due to simultaneous attempts to output to the bus. Serial interfaces used to simplify the wiring of chips within circuit boards can also be based on LVDS interfaces that operate at much higher data rates. The same techniques apply to arrive at simple board-level interfaces. If no bus extended beyond the board, no bus drivers and connectors are required and the high frequency signals that may be used by the on-board circuit do not appear outside. Serial interfaces between the boards reduce the wiring effort. If only external i/o and network interfaces remain, the boards become cheaper and simpler. As a case study we present a general purpose circuit board design from standard components announced or available in 2003 (Figure 2.51). As a general purpose board it should provide useful computational resources as characterized in terms of kinds and numbers of available operations per second, configurable input and output interfaces, and some networking interfaces for multi-board applications in order to be applicable as an architectural component. Moreover, the board design might allow for different selections of components in order to efficiently support different classes of applications. Intended applications for the board were its use within a large distributed control and measurement system, and its use to control an autonomous robot vehicle. Requirements from these applications were the availability of an Ethernet interface (section 6.5.3.3) and support for the TCP/IP protocols, a CAN bus interface (section 6.5.3.1), an asynchronous serial interface, several counter and timer functions, and a particular, small size. The manufacturing costs of the board should be kept low by not using

• 90

HARDWARE ELEMENTS

BGA packages that would require more than four circuit layers. The non-volatile memories should be programmable without removing them from the board. Some standard chip offerings integrate several of the required system components in a convenient way. One could have used an integrated processor including the Ethernet support and the CAN bus like the Motorola MCF5282, or a floating point processor like the SH7750 (see section 6.6.4). There is a choice of fairly powerful micro controllers for automotive applications packing a CAN controller, counter and timer functions, analog input, and even the required program memory, e.g. the XC161 from Infineon, the DSP56F8356 from Motorola, or the TMS320F2812 from Texas Instruments (section 6.6.2). Instead, it was decided to place the costly Ethernet interface, the CAN bus controller and the number crunching support onto separate chips that would only be mounted if required. The networking of several boards and general purpose, configurable interfaces including the required counter and timer interfaces would be supported by an FPGA. A fast micro controller with DSP capabilities (the Blackfin processor, see section 8.4.3) is provided to supply computational resources for the application programs and for the IP protocol support, using a separate serial Flash memory chip as its program store and to hold the code for the other devices. Separate flash memory chips are offered with higher capacities and endurances than inside the integrated processors. If analog input is needed (cf. section 8.1.1) a separate ADC chip can deliver a better performance as well. The Blackfin chip integrates some SRAM and controls an optional SDRAM chip. Its integrated PLL clock generator allows the operating speed of the board and its power consumption to be adjusted to the needs of the applications. The LAN controller chip interfacing the board to the Ethernet is connected to the micro controller via the FPGA chip. It also integrates extra buffer space for the Ethernet frames and was selected to support the 100 Mbit rate as many boards may have to share the Ethernet bus. The FPGA is a Spartan III chip that provides fast serial interfaces for connecting several boards using LVDS signals and can also be used as an additional compute circuit. The micro controller can be used to reconfigure the FPGA. The optional coprocessor provided for applications that need fast floating point processing is an integrated DSP chip (see section 8.5) operating in parallel to the micro controller using an onchip instruction and data memory of 256k bytes of its own. It also adds some extra interfaces. The processors provide up to about 109 16-bit integer or 32-bit floating point operations per second at a power consumption of about 1W. If just the Blackfin and the Flash memory chips are used, the power dissipation becomes as low as 0.2W. The size of this fairly powerful processor board is only about 10 × 10 cm (there are just 5 larger chips including the LAN controller), and it uses just four layers (2 signal and 2 power supply layers) which is the result of keeping the interfaces between the chips at a strict minimum. The only chip requiring a large number of interface signals to a processor bus is the optional SDRAM chip. The segmentation of the bus through the FPGA simplifies the wiring and allows both segments to be used independently. The processing functions could have been packed even more densely by using BGA packages yet at the expense of higher board manufacturing costs. The chips all interface via 3.3V signals and only need an additional 1.2V core voltage supply which is due to the consistent use of chip technologies. The board holds the needed switching regulator to generate it and runs from a single voltage supply. Various interface signals leave the board, but no bus leaves it for a further memory expansion. The Flash memory can be programmed by connecting a USB adapter to the board.

CHIP LEVEL AND CIRCUIT BOARD-LEVEL DESIGN

2.3.2 IP-Based Design

• 91

In a board level design one selects components such as processor, memory, and interface chips and connects them on the board according to their bus interfaces. Similarly, in an ASIC or an FPGA based SOC design one also tries to compose the desired function from proven modules of similar kinds. A common module such as a processor is a standard structure that requires additional software tools for its programming and some specific know-how on its capabilities, their interfacing and the related design tools. Then, it is desirable to use such a component in many designs. As well as reusing components developed in previous designs, it has become common to licence standard components called IP modules (intellectual property modules) and the related tools in very much the same way as one previously did for chips at the board level from vendors specialized in providing well-tested component designs with a guaranteed performance and the tool support. IP modules may be offered for specific FPGA families and without access to the component design apart from its interface, or as portable sources in some hardware design language that allow the component to be synthesized on several platforms. In contrast to hardware components (chips) IP modules may be parameterized, allowing e.g. the word size of a generic processor architecture and the number of registers actually implemented to be adjusted to the cost and performance requirements of an application. In some cases, however, the licensing conditions for IP modules may exclude their use in small volume applications. Xilinx and Altera both offer processor cores for the FPGA families discussed in section 2.2.4. The 16/32-bit NIOS processor core from Altera e.g. can be used for the APEX, Stratix and Cyclone devices and is supported by a C compiler. A processor realized with the costly resources of an FPGA must achieve a sufficient performance using as few FPGA cells as possible, and interface through some simple software and hardware interface to FPGA memory blocks, companion IP modules, and application specific FPGA circuits. NIOS uses about 1000 cells of an APEX device and achieves a performance of 50 MIPS. The system-on-achip design based on a set of IP modules can be supported by software tools composing a component and interconnection table from a menu-driven user interface. The SOC builder tool from Altera lets one specify the processor and memory parameters, and the numbers and kinds of certain serial and parallel interfaces than then become synthesized on the FPGA. Several NIOS processors can be implemented and interfaced to each other on an FPGA. The bus of a processor implemented on an FPGA or an ASIC constitutes an important module interface for IP modules from other vendors or for the integration of application-specific interfaces to the processor subsystem. On-chip bidirectional data buses are not always available. Then the bus specifies separate read and write signal sets and selector circuits to connect the data output of some device to the read data lines of the processor. Examples are the Avalon bus used for the NIOS processor or the high-performance bus specified in the advanced microcontroller bus architecture (AMBA) for ARM processor based systems [45].

2.3.3 Configurable Boards and Interconnections The ability to configure a board for different applications results from the use of configurable components on it, e.g. an EPROM chip to hold different programs, and interfaces controlled by configuration registers that are written to by a processor with data from the EPROM. Other

• 92

HARDWARE ELEMENTS

configuration inputs may simply be set the L or H level by means of miniature make switches or jumpers plugged onto configuration connectors. The connections within a board including those to the external interface connectors may be routed via such switches, too, to support alternative configurations. The mechanical components can be avoided through the use of electronic switches controlled by configuration registers (similar to those within FPGA devices). Then the board design becomes simpler, and the dynamic reconfiguration of the hardware resources becomes possible. An interface signal of the circuit board may e.g. be routed to several destinations, or be generated from a choice of circuits. The electronic switching can be supported by special integrated circuits (‘crossbars’) providing switches between a number of i/o signals that are controlled by memory cells. The crossbar function can also be implemented on an FPGA, or simply by routing the connections to be configured through an FPGA device. Configurable interconnections can be used to implement the through routing of an input signal to an output signal without being processed on the board. A faulty board can e.g. be bypassed or substituted by a spare one. Then multiple boards can be mounted and wired in a fixed arrangement (e.g. using a motherboard) and still use application-specific interconnections. The basic FPGA structure of an array of configurable components connected via a network of electronic switches is thus generalized and ported to the board level to yield system architectures with interesting properties. As a case study we consider the ER2 parallel computer architecture [25]. The ER2 is a scalable board-level configurable system that is similar to an FPGA structure in several respects. Like an FPGA it is a general purpose system that can be used for a variety of applications. The ER2 builds on just three board level components, an electronically controlled crossbar switch, a compute module that can be attached to the crossbar, and a motherboard for a number of crossbar switches that are interconnected on it in a grid structure, each switch being connected to four neighboring ones or to connectors at the border of the motherboard. The motherboards can be plugged together via edge connectors to form larger networks. The crossbars are connected to their neighbors via multiple wire segments (24 wires in the east and west, 18 in the north and south directions, see Figure 2.52). The switch boards are small

C

C

X

X

X

X

X

X

C X

C X

X

Figure 2.52 Configurable interconnection network with attached compute modules

CHIP LEVEL AND CIRCUIT BOARD-LEVEL DESIGN

• 93

circuit boards (X) containing the crossbar circuit, a control processor with RAM and EPROM memories and an FPGA chip implementing a set of auxiliary pre-configured interfaces to the neighbors. The configuration of the switches for an application is performed by the control processors that receive the application-specific control data through the auxiliary interfaces. The program code for the processors is also distributed that way. The control processors can also be used for the application processing. The main compute resource, however, is the compute module (C) that interfaces to a switch module via 6 fast serial interfaces each using 6 wires and by an interface port to the control processor of the switch. It contains a cluster of four tightly coupled processors of the Sharc family (see section 8.5.1). The board components of the ER2 allow digital systems of arbitrary size to be composed without involving additional electronic design. The prototype system shown and described in [55] includes 256 crossbars and 64 processor clusters. The crossbar network would support other kinds and mixes of different kinds of compute modules as well. The crossbar sites not connected to a compute module contribute to the routing resources and can be used for input and output interfaces, or to provide extra connections between different crossbars. The grid interconnection on the motherboards is just the basis of an application-specific wiring of the compute modules which can even be changed during a computation or be used to implement rerouting capabilities to spare boards. The scalability of the architecture results from the strictly local interconnection structure. Each switch added to the system to provide a new compute site also adds to the interconnection resources. In contrast, conventional standard processor boards connecting to a shared memory bus as offered by many manufacturers only allow for a small number of processor boards in a system. The overall architecture of the ER2 is similar to an FPGA architecture using complex, programmable processors operating asynchronously for the cells, and multi-bit serial transfers via the connections using handshaking. The algorithms placing the application functions to the processors and routing interconnections are similar to FPGA place and route algorithms, too. A processor cluster is comparable to the CLB in Figure 2.49. In contrast to the FPGA that only provides input and output to the border cells, the board level architecture has the important advantage of allowing it at every compute module. The different clusters operate asynchronously and the interfaces connecting them through the switches perform handshaking to at least synchronize for the data exchange. The crossbar circuit used in the ER2 design is the IQ160 from the I-Cube family of switches that has meanwhile disappeared from the market. It provides 160 i/o signals which can be configured as inputs, buffered outputs and as bi-directional signals. In the bi-directional mode an i/o signal also runs through a buffer circuit but without requiring a direction control signal to select between input and output. It is held at the high level by a pull-up resistor and becomes input if a high-to-low transition is initiated externally and output if it is initiated internally. Due to the buffering a signal passing the crossbar undergoes a small delay (about 10ns). The inputs and outputs can optionally be latched in on-chip input and output registers. The local control of the crossbars makes the reconfiguration of the network a distributed task of the set of control processors. It is, however, simple to just change the connection from an interface of the attached compute module to a crossbar to another signal connected to the crossbar on the motherboard and thereby to use the same interface at different times for different interconnection paths. Also it is possible to select between several local interfaces to communicate along a switched path through the network. The compute module also provides

• 94

HARDWARE ELEMENTS

external interface (LVDS)

crossbar (FPGA)

local interfaces

processors + memory sub system

crossbar control

Figure 2.53 Versatile module interface using a crossbar

some lower speed serial interfaces that require just two signals. These can be switched through the crossbars, too. The crossbars can be thought of as a configurable interface of various local interfaces to the wiring media on the motherboards. The aspect of reserving an applicationspecific amount of the communications bandwidth provided by the media is also used in other interfacing schemes (e.g. the USB, see section 6.5.3.2). The idea of using a scalable network of crossbars as the interconnection media of scalable digital systems can be applied in many variants using different processor building blocks, implementing crossbars on FPGA chips and using other crossbar interconnection schemes and balances between the compute and the crossbar modules. The ER2 architecture can e.g. be moved to the chip level to provide single-chip configurable processor networks [26, 27], which will be further discussed in Chapter 7. As the cells are complex processors, the control overheads of the fine-grained FPGA architectures are avoided. By integrating a crossbar function with some handshaking support, almost every single-chip system or circuit board with a single or few processors and interfaces can be equipped with a module interface that supports the multiplexing and through-routing capabilities needed to use it as a component in a scalable architecture (Figure 2.53).

2.3.4 Testing In order to be able to detect manufacturing (and design) flaws, it is necessary to provide testing facilities for every digital system. This is done by feeding the system with test input data and by verifying that the outputs obtained with these are correct. If the system, as usual, is constructed from modules, the testing is applied at the level of the modules to find out whether they operate correctly before it is applied to the structure composed by them. As the interfaces to the modules are not required for the application processing, it requires an extra design effort to give access to them. To independently test a module one needs to be able to switch a source of test data to its inputs and to select its output signal. Actually the testing of the individual module does not depend on its wiring within the system. For a chip, the access to its subcircuits is in conflict with the requirement of a simple module interface (pin count). The test signals are applied to chip sites not connected to regular package pins, or shared with package pins, or to serial interfaces provided to support testing with as little extra signals as possible. At the board level one needs to verify that the chips mounted to the circuit board and already tested after manufacturing are connected correctly. The wires connecting them are not easily accessible. They are packed quite densely, connect to the chips beneath the package and can even be routed in the middle layer of the board only. The solution to this testing problem is to integrate switches onto the chips that disconnect the internal signals from the pins, connect test inputs and outputs instead, and give access to these by means of a special serial interface (Figure 2.54).

• 95

CHIP LEVEL AND CIRCUIT BOARD-LEVEL DESIGN

ser1.out internal output

ser2.out SEL D-FF

SEL D-FF

pin signal

ser1.in shift clock

extest tristate

ser2.in

sample

shift/load clock

Figure 2.54 Boundary scan circuit for a package pin

TDI TMS TCK TRST TDO

device #1

device #2

device #3

TDI

TDI

TDI

TDO

TDO

TDO

Figure 2.55 JTAG chain

There is an industry standard for the design of this interface called JTAG (joint test action group) [28]. It defines an interface of five test signal inputs and outputs, namely: TRST − TDI − TDO − TMS − TCK −

reset signal input to be set L except during test serial data input serial data output additional control input (‘test mode select’) serial interface clock input, typical rate is below 1 MHz

and a set of commands that are serially input using these signals into an instruction register of at least two bits if a particular pattern is chosen for TMS. Otherwise the data shift register is selected (with a register bit for every signal to be tested), or a bypass flip-flop. The TRST signal is not always implemented as the reset state can also be obtained by inputting a special bit pattern on TMS. The most important commands are: 00 : EXTEST − apply test output 01 : SAMPLE − read input signals 11 : BYPASS − connect TDI to TDO via flip-flop. The command set can be expanded by commands specific to the chip, e.g. to access special registers of a processor used for the software debugging in a single step mode, or to use the JTAG pins for a general purpose serial interface for application data (Xilinx). As there are several chips within a system, the JTAG signals are chained (Figure 2.55) to give serial access to all of them via a single test interface. Data bits are input to TDI with the rising edge of TCK, and the TDO event is with the falling edge. The instruction and data registers of the chips in the chain are put in series. A typical sequence of commands is to shift in the test data pattern and then the EXTEST command to all chips of the chain, then to issue the SAMPLE command for the inputs and to output them via TDO to the test equipment. This is repeated with varying test patterns until all connections have been verified. BYPASS is used to selectively input to or output from a particular chip in the chain only.

• 96

HARDWARE ELEMENTS

According to Figure 2.54 the JTAG interface needs a shift register bit and selectors for every signal to be tested, the command and bypass registers and the TMS control. Usually it is only implemented for complex integrated circuits where it represents just a small fraction of the overall hardware. For the inputs and output from a board or sub-circuits within it, there are multi-bit driver circuits equipped with a JTAG interface. On a circuit board with a processor equipped with the JTAG interface and an SRAM or EPROM memory attached to its bus, the test mode can be used to drive the bus lines to test the memory by verifying its functions (even if it does not have a JTAG port of its own). This is also a common way to perform the in-circuit programming of a Flash EPROM. JTAG accesses are slow due to the serial interfacing and cannot be used to track the signal changes of a running system (in ‘real time’). Some chips also use the JTAG interface to scan internal signals at the interfaces between chip modules that are not accessible at all otherwise, or use the interface to provide extra debugging functions through special commands.

2.4 SUMMARY In this chapter we explained the CMOS circuit and chip technology that is mostly used to build digital systems. The CMOS technology provides the basic Boolean gate function and storage elements that can easily be composed to almost any degree of complexity. The power dissipation of CMOS circuits has been discussed, including methods such as clock gating, asynchronous and adiabatic logic to reduce it. The digital design proceeds from the CMOS circuit level in a hierarchical fashion to more complex building blocks, the highly integrated chips and to circuit boards, with emphasis on reusing proven modules as standard components, and on extending their scope by making them configurable. The design of chips and circuit boards turns out to be quite analogous. Configurability and programmability at the board level and for interconnecting boards are as useful as they are for chips. Coarse-grained FPGA like architectures with (re-)configurable interconnections can cover many applications with a small inventory of components. Chip designs proceed similarly to board level designs, namely by combining proven, complex modules (IP modules) such as processors, memories and interfaces that might show up as chips in an equivalent board level design but now are components of a system-on-a-chip. Both at the chip and at the board levels more distant system components need to provide simpler interfaces. Ideally, on a large chip or board, distant subsystems are interfaced serially using a few signal lines only and operate asynchronously.

EXERCISES 1. Boolean functions implementing binary arithmetic often build on the full adder function that calculates from three inputs X,Y,Z the outputs: Q(X, Y, Z) = X ⊕ Y ⊕ Z O(X, Y, Z) = XY + YZ + Z X ‘⊕’ denotes the XOR operation. Determine the gate counts for an implementations based on NAND, NOR and NOT gates, as a complex CMOS gate, and as a dual n-channel network gate with complementary outputs.

EXERCISES

• 97

2. Design a circuit realizing the MRS flip-flop from gates and from CMOS switches. 3. Design a circuit generating bipolar handshake signals (with every transition defining an event, not just L-H) and the control signal for a data latch. 4. Design a circuit having two clock inputs R and S and an output that is set to H by the L-to-H transitions on S and reset to L by the L-to-H transitions on R. 5. Design a digital phase comparator for a PLL clock generator. 6. Use an SRAM to design a LIFO stack that is accessed like a single register. Use registers to pipeline the SRAM access to the read or write operations. 7. A ring counter is an automaton using an n-bit state register that cycles through the states 1000 . . . 00, 0100 . . . 00, 0010 . . . 00, 0001 . . . 00, . . . , 0000 . . . 10, 0000 . . . 01. Design the feedback circuit for it from elementary gates so that any initial state eventually transitions into the state 1000..00. 8. Design a k-bit automaton with a single bit input that calculates the k-bit CRC code of a sequence presented at the input defined by a given mod(2) polynomial of degree k (see section 1.1.2). 9. Design an address decoder circuit computing three chip select outputs from a 16-bit address input. CE1 shall be L for addresses in the range 0 . . . 0x7fff, CE2 in the range 0x8000 . . . 0x80ff and CE3 for the remaining addresses. 10. Show that for every n-bit input k to the fractional n-bit frequency divider the most significant output bit changes with a mean rate of f∗ |k|/2n where f is the input frequency and |k| is the absolute value of k as a signed binary number. Derive an expression for the current consumption due to charging and discharging the adder inputs and the output.

3 Hardware Design Using VHDL

3.1 HARDWARE DESIGN LANGUAGES As already noted in section 1.2, circuits composed of building blocks can be understood as a special way to realize algorithms in hardware and be specified by means of the algorithmic notation found in programming languages. For the design of digital systems (and for real-time programming) the timing behavior of the execution of operations is important (see section 1.4), in particular, the timing of the events of changing signal levels. For the purpose of defining digital hardware structures, including their timing, hardware design languages (HDL) have emerged that as well as defining the operations to be performed also specify their timing. A common HDL is VHDL, others are Verilog, ELLA [69], and extended versions of standard languages like C, e.g. System C [70], and Handel-C which goes back to [29]. For a long time, hardware designers drew circuit diagrams (‘schematics’) showing the interconnection of components (gates, registers, processors, etc.), using special shapes or annotations to distinguish the different building blocks. This is adequate for showing the structure of a design, and can be supported by using a graphics editor for schematic entry. It does not cover the specification of the timing behavior as a basis of timing simulations and verification. Algorithms are read more easily from a textual representation, and sometimes a hardware description is only given up to the point of specifying its behavior (the timing and the transfer functions of the building blocks) but not the details of the algorithms to be used. A building block might e.g. be described to perform an add operation on 16-bit codes of numbers within a certain processing time without specifying which Boolean algorithm should actually be selected for it. These aspects are taken care of by the hardware design languages. By now, apart from the level of connecting large standard building blocks on a circuit board, hardware design is mostly done using some HDL. From an HDL description the interconnection network of basic components can automatically be extracted and mapped to the resources of a chip or an FPGA. If the description only specifies the behavior but not the implementation of some of the building blocks, the synthesis of the network of basic Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 100

HARDWARE DESIGN USING VHDL

components is still possible by using default implementations. The tool chains offered by FPGA vendors include HDL design entry tools for the most common HDLs. An HDL takes a point of view that differs in some respects from conventional software languages. The statements in a software language are implicitly understood to be executed oneby-one. In contrast, the statements in an HDL describing the operations to be performed and the signals to be output are essentially executed in parallel. They may exhibit data dependencies but they are not bound to the serial execution in the textual order. Second, software programmers use local variables to store intermediate results of a computation. Although storage elements play an important role in digital hardware, intermediate results may be passed directly from the output of a building block to the input of another one. The assignment to a local variable must be understood as naming the output signal of an operation just for the purpose of being able to reference it. Whereas sequential programs may use the same variable to store different values at different times, this makes no sense for naming an output signal. Thus the assignment of a value to a name must be unique. Storage will only be involved if the signal data at the time of a particular event (defined by some change of a control signal) have to be used at a later time. Finally, software programmers use functions and procedures with the understanding of defining sub-routines which may be jumped to. In a higher level language they mainly serve the purpose of gaining more abstraction by referencing a composite action by a short name, and a hierarchical structure due to allowing nested function calls. In a hardware circuit there is no way to jump into or to return from a sub-structure (although a sub-circuit may be applied sequentially for several times). A call must be interpreted in such a way that an individual sub-circuit of the structure defined in the algorithm for the function has to be inserted at the call site. These slight restrictions and changes to the interpretation of software languages provide the abstraction from the working of a sequential machine. Then the same programming language can be used for algorithms that are alternatively realized in software or in hardware (excluding the definitions of their timing behaviors which are mostly needed for a timing simulation to verify the proper operation and existing timing constraints). This would be attractive for digital systems realizing some processing by means of composite hardware circuits and other functions on programmable processors. The most common HDLs, however, have evolved into dedicated ones and include structures that have no meaning to processor software whereas processor software is implemented with dedicated software languages. The algorithmic notation provided in VDHL is used for behavioral descriptions only while the composition of circuits uses an extra notation not applicable to sequential software. We’ll come back to the system design in a common language for both hardware and software functions in Chapter 7. In the sequel, VHDL will be introduced informally as one of the most common HDLs for the purposes of subsequently describing the most common sub-structures of digital systems beyond the level of gates and registers and to provide the basics to enable practical design exercises on FPGA using a standard tool chain. A more comprehensive presentation of VHDL can e.g. be found in [30]. The discussion of VHDL will give us the opportunity to formally describe the timing of the basic circuit elements introduced in Chapter 2 and the timing relationships of events (Chapter 1). We will still use diagrams if only the interconnection structure is considered. VHDL is a special purpose language for describing and designing digital hardware, and for simulating its operation. The only run time environment for VHDL ‘programs’ is the VHDL circuit simulator. VHDL also serves to synthesize circuit structures, but it cannot be used to generate executable code for processors or other control parameters for sequential sub-systems.

ENTITIES AND SIGNALS

3.2 ENTITIES AND SIGNALS

• 101

VHDL describes the functional behavior of circuits, their timing, i.e. the events of changing input and output data according to their execution times, and their structure. The language is dedicated to describing idealized digital systems where the signals are binary and make their transitions at well-defined times at zero duration. The transfer of a signal from the output of a sub-circuit to the input of another one also takes zero time. The functional behavior is defined through Boolean functions that may be given as tables, as Boolean expressions or by other algorithms. VHDL is intended to define and simulate digital hardware before it is actually built, and also serves as a formal language from which the information needed to produce the hardware (e.g. the configuration code of an FPGA) is synthesized automatically by a compiler. A VHDL design defines some hardware system as well as the application-specific building blocks for it. All of these become design ‘units’ in the some default VHDL library (usually named WORK). Typically, there is also a design unit containing the main system and supplying it with input signals for testing purposes. Other libraries are used to supply additional design units. The types of circuit building blocks described in VHDL are called entities. Once an entity has been defined, other definitions may use instances of it which are separate building blocks of the same type. The circuits described in VHDL have input and output signals and internal signals which are set to specific values at specific times. The definition of an entity is divided into a definition of its interface signals and definitions of its behavior or its structure as a composition of other building blocks, or as a mixture of both. An entity may be given both structural and behavioral descriptions. The structural definition of an entity implies a behavior resulting from the behavior of the components used therein. A behavioral description is not automatically checked to be compatible with the behavior of a structural definition. If several competing definitions are given, extra control statements are used to define which one should be used for the purpose of simulation or the extraction of a network of basic components. For the most elementary circuit types that are not composed of sub-circuits, only a behavioral definition can be given. These definitions are usually taken from a standard library. The signals used in an entity need to be declared as interface or internal signals of some type defining the possible values. Standard signal types are ‘bit’ which is an enumerated data type containing the symbols (not the numbers) ‘0’ and ‘1’, or the type ‘std logic’ defined in the IEEE library STD LOGIC 1164 which also allows the symbol ‘Z’ as a value indicating a high impedance state of the signal, and six more values. ‘U’ indicates an uninitialized state, ‘X’ a bus conflict, ‘H’ and ‘L’ are logic levels generated by pull-up or pull-down resistors, ‘W’ an intermediate value, and ‘-’ is a ‘don’t care’ value. Signals of the generic types ‘bit vector’ or ‘std logic vector’ take n-tuples as values. The index range used for the tuples needs to be specified as a parameter. It is an expression such as ‘n − 1 downto 0’ or ‘0 to n − 1’ or a similar one denoting an integer interval containing n indices. Bit vector literals are written as strings that may contain the ‘0’ and ‘1’ characters and underline characters for the sake of readability (octal and hexadecimal literals are also supported). “01 00” is a bit vector of size four and can be assigned to signals of this size. VHDL also provides numeric data types, namely the types ‘integer’ and ‘real’ covering the ranges of 32-bit signed binary and single-precision floating point numbers. Bit strings and the values of signals are distinguished from the numbers they might represent. The arithmetic operations are not defined for the ‘bit vector’ and ‘std logic vector’ types, but for numeric types. Other bit field data types can, however, be defined, and it is possible to define (overload) the arithmetic operators for such. The types ‘signed’ and ‘unsigned’ are defined in the IEEE libraries NUMERIC BIT and

• 102

HARDWARE DESIGN USING VHDL

NUMERIC STD as vector types that are signed binary or binary numbers, and the arithmetic operations are defined for these. For the specification and simulation of times VHDL provides the ‘physical’ type ‘time’, the values of which are numeric multiples of one of the units ‘s’,‘ms’,‘us’,‘ns’,‘ps’ or ‘fs’. The interface part for an entity starts with the keyword ‘entity’ and the name of the circuit type to be defined. The interface signals are specified in a ‘port’ definition and given directional attributes ‘in’, ‘out’, or ‘buffer’ and a signal type. ‘buffer’ is similar to ‘out’ but allows the signal values not just to be assigned but also to be read and used like input signals. The attribute ‘inout’ is used for bi-directional bus signals of a type providing a high-impedance state ‘Z’. Such signals may be driven from several sources. If ‘Z’ is output, but ‘0’ or ‘1’ is output from another (external) source, the signal will assume these latter values (signal resolution). The structural or behavioral descriptions for the entity are introduced with the keyword ‘architecture’, a name for this architecture, and the reference to the circuit type to be described. It is followed by declarations of the additional internal signals used by this particular definition. They have no directional attributes but just a signal type and can be used like ‘in’ signals but need to be assigned values, too. The actual definition is enclosed between the ‘begin’ and ‘end’ keywords (see Listing 3.1). The definition of an entity is preceded by a specification of the required additional libraries and the particular packages and definitions to be used therein, e.g.: library IEEE; use IEEE.NUMERIC BIT.all;

3.3 FUNCTIONAL BEHAVIOR OF BUILDING BLOCKS Listing 3.1 shows a definition of the functional behavior of the elementary AND gate (keywords are printed in bold face, and comments are introduced by ‘--’ and extend to the end of the line). entity AND2 is port (r,s: in bit; t: out bit); end AND2; architecture functional of AND2 is begin t <= r and s; end functional; Listing 3.1 Functional behavior of the AND gate The ‘in’ signals in the port definition of the entity can be used in expressions in the architecture definition that can in turn be assigned to the ‘out’ signals. The ‘<=’ symbol stands for assigning a value to a signal. The ‘and’ operator in Listing 3.1 is the abstract Boolean operation defined for the data type ‘bit’ and used here to define the value of the output signal ‘t’ of AND2. The Boolean operations on the ‘bit’ type include ‘and’, ‘or’, ‘not’, ‘xor’, ‘nand’, ‘nor’, and ‘xnor’. In general, a signal assignment of this kind would contain a more complex Boolean expression or assign the value computed by function defined in the current design or taken from a library: t<= x and (y or z); t<= f (x, y, z);

• 103

FUNCTIONAL BEHAVIOR OF BUILDING BLOCKS

The definition of a function uses an algorithmic notation similar to the ones in other, familiar languages, using local variables, branches and loops (see Listing 3.2). The assignment of the result of a function f to a signal does not imply that the hardware to be described should be constructed according to the algorithm in the definition of f. The function is merely used to describe the functional behavior. This is different from some other HDL, e.g. ELLA. As a more complex example for the definition of a functional behavior we consider the binary add operation computing the (n+1)-bit binary number code that shows the sum of the numbers represented by two binary n-bit input codes. To determine the coded input numbers, these have first to be decoded. Then the result assigned to the output signal is the encoded sum of these. For the decoding and the encoding two functions are used, ‘decode’ and ‘encode’ (Listing 3.2), that are designed for index ranges of the form ‘n−1 downto 0’. E.g., decode converts the bit string literal B“1011001” (or simply “1011001”) into the integer 2“1011001”. Listing 3.2 shows a generic definition that applies to all word sizes n. Note that the index range attribute of a bit vector argument ‘b’ is accessible as ‘b’range’. function decode(b: bit vector) return integer is variable r,h: integer; begin r:=0; for k in b’range loop if b(k)=‘0’ then h:=0; else h:=1; end if; r:=2∗ r+h; end loop; return r ;

-- convert bit to number

end decode; function encode(i,s: integer) return bit vector is -- s+1 is the size of the result vector variable r: bit vector(n downto 0); begin for k in 0 to n loop if mod(i,2) = 1 then r(k)=‘1’; else r(k)=‘0’; end if; i:=i/2; end loop; return r; end encode; entity ADD is generic (n: integer); port (a,b: in bit vector(n−1 downto 0); q: out bit vector(n downto 0)); end ADD; architecture functional of add is begin q <= encode(decode(a) + decode(b), n); end functional; Listing 3.2 Functional behavior of an n-bit binary adder circuit

• 104

HARDWARE DESIGN USING VHDL

The behavior of the n-bit adder can also be defined as in Listing 3.3, using the binary add operation on bit strings representing unsigned integers supplied in the above-mentioned standard library NUMERIC BIT and converting between the unspecific input and output bit fields and ones representing signed numbers as needed. The ‘&’ operator concatenates bit strings and is used to extend all arguments to n + 1 bits as the library functions expect all to have the same size: library ieee; use ieee.numeric bit.all; architecture functional of ADD is begin q <= bit vector ( unsigned (“0”& a) + unsigned(“0” & b) ); end functional; Listing 3.3

Adder definition using the add operation for the type ‘unsigned’

Signal assignments can be specified to only occur when certain conditions hold. If these conditions are derived from time-varying signals, they hold at certain times and not at others. An important condition of this kind is the condition ‘s’event’ for a signal s. It identifies the time at which s changes. The conditional assignment used the keyword ‘when’. Several ‘when’ cases can be combined in a signal assignment. A signal assignment by cases equivalent to assigning ‘r and s’ is: t <= ‘1’ when r =‘1’ and s = ‘1’ else ‘0’; Here, ‘and’ is used as a logical conjunction. Another equivalent assignment is: with r select t <= ‘0’ when ‘0’, s when ‘1’; This form is still a functional description and does not imply the implementation to use a select gate. The behavior of flip-flops and registers can be described by means of assignments that occur at the time of the event when some clock signal makes its L-H transition. Listing 3.4 shows the definition of an entity DFF using such an assignment. The condition clk= ‘1’ identifies the positive clock edge. If it is dropped, the behavior of a flip-flop storing the input data at both clock edges is described (Figure 2.23). Every ‘when’ case must always be followed by an ‘else’ case. It assigns the value read back from the output and thereby keeps it: entity DFF is port ( d,clk: in bit; q: buffer bit) end DFF; architecture functional of DFF is begin q <= d when clk’event end functional;

and

clk = ‘1’

else q;

Listing 3.4 Functional behavior of a D flip-flop Instead of using individual signal assignments sensitive to the involved inputs, a set of changes to be performed by the simulator for a new time step can be cast into a complex sequence of instructions called a process. A process definition within the definition of an architecture

• 105

FUNCTIONAL BEHAVIOR OF BUILDING BLOCKS

is introduced by the keyword ‘process’ preceded by some label and followed by a list of signals. Changes (events) of any signals in this so-called sensitivity list cause the process to be activated and to make its assignments. Processes without a sensitivity list also exist. They are continuously active until they become suspended at a ‘wait until’ statement until the subsequent condition becomes true. As in the case of functions, the instructions in the process body may include function calls, branches, case statements and loops. Variables used within the process can be used to store state information from previous passes through the process. The signal assignments may thus be described by complex algorithms which are, however, not necessarily the ones corresponding to the structure of the circuit to be implemented. Signal assignments executed in a process occur only after the complete run through it or after leaving it due to a ‘wait’ statement. The changed value cannot be retrieved from a signal during the current pass through a process but only the previous one. If an assigned value is needed again in the process, it must be stored in a variable. Read operations of signals in the sensitivity list of a process return the values immediately after the argument change. For a process which is sensitive to several signals, the condition ‘s’event’ for a particular signal can be used as a branch condition. The processes within an architecture definition and the signal assignment statements outside the process definition are executed concurrently whenever they become active. Every individual assignment statement can be transformed into a process using all of its input signals in the sensitivity list. Processes can be used to define flip-flops and registers, too, by using the clock signal in the sensitivity list. Typically, they are used to define several registers at a time and the store operations into them. Listing 3.5 just defines a single, generic parallel n-bit register entity and its behavior and is similar to using a conditional assignment as in Listing 3.4. The condition of clk= ‘1’ again identifies the positive edge of the clock signal. A ‘0’ at the ‘res’ input resets the register to the all zeroes value. More complex process definitions will follow in Chapters 5 and 6.

entity REG is generic ( n: integer:=1 ); port (d: in bit vector(n−1 downto 0); res, clk: in bit; q: out bit vector(n−1 downto 0)); end REG; architecture functional of REG is begin m: process (clk,res) begin if res=‘0’ then reg <= (others => ‘0’); elsif clk’event and clk=‘1’ then q <= d; end if; end process; end fast;

– assign ‘0’ at all index positions

Listing 3.5 Behavioral definition of an n-bit register

• 106

HARDWARE DESIGN USING VHDL

3.4 STRUCTURAL ARCHITECTURE DEFINITIONS Complex architecture definitions typically use internal signals and reference other components. Preceding the ‘begin’ bracket of the definition, the component types which must match other user-defined entities or predefined ones, and the auxiliary signals must be declared. A structural definition ‘calls’ sub-circuits of the declared component types with a unique label to create the needed instances and specifies their interconnection with a ‘port map’ statement that matches the interface signals in the component type with the actual signals connected to an instance of this type. A structural definition can be used to specify a particular algorithm for a Boolean function in terms of gate building blocks. The connected components are not bound to implement Boolean functions of their inputs only, but are arbitrary sub-circuits with internal registers. Their interconnection does not exclude feedback or driving signals from several sources. Consequently, the syntax used for connecting components into more complex circuits does not look like an algorithmic notation but is chosen to simply define a wiring list. There is no attempt to support a control flow or recursion. The structural references do implement hierarchy in VHDL designs by constructing them from sub-systems. For the binary full adder computing the 2-bit binary number that is the arithmetic sum of three inputs (see Chapter 2, Exercise 1), the interface definition and a structural definition as a composition of gates (i.e., an algorithm) are shown in Listing 3.6: entity ADD1 is

port (a,b,c : in bit ; q,o : out bit)

end ADD1;

architecture structural of ADD1 is component XOR2 component AND2 component OR2

port(r, s: in bit; t: out bit) port(r, s: in bit; t: out bit) port(r, s: in bit; t: out bit)

end XOR2; end AND2; end OR2;

signal i, j, k : bit; begin g0: XOR2 g1: AND2 g3: XOR2 g4: AND2 g5: OR2

port map (a,b,i); port map (a,b,j); port map (c,i,q); port map (c,i,k); port map (j,k,o);

end structural; Listing 3.6 Structural definition of the binary full adder As usual (and attractive) in higher-level languages, VHDL provides array and loop structures that allow an indexed set of components to be interfaced to arrays of signals. Listing 3.7 shows a structural definition of the n-bit adder circuit that is built up from a cascade of full adder components, each outputting its overflow signal to the carry input of the next one. That it actually conforms to the behavioral definition, will be shown in section 4.2. It is not obvious from the VDHL definitions:

TIMING BEHAVIOR AND SIMULATION

architecture structural of ADD is

• 107

component ADD1 port (a,b,c: in bit; q,o: out bit) end ADD1; signal c: bit vector(n downto 0) ; begin c(0) <= ‘0’; I: for k in 0 to n-1 generate G: ADD1 port map (a(k),b(k),c(k),q(k),c(k+1)); end generate; q(n) <= c(n); end structural; Listing 3.7 Structural definition of an n-bit binary adder circuit The ‘for..generate’ structure with the preceding label is the one generating the full adder instances which are cascaded through the port map statement. The carry outputs are mapped to the signal vector c(n downto 1) while c(0) is fixed to ‘0’.

3.5 TIMING BEHAVIOR AND SIMULATION The signal assignment explained in section 3.3 may be extended by a time delay specification describing the time when it shall occur. This timing information is evaluated by a VHDL circuit simulator performing not just a functional but also a timing simulation. The simulator repetitively evaluates all signal assignments in a VHDL program for ascending values of a simulated time variable. Conceptually, all signal assignments are evaluated for every new time step based on the previous values of the signals. An assignment is ‘sensitive’ to its input signals only and needs to be reevaluated by the simulator only when one of these makes a change. The condition ‘s’event’ indicating that a change just occurred to a signal can be thought of as the argument of an implicit ‘when’. The timing information is used for the simulation only, and to derive processing times for the described circuit functions. The individual steps needed to determine the new value of a signal during the simulation (i.e., the evaluation of conditions, Boolean expressions, functions and processes) do not consume simulated time as they are activities of the simulator and not of the described hardware. If s is the signal value to be assigned to a signal t, then the assignment is extended by an ‘after’ or a ‘transport after’ specification with an associated time delay parameter. The statements (i) t <= s after 10 ns: (ii) t <= transport s after 10 ns: both define t to assume the values of s that usually vary with time with a delay of 10 ns. If a change of s occurs at the value T of the simulated time (from the start of the simulation), t will be set to the new value of s at the simulated time of T + 10 ns. Instead of the single signal change at that time, a list of changes at ascending time delays can be specified, and the timed

• 108

HARDWARE DESIGN USING VHDL

value (0)

X 1

X 0

1

T

X 1(new) T+10ns

X 0

scheduled changes simulated time

Figure 3.1 Assignment using an inertial 10 ns delay (‘X’: discarded change)

assignment can be conditional using ‘when’. There may already be a set of future, scheduled signal changes when the assignment is executed and adds more of them. The new assignment deletes all scheduled changes that would occur after the one caused by the current assignment. For (i), also all future changes before this one that would result in a different signal value, and changes previous to such are discarded, i.e., there only remain earlier changes to the same signal value that are not followed by changes to another value (Figure 3.1). This kind of delay is called ‘inertial’. It suppresses short signal spikes. If e.g. s makes transitions to its previous state before T + 10 ns, say, at T + 2 ns, then t does not change at all as the change scheduled for T + 12 ns deletes the one at T + 10 ns but keeps the previous value. For (ii), such intermediate changes do show up with the specified delay. Listing 3.8 describes both the functional and the timing behavior of AND2 so that L-H transitions occur with a delay of 3 ns, H-L transitions with a delay of 2 ns, and no transitions occur in response to short input signal ‘spikes’ of less than 2 ns duration. This is the typical behavior of a CMOS AND gate. An alternative behavioral architecture for AND2 could e.g. use different timing data. Timing parameters can also be defined as ‘generic’ parameters for the entity. Similarly, the processing time of the adder or the delay of the output of a flip-flop or a register from the clock edge can be specified (Listings 3.3, 3.4, 3.5). architecture fast of AND2 is begin t <= ‘1’ after 3 ns ‘0’ after 2 ns end fast; Listing 3.8

when ((r and s) =‘1’) else when ((r and s) =‘0’);

Functional and timing behavior of the AND gate

The correct operation of a D flip-flop or register requires that the data input should not transition at the time of the clock edge. It must not change between the set-up time before the edge and the hold time after the edge. If the data transitions during this time the behavior of the flip-flop is undefined (and may be non-deterministic for the circuits given in section 2.1.2). The VHDL definitions, however, define the behaviors of entities positively only. Listing 3.9 shows how an error message can be generated during the simulation if the set-up and hold times are violated. It uses the attribute s’last event to obtain the time delay since the last change of a signal s. process (clk,d) begin if d’event then assert clk=‘0’or clk’last event >= 1 ns report “hold time violated” severity warning; end if;

• 109

TEST BENCHES

if clk’event and clk=‘1’ then assert d’last event >= 1 ns report “set-up time violated” severity warning; q <= d after 3 ns; nq <= not d after 3 ns; end if; end process; Listing 3.9 Flip-flop process checking for set-up and hold time violations

The definition in Listing 3.10 formally specifies the behavior of another basic circuit, the handshaking circuit for a reactive circuit using an output pipelining latch to store the data as long as OR = ‘1’ (see Figure 2.24). The generation of the delayed IR signal might be moved into a separate entity. entity HS is generic exec time: time; port(IR,OA: in bit; IA,OR: buffer bit); end HS; architecture timing of HS is signal IR del: bit; begin IR del <= ‘1’ after exec time when IR=‘1’ else ‘0’ after 3 ns; IA <= ‘1’ after 3 ns when IR del=‘1’ and OR=‘0’ else ‘0’ after 3 ns when IR del=‘0’else IA; OR <= ‘1’ after 3 ns when IR del=‘1’ and OA=‘0’ else ‘0’ after 3 ns when OA=‘1’ else OR; end timing; Listing 3.10 Handshaking circuit behavior

3.6 TEST BENCHES The behavior of a circuit defined in VHDL resulting from the time delays and conditions can be visualized with the aid of a VHDL simulator by connecting signal sources to the inputs of the circuit as stimulus signals. The signal sources and their connection to the circuit are defined within another VHDL program. Such a combined program is called a test bench for the circuit in question as it is analogous to connecting real signal sources to the circuit on a breadboard and measuring the output waveforms by means of a logic analyzer (Figure 3.2). Test benches have no external signals. Test signals are easily generated by means of multiple timed assignments to them. A periodic clock can be generated through an assignment of the kind clk <= not clk after 25 ns; or by means of the process shown in Listing 3.11.

• 110

HARDWARE DESIGN USING VHDL

signal generator

component to be tested

Figure 3.2 Test bench for a component

pn: process variable x: bit; begin x := not x; a <= x; wait for 25 ns; end process; Listing 3.11 Clock process The definition in Listing 3.12 describes a test bench for the AND2 component defined in Listing 3.11. The circuit to be tested is included as a component into the test system. Its definition is supposed to be found in the standard library ‘work’. entity test is end test;

-- there are no ports to the test system

architecture fast of test is component AND2 port(r,s: in bit; t: out bit) end component; signal a,b,c: bit; begin c: AND2 port map (a,b,c); -- this line defines an instance of the type AND2 a <= ‘0’, ‘1’ after 10 ns, ‘0’ after 30 ns; -- test signal definitions b <= ‘0’, ‘1’ after 20 ns, ‘0’ after 40 ns; end fast; -- configure simulator to display a, b, and c Listing 3.12 Test bench for AND2 The assignments to a and b are all executed at the beginning of the simulation and specify particular time patterns for them. ‘a’ e.g. changes for T = 0 ns, 10 ns, and 30 ns and stays at ‘0’ afterwards. A simulator display showing a, b, and c for T = 0 .. 60 ns would be similar to Figure 3.3. For complex input and output patterns, test data can be stored in files or be read from such. If memories need to be attached to a circuit (e.g. a programmable processor) to obtain an operating environment for it, they can be simulated in a test bench for it by means of arrays. They can be filled with initial data from some file at the start of the simulation, or by calling some initialization program. For testing application specific processors that read sequences of function codes (instructions) from a memory, it is necessary to initialize the memory with test sequences (programs). User-defined enumeration types can be used to support the symbolic

• 111

SYNTHESIS ASPECTS

T

0

10

20

30

40

50

60 ns

a b c

Figure 3.3 Simulator display for the AND2 signals

input of such. After a definition such as type opcode is (and, or, xor, not, add, sub, mul, div); opcode symbols can be converted into function codes (using e.g. a case statement), and it is easy to implement an assembly function with a symbolic opcode argument that translates the opcodes and additional parameters into processor instructions and places them into subsequent locations of the memory array in the test bench.

3.7 SYNTHESIS ASPECTS Besides specifying the structure and the behavior of a digital system up to the point of being able to simulate its operation, the information within a VHDL design is used to synthesize the hardware from certain basic components such as gates and flip-flops. VHDL synthesis results in a formal network description of such basic components submitted as a text file that lists the uses of component instances and their wiring through wire nets. A common format for this description is the EDIF net list format [42]. The structure and the contents of the EDIF net list are quite similar to a purely structural architecture definition in VHDL. It starts by listing the components and their interfaces, then proceeds to the interface definition of the entity, introduces the component instances to be used, and interconnects them through ‘nets’. For an FPGA design, this net list is used as the input to a FPGA specific software tools that map the gate and flip-flop functions to cell functions, map the required cells to the physical cells on a particular FPGA chip (placement), route the interconnections required by the design using the available wiring resources (routing), and finally encode the results in a bit stream for the FPGA that can be downloaded into its configuration memory. For the selected placement and routing, the signal delays and the processing delays can be determined from the timing parameters of the FPGA chip and checked to meet the performance requirements. The routed interconnections result in significant signal delays that do not depend on the circuit structure but on the particular usage of the routing resources. It is these timing data that must be used in the behavioral description of the system to be built in order to obtain a correct simulation of the actual circuit. The net list is derived from the specified functional behavior and the structural information but does not depend on the delay data for the timing behavior (and on other structures only needed for the simulation). The net list constitutes a purely structural description of the hardware. It is derived from the supplied structural information in the architectures of the VHDL entities, but also attempts to be extracted from the functional specifications. The

• 112

HARDWARE DESIGN USING VHDL

algorithms in the functions and processes are transformed into Boolean expressions for which a minimization is performed. The Boolean operators are transformed into gate instances. This extraction process from the functional specifications is no longer under the control of the designer. If there is e.g. an add operation for bit fields of the type ‘unsigned’ in the behavioral description as in Listing 3.3, it is not defined which adder circuit is actually realized in the net list (there are several choices, see section 4.2), and a user-defined add operation as in Listing 3.2 might not be recognized as such and be realized efficiently. If this control of the results of synthesis is desired, a structural specification of the entity must be given by the designer. Also, certain entities cannot be further resolved into networks, and their instances show up in the final net list. This is the case for memory arrays or the delay elements used for handshaking. If such entities adhere to specific patterns corresponding to FPGA structures such as memory blocks, the FPGA specific tools recognize their instances and map them to the corresponding FPGA resources. Their behavioral descriptions are uniquely intended for the simulation. To suppress synthesis from these, the entity and architecture definitions are enclosed between special VHDL comment lines that act as compiler switches, ‘-- RTL SYNTHESIS OFF’ and ‘-- RTL SYNTHESIS ON’ or equivalent ones recognized by the synthesis backend tool. Synthesis from behavioral architectures also generates latch and flip-flop components for signals written to in processes and for process variables, yet no flip-flops clocked at both edges or by dual clock signals. There are some basic rules that determine whether signals and variables correspond to latches or to edge sensitive flip-flops that derive from the proposed standard [43]:

r a process variable always written to in a process before being read at most contributes a network of gates (a ‘combinatorial’ function);

r a variable always written to and sometimes read before corresponds to a registers (unless it is used for the simulation only);

r signals not read in the process and not written to under all conditions become latches; r signals written to conditionally after ‘if clk’event and clk=‘1’ then . . . ’ are registers; other branches not depending an on event may asynchronously set or reset the output;

r without clk’event latches are generated, even if ‘clk’ is the only signal in the sensitivity list. The latch generation in the latter case is due to the fact that the synthesis tools automatically add the involved signals to the sensitivity list, including the signals from which the value to be assigned although the simulation would show the behavior of an edge-sensitive flip-flop. It is therefore common practice to always include such signals in the sensitivity lists. Selective signal assignments according to the values of control signals generally translate into select circuits (multiplexers) although for some target architectures distributed selection through tri-state drivers attached to bus lines might be more advantageous. Bus lines need to be specified explicitly even in a behavioural definition. Other optimizations to a design such as sharing components like adders for different purposes also need to be specified through structural elements.

3.8 SUMMARY VHDL is a common language to formally describe the structure and the behavior of digital systems and their components. It serves to simulate complex circuits, and to carry out hardware design using synthesis tools that automatically derive the needed components and

EXERCISES

• 113

their interconnections. In contrast to value assignments in software languages, assignments to signals are temporal processes and can occur at certain times only, e.g. after clock and handshaking events. If a signal is not assigned a value during some time, it holds the last value and behaves like a storage element. Our discussion reviewed the basic components such as gates and flip-flops. For gates, the processing delays in general depend on the output level, and registers and flip-flops need to respect set-up and hold times for the data inputs. Some VHDL constructs such as text output or file functions do not describe circuit structures but serve to implement test benches and to output results of the simulation.

EXERCISES 1. Define the behavior of a flip-flop with a gated clock signal. 2. Implement a test bench for a system of two cascaded instances of the HS entity in Listing 3.10 and verify their correct operation with suitable test signals, using a VHDL simulator. 3. Define a variant of HS for which the pipelining latch is moved to the input of the reactive circuit. 4. Define the behavior of an elementary gate or a flip-flop with std logic inputs and output so that transitions have a non-zero rise or fall time modeled by an intermediate ‘W’ state of the signals and enforce a maximum rise time for the clock signal of the flip-flop. We remark that this kind of behavioral description still does not capture the actual circuit behavior of the involved signals as would be modeled by an analogue circuit simulator program (e.g., Spice) although it may be a better approximation than the zero-time change events. 5. Define an entity that provides the interface signals of the 8-bit EPP peripheral bus found on PC workstations and distributes them to four separate input ports and an output port, using the address write cycle to select one of the input ports for subsequent data cycles. Such an interface is typical for accessing several circuits within an FPGA from the bus of an attached processor. For a peripheral circuit, the EPP signals are (see the specification in [71]):

r 8 bidirectional data lines r data strobe input (chip enable signal as in Figure 2.36) r address strobe input (alternative chip enable signal) r read/write select input r reset input r interrupt output r ready handshake output. 6. Implement a simulation primitive that computes an ordered list of positive integers describing the simulated times of events on some signal from a sequence of lists of integers in which the first entries are the activation times and the subsequent ones are relative times for subsequent events, using transport delays only.

4 Operations on Numbers

In order to perform numerical computations with a digital system, numbers need to be encoded as bit strings as explained in section 1.1.1. Then arithmetic operations can be implemented as special Boolean functions on the number codes and in turn be used as building blocks in numeric algorithms. The general scheme to arrive at the Boolean function realizing an arithmetic operation for the given encoding is as shown in Listing 3.2 for the add operation. For binary and signed binary numbers, the circuit design from gates or CMOS switches is actually carried out for single bit numbers only. For larger numbers, the arithmetic operations have fairly simple algorithms reducing them to single bit operations. There is no need to resort to ROM tables for their realization as one might consider for more irregular Boolean functions, an exception being the so-called distributed arithmetic where a complex arithmetic operation is realized with the aid of small tables. To study the materials in this chapter, the exercises section plays a special role. Several of the arithmetic circuits and algorithms are taken up there, and VHDL implementations are given that also help the reader to become more familiar with that language.

4.1 SINGLE BIT BINARY ADDERS AND MULTIPLIERS If the Boolean 0 and 1 elements encode and are identified with the numbers 0 and 1, then the result of the arithmetic add operation applied to two numbers r, s ∈ B ranges from 0 to 2. Its binary encoding is by two bits (q, o) so that: r + s = q + 2∗ o The Boolean function mapping (r,s) to (q,o) is called the ‘half adder’ function. According to its function table: q = XOR(r, s),

o = AND(r, s)

Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 116

OPERATIONS ON NUMBERS

The half adder is thus realized by two standard gate functions. The sum of three bit r,s,t can still be represented as: r + s + t = q + 2∗ o

(1)

and the function mapping (r, s, t) to (q, o) is called the full adder function. It can be realized by cascading two half adder circuits and using an OR operation to sum up their mutually exclusive ‘o’ outputs (see Listing 3.6 for the VHDL definition), i.e.: q = XOR(q0 , t), o = OR(AND(r, s), AND(q0 , t)),

with q0 = XOR(r, s)

(2)

Reciprocally, the half adder function is obtained from the full adder by setting t = 0. More generally, a 2k − 1 input adder can be defined that outputs a k-bit binary number. Its least significant bit is the result of the XOR set operation (i.e., the parity function) applied to the inputs. E.g., the 7-input add circuit with a 3-bit result can be obtained by composing 4 full adder circuits. The product of two numbers r, s ∈ B takes its values in B and is hence encoded by a single bit. It is computed by the AND gate.

4.2 FIXED POINT ADD, SUBTRACT, AND COMPARE In this section, we consider the n-bit unsigned and signed add and subtract operations, their restrictions to inputs yielding n-bit results, and the mod(2n ) add and subtract operations. They turn out to be quite similar. These n-bit add and subtract operations and the related increment and decrement operations have algorithms based on the single bit add operations in section 4.1. To start with the binary add operation, let two numbers a, b have the codes (a0 , . . . ,an−1 ), (b0 , . . . ,bn−1 ) and let (s0 , . . . ,sn ) be the (n + 1)-bit code of their sum. We have a=

n−1

a i 2i

b=

i=0

n−1

bi 2i

i=0

and a +b =

n−1 i=0

(a i + bi )2i =

n−1

q(a i , bi , ci )2i + cn 2n

i=0

The function q is the full adder operation (1) in section 4.1 and the bits c0 ,..,cn are defined recursively by c0 = 0 and ci+1 = o(ai , bi , ci ) for i < n . Thus, si = q(ai , bi , ci ) for i < n , and sn = cn . The algorithm to obtain the si by computing the q(ai , bi , ci ) and the ci+1 = o(ai , bi , ci ) is the common bit-by-bit computation of the sum, the ci being the carry bits, and corresponds to the composite circuit ADD defined in section 3.4, using full adder circuits as building blocks. Note that it is the structural type of definition that corresponds to a particular Boolean algorithm. The circuit is known as the ripple-carry adder. It suffers from a long execution time T = n∗ Tf due to the connection of the full adder building blocks in series, T f being the processing time of the full adder. The circuit complexity is in O(n). The first full adder stage actually performs as a half adder and could be substituted by the simpler half adder

FIXED POINT ADD, SUBTRACT, AND COMPARE

• 117

component, but maintaining the carry input to the first adder is useful for cascading several multi-bit adders. If one desires the result of the add operation to be encoded similarly to the operands, i.e. with n bits, too, which is usual in order to avoid the growth of the codes for every add operation, (s0 , . . . ,sn−1 ) is the result and the carry output cn = 1 is the overflow error condition of the result exceeding the range of the n-bit binary codes. The add operation with the n-bit result (s0 , . . . ,sn−1 ) irrespective of the carry is the add operation mod(2n ) that computes the binary code of the remainder of the sum a + b divided by 2n . The n-bit adder with the carry input and output generalizes the full adder function by changing the inputs a, b and the output q to binary coded base 2n digits. Every n-bit add circuit with an (n + 1)-bit result performs the mod(2n ) add operation and generates the overflow bit. The carry input to the n-bit adder can also be thought of as a control input that selects between the alternative computations of a + b and a + b + 1. In fact, for n = k + m the n-bit ripple-carry adder can be regarded as the composition of a k-bit and an m-bit adder, so that the carry output of the k-bit adder selects the between the two alternative computations of the m-bit adder which is similar to executing them as the alternative branches of an ‘if’ control structure (section 1.2.2). As the add operation may occur many times within an algorithm, a short execution time is desirable for it. There are various schemes to reduce the processing time by providing a faster carry propagation and by lowering the total depth of the adder circuit at the expense of an increased gate count. One is the carry select scheme that converts the control dependency of the m-bit add on the k-bit add into a data dependency by executing both branches in parallel and selecting using the carry bit of the k-bit adder (see section 1.4.2). Listing 4.1 lists the structural VHDL definition for a 16-bit adder architecture with k = m = 8. If n is a power of 2, the carry select scheme can be applied recursively to half size adders which raises the gate count to O(n1.6 ) but decreases the processing time to O(log n) (three half-size adders are needed, hence for n = 2r a total of 3r = n1.6 1-bit adders are needed). Thus, the component ADD8 would be implemented similarly with three ADD4 components etc. For other circuit schemes to achieve a faster binary add operation, see [44]. entity ADD16 is port (a,b:in bit vector(15 downto 0); ci:in bit; q:out bit vector(15 downto 0); co: out bit); end ADD16; architecture structural of ADD16 is component ADD8 port (a,b: in bit vector(7 downto 0); ci: in bit; q: out bit vector(7 downto 0); co: out bit); end ADD8; component SEL port(a,b,s: in bit; q: out bit); end SEL; signal s0,s1: bit vector(8 downto 0); signal cl: bit; begin low: ADD8 port map (a(7 downto 0), b(7 downto 0), ci, q(7 downto 0), cl); high0: ADD8 port map (a(15 downto 8),b(15 downto 8),‘0’,s0(7 downto 0),s0(8));

• 118

OPERATIONS ON NUMBERS

high1: ADD8 port map (a(15 downto 8),b(15 downto 8),‘1’,s1(7 downto 0),s1(8)); I: for n in 0 to 7 generate G: SEL port map (s0(n),s1(n),cl,q(n+8)); end generate; GC: SEL port map (s0(8),s1(8),cl,co);

end structural; Listing 4.1

The carry-select adder

The operation of incrementing (i.e., the unary operation i -> i + 1) is the special case of an add operation in which the second operand is the constant number 1 which is encoded as (1,0, . . . ,0). Like the n-nit adder, the n-bit incrementer produces an (n + 1)-bit result. The lower n output bits encode the result of the mod(2n ) increment operation, and the upper bit indicates the overflow condition of the result not being representable with n bits (this only occurs for the all ones input encoding 2n − 1 with the result of 2n with zeroes in the lower n bits; the overflow bit ‘tests’ for the all ones code). If the ripple-carry add algorithm is used for the increment operation, the full adders all have a zero operand and can be simplified to half adders (the first of them even to an inverter). If the first half adder is maintained and the ‘1’ input to it is changed to ‘0’, the trivial operation i → i + 0 = i is performed instead. Thus the increment operation can be controlled with very little extra hardware effort. The incrementer is faster than the ripple-carry adder but still needs the ‘linear’ time n∗ Th , Th being the execution time of the half adder. Now the carry to the ith stage only occurs if all previous outputs are ones which can be determined using an (i−1)input AND function. These can be computed in ‘logarithmic’ time ld(i-1)∗ Ta by connecting the elementary 2-input AND gates as trees, and the increment function using such fast carry inputs to all the half adders achieves an execution time in O(log n). We next consider the signed binary add operation. The add operation essentially remains the same if twos-complement encoding is used instead of the straight binary encoding. Using equation (3) in section 1.1.1 the numbers a, b are related to their n-bit twos-complement codes (a0 , . . . ,an−1 ) and (b0 , . . . ,bn−1 ) through the equations a = −a n−1 2n +

n−1

a i 2i ,

b = −bn−1 2n +

i=0

n−1

bi 2i

hence

i=0

a + b = −a n−1 2n − bn−1 2n +

n−1

(a i + bi )2i = h + (cn − a n−1 − bn−1 )2n

(3)

i=0

where h denotes the result of the modulo 2n add operation applied to ai 2i and bi 2i and cn is its overflow bit. The only difference is in the overflow condition if operands and the result use the same word size. For the binary encoding, the final carry output cn signals the overflow. The sum of two signed n-bit numbers is a signed (n + 1)-bit number s, the sign bit sn of which is the sign of the result. The overflow in the twos-complement case is indicated by the sign sn of the result being unequal to sn−1 . By (3) the computation of sn requires an extra XOR operation: sn = XOR(XOR(a n−1 , bn−1 ), cn )

(4)

FIXED POINT ADD, SUBTRACT, AND COMPARE

From

• 119

sn−1 = XOR(XOR(a n−1 , bn−1 ), cn−1 ) it follows that sn = sn−1 <=> XOR(sn , sn−1 ) = 1 <=> XOR(cn , cn−1 ) = 1

(5)

The signed binary increment operation with an n-bit result is the same as the unsigned or the mod(2n ) one. It overflows for the maximum positive input code of (1, 1, . . . ,1, 0) and then outputs the most negative code of (0, 0, . . . ,0, 1). Concerning the subtract operations for unsigned, signed and modulo 2n binary numbers, we first remark that for a mod(2n ) binary number x,/x + 1 is the mod(2n ) additive inverse −x of x (/x denotes the complement of x). The negation overflows for the all zeroes code only and hence can be used to test for it. The binary subtract operation mod(2n ) is obtained from the equation a − b = a + (−b) = a + /b + 1

(6)

Adding the ‘1’ does not require an extra increment circuit but is achieved by placing a ‘1’ at the carry input of the first adder stage of the mod(2n ) adder. By selecting between b and /b the same n-bit adder can be used both for the add and subtract operations which is useful to build multi function compute circuits, in particular arithmetic units for programmable processors (see Figure 1.6). For unsigned binary numbers a, b the difference a – b is defined for b ≤ a only and then equal to the result of the binary mod(2n ) subtract operation. This condition is signaled by the mod(2n ) adder used to compute (6) through a ‘1’ at the carry output. The signed binary subtract operation also coincides with the mod(2n ) subtract operation for the lower n bits of the result as the signed binary code of a number coincides with the unsigned code of its remainder mod(2n ), and taking remainders is compatible with the add and subtract operations. The full result can be represented as a (n + 1)-bit signed binary number that is obtained by applying equation (6) to a and b sign extended to n + 1 bits. This time the overflow condition is sn = sn−1 again. A special case of the subtract operations is the decrement operation, i.e. subtracting the constant 1. As for the increment operation, the subtract circuit can be simplified and sped up to logarithmic time. If x is an n-bit signed binary number and x = −2n−1 , then so is −x, and −x = /x + 1. The negation overflows for x = −2n−1 (=100..00), the most negative representable number, only as −x = 2n−1 cannot be represented as a signed n-bit number but needs n + 1 bits. Again, the overflow condition is sn = sn−1 and can be used to check for this special code. The add, subtract, negate and compare operations for signed or unsigned binary numbers x are identical to those for fixed point operations on the fixed-point numbers x/2r , for which hence the same algorithms can be used. Related to the unsigned and signed fixed point subtract operations are the compare operations ‘a ≤ b’ and ‘a = b’ viewed as a Boolean functions. ‘a ≤ b’ for unsigned n-bit operands is equal to the carry output from the subtract operation b − a and hence computed by any subtract or combined add and subtract circuit if the value of 1 is interpreted as ‘true’. The comparison of signed numbers requires computing the sign of b − a instead of the carry. ‘a = b’ can be evaluated by comparing bit per bit and computing the conjunction of all bit comparisons with

• 120

OPERATIONS ON NUMBERS

an AND gate tree as the codes are equal if and only if the represented numbers are equal, or be reduced to the signed or unsigned ‘≤’ operations or to a test for zero using the equivalences a = b <=> a ≤ b

and

b ≤ a <=> a − b = 0

Testing for the special values 0, −1, −2n−1 and 2n−1 − 1 is equivalent to testing for the overflow conditions of the increment and negate operations.

4.3 ADD AND SUBTRACT FOR REDUNDANT CODES A still more effective way to speed up the add operation is obtained by using a redundant encoding for the numbers. We first consider the code using the 2n-tuple (u0 , . . . ,un−1 , v0 , . . . ,vn−1 ) for the number n−1 n−1 i i m= (7) ui 2 + vi 2 i=0

i=0

Every binary number (m0 , . . . ,mn ) can be converted into a ‘double’ code by appending (0,. . . ,0), and every double code can be reconverted to a binary one by performing the binary add operation according to equation (7). For two redundant codes a, b encoding numbers a , b , the conditions a = b and a = b are no longer equivalent. If u and v are the binary numbers (u0 , . . . ,un−1 ) and (v0 , . . . ,vn−1 ), then the redundant code (u0 , . . . ,un−1 , v0 , . . . ,vn−1 ) encoding their sum is obtained without any computational effort. An encoded add operation using a redundant code for the first argument and the result and a binary code for the second argument is any function B2n × Bn → B2n+2 mapping a double code (u0 , . . . ,un−1 ,v0 , . . . ,vn−1 ) of a number a and the binary code (b0 , . . . ,bn−1 ) of a number b to a double code (s0 , . . . ,sn ,t0 , . . . ,tn ) of the number a + b. Encoded add operations are not uniquely defined as the double codes for the result are not unique. A possible choice is si = q(u i , vi ,bi )

for i < n,

sn = 0

ti+1 = o(u i , vi ,bi )

for i < n,

t0 = 0

and

where q, o are the full adder outputs. This follows from equation (1) in section 4.1; the multiplication by 2 is realized by the t word being shifted by one bit position. The computation of the result code can thus be performed by n full adder circuits in parallel in the time T f irrespective of n. From this add operation one derives an encoded add operation B2n × B2n → B2n+4 accepting double codes for both operands that also operates in the constant time 2∗ T f , namely by adding the halves of the second operand one by one. These add operations can now be composed. If k redundant codes have to be added up to a single redundant result code, one can arrange them as a tree to obtain the time of 2∗ T f ∗ ld(k) for the total computation. The only disadvantage of the redundant add operation is the redundant encoding of the result which usually needs to be converted (added up) to a binary code. If, however, a set of k binary numbers have to be added up, only the final result would have to be converted.

ADD AND SUBTRACT FOR REDUNDANT CODES

• 121

The redundant encoding defined by equation (7) and the fast add operation generalize to codes (u0 , . . . ,un−1 ,v0 , . . . ,vn−1 ) where the ui are base-k digits and only the vi are binary digits. Listing 4.2 shows a 16-bit adder using this redundant kind of code for the first operand and the result and the non-redundant binary code for the second operand. Four 4-bit sections serve as base-16 adders. entity XADD16 is port ( a,b: in bit vector(15 downto 0); xa: in bit vector(3 downto 0); q: out bit vector(15 downto 0); xq: out bit vector(4 downto 1) ) end XADD16;

architecture struct of XADD16 is component ADD4 port (a,b: in bit vector(3 downto 0); ci: in bit; q: out bit vector(3 downto 0); co: out bit) end ADD4; begin g0: ADD4 port map(a(3 downto 0),b(3 downto 0),xa(0),q(3 downto 0),xq(1)); g1: ADD4 port map(a(7 downto 4),b(7 downto 4),xa(1),q(7 downto 4),xq(2)); g2: ADD4 port map(a(11 downto 8),b(11 downto 8),xa(2),q(11 downto 8),xq(3)); g3: ADD4 port map(a(15 downto 12),b(15 downto 12),xa(3),q(15 downto 12),xq(4)); end struct; Listing 4.2

Redundant adder built from 4-bit sections

The signed digit encoding of a number m by the tuple (m0 , . . . ,mn−1 ) so that m=

n−1

mi 2i

(8)

i=0

where the digits mi take the values −1, 0, 1 and hence also need at least two bits each for their encoding also yields a fast add operation by using the redundancy in such a way that the carry does not propagate by more than one place. The negate operation on the signed digit code is by negating the digits individually in parallel. It is hence easy to convert a signed binary number into this representation, and the reconversion to a signed binary code is by splitting m into a difference of two positive binary numbers and performing the binary subtract operation. A fast algorithm to compute a sum code (s0 , . . . ,sn ) of two signed digit codes (u0 , . . . ,un−1 ) and (v0 , . . . ,vn−1 ) is as follows. For every i ≤ n − 1 one computes the conditions: pi : (u i = 0 and vi = 1) or (u i = 1 and vi = 0) or (u i = vi = 1), n i : (u i = 0 and vi = −1) or (u i = −1 and vi = 0) or (u i = vi = −1), the carry digit ci = 1 = −1 = 0

if (u i = 1 and vi = 1) or ((i > 0) and pi and pi−1 ) if (u i = −1 and vi = −1) or ((i > 0) and n i and n i−1 ) otherwise

• 122

OPERATIONS ON NUMBERS

and (setting c−1 = 0) si = u i + vi − 2∗ ci + ci−1 Finally, sn = cn−1 . Thus si only depends on ui , vi , ui−1 , vi−1 , ui−2 , vi−2 , and all can be computed in parallel (this also holds for the redundant add operation presented first). The fact that si does not depend on the lower bit positions is called the ‘on-line property’ in the literature [6]. It implies that they do not even need to be known at the time si is computed and that their computation may be pipelined with the computation of si .

4.4 BINARY MULTIPLICATION The product of two numbers a=

n−1

a i 2i ,

b=

i=0

m−1

bi 2i

i=0

is the (n + m)-bit number a ∗b =

i, j

a i b j 2i+ j =

nm−1

h k 2k

k=0

∗

Thus the n m single bit products ai b j have to be computed and the results for a given bit position i + j have to be added up by means of single bit adders passing carries as needed. This can be done in various ways, using as its building blocks full adders extended by an AND gate (the single-bit multiplier) to form the product ai b j . They are depicted in Figure 4.1 for the case of n = m = 5 by circles that output the higher (carry) bit to the left or straight down and the lower (sum) bit down on the diagonals to the right, and receive up to three inputs, a product ai b j , and the others from other building blocks. The jth row receives the inputs an−1 b j , . . . , a0 b j , and the output bits h0 . . . h2n−1 appear at the right and lower sides of the diagrams. A circle that only receives the product input just passes it, and if there is only one additional input the circle acts as a half adder. The version in Figure 4.1 a corresponds to adding shifted versions of a with n ripple-carry adders operating in parallel and has the disadvantage of a

an-1 …

a1

a0

b0 h0 b1 h1 a)

b)

c)

bn-1 hn-1 h2n-1 h2n-2

hn

Figure 4.1 Multiplier networks with different carry propagation schemes

• 123

BINARY MULTIPLICATION

a b

k-bit binary multplier

c

o

2k+k+k bit binary adder

q

d

Figure 4.2 Extended k-bit adder component

fairly long processing time of (3n − 2)∗ T, T being the processing time of the full adder (the products are computed by the AND gates in parallel and only add a single AND gate delay which is neglected). The version shown in Figure 4.1b uses the same number of building blocks but a different carry routing for the nodes above the diagonal and multiplies in a time of (2n − 2)∗ T (the 1st row just passes the product inputs to the 2nd). The computation of the less significant half word of the result only takes (n − 1) ∗ T. For a multiplier modulo 2n that only computes the less significant word only the adders on and above the diagonal are needed. The third propagation scheme in Figure 4.1c ends up with a redundant code for the upper half word of the result, i.e. two n-bit binary numbers that still need to be added up to the most significant word of the result, using e.g. a carry select adder. It takes the time of (n − 1)∗ T plus the carry select adder time and can be thought of as cascading n − 1 adders, each adding a redundant code and a row of products. A still faster multiplier operating in logarithmic time can be obtained by using a tree of redundant adders instead of cascading them, the so-called Wallace tree [44], and adding up the upper word parts with a carry select adder or some other adder also operating in logarithmic time. For all of these schemes, the input data must be held unchanged throughout the processing time as they are used in all adder rows. The multiplier networks in Figure 4.1 are regular structures constructed from a single type of component, the extended full adder including the AND multiplier. It adds the multiplier result and two more bits and output a 2-bit result. Current FPGA chips provide this function in a single cell. The extended adder component generalizes to the k-bit extended adder that is defined as a combined k-bit multiply and add circuit that forms the product of two k-bit numbers and adds it to two more k-bit inputs to arrive at a 2k-bit result (Figure 4.2). For n = m∗ k, the same network structures as in Figure 4.1 implement n-bit multipliers with m∗ m extended k-bit adder components. The extended k-bit adder (or add-subtract circuit) can also be used as a component to implement n-bit add (add/subtract) functions and could be used as a cell in a coarse-grained FPGA architecture. The product of a twos-complement signed number a by an unsigned or signed number is formed according to the representations a = −a n−1 2n +

n−1 i=0

a i 2i = −a n−1 2n−1 +

n−2

a i 2i

i=0

and differs from the unsigned product in the most significant word only. If ‘∗s ’ denotes the Boolean function computing the product of signed binary numbers, and ‘∗u ’ the unsigned version explained before then a ∗s b = a ∗u b − 2n a n−1 b − 2n bn−1 a

(9)

• 124

OPERATIONS ON NUMBERS

where the products by an−1 and bn−1 just make the subtract operations of b or a conditional on them, and the product by 2n means that the subtract operations are from the upper half word. The same multiplier structures can be used and extended to conditionally subtract from the most significant word, or to perform a subtract in the last row containing the products ai bn−1 instead of adding and the signed versions of the add and shift operations (the leftmost carry is replaced by the sign). If one of the operands, say b, is a constant, the multiplier circuit can be simplified significantly. In fact, all ai b j can be evaluated to ai or 0 depending on b j . Thereby all AND gates are eliminated, and one has to add up selected shifted versions of a, namely the versions 2 j∗ a for those j satisfying b j = 1. In the special case of just one of the b j being 1 no add operation needs to be performed at all, and the output is the shifted word 2 j∗ a. Obviously, the structure of the circuit now depends on the constant operand. All of the above binary multiplier circuits for two variable n-bit operands use n2 full adder building blocks to add up the products and have a gate count in O(n2 ). A reduction of this complexity can be achieved by applying the formula (a + 2r b)(c + 2r d) = ac + 2r ((a + b)(c + d) − ac − bd) + 22r bd

(10)

that reduces a (2r)-bit product to three r-bit products. The recursive application reduces the complexity to within O(n1.6 ) if ripple carry adders are used, and to O(n1.6 log(n)) with carry select adders. With the fast adders the processing time adds up to O(log(n)2 ) using the fact that the three products in the formula can be computed in parallel. If log(n) multipliers can be used in parallel, the execution time per multiply becomes O(log(n)) as for the Wallace tree but the gate count raises to O(n1.6 log(n)2 ) only. The Wallace tree uses its gates more efficiently but needs too many steps to perform the multiplication. The multiplication based on equation (10) pays off and is used for large bit sizes (>64) only. The complexity of multiplication algorithms can be as low as O(n log(n) log(log((n))) [54], by interpreting the multiplication as a convolution for which a fast algorithm exists (cf. section 8.2.3). For the computation of the product an online property can be achieved if redundant codes are used for the adder implementation (at least for the upper half of the result word). It is by applying still another scheme to connect the adders given by the recursion Aj B j = 4Aj−1 B j−1 + a n− j B j + 2bn− j Aj−1 where A j , B j denote the binary numbers presented by (an− j ,..,an−1 ) and (bn− j ,..,bn−1 ) so that a = An , b = Bn . Adding B j and 2A j−1 may still cause a change in the bit position j of the 2j-bit result, but can be arranged not to do so in the higher ones, i.e. the upper j − 1 bits can be computed from the upper j bits of the operands [6]. This online property is useful in fixed point computations with a fixed number of places where instead of the full product a rounded version of the most significant part is needed.

4.5 SEQUENTIAL ADDERS, MULTIPLIERS AND MULTIPLY-ADD STRUCTURES In this section we derive implementations of some of the add and multiply algorithms discussed before that serially reuse components for the sake of efficiency and provide cost-effective register arrangements to store the intermediate results and select them for later operations.

SEQUENTIAL ADDERS, MULTIPLIERS AND MULTIPLY-ADD STRUCTURES

ci bi ai

binary full adder

oi

DFF

• 125

oi-1

clk qi

Figure 4.3 Serial adder

If h is the depth of the add or the multiply circuit, its processing time is h ∗ T. As pointed out in section 1.5, if the multiplier is used at its maximum rate corresponding to its processing time, then the adder circuits performing the computation are used with an efficiency of 1/h only. Pipelining can also be used to raise the efficiency. The layered structure of the multiplier in Figure 4.1(c) can be used to pipeline its operation by inserting registers between the layers both for the intermediate results and for the operands. Then the multiplication still takes the same time (even a little more due to the registers) but subsequent multiplications can be started at the rate given by T that is independent of h, and the efficiency becomes close to 100% (with a proportional increase of the power consumption). The storage and power requirements become lower if the layers are grouped into sets of h layers and the pipelining is implemented for these only. Then the pipelined multiplications can be started at a rate of h ∗ T and the efficiency raises to close to 1/h . The n-bit binary ripple-carry adder applies n identical full adder circuits at all bit positions. The full adders are connected in series via the carries. The full adder operations can be executed serially on a single full adder circuit, starting with bit 0, by using as the full adder inputs for the ith or i-th application the bits ai , bi from the operands and the carry signal ci that has been computed as the overflow oi−1 in the previous application. oi−1 must be stored in a flip-flop in order to be able to use it in the subsequent step, but it is no longer used thereafter and the same flip-flop can be used to store all the carries in sequence (Figure 4.3). It must be cleared to zero at the start of the serial computation. This also eliminates the need to select the carry input from different sources during the sequence of steps. The simplicity of the serial adder circuit depends on the input bit pattern being applied serially to the same sites and on also producing the output pattern this way. Conversely, the inputs and outputs to the previous ‘parallel’ adder circuits were supposed to be applied in parallel. Auxiliary circuits are needed if the inputs to the adders are applied differently. A selection circuit for the input bits for the serial adder is needed if these are applied in parallel on n signal lines, using n-1 select gates connected as a tree as in Figure 1.3 for each of the two input words. The processing time of these select trees is proportional to their depth ld(n) and drives up the processing time of the serial adder, and also requires the generation of a select address. The extra processing time and the need for a select address are eliminated by receiving the input bits from two n-bit shift registers using the same clock signal as for the carry flip-flop (increasing the power consumption). This maximum clock is given by the processing time of the half adder plus the set-up and hold time for the flip-flop. As the shift registers needs to be loaded in parallel from the input lines, they also include n − 1 select gates each, but this time the select operation is pipelined with the full adder operation and does not contribute to the processing time. Even not counting the input register flip-flops which probably would also be used for the inputs to a parallel adder, one concludes that a need to select from the parallel input bits drives up the hardware costs to similar costs as for the parallel adder. If the output bits from the serial adder are needed in parallel, another shift register can be used,

• 126

OPERATIONS ON NUMBERS

ci a (parallel) bi shift register

ai

binary full adder

oi

DFF

b (parallel) oi-1 shift register

qi clk

q (parallel)

Figure 4.4 Serial adder with parallel input and output

a (parallel) bi

AND gates bi*a parallel binary n-bit adder q0 q1…qn

shift register

register clk

(low word)

result

(high word)

Figure 4.5 Serial multiplier using the Horner scheme

again eliminating the need to select storage locations for the individual bits. One of the input shift registers can be used for this purpose (Figure 4.4), too. This figure does not show the signal selecting parallel input and a handshaking signal indicating n clock edges later that the result is ready. Their generation still requires an extra effort not needed for the parallel adder. As the n by n bit binary multiplier is also built up from n2 identical full adder plus AND gate components, all their operations could be executed in sequence, this time requiring n bits of intermediate storage. More generally, the n-bit multiply operation broken up into m2 identical k-bit extended add operations that can be serialized. The n by n bit multiplier in Figure 4.1(a) can also be represented as a cascade of n rows that are parallel adders and can be realized by using the same parallel adder in n sequential steps. It needs a single n-bit register to pass a partial sum to the next step (Figure 4.5) which avoids the overhead for selecting input bits. The first operand a is input in parallel to the adder while the bits of the second operand need to be input serially (e.g. using a shift register). The output bits come out serially of the least significant output bit of the adder according to the serial execution of the Horner scheme (where multiplying by 1/2 is equivalent to shifting down): n−1 bi 2i ∗ a = 2n ∗ 1/2(bn−1 ∗ a . . . + 1/2(b1 ∗ a + 1/2(b0 ∗ a ))..) i=0

and may be moved into a shift register. After n steps, the register holds the most significant

SEQUENTIAL ADDERS, MULTIPLIERS AND MULTIPLY-ADD STRUCTURES

• 127

half of the result. Again, there is some extra effort for serializing b and for generating signals indicating the start and the end of the multiply operation, but this time the overall hardware costs are in O(n) and thus significantly less than for the ‘parallel’ multiplier circuits. The signed multiply is implemented by changing the cycle adding bn−1 ∗ a to a subtract cycle and by using a signed adder circuit. The circuit in Figure 4.5 generalizes to serially adding up k∗ n bit products using extended k-bit adders. Instead of multiplying via the AND gates, the multiplication step can also be realized by selecting between the adder output and directly passing the shifted input. This amounts to skipping the add operation and can be exploited to speed up the sequential multiplication in the case of the second operand containing many zeroes. Moreover, if the second operand contains a sequence of ones from bit positions k to l, then the sum of the corresponding powers 2r ∗ a is equal to 2k+1∗ a − 2r ∗ a and can hence be computed with two add steps and some shifts (Booth’s algorithm). The processing time for the multiplication becomes data dependent which is considered as a disadvantage in some applications. The row of two output extended full adders from Figure 4.1(c) can be also be used in n sequential steps. It requires two n-bit registers and requires their contents to be summed up in a final step. The second register is for the carry words and gives this multiplier structure the name of a carry-save multiplier. The sequential steps can be performed at nearly the maximum rate of the individual extended full adder component. The carry-save multiplier performs in linear time and with a linear hardware effort. The adder components are used at close to 100% efficiency as in the case of the fully pipelined parallel multiplier. The carry-save multiplier is much faster than the structure in Figure 4.5 even if a fast adder is used there and if the final add to convert from the redundant code is performed serially. In comparison to the parallel circuit using n2 adders, it cannot exploit the adder tree structure or be pipelined. Adding bi ∗ a to the right shifted output word in the serial multiplier of Figure 4.5 is easier to implement but equivalent to adding a left shifted version of it to the unshifted 2n-bit output register. If it is implemented this way, then at the start of the multiplication the register need to be reset to zero. If not, the product is actually added to its previous contents. Thus at no extra hardware costs, this serial multiplier performs a multiply-and-add operation. The shifted output register version can be modified to perform it, too, and a variant of it is shown in Figure 4.6 using the carry-save multiplier scheme. The upper part containing the AND multipliers, the full adders and the three registers C, L, H is the carry-save multiplier circuit outlined above, the C register holding the carry word. After the last cycle of a multiply operation the redundant code of the high word of the result would stand in the registers C and H while the L would hold the low word of the result. The remaining registers serve to implement the multiply-add function. After the last cycle the next multiply operands a, b are applied, the redundant code of the high word is moved into the registers c, h instead, the low word is loaded into H, and C is cleared. The new multiplication then continues to add to H while at the same time the high word of the previous result is added up bit by bit in the serial adder consisting of the full adder and the D flip-flop and applied to the carry-save adder through the upper bit of the H register. The individual product then takes n cycles, and the add operation to convert from the redundant code is pipelined with the next multiplication. It is at the end of k multiplications only that a single redundant code remains to be added up to a binary one. A serial counter-circuit can be used to include a third result word accumulating the overflows occurring in a series of multiply-add operations.

• 128

OPERATIONS ON NUMBERS

a (parallel) bi

AND gates bi*a bank of n full adders q0

o0 on-1

carry word

q1 qn-1

register (C)

shift reg. (L)

register (H)

shift reg. (c)

full add

shift reg. (h)

DFF

Figure 4.6 Serial carry-save multiplier-accumulator (MAC)

4.6 DISTRIBUTED ARITHMETIC Sums of products g=

r −1

Ak xk

k=0

with constant coefficients Ak , i.e. considered as functions of the xk only, can be evaluated for small n quite efficiently using look-up tables, yet not using the entire multi-bit binary or fixed point codes of the xk to address the table, but individual bit xk,i for the same arbitrary bit position i. Then fairly small look-up tables addressed by r bits suffice. For n = 3..6 such are offered by the cells of current FPGA chips. Therefore this approach yields an efficient implementation for multiple multiply-and-add operations on such chips. For all k, xk =

n−1

xk,i 2i

i=0

hence g=

n−1

si 2i

i=0

with si =

r −1 k=0

xk,i ∗ Ak = F (x0,i , .., xr −1,i )

F is the function of r Boolean inputs defined by F(b0 , . . . ,br −1 ) = k bk *Ak . It outputs n -bit words with n > n due to the multiple add function and is realized using a table. Then, g

• 129

DISTRIBUTED ARITHMETIC

xr-1,i … x1,i x0,i

ROM table for F F0…Fn′-1 parallel binary n-bit adder q0 q1…qn shift register

register clk

(low word)

result

(high word)

Figure 4.7 Serial computation of a dot product using distributed arithmetic

is summed up serially using the Horner scheme using the sequential structure in Figure 4.7 which is similar to Figure 4.5: g = 2n ((..((F (x0,0 , . . . , xr −1,0 )∗ 2−1 + F (x0,1 , . . . , xr −1,1 )∗ 2−1 + . . .)∗ 2−1 + F (x0,n−1 , . . . , xr −1,n−1 )) The operand bit xk,i need to be input serially using e.g. shift registers clocked synchronously with the add and shift steps. For every bit of the result of F, a look-up table from a separate FPGA cell is used. Thus the required memory is taken from the FPGA configuration resources. This also allows an easy change of coefficients without having to redesign the circuit structure and its mapping. The number r of products is limited by the size of the available look-up tables. Longer input sequences can, however, be handled by breaking the g sum into small parts and summing up the outputs of the corresponding tables, or by combining several cells to obtain larger look-up tables. It is not possible, however, to reuse the same length r hardware circuit for the different sub-sequences as the coefficients in the LUTs cannot be changed easily as this amounts to reconfiguration which takes some time if it is supported at all (only some Xilinx FPGA families can serially load a LUT from an application circuit). If, however, the same sum-of-products operation has to be applied to a block of vectors, the swapping of the LUTs can be performed once per block only and may become affordable. The hardware effort in using the circuit in Figure 4.7 is considerably less than using r serial multipliers in parallel and adding up their results, and the speed is faster. A trick to cut the required memory by a factor of two (cf. [3]) is to represent the signed xk inputs in the form: n−1 (xk,i − /xk,i )2i − 1/2 xk = 1/2(xk − (−xk )) = 1/2 i=0

and to replace F by: F (b0 ,.., br −1 ) = 1/2

r −1 k=0

(bk − /bk )∗ Ak

• 130

OPERATIONS ON NUMBERS

Then: g=

n−1

F (x0,i , . . . , xr −1,i )2i − 1/2

i=0

r −1

Ak

k=0

F satisfies: F (/x0,i ,.., /xr −1,i ) = −F (x0,i ,.., xr −1,i ) The table for F only needs to provide the entries for x0 = 0. x0 is only used to complement the other bits and to change the add to a subtract operation using an add/subtract circuit instead of the adder.

4.7 DIVISION AND SQUARE ROOT The division of an unsigned 2n-bit number a by an n-bit number b such that a < 2n b is computed in a sequence of operations conditionally subtracting products 2k b from a as long as the result is positive, for k decreasing from n-1 to 0. The sum over those powers 2k for which the subtraction was carried out is the result of the division. The condition of an intermediate difference a satisfying a ≥ 2k b corresponds to the carry bit being set after performing the binary subtract operation. Thus the divide step can be realized by using a binary subtract circuit and a selector between a and a − 2k b controlled by the carry bit, and outputting the carry bit to the kth bit position of the result. Due to the serial composition of the operations, a serial execution on the same conditional subtract circuit can be nearly as fast. The carry outputs during the time steps can be shifted into a result register. After performing the algorithm the remainder of the division is obtained as the result of the subtract operations. Equivalently to subsequently subtracting 2k b from an intermediate difference a one can subtract 2n b from 2n−k a . This corresponds to shifting a to the left instead of shifting b to the right. The conditional subtract operation can also be realized with an add/subtract circuit controlled by a function select input again. If the subtract operation a − 2k b sets the carry, then in the next step a − 2k b will be changed by subtracting 2k−1 b, otherwise by adding it as a − 2k−1 b = a − 2k b + 2k−1 b. A common circuit for the sequential computation in shown in Figure 4.8, again shifting a to the left instead of b to the right. a is first loaded into the shift register L, the carry flip-flop C is set, and H is cleared. Then a first subtract is performed inputting from the lower n−1 bit of H and the highest bit of L and storing the result in H, and shifting L to the left using the carry as input into the lowest bit of L (L is used for the result bits as well). After a total of n add/subtract and shift operations the result of the division stands in L. A final conditional add operation is required to obtain the remainder of the division in H. As in the circuits discussed before, the circuit starting and generating the clock sequence is not shown, only the data registers and the add/subtract circuit (the so-called data path). The same principle of performing a binary search for a maximum solution x of an equation f(x) ≤ a (f being a monotonic function, f(x) = b∗ x for the division) can also be applied to calculating the square root of a number a (here, f(x) = x ∗ x). One adds decreasing powers 2k starting with k = n/2 − 1 as long as the square of the sum is less than a, keeping track of the difference d of a and the square of the sum s. In every add step d decreases by 2k+1 s + 22k (or, 2−1−k d decreases by s + 2k−1 ) which can be realized with conditional subtract operations again.

FLOATING POINT OPERATIONS AND FUNCTIONS

clk register H

• 131

shift register L a′

add/subtract

b

DFF

+/- select

Figure 4.8 Binary n-bit serial divide circuit

For the square root of fixed point numbers a with 0.25 < = a < 1, another common procedure is the approximation using the iteration bi+1 = 1/2 (bi + a/bi ) and starting from b0 = 1 which converges quadratically but requires division.

4.8 FLOATING POINT OPERATIONS AND FUNCTIONS Floating point operations are a lot more complex than fixed point ones, especially for the high precision types. There are, however, algorithmically derived from fixed point operations, and thus covered by the preceding sections. Special design tricks can be found in [44]. A common optimization in digital computations is to substitute floating point computations by fixed point ones after an analysis of the required precision. If the required frequency of floating point operations is low enough, they become a candidate for serial execution on a single or on a few processors and even for software implementations, i.e. for synthesizing them from fixed point operations executed on simple programmable processors. The product of two normalized floating point numbers r1 = m1 2s and r2 = m2 2t with 1 ≤ m1 m2 < 2 is r1 ∗ r2 = m1 ∗ m2 2s+t . The following operations have to be performed:

r Extract exponents and form e = s + t. r Extract mantissas and form the fixed point product m = m1 ∗ m2 . r If m ≥ 2 then normalize m, increment e. r Round m. r Extract signs and derive sign of result. r Pack the bit fields. Of these, the multiply operation is the most demanding one. It is a fixed point operation only requiring the most significant part of the product to be computed. If s ≤ t, the sum is derived from the formula r1 + r2 = (m1 ∗ 2s−t + m2 )2t and requires the following operations:

r Extract exponents, order according to s ≤ t. r Denormalize m1 (multiply by 2s−t ). r Fixed point add/subtract operation according to sign bits to get m = (m1 ∗ 2s−t ± m2 ). r Normalize and round m and correct exponent accordingly. r pack the bit fields.

• 132

OPERATIONS ON NUMBERS

m may be small due to bit extinctions, and the normalization step requires a multi-bit shift depending on the number of sign bits. Extinction introduces an error that cannot be read off from the code. While binary and signed binary adder and multiplier building blocks can be composed to yield add and multiply operations on words of arbitrary sizes, the floating point operations are not easily extended to higher precision operations. According to [54], the true sum of the rational numbers exactly represented by two floating point codes a, b with |a| ≥ |b| can be represented as a + b = (a ⊕ b) + (a (a ⊕ b)) ⊕ b = (a ⊕ b) + b ((a ⊕ b) a ) where ⊕, denote the (rounded) floating point add and subtract operations. The rounding error of the floating point add operation corresponding to the places of b lost during the denormalization step can thus be computed in terms of floating point operations only (although not efficiently). Such a double word representation of a number as a sum of two floating point numbers can also be obtained for the true product of two numbers, using an operation that divides a floating point code into a sum of two half mantissa size codes for which the floating point product is exact (or one clearing the lower half of the mantissa). The operations to compute the rounding errors as floating point codes are useful floating point building blocks that can be generated as by-products of the implementations of the rounded floating point operations quite easily, but are not available on standard processors. Using them, full double precision operations can be derived from single precision ones, and the accumulation of rounding errors can be tracked to some degree. For the non-normalized floating point format proposed in section 1.1.1 a simple heuristic to keep track of the precision is to form products to the minimum number of places of the operands. For the logarithmic encoding, fairly efficient add and multiply operations have been derived [82]. The square root and reciprocal functions are evaluated using the algorithms in section 4.6 or iterative procedures starting from low precision approximations taken from a ROM table. The standard transcendental functions are all evaluated from polynomial or rational approximations on certain intervals and extended from there by means of functional equations [47]. The sine function e.g. is approximated on [0, π/2] and extended using its symmetry and periodicity. Higher precision floating point formats require higher order polynomials, the allowed approximation error being of the order of magnitude of the least significant bits of the mantissa. For the sine and exponential functions there are special CORDIC algorithms exploiting their special properties that compute their results by means of the fixed point add, subtract, compare and shift operations only that are most suitable for FPGA [46, 82] (cf. exercise 8). For the evaluation of polynomials p(x) the Horner scheme is attractive as it eliminates multiplications and the storage of powers of x: p(x) = a 0 + a 1 x + a 2 x 2 + · · · + a n−1 x n−1 = a 0 + x(a 1 + · · · x(a n−3 + x(a n−2 + x a n−1 )) . . .)

(11)

It combines the multiply and add operations in a different way from the MAC structure in section 4.5, namely by multiplying to an intermediate result and then adding instead of adding to the intermediate result, and must be supported by a different circuit structure. The coefficients are constant and may be supposed to be scaled so that for arguments x ∈ [−1, 1] the

• 133

POLYNOMIAL ARITHMETIC

result may be represented in a given fixed point format on [−1, 1]. After every multiplication rounding is supposed to be applied so that all intermediate results can be represented in the same format. A circuit realizing this composite operation would be a useful building block for realizing functions through polynomial approximations. For rational approximations, other primitives, e.g. the expressions (a + 1/x) or (ax + b)/(cx + d) can be considered. In some applications, it is sufficient to provide ROM function tables for selected arguments of certain functions only. Starting from a table, the values at intermediate arguments may be derived by some low-order polynomial interpolation to a desired precision.

4.9 POLYNOMIAL ARITHMETIC For a primitive binary polynomial p(x) = pi Xi of degree n the multiplication of polynomials mod(p) is the composite operation of taking the remainder of the product polynomial after its division by p. The powers Xk mod(p) of the polynomial X exhaust the m = 2n − 1 non-zero polynomials of degree < n and can be used to encode the numbers k with 0 ≤ k < m by their coefficients, as proposed in section 1.1.1. For larger k, the remainders of the Xk repeat mod(m) (Xm = 1 mod(p)). For this encoding, the increment operation is particularly simple. If Xk =

n−1

gi Xi mod(p)

then

i=0

Xk+1 =

n−2

gi Xi+1 +

i=0

n−1

gn−1 pi Xi

mod(p)

i=0

The tuple representing the number k is thus shifted and XOR’ed by the shifted-out bit gn−1 at the positions of the non-zero coefficients of p. This is the operation of a shift register with XOR feedback from the output shown in Figure 4.9. The remainder of Xm−1 increments to the polynomial 1, and the increment operation repeats after m steps. If the number of non-zero coefficients of p is small, the feedback shift register is less complex than e.g. a synchronous binary counter. Apart from the simplicity of these feedback circuits, shift register sequences generated in this way also have interesting statistical properties and can be used to generate pseudo-random numbers [83]. The general add operation is more complex, however. It is the product Xr ∗ Xs = Xr +s that corresponds to the add operation modulo m of the exponents. Let Xr and Xs be the polynomials g(X) =

n−1

gi Xi ,

h(X) =

i=0

h i Xi

i=0

X3

X

1

DFF

g0

n−1

DFF

g1

DFF

g2

Figure 4.9 mod(7) counter automaton based using the polynomial 1 + X + X3

• 134

OPERATIONS ON NUMBERS

q2n-2,...,q0 hn−1

*

gn-1,..,g0,0,..,0

hn−2

*

hn−3

h1

*

h0

*

DFF

DFF

*

DFF

Figure 4.10 Polynomial multiplication ( * : AND, + : XOR ) X3

X

1 DFF

q2n-2,...,q0 r0

DFF

r1

DFF

r2

Figure 4.11 Computing the remainder of the polynomial q divided by p(X) = 1 + X + X3

Thus (g0 , . . . ,gn−1 ) and (h0 , . . . ,hn−1 ) are the codes of r, s. The product polynomial is given by n−1 2n−2 k n−1 j+i g(X)h(X) = gi hjX = gi h k−i Xk i=0

j=0

k=0

i=0

The add operation in the field B is the XOR operation. The Xk with k ≥ n must be substituted by their remainders modulo p which depend on p. If the polynomial p(X) = 1 + X + Xn is used (for some n it is primitive), then Xk = Xk−n + Xk−n+1 mod(p). The total number of XOR operations to compute the code of r + s from the codes of r and s is then n2 − 1 that can be arranged so that the processing time is logarithmic. An application would be to compute relative addresses in the case that subsequent address codes to read out a memory are generated by a shift register implementing the above increment operation (see section 5.1.3). Multiplications of binary polynomials and remainder computations can be carried out serially using shift registers. The coefficients gk are applied to an n − 1 bit shift register in reverse order (starting with gn−1 at the input and all flip-flops reset to 0). Every shift operation moves the coefficients to the left so that gk+i appears at the output of the ith stage. After n − 1 clocks, g0 is applied to the input, and g1 . . . gn−1 stand in the register. Thereafter, zeroes are applied to the input. The shift register taps are multiplied by the coefficients hn−1 , hn−2 , . . . to generate the coefficients qk+n−1 = hn−1−i gk+i of the product g∗ h in reverse order, starting from q2n−2 (Figure 4.10). These are fed into a second shift register with feedback from the output according to the primitive polynomial p to perform the substitution of Xn appearing at the register output by its remainder (Figure 4.11). After 2n − 2 shift clocks, the result code stands at the inputs of the flip-flops of the second register.

4.10 SUMMARY The Boolean functions realizing arithmetic operations on number codes have been found to have moderately complex algorithms. Binary arithmetic could be mostly constructed from

• 135

EXERCISES

full adder components and their generalizations, the mod(2n ) adders. With some hardware effort, faster add circuits could be obtained. While parallel multiply circuits do not use the gates efficiently, serial multipliers reusing a single mod(2n ) adder were found to use it quite efficiently. Other serial circuits were demonstrated for computing sums of products and for division.

EXERCISES 1. Formally verify that the algorithms given for the n-bit unsigned and signed binary add and subtract operations with an (n + 1)-bit result are correct. 2. Write out the behavioral and structural VHDL definitions of an n-bit adder with a carry input and an overflow output. 3. Give a structural VHDL definition for a fast 16-bit incrementer using multiple AND gates to compute the carry bits. 4. Verify the signed digit add algorithm in section 4.3, give a structural VHDL definition for a generic n-digit adder, and determine its gate count as a function of n. 5. Work out a 16-bit serial MAC as in Figure 4.6 accumulating 1024 products with an additional 10-bit overflow word into a VHDL entity. The 16-bit operands will be read from external memories. Extend to signed multiply-add operations. 6. Design a combined multiply-and-add circuit for the evaluation of polynomials similar to the structure in Figure 4.6. After the n carry-save cycles the result of the multiplication by x is added bit by bit to the coefficient and immediately applied as the b operand to the next multiply. 7. Implement a 4-bit extended adder component with handshaking signals as a VHDL entity and use it as a component in an n-bit multiplier circuit. 8. For a number p with 2n−1 < p ≤ 2n the remainders mod(p) can be represented as binary n-bit numbers. Design a serial mod(p) multiplier that computes the product of two remainders in n cycles. Show how this multiply operation can be derived from the binary n-bit multiply operation with a 2n-bit result in the special case of p = 2n − 1. 9. A CORDIC algorithm to compute the trigonometric functions of some n-bit fixed-point input x ∈ [0, 1] startsby initializing a 2 × 2 matrix A to the unit matrix and subsequently multiplying matrices

1 −2−i

2i

1

to it (slightly scaled rotations by αi = atan(2−i )) if xi ≥ αi ,

where the sequence (xi ) is defined by x0 = x, xi+1 = xi − αi if xi ≥ αi , xi+1 = xi otherwise. After n steps, A is close to a scaled rotation by x, and sin(x), cos(x) and tan(x) are easily derived from it. The matrix multiplications only involve shift and add operations. Derive a serial data path performing the described operations.

5 Sequential Control Circuits

After discussing the data paths of several circuits designed to be used sequentially during an algorithm we now discuss the auxiliary circuits needed to orderly run through the sequential steps, to supply the necessary function codes, and to implement the control flow and synchronization, starting from a schedule for the operations to be executed on the circuit. The circuits in sections 4.5–7 and 4.9 require the generation of trains of clock pulses to go through the steps of the computation after some starting event, and of control signals to redirect the data during some clock cycles. The times at which the control signals change can be defined by the L-H transitions of some clock signal, e.g. a free-running periodic waveform. Elementary circuits that change their outputs in response to a clock event are the register or, more generally, the feedback circuit built around a register as already shown in Figure 2.20 and called an automaton. Circuits executing a moderate number of steps or repeating their operation periodically can be controlled by simple automata that use an application-specific transition function implemented by a minimized algorithm. For more complex circuits that need long, irregular sequences of control and operand select codes, methods to implement complex automata are required. Complex automata can be composed from simpler ones. A general method to generate arbitrary sequences is by reading them from a memory table. This approaches the structure of a processor reading instructions from a program memory that was already hinted at in the previous chapters.

5.1 MEALY AND MOORE AUTOMATA An automaton is fully characterized by the Boolean function f that at the end of the ith clock cycle outputs the bit pattern for the next step. The setting xi of the register is called the state of the automaton. The automaton starts in an initial state x0 . The further output sequence is determined by this state and the further input sequence, the next state after reaching the state Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 138

SEQUENTIAL CONTROL CIRCUITS

combined function next/out

e s

a = g(s,e) s′ = f(s,e)

state register clk

Figure 5.1 Mealy automaton

s being determined from xi and the input ei by xi+1 = f (xi , ei )

(1)

The feedback circuit in Figure 2.20 is a special case of the more general Mealy automaton (or ‘finite’ automaton) shown in Figure 5.1 where the output is not simply taken from the register but computed as a Boolean function of the state s and the input e. The automaton in Figure 2.20 is special because the output does not depend on the input but only on the state. An automaton with this property is called a Moore automaton. It is still more special as the output is direct from the state flip-flops. However, by adding an output pipelining register to a general Mealy automaton one obtains an automaton as in Figure 2.20 that outputs the same sequence of data (up to a shift by 1). If E is the set of all possible binary input patterns, O the set of output patterns and S the set of states of a Mealy automaton (E, O and S can also be thought of as abstract sets encoded by bit patterns), the automaton is thus characterized by the functions: f: E × S → S g: E × S → O

computing the next state and computing the actual output

which may also be thought of as a single, combined function taking values in O × S. f is called the transition function of the automaton. The behavior of an automaton is the mapping of finite input sequences (ei ) to the corresponding output sequences (ai ) defined by a i = g(ei, xi )

(2)

where the state sequence (xi ) is defined by equation (1), starting from the assumed, initial state x0 . It is the behavior of an automaton which is specified for an application such as controlling a sequential circuit. The states and the transition function are not explicit in the behavior. Two automata are called equivalent if they exhibit the same behavior. The finite automata are not restricted to auxiliary functions such as the control of sequential computations but can be considered as computational structures by themselves. The state transition function is arbitrary and can be arithmetic. The data paths in Chapter 4 without the required control circuit are automata, and the full structure in Figure 1.11, composed of a control automaton, a compute circuit, the input and output selectors and registers all using the same clock, is also a finite automaton. The task of designing a control automaton for a sequential circuit can be thought of as composing the sequential circuit which is an automaton by itself with another one to arrive at an automaton with a more convenient behavior. Any Boolean function g can be considered as an automaton (one without storage), and the D flip-flop is one with a special, trivial transition function and the property that the output does

• 139

MEALY AND MOORE AUTOMATA

e

f

X

g

a

Figure 5.2 Decomposing the Mealy automaton

not depend on the actual input (the property defining the Moore automaton). The behavior of a function is so that the actual output does not depend on previous inputs but only on the actual one. The behavior of the flip-flop is to shift the input bit sequence (b0 , . . . ,bn−1 ) to (a0 ,b0 , . . . ,bn−2 ) for some fixed a0 . If two finite automata are given, serial and parallel composition can be applied to arrive at a composite, more complex automaton. If the case of Boolean functions (considered as automata) this amounts to applying functional composition (cf. section 1.2.1). Then, starting from the elementary Boolean functions and the D flip-flop, complex automata can be constructed. The n-bit parallel register is the parallel composition of n D flip-flop automata, and the n-bit shift register is their serial composition. In a network of automata resulting from serial and parallel compositions, feedback may be applied from an output to an input as long as this output does not depend on this input (the next state, however, may depend on it). The general automaton in Figure 5.1 is a composition of three automata, namely the two Boolean functions f, g computing the next state and the output and the register automaton X which is itself composed of still simpler ones (Figure 5.2). Generally, every feedback cycle must pass through a Moore automaton. For a network of automata with the sub-automata characterized by the functions (fi , gi ):Ei × Si → Oi × Si the total state space is the Cartesian product of the individual state spaces Si , and the total transition function is derived from the fi , gi according to the interconnection of the component automata. Reciprocally, introducing a network structure roughly corresponds to representing the input and the state spaces as product spaces so that the components of the transition function and the output function only depend on selected components of their arguments. As in the case of functions, we define the complexity of an automaton constructed from selected types of component automata as the total number of component automata used for it. For the design of automata the problem arises to construct an automaton with the minimum complexity and a given, desired behavior. The optimization may involve steps expanding the state memory but simplifying the Boolean functions. Simple storage structures such as the parallel register, the shift register, the FIFO and LIFO buffers, the addressable memory and the CAM (cf. section 2.2.2) serve to store data and to retrieve the same data later and in some particular order in response to some control inputs will be referred to as storage automata. The general automaton in Figures 5.1 and 5.2 can be given some additional structure by allowing the use of other types of storage automata in the place of the register X. An automaton including a LIFO buffer in place of a register is called a stack automaton in the literature [72]. In order to design a simple Mealy automaton with a small state space it is common to describe the states and transitions by means of a state diagram or a state transition table. The states are then encoded as binary numbers, and the transition function is read off the list or the state diagram, and realized as a circuit to implement the automaton. More complex automata must be constructed from simpler ones, including multi-bit storage automata that may still

• 140

SEQUENTIAL CONTROL CIRCUITS

have a simple data interface and universal Boolean functions (memory tables). This will be dealt with in the next sections. A complex automaton such as a digital computer cannot be adequately described and understood by defining its next state and output as a function of the input and the contents of all its memory cells due to the very large state space. A finite automaton will eventually respond to every periodic input sequence with a periodic output sequence, maybe by assuming a final state and output that remains constant. For an automaton as simple as a 64-bit binary counter, the state space has 264 elements. If the counter-increments at 1GHz its period will last about 500 years. It should be evident that the operation of an automaton does not depend on a periodic clock signal to define the transitions but can be based on any sequence of input events. A handshaking signal can be used as well to define the input events of the automaton and to trigger its transitions. With such control signals an automaton can output at a sub-sequence of the input events only, or expand its response to an input to several output cycles.

5.2 SCHEDULING, OPERAND SELECTION AND THE STORAGE AUTOMATON Once a number of operations have been selected to be executed in sequence on the same circuit, a schedule needs to be set up for these, satisfying the requirement than an operation depending directly or indirectly on the result of another one must be executed later. For every operation, the operands must be passed to the input of the circuit, and the result must be passed to some target circuit or storage element. In general, there are many schedules with this property. A simple procedure to find one of them is to subdivide the set of operations L into disjoint subsets L0 , L1 , L2 , . . . so that L0 are those operations not depending on any operation in L, and the operations in Li+1 only depend on operations in L0 ∪ . . . ∪Li . Then any schedule starting with the operations in L0 , then going through the operations in L1 , etc., has the desired property. More generally, several circuits may be used to carry out operations that are pre-assigned to them or can even be executed by any of them. If all of them input and output synchronously (using the same clock signal) the above procedure to find a schedule can be applied to each of them, delaying operations depending on results on another circuit appropriately (Figure 5.3). The operands for the operations come from the external inputs of the digital system, from other circuits, maybe, via storage elements (registers), or are the results of previous operations on the same circuit that have been stored in the registers of a storage automaton. A register may supply the same value to several operations, and be used to store another value if the

T C1

0

1

2

3

4

C2

Figure 5.3 Simultaneous scheduling of operation for two compute circuits C1 and C2

SCHEDULING, OPERAND SELECTION AND THE STORAGE AUTOMATON

• 141

previous one is no longer needed. The needed number of input registers both read and written to by the circuit can be easily derived from the schedule by assigning to every value the index of the operation after which it is available, and the one for which it was used the last time. Then for every index the number of values computed before and still needed by the present operation or a later one can be determined, and the maximum of these numbers is the number of required registers. It depends on the chosen schedule; this schedule may be selected to use as few registers as possible. For values input from another circuit that is also used for several operations the number of registers may be minimized and determined in a similar way; we will simply assume it to be given, and that every register used for intermediate storage is written to by a unique circuit (in order to avoid input multiplexers for registers). The most common structure for the storage automaton supplying the operand inputs and receiving the intermediate results is the addressable bank of registers. Then the input selection to the circuit is made by means of selector circuits using select addresses generated by some control circuit for the individual operations (cf. Figure 1.11). Disadvantages are that for a long, irregular sequence of operations the generation of select addresses may be complex, and that the input selection from several sources costs extra processing time for the select circuit. In order to avoid these, do the following:

r Use as few input registers as possible to select from or outputs to write to. r Preload these in a pipelined fashion from other inputs using secondary selection circuits (this is possible if the value to be used was not calculated immediately before).

r Combine several registers into storage sub-automata for storing and retrieving values in particular sequences without requiring inputs to select between them. To clarify the latter, we consider some special cases. First, a single register allows multiple reads of the same value after a write. Then, in the case that two values a, b are written in sequence and that subsequent reads of a and b are performed in the same order, a FIFO interconnection of two registers can be used in which the read operation transfers the contents of the front register to the other one. The LIFO buffer (stack) interconnection of the registers would output the values in reverse order. If these particular orders apply, the operand selection from the registers is carried out automatically by the multi-value storage structure. Other multivalue storage structures are deeper FIFO or LIFO buffers, the shift register interconnection of several registers, or special automata that would e.g. output the ‘a’ value twice before repeatedly outputting b. The secondary selection from a larger number of storage elements corresponds to introducing a data memory structure as well as the primary registers directly outputting operands. If the load operations from this cannot be fully pipelined, the selection from the memory will slow down the sequential computation. It is not practical to supply the wide select addresses required for a large memory in parallel from the control circuit for the operations (this is called ‘absolute’ addressing). Instead, the address generation for a small set of addresses is placed inside the storage automaton by using extra address registers (‘indirect’ addressing). Then, only a select code for the address registers is required from the control circuit. Moreover, for memory accesses to sequential addresses (as defined by some address code increment function), the address registers can be implemented as counters (more generally, as simple sub-automata) that automatically generate the required sequences without further input from the control circuit (Figure 5.4).

• 142

SEQUENTIAL CONTROL CIRCUITS

address input

address registers

memory address

set/inc. select

data memory

read/write

data registers

data to/from compute circuit

select

Figure 5.4 Composite storage automaton

5.3 DESIGNING THE CONTROL AUTOMATON The control circuit for a sub-circuit that is to be used several times to perform several steps in an algorithm has the following tasks:

r For every step it outputs control codes to the storage automaton to select the inputs from external signals, from other sub-circuits, or from storage elements within the storage automaton, and to select the output register or sub-automaton to store data needed at a later time. To select from a bank of registers, the control circuit needs to generate a k-bit input address and an l-bit output address for every step (k, l ≥ 0). r For every step it also generates an m-bit function code if the circuit is a multifunction circuit. It thus generates a sequence of prescribed outputs that is as long as the number of steps. r It implements the control flow by skipping the steps in the unselected branches. The condition for a branch must be encoded by a binary signal that is valid at the time of the next step and must be input to the control circuit to cause the skipping. r It properly delays output addresses if the circuit to be controlled is pipelined. The control circuit will be an automaton inputting the branch condition bits that cause the skipping of computational steps, and directly outputting the k + l + m control bits as a function of its state bits, or even directly outputting them from its state bits. An automaton of this kind will be referred to as a control automaton. There may be auxiliary state bits not used to encode the address and function code outputs, but needed to distinguish states using the same output pattern but incompatible follow-on output sequences. The state of the control automaton is then composed of k + l + m output bits and n auxiliary bits. To design a control automaton generating the desired sequence of control codes (which is uniquely defined after setting up the schedule for the operations and assigning select codes to the i/o registers), one first needs to determine the auxiliary control bits needed to distinguish states with the same control and select outputs but different output sub-sequences starting from them. The simplest case is for a circuit implementing a single operation only and inputting and outputting data from the same sources and to the same destination all the time. Then there are no control or select outputs. The control automaton still needs some auxiliary bits to distinguish (to count) the individual operations in order to be able to stop the computation. As an example, we consider the serial multiplier in Figure 4.5 where a single-function add circuit with fixed input and output registers is applied n times in series for every multiply. The input is supposed to be synchronized with the clock signal and to be signaled by a handshake signal. To start a multiplication, the input request signal is supposed to become H for one clock cycle. The multiplier moves the b input into an operand shift register, and the output register

• 143

DESIGNING THE CONTROL AUTOMATON

for the high word must be cleared. Then the circuit performs n add/shift cycles. At the end, it becomes ready for the next input. The states of the control automaton are:

r idle/initial state, waiting for input; r compute states 0 . . . n − 1, output the input acknowledge signal. They are repeated cyclically. As there are no branches, this list form is a convenient way to describe the states and transitions. In the present case, a state bit can be used to indicate the idle state, and a binary counter may be used to cycle through the compute states. During the idle state, the clock is gated off for the counter and for the output registers. If there are control and select outputs, they may be shared to distinguish the computational steps. For a multifunction circuit offering the operations ‘+’ and ‘−’, the non-branched sequence of operations +

+

−

−

+

+

−

−

. . . etc.

periodically selecting one of the operands from four different inputs needs one auxiliary bits in addition to the function code. As the sequence repeats after the fourth operation two state bits suffice to distinguish these steps that can in turn be used to select the input, and one of these can be used to encode the operations. In general, a sequence of p <= 2n operations or a period of such will require up to n auxiliary bits. The choice of the auxiliary bits and their use to distinguish the steps in the prescribed code output sequence are not unique. This can be optimized by using as few auxiliary bits as possible, and then setting up the table for the transition function and deriving a minimized algorithm for it as in section 1.3.2., or simply by selecting the number n of auxiliary bits so that p <= 2n and associating a unique bit pattern to every step to be performed. In this latter case the bit patterns encode the index of the operations in a list of all of them. The bits output by the automaton for a particular operation can then be computed from the index of the operation, and the control inputs operate by changing the sequence of indices (Figure 5.5). By leaving off the output register the control automaton becomes a Mealy automaton using the auxiliary bits as state bits. The output as a function of the index bits can be realized by setting up a table for it and deriving a minimized algorithm, or with a read-only memory (the universal function) holding the table and using the index for the address input. If no branching occurs, the index steps through a sequence of different n-bit patterns. The auxiliary register outputting the index with the feedback generating the next index in sequence is referred to as the program counter of the control circuit (the ‘program’ is the content of the read-only memory addressed by it). The sequencer automaton is thus composed of the counter automaton and the program function or table automaton.

clk e

increment or jump function

regis ter

code table or function

regis ter

Figure 5.5 Control code sequence generator with optional output register

• 144

SEQUENTIAL CONTROL CIRCUITS

program memory automaton

compressed code reader automaton

decompress automaton

decode and expand automaton

Figure 5.6 Control automaton using code decompression

The generation of the control circuit starting from the schedule of operations using any of the approaches indicated above is a well-defined procedure that can be automated. For short sequences of operations and just a few output bits, one will use optimized Boolean circuits to implement the next state function while for long sequences and more control outputs a program counter and a program memory is suitable, which apart from size parameters is a standard structure that can generate arbitrary control sequences. For a fixed design procedure the costs of the control circuit can be computed for the given schedule of operations and be put in relation to the cost reduction through the multiple use of the compute circuit. If the application allows the same control circuit to be used to control several identical compute circuits in parallel, the relation becomes more favorable (this is called SIMD parallel processing, single instruction multiple data). The hardware effort for the control circuit and the needed data registers and select circuits depends on the choice of the schedule. There may be different, equivalent schedules (for the same operations) that require a different number of registers to store intermediate results. Even the choice of the select codes for the function and the registers influences the complexity and can be changed to optimize the design. Sequential control can also be applied to a circuit that performs its operation in several time steps (e.g., a serial multiplier). The control may be hierarchic, starting the internal sequence of steps generated by an extra control automaton for every operation to be performed and waiting for its completion, or from a combined control circuit stepping through the sequence of partial steps for every operation to be performed. The partial steps may be put into a linear sequence or be invoked by the equivalent of a sub-routine call to a sequence of ‘micro code’ steps. The micro code words may be chosen wider than the original function and select codes (the ‘instructions’) in order to eliminate the decoding of control signals. The decoding of control signals from more compact instructions and the expansion of instructions into micro code sequences can be thought of as applying decompression to a compressed sequence of control codes. Generally, the techniques mentioned in section 1.1.2 to compress data such as variable size and run-length encoding can be applied. Then the control automaton becomes composed of a memory read automaton, a decompress automaton, and a code expansion automaton (Figure 5.6). The compression of long control code sequences through the use of sub-routines and loops is common programming practice but can also be automated.

5.4 SEQUENCING WITH COUNTER AND SHIFT REGISTER CIRCUITS The sequence of n-bit binary numbers is output by a control automaton if in the state r the register input is r + 1 (‘+’ denoting the binary add operation on codes). Such a circuit (cf. section 2.1.2) has been called a synchronous counter as the register outputs change simultaneously, in contrast to the ‘ripple’ counter (Figure 2.27) where every flip-flop uses a different

• 145

SEQUENCING WITH COUNTER AND SHIFT REGISTER CIRCUITS Table 5.1 Output of the pipelined binary counter c2

o2

c1

o1

c0

o0

0 0 1 0 0 0 0 0 0 0 1

1 1 0 0 0 0 1 1 1 1 0

0 1 0 0 0 1 0 0 0 1 0

1 0 0 1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0

clk half

DFF INV

adder

DFF o0

half

DFF

adder

DFF o1

DFF DFF o2

Figure 5.7 Pipelined binary counter

clock. The maximum clock rate of the synchronous counter is given by the processing time of the increment circuit (a little less if the timing of the register is taken into account). The n-bit ripple-carry or carry look-ahead increment algorithms require linear or logarithmic time (see section 4.2). If the circuits to be controlled are fast, this time for the increment operation might be too large. A fast binary counter with a maximum clock independent on n is obtained by pipelining the carry propagation using a total of 2n flip-flops (Figure 5.7). This circuit can also be interpreted as using a carry-save adder structure for the increment operation. Due to the pipelining, the output sequence for the bit k is delayed by k clocks w. r. t. the sequence for bit 0 (see Table 5.1) which can be compensated by delaying the output k by an n-1-k -bit shift register for every k. For the application of uniquely indexing operations by bit patterns the unshifted outputs can, however, be used as well. The pipelined counter has the useful additional feature to output on the final carry output a decoded signal which is H for a single clock period out of 2n and L otherwise. In the pipelined counter clocked at the maximum rate the AND and XOR gates are used efficiently if every clock edge is thought of as starting a new computation with them. Another method to generate long sequences of n-bit indices at a high speed and with even less hardware effort is by using the polynomial increment operation implemented with a shift register as shown in Figure 4.9. For some n (e.g. n = 7, 15 . . . ), the special polynomial p(X) = 1 + X + Xn

• 146

SEQUENTIAL CONTROL CIRCUITS Table 5.2 Gray code sequence step

b3

b2

b1

b0

step

b3

b2

b1

b0

0 1 2 3 4 5 6 7

0 0 0 0 0 0 0 0

0 0 0 0 1 1 1 1

0 0 1 1 1 1 0 0

0 1 1 0 0 1 1 0

8 9 10 11 12 13 14 15

1 1 1 1 1 1 1 1

1 1 1 1 0 0 0 0

0 0 1 1 1 1 0 0

0 1 1 0 0 1 1 0

is primitive. Then the shift register sequence obtained with a single XOR gate in the feedback path steps through all 2n−1 non-zero n-bit patterns before it repeats. In some applications speed is less important than low power consumption. It can e.g. be reduced by applying the technique of clock gating (see section 2.1.3). The carry signals in the pipelined counter can e.g. be used to gate the the clocks of the next counter stage and their own to be immediately reset. Lower power consumption can also be expected if the chosen sequence makes a smaller number of signal transitions during a count cycle. A binary mod(2n ) counter makes 2∗ 2n output bit transitions in every cycle of 2n counts. In contrast, the sequence of n-bit Gray codes (cf. section 1.1.2) achieves the minimum of 2n transitions only. The individual output bits of the Gray code sequence are actually delayed versions of the bit sequences for the binary code, yet starting from the binary bit position one only (Table 5.2). The sequence at bit n-2 is a phase-shifted version of the sequence at bit n − 1 that coincides with the binary case. The bit i < n − 1 that is going to change at the end of the current clock cycle is characterized by the conditions bn−1 ⊕ . . . ⊕ bi = 0 and bi−1 = 1 and b j = 0 for j < i − 1 which can be used to construct a Gray counter with gated clocks.

5.5 IMPLEMENTING THE CONTROL FLOW A non-trivial control flow causes some operations of the total sequence to be skipped in response to the control input. This only happens after particular operations where a branch occurs while mostly the operations are executed in sequence. The control circuit in Figure 5.5 based on one of the sequence generators in section 5.4 as its ‘program counter’ can be extended to implement the control flow without renouncing on the simple feedback circuit generating subsequent table indices. It can do so by decoding a signal that is active for the operations causing branches and by using this signal combined with the control input (‘e’) to control a multiplexer in the feedback path that selects between the next index in the sequence and a non-sequential one that must be computed from the index and from e by an extra circuit (Figure 5.8). The output of this extra circuit is only used for the indices of operations preceding branches but is ignored for the other index patterns which simplifies its implementation by a Boolean algorithm. The selector is controlled so that the non-sequential address is only selected when a jump is enabled and the branch condition is met.

• 147

IMPLEMENTING THE CONTROL FLOW jump enable (from table) jump function

e

sel increment function

regis ter

table index

Figure 5.8 Implementation of non-sequential jumps

regis ter

next table index

jump table index jump address (from jump table) jump enable (from function table)

e

jump function sel increment function

regis ter

function code table index

Figure 5.9 Separate control flow table

If the control outputs are generated with a ROM table that may generate arbitrary control sequences with a standard control circuit, then one wants to have this flexibility for the nonsequential addresses, too, and also read them from a memory. As the output of this memory is required for the branching indices only it cannot be used efficiently if the address input is the same as the index used for the operation (this amounts to using a wider memory). Any memory structure could, in fact, be considered that is fast enough to deliver both instruction codes for the compute circuit and jump addresses at the required rate. A common solution to this is to use the same memory for the control output and the non-sequential addresses and use extra clock cycles when the control output is left unchanged and the memory supplies the non-sequential address. In other words, dedicated jump instructions are inserted to realize the control flow, and the corresponding cycles are not used to supply the compute circuit with new instructions. The compute circuit is not used as efficiently then, but the memory structure is less costly. There may be wait conditions when the compute circuit cannot be used, anyhow. If a jump address is loaded during such a time the efficiency would not decrease. Another option for a memory-efficient realization of the control flow without losing computational cycles consists in using an extra memory that is dedicated to outputting nonsequential addresses with an extra input address generator that is only enabled for branching operations and actually performs an extra indexing for these (Figure 5.9). After performing a branch the jump table index is supposed to be updated, too, from the jump table. The extra memory and address generator constitute an extra control circuit dedicated to implementing the control flow. It may be used also to generate auxiliary instructions such as loading a constant from a memory table into an input register. A single jump enable control bit from the

• 148

SEQUENTIAL CONTROL CIRCUITS

function and control code table suffices to synchronize the two control circuits. Alternatively, a single memory with a higher bandwidth can be used to pre-fetch the jump and auxiliary instructions that are subsequently synchronized and handled in parallel to the compute cycles. Then an extra jump table index is not needed but the jump and the auxiliary instructions must be distinguished from the computational ones again, and buffer registers must be provided to store them and to adjust the memory read rate to the execution rate of the controlled circuit.

5.6 SYNCHRONIZATION Once a particular operation is scheduled, the input data selected for that operation must first be waited for, if necessary. This can be done by inhibiting further clock edges until a handshake signal signals valid input, or by deriving the clock from such handshaking signals. If the compute circuit has a set of handshaking signals of its own, these must be connected to those of the selected input, too. Handshaking and waiting do not need to be implemented if it is known in advance that, at the time the operation is scheduled, the input data are ready. The control circuit can be designed to delay an operation for a fixed number of clock cycles, if necessary. This is particularly the case if the selected input is from a register storing the output of a previous output of the same compute circuit. If an input is from a register written to by another circuit also controlled to perform a sequence of operations, no handshaking is needed if the register is written but not overwritten before it is selected as an input for the operation. This would be the case if both circuits use compatible schedules and the same clock. Otherwise, handshaking must be implemented, e.g. by using an extra flip-flop associated with a register that is set by the write operation and reset by the read. The waiting for the handshake can also be implemented by the control circuit performing conditional jumps in an idle loop. No result of the compute circuit is saved during this time, and no operands need to be selected. The synchronization can e.g. be realized by enabling a control input to the incrementer that disables the increment function as long as the handshake has not occurred.

5.7 SUMMARY With the circuits described in this chapter, dedicated serial computers and computer system components can be designed. The design of simple control automata is straightforward once the transition function has been set up. The control circuit architecture using a memory table to read off the control and select code sequences is universal, i.e. can be applied to all kinds of serial circuits. With such a standard architecture, the design procedure for the control circuit starting from a schedule for the operations to be performed is also straightforward. Selecting the operations to be executed on a single compute circuit and optimizing their scheduling so that an efficient system design results are difficult issues that have not been dealt with. The circuit examples and exercises show that a considerable hardware effort must be made for the serial control of compute circuits. There are cases in which a serial computation hardly saves any hardware costs due to the needed control circuit. Handshaking needs to be dealt with in serial computations and can be handled by the control circuits.

EXERCISES

EXERCISES

• 149

1. The ‘game-of-life’ automaton is a Moore automaton having two states, ‘active’ and ‘inactive’. It has eight inputs, each inputting one of the values ‘active’ and ‘inactive’ and directly outputs its state. The inactive state changes to the active one if three of the inputs are active. The active state changes to the inactive one if 0, 1 or more than 3 inputs are active. Design a circuit realizing this behavior. In the game of life, automata of this type are arranged in a grid. Each receives its input from its immediate neighbors, and all automata perform their transitions synchronously. Then interesting activation patterns develop. Define the automata network using ‘for . . . generate’ statements. 2. Give a behavioral description of a binary serial 16-bit multiplier including the control automaton. The interface signals are;

r the clock input; r an input request signal and a busy output; r 16-bit data inputs a, b and the 32-bit data output q. 3. Work out the circuit for a Gray counter using gated clocks. The multiple XOR and OR gates needed to compute the clock enable condition may be realized as chains of two-input gates shared for the different bit positions, and the clock gating with 4-input OR gates. Determine the energy for a full count cycle assuming a standard capacitance C on every input and every output. Compare to a synchronous binary counter. 4. Design a circuit (using VHDL) that cyclically adds up three numbers received from a port p1 in a parallel adder and outputs the result to a port q, and then three numbers from a port p2 also outputting to q. The ports all have bi-directional handshake signals. 5. Design a serial MAC circuit as shown in Figure 4.6 equipped with a control circuit for the computation of the dot product of two real vectors a, c both having the dimension of 256 (cf. section 1.6). The vector components are binary 16-bit numbers. The vectors can be read by the MAC circuit via two SRAM buses at the clock rate and using the clock signal as a chip enable signal, and the computation starts after the activation of an input request signal. After the 256∗ 16 clock cycles for the carry-save and shift operations the redundant code is added up and output as a binary number via a 40-bit port. 6. Design a storage automaton that after being written to with three values a, b, c, outputs the sequence a, a, b, b, c, c, c, c . . . in response to subsequent read commands.

6 Sequential Processors

This chapter deals with the design of programmable processors and builds on the techniques developed in the previous chapters. If all basic operations executed by a digital system are executed in sequence using a single multifunction ALU (arithmetic and logic unit) circuit controlled by a memory table-based automaton as in Figure 5.5, one arrives at the concept of a sequential processor controlled by an instruction memory. The CPU (central processing unit) of a processor system includes a single control automaton that is applied to control one or a few multifunction compute circuits constituting the ALU. An ALU capable of performing several operations in parallel will be able to execute the blocks of operations of a given data and control flow faster provided that they include parallel compositions. According to the discussion in the previous chapter, the ALU circuit inputs from and outputs to a storage automaton, the storage capacity of which depends on the application, e.g. a single port RAM in conjunction with some registers loaded in sequence from it to provide the required number of operands in parallel. The ALU and the attached registers and associated select circuits constitute the data path within the CPU. The instruction memory is a configurable, universal structure. It can hold arbitrary sequences of function and select codes and jump instructions; the resulting processor plus memory structure shown in Figure 6.1 can hence implement many algorithms. The system based on a single programmable processor may be slow but the hardware costs per operation are low. As long as the address space supported by the address generation circuits is not exhausted, the incremental hardware cost of a CPU plus memory system for an extra operation to be performed is one additional instruction memory location only. In contrast, for a network of compute circuits without sequential control, an extra compute circuit would have to be added and wired into the system for the same effect. The hardware cost advantage through using a sequential processor with an instruction memory is roughly described by the quotient of the average cost of a special function compute circuit by the cost of the sequential processor system comprised of the multifunction ALU, the control circuit and the memory divided by the number of instructions actually used. Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 152

SEQUENTIAL PROCESSORS

memory and i/o bus branch conditions

address generator

program memory

storage automaton (registers)

ALU

data path operand select and function code

control automaton

Figure 6.1 Basic processor structure (simplified)

The memory of a processor-based system consumes most of the transistors and of the silicon area in it, which is partially compensated for by the fact that memory structures are very regular and can be integrated more tightly. Consequently, the cost of the fairly large memories in typical single CPU plus memory systems is similar to that of the CPU, but also some portion of the CPU costs are paid for the capability to support a large memory. A sequential processor is faster and more efficient, the more complex its ALU is. An algorithm for a particular function that is based on more complex operations will need a smaller number of steps. The compute circuits in recent CPU designs provide fairly complex arithmetic operations on 32- or 64-bit data words. Many applications, however, also need 8- or 16-bit operations on which a 64-bit ALU would be wasted. The use of a very simple compute circuit (performing e.g. the basic AND, OR, and NOT functions or even conditional jumps only) lets the memory-based sequential control become inefficient. After properly choosing the compute circuit for a CPU design, the emphasis must be on reducing the memory size requirements as much as possible, and making sure that the control circuit is as simple as possible while letting the compute circuit work at close to its maximum rate. The operation of the ALU depends on the function codes obtained from the control automaton as well as on the input data selection. The technique at hand to raise the ALU rate is to employ pipelining of the ALU operation and of the memory accesses for data and instructions and to take care that the latter perform close to or faster than the ALU operation. With such enhancements, however, the control circuit excluding the memories tends to become more complex than the compute circuit itself, although the overall efficiency, including the memory costs, does become higher this way. Instead of investing in the control circuit and using a complex processor with a large memory, one may opt to use several compute circuits and simple control circuits only, in particular ones that use small memories. The individual processor is then a ‘simple’ CPU, and the total system is a network of such. Such networks which can be realized both at the chip and at the board level will be considered in Chapter 7. Another motivation to consider simple CPU structures is their integration into FPGA designs as standard controllers for sequential FPGA functions and to support software functions. There the memory block sizes and the hardware resources are very restricted, anyhow. Most micro controllers and integer DSP chips found in embedded digital systems are fairly simple processors, too. Our discussion of CPU structures will therefore concentrate on those that do not involve complex auxiliary circuits. Simple processor designs are presented in many books on digital design or HDL (e.g., in [49]), yet mainly as didactical vehicles to explain the basic structures.

• 153

DESIGNING FOR ALU EFFICIENCY

Current mainstream CPU designs are not simple at all but support quite large memories (e.g. of up to 232 bytes), and have to further enhance the control circuit to cope with the large addresses and the increased access times to a large physical memory that tend to be much longer than the processing time of the compute circuit. They offer the programmer the benefit of an almost unlimited space for instructions and data. More information on the design of complex processors can be found e.g. in [48].

6.1 DESIGNING FOR ALU EFFICIENCY The ALU should provide the elementary operations for intended domain of applications and be supported by the data path structure and by the control automaton so that it can execute operations at a high rate, ideally close to its possible throughput. ALU efficiency depends on the selection of elementary operations and the design of the ALU, and on the implementation of a pipeline supplying instructions and operands to the ALU. The ALU efficiency is paid for with an increased hardware effort for the auxiliary circuits, which must be relation to the performance gain.

6.1.1 Multifunction ALU Circuits A multifunction circuit is particularly attractive if all operations executed by it efficiently use the gates, or if a dedicated circuit executing a particular operation O would be almost as complex as the multifunction circuit. Let h O ≤ 1 be the quotient of the gate counts of the dedicated and the multifunction circuits. In order to take the partial usage of its hardware resources into account, the efficiency e of the usage of the multifunction circuit (i.e. the actual frequency f of its applications divided by the maximum one, fm ) needs to be multiplied by the sum O h O f O /f taken over all operations of the circuit where f O denotes the frequency of every particular operation O (and f = O f O ) to obtain the ‘true’ efficiency h O fO / fm et = O

which is less than e. Obviously, the most frequent operations should have a high h O level. On the other hand, an infrequent operation with a low h O does not damage the true efficiency but may be advantageous for the total efficiency of the system (including the hardware effort for the control circuits) as the operation shares a single controller with the others and may support faster algorithms. If there are several frequent operations not sharing resources, they should be considered for parallel execution. The efficiency thus depends on the mix of operations which therefore needs to be determined for the intended range of applications. As pointed out before, for the sake of efficiency the provided arithmetic operations should be complex and apply to wide data words. If such are needed for a particular range of applications, they do not have to be synthesized as sequences of smaller word operations. If, on the other hand, wide operations are not needed, efficiency can be maintained by packing several small words into a wide one and executing the small word operations in parallel on them. Thus the same instructions control parallel, identical operations (i.e., SIMD operations). If no wide operations are needed, SIMD processing can still be applied to raise the efficiency by

• 154

SEQUENTIAL PROCESSORS

providing k identical ALU circuits operating on n-bit words and executing the same instruction in parallel on combined k∗ n-bit data words. ALU circuits found in general purpose microprocessors usually provide a set of operations on operands of a particular word size (e.g. 16 or 32 bit) including the following:

r binary arithmetic operations such as add, subtract, negate; some processors also provide multiply and divide operations, and even an extra set of floating point operations;

r bit field SIMD operations that apply the Boolean AND, OR, NOT and XOR operations in parallel to the corresponding bit components of two data words;

r shift operations on data words (shift left/right, shift right signed, rotate left / right). Obviously, these operations differ in their complexity and execution time. A complex operation such as multiplication might use pipelining or rely on a sub-circuit used serially under the control of an extra, embedded automaton. A general purpose processor would support binary numbers of arbitrary sizes by providing fixed size operations from which wider operations can be composed, e.g. add operations with carry inputs and outputs and multiply operations with a double size result. The ability to implement arbitrary Boolean functions can also be supported by single bit Boolean operations that receive their operands from selected bit positions of the multi-bit data words used otherwise. These are implemented on some micro controllers in addition to the bit field SIMD operations; the bit field operations are actually set operations (union, intersection, etc.) on the set of bit positions of the data word. We note that the bit field XOR f and OR f operations can be realized as compositions of the AND f and of binary add and subtract operations using the formulas a XOR f b = (a + b) − 2∗ (a AND f b), a OR f b = (a XOR f b) + (a AND f b) = (a + b) − (a AND f b). As an example of a fairly simple multifunction circuit that can be used efficiently for several different arithmetic operations and also supports some Boolean ones, we consider the parallel mod(216 ) add/subtract circuit obtained from an adder and a bank of XOR gates complementing the ‘b’ input (Listing 6.1) in response to a control signal ‘op’. As well as these arithmetic operations it also supports the AND f operation selected with a signal ‘andf’. It uses a 16-bit adder component like the one in Listing 4.1. The chosen adder implementation mostly determines the processing time and the efficiency of this ALU circuit. If the operation rate of a CPU built around this ALU is not limited by the control and memory circuits, the use of a faster adder will increase the overall efficiency of the processor system. entity ALU16 is port ( a,b: in bit vector(15 downto 0); c in, op, andf: in bit; s: out bit vector(15 downto 0); c out: out bit ) end ALU16;

DESIGNING FOR ALU EFFICIENCY

architecture behav of ALU16 is component ADD16 port ( a,b: in bit vector(15 downto 0); c in: in bit; s: out bit vector(15 downto 0); c out: out bit ); end component;

• 155

signal bi,si: bit vector (15 downto 0); signal cy: bit; begin a: ADD16 port map (a, bi, c in, si, cy); bi <= b when op = ‘0’ else not b; c out <= cy; s <= si when andf = ‘0’ else a and bi; end behav; Listing 6.1

Simple ALU circuit

It was shown in Chapter 4 that the binary arithmetic operations including multiply and divide can all be realized with an add/subtract circuit. Hence depending on whether independent inputs are applied, the same data to both, or a zero operand to one of them, and from which outputs the result is read off, this ALU circuit can be used for the operations: (a) (b) (c) (d) (e) (f)

Add/subtract two 16-bit binary numbers, optionally using the carry. Add/subtract steps in the serial multiplication or division. Compare two 16-bit numbers (≤), test an operand to be zero. Multiply a 16-bit number by two (shift left), optionally using the carry. Increment or pass a 16-bit number (rounding), decrement, negate. NOT f , AND f , OR f and XOR f (in multiple steps), AND f combined with NOT f .

Only the operations (d), (e), (f ) have low h O values. The AND f operation is quite different in its nature and its applications from the add operation and cannot use most of the gates of the add circuit even if a synthesis tool uses AND gates from within the adder. If the bit field operations were required with a higher frequency than the arithmetic operations, an ALU just providing these and deriving the arithmetic operations from them could be used more efficiently. A simple extension to this ALU circuit is to subdivide the 16-bit adder into two 8-bit adders with separate carry outputs. Then, it supports dual 8-bit add and subtract operations as well. The dual add can in turn be used for a dual 8-bit multiply operation that would execute efficiently in eight cycles. The binary n-bit add and subtract operations are only partially defined as their result may overflow to an (n + 1)-bit number (cf. section 4.2). ALU circuits are usually designed to signal the event of an overflow by storing the overflow condition in some state flip-flop so that a conditional branch to an instruction sequence handling the overflow error can be performed if desired. Instead of computing the signed and unsigned overflow conditions, it is as simple to store the additional, n + 1st result bit F as part of an intermediate result. A subsequent add/subtract operation can then be applied to an extended, (n + 1)-bit operand E including F

• 156

SEQUENTIAL PROCESSORS Table 6.1 Overflow conditions with simple and extended add/subtract operands

Unsigned

Overflow

S + S −> S E + S −> S S + S −> E E + S −> E S − S −> S E − S −> S S − S −> E E − S −> E S<S E<S

C F or C F and C /C /F xor C /C /F and /C

Result bit n + 1

Signed

Overflow O O1 xor O

C F xor C

S + S −> S E + S −> S S + S −> E E + S −> E

/C /F xor C /C /F and /C

S<S E<S

OX

Result bit n + 1

sign SX

sign SX

as its n + 1st bit and may happen to have an n-bit result again. With this approach, overflow checking is only needed after every second operation (with another intermediate result bit, the checking rate can be reduced to every fourth operation). Table 6.1. lists the various overflow and result bits, denoting by C the carry output of the adder, and by O the signed overflow condition (5) in section 4.2. S denotes an unextended, n-bit operand. Another example of a highly versatile multifunction circuit is the parallel MAC (multiplier-accumulator) consisting of a parallel multiplier for signed and unsigned binary numbers with a double-size result connected to the input of a parallel adder that adds the product to the contents of an accumulator register. The same operation is performed by the serial MAC (section 4.5). The word size of the adder is extended to provide additional precision in order to avoid overflow, e.g. the triple word size. The serial MAC is built on a single parallel adder and can be designed as an extension of the above ALU structure. The MAC circuit performs the following operations: (a) (b) (c) (d) (e)

signed, unsigned or mixed multiply; signed, unsigned or mixed multiply and add/subtract; shift an input word (multiply by a power of 2); shift and add/subtract; add/subtract (an input multiplied by 1).

(c), (d) and (e) have small h O values in a parallel implementation (in a serial implementation, add and subtract are as efficient). As sub-functions of (a) and (b) they do not affect the efficiencies of these latter, however. The signed, unsigned and mixed variants of the multiply operations use the same resources. If multiply operations are needed at a low frequency only, their serial execution using the adder function of an ALU would be more efficient. The operations can be used in many applications. (b) is used extensively for the computation of dot products of vectors and to process digitized signals. (a) and (b) are steps in multiply operations on binary codes spanning multiple words, and (d) can be used as a step to compute the sum of floating point numbers with a multi-word mantissa field. The multiple words of the result are computed one-by-one starting from the least significant one. This can be supported by a data path providing a result register (an accumulator) with a shift words function from the most significant word to the middle word and from the middle word to the least significant

• 157

DESIGNING FOR ALU EFFICIENCY

one. Floating point operations implemented this way use the multiplier hardware efficiently. Faster implementations of floating point operations using parallel and pipelined circuits are common on high performance processors. Such processors can only be used efficiently if the floating point operations are executed with a high relative frequency. The 16-bit MAC circuit also supports 16-bit integer data efficiently. Other complex operations besides the multiply-add are the polynomial evaluation step in the Horner scheme or a similar step in evaluating a rational function. Other more complex operations involving several adds and multiplies tend to be more application-specific. Parallel implementations can be considered for them in application-specific programmable processors if the adds and multiplies mostly feed each other, and data input and output interfaces of the data path do not inflate too much (cf. section 8.2.4). Efficient use of the ALU gates for different operations can also be achieved by rearranging sub-circuits via electronic switches, i.e. by making the ALU configurable. Subdividing an adder or multiplier to perform an SIMD operation on smaller words corresponds to selecting a different circuit configuration for the ALU. A complex ALU would be composed of sub-units that perform useful operations by themselves and might be combined in various fashions. A configurable ALU may provide a number of predefined configurations (e.g. using a multiply– add circuit to support both the dot product and polynomial evaluation) or a general configurable FPGA-like resource of arithmetic building blocks like the extended adders in section 4.4. The sizes and kinds of sub-circuits and the switch network to implement different configurations must be properly chosen. The switches are control circuits that tend to degrade the overall efficiency. The control signals for them would need to be loaded from the instruction memory (maybe as a sequence of instructions for the individual subunits) or an extra one and stored in one or several alternative configuration registers (Figure 6.2). The configurable structure would only be attractive if most of the sub-circuits can be used efficiently for the intended algorithms. A related approach is found in the VLIW (very large instruction word) processors. These provide multiple arithmetic circuits and a wide instruction word with operand select fields for each of them that control the data exchange between the circuits individually for every instruction. In contrast, the configurable ALU would preload a configuration to be applied several times without having to supply the configuration data again and again from a wide memory. ALU circuits composed of a few multiplier and adder components can be derived automatically for a repeated sequence of individual add and multiply instructions, or alternatively determined at compile time as in [84]. In contrast to an SIMD structure, the sub-circuits of a configurable ALU or a VLIW processor may execute different operations at a time and directly output data to each other.

data memory

input registers

S C R

S C R

S C R

from configuration register

Figure 6.2 Configurable ALU with multiple compute elements (C), selectors (S), registers (R)

• 158

SEQUENTIAL PROCESSORS

6.1.2 Pipelining ALU efficiency results from the most frequent usage of the ALU and from the ALU circuit itself operating efficiently for the selected operations. For the ALU circuit to be efficient, pipelining needs to be applied unless its circuit depth is low. Usually, pipelining is not applied to fixed point adder/subtract circuits and bit field operations, but it sometimes is to fixed point parallel multipliers, and it usually is in floating point implementations. To simplify the discussion, we consider the first case only (no pipelining within the ALU). To operate at the maximum throughput (which for the non-pipelined ALU corresponds to its execution time), new operands and function codes must be delivered immediately after finishing an operation. This only works if these data are loaded in parallel to the current operation, i.e. pipelined to it. If we assume that the operands are loaded from a bank of registers, then the following sequential steps are involved in every instruction and are hence potential pipeline stages:

r Apply address to instruction memory and load instruction from memory. r Unpack (decode) control bits from instruction. r Access selected register operands. r Perform ALU operation. r Store result to selected register. There may be additional steps to access a larger and slower data memory device. In simple cases the decode operation may be much shorter than the ALU operation, and for a small number of registers the same applies to the register accesses. Then the above steps collapse to as little as two. Most current processors have fairly complex instruction and register sets and use 4 or 5 level pipelines. The highest speed processors running at clock rates of several GHz apply pipelining more extensively and use up to 20 pipeline stages with as few as 5 gate delays per stage [48], including multiple steps for the ALU operations. According to the discussion in sections 1.4.1 and 1.5.3, the pipelining only yields close to maximum efficiency if the individual pipeline stages need about the same processing time. It is thus necessary to select the pipelining stages so that all take about the same processing time as the ALU. If the instruction memory turns out to be much slower, then its width can be increased to deliver several instructions at a time that are first stored in registers and then executed in sequence at a higher speed. Pipelining is sensitive to data and control dependencies. If the result of an instruction is an operand for the next, the next one must wait until the result has been stored so that it can be selected from the register. Then one or more pipeline clocks must be waited for before the ALU can continue, and the ALU rate becomes significantly reduced. Some designs provide an extra path to the ALU input to avoid the time step to access the operand registers. If a branch is performed by a processor using a k-stage pipeline, then at the time of the nonsequential instruction fetch k − 1 instructions of the wrong branch are already in the pipeline and must be cancelled. After this ‘break’ of the pipeline, it takes k cycles until the next ALU operation finishes. Several measures can be taken to reduce the resulting slow-down. One is branch prediction, i.e. selecting the most probable branch to be taken and fetching the subsequent instructions from it. The information on which branch is more probable can be encoded into the instruction (then it becomes available after the decode stage), on some heuristics (the backward branch from the end of a loop to its beginning is more probable), or by

THE MEMORY SUBSYSTEM

• 159

dynamically collecting it (storing the history of the most recent branches). High-performance processors starting several instructions in parallel on several compute circuits without strictly respecting their order in the instruction sequence add speculative execution for the instructions logically following the branch [48]. Their effect can only be validated when the branch has been decided on. Another method called simultaneous multithreading (SMT) is to feed the ALU with instructions from alternative instruction sequences (threads) to keep it active. If a branch inhibits one of them to keep the pipeline filled, the instructions from the other can fill the gap. Both methods involve a considerable hardware effort. A much simpler method consists in placing the jump instructions in front of others that are still preceding the branching and letting them take effect on the further fetching of instructions only after the others have been performed (delayed branches). By providing the memory bandwidth to load the branch instructions in parallel with the ALU instructions (or pre-fetch them into some instruction buffer) and handling them in parallel in the branch processing part of the control circuit, no ALU cycles get lost. Finally, branching overheads and the related pipelining overheads are also avoided by conditional ALU operations that only change a register if a control input to the ALU representing some condition is active. If the condition does not hold, only a single cycle is wasted.

6.2 THE MEMORY SUBSYSTEM The programmable processor systems’ instruction and data memories contribute to the cost of the system. We need to consider the size and speed requirements for them and their interfacing to enable the ALU to be used efficiently. For a particular application of a processor the sizes of the memories may be reduced to the actual needs but for a multi-purpose processor design a standard memory size must be supported by the control circuit. The hardware complexity of the CPU grows with the number of address signals to be generated. Also, the memory access times increase with the number of address bits due to the decoding and the larger distances within the memory array. Only small, on-chip memories can provide very fast access times similar to the processing time of the ALU. The access time to data in memory can be reduced by loading many bits in parallel, i.e. using a larger memory word size at the expense of having more signals to route between the CPU and memory (on-chip high-speed memories in current processors load up to 256 data bits in parallel). Moreover, a wider memory requires less address lines to achieve the same capacity and can be accessed faster due to the reduced decoding. If the memory word size is smaller than the instruction or data word size, instructions and data must be assembled from multiple read operations, which translates into a longer access time. Ideally, the required data words and instructions can all be loaded in parallel to the continuous sequence of ALU operations. One of the questions to be answered is how much memory must be supported on a general purpose CPU design. Modern standard CPU designs opt for very large address spaces to support arbitrary programs and data structures and pay for this with a considerable effort to reduce the resulting memory access times. A multi-level memory hierarchy is introduced, starting with the register set (usually considered as a part of the data path), continuing by the small on-chip memories (caches) required to allow accesses at the ALU rate, by larger off chip memories, and eventually by slow, serial mass memories, all of which are mapped to a single and uniform memory space.

• 160

SEQUENTIAL PROCESSORS

6.2.1 Pipelined Memory Accesses, Registers, and the Von Neumann Architecture For pipelining the instruction fetches, the loaded instruction is stored into a pipelining register as already shown in Figure 5.5 to allow the next instruction or other information to be read while the actual one is being executed. The operands fetched from the data memory are stored in pipelining registers, too, to allow further data memory accesses for the next instructions to occur in parallel. If there are several operands, they may have to be fetched in sequence depending on the organization of the data memory. From the registers provided within the data path, operands are accessed in parallel, and so is the write access for the result. As remarked in section 6.1.2, the clocking rate of the pipeline should be close to the ALU processing time, and the memory system should deliver instructions and the required data at this rate. The ALU hence obtains its inputs from pipelining registers and also deposits its results in registers to allow the next operation to be executed in parallel to storing the result of the previous one to memory, if required. If the input registers already contain the right operands, the load from memory operation can be skipped, and if the result of an operation is an input for one of the next, the store operation may be skipped assuming that the output register can be selected as an ALU input, too. If there is a choice of input and output registers (more generally, storage automata) within the data path, the instruction code must encode which ones to select. The higher the number of registers, the more the number of data memory accesses is reduced (simple applications may not need an extra data memory at all) but the operand selection from the registers takes more time and may require an additional pipeline stage and corresponding extra registers in order not to slow down the ALU operation. Then the register set becomes second in the memory hierarchy. The registers serve to adjust the access time for the (usually) two operands to the ALU execution time. If the data selection from them requires an extra pipeline stage, then no further execution time penalty results from using as many registers as can be selected from during a pipeline clock period. There are, however, extra hardware costs for the registers and for their addressing (see below) that have to be put in relation to the resulting gain in performance. The number of registers is selected so that the required data memory bandwidth is sufficiently reduced. It has become common to use 16 general purpose (data and address) registers and more and a pipeline stage for their access. The simplest structure for an ALU with up to two input operands and a single result word is to provide an output register A (an ‘accumulator’) also providing one of the ALU inputs and another pipeline register E for the input from a memory (Figure 6.3). Figure 6.4 shows the extension of this structure by a register file. In general, there is a data memory that is connected to the registers of the data path. Instead of using separate data and instruction memories accessed in parallel via separate buses, one can use a single memory bus for both and time multiplex the data and instruction accesses.

from memory

E

ALU

A

to memory

Figure 6.3 2-input ALU with pipeline registers E and A

• 161

THE MEMORY SUBSYSTEM

register file

memory

E1 ALU E2

select codes

function code

Figure 6.4 Multi-port register file with pipeline registers

instruction and data memory

instruction and data bus

CPU

data path instruct. reg.

alternating instruction and data addresses

s e l

program ctr.

Figure 6.5 von Neumann architecture

This is the von Neumann architecture for programmable sequential processors (Figure 6.5), in contrast to the Harvard architecture that uses separate memories. Then ALU instructions, control flow instructions and data read and write accesses alternate, and the input pipeline register can even be shared by them. To compensate for the resulting lower performance, the common memory for instructions and data has the cost advantage that just a single memory bus needs to be implemented (and the strange by-product that self-modifying programs are supported that overwrite some of their instructions with computed data and then execute them). A von Neumann computer with a high bandwidth memory will be as fast as a Harvard machine, and a Harvard machine interfaced to a single dual-port RAM can overwrite its instructions; the difference in the organization of memory was, however, considered important during the development of computer technology. Most processors now use separate on-chip memories and a unified off-chip memory. Some enhancements can then be made to the control circuits to reduce the penalty for sharing a single memory (see section 6.3.2); such extra effort is much cheaper than using a second memory. With a simple, single port memory interface performing a single word access in each processor cycle, the performance of a von Neumann system may be lower whereas the efficiency may be higher due to the reduced costs. The performance does not drop very much, however. An instruction requiring a memory operand must wait for this memory access until it can execute, and then there is still time enough to fetch the next instruction. Also, data memory accesses may be less frequent than instruction accesses if there are several registers. The code word sizes for instructions and data will in general be different. The data word size is usually related to the sizes of the operand and result words of the ALU. For a memory adapted to the data word size, wide instructions codes may have to be assembled from several short word read operations in sequence. The resulting penalty in access time can be reduced by using variable size codes with the most frequent ones being encoded within a single memory word. A Harvard processor has a somewhat simpler memory interface as the data and instruction addresses do not have to be multiplexed, and the memories may have different

• 162

SEQUENTIAL PROCESSORS

widths. The extra bandwidth provided by separate memories (or a dual port memory) allows parallel data fetches as well as parallel, pipelined data load and store and instruction fetch operations and extra memory accesses for inserting or extracting input and output data (see section 6.4).

6.2.2 Instruction Set Architectures and Memory Requirements Typical ALU operations have two operands and one result, all of the same word size. If there is a choice of 2k data registers for the operands and the result, an instruction needs 3k bits to independently specify the register operands; more bits are needed for the operation and to distinguish other classes of instructions (e.g., jump instructions). A common way to reduce the instruction size and to slightly simplify the hardware at the expense of some flexibility in the use of the registers is to always use the first operand register for the result. Then only two register addresses are specified using 2k bits, and the instruction set is said to have a two-address architecture (in contrast to three-address). If the operands of an operation are still needed for a subsequent operation, the value in the destination register must then be copied to another one or later be reloaded from memory. If an operand comes from memory, the instruction needs to specify its memory address and a register address. If a full address code takes n bits, n + k bits are needed for the operands and the result. Instructions with more than one absolute memory address are hardly used at all. Shorter instructions result if the address to be used is in a register and does not need to be loaded from the instruction (register indirect addressing), or if it is specified as the sum of the value in a register and of a small constant. Then, a memory operand is specified via an address register, and several memory operands can be afforded. The instruction size is affected by the number of addressable registers which should be high in order to avoid memory accesses, the extreme being that no memory accesses remain for intermediate results and that the registers represent all of the data memory. This is impractical for a general purpose CPU. The other extreme is to provide only one accumulator register (i.e., no choice) and thereby avoid the register address encoding, but have to perform load and store operations for most operations. Register addresses and memory operand addresses are avoided without having to perform frequent load and store operations by using a stack automaton to hold the operands and results (Figure 6.6). This allows for the evaluation of complex expressions without needing extra instructions to access operands or to store intermediate results in memory. By definition, the ‘pop’ operation of a LIFO stack only gives read access to the element written (pushed) to it most recently. The stack processor ALU always pops its arguments from the stack and pushes its results back there. The stack may be implemented in a standard memory using a stack pointer register to provide the memory address for the load and store operations (see exercise 6 in Chapter 2). Then the data load and store cycles remain, and pipeline registers for the top-of-stack and next-on-stack elements NOS, TOS (the ALU operands and the result) have to be used. The stack can also be implemented within a register file. Again, the register select address is generated using an extra stack pointer register, this time used to perform an indirect addressing within the registers. The stack can be thought of as a structure that dynamically allocates storage locations for intermediate results. The stack structure does not provide access to intermediate results other than the most recent ones until the more recent ones have been popped off, and requires extra access operations or storage for them if they are needed out of this order.

• 163

THE MEMORY SUBSYSTEM

memory bus

TOS

ALU

NOS

stack storage automaton

Figure 6.6 Stack processor data path

Combinations of these techniques can be considered, e.g. providing small stack buffers for a small number of registers, thus combining short register addresses with larger storage capabilities. Register addressing can further be avoided by using unique registers for some instructions and thereby registers with different properties. Another option is to use a small number of indirect register addresses selected previously by some special instruction within a larger set of registers. An ALU instruction then does not specify a register directly, but via a register pointer. By redefining the pointers one gets an effect similar to performing register move operations. Indirect register addressing involves some extra processing delay. A common method to arrive at a simple instruction set (and a smaller instruction size) is to only provide ALU instructions with register (or stack) operands and to use separate load and store operations for registers from and to memory with absolute and indirect addresses. The resulting load-store architecture is a basic ingredient of the RISC processor architectures (reduced instruction set computer). In contract CISC (complex instruction set computer) processors combine operand load and store memory accesses with ALU operations. The extra load/store instructions in RISC machines also require extra instruction memory locations and instruction fetches. The decoding of instructions, however, is simplified, and the data memory accesses are still pipelined with the subsequent instructions. With some hardware effort at the memory interface, instruction codes may still be loaded faster than at the ALU rate and the load/store instructions can be decoded and executed in parallel to the ALU operations. An instruction type found on many processors is the loading of an m-bit constant (m being the ALU word size) into a register or as an ALU operand which obviously needs an instruction code of more than m + k bits (with a k-bit register address), forcing a larger instruction size than the data word size. This can be avoided by only providing instructions that load and pack constants encoded with a smaller number of bits (say, half of the data word size), or by substituting the load constant instructions by load from memory instructions and placing the constants into the data memory instead of into the instructions. Address code constants to be loaded into address registers may be handled in a similar way. Control flow instruction sizes such as conditional jump instructions can be kept small by using relative jumps by short address distances (at the expense of needing an extra adder circuit and the delay caused by it) and by only using a few kinds of jump conditions, or by applying register indirect addressing to the jump target address (indirect jump instruction). Besides the size of the instruction words, also their required number contributes to the memory costs. Techniques for shorter programs are to specify several ALU operations within a single instruction, or a control or load/store operation combined with an arithmetic one (which may again require wider instruction codes). Other techniques are the use of automatically inserted operations, and state dependent instructions where the same instruction code is used

• 164

SEQUENTIAL PROCESSORS

for several different operations depending on register bits loaded before (see section 6.3.2). Entire operation codes can be pre-loaded into a register and then referenced by a single bit. The technique presented in section 5.5 to provide a single instruction bit to synchronize just with the second automaton taking care of some or all auxiliary operations goes even further. It can be generalized to using a single instruction or an instruction field available in parallel to others to input and output from a sub-automaton that autonomously goes through a number of states. In contrast to micro code, the sub-automaton operates in parallel to the further instructions and lets control become distributed. It might e.g. respond by accepting a command or delivering a result if it is ready for this but otherwise create a branch and a later retry. The instruction memory requirements are also reduced by supporting code compression through sub-routines and loops that permit to run through an instruction subsequence several times during the execution of a program. Sub-routine calls and loops are usually realized with special jump instructions and are ‘decompressed’ (see Figure 5.6) by performing these jumps in the same way and with the same overheads as for the control flow. Their implementation thus needs to minimize the runtime overheads in order not to lose ALU efficiency. The unconditional jump instruction to some non-sequential location in the instruction memory already permits sharing the instruction sequence starting at that location among several initial sequences as a common continuation. It is applied to implement the continuation of a program after executing a sub-function with several branches. A call instruction performs the combination of an unconditional jump with storing the next sequential instruction address to some register or memory location. An indirect jump to this address (the ‘return’ instruction) is placed at the end of the called instruction sequence (the ‘subroutine’). A call instruction can be used as an application-specific instruction implemented by a multi-step computation, e.g. as a multiply operation implemented by calling a sub-routine composed of a series of conditional add operations (and the return instruction). It also yields the standard implementation of a recursive algorithm as a sub-routine that includes a call to itself. Common methods to speed up the sub-routine call and return instruction are to automatically store the return address in a stack structure in the data memory or in a dedicated ‘link’ register, and to pack the return instruction code as a single bit into an ALU instruction. The link register method is easier to implement but requires store and load operations to free the register for nested calls and to restore it for the return. They can be inserted by a compiler. If a sub-routine does not contain further call instructions, they are not needed at all, so that the execution time for them can be saved. A loop is decompressed by setting up a counter to control the re-reading of an instruction sequence until the count expires. The counter can be implemented in software by decrementing a memory location using the ALU and comparing it to zero. This can be avoided by providing a dedicated loop counter register and a special conditional branch to the start of a loop that decrements the counter and tests it for expiry. The overhead involved in the repeated reading and execution of this conditional branch instruction can be eliminated by pre-fetching the branch and executing it in parallel to another operation, or by implementing the control circuit so that the conditional jump occurs automatically until the loop has been fully expanded (e.g., by providing special registers to hold the start and end addresses of the loop). The second

• 165

THE MEMORY SUBSYSTEM

method is common on digital signal processors (see Chapter 8). Like sub-routines, loops are often nested. This requires the loop counter value to be saved in a stack structure before executing an inner loop and retrieving it afterwards.

6.2.3 Caches and Virtual Memory, Soft Caching The memory requirements have to be considered for both the instructions and the data (in a von Neumann architecture, the unique memory must be sized for the union of data and instructions). The data memory is used for two different functions. First, it is used to store intermediate results that do not fit into the available registers. Second, it is used to store large, multi-word data structures like vector operands or data bases from which input data are read and that are constructed or modified incrementally. The size of these data structures depends on the application. If there is no storage from which the components of a large vector can be read as often as needed, a programmable digital processor is simply not suitable for operating on it. As, however, all accesses are by components, it is enough to provide fast access to the components actually needed by storing them in the memory attached to the CPU, if there is a way to load the ones needed later into this memory. For the instructions, the same applies. Most time is spent in program loops that hardly span a few hundreds of instructions. Thus a CPU with access to just a few thousands of memory locations will be able to operate efficiently if new instructions and data can be swept into this memory in time. Modern general-purpose CPU designs implement large address spaces of 232 bytes and above and provide wide address registers and instructions to support them. A large physical memory involves long access times due to the selection circuits that are much slower than the CPU cycle time. The CPUs work by providing fairly small intermediate on-chip memories (called cache memories) allowing accesses at the CPU clock rate that provide copies of the portion of the main memory most recently accessed. Cache memories are defined not to be mapped to specific address ranges using decoder circuits (this would be another way to implement fast memory accesses) but to be mapped to the almost arbitrary addresses used by some program by means of auxiliary circuits. More specifically, a hardware cache implementation needs extra circuits to store the address ranges mirrored in the cache memory, and to detect whether a memory access falls into the range of data for which a copy exists (Figure 6.7). This is implemented by storing the starting addresses of small blocks of memory mirrored in the cache in a content addressable memory (CAM, see section 2.2.2). The blocks are further subdivided into cache lines that are the smallest units loaded from main memory and include status bits status whether they hold valid data.

cache RAM

main memory

internal CPU bus

(DRAM) external bus interface

cache control

address CAM

Figure 6.7 Cache memory structure (simplified)

• 166

SEQUENTIAL PROCESSORS

The cache management also requires some algorithm to determine which part of the cache should be overwritten if all of the cache is in use but a new address range is used by the application program. Previously, locations of the cache that have meanwhile been overwritten by the CPU have to be stored into the main memory to the addresses mirrored by the cache if this has not already been part of the write operations (write-through policy). Before this write operation to memory the contents of the cache may differ from the portion of the main memory mirrored by the cache. This must be taken care of if another CPU can also access the main memory and then requires additional hardware to let the other CPU detect cache writes. Another aspect of using caches is that separate caches may be implemented for instructions and data so that the benefits of the Harvard architecture are combined with a single main memory used for both, and that the caches may have a wider data interface to the data path and the instruction decoders than the external memory. Loading and storing of cache lines use fast sequential transfer modes (DRAM page mode accesses, see section 2.2.2) while the caches allow random accesses by the CPU. If the cache lines are to be loaded from memory frequently, the CPU operation slows down. Obviously, the hardware cache management that gives the programmer the use of a memory that is large and fast, at the same time is paid for with a substantial hardware effort to implement the CAM, to switch between cache accesses and external ones, to provide the generation of cache line load and store operations, and the address replacement algorithm. This effort is higher the more independently mapped memory blocks there are. Another drawback is for applications requiring a precise control of the timing of a computation. The automatic cache operation cannot be predicted for a program if some other thread can invalidate the cache entries for the current one. Hardware support for large address spaces goes even further on current standard processors. Usually only a fraction of the address space is populated with memory cells (e.g. 256 M bytes within a 4 G byte space) but a hard disk in the system may provide many G bytes of storage. Then a virtual memory management circuit that uses a configurable address decoder maps blocks of the physical memory space to any desired address requested by a program, after moving the previous contents of this block to the hard disk using some auxiliary software. The similarity to caching is obvious. The desired large memory space available to the program is realized by a region of the hard disk, and the slow disk memory becomes cached by the main semiconductor memory which supports much faster random read and write accesses. Here too, complex control hardware similar to the one supporting the cache is employed to implement the mapping from the processor address space to the main memory addresses, to detect accesses to unmapped addresses and to realize an algorithm selecting the portion of physical memory to be swapped to the disk and be remapped. The concept of a virtual memory is also applied to trigger communications with a remote subsystem (to perform a slow, serial access to the remote memory) instead of reading from a disk. The various types of storage devices, the registers directly connected to the ALU, up to three levels of caches, the main memory and the mass storage constitute the usual hierarchy in the memory system of a high performance processor which is attempted to be hidden by the cache and virtual memory management hardware. In simple embedded processors caches and memory management are not implemented. On these, the memory structure is ‘flat’ but still exhibits significant differences in the access times of the read-only and the on-chip and offchip read-write portions of memory that are statically mapped to fixed regions of the address space in this case.

• 167

THE MEMORY SUBSYSTEM

CPU data path and control

on-chip memory (RAM)

memory controller

i/o port

memory sub system

i/o bus CPU

main memory

Figure 6.8 Distributed storage architecture

In dedicated systems, an effect similar to the use of a cache memory can also be obtained without any particular hardware effort, namely by moving the data and instructions from a semiconductor storage device into a fast yet small CPU memory or vice versa, under software control (soft caching) using the input and output facilities of the processor system. This lets different portions of the program and the data appear at different times at the same CPU addresses which can, however, be taken care of by a compiler. Consequently, a high performance CPU can be implemented that only uses an on-chip memory of the size of a cache and supports data exchanges via a fast input/output interface (using DMA, see section 6.4). Then it performs the data exchanges between the on-chip memory and a larger (and slower) external memory on command from the application program which at the same time gains a deterministic timing. Figure 6.8 shows the resulting memory system architecture. The external memory needs to be controlled by a control circuit generating the required memory interface signals and sending or receiving data blocks to or from the CPU on demand. The memory controller e.g. generates the bus cycles for an array of DRAM chips (including the refresh cycles) and a flash memory chip. It can even implement higher-level control functions such as the automatic memory allocation for application data. The CPU does not have to include a memory bus controller. The modularization in Figure 6.8 can also be used in conjunction with a hardware cache management by implementing the cache line load and stores as fast block transfers via a high-speed interface to a memory controller. With soft caching, the block transfers can be pipelined with the data processing by starting them in advance and letting the CPU continue to operate while the data transfer proceeds. Also, the CPU design is further simplified and becomes still more efficient as the address generation circuits no longer need to support a large address space but only the small on-chip memory. In a similar way, the virtual memory management can be substituted by input and output transfers to a mass storage device that are generated by a compiler and do not rely on any special hardware support within the CPU. In this case, there remains no difference between caching and virtual memory management apart from the selection of the memory device. Soft caching and memory management are useful techniques for even the simplest processors. If a micro controller has a fast yet small on-chip memory, part of a long program can be swept into it to speed up its execution. Once interfaced to a serial mass memory, it can execute programs of arbitrary size irrespective of its addressing capabilities. Soft caching and memory management rely on the knowledge of the available memory resources during the execution of the application program in the digital system. In dedicated digital processors this knowledge may be assumed, while for programs distributed to various execution environments

• 168

SEQUENTIAL PROCESSORS

that are even shared between several users and applications it cannot be assumed, and only the dynamic resource management remains. The data registers of a processor are used to hold inputs and intermediate results to allow for their fast access without having to perform memory cycles. They are essentially used to cache data that would otherwise reside in a larger memory. For the registers, soft caching is the common method. The compiler is supposed to insert memory load and store instructions in order to manage it. Here, too, automatic caching to registers would be possible. This is not used to save hardware costs and to keep the operand fields in the instructions small, i.e. for the same reasons as mentioned before.

6.3 SIMPLE PROGRAMMABLE PROCESSOR DESIGNS In this section we present and evaluate a number of easy-to-implement design tricks to avoid program control overheads and to reduce the memory requirements of a CPU that do not rely on complex caching circuits for instructions and data or on branch prediction. They are explained for two von Neumann type 16-bit CPU designs, both based on an add/subtract multifunction circuit and a simple memory interface operated synchronously with the CPU from which instructions are read and executed in a pipeline. The first design concentrates on the most basic structure while the second shows how simple twists can significantly increase efficiency. The VHDL code for both can be synthesized and experimented with on mediumsize FPGA chips equipped with embedded RAM blocks or with access to an external memory; a higher density FPGA can even hold multiple copies of the processors. Both CPU designs implement small memory spaces only that can be implemented with the RAM blocks found on current FPGA chips. The resulting size restrictions can be overcome by using a distributed memory architecture and soft caching as explained above.

6.3.1 CPU1 – The Basic Control Function The CPU1 is a very simple processor. It demonstrates how little effort may be involved in designing a simple controller. The full VHDL source for it is given in Table 6.2. This simple design has several drawbacks, however. Its ALU cannot be expected to be used efficiently, and the instruction sequences for simple functions tend to be long. This makes it unattractive as a general purpose CPU. A program control circuit similar to that of the CPU1 may, however, be useful to control the sequential execution of FPGA circuits. The ingredients of the CPU are as follows:

r ALU. The multifunction circuit to be used for the ALU is the 16-bit add/subtract circuit in Listing 6.1 that can be used to add, subtract, compare, negate, and shift (also supporting multiword arithmetic operations and shifts) and to perform the bit field AND f . The data path built around the ALU uses a single output register (the ‘accumulator’), and an input pipelining register to hold a memory operand as shown in Figure 6.3. r Memory and pipeline. The CPU interfaces to a memory of 4096 16-bit words memory and performs a memory access in every cycle. A 4-bit instruction register is provided in addition to the 16-bit operand register and loaded from the memory in parallel to decoding and executing the previous instruction. There is no extra pipeline stage for decoding the instructions.

SIMPLE PROGRAMMABLE PROCESSOR DESIGNS Table 6.2 Instructions of the CPU1 Instruction

Operation

0000.h 0001.h 0010.h 0011.h 0100.h 0101.h 0110.h 0111.h 1000.h 1001.h 1010.h 1011.h 1100.h 1101.h 1110.h 1111.h

a = mem(h) mem(h) = a a = a and mem(h) a = a and /mem(h) a = a + mem(h) a = a + mem(h) + cy a = a − mem(h) a = a − mem(h) − 1 + cy (spare) a = a + cy jc h jp h a=a+a a = a + a + cy a=a−h a = a − a − 1 + cy

• 169

load A from memory, keep cy store A to memory, keep A and cy bit field AND by components, keep cy same with negated argument

jump conditional, negate A, set cy if A = 0 jump unconditional shift left accumulator, h not used shift left with carry, h not used set cy if h P a before, h not used complement cy and load, h not used

Note: (a = accumulator, cy = carry, h = 12-bit data field of instruction)

For the instructions requiring a memory operand, the data memory accesses are inserted before their execution. r Control circuit. The control circuit uses a program counter register that is incremented after every instruction fetch. The CPU performs absolute unconditional and conditional jumps depending on the value of the carry bit. The memory address is selected from the program counter register and, for data fetches and jumps, from the operand register. A call instruction or some special loop support are not provided. r Instruction set. The CPU performs the instructions shown in Table 6.2. The control signals for the ALU are directly reflected in the instruction codes. For some instructions the operand field is wasted. Indirect jumps and memory accesses must be synthesized by overwriting the address fields of the load, store and jump instructions. A constant to be loaded into the accumulator must be placed at and read from some memory address. The conditional jump is defined also to negate the accumulator and test it for being zero. There is a spare instruction for extensions (see the exercises). The VHDL code in Listing 6.2 is not purely behavioral as it references two components, the ALU circuit and a binary increment circuit for the 12-bit program address. Some of the signals used in the architecture definition correspond to the above-mentioned registers as indicated by the comments. They are synthesized as such according to the synthesis rules in section 3.7. The CPU has a 12-bit address bus and a 16-bit data bus. The external clock signal is assumed to enable the memory output or the write operation. If the reset signal is ‘0’, the memory bus signals that are defined to be of the type ‘std logic’ are set to the ‘Z’ state in order to allow some external hardware to define these signals and initialize the memory. After this resetting, the CPU starts to execute instructions from address 0. If the memory is an embedded FPGA memory block, the initial data may be loaded as part of the FPGA configuration. The simple CPU design allows high clock rates even in an FPGA implementation.

• 170

SEQUENTIAL PROCESSORS

library ieee ; use ieee.std logic 1164.all ;

-- needed for the ‘std logic’ signal type

entity CPU1 is port ( reset, mclk: in bit; -- reset and clock signals r w: out std logic; -- memory control bus data: inout std logic vector(15 downto 0); -- memory data bus adr: out std logic vector(11 downto 0) ); -- memory address bus end CPU1; architecture behav of CPU1 is – component types : component ALU16 port( a,b: in bit vector(15 downto 0); cin, op, andf: in bit; s: out bit vector(15 downto 0); cout: out bit end component;

);

component INC12 port( a: in bit vector(11 downto 0); q: out bit vector(11 downto 0)); end component; -- signals representing registers: signal accumulator, operand: signal pc: signal instr: signal cy, cki:

bit vector(15 downto 0); bit vector(11 downto 0); bit vector(3 downto 0); bit;

-- auxiliary signals: signal din,an, bn, sum: signal iadr, nadr: signal cin, cout, rw, op, and op:

bit vector(15 downto 0); bit vector(11 downto 0); bit;

begin alu: ALU16 pcinc: INC12

port map (an,bn,cin,op,and op,sum,cout); port map (iadr, nadr);

main: process(mclk, din,sum,cout,nadr,reset,cki,instr) -- mclk is essential, others for reference begin if reset=‘0’ then -- reset operation asynchronous to clock cki <= ‘0’; pc <=“000000000000”; instr <=“0000”; elsif mclk’event and mclk=‘1’ then -- positive clock edge operand(11 downto 0) <= din(11 downto 0); if cki=‘0’ then cki <= ‘1’; -- end of data load/store cycle operand(15 downto 12) <= din(15 downto 12); else -- end of instruction execute/fetch cycle if instr(3 downto 1) /= “1011” then accumulator <= sum; if instr(3 downto 2) /= “00” then cy <= cout; end if; end if;

SIMPLE PROGRAMMABLE PROCESSOR DESIGNS

• 171

pc <= nadr; instr <= din(15 downto 12); operand(12)<=din(11); operand(13)<=din(11); -- extend sign bit of operand(14)<=din(11); operand(15)<=din(11); -- din(11 downto 0) if din(15)=‘0’ then cki=‘0’; end if; -- insert data cycle end if; end if; end process; -- ALU input selection: an <= “0000000000000000” when instr(2)=‘0’ else accumulator; bn <= accumulator when instr(3)=‘1’ and instr / = “1100” else operand; cin <= cy when instr(0) = ‘1’ else instr(1); -- instruction decoding: op <= instr(0) when instr(3 downto 1) = “001” else instr(1) ; and op <= ‘1’ when instr(3 downto 1) = “001” else ‘0’; -- address multiplexer: iadr <= pc when cki=‘1’ and (instr(3 downto 1) /= “101” or (cy=‘0’ and instr(0)=‘0’)) else operand(11 downto 0); -- interfacing to the external signals: din <= to bitvector(data); -- conversion to the internal ‘bit’ signal type rw <= cki or not(instr(0)); adr <= “ZZZZZZZZZZZZ” when reset=‘0’ else to stdlogicvector(iadr); r w <= ‘Z’ when reset=‘0’ else ‘1’ when rw=‘1’ else ‘0’; data <= to stdlogicvector(accumulator) when rw=‘0’ and reset=‘1’ else “ZZZZZZZZZZZZZZZZ”; end behav; Listing 6.2

VHDL definition of CPU1

The recursive algorithm for the GCD function from section 1.2.2 translates into the instruction list shown in Listing 6.3. The operands are assumed to stand at the memory locations 32 and 33. At the end, the processor enters an idle loop, and the result is in location 32. The jump in line 5 is executed after taking the one in line 2 but checks for a different condition due to the negate operation (usually, jump instructions are implemented so that they do not change state). In this program, only the execution cycles of the instructions marked by an asterisk (subtract and negate) perform computational steps. While the load, store and unconditional jump cycles do not. Only 2–3 cycles out of 8 are computational depending on the branches taken. Thus the ALU efficiency running this algorithm is low. As a second program example, we consider the multiplication of two 16-bit operands standing in locations 32 and 33 with the 32-bit result placed into locations 34 and 33. The multiplication step by a bit of the second operand is shown in Listing 6.4. It is realized by performing the add operation of the word to be multiplied conditionally. The shift operations used to adjust the result word for the next add operation and to select the bits of the second operand are considered noncomputational, but the conditional jump and the add operation

• 172

SEQUENTIAL PROCESSORS Listing 6.3 gcd algorithm Instruction address

Instruction

Comment

0 1∗ 2∗ 3 4 5∗ 6 7 8

a = mem(32) a = a − mem(33) jc 5 mem(33) = a jp 0 jc 8 mem(32) = a jp 1 jc 8

load 1st operand compare to 2nd operand jump if 2nd op ≤ 1st op, negate overwrite 2nd operand test for if a = 0, negate again overwrite 1st operand (end)

Listing 6.4 16-bit multiply step Instruction address

Instruction

Comment

n+0 n+1 n+2 n+3 n+4 n+5 n + 6∗ n+7 n + 8∗ n+9 n + 10 n + 11∗ n + 12

a = mem(34) a=a+a mem(34) = a a = mem(33) a = a + a + cy mem(33) = a jc n + 13 a = mem(32) a = a + mem(34) mem(34) = a a = mem(33) a = a + cy mem(33) = a

shift left result shift second operand conditionally add

propagate carry

realize the arithmetic operations of the algorithm. Then only one in 8 cycles is computational in the mean. The total multiply function takes 16 steps with 208 instructions and needs 256 cycles in the mean. If the instruction pattern in Table 6.4 is not duplicated but executed in a program loop using another memory location as the loop counter, the program would have 19 instructions only but be still less efficient as the loop management instructions do not contribute to the arithmetic operations needed for the multiply algorithm, and take about 350 cycles. The low efficiency is due to the need to reload operands and to store every result in order to free the accumulator for the next one. This long execution time may be attributed to the simple CPU design and be put in relation to the low hardware costs. Actually the cost is not that low if the total system including the memory is considered. The required, fairly long instruction sequences drive up the memory costs. A faster and more elaborate CPU design might deliver more performance at the same costs for the total system.

6.3.2 CPU2 – An Efficient Processor for FPGA-based Systems The flaws of the CPU1 are avoided in the CPU2 design. It uses a more powerful data path (yet still the add/subtract/AND f ALU) and a richer instruction set supporting indirect jumps

SIMPLE PROGRAMMABLE PROCESSOR DESIGNS

• 173

and memory accesses, efficiently implements calls and loops, and provides special support for soft caching. The increased efficiency is at the expense of spending a much higher amount of hardware for registers and control functions (more than for the ALU itself). If the reduced program sizes are taken into account and exploited to reduce the system cost, these expenses are compensated, and the overall efficiency (the performance/cost ratio) of the CPU2 plus memory system can be expected to be much better than for a CPU1-based system. The simple pipeline structure has been maintained. The estimated cycle time of the CPU is roughly compatible to the access time of the required, fast SRAM. The low complexity and the inclusion of a coprocessor interface make the CPU2 attractive for real FPGA-based applications (cf. Table 2.1). Its current implementation exploits the extra bandwidth provided by FPGA memory blocks when they are configured as dualport memories. This extra bandwidth is useful for DMA accesses from the interfaces, too. A version of the CPU2 has been synthesized for a current 0.18 µ standard-cell manufacturing process and been found to run at about 280 MIPS. The VHDL source for the CPU2 is much more complex than the one for the CPU1. It is not included here, but is available from [55]. This present section concentrates on the basic CPU processing and control functions. Thread management, input/output and the soft caching implementation for the CPU2 follow in section 6.4 and 6.5.3. The CPU2 is extended by fast input and output interfaces that allow several CPU components to be used in a system, as will be further discussed in section 7.3. The CPU2 design is related to some software techniques further explained in section 7.7.

6.3.2.1 Data path and instruction set architecture The performance penalty due to frequent loads and stores is overcome by using 4 registers for operands and intermediate results instead of just 1 (enough to perform a multiply operation without having to perform memory accesses). This number of registers is still small enough to avoid an extra pipeline stage for the register accesses. Some special purpose registers are supplied to support data addressing and program control that do not require register addresses within the instruction. Current micro controllers often use larger register sets, but also use addressable registers for the special register functions. Large register sets are convenient for executing complex arithmetic operations. It has been found in a study comparing various register sets that the small number of addressable registers is sufficient as long as basic arithmetic operations such as a multiplication are multicycle operations. A larger number of registers does not significantly raise the speed in typical applications but tends to increase the instruction size and thereby the memory costs (or decrease the available size of other useful instruction fields). Some additional registers are supported within an optional coprocessor extension of the CPU. Table 6.3 summarizes the registers of the CPU2. The instruction set architecture of a CPU and its architectural features are closely related. The instruction set enhancements in the CPU2 reflect the more sophisticated data path design that exploits the adder capabilities more fully, and some simple extensions of the ALU circuit:

r It generates a sign bit as needed for twos complement comparisons, and an overflow bit. r The carry and sign bits can be used to support intermediate results requiring 17 bits. Three instruction bit are used to select the F bit to store an extra result bit (Table 6.1).

r The sign bit of an operand is stored to support sign extension to multiple words.

• 174

SEQUENTIAL PROCESSORS Table 6.3 Registers of the CPU2 visible registers R0, R1, R2, R3 DP, CTR F, CY

– – –

general purpose data registers special purpose registers (data pointer, loop counter) jump condition and carry registers (1 bit)

invisible registers LNK, SP I, PC DMAA, DMAC

– – –

return register, stack pointer instruction register and program counter DMA address and DMA counter

r The 17-bit ALU output (including the carry/sign bit) can be selected from the upper 16 bits to perform the equivalent of a right shift and the add and divide by two operation.

r On register R0 a shift right function is implemented, and an add operation conditional on bit 0 of R0. They combine to yield a 16-cycle multiply operation with a 32-bit result.

r There is also a conditional add/subtract operation to support division.

There is some more support for single-bit Boolean operations. A register bit can be selected from R1-3 and used as an operand for a single bit operation. The result goes to the jump condition flip-flop F from where it can be moved back to any bit position in R1-3. XOR f and OR f still need to be synthesized from other operations. Some Boolean operations are available to combine the results of comparisons. Other standard instruction set enhancements concern indirect addressing, i.e. memory references using the contents of some register (R0-3, DP) as the address. The address registers can be automatically incremented after the memory operation to point to the next address in a data array. Indirect jumps are used, too. There is a call instruction that saves the return address to a dedicated link register (LNK). The corresponding return instruction is encoded by a single bit and packed into an ALU instruction. The cost involved in calling a sub-routine is thereby reduced to the single cycle needed for the call instruction (memory is not saved for the return instructions this way as all ALU instructions now include the return bit). The CPU2 also implements a dedicated loop counter register (CTR) and a conditional branch to the start of a loop that decrements the counter and tests it for expiry. A special instruction is used to load the loop counter and to start the loop. In order to save extra instructions to store intermediate results that are still needed to memory if the available registers cannot hold all of them, an indirect store operation in consecutive memory locations pointed to by DP is provided that auto-increments DP. As an unusual feature, it is encoded within the ALU instructions by a single, extra bit, too. The autoincrementing store operation is similar to a stack push operation and dynamically allocates a memory location. The DP register is also involved in passing parameters to a subroutine. If more parameters are needed than can be passed via the registers, they can be pushed into the memory addressed with DP which also allows for recursion. The instruction set for the CPU2 is summarized in Table 6.4 (the details can be obtained from [55]). A 13-bit address space for instructions is supported. A single type of conditional jump is available using an absolute address and the only jump condition of the F flip-flop being set to 1. The ALU instructions independently specify source and destination registers and implement a three-address architecture. The encoding of the instructions translates into

Table 6.4 Instruction set overview for the CPU2 1. ALU instructions 2. Coprocessor instructions 3. Load/store instructions 4. Input/output instructions 5. Control flow instructions

• 175

SIMPLE PROGRAMMABLE PROCESSOR DESIGNS

add, subtract, andf, compare, conditional add&shift, conditional add/subtract, single-bit operations application specific load constant, load/store absolute, load/store indirect, move indirect, load loop index, store indir.using DP(embedded into ALU instruction) input, output jump conditional, loop, call, return (embedded into ALU, i/o and coprocessor instructions)

fairly simple Boolean circuits to select operands and to enable register clocks. Besides the store bit and the return bit, another three bits of the ALU instruction codes are used to handle the condition bits and exceptions. Their inclusion compensates for the single jump condition. Finally, there is a special i/o instruction (see section 6.5) and a set of coprocessor instructions. A typical ALU instruction not using the return option would be as follows (C denoting the carry). According to Table 6.1 it computes the extended sum of two unsigned operands in R1 and R2: store R3 = R1 + R2 + C

F =C

6.3.2.2 How to implement the return stack and zero-overhead loops The CPU2 design implements a useful method to enhance efficiency that concerns the management of nested sub-routines and loops. Nested sub-routines and loops require a stack that would require extra load and store instructions for the link and counter registers to be inserted by a compiler. This is avoided by implementing a stack pointer register (SP) and push and pop operations to memory, as many processors provide it for return addresses instead of using a link register. Here, they are provided in conjunction with the link register. The push and pop operations are generated automatically using extra state flip-flops indicating whether the link and counter registers are in use. If the link register is in use, and a call is executed, then the link register is automatically pushed. If the return instruction is executed and the link register is empty, an automatic load (pop) operation from the stack is performed. Thus the speed advantage of using a link register is combined with the nesting support through a stack. The same procedure applies when a loop is started and the counter is occupied, or the loop return is executed (the conditional branch to the start of the loop) with the counter being empty. The link and the counter registers can be thought of as cashing stack memory locations. The state bits indicate that these cache entries are valid. The SP register is initialized during reset and remains invisible to the programmer. Another useful method yields ALU efficiency by implementing a zero-overhead loop with very little extra effort. The loop return instruction (the conditional branch to the start of the loop) is encoded by a single bit and packed into the ALU instruction, too. For this purpose, the branch needs to be indirect (using an address register). As the number of instruction bits is limited, the same instruction bit is used as for the return instruction, using a loop mode state bit that is set when the loop is entered to distinguish between the sub-routine and the loop return. The link register is also used to hold the start address of the loop, and loaded

• 176

State

SEQUENTIAL PROCESSORS Table 6.5 State dependent control functions of the embedded return Sub-routine Counter loop

unconditional return conditional error return

Do-until loop

Error mode Empty stack/interrupt

jump back jump back/break terminate error ret./break error return branch

context switch conditional ctx switch

with it by the special instruction starting the loop. Once a counter loop is entered, both the link register and the counter are pushed onto the stack (unless they are empty), and if they are empty at the time of the loop return, they are popped from the stack before executing it. The state bit distinguishing subroutine and loop return needs to be saved along with the link register contents if a nested call or loop is entered. There are, in fact, two such state bits distinguishing the operating modes of the parallel control operation indicated by the return bit in order to support a second loop structure (a ‘do-until’ type loop not using the counter) and a mode provided to handle exceptions. In this latter mode, LNK holds the address of the error handling routine, and the return bit becomes a conditional jump instruction, the condition being an overflow selected according to Table 6.1. In this case, the benefit of the packed control flow operation is that overflow checking does not consume extra cycles in the no-error case. The multifunction return bit is further exploited for switching contexts (see section 6.4), in particular on returning from an interrupt. Its various uses are summarized in Table 6.5. The multifunction return bit is similar to the jump enable bit provided to synchronize the execution of the ALU instructions with a parallel branch processing unit as proposed in Figure 5.9. The simple mechanism of preloading the LNK register works without having to implement a more sophisticated memory interface pre-fetching auxiliary operations, yet covers the most important cases. Due to its shared usage the space consumed by it in the instructions is no longer wasted for an infrequent operation. The loop implementation uses still another state bit that is valid after loading the first instruction of the loop and inhibits the reloading of the instruction if there is only one instruction in the loop, freeing up some memory cycles for other devices directly accessing the memory (see section 6.5.1). In a refined implementation, it could also be used to switch the processor to a higher speed execution mode. The serial multiply operation would e.g. be implemented by two instructions, one setting up a loop and the other performing the multiply step. The latter would be repeated but be fetched only once. For a nested counter loop, the count is pushed after the link register. If a sub-routine is called within a counter loop, the count can remain in the counter register in order to avoid memory cycles. The multiple use of the return bit allows further optimizations within the control circuit. An instruction sequence ending by a return instruction can e.g. be executed both as a subroutine or as a loop body.

6.3.2.3 Input/output and coprocessor modules The hardware interface of the CPU2 is defined in Listing 6.5. The CPU2 provides an additional interface to handshaking input and output devices (Listing 6.6) with a direct path to the memory, and a dedicated i/o instruction to support them (see section 6.5).

SIMPLE PROGRAMMABLE PROCESSOR DESIGNS

entity CPU2 is port (

• 177

reset, bs, irq: in std logic; -- for ‘irq’,‘bs’ see sections 6.4, 6.5.5 -- clock mclk: in std logic; r w, m: out std logic; -- memory control data: inout std logic vector(15 downto 0); -- memory data -- memory address adr: out std logic vector(15 downto 0)

) end CPU2; Listing 6.5

Module interface of the CPU2

entity IF2 is port (

data: inout std logic vector (15 downto 0); -- memory data bus ctrl: in std logic vector(7 downto 0); rd,wr,d,n,creq: in std logic; err,cen: out std logic

) end IF2; Listing 6.6

Module interface for i/o circuits attached to the CPU2

The CPU2 also provides a control output to a coprocessor circuit that is connected to the memory bus, and a set of coprocessor instructions that can be used to control non-standard operations realized with FPGA circuits. The interface definition of a generic coprocessor is shown in Listing 6.7. Figure 6.9 summarizes the modules within a CPU2-based system, using the parallel interface defined for the CPU2 to attach to a memory controller to support soft caching. entity COP2 is port (

data: inout std logic vector (15 downto 0); adr: in std logic vector (15 downto 0); -- only used for fct=1 opcode: in std logic vector (3 downto 0); a0,a1,rd,wr,op,fct,c : in std logic; opack,err: out std logic -- wait until opack=1

) end COP2; Listing 6.7

Module interface of coprocessors to the CPU2

The CPU controls the sequencing of the coprocessor operations and the generation of the memory addresses and cycles for the data transfers with it. For coprocessor operations, the registers R1, R2, R3 are used as address registers that are updated in parallel to the memory access cycles. The addressing options are chosen with a particular coprocessor in mind, a parallel 16 × 16 bit multiplier-accumulator (MAC) circuit with a 48-bit result register. In conjunction with the zero-overhead loop the MAC is controlled to perform the basic sum of products function typical to DSP applications (see Chapter 8) at the maximum MAC efficiency. The address registers are automatically updated to support the operand addressing in

• 178

SEQUENTIAL PROCESSORS Listing 6.8 GCD program

Instruction address

Instruction

Comment

0 1∗ 2 3∗ 4

loop R3 = R1 − R2 F = /C jc 50 0 − R3 F = /C R1 = R3 return ... R2 = − R3 return

start do-until loop compare 1st to 2nd operand jump if R2 > R1 set F if R3 > 0 conditional loops return if F, else break end: result in R1 and in R2 negate, overwrite 2nd operand, return to 1

50∗

coprocessor (MAC) local data/instr RAM

flash memory DRAM

CPU core

instructions local data bus

par. interface

i/o bus

memory controller

Figure 6.9 CPU2-based system

subsequent partial products ai ∗ bn−i as needed in a multi-word integer product (see section 4.4), and the use of the MAC to implement multi-precision floating point operations. The modular extension of the CPU by a coprocessor increases the overall efficiency as the coprocessor is a computational circuit that shares the sequential control provided for the CPU. The coprocessor is assumed to have an extra storage automaton with input sub-automata (registers) X,Y and an output sub-automaton M. For the MAC, M is a three word shift register with the shift function being performed on single word load and store operations. The signals ‘op’ (start operation) and ‘opack’ (allow next operation) implement a handshake between the CPU and the coprocessor and allow the coprocessor to take more clock cycles for a complex operation. Every instruction starting a new coprocessor operation or accessing a coprocessor result first waits for the ‘opack’ signaling the previous instruction to be finished. The ‘err’ signal is copied to the F bit and permits the program to branch on an error condition.

6.3.2.4 Sample programs The GCD program for the CPU2 starts with both operands in the registers R1, R2 (Listing 6.8). It eliminates some jump cycles by using the loop option and performs no memory accesses for intermediate results. The embedded return instructions perform the jump to line 1 without requiring an extra cycle. An arithmetic operation is performed about every 2 cycles. The program in Listing 6.9 performs the binary division of the unsigned 32-bit number in R1 and R3 by the one in R2 using the conditional add/substract operation, placing the result in R1 and the remainder in R3 (cf. section 4.7). F is updated to a sign bit extending R1. It takes 51 cycles and performs a compute step in every 3rd cycle (not counting shift operations). For the more frequent multiplication, the required shift is part of the conditional add instruction.

• 179

INTERRUPT PROCESSING AND CONTEXT SWITCHING Listing 6.9 Division algorithm Instruction address

Instruction

Comment

0 1 2 3∗ 4 5

R3 = 2∗ R3 F = 0 loop #16 R1 = 2∗ R1 + C R1 = R1 ± R2 R3 = 2∗ R3 + C return if F = 1 R3 = R3 + R2

start counter loop add if F = 1 else subtract to instruction 2

Listing 6.10 48-bit multiply subroutine using a MAC coprocessor Instruction address

Instruction

0 1 2 3 4 5 6 7 8 9

X = @R1+ X = @R1− X = @R1+ Y = @R2− X = @R1+ X = @R1− Y = @R2+ X = @R1+ X = @R1− @R3+ = M

Comment (LMH:low/middle/high) Y = @R2+ @R3+ = M Y = @R2+ @R3+ = M Y = @R2− Y = @R2+ @R3+ = M Y = @R2− @R3+ = M @R3+ = M

MPYuu MACuu MACuu MACus MACuu MACsu MACsu MACus MACss return

LL ML LM LH MM HL HM MH HH

uu: unsigned by unsigned (memory accesses occur before starting operation) us: unsigned by signed su: signed by unsigned reading M also shifts M ss: signed by signed

The final example in Listing 6.10 uses the 16-bit MAC coprocessor to perform a signed 48 × 48 bit multiply that might be part of a floating point multiply. R1 points to the least significant word of the first operand, R2 to the second, and R3 to a memory buffer for the 6-word result. The memory operand addressed by a register Rx is denoted @Rx+ or @Rx−, the sign indicating the modification of the address register by ±1. The provided addressing options are designed after an analysis of selected applications of this kind. The MAC can be used nearly continuously in this example.

6.4 INTERRUPT PROCESSING AND CONTEXT SWITCHING Sometimes, the control flow of a program needs to provide branches which are known to be taken infrequently but nevertheless require the testing of a condition and a conditional branch instruction and thereby cause some continuous control overhead. This is in particular the case for operations that are undefined or deliver an invalid result for some input data, e.g. a divide operation by 0 or a binary add if the operands happen to be large. In most processor designs intermediate overflows are not supported, and an overflow check may be necessary after every add operation. The overflow handling described in section 6.3.2.2 is one possible method of avoiding the involved overheads. If the CPU supports sub-routine calls, the exception handling can also be implemented by automatically performing a sub-routine call to a predefined address in the error case but to immediately continue otherwise. This of

• 180

SEQUENTIAL PROCESSORS

course implies that all overflows are handled by the same error routine. An automatic call to an error handling routine is called a ‘trap’. A versatile feature of most general purpose CPU designs is the capability to react quickly to an external event (an ‘interrupt’ signal) usually related to the handshaking at external interfaces without having to wait for it in an idle loop. The technique of inserting a call (a trap) to a predefined address is also used for this purpose. The insertion lets the CPU jump to the predefined address, execute the instructions found there to perform some service related to the event, and then return to the interrupted program. If there is more than one event that triggers the interrupt processing but need to be handled differently, they may be associated with individual, predefined addresses used for the inserted call. This technique lets the processor immediately respond to each event in the appropriate way without having to perform a branch. A typical application is the reading of data from some input device in response to an associated signal that indicates the presence of input data, performing some pre-processing and storing it in memory for later use by the main program. The following are aspects of the processing of interrupts:

r The interrupt programs constitute a 2nd thread of instructions that is independent of the

r

r

r

r

main program. An interrupt routine is repeated from scratch for every new event causing the interrupt. This is similar to performing a program loop which starts by waiting for the occurrence of the event. By branching on the contents of a state variable, the interrupt program can follow a control flow that spans several wait periods for the interrupt event. The execution of a second thread can, however, also be realized without interrupts by cutting both instruction sequences into parts and jumping back and forth to execute both. The schedule in which the processor is used to perform operations of the main program and of the interrupt program is determined dynamically through the timing of the interrupt events. Once running, the interrupt program always continues until the next wait/return from interrupt instruction. Using a wait situation in one task to perform operations in a second one also does not depend on interrupt processing. It can be implemented by conditionally jumping to the other sequence when the wait situation occurs. A characteristic property of interrupt processing is that the interrupt program is executed with a higher priority than the main program. Once the waited-for event occurs, the former is scheduled immediately without the main program having reached a wait condition. Some processors implement more than two priority levels and allow an interrupt program to be itself interrupted by a higher priority one. Otherwise, during the execution of the interrupt routine other interrupt events are not serviced. Even with a single priority level a fast response to all events can be guaranteed if all service routines are very short. The interrupt provides a strict time synchronization of the event with the instructions in the interrupt program. This is achieved without losing processing time in an idle loop waiting for the event. Due to the dependency of execution times on the control flow (the conditional branches), a software loop waiting for the event would always lose the processing time saved previously in taking a faster branch of the program. The time from the interrupt event to the start of the first instruction of the interrupt program suffers from same jitter due to the clocking (the discrete time base), and the need to first finish the current instruction. If the interrupt signal can be disabled (maybe due to another interrupt program already triggered), the event may even be recognized only much later. The digital system to be based on a single processor may have inputs requiring a long processing that does not need to be performed frequently whereas other inputs require a

INTERRUPT PROCESSING AND CONTEXT SWITCHING

• 181

short processing time once their data become available. The interrupt feature helps to fulfill both timing requirements with a single processor, i.e. without having to dedicate an extra processor to an interface requiring the fast response. The most common case is to service interfaces having a fast, unidirectional event signaling by immediately moving the interface data into some memory structure or vice versa in order to prevent a loss of data. It can, however, also be handled by other hardware to buffer the interface data.

The interrupt processing must be so that the processor returns in the same state (register setting) to the interrupted program in which it was left. Therefore the register contents changed by the interrupt routine must be saved and restored on return. On changing between the interrupt program and the interrupted one, the processor thus needs to perform not just a jump but a context switch in which the setting of its state is saved and restored as needed. As a controlrelated feature the interrupt processing must cause as little control overhead as possible. A common technique is to use other registers in the interrupt routines than in main program, e.g. by switching to a second bank of registers. Then no time is lost saving the registers in memory, yet at the expense of providing more registers and having to select between them. A common source of exceptions and interrupts to a CPU are memory and interface accesses. If the interface between a CPU and its memory uses handshaking, the memory could simply not respond at all, return an invalid data code or signal an error condition via a separate signal. Then the CPU might break the memory access and perform a trap. A common example for this is virtual memory where an access to a location that is not mapped to physical memory causes an interrupt that is serviced by swapping some physical memory to the disk and then mapping it to the required address. After returning from the interrupt service routine the memory access must be retried as it was not completed. Reading input data from an interface is similar to a memory access and can be designed to generate an exception if input data are not present, by delaying the memory handshake until input becomes available, by returning an invalid data code, or by forcing a context switch and retrying the read operation at a later time. As pointed out before, applications define several processes (section 1.5.2), and correspondingly a CPU is time-shared to execute several threads. This is, in fact, the key to using a CPU efficiently even if new data from some interface have to be waited for. During the execution of a thread, the processor registers are used for the intermediate results, and the program counter is used as the read pointer for the instructions. To switch to another thread, the register contents must all be saved in memory and be changed to the register settings in the other thread (reading the new values from memory locations where they were saved for this other thread before). Thus the context switch corresponds to a fairly large number of non-computational CPU cycles. The interrupt servicing is a special case, and the technique of using a separate register set for it generalizes to providing such for each of the different threads. Some processors support this by implementing the register sets within a larger memory structure and switch to another bank of registers by changing a single base address register. Otherwise, the context switch is implemented by calling an operating system sub-routine. The CPU2 implements the most basic interrupt support only in order to cover the aspect of time synchronization as it provides other support to automatically input interface data into memory buffers. Interrupt processing occurs in response to the signal ‘irq’ in Listing 6.3 changing to the L level. It starts by automatically storing the F bit and the carry bit in extra flip-flops before jumping to the interrupt program at the predefined address. A single priority

• 182

SEQUENTIAL PROCESSORS

level is supported, and interrupts cannot be nested (during the interrupt service, changes of the ‘irq’ signal are not recognized). Single instruction loops can be interrupted and fetch the repeated instruction once again upon return (some commercial processors implement a single instruction repeat feature by blocking instruction fetches that disallows interrupts during that time). Like a call instruction the interrupt causes the PC to be stored to LNK which does not cost an extra cycle if LNK is empty but involves two memory cycles otherwise. If e.g. R0 and R1 need to be saved and restored by the interrupt program, an overhead of nine to eleven non-computational cycles results from every interrupt. Moreover, the CPU2 supports context switches between several threads in a simple way so that they do not require extra instructions, relying on a static process management (see section 7.7.7). They are bound to occur outside sub-routines (i.e. with an empty stack) only in order to avoid the swapping of the stack pointers for the threads. They occur in response to the embedded return instruction that is redefined in this case to jump to the continuation address of the next thread which is found in a memory table. The saving of registers in memory is performed through store instructions inserted by the compiler before the return instruction.

6.5 INTERFACING TECHNIQUES Every CPU plus memory system must be interfaced in order to be able to receive input data and to produce output signals. As discussed in section 1.4.3, handshaking is required to synchronize input and output with the data processing. External interfaces may involve other signal levels than those used inside the digital system and even other power supply voltages. Interfaces may attach to the CPU such as special registers accessed by dedicated instructions, or be accessed like memory locations via the memory bus. The former method may provide faster access and avoids the decoding of interfaces by their memory addresses. It also allows for efficient implementations of synchronization and block transfers. The memory bus supports the use of many interfaces accessed one at a time with the same instructions also used for memory accesses. Input and output operations performed by a CPU are non-computational and must be realized with as few CPU cycles as possible in order to maintain ALU efficiency. The basic function of any interface is to input or output blocks of data words encoding some complex information. Some interfaces need a complex handshaking protocol, or transfer control information and error checking codes along with the data. An interrupt routine may be needed to handle the handshaking and unpack the data (or pack the send data) and store them in memory, or an extra input or output processor can be used for this (networking interfaces are often packed with such protocol pre-processors onto integrated circuits). Then input and output can be given a simple, unified behavior (software interface), reading data from memory (or registers) or placing data there. If they provide input and output repeatedly (streams), they behave like files or FIFO buffers (cf. section 1.5.2).

6.5.1 Pipelining Input and Output A multi-word input is conveniently accessible by the CPU if it is stored in a set of memory locations. Usually, the words arrive serially via an interface with a particular handshaking time pattern and must be moved to the different memory locations one-by-one before the CPU operations on it can begin. This can be thought of as a serial-to-parallel conversion process and

• 183

INTERFACING TECHNIQUES

RRRRRRRR SSSSSSSS

RRRRRRRR SSSSSSSS

XXXXXXXXXXXXX RR…: receive new input

RRRRRRRR

XXXXXXXXXXXXX

XX…: process input

time

SS…: send out results

Figure 6.10 Pipelined input and output (half-duplex timing)

e.g. can be handled by an interrupt routine in order to be able to use the remaining time during the input data block transfer for another, computational process. Data input, processing, and similarly the output of a data block become pipelined then (Figure 6.10). If input and output block transfers occur on the same interface in a half-duplex fashion (e.g. via a serial bus), then the send and receive blocks do not overlap. If the only reaction to an event is the storing of input data in memory for later use, then instead of using CPU instructions to read and store the data at subsequent addresses, an auxiliary circuit with an extra address register can be implemented to just insert a data move to the address therein. This technique is called direct memory access (DMA). It significantly reduces the control overheads if the frequency of the trigger events is high. The address register and the generation of the memory transfer can be realized by a coprocessor stopping the CPU from accessing the memory for the time of the transfer, or be integrated into the CPU structure. The DMA function can be further enhanced by providing a dedicated counter register that counts the individual transfers to consecutive memory locations and upon reaching a final count generates an interrupt to trigger some processing of the received data block. This is useful if the input data are transferred word by word in a time serial fashion and processing can only start after the complete input is present. DMA hardware can also be used to receive a sequence of instruction codes via an interface and storing them in the instruction memory before starting the execution (bootstrapping). If several interfaces need DMA support, separate address and counter registers must be used if the transfers occur during the same period of time. The same DMA hardware may, on the other hand, serve different interfaces at different times if the interface to be supported can be configured under software control.

6.5.2 Parallel and Serial Interfaces, Counters and Timers A parallel input interface is one that reads a full data word into a processor register or onto the data bus via tri-state drivers or pass gates enabled by chip select signals decoded at a particular address. A parallel output interface is obtained by connecting to the outputs of a CPU register or by writing a register from the data bus at a particular address. If both interface functions are combined, the data written to an output register can be read back (even from the same address), and the output interface behaves like a memory location. Parallel interfaces have the advantage of providing almost instant input and output at a rate as high as for memory accesses. Parallel outputs from a register can e.g. be amplified and used to control external machinery through a computer (see section 2.2.4), and a parallel input port can be used to connect to the switches of a keyboard. A parallel interface usually transfers blocks of data as sequences of single word transfers. The processor can only write by words, and the wide, fully parallel output of applicationspecific data structures would be very costly. To exchange data blocks between sequential processors, a dual-port memory is a convenient, parallel interface that lets both the sender and

• 184

SEQUENTIAL PROCESSORS

the receiver site access all the data words of a block. A simpler hardware interface can be used if the words are written and read sequentially. Then FIFO buffers can be applied, allowing e.g. output blocks to be written without delay. Output registers are often used to implement configuration registers to generate internal signals of a digital system to enable or select particular hardware functions or parameter settings under software control (section 2.2.3). The writing to configuration registers usually does not involve handshaking and is performed just once before entering into the main program function. A configuration register can e.g. be used to define the transfer direction of a bidirectional, half-duplex parallel port (similar to a bus interface) for which the two directions cannot be used simultaneously. Whereas a parallel interface directly transfers words of 8 or 16 bits to and from the data bus of a CPU, a serial interface breaks the word transfer into a sequence of smaller size transfers using a reduced number of signals, maybe just one (cf. section 2.1.4). This can be achieved via a program extracting the bits from a word and outputting them from a particular bit of a parallel interface, but usually one prefers not to waste processing time for this and to use hardware shift registers to assemble words from bits or conversely. Then the processor actually inputs and outputs full data words, and the only difference is in the extended transfer time that may be needed for a serial transfer occurring at a low bit rate. As in the case of the memory bus, the main characteristic of a serial or parallel interface is its data transfer rate. Due to the reduced cabling, serial interfaces in particular are used with signal drivers for longer distance interconnections. The simplest type of serial interfaces is the synchronous serial interface already shown in Figure 2.32. Here the shift register clock is transmitted on a separate signal line in parallel to the data. The receive site only needs to attach a shift register and to delay the input request to the receiving processor by means of a counter until all the bits have been shifted in. Synchronous serial interfaces can be clocked at very high speeds (e.g. at the CPU clock rate) and are easily implemented, e.g. on an FPGA. There, it pays off not to use a single data line but two of them because this doubles the data rate but increases the wiring costs by 50% only, and to adjust the transferred word size to the type of information actually transferred. The synchronous interfaces found on commercial chips mostly use one data line only and a constant, yet configurable word size. An asynchronous serial interface transfers the clock for the receive shift register and the data on the same line (uses a single signal line only). A common method is to output the bits of a word transfer each for a fixed time and to output an extra bit to identify the start of the transfer at the opposite to the quiescent signal level (the ‘start bit’). After the transfer the signal returns to its quiescent level for at least one bit time (the ‘stop bit’). The shift register clock can be derived from the signal at the receive site by generating a train of equal spaced clock edges after detecting the start bit. This clock generation makes the asynchronous interface hardware more complex than the synchronous interface, and the transfer is usually slower. A common interface used to acquire external events or to generate events with a precise timing is the binary counter that increments on the edges of an external signal, and that can be read by the processor via a parallel port attached to the data bus. It may be thought of as a pre-processor for the external signal. Common extensions to this basic function are to implement a write function to the counter to set its state, to generate an interrupt when a final count is reached, and to make it configurable via a register, e.g. to enable the generation of the interrupt or to select the signal input from different sources. An important special case

INTERFACING TECHNIQUES

• 185

is the use of a periodic input signal, e.g. the processor clock. Then the counter becomes a timer, the individual counts representing discrete time intervals. A timer can be used for time measurements (within the resolution given by the clock signal) and to generate single or periodic interrupt after a prescribed time delay. A periodic interrupt can e.g. be used to allocate a portion of the ALU operations provided by the processor to some secondary task (defined by the interrupt routine). There is a technical problem in reading a counter at some instant of time that is not related to the clock input. The counter bits might change in the moment of reading and the read result might be a mixture of previous and updated counter bits. This problem also occurs for a ‘long’ counter that needs to be read in a sequence of several read operations. The latter case can be treated by latching the full counter output at the time of the first read operation of the sequence. Misreading is then avoided by synchronizing the counter clock and the processor clock so that counter updates do not occur during the latching of the read data by the processor. Read errors can also be avoided by reading several times and comparing the result, or by using a Gray counter where a misreading can only affect one bit. Timer functions play an important role in detecting errors. A processor continuously executing a program might stop working correctly at some time (an electrical spike might e.g. cause an erroneous instruction read and let it jump out of the program). Then a periodic timer interrupt can be used to check whether the program still works properly. If the processor does not respond to this interrupt, a simple circuit can be added to the counter that is armed when the final count causing the interrupt is reached and will force a hardware reset if the final count is reached a second time (after the reset the processor would check for errors and resume its operation). The reset circuit must be disarmed by the interrupt routine in order to avoid this reset as long as the program works properly. A timer extended by such a reset circuit is called a ‘watchdog’ timer. Another use of a timer to handle an error situation is the generation of a timeout interrupt if the expected handshake does not occur. Before starting to wait for the handshake(s) in question the timer has to be armed to generate its interrupt after some time (defined by presetting the counter). If the handshake occurs within this time, the timer is disabled via a configuration register.

6.5.3 Input/Output Buses The concept of a bus has already been explained (the memory bus: section 2.2.1) and (the I2 C bus: section 2.3.2). Several circuits drive signals to the same media one at a time to perform a data transfer to another circuit connected to the bus. The simplest configuration of this kind is the half-duplex interface using bi-directional transfers on the same signal lines between two subsystems. The bus corresponds to the idea of sequentially using a resource for different purposes, this time applied to the interconnecting media. As for the computational building blocks, these require scheduling and some extra control effort. The savings through reusing the same hardware must outweigh the costs for the additional control. For a serial bus like the I2 C bus, the transfer rates suffer from the lower speed interfaces but may be fast enough for the intended application. By integrating a serial bus interface with a parallel port, one obtains a remote interface with several input and output lines but a simple wiring to the processor system. A parallel bus used just to interface processors to each other (but not to perform memory accesses) combines the higher speed of parallel interfaces with the sharing of the wiring resources and interfaces.

• 186

SEQUENTIAL PROCESSORS

Using bus interfaces involves two control issues. The first is arbitration; only one interface site at a time can enable its output drivers. This can be handled by having a single master site that by default has the right to drive the bus and which lets other sites respond to it by sending them a command to do this. The memory bus of a processor is managed this way, the CPU being the bus master. Only on a write command from the CPU is a memory or interfacing circuit attached to the bus allowed to drive the data lines. On a multi-master bus, several or all sites get the right to drive the bus lines in turn, using some handshaking protocol or a mechanism to detect and resolve collisions when two sites try to drive the bus lines simultaneously. The bus arbitration may be distributed, using e.g. collision detection or a control message to hand over the bus driving to another site, or handled by a dedicated arbiter that receives bus driving requests from all sites and grants the mastership using priorities or a round robin method. The technique of passing the bus command is also used for the memory bus and then allows several CPUs to share the same memory and interfacing devices. Only the CPU that is granted command enables its address and control outputs to access the bus. The second control issue is the need to identify the destination sites for a message (which may be a single site, or a group of sites). On the memory bus this is the function of the address output. An input/output bus (i/o bus) without separate address lines must pass the destination code along the data lines using some protocol. The interfacing sites must detect the control message and find out whether they are among the destination sites for the information placed on the bus subsequently. As well as the electrical interfacing an IO bus is defined by the protocols used to sequence control and data transfers and hand over command. These may be realized in software by the attached processors, or by extending the interfacing hardware. A bus may be realized with custom interfaces and protocols. There are a number of bus standards allowing components from different sources to be used by simply connecting them to a bus conforming to the standard. Some originate from the world of personal computers where it is important to be able to connect equipment from various manufacturers, and many different system configurations must be supported. For these (the PCI bus, USB and others), the actual configuration is determined dynamically, devices get their bus addresses from the operating system, and may be connected to the bus or disconnected even while the system is running (hot plugging on the USB). This implies both extra hardware support (address and configuration registers) and a software protocol to detect and configure the devices on the bus. Among the standard buses are also parallel data and address buses for memory and i/o including interrupt signals. The PCI bus is e.g. used used for add-on interface modules for personal computers including video i/o and devices supplying bootstrap code, and for interfacing such components to an embedded processor. Other buses are defined to connect processor, memory and i/o modules on an FPGA or ASIC that come from various manufacturers in the form of VHDL sources or data files that need to be synthesized (IP modules, section 2.3.2). Standardization and providing an interface to which modules of different origins can be connected are most important and are paid for with a considerable hardware effort. The buses discussed in the next sections are all standard buses that can easily be used in applicationspecific systems by using available protocol chips.

6.5.3.1 The CAN bus The controller area network (CAN) bus is also a serial multi-master bus, this time using a single, differential signal to transfer clock and data. Special transceiver circuits (e.g., the

INTERFACING TECHNIQUES

• 187

82C250 from Philips) are used to translate between the H/L levels of an attached processor and the differential bus signal which uses a dominant and a recessive level similar to an open collector bus line. Only if all processors drive the bus to the recessive level will it be shown up on the bus. Bit rates for the CAN bus are up to 1Mbit/s and must be the same for all interfaces. The CAN bus interface is used in the automotive industry and has been integrated into several micro controller chips to serve this large market. The CAN bus protocol is designed to transfer short messages of up to 8 bytes between the processors attached to the bus. It is transparently used by the interface circuits that perform the handshaking and error checking and use DMA to read and write the messages from and to data structures in memory, the so-called CAN objects. It is these objects and not the processors that are addressed by messages. Every CAN object thus has a unique identifier and is present on all processors using it for data transfers. Typically, it is written to by a unique processor but read out by several ones. The CAN interface can be told to send out the data in the object after changing them at the write site. Then they are copied to all read sites. Similarly it is possible to request the updating of an object at some read site. Some CAN interfaces manage several objects (up to 32) this way. The objects can be used like independent interfaces to send data to selected threads on the processors without having to care for the involved multiplexing of the interface.

6.5.3.2 The universal serial bus The universal serial bus (USB) was designed as a serial bus interface to which up to 127 PC peripherals from different vendors can be attached. It has become a widespread interface to all kinds of digital systems attached to and controlled by a PC. The USB defines the signals, the protocols and even the connectors to be used. It uses a single, bi-directional differential signal that is transferred by twisted pairs of wires through a tree connecting the master at the root to the slaves at the leaves. The intermediate nodes are the USB hubs. The USB signal encodes both the data and the bit clock and provides a data rate of 12 Mbit/sec for the first USB generation. The second one extends this rate to 480 Mbit/s. The USB cabling also provides a ground reference and a 5 V power line that allows low-power peripherals to be powered via the interface cable. The connectors support the ‘hot plugging’ of interfaces. The master explores the USB tree and also detects the plugging of additional interfaces during the operation of the system. The USB is a single master bus, the master being the PC or some embedded processor. The series of bus accesses carried out by the master under the serial control of its program performs the bus arbitration. In order to obtain data from an interface, the master first sends a command to it. There is no provision for an interface to directly send out data to another one. It is only by sending out periodic requests that the USB master can detect and react to events at an interface, the periods being multiples of 1 ms. The network addresses are assigned to the interfaces by means of control messages from the master. Every interface implements a number of ‘endpoints’ to which messages from the host can be directed. There is at least one endpoint receiving control messages from the master and sending out information on the interface using special message formats defined in the USB specification. In the PC operating system, this information is used to automatically load software drivers for the interface. The other endpoints (up to 31) are either receiving or transmitting endpoints to which the master can send data packets or receive from them. There are different timing schemes that can be used for these. One is the bulk transfer mode in which

• 188

SEQUENTIAL PROCESSORS

USB master or hub

differential USB signal

FT8U245

8-bit data bus handshake signals

Figure 6.11 Simple USB slave interface providing buffered send and receive channels

a stream of bytes is transferred using all of the available data transfer bandwidth. Every transfer starts by a special start message including the address information and is followed by a single or several data packets with individual check sums. Acknowledge messages are transferred to implement the handshaking and error correction. By automatically adding the needed control transfers and packing data as required using some extra hardware, USB interfaces can be designed so that the endpoints are attached to a processor bus as individual FIFO buffers. The different endpoints can be given different characteristics and be used to transfer data to and from different threads. A fixed amount of data can be transferred every ms without error checking, or an arbitrary byte stream can be transferred with error checking but with a lower priority than the fixed data rate transfers including those occurring in response to the periodic requests. Several standard products integrate self-contained USB interfaces with all the required circuits to serialize and buffer the USB data, and to respond to the initial information request and the request to set the bus address. A simple product of this kind for the 12 Mbit/s rate is the FT8U245BM from FDTI that interfaces to an attached processor like an 8-bit bi-directional port buffered by a receive and a transmit FIFO (Figure 6.11). The address configuration occurs automatically. The FX2 chip from Cypress supports the 480 Mbit/s rate and includes an 8bit micro controller equipped with program and data RAM. It is bootstrapped via the USB interface and used to generate various control and interface signals from the USB data. Several other micro controllers offer USB interfaces, too.

6.5.3.3 Embedded Ethernet The Ethernet is the most common network interface found on computer workstations. It provides connectivity in local area networks (LAN) but also interfaces the workstations to wide area networks (WAN) and, in particular, the internet using the TCP/IP protocols. The Ethernet with the IP protocols has also become a common interface to dedicated, embedded computers as it allows an easy remote access from workstations using standard communication software (e.g., web browsers and standard libraries available to the workstation programming tools). Due to the availability of easy-to-use interfacing components it is also used as a convenient high-speed bus to communicate within distributed embedded systems such as microcontroller networks. The IP packets can be delivered to an Ethernet interface via a wireless LAN (WLAN) to obtain a convenient wireless interface to an embedded system (e.g., one controlling a robot) from standard computers. The Ethernet is a serial multi-master bus that uses separate differential receive and transmit signals from every interface to connect it to a single other one or to a hub or switch node with several interfaces. The cabling thus looks like connecting point-to-point but the interconnection structure is operated as a bus. A signal input through an interface of a hub appears as an output signal on all other hub interfaces. In a switch, interfaces are dynamically connected point-to-point in order to exploit the wiring structure for parallel transfers between

INTERFACING TECHNIQUES

• 189

separate pairs of interfaces in a way transparent to the endpoints. Then, it is even possible to use the receive and transmit lines simultaneously (‘full duplex’). The standard bit rates of the Ethernet mostly found in embedded systems are 10 and 100 Mbit/s using Manchester encoding to transfer clock and data on the same lines (section 2.1.4). There are also 1 and 10 Gbit/s versions of the Ethernet. During the time in which there are no transmissions there is a residual activity on the bus lines (pulses transmitted every 16 ms) that allows a receiver to detect the presence of some transmitter connected to it. The interfaces include transformers that isolate the attached subsystems from the Ethernet cables. The Ethernet relies on detecting collisions due to simultaneous transmit operations from several interfaces. Then, all transmitters retire from the bus and try to retransmit after some random delay. Only if collisions can be excluded through some policy on how to grant access to the bus does the timing of the Ethernet become deterministic. To detect collisions it is necessary that the data transmitted by an interface are simultaneously received by it in order to be able to compare the data on the bus to those sent out, which excludes full duplex receive and transmit operations of independent data. If a switch is used, collisions are confined to occur within it. The switch needs to provide the buffer space needed to handle them. The Ethernet bus definition also specifies a basic protocol. By it, the Ethernet is restricted to transfer bit fields of up to 1508 bytes packed with destination addresses and a CRC check sum into frames of up to 1518 bytes. Bit fields longer than 1508 bytes need to be split into several frames. The Ethernet standard actually defines a minimum frame size of 64 bytes. For the 10 Mbit/s rate, every frame is preceded by 8 bytes used to synchronize the bit clock and to indicate the start of a frame and thus takes a minimum duration of about 60 µs. The minimum size makes sure that in an extended network collisions can still be detected. For small networks, shorter frames can also be used. An Ethernet frame contains the following sub-fields:

r 6 byte destination address, first bit indicates a multicast, the all-ones code a broadcast; r 6 byte source address; r 2 byte length field or protocol code; r application data, padded to obtain the minimum frame size; r 4 byte CRC field. The address resolution protocol (ARP, code 0x806) is used to connect an IP address with the Ethernet MAC (media access control) address. The internet protocol (IP, code 0×800) messages consist of application data packed with extra IP headers. An IP header includes the source and destination addresses, a type byte indicating the kind of data contained in the application field, a length field, a sequence number, and a check sum. The IP layer takes care of routing messages to recipients not connected to the local Ethernet via gateways through the Internet. The user datagram protocol (UDP) data field consists of another 8 bytes including the source and destination ‘port’ codes that can be thought of as addressing the sending and receiving threads, and application specific raw data. UDP messages are not acknowledged; a message sent out is not guaranteed to be received. The transmission control protocol (TCP) verifies the delivery of data and also takes care of breaking up long bit fields into smaller ones to make them fit into the Ethernet frames and to reassemble them at the destination. TCP is e.g. used to transfer textual data in the HTML format using the HTTP protocol. These can be displayed on a web browser or be sent by the browser to transfer parameters input by a user to control an embedded system. The protocols used to handle the different network

• 190

SEQUENTIAL PROCESSORS Table 6.6 Protocol layers passed by a message Application layer, specify destination and raw data to be transferred Format HTTP page Transfer page via TCP protocol, adding TCP header Add IP header information MAC layer: add Ethernet header Handle low level media access (physical layer, performed by controller)

aspects are often referred to as a protocol stack (Table 6.6). For a complete definition we refer to [41]. The standard protocols obviously involve a lot of overhead for formatting and analyzing the various headers. A UDP message requires a total of 40 bytes that have to be transferred and analyzed to be part of a valid UDP message, including computations of the IP check sums. For TCP the overhead is even larger, and the technique to realize a user interface by sending textual data back and forth involves further overheads for scanning the texts for parameters and converting them from characters into the corresponding binary codes. The full implementation of the IP protocols on an embedded processor requires a considerable amount of memory and processing time that do not contribute to the application processing. If just ARP and UDP are implemented, the effort is much less. In a dedicated system using a local Ethernet bus it is, moreover, possible to use non-standard protocols not requiring extensive headers, e.g. simply the Ethernet protocol layer which already includes error checking and delivers frames to their selected destinations only, or a mix of standard and low-overhead non-standard protocols. The minimum amount of header and error checking information is the 8 synchronization bytes, the 6 byte Ethernet address, the 2 byte length field and the check sum sum up to 20 bytes or 16 µs per message at the transfer rate of 10 Mbit/s (21 µs if the source address is included). If the net data rate is a concern, messages should be significantly longer than 20 bytes in the mean. The integration of an Ethernet interface into digital system is fairly easy, despite the complexity involved in handling the protocol layers. There are standard components integrating the serial interface and the collision detection circuits, the address comparators needed to select the right messages from the bus, a buffer memory for several Ethernet frames, and error checking logic. An attached processor may then read and write the Ethernet packets and handle the IP protocol layers if desired. It only has to handle packets with the right destination address and finds them in a memory where they have been put using DMA. The CS8900A from Cirrus Logic performs these functions for the 10 Mbit/s rate and supports DMA to move received messages out of the limited internal buffer space. The LAN91C111 from SMCS is an example of a product that can be used for the 10 and 100 Mbit/s rates. It does not support DMA to external memory but provides more internal buffer space and a wider data bus interface. Both include some support for handling multicast messages. There are also some highly integrated processor chips that include an Ethernet interface of this kind (e.g., the AT91RM9200 from Atmel and similar products, and some processors from IBM and Motorola). By adding some extra hardware or software (e.g., a simple processor) the ARP messages and the IP and UDP layers can also be automatically taken care of in order to arrive at a simple application layer interface. The recent high-performance PowerPC440GX from IBM integrates some hardware support for the IP protocols on the two integrated 1 GBit/s Ethernet controllers. Whether it

• 191

INTERFACING TECHNIQUES

is more efficient to use dedicated hardware to support the IP protocols or to share an existing processor for this task depends on the mean rate of received messages.

6.5.3.4 Parallel interface buses Parallel interface buses differ from serial ones by providing several data lines in parallel (8. . . 64), additional control and address lines, so that higher transfer speeds result (that could, however, also be obtained by a serial bus using very high bit rates) and the interface to a processor outputting parallel addresses and data can be simpler. The parallel data lines share the additional handshake and control signals. A multi-master bus would include handshake signals to request the control. A single address line is sufficient to distinguish data and control transfers and mark the beginning of a serial multi-word data transfer. If a bus provides many address signals, it can be used to attach memory devices, too. High speed long-distance buses can use differential signals although they double the number of signal lines. Most parallel buses are used within a circuit board or a local backplane only. A number of bus interface definitions used in dedicated, embedded processor systems stem from the PC environment, starting from the bi-directional printer port standard and the IDE interface or the SCSI bus used for mass storage devices. By adhering to such an interface and its protocol it is easy to interface a standard hard disk to an arbitrary processor system, or a removable one operated with a standard directory structure if high volume data exchange with a standard computer is required. The PCMCIA bus used to interface small interface cards containing non-volatile memory and all kinds of PC interfaces can be used in the same way. Another bus originally used to connect PC components and now widespread in embedded systems too, is the PCI (peripheral component interconnect) bus. The PCI bus is a synchronous, multiplexed bus using 32 data/address lines and a 33 MHz bus clock (faster and wider versions also exist), DMA and IRQ signals, and non-differential signals with levels generated from a 3.3 V power supply with specific current drive and rise time requirements to meet the transfer speed requirements [73]. The most important signals are AD0. . . 31

data/address lines

CLK, R/W, ME, ALE

PCI clock, memory control signals

IRQ, DMA Peripheral components attached to the PCI bus are supposed to locally generate the required select signals from the current address. The address space of a device and the interrupt parameters used by it are not hard-wired but are supposed to be configurable through special configuration cycles issued by the bus master. Every PCI device also contains a short table from which the master can read off its memory and interrupt handling requirements. The bus transactions on the PCI bus are bursts that start by an address transfer and are followed by several data word transfers. PCI is a multi-master bus that can also be used to communicate data between processor subsystems via some memory interfaced to the PCI bus. The configurable chip select logic, the generation of address signals and the adaptation of the data transfers of some device using standard handshaking signals to the PCI burst transfers requires additional logic. There are standard chips that do this job and include FIFO memories to implement the bursts. Several standard and application domain-specific

• 192

SEQUENTIAL PROCESSORS

processors include integrated PCI interfaces. The I/O cells of current FPGA chips meet the PCI signal requirements. An FPGA-based system can thus be interfaced to the PCI bus by implementing the special PCI functions into the FPGA.

6.5.4 Interfaces and Memory Expansion for the CPU2 We conclude the description of the CPU2 design with its interface to a 16-bit synchronous peripheral bus that in particular connects it to a memory controller, supplying it with instructions and data and used to implement soft caching, and to other processors in an FPGA design. A special i/o instruction supports the peripheral bus port and a second interface. The interfaces are controlled by extra automata that are just started by the i/o instruction and autonomously perform entire block transfers in parallel to the operation of the CPU (cf. section 6.2.2). The i/o instruction does not wait for the data transfer to complete but reports its state in the F bit and must be retried until it is over. The instruction supports two bidirectional interface modules according to Listing 6.4. They are attached to the memory data bus in order to allow for single cycle DMA transfers. The control code output ports are intended to support their multiplexing. The second interface module designed for the CPU2 is a bi-directional synchronous serial interface with a combined handshake and direction signal. The control port is intended to issue connect and disconnect commands to an optional cross-bar switch (cf. section 7.3). The block transfers rely on a simple DMA function using the address and counter registers DMAA and DMAC. This DMA function is integrated into the CPU so that there is no arbitration overhead for switching between memory accesses by the CPU and DMA accesses. DMA transfers are set up automatically in response to the i/o instruction (in conventional DMA implementations the CPU needs to configure the DMA hardware by a series of write operations for every transfer). The i/o bus accessed via the parallel interface module operates synchronously to the processor clock. The control port of the parallel interface is used to output handshake signals to each of the other processors and to other i/o devices with a compatible bus interface such as the memory controller. Once both processors have activated their mutual handshake outputs, the sending site (which is defined though software) issues a write request to a bus arbiter that responds by enabling the output of the sending processor (Figure 6.12). The transfer terminates by the sending processor resetting its handshake output synchronously with the bus arbiter resetting the output enable signal. This scheme combines individual handshake lines with a shared bus resource and eliminates the need to implement a serial addressing and handshaking protocol. The handshaking signals do not depend on the direction of the transfer and could be combined into a single signal using a wired AND function (the signal HS in Figure 6.12 is the conjunction of both). If there are simultaneous bus requests, it is the bus arbiter that selects the transfer to be carried out first and even controls the bus cycle. If there is just one CPU and the memory controller then the bus arbiter is not needed at all. If only one data line and a single combined handshake line were used, the i/o bus can be granted continuously and degenerates to a bi-directional synchronous serial interface using the handshake signal as its clock. The memory controller shown in Figure 6.12 and used both for bootstrapping the CPU2 from an EPROM and to subsequently load more of the program (and to cache data, too) is easily implemented with some binary counters and a few state flip-flops. As pointed out in section 6.2.3, this approach significantly simplifies the CPU and allows memory devices of

• 193

STANDARD PROCESSOR ARCHITECTURES

handshake lines other CPU or i/o devices

CPU2 (data port)

memory controller

DRAM EPROM

bus arbiter

clock HS bus request enable

Figure 6.12 I/O bus of the CPU2

all kinds, sizes and speeds to be interfaced including DRAM and hard disks without having to change the CPU design. Moreover, a single memory controller can be used for several CPUs connected to the i/o bus.

6.6 STANDARD PROCESSOR ARCHITECTURES In this section we briefly review some standard CPU designs that are available as chips and used in many applications, starting with some criteria for comparing the different architectures, and also take the opportunity to look at the different instruction set architectures. Most of the processors are RISC chips using a load-store architecture and 16 to 32 general purpose registers for the integer operations and for indirect addressing. A disadvantage of this approach is that addresses do not need to have the same size as data. 16-bit processors with a larger address space then resort to using extension registers and to restrict indirect addresses to 16-bit pages at restricted starting addresses. Adding an integer offset to an address or adding two integers are usually the same Boolean operation. Call and return mostly use a link register, and loops and arithmetic exceptions are implemented with conditional jumps.

6.6.1 Evaluation of Processor Architectures There are many competing CPU architectures that are offered as standard processors or ASSP chips (application-specific standard processors) or as IP modules to be integrated into FPGA or ASIC designs. We list some basic characteristics and criteria that can be applied to compare different chip offerings and ASIC cores. The net performance of a processor may be measured in terms of ALU operations per second for a given, application specific mix of operations. Two processors can be compared by running the same application algorithm on both using the fastest memory cycles supported by the processors and a hand-crafted memory and register allocation to achieve the best performance on each of them. The comparison obviously depends on the chosen algorithm

• 194

SEQUENTIAL PROCESSORS Table 6.7 Selected EEMBC benchmarks for embedded processors Basic floating point Matrix arithmetic Bit manipulation Table lookup CAN remote data request Autocorrelation Convolutional encoder Viterbi decoder Fixed point complex FFT Inverse FFT

Inverse discrete cosine transform FIR filter IIR filter JPEG compression JPEG decompression RGB to YIQ conversion Image rotation Dithering Text processing Packet flow

and the quality of the implementations. If particular kinds of application-specific operations or functions are known to be frequent, e.g. Boolean operations on port bits, or dot products using integer or floating point multiply and add operations, an easier, rough comparison is possible by just implementing these. There are standard sets of applications written in C and used as benchmarks for embedded processors, e.g. the set published by the EEMBC consortium [76] part of which is shown in Table 6.7 (a few of them correspond to algorithms explain in this book, and some mostly apply to the DSP processors in Chapter 8). High-level language (HLL) benchmarks suffer from the fact that their results depend on the optimization capabilities of the C compilers used for them. As application programming is mostly done using a HLL. Such benchmarking is useful but clearly needs to identify the compiler used to obtain the results. Basically, the performance of a processor results from its instruction rate, the kinds and numbers of operations performed by individual ALU instructions, the ALU efficiency in various memory configurations (which depends on the number of data registers, the addressing facilities, and the caching), and the typical overheads of branches and loops and their effect on the pipelining. Besides the net performance, the operating cost (the power consumption) of a digital system depends to a large extent on the processor (but also on other system components). The quotient of the rate of operations by the power consumption can be used to compare processors for their power efficiency. It is independent on the chosen clock rate or on the number of processors of the same type used in a system. In contrast to application-specific FPGA and ASIC designs, the overall efficiency of a processor chip seems to be less important than its net performance or its power efficiency. If e.g. the low efficiency results from integrating resources which cannot be fully used in an application, the costs can still be low as the chip can be manufactured in higher volume. The integration of memory and of the most important interfaces and system functions into a processor chip facilitate the system design and tend to reduce its costs. Finally, the development costs for a system design building on a particular processor are more related to the tools than to the processor architecture.

6.6.2 Micro Controllers Micro controllers are application-specific standard processors (ASSP) for the domain of embedded control applications. ‘Embedded’ means that the processors are programmed for a

STANDARD PROCESSOR ARCHITECTURES

• 195

dedicated application in which they interact with an environment. They receive real-time input from it and output control signals. There are many applications for such processors that range from simple automata to complex measurement and control applications in robots and cars, and the products further specialize in more specific domains. The chip offerings are actually still more focused to narrower application areas through the features and speeds of the CPU cores and through the integration of memory and peripherals. Many micro controller chips are almost complete systems-on-a-chip that integrate interfaces and some memory and can be operated stand-alone or with the addition of a flash memory chip but also can be extended by other chips according to the needs of an application. Another important class of ASSP targeted at digital signal processing (DSP) will be considered in detail in Chapter 8. Most micro controllers are based on fairly simple CPU designs using 8-bit and 16-bit ALU circuits and supporting 16- to 24-bit memory addresses, but they extend to 32-bit RISC architectures and DSP derivatives that differ from related high performance standard processors and DSP chips only through a particular mix of control oriented on-chip peripherals. The use of a 32-bit ALU is beneficial if the application really uses a substantial amount of 32-bit integer or floating point operations; otherwise one only exploits the larger addressing space which actually does not depend on implementing operations on 32-bit data but typically requires 32-bit address computations. A 32-bit data type does have the advantage that after starting from smaller input values more complex expressions can be evaluated without having to check for overflows. The extra bits then serve as the equivalent of multiple overflow bits. An operation that moves a 16-bit half word to a 32-bit register with sign or zero extension can be used to determine through a comparison whether the final result is in the smaller range again. Also, a 32-bit ALU might implement 16-bit SIMD operations. In the broad range of low-to-medium performance micro controllers, the chips traditionally run from a 5 volt supply and operate at moderate instruction rates from 2 to 40 MIPS (million instructions per second) yet offer enhanced interrupt and bit manipulation capabilities. Technologies below 0.35 µ yield faster and cheaper chips but run from 3.3 V and lower supply voltages. For automotive applications and motor control there are medium-tohigh performance offerings often supporting analog input by integrating a medium resolution analog-to-digital converter (ADC). They operate at 40 to 200 MHz and use finer technologies and lower core voltages. Processor chips with integrated analog input and output circuits are known as mixed signal devices. In the high-end chips offered for automotive applications, communications, and portable computing and gaming applications, the technology is fully exploited. Ironically, a simple 8-bit CPU realized on a low-cost FPGA of the latest technology will achieve a higher performance than a standard 8-bit processor manufactured in a conservative technology. A micro controller implemented in a fine technology does not have to run as fast as possible but can be selected in order to achieve a higher level of integration and lower power consumption only. 0.18 µ chips e.g. use to come with 32 k bytes of RAM and more and can be used in many applications without adding further RAM. If an application requires 5 V signal levels but otherwise benefits from the use of a 3.3 V processor, then the interfacing may require some extra effort. Some 3.3 V chips tolerate input voltages of up to 5 V. Otherwise, there are 3.3 V powered buffer chips of this kind that can be used to protect 3.3 V input circuits. Outputs from 3.3 V chips can be connected to the standard digital inputs of 5 V chips anyhow. The integration of different technologies onto a system-on-a-chip is not always desirable. A system requiring analog input (cf. section 8.1) usually achieves a better performance by using a separate converter chip.

• 196

CPU resources: Operation rate: Integration: Power consumption:

SEQUENTIAL PROCESSORS Table 6.8 List of features for the AT90S1200 8-bit ALU, single bit operations 32 registers, 3-word sub-routine and interrupt return address stack CPU executes 12 MIPS 512 instruction words of flash memory and 64 bytes of data EEPROM clock generator, timer and watchdog timer synchronous serial port, parallel port, analogue comparator below 50 mW using a 5 V supply

The simplest micro controllers come without an external memory interface and just offer a few external interface signals. Widely used 8-bit controllers are the derivatives of the i8051 processor originally designed by Intel and the 6805 family from Motorola; there are many others. As an example of a more recent 8-bit micro controller design we consider the AVR family of micro controllers from Atmel. The AVR uses a load-store RISC approach and a large set of 8-bit registers. The instructions are 16-bit and use two register addresses for the ALU operations. Instruction fetches and their execution occur in a two stage pipeline (similar to the CPU1 and CPU2 designs). Some members of the family integrate a flash memory; others have an external bus interface and an extra 8-bit multiply circuit. The At94k CPU plus FPGA devices discussed in section 2.2.4 even integrate an SRAM for program and data for the micro controller subsystem, connect to a serial EPROM to load the internal memory instead of providing an external bus, and achieve 25MIPS for it with their 0.35 µ, 3.3 V chip technology with 5 V tolerant inputs. The low-end device AT90S1200 is a 20-pin component that only provides the registers for intermediate storage and a 512 word flash memory that can be written to via a serial interface using a simple protocol. Table 6.8 lists the main features. 16-bit controllers come from several manufacturers, e.g. Motorola, Philips, Texas Instruments and Hitachi. A recent single-chip ‘simple’ 16-bit RISC design with a 16-bit byte address range is the MSP430 from Texas Instruments. It includes some RAM and a flash memory but no external memory interface, optionally integrates a 16-bit multiplier, a serial interface, and a 12-bit A/D converter, and executes 8 MIPS, consuming just a few mA from a 3 V supply. A 40 MIPS 16-bit controller with a core voltage of 2.5 V and 3.3 V external signals tolerating 5 V levels is the DSP56F826 from Motorola. It integrates 64 k bytes of flash memory with an endurance of 10000 writes, 9 k bytes of SRAM, some peripherals and a simple bus interface, using a Harvard architecture with two separate 65 k word address spaces for program and data. The CPU provides a parallel MAC circuit which was seen to be a versatile compute circuit (section 6.1.1) and a zero-overhead loop structure. The power consumption is about 150 mW. The chip provides a JTAG interface that also serves to insert a bootstrap loader program into the flash memory. In the same family one finds the more recent DSP56F8356 with more memory and peripherals including a CAN controller and a 12-bit ADC (the typical mix of interfaces for chips targeting automotive and motor control applications) and executing 60 MIPS. It generates its 2.5 V supply from the 3.3 V supply through an integrated linear regulator. The high-end TMS320F2812 from Texas Instruments is a 0.18 µ chip with a dual 1.8/3.3 V power supply that runs at up to 150 MIPS and integrates 36 k bytes of RAM flash memory, a CAN controller and an ADC. At 40 MIPS and with the ADC switched off, it consumes about 150 mW. The ADC takes 150 mW. The F2812 data path provides 32-bit fixed-point operations. It uses an accumulator and a number of extra registers that hold addresses of

STANDARD PROCESSOR ARCHITECTURES Table 6.9 List of feature for the SAB80XC161 CPU resources.

Operation rate. Integration.

Power consumption.

• 197

16-bit ALU also implementing 8-bit operations, single-cycle multiply instructions for performing single bit operations, 8-bit ALU instructions 16 16-bit registers for addresses or data (multiple banks in internal RAM) 16 priority levels for interrupts, DMA function CPU executes 40MIPS from internal memory 128 k bytes of flash EPROM, 8k bytes of on-chip SRAM PLL clock generator, chip select signals and wait states, debug interface configurable 8-bit or 16-bit memory interface addressing up to 224 bytes parallel/serial ports, timers, watchdog timer, CAN, I2 C, 10-bit ADC 0.3 W

intermediate results and also supports 16-bit operands and multiply and MAC operations. Instructions are 16 or 32 bits wide. Another 16-bit micro controller also running at 40 MIPS and providing a MAC circuit is the basis of the SAB80XC16x family from Siemens, implemented in a 0.25 µ, 2.5 V technology, which is a significant update to the former 80C16x family established in the late 1980s. The devices are offered with a large selection of interfaces including I2 C and CAN bus interfaces (unused interfaces may be powered down) and a large on-chip flash memory. In contrast to the Motorola processor, the flash memory has an endurance (number of allowed erasures) of 100 only which is, however, enough to allow some software updates in the field. A related 20 MIPS chip with a high endurance flash memory comes from SGS/Thomson. Instruction fetches from the internal flash memory or the on-chip program SRAM take a single cycle only even for 32-bit instructions (in contrast to external memory). The on-chip asynchronous serial interface has a bootstrap option that allows the uses to download code into the internal RAM. This is used for programming the on-chip Flash memory and for running simple stand-alone applications of the chip. The XC16x CPU offers an overwhelming amount of configurable options and consequently invests a considerable amount of hardware into auxiliary functions (it suffers a bit from this complexity and the forced compatibility to its forerunner). The feature list in Table 6.9 for the XC161 which is a 144-pin device using a 5 V external interface and a separate supply for this voltage. We take a brief look at the instruction set architecture of the XC16x CPU. ALU instructions have two register operands and a width of 16 bits and execute in a single cycle. The ALU operations can also have a memory operand. Due to the pipelining they still execute in a single cycle. The MAC circuit is a coprocessor using a dedicated 40-bit sum register; the MAC instructions support dual operand loads in parallel to the arithmetic operation (others don’t), and an extra loop counter that can be tested in a conditional branch. The MAC coprocessor also performs left and right shifts of the sum register by up to 16 bit positions. Instructions with embedded constants or absolute addresses take 32 bits. Data types supported by the ALU are either 8-bit or 16-bit words. Relative branch instructions take a 16-bit word and select from 32 conditions. They must be used to handle arithmetic exceptions. The call and return instructions take 3 and 2 cycles. Multiple banks of registers can be used to implement fast context switches. The CPU uses as many as 7 pipeline stages to fetch, decode, execute an instruction, write the result and to perform memory accesses. It also implements a simple

• 198

SEQUENTIAL PROCESSORS Listing 6.11 Boolean operation on port bits

Instruction

Words

Cycles

Functions

BMO V PSW.1, P2.3 BAND PSW.1, P2.4 BMO V P2.0, PSW.1

2 2 2

2 2 2

move bit 3 from port P2 to bit 1 of status register PSW perform the AND with bit 4 from P2 output to P2, bit 0

Listing 6.12 32-bit multiply operation Instruction CoMULu MO V R8, CoASHR CoMACu CoMACu MO V CoASHR CoMACu

R6, R4 MAL #16 R7, R4 R6, R5 R9, MAL #16 R7, R5

Words

Cycles

2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1

branch prediction scheme. A branch to the start of a loop is executed in zero cycles except for the first pass. If a MAC instruction is involved, a zero overhead loop results. The XC16x CPU performs fast context switches by switching register banks. The general purpose, 16-bit registers of the CPU also serve as address registers. An address space of 224 bytes is supported for program and data, restricting indirect addresses to pages of 65 k bytes. Listing 6.11 shows the instructions to read two signals from a port and to output their Boolean AND to another port signal. Due to its shift operation, the MAC can also be used quite efficiently to perform multiword arithmetic operations. The instruction sequence in Listing 6.12 computes the product of two unsigned 32-bit numbers in r4/r5 and r6/r7 and places the lower half of the result into the registers r8/r9 and the higher half into the sum register words MAH, MAL. The multiplier is applied in 4 out of 8 cycles that take a total of 0.2 µs.

6.6.3 A High-Performance Processor Core for ASIC Designs More powerful controllers providing higher clock rates and binary arithmetic operations on 32-bit words come from Motorola, Hitachi, Intel and others. An architecture found in many embedded applications, in particular in ASICs, is the ARM architecture. There are several generations of the ARM architecture that have been licensed to a large number of manufacturers. The ARM is used in particular in low-power SOC designs, and there is even an implementation of the ARM using asynchronous logic [45]. The ARM-7 CPU implemented in a 3.3 V, 0.35 µ technology and executing 66 MIPS is offered within some standard micro controller chips. It is a fairly simple design using a single memory port and a three-level pipeline. Reading the operands from registers, performing the ALU operation and writing back the result is a single pipeline stage. Some ARM-7-based micro controllers do not use caches or a memory

STANDARD PROCESSOR ARCHITECTURES

• 199

management unit. The more recent ARM-9 implemented in a 2.5 V, 0.25 µ technology executes 200 MIPS. It uses separate caches for instructions and data (a ‘modified Harvard architecture’) and extends the pipeline to 5 steps. The ARM-9 core also appears in standard chips, including the combined FPGA plus processor chips from Altera (see section 2.2.4). The still more recent and faster ARM-11 core uses 0.13 µ technology at 1 V and executes 400 MIPS. A variant of the ARM architecture is also found in the Xscale processors from Intel. The ARM uses a load-store instruction set architecture. The CPU uses 14 general purpose registers R0. . . R13 as well as a LINK register that is used for return addresses, and the PC register. These latter can be accessed as registers R14 and R15. All instructions have a width of 32 bits. The basic ALU instructions use three register addresses; there is also a set of multiply and multiply-add operations with up to four register fields (specifying two destination registers for the double size product). The large width of the instructions is exploited to use a 4-bit condition select field and to make all instructions conditional. The state change of the condition bit can be suppressed for ALU and multiply instructions. Then, sequences of conditional operations can often substitute short branches. The general purpose registers including the PC can be used as address registers. The available indirect addressing modes include the addressing with a constant offset specified in the load/store instruction and the modification of the address register by adding a constant before or after applying the address which is useful to implement stack structures in memory (e.g., a return stack). There are dedicated conditional jump and call instructions. Jumps can also be realized by using R15 (the PC) as the destination register of a move, load, or arithmetic operation. Return is implemented by moving R14 to R15. There is no special loop support; any of the general purpose registers can be used as a loop counter, and the loop is realized with a conditional branch. The ARM may switch to a set of compressed instructions that take only 16 bits and save memory costs. Interrupts switch to a bank of seven additional registers in order to speed up the context switch. The CPU also provides a coprocessor interface and coprocessor instructions. The processor provides a distinction between a ‘user’ and a ‘supervisor’ mode. In the user mode not all of the memory can be overwritten, and a system can recover from a crash of the ‘user’ program through the supervisor mode. The memory-mapped i/o interfaces are typically accessed by calling superviser mode functions. This feature is most useful in standard computers running changing applications (as opposed to dedicated systems) and is found in all processors intended to be used in such. As a standard chip offering we mention the AT91RM9200 from Atmel which is a fairly fast 1.8 V device housed in a small 208-pin package (Table 6.10). Due to the integrated memory a single serial flash EPROM chip or the bootstrap via the USB suffice to obtain an operational system. A processor with a similar feature set based on the MIPS architecture [48] is the Au1100 processor from AMD. The MIPS architecture is a competing RISC architecture found in many CPU designs including IP cores. Besides a standard set of 32-bit, 3-address instructions it also provides compressed 16-bit instructions.

6.6.4 Super-Scalar and VLIW Processors The performance of a processor in an application cannot be raised beyond the limit of a 100% ALU efficiency which can be achieved if the operands can be supplied by the hardware at the ALU throughput rate and the data dependencies between subsequent operations do not cause

• 200

SEQUENTIAL PROCESSORS Table 6.10 List of features for the AT91RM9200 CPU resources:

Operation rate: Integration:

Power consumption:

32-bit integer CPU, 8×32 bit multiplier 14 registers, more for interrupts load/store/move for bytes and words, compressed instruction mode coprocessor interface 200 MIPS 16 k + 16 k bytes data and instruction caches, 16 k bytes of SRAM memory management unit SDRAM controller multiple serial interfaces with DMA support, I2 C interface USB host and slave interfaces, Ethernet media access controller 0.6 W (estimated)

ALU

control automaton

instruction memory

ALU

storage automaton (registers)

memory and i/o bus

ALU

Figure 6.13 Processor with multiple compute units

extra wait cycles. The limitation clearly results from using a single, particular ALU circuit all the time, and can be overcome by using several compute circuits instead, or a more complex one. This can be done by using separate control circuits for them (letting each be the ALU of a separate processor) and synchronizing their operation so that inputs from another processor are waited for as needed (MIMD processing, multiple instruction multiple data). If the mutual synchronizing data exchange occurs rarely, the processors are used most efficiently this way as they can exploit an individual control flow. If, on the other hand, the compute circuits exchange data on a per operation basis and share the same control flow, it is possible to serve them with the same control circuit (Figure 6.13). The simplest approach is to duplicate the same data path and let the multiple compute circuits execute the same instructions and their individual data (SIMD processing, cf. section 5.3). This can be used in application where separate data streams need to be processed simultaneously in the same way. An SIMD processor cannot be used efficiently to process a single data steam, however, but can only apply one of its compute circuits in that case. Despite executing the same instructions the individual compute circuits can adapt to their local data. This is achieved by expanding the control flow into a data flow (section 1.2.2). Instead of a selection, one simply lets all register writes be executed conditionally so that only those in the selected branch take effect. A more flexible way to control multiple, even different, compute circuits is to use individual instruction words containing independent control and operand select fields for every compute circuit. The processor actually executes these instructions in parallel and implements

STANDARD PROCESSOR ARCHITECTURES

• 201

what is called instruction level parallelism (ILP). A VLIW instruction set architecture (cf. section 6.1.1) simply packs the individual instructions to be executed in parallel into a single, wide word. Current VLIW implementations use a compressed or variable-size encoding of the wide words and only expand them for the decoding, or before putting them into an instruction cache with a wide interface to the decode stage. An example of a processor of this kind will be discussed in section 8.5.2. Finally, the instructions for the individual compute circuits can be selected automatically from an applied instruction sequence by providing a fast memory interface loading multiple instructions at a time and some automaton that assigns the instructions to the available compute circuits. This method requires a considerable hardware effort to dynamically determine what was otherwise grouped explicitly into the VLIW instructions but it maintains the single-operation instruction structure and simplifies the compilation. It may happen that instructions finish in a changed order, and data dependencies must be checked dynamically. If, on the other hand, the compiler is aware of the dynamic allocation algorithm implemented in the hardware, the instruction sequence can be arranged so that the resulting parallel schedule is an optimized one determined by the compiler. Processors that dynamically schedule and execute several instructions in parallel on different compute circuits are called super-scalar. They are discussed in much more detail in [48]. The highest performance standard processors are found in the PC class computers and workstations. They are super-scalar processors with multiple integer and floating point ALU circuits and achieve sustained instruction rates of 3 GHz and more by applying deep pipelines and large caches, and by issuing several instructions in parallel, including load/store operations and branches using branch prediction units. These general purpose processors support a variety of data types including 8-bit characters, 16- and 32-bit integer codes and 32-, 64-, and 80-bit floating point numbers. Some of the workstation class processors consume much power and need extensive cooling which excludes their use in many embedded applications with high performance requirements. Portable, battery-powered computers also require processors with a low power consumption. Variants of workstation class processors and processors for portable computers are used in dedicated, digital systems, too, e.g. in signal processing applications, or if standard operating systems like Linux and Windows, and programming tools otherwise used for application on workstations shall be applied. A processor architecture first used in workstations and then applied to embedded systems as well is the PowerPC family from IBM/Motorola. The PowerPC architecture is a RISC loadstore 32-bit architecture using 32 general purpose 32-bit integer and address registers and as many 32-bit floating point registers. The PowerPC versions found in computers include floating point units supporting the IEEE 32-bit and 64-bit floating point types (leaving, however, the approximation of transcendental functions to software). Recent PowerPC processors from Motorola also include an additional compute circuit specialized in performing multi-media related graphical and signal-processing SIMD operations on an extra set of 128-bit wide registers [81]. This extension called ‘Altivec’ performs up to four 32-bit floating point MAC operations in parallel to the floating point unit in the CPU. The latter is also capable of performing single-precision MAC operations. Simple embedded processors based on the PowerPC architecture do not use the floating point unit but add serial and parallel interfaces and an easy-to-use memory interface. The PowerPC405 core found in the VirtexPro FPGA devices is of this kind. The 405 provides a single integer ALU only, separate data and instruction caches of 16 k bytes each and a

• 202

SEQUENTIAL PROCESSORS Table 6.11 List of features for the MPC6447 CPU resources:

Instruction rate: Integration: Power consumption:

32-bit integer ALU (2 instances) 32/64-bit floating point ALU 32 + 32 32-bit registers Altivec SIMD unit w. extra registers 3MIPS (using a 1 GHz clock) 32 k + 32 k bytes 1st level caches for instructions and data 512 k bytes second level cache 64-bit interface to main memory 10 W

virtual memory management unit. It performs 400 MIPS in its 0.13 µ, 1.5 V implementation. In comparison to the ARM, it provides a larger register set of 31 general purpose registers (plus one that is always zero). There are additional PC and LINK registers, and a counter register that serves a decrement and jump instruction. The instruction width is fixed to 32-bit, and the ALU instructions use three register addresses and are unconditional. There are multiply instructions and 16-bit integer MAC instructions. Separate multiply instructions are needed to generate the upper and lower halves of the result. The PowerPC processors supporting the floating point types achieve a much higher performance through their MAC instructions and the parallel execution of several integer operations, load and store operations, and branches. As an example, the low-power version of the recent, super-scalar MPC7447 processor that dissipates about 10 W from a 1 V supply achieves a floating point peak performance of 10 GFLOPS. This rate of operations can be used in special applications only. Double-precision add or multiply operations are not supported by the Altivec unit. They are executed at a rate of 1GFlops. The MPC7447 is implemented in 0.13 µ technology and housed in a 360 pin ball grid package (Table 6.11). The cited performance data are nominal and achieved for special algorithms only. They are hard to achieve in given, other applications, in particular, if data have to be moved into the caches at a high rate. In dedicated systems that can exploit the special SIMD operations to some degree (e.g. in DSP algorithms, see [51]), the workstation class processor may be an attractive component. The MPC7447, however, does not integrate interfaces and system support functions such as DRAM control. To use a processor of this kind in a dedicated system one has to add extra circuits to connect a Flash memory chip, interfaces or SDRAM. These functions which are usually implemented into the chip sets on the workstation motherboards can e.g. be implemented on a single FPGA. An embedded application might not need more memory than the integrated cache memory of the processor. Then it is not necessary to implement a DRAM interface at all. The instructions in the EPROM and the intermediate data automatically get cached within the CPU chip so that there are no further memory interface transactions except for i/o. Then the digital processor hardware gets reduced to the processor chip, the FPGA and a serial Flash EPROM (Figure 6.14). A highly integrated floating-point RISC chip providing such support functions is the SH7750R processor from Hitachi. It contains the SH-4 RISC processor core equipped with a floating point engine, a memory interface supporting SDRAM and Flash memories, and various interfaces. The chip has a super-scalar architecture and executes 240 MIPS and more (up to two instructions in parallel) and 1.7 GFLOPS using a special single-precision floating point

• 203

EXERCISES

interfaces Serial Flash

FPGA

PowerPC memory bus

Figure 6.14 Embedding a high performance processor

instruction computing the inner product of four component vectors that is intended for multimedia applications. Otherwise it performs single-precision and integer MAC instructions at 240 MIPS, or double precision adds and multiplies at a lower rate. The SH-4 uses 16-bit wide instructions. It employs a load-store architecture using 16 integer registers and 32 floating point registers. A single condition bit is selectively used as a carry or overflow bit and serves as the only jump condition. Obviously, the performance is less than for the PowerPC. The SH7750R, however, dissipates below a single watt from a 1.5 V supply. Another super-scalar RISC processor with an integrated DRAM controller is the VR4131 chip from NEC. It is a 200 MHz device based on a 64-bit MIPS integer CPU and consumes about 1/4 watt only.

6.7 SUMMARY Programmable processors combine one or several multi-function compute circuits with a storage automaton optionally extended by a data memory and a control automaton based on a memory table. By filling this table with different instruction lists, many applications can be supported by the same hardware. If the same memory is used both for the data and for the instructions, one arrives at the von Neumann processor architecture. Various methods have been discussed that keep the hardware effort needed for the storage and control automata low, one being the application of soft caching. The standard CPU architectures are offered at a variety of performance and cost levels. Most recent architectures opt for a RISC structure and a load-store instruction set architecture, using 16 to 32 registers for data and addresses. Standard chip products often integrate memory and a selection of interfaces along with the CPU.

EXERCISES 1. Extend the ALU definition in Listing 6.1 to include a dual 8-bit SIMD add operation that can be used to implement a dual 8-bit serial multiply operation. 2. Change the VHDL architecture for the CPU1 so that the spare instruction 1000 becomes an i/o instruction shifting in a bit into the ‘a’ register and outputting the carry bit. The shift clock should be from an external signal, and a signal indicating the execution of the i/o operation should be output. Introduce a 4-bit counter to automatically input and output full words. Finally, implement bootstrapping via this serial interface by forcing the i/o instruction via a boot mode state bit, using the program counter to generate the write addresses during the bootstrap write cycles. 3. Design a memory controller for reading data blocks from a 512 k byte flash EPROM chip with 19 address and 8 data lines and an access time of 100ns and determine its complexity.

• 204

SEQUENTIAL PROCESSORS

The controller receives commands and sends out data via a 100 MHz 16-bit synchronous bus as described in section 6.5.4. The read command contains the upper 12 bits of the starting address on the block in the flash memory. The lower bits are supposed to be zero, and the first bytes stored in the block contain its length in 16-bit words as a 10-bit number. 4. Explain the needed changes to the operation of the control circuit of the CPU2 in order to execute a sub-routine as a loop body and the advantages and disadvantages of implementing these. 5. Design a synchronous serial interface that receives a 16-bit word from a data bus and outputs it in 8 clock cycles, 2 bits at a time. The clock is output via a separate line, and a handshake signal is input and presented to the bus master using a separate select signal. At the receive site, a handshake bit can be read that indicates new data being available, and which is reset after a read operation. 6. An application requires the generation of 4 PWM signals with a period of 1 kHz and a resolution of 10 bits, 4 event counter inputs, output via an USB interface, 16-bit integer operations at a rate of 20 MHz, and single precision floating point adds and multiplies at a rate of 100 kHz. The algorithms to be performed are given by a C program showing about 500 operations, and need 12k 16-bit words for variables and arrays. Define a lowcost system fulfilling these requirements. Verify the assumed performance data from the available data for the selected components.

7 System-Level Design

7.1 SCALABLE SYSTEM ARCHITECTURES 7.1.1 Architecture-Based Hardware Selection The physical components of a digital system (the ‘hardware’) are chips and boards that each provide certain resources to support application-specific functions. Whether the system extends over several circuit boards, a single one with multiple chips or a single chip depends on the computational requirements and the available components; there may be several options to choose from. A more abstract description of the system structure in terms of functional system components such as processor subsystems including IP modules or software objects eventually translates into a network of underlying hardware components, the FPGA chips and processors to be used. Theoretically, the engineer may select from all the components offered by the industry and add some of his/her own to construct a digital system. Practically, the choice is more limited to keep the inventory of components small, to reuse both the hardware building blocks and the experience gained in previous applications, in particular, the involved software tools and modules. A well-chosen, universal and scalable architecture (see section 1.5.1) can actually cover many applications. Also, the limited set of component types of an architecture can be supported by just a few design tools (even a single, integrated one). Assuming some moderately complex board level components to choose from (e.g., a selection of processor boards), any particular application requiring more hardware resources than available in a single component would be served by a network using the available network interfaces (Figure 7.1). The target hardware can then be described as a network, the nodes of which are the processing elements (PE). The ingoing and outgoing edges to a node correspond to the interfaces available on the PE to connect it to others via some media (Figure 7.2). As pointed out in section 2.3.1, chips and boards used as components should have simple interfaces (e.g., serial ones) and avoid parallel buses (this is a design philosophy; many designs found in industry follow a different, more conventional track). A design using several simple processor boards connected via network interfaces may be more cost-effective than a complex Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 206

SYSTEM-LEVEL DESIGN

network interface

PE II external output (distributed)

external input

PE I

PE II

network interface

Figure 7.1 System constructed from self-contained PE i/o buses

network

interfaces in

PE

out

external interfaces

Figure 7.2 Processor network node with various types of interfaces

one based on a single, more powerful processor with equivalent interfacing and storage capabilities if such exists at all. The component boards can thus serve more applications during a longer lifetime. The resulting, composite processor structures may be harder to program and to debug, but switching to a more powerful processor and a new set of software tools for it is also a problem. The network composed of the boards can be modularly extended by additional boards that each adds to the processing capabilities, to the memory capacity and bandwidth, and to the i/o capabilities of the overall system. An architecture can support more cost-effective system designs tailored to the applications if it provides components of different capabilities, e.g. specialized compute circuits. The resulting, heterogeneous structures are still harder to program without the help of appropriate system level tools. Current commercial compilers only support the individual component processors. Processor or FPGA-based components are versatile, programmable or configurable PEs that can serve several applications (are ‘universal’). If the interconnection network is configurable, too, or if data can be passed through intermediate nodes of a network, networks of PEs can be integrated onto complex boards or into chips. Then these are more powerful and still are universal system components. Applications would build on a smaller number of the more complex standard components and use application-specific configurations for their internal sub networks.

7.1.2 Interfacing Component Processors Interfaces between processor subsystems can be distinguished by the data rates achieved for block transfers, their system costs, the degree to which they consume auxiliary processor operations to service the hardware, to copy data and to support their sharing between several processes, and by the restrictions for their wiring and the access to the interfacing media. Processors that need to communicate do not need to be interfaced to each other directly but data to be transferred (‘messages’) may be passed through intermediate processors with some extra effort. Generally, interfaces connect to the local memory buses of the processors like memories but bring their signals to the outside. Two connected interfaces on different processors

SCALABLE SYSTEM ARCHITECTURES

• 207

constitute a bridge between the local buses. The data that need to be passed between the processors are multi-bit codes that generally require sequential multi-word block transfers to be performed. Apart from shared memory structures like the dual-port RAM (Figure 2.40), standard interfaces do not support random access storage of multi-word data and require data to be moved from the interface to a RAM by the receiving processor or a DMA circuit as discussed in section 6.5.1. The need to pass data between chips, to synchronize these transfers and to copy data to and from interface registers slows down data transfers through interfaces. Interfaces are distinguished by certain basic properties. The simplest interfaces are directional ones (input or output). Bidirectional interfaces combine two directional interfaces but restrict them to interface to a bi-directional interface on a single other processor (the directional sub-interfaces usually share some resources). Finally, interfaces may connect to a bus dedicated to communications, or a large memory may be accessed by several processors via a common memory bus instead of using a multi-port memory. A memory bus with some attached memory chips shared by several processors is a common way to build small networks from all kinds of processors. The memory accessed via the bus allows random access by all processors. A processor transfers data to another one by depositing them in the common memory. The handshaking is implemented with variables holding the handshake signal levels. Due to the bus capacitance and the sharing of the data transfer rate of the bus by the attached processors, this method does not extend to arbitrary numbers of processors and is most suitable within a single circuit board. Some processors offer two external buses or a bus and an additional ‘host’ port that can be accessed from the bus of another processor. Then networks with multiple buses connecting many processors can be built using a single chip per node. If a bus is subdivided (by means of 3-state drivers) to reduce the capacitive load, the access to the bus segments can be arbitrated separately to allow for parallel accesses from different processors to different segments. For dedicated parallel computers, large, hierarchical memory systems have been built, providing multiple bus interfaces to attached processors and parallel memory accesses. The common memory space serves to let the processors exchange or share information and yields a somewhat simpler programming without having to consider communications and the large delays caused by them but only mutual exclusion for the access of certain memory objects. The implementation of a common memory subsystem that can be accessed in parallel from several processors, however, involves a considerable hardware effort. A system relying on local memories only could compensate for the slower access to remote data by using more processor plus memory modules, and in many applications the data transfers through interfaces can be pipelined with the operation on other data. With some overheads involved, virtual memory management can be applied in a network without a global memory subsystem to encapsulate communications at the operating system level and to mimic the behavior of a common shared memory. High speed computers used for scientific and engineering tasks often use an architecture known as clusters of workstations. Here the ‘components’ are standard PC-type computers, and their interconnection is provided by one to several serial buses (local area networks) or dedicated high-speed interfaces like the Myrinet products [50]. The Myrinet network uses dual 2Gbit/s glass fiber connections from the individual computers to a crossbar switch that establishes direct links between the computers that need to communicate. It can thus be compared to a switched Ethernet where all connections are routed through a switch and the endpoints use their cables in a duplex fashion without handling collisions. In a dedicated network, the

• 208

SYSTEM-LEVEL DESIGN

Opteron CPU

memory controller

SDRAM EPROM

Opteron CPU

(chip set)

I/O

Hypertransport channel

Figure 7.3 CPU module interface using serial interface channels

protocols can be streamlined. The individual workstation may use several processors accessing a common memory space. Some recent CPU chip architectures for workstations integrate several CPU modules. Moreover, some CPU chips are equipped with several fast communication channels instead of a single memory bus on which in parallel they perform transfers of data packets between their integrated caches and i/o or memory subsystems. Such a chip is similar to an integrated CPU plus memory system including network interfaces. The Opteron processor from AMD e.g. provides several so-called Hypertransport channels to interface to external memories and i/o circuits that are adapted to the serial channel by means of a support chip set (Figure 7.3). The cache fill operations automatically translate into messages along the Hypertransport links. The links provide 4 to 16 high speed differential data lines and some separate control lines and simplify the external interface from the CPU. Thus, networks move into the chips, onto the motherboards, and link workstations within clusters.

7.1.3 Architectures with Networking Building Blocks Components intended to be used within networks need to provide the interfaces supporting the network structure. The design of a processor module with such network interfaces, that consequently become the networking standard within the architecture, can support them by means of special DMA hardware and special instructions in order to be able to use the interfaces with minimum software overhead. There are just a few commercial processors that provide networking support for parallel high performance systems within their architecture. For physically distributed applications where it is essential to link remote units to exchange data and control information, micro controllers with serial bus interfaces and protocol engines are offered, providing a simple software interface for remote data accesses. Examples are the LonWorks chips manufactured by Motorola or standard micro controllers with integrated CAN bus or Ethernet controllers. The LonWorks chips actually implement a distributed operating system. Every node provides a simple 8-bit application processor and devotes most of its hardware to the network services. Serial buses can support parallel systems in some applications but usually involve significant software overheads; in general, they are too slow and involve too many hardware and software overheads to be used for efficient systems with a scalable performance. Every processor cycle needed to service the interfaces and protocols beyond the basic read or write commands and synchronization is wasted for application processing and decreases the overall ALU efficiency. Employing complex hardware to support communications, however, also has the effect of reducing the overall efficiency. The building blocks offering the networking support needed for a scalable architecture can be designed on the chip level or on the board level. Some recent FPGA families from Altera and Xilinx provide predefined fast serial interfaces with LVDS pin drivers and dedicated

• 209

SCALABLE SYSTEM ARCHITECTURES

high speed shift registers. Also, FPGA-based component designs are free to define specific interfaces with the desired support for block transfers and synchronization from scratch and to implement them on the FPGA. The board design in Figure 2.51 can be used as a common packaging unit to provide a number of architectural components. The DSP and the FPGA chips both provide computational resources. System design based on the board is supported by design tool(s) for the micro controller, DSP and FPGA components. The board includes several network interfaces that can be used for distributed applications and for scaling the performance. For networking there are the CAN bus and the Ethernet interface, fast serial interfaces on the DSP chip, and the FPGA to implement fast serial point-to-point interfaces using LVDS signals. An individual board already supports many applications. If several ones are needed, they may be mapped to a single one in a higher volume application or be mounted close to each other and be connected using some of the network interfaces. The crossbar and processor modules in the ER-II design (section 2.3.3) also provide a universal, scalable architecture for embedded systems, in particular for signal processing applications. The crossbar module contains a small 16-bit processor system, special network interfaces to neighboring modules and an interface to an add-on board, in particular to the Sharc module that adds fast floating point processing and more interfaces. The unique crossbar interface supports the networking of several modules using the serial interfaces of both kinds of DSP processors and standard motherboards. The multiple interface signals to the crossbar translate into a fairly wide module interface. This does not matter if a strictly local interconnection scheme is used for the modules, as in the case of the ER-II design. A remarkable chip-level architecture that gave rise to a commercial, scalable, universal board level architecture has been the Transputer family of processors. The Transputers were on the market from the mid-1980s until the early 1990s. They included a 16-bit integer CPU and 32-bit processors with and without floating point units operating at about 10-20 integer MIPS and 2-3 MFLOPS. Besides its memory bus interface, a Transputer chip included 4 k bytes of SRAM and four fast, asynchronous serial interfaces (the Transputer ‘links’) with data rates of 20 Mbit/sec, an efficient handshaking protocol and DMA support that could be used to bootstrap the processor (load initial program code into its internal memory). The Transputer included no other interfaces, but there were peripheral circuits that could be connected to a Transputer link (Figure 7.4). There was a range of tiny circuit boards equipped with a Transputer chip and external static or dynamic RAM (called TRAM modules) and just 16 interface pins to a motherboard giving access to the link interfaces. The Transputer family was supported by a special programming language called OCCAM that also covered networks of

boot link

SRAM

SRAM

T222

T222

(TRAM)

(TRAM)

par. link adapt

Figure 7.4 Simple 16-bit Transputer system

input output

• 210

SYSTEM-LEVEL DESIGN

Transputers. The Transputer family has been one of the most elegant architectures so far to hit the market. Unfortunately, a second generation Transputer failed to come out and the family got phased out. As well as providing the link interface hardware, the Transputer also supported the use of the links through special instructions and an automatic handling of block transfers. First of all, as communications involve waiting for handshakes, the Transputer provided an automatic thread management maintaining a queue of process workspaces in memory and switching to other contexts without any software overhead in about a µs. Time-critical processing was supported by a second process queue of processes of a higher priority. The input and output instructions specified a link interface and a memory area to be read from or written to. This automatically set up a DMA transfer and performed a context switch. The thread containing the i/o instruction only continued (was reinserted into the process queue again) when the specified transfer was completed. This scheme implemented the point-to-point interfacing between threads running on different processors (the same instructions would also transfer data between threads on the same processor). As on every Transputer there were several threads but only a few interfaces to link it to others, it was desirable to extend the link support and the protocol to allow for the multiplexing of a given hardware link and letting it transfer data to an addressable destination thread. This was previewed for the second Transputer generation. Now the only processor chips integrating multiple high-speed point-to-point network interfaces with DMA support are the TMS320C40 DSP chips from Texas Instruments (TI) and the more recent Sharc family from Analog Devices (see section 8.5.1). The idea of using board level network modules was taken up for the TI processor by some companies offering processor plus RAM modules compliant to a standard connector interface to motherboards. The Sharc processor includes more on-chip memory and can even be used as a stand-alone network node. A recent 100MIPS Sharc processor (the ADSP21161) integrates two ‘link’ interfaces with 10 interface signals, each supporting data rates of up to 100Mbytes/sec, 128k bytes of internal SRAM and serial interfaces into a 225-pin BGA package. Other members of the Sharc family have up to six link interfaces and up to 512k bytes of SRAM. These DSP chips do not offer the same degree of support to these interfaces by the CPU as the Transputer did for its link interfaces but use the conventional low-level programming of the DMA control registers and rely on interrupts to synchronize with the data transfer. Besides these network interfaces, the Sharc processors also offer a unique support for multiple processors (up to six) sharing and communicating via a common memory bus. The bus arbitration relies on extra handshake signals provided by the Sharc chips to let just one processor at a time drive the address and control lines, and requires no additional circuits. Moreover, a Sharc can directly access the internal memories of the others or special register ports via the bus. Therefore if one only wants to communicate, it isn’t necessary to also attach memory chips to the bus. The two methods to link Sharc processors can be freely combined. A cluster of three ADSP21161 processors provides six link interfaces that can be used to interface them to processors of other, similar clusters (Figure 7.5). This Sharc processor hence defines by itself a scalable architecture; a derived board level architecture could build on small clusters of Sharcs of this type with some external memory. Due to the ability to bootstrap via a link, a single EPROM device suffices to bootstrap an entire network of Sharc processors. The Sharc architecture is specialized to complex signal processing applications and e.g. lacks the double precision data type supported by general purpose processors and some

• 211

REGULAR PROCESSOR NETWORK STRUCTURES

Sharc

Sharc

Sharc

Sharc

ROM

Figure 7.5 Sharc DSP network

host port

host port DSP bus

DSP bus master

bus port

slave

bus port

Figure 7.6 Host port interfacing

standard interfaces, and is not cost-effective if 8-bit and 16-bit processing is needed. These types would typically be supplied by other architectural components. Another integrated interface that can be used to set up small networks within a circuit board is the host interface found on many DSP chips in addition to their memory bus interface (especially on integer DSP chips). This interface is a parallel 8- or 16-bit data port that can be connected to the memory bus of another processor (e.g., a micro controller) and used to bootstrap the DSP. Several DSP chips can connect by their host ports to the same memory bus, and a DSP can control others connected to its memory bus in turn (Figure 7.6). The Sharc also provides a host port but shares the memory bus lines with it.

7.2 REGULAR PROCESSOR NETWORK STRUCTURES While application-specific networks of component PEs require application-specific interconnection patterns, fixed or configurable sub-networks within an architectural building block (a board or a chip) have to use standard interconnection patterns that can be scaled to larger network sizes. For networks within a chip (e.g. the interconnection networks for FPGA), regular, planar structures with local interconnections are most important, while at the board level the third dimension can be used to stack the boards or to arrange them so that the cabling distances remain small. To simplify the modeling we assume that a single type of network node (PE) with a number d of similar point-to-point input interfaces and also of output interfaces are used. Then the network structure can be described as a directed graph. A more general network model might distinguish different kinds of PE with different computational and interfacing resources, point-to-point interfaces of different data rates, input, output and bidirectional interfaces, and input, output and bidirectional bus master or slave interfaces. The definitions given for the simplified model mostly generalize to such a model. Recommended for further reading on graph theory, in particular interconnection graphs, and relations of interconnection graphs to algorithms are the treatments in [78, 20, 77]. A processor network is modeled as a directed graph with a finite node set V and an edge set E that is a subset of the set V × V of ordered pairs of nodes. The elements of V correspond

• 212

SYSTEM-LEVEL DESIGN

Figure 7.7 Line and circle graphs

to the processing elements, and an edge (v, v ) corresponds to a directional connection from v to v using an output interface on v and an input interface on v . We thus assume that there is at most one such connection from v to v and also exclude interfaces of the form (v, v) connecting a processor to itself, and only consider connected graphs, i.e. for all v, v ∈ V there exist v0 , v1 ,.., vn ∈ V so that v0 = v, vn = v , and for 0 ≤ i < n either (vi , vi+1 ) ∈ E or (vi+1 , vi ) ∈ E as otherwise the system modeled by the graph would break up into separated subsystems. For a given node v ∈ V, the sets Iv = {v | (v , v) ∈ E}, Ov = {v | (v, v ) ∈ E} correspond to the input and output interfaces of v connected within the network. To be compatible with the node hardware the graph must satisfy the sizes |Iv | and |Ov | (called the in and out degrees at v) to be bounded by the constant d for all v ∈ V. Consequently, |E| ≤ d |V| . If these sizes are all equal to d, the graph uses up all the available interfaces and is said to have the degree d. The graph will be called undirected if for all v ∈ V the sets Iv and Ov are equal. This property can be used to model the interconnection by bidirectional interfaces. A bus connecting several PEs attached to it allows any sending node to communicate with every receiving node. The same property holds for a fully connected directed subgraph of a graph (V, E). In the simplified directed graph model, a bus will be modelled as a non-empty subset Q of E with the property that whenever (s, r), (s , r ) ∈ Q then also (s, r ), (s , r) ∈ Q unless s = r or s = r. The nodes s so that (s, r) ∈ Q for some r drive the bus, and the nodes r so that (s, r) ∈ Q for some s receive from the bus. For an undirected graph (V, E), E is a bus subset by itself, if and only if the graph is completely connected, and the set of nodes of a fully connected subgraph of a graph (V, E) is a bus. For every directed graph, E can be partitioned into bus subsets, yet not always in a unique fashion. The notion of bus subsets does not capture the idea that a bus interface on a PE is actually just a single interface that can connect the PE to a variable range of others. A refined model would therefore distinguish bus interfaces from others or describe buses as an extra type of nodes providing another maximum number d of interfaces. The simplest scalable network structures require nodes with a single input and a single output interface (d = 1). They are the directed line and circle graphs shown in Figure 7.7. With two input and output interfaces per node (d = 2), the undirected line and circle networks can be formed. If a processor needs to transfer data to another one that is not directly connected to it, then the data can often be passed along a directed path through some intermediate nodes using some extra transmit time and bandwidth from every edge on the path. For individual connections between PEs, the sharing of the bandwidth of an edge corresponds to time-sharing it for the various messages traveling on paths through it. Moving a message along a path can be implemented by receiving and buffering the message at the intermediate nodes and sending it out along the next edge again (‘store and forward’), or by sending it out while its tail is still being received (‘wormhole routing’, supported by dynamically controlled switches). If the network is a crossbar network using multiple connections between neighboring crossbar nodes (cf. section 2.3.3), the bandwidth of an edge can be measured and distributed in unit

REGULAR PROCESSOR NETWORK STRUCTURES

• 213

Figure 7.8 Directed, 2D grid graph

Figure 7.9 Node neighborhoods for the At40k and Virtex FPGA cell grids

of wiring segments. For a crossbar network linking complex processors, both methods can be combined. The time or costs to communicate are roughly proportional to the length of the chosen path. The distance d(x, y) between two processor nodes x, y is defined as the length of a shortest directed path from x to y. It can be used for worst case estimates for the costs for communications between them. For an undirected graph, d(x, y) = d(y, x) for all nodes x, y. The diameter of a network (V, E) is defined as the maximum finite distance between nodes. While the undirected line graph with n nodes has a diameter of n-1, the undirected circle with its single extra connection cuts the diameter to just the integer part of n/2. If in the case of d = 2 the input and output interfaces are used independently one finds graphs that connect a larger number of processors. The graph in Figure 7.8 connecting n = m2 processors has a diameter of 2(m − 1) = 2(n1/2 −1) only. The same holds if the rightmost processors in all rows but the last one are connected to the leftmost ones in the next row which corresponds to augmenting the directed line graph by a second set of interfaces that take wider steps along the line. If the four directional interfaces at every node are substituted by bi-directional ones, the 2D undirected grid results, which is the most common scalable graph structure filling out a rectangle in the plane (e.g., the area of an FPGA chip). Other local interconnection schemes are obtained by defining an arbitrary neighborhood of a processor in the grid and connecting all processors in this fashion by ‘translating’ the neighborhood to the different nodes. The neighborhood in Figure 7.9 used by the Atmel At40k FPGA (see section 2.2.4) adds diagonal

• 214

SYSTEM-LEVEL DESIGN

connections to the previous grid structure and thereby cuts the diameter of the network by a factor of 2. The neighborhood used by the Virtex FPGA uses additional connections that perform larger horizontal and vertical steps, and cuts the diameter by a factor of 6. On a chip, the wires linking a node to its neighbors are usually routed along the edges of a grid. If the node is linked to its neighbors by k paths with the lengths d j in the grid using n j signal lines (0 ≤ j ≤ k − 1) for each, then the number of wire segments along grid edges is u = d j n j . If these are unique, bi-directional lines equally distributed between the north–south and east– west directions, then from every node there are u/8 lines in every direction. An important construction deriving a new graph from two given ones is the Cartesian product graph. The Cartesian product of the graphs (V, E) and (V , E ) is the graph with the node set V × V and the edge set {((u, u ),(v, v ))| (u, v) ∈ E and u = v , or (u , v ) ∈ E’ and u = v }. If both graphs are undirected and have the degrees d and d , then the product graph is undirected, too, and has the degree d + d . This implies that the product graph requires processor elements with more interfaces than required for the factors. If the diameters of the two graphs are D and D , then the product graph has the diameter D + D as paths in the product graph break up into steps ((v, v ),(u, v )) and ((v, v ),(v, u )) in the V and V dimensions only. The two-dimensional grid graph is the Cartesian products of two one-dimensional line graphs. The Cartesian product of two (undirected) circle graphs yields the 2D torus graph. The product of a 2D torus with another circle yields the 3D torus graph, etc. An n-dimensional torus that is the product of n circles of size k has N = kn elements, the degree 2∗ n and the diameter n∗ D1 where D1 is the integer part of k/2. In the special case of k = 2 line and circle graphs coincide and have the degree D1 = 1. The n-fold Cartesian product of this graph by itself, is the n-dimensional hypercube graph with N = 2n elements, the degree d = n and the diameter D = n. The nodes of the size 2 line graph can be identified with the elements 0, 1 ∈ B. Then the nodes of the hypercube are the binary n-tuples which are in turn the coordinates of the corner points of the cube as a subset of the n-dimensional real vector space Rn . The torus graphs and the hypercubes yield families of processor networks of arbitrary size but require processor elements with many interfaces (depending on the dimension). This disadvantage is avoided in the cube connected cycles (CCC) graph that is obtained by substituting every node of the n-dimensional hypercube by a size n circle graph with every processor on the circle only providing a single interface of the n interfaces previously needed from the node (Figure 7.10). The CCC nodes can be labeled by the tuples (i,b0 , . . . ,bn−1 )

Figure 7.10 The hypercube and the CCC graphs

. . . secondary

buses . . . . .

........

primary (cluster) buses

• 215

REGULAR PROCESSOR NETWORK STRUCTURES

.........

Figure 7.11 Network built from nodes with two bus interfaces

bus

Figure 7.12 Constant degree sub-networks replacing higher degree nodes

where 0 ≤ i ≤ n-1 and all b j ∈ B. Every node (i, b0 , . . . ,bn−1 ) is connected to the nodes/tuples (i − 1, b0 , . . . ,bn−1 ), (i + 1, b0 , . . . ,bn−1 ) and to (i, b0 ,..,bi−1 ,/bi , bi+1 ,..,bn−1 ) where i + 1, i − 1 are calculated mod(n) and /bi is the complement of bi . Vice versa, the circles can be viewed as clusters of processors with a total of n interfaces each that are used to implement a hypercube network. The CCC graph has N = n∗ 2n nodes, a degree of d = 3, and a diameter of less than 2.5 n. The degree does not increase with the size of the graph, and the same type of PE can be used for networks of arbitrary sizes. Using more interfaces, the CCC graph can be modified to connect more processors at slightly lower distances. First, the cycles can be substituted by fully connected graphs (buses), and, second, PEs can be connected to more than just one PE on another cycle/bus, namely m−1 others connected via a secondary bus. This yields the dual bus graph Cn,m with the nodes (i,d0 , . . . ,dn−1 ) where 0 ≤ i ≤ n−1 and 0 ≤ d j ≤ m−1 for all j. The node (i,d0 , . . . ,dn−1 ) is connected to all the nodes/tuples (i ,d0 , . . . ,dn−1 ) with i = i and (i,d0 ,..,di−1 ,di , di+1 ,..,dn−1 ) with di = di. Cn,m has n∗ mn nodes, a degree of n + m − 2, and a diameter of 2n. Instead of using separate point-to-point interfaces, every node/PE can be equipped with two bus interfaces, one connecting it to n−1 and the other connecting it to m − 1 other processors. The PE is independent of the size of the network if one disregards restrictions on the numbers of PEs connected to a bus and the degradation through sharing the bus bandwidth. For example, for n = m = 4 one links 1024 processor elements at a diameter of 8 using 6 point-to-point or 2 bus interfaces on every PE (Figure 7.11). The CCC construction is an example of a hierarchical network, namely a hypercube of cycles. In a hierarchical network the nodes of a higher level network are expanded into sub-networks with unconnected edges that extend into the high level network. An arbitrary undirected graph with any number of edges from a node can be expanded into a hierarchical graph in which all nodes have the same number of interfaces by substituting the original nodes by cycles with an unconnected edge extending from every node within a cycle, or, alternatively, by a bus connecting a cluster of processors each providing an extra link, or by any kind of network with a fixed degree (Figure 7.12). An important aspect of the torus and CCC networks is their regularity. They not only use the same kind of nodes but also connect the nodes in a similar way all over the network.

• 216

SYSTEM-LEVEL DESIGN

This similarity of the nodes is captured by the following definition. A graph (V, E) is called homogeneous if for any two elements x, y ∈ V there is a one-to-one transformation ϕ of the node set V onto itself mapping x to y and conserving the interconnections: (u, v) ∈ E => (ϕ(u), ϕ(v)) ∈ E A permutation of V with this property is called a symmetry of (V, E). A homogeneous graph is thus defined to have many symmetries. One easily derives that the Cartesian product of two homogeneous graphs is homogeneous, too. The circle graph is homogeneous as any node x can be mapped to any y by a rotation of the graph that conserves the edges. Hence also the torus graphs are homogeneous in the strict sense of the preceding definition. Homogeneous graphs have no special nodes where there might be above average traffic due to the interconnection structure or above average costs due to longer mean distances to other nodes. These properties usually remain to some degree if a homogeneous graph is altered by cutting out a few nodes. A still stronger property than being homogeneous is edge transitivity. Then for all (u, v), (u , v ) ∈ E there exits a symmetry ϕ of (V, E) so that ϕ(u) = u and ϕ(v) = v . In addition to being homogeneous, such a graph has symmetries that map any given edge (u, v) emerging from some node u to any other edge (u, v ) from the same node. The hypercube has this stronger property, but the 2D n by m torus for n = m does not. Many useful homogeneous graphs including the torus, the CCC and the Cn,m graphs are obtained through a simple construction from discrete mathematics. The construction starts from a (finite) group, i.e. a set G equipped with an operation ∗ : G × G → G that is associative, admits a neutral element e ∈ G so that e∗ g = g∗ e = g for every g ∈ G, and so that for every g ∈ G there exists an ‘inverse’ element g−1 ∈ G with the property that g∗ g−1 = g −1 ∗ g = e [12]. The set Zn = {0,1, . . . ,n − 1} with the operation +n of adding mod(n) is an example of a group and is called the cyclic group of n elements. Another example of a finite group is the large, n! element set Sn of all permutations of Zn with the composition of mappings taken as the group operation. This is called the symmetric group and in contrast to the cyclic group is not commutative for n ≥ 3, i.e. for permutations ϕ, ψ the compositions ϕ ◦ ψ, ψ ◦ ϕ are different in general. Many other groups (virtually all) arise as subgroups of an appropriate symmetric group Sn , i.e. subsets of Sn closed under forming compositions. An example of this kind is the set of all affine permutations of the set Zn , i.e. all mappings ϕa ,b of the form ϕa ,b (x) = a ∗n x + n b with a, b ∈ Zn and a being relatively prime to n (∗n denotes multiplication modulo n). This group, also, is not commutative for n ≥ 3. The set of symmetries of a given graph is another example of a subgroup of a group of permutations. A subset S of a group G is called a symmetric generator set if the neutral element e ∈ / S, if for every s ∈ S its inverse s−1 is in S, too, and if every group element can be obtained as a product of elements of S. A group may have many generator sets. For every group the set of all g = e is a generator set. Large groups may have small generator sets. For example, S = {1, n−1} is a symmetric generator set of Zn . Given a group G and a generator set S the Cayley graph of (G, S) is the graph having G as its node set and the edge set E = {(g, gs)|g ∈ G and s ∈ S}. (G, E) is a symmetric, connected graph in which every node has the degree d = |S| (the number of elements of S). It is easy to see that this graph is always homogeneous. For given elements x, y ∈ G a bijection of G onto itself mapping x to y and preserving the interconnections is the multiplication from the left side by the special element yx−1 ∈ G. For the group Zn and the generator set {1, n−1} one obtains the circle graph. The generator set {1, 2,..,n−1} of Zn of yields the totally connected graph with n elements. In a

REGULAR PROCESSOR NETWORK STRUCTURES

• 217

non-commutative group, for g ∈ G and s, s ∈ S the elements gss and gs s at the distance 2 from g are different in general. For the sake of having a small diameter (to reach many processors within a small distance from every point), it is therefore attractive to look for Cayley graphs of non-commutative groups like Sn . The diameter of a Cayley graph is equal to the maximum number of generator set elements needed to obtain an arbitrary group element as a product of these. CCC and Cn,m are Cayley graphs of non-commutative groups. For a given diameter, these networks can link a much higher number of nodes than the 2D torus networks. A generator set for Sn is the set {ψ0 ,.., ψn−2 } where the permutation ψi exchanges i and i + 1 and leaves all other elements of Zn unchanged. The corresponding Cayley graph has the degree d = n−1 and the diameter D = n(n−1)/2 which is an attractive value in view of the large number N = n! of nodes (it is the bubble-sort algorithm that shows that every permutation is a product of at most D exchanges). In order to reduce the number of interfaces (the degree), one can substitute the nodes again by circles, but there are also 3-element generator sets for every Sn , e.g. the set S = {ψ0 , σ, σ −1 } where σ is the cyclic shift mapping i to i + 1 for i < n − 1 and n − 1 to 0. The permutation ψi can be obtained by first rotating the couple i, i + 1 to the position of 0, 1, then performing ψ0 and then further rotating until 0, 1 is mapped back to i, i + 1. The resulting Cayley graph has the degree d = 3 (same as for CCC) and a diameter of at most n∗ n∗ (n − 1)/2. Another family of Cayley graphs with a fixed degree of 4 is based on the affine groups of permutations of Zn for different n and special generator sets. The generators are ϕ1,1 , ϕ1,n−1 , ϕa ,0 and ϕa ’,0 where a and a satisfy a ∗ n a = 1 and a is so that its powers exhaust all q∈ Zn relatively prime to n. While the components of dedicated systems can be arranged in space and wired up almost arbitrarily within the allowable cabling distances, the wiring on a chip (or on a motherboard) is bound to simple grid networks that exhibit fairly large diameters. It is, however, possible, to apply a crossbar grid with multiple wiring segments connecting neighboring nodes and to realize a lower diameter network with switched paths through the crossbars. Then the worst case delays and processing costs for passing messages through intermediate nodes are proportional to the lower diameter, ignoring the costs for communication through the switched paths. The tighter interconnection of the nodes would be paid for by the multiple segments provided along the grid edges. If the diameters of the grid and of the embedded graph are D0 and D, and every node connects to four neighbors via switched paths, then these paths can be expected to have a mean length l = D0 /D as the longest path in the grid must be shorter than a composition of at most D switched paths. Then l is an estimate for the number of required segments along the edges of the grid. Although this is not a formal argument, it gives a feeling for which low diameter interconnection schemes can be implemented in a crossbar network. Formally, an embedding of a graph (V, E) into another graph (V , E ) shall be defined as a mapping ϕ: V → V along with an mapping ψ: E → E’∗ assigning to every edge (u, v) ∈ E a sequence ψ(u, v) ∈ E’∗ (i.e., a sequence of elements of E ) which has the property of being a path from ϕ(u) to ϕ(v) if both are different and the zero length sequence otherwise. This definition does not require the mapping ϕ to be injective. The inverse images ϕ −1 {q} ⊂ V of nodes q ∈ V are called clusters. Every graph embedding can be composed of an embedding just defining the clusters and mapping edges to edges or length zero paths only, and an injective embedding expanding edges into paths of a non-zero length. Graph embeddings are needed to place software modules onto the processors of a given physical network. The non-injectivity then corresponds to time-sharing a node to execute several software modules (section 7.7).

• 218

SYSTEM-LEVEL DESIGN

7.3 INTEGRATED PROCESSOR NETWORKS As chip technology moves to ever smaller feature sizes, the achievable level of integration extends. High performance CPU chips with large integrated caches extend into the 20–50 million transistor range, and DRAM chips with 500 million transistors are standard. Single chip systems integrating large amounts of memory yield cost-effective system designs for mass volume applications, but do not extend the performance to what could be achieved by integrating more computational circuits. A number of researchers have published concepts to integrate single-chip networks of processors, but industry designs of this kind have mostly come from start-up companies. Several of these (and a recent announcement from Motorola) propose configurable arrays of MAC/ALU circuits with a common control engine. Application-specific processor networks can be designed as ASICs or on FPGA using appropriate IP cores, in particular on FPGA chips integrating several predefined CPU cores of predefined computational building blocks. The only large-scale configurable chip level network that has become a wide-spread standard component is the FPGA itself with its large configuration overheads in terms of silicon area. Integrated networks of complex processors with local memories offer the prospects of yielding scalable families of efficient, standard (configurable and programmable) building blocks, large memory bandwidths, simplified VLSI design and the asynchronous operation of distant regions of a large chip. One advantage of regular processor networks at the board level is that the resulting systems are built by replicating simple, proven components (processor modules). This carries over to the chip level. An integrated network with some degree of configurability can be a standard platform for implementing digital systems or be used as a component of such. It can exploit the chip technology to arbitrary levels of integration yet use a simple, low-risk design approach by building on a scalable architecture based on functional components. The large FPGA chips also profit from these advantages; they may be considered special cases of the network integration approach with fine-grained nodes. There are many design options, but every integrated processor network design must provide the following ingredients:

r a useful processor core with a single or multiple compute circuits, or a mix of such; r integrated memory for every processor with appropriate expansion capabilities; r standard network interfaces with DMA support and synchronization; r network media supporting the static or dynamic routing of messages; r a low overhead software interface to communications, and protocol support; r i/o, bootstrapping and expansion capabilities to tile several chips on a board. Starting from a chip architecture with integrated networking facilities like the Sharc or the Opteron with its Hypertransport links, the move to a single chip multi-processor integrating a few CPU modules is straightforward. The first generation Sharc family includes a package of this kind integrating four Sharc DSPs connected via a common bus. The dedicated chip-level network designs differ in their granularity. As an example, the RAW architecture described in detail in [51] uses a 32-bit processor node based on the MIPS architecture that uses a 3-address load-store instruction set architecture and a set of 32 32-bit registers. The processor is equipped with 16k words of RAM (a total of 64k bytes) and includes a single precision floating point unit. The architecture defines direct links to the neighboring

INTEGRATED PROCESSOR NETWORKS

• 219

processors in the grid via register read and write operations and using parallel 32-bit buses. Second, there are 32-bit long distance connections that support the transfer of messages to arbitrary destination nodes. The communication paths are switched on the fly by extra communication processors associated to each node executing their proper programs. The read port is mapped into the register set and can be addressed as a register operand. An extra instruction bit specifies an ALU result to be moved to the output (in parallel to the destination register). Outputs to a neighboring processor are buffered in a FIFO, and inputting implies synchronization (yet without switching contexts). Dynamically routed messages including a header word are written into a register port by a software loop. The nodes operate synchronously and move every message towards its destination by 1 step/cycle. The message routing scheme takes care of possible deadlocks. The architecture also routes data blocks from a memory interface to the node memories. For loading the instruction memories, soft caching is proposed [52]. A 16-processor prototype chip of the RAW has been reported on. A problem to be dealt with in all single-chip processor network designs is the integration of memory and the interfacing. Whereas a board level design like the ER2 system can offer memory expansion buses and interfaces at every processor site, current chips provide interface signal pads from the borders only. While the number N of processor nodes of a scalable network integrated onto a quadratic chip would grow quadratically with its width, the number of pads only grows linearly, and the number of pads per processor node decreases with 1/N which can only be compensated for by sharing interface signal between several processors through high speed multiplexing. This could also be a reason to consider linear arrangements of processors on rectangular chips, or chips composed of a few linear stripes with wiring channels in between to connect many processor nodes to the pads (e.g., using buses). For a network structure connecting the processors only locally, access to the inner processors of a single-chip sub network through package pins is not needed, and access to the border processors is enough to extend the network off-chip. The interfacing problem is not particular to packing processors onto chips. If more functions (in particular, simple ones) are placed into any kind of module, the number of required data inputs grows. For very complex modules, the interfaces become simpler again and approach those for the external input and output streams. Following the philosophy given in sections 2.2/2.3, even the full usage of the possible i/o pads on a large chip is not desirable. Ideally, an integrated processor network chip would just provide some (even a constant number independent of the number of integrated processors of) very fast, serial interfaces to support a board-level extension of the integrated network, and some standard interfaces to receive external data streams. Memory resources accessible to the individual processors must be realized on-chip and are not expandable except for some border processors, or by using serial data exchanges with shared, external memory modules. For some designs that provide distributed DRAM or SRAM memory modules along with processing units, we refer to [65, 66]. The CPU2 with its 32k bytes of local RAM and a MAC coprocessor or a coprocessor configurable for floating point or some common DSP primitives can be used as the basis of a chip-level network architecture similar to the one sketched in [27]. The MAC is used efficiently in software floating-point operations. The network builds on the standard parallel bus and serial point-to-point interfaces explained in section 6.5.3. The bus can be used to form clusters of processors that provide an additional point-to-point interface extending from every processor. These extra, serial interfaces are connected in a fixed or configurable network of clusters. In

• 220

SYSTEM-LEVEL DESIGN serial interfaces

DRAM MC

P1

P1

P1

P1

X bar X ctrl

X-bars at other clusters or i/o (24 links)

cluster bus remote i/o bus

Figure 7.13 CPU2 cluster with crossbar and attached DRAM

[27], a crossbar design was presented tailored to the particular 4-wire serial interface. The crossbar is configured via an 8-bit control bus by any of the processors of the attached cluster (Figure 7.13). This port allows the user to dynamically connect the serial interfaces to the switched paths without having to reroute the network. Moreover, as long as the path is not connected to both the sending and the receiving site, no handshaking occurs and the sending site automatically waits. The crossbars may be implemented to assume a standard state after starting the system that allows the entire network to be bootstrapped from a single memory controller using the serial interfaces. The memory controllers attached to the clusters support the integration of DRAM or SRAM or non-volatile memory blocks into the cluster array. They can also implement shared memory blocks or be used to interface the clusters to buses extending to the chip borders (otherwise, i/o would only be available from the border clusters). The crossbar links brought out of the chip can serve to implement external interfaces (e.g., via an attached FPGA), or to connect to the networks in neighboring chips of the same kind. The main feature of a CPU2 network is the simplicity of the network structure and the processor core which not only results in a high ALU efficiency but in a fairly high overall efficiency also, taking the non-computational circuits into account (section 1.5.3), except for the memories. The 16-bit word size does not restrict the applications as the MAC efficiently implements more complex arithmetic types. Disadvantages in some applications may be the lack of message passing support and the fairly slow, distributed reconfiguration of the crossbars. Additional hardware could be used for this. Compared to the RAW architecture, the CPU2-based network is finer-grained and simpler. The processors are just 16-bit integer processors performing floating point operations in software only, yet using their compute circuits efficiently. The routing is static, and the network structure is quite similar to an FPGA structure. The hierarchical network structure can be exploited by placing software modules with a high data exchange frequency onto processors in the same cluster. Communications are handled with DMA support and automatic context switches, and different clusters operate asynchronously (the crossbars are not registered). Single word transfers involve more overhead than in the RAW design. While the RAW architecture supports parallelization on a per instruction basis, the CPU2 network is intended for a looser coupling of the processors, controlling the data exchanges through the main CPUs instead of using extra controller hardware. Compared to an FPGA, the configurable processor network achieves a higher total efficiency and eliminates all clock distribution issues. Like some recent FPGA architectures it provides multiple, distributed RAM blocks and multipliers but adds sequential control circuits and communication facilities between these. The clusters can be reconfigured in a pipelined fashion (soft caching using DMA). The bit processing facilities of an FPGA can be recovered

STATIC APPLICATION MAPPING AND DYNAMIC RESOURCE ALLOCATION

• 221

by replacing some or all of the MAC coprocessors by configurable logic, using the processors to perform a fast, parallel (re-)configuration.

7.4 STATIC APPLICATION MAPPING AND DYNAMIC RESOURCE ALLOCATION In this section we come back to the task of designing a digital system starting from a specification of the application processes to be performed by it, as outlined in section 1.5.2, using the methods developed in the preceding sections. To some degree, the transition from the specification to a functional system can be automatized. Automated system synthesis has been a research topic for a long time and can be further studied in [35, 36]. The first step is to select algorithms for the processing functions. If these were defined through algorithms, the algorithms to be used may be adapted from them. For Boolean functions, algorithms may also be derived from the function tables. After this, an application gives rise to a combined data and control flow graph. The nodes of this graph are operations that must be assigned to the available compute circuits. The selection of the compute circuits within an architecture (or evaluated for competing architectures) is usually based on some knowledge of the speed and efficiency achieved for the algorithm in question using a particular circuit. The algorithm selected for some block in the process graph of the application may even have been selected with a particular processor in mind as there is a dependency of algorithms and architectural components used to realize them. A particular algorithm usually runs on a particular type of processor but may be distributed to several processors of this type to achieve the given timing requirements. There are cases where the efficiency to perform a particular sub-task plays a minor role and is overridden by system cost considerations. If e.g. a digital system receives user input via a small keybord, then a microcontroller would be suitable to perform the involved bit processing and user interaction. Due to the low rate of user inputs any other programmable processor that is needed within the system can also perform this job, although less efficiently, and thereby eliminates the need for an extra processor. Thus there will be selection of the major system components based on the algorithms that dominate the computational requirements, an estimate of the required number of components from the thoughput requirements, and an assignment of the main tasks to be performed to the selected processor types. A high level design tool intended to derive dedicated digital system designs and implementations from formal specifications of the application-specific processing might capture the engineering knowledge on architectural components and algorithms in a database and automatically perform the selection of algorithms and components, estimate the required number of components, or simply start from given algorithms and data types and perform an optimization, mapping the specific operations in the algorithms (e.g. floating point operations, or inner products) to different combinations of architectural components. Assuming that the operations of a particular algorithm given by a sub-graph of the combined data and control flow graph will be performed on k processor modules (or FPGA chips) of a selected type, then the sub-graph has to be clustered onto a graph consisting of k nodes (the clusters) that are connected according to the original data and control flows. The clustering must be so that the resulting number of edges is small as the clusters correspond to the sets of operations executed on the processor modules, and the edges must be realized with

• 222

SYSTEM-LEVEL DESIGN

the limited number of interfaces between them, and the connected clusters must exchange as little data as possible to minimize the bandwidth requirements for these interfaces. Operations executed in alternative branches will be put into the same cluster. If there is a limitation on the execution time, the clustering must take into account that an operation can only be performed after all operands are available, including those that are sent from another cluster. On sequential processors, the operations in every cluster need to be scheduled so that wait times for operands are avoided. If there is just a throughput requirement but no requirement on the processing time, a simple approach to obtain k clusters is to schedule the n operations of the algorithm, assign the first n/k operations of the sequence to the first, the second set of n/k operations to the second, etc., and connect the processors in series to operate in a pipeline. This method, however, will not always respect the bandwidth limitations of the interfaces. The processor modules will have a limited number d of interfaces each. Ideally the processor modules would be wired according to the edges between the clusters. This only works if the numbers of edges from every node (the degree of the node) is at most d. Otherwise the processor modules may be connected so that most required edges in the cluster graph are realized by edges of the processor network, and the remaining ones are mapped to paths. Alternatively, some standard interconnection pattern is chosen for the processors, and the cluster graph is embedded injectively. In both cases, the embedding must be so that the shared interfaces deliver the bandwidth to maintain the required throughput. The clustering and the graph embedding may be performed automatically using e.g. stochastic optimization heuristics such as simulated annealing or genetic algorithms. Even restricted to a particular algorithm and the processors executing it, the clustering and the subsequent graph embedding are very complex optimization tasks, in particular if the number of operations in the algorithm is high. To avoid this, the algorithm to be mapped may be defined in terms of more complex operations, or be partitioned into such in a preliminary, manual step. The distribution of complex building blocks of an algorithm to the available processors can also be performed dynamically (at ‘run time’) by emulating a totally connected processor network through a bus or by means of passing messages along the paths of a point-to-point connected network. Then some management algorithm can be used to dynamically decide on the placement of a building block onto one of the processors. Such a decision would only be based on the actual state information without knowing about the future situation (otherwise the assignment would be static again), e.g. on the availability of some idle processor, and would decide on whether to execute the building block there or to schedule it on a processor disposing of the required inputs. An optimization procedure based on the data and control flow of the entire algorithm can be expected to yield better results and also avoids the overhead for the decisions at run time. There are, however, cases in which the execution times of the building blocks depend on the actual data and are not known at compile time. Then only a dynamic algorithm can achieve efficient use of the processors. A somewhat artificial example of this kind would be the computation of the pixel colors of a display of the Mandelbrot set [85]. These are functions of the run times of another recursive algorithm taking the pixel coordinates as its inputs. The total computation can be distributed by assigning the individual pixel computations to the available processors in any order. A simple dynamic allocation algorithm of this kind was described in [86] for a processor network using point-to-point interfaces and message passing between remote processors. The

• 223

STATIC APPLICATION MAPPING AND DYNAMIC RESOURCE ALLOCATION

2

1

2

3

2

1

3

3

2

2

2

3

4

4

3

3

1

2 2

3

3

2

4

2

1 3

3

4

5

4

3

2

3

4

3

2

4

3

5

4

4

5

3

4

Figure 7.14 Distance values to an idle (black) processor

building blocks to be assigned to the processors are functions that can be remotely called from any processor on every other processor by sending a ‘call’ message to it containing the functional arguments of the call. Every processor has identical, resident instruction lists for all these functions. The results of a remote function call are sent back to the calling processor and do not depend on the choice of the executing processor, and are the same as for a local call of the same function on the calling processor. The allocation algorithm is a distributed algorithm that is executed in parallel on all processors. It is invoked for all calls to the said functional building blocks by deciding to execute the call either locally or remotely (the result is the same). The decision is based on a set of state variables maintained on every processor for each of the networking interfaces. Such a variable contains the distance to the nearest idle processor in the network reached via the interface (Figure 7.14). If the minimum of these values exceeds the diameter of the network, there is no idle processor, and the call is executed locally. Otherwise, a remote call is sent out via an interface with the smallest distance value. The call is passed on by the receiving processors along interfaces with the smallest distance values until it reaches the idle processor. The call message contains the identity of the calling processor so that the result can eventually be sent back to it. The distance variables are updated by reporting every change of the minimum distance to the nearest idle processor taken over all interfaces from a node to all its neighbors. An idle processor is defined to have a minimum distance of zero. Once engaged, the minimum changes to the minimum taken over the interfaces, which is reported to the neighbors. Thus, the variables adapt to the new values through a diffusion of the information on the change. It may happen that a message travels towards an idle processor which, however, becomes occupied before the message reaches it. Then it goes on traveling until a maximum number of steps are reached and reported to its origin (which then executes the function call by itself). The parallelization procedure can be refined by providing a buffer for a remote call message on every processor module. After sending back the results of a remote call, a processor can immediately start to work on the buffered task so that the communication of remote calls and their execution become pipelined. The dynamic routing of messages and special functions such as updating the distance variables involve some protocol overheads. In the cited implementation, a single 32-bit word of header information is used in this case to distinguish the different kinds of messages. The messages for updating the distance variables are single word (header only) messages.

• 224

SYSTEM-LEVEL DESIGN

The updating function can easily be implemented in the hardware to reduce the interrupt frequency for the processor, e.g. by using counters for the variables and separate interface signals to increment or decrement them. Dynamic message passing can be further supported up to the point of automatically routing the messages. If message passing is used in a point-topoint network, one has to exclude the possibility of communication deadlocks. Such problems arise if a cycle of processors forms in the network such that every processor waits to send to the next on the cycle but none of them accepts new messages as the receive buffer cannot be passed on. Deadlocks can be detected through the use of appropriate protocols and resolved by providing extra buffer space reserved to handle the fault situation [87]. In the present case, header words starting a message transfer are acknowledged by single word messages first, before the data part of the message is transferred as a block. A negative acknowledge message includes the information required to detect a deadlock. More on distributed algorithms and message routing is in [88].

7.5 RESOURCE ALLOCATION ON CROSSBAR NETWORKS AND FPGA CHIPS While the allocation of instruction lists or variables in the memory space of a processor is a simple matter and easily automated, the allocation of FPGA cells for some special function and the selection from the available wiring resources are more difficult. Many problems of this kind concerning automatic design have been shown to be NP complete, and only tackled using heuristic methods. A crossbar network can be described as an undirected, connected graph (V, E) and a bandwidth function b: V × V → N0 that assigns to every edge (x, y) ∈ E the number of available wiring segments along it. The bandwidth function satisfies the equation b(x, y) = b(y, x) for all x, y, and b(x, y) = 0 if and only if (x, y) ∈ E. For x, y ∈ V we again denote by d(x, y) ∈ N0 their distance in the network. To injectively embedding a given graph (W, F) into (V, E) a ‘placement’ function ϕ: W → V and a ‘routing’ function ψ: F → E∗ have to be defined. The edge set F ⊂ W × W can also be characterized by a bandwidth function c: W × W → N0 , namely the characteristic function of F. For (x, y) ∈ F the length of the path ψ (x, y) from ϕ(x) to ϕ(y) is greater or equal to d(ϕ(x), ϕ(y)). Therefore the total number S of wiring segments needed for the paths is greater or equal to the number T= d(ϕ(x), ϕ(y)) = c(x, y)d(ϕ(x), ϕ(y)) (x,y)∈F

(x,y)∈W × W

that only depends on F. For every edge e ∈ E , the bandwidth b(e) limitsthe number of paths ψ(x, y) containing e. In particular, S cannot exceed the total number Q = e∈E b(e) of directed wiring segments of the crossbar network. In terms of c, the bandwidth condition reads c(x, y)|ψ(x, y)|(e) ≤ b(e) for all e ∈ E, (1) (x,y)∈W × W

where the function |ψ(x, y)| of E takes the value of 1 for the edges contained in ψ (x, y), and is zero otherwise. Note that in this estimate c might be refined to a real valued function to

• 225

RESOURCE ALLOCATION ON CROSSBAR NETWORKS AND FPGA CHIPS

describe the data exchange rates along every edge. Then S generalizes to the expression: S= c(x, y) length (ψ(x, y)) (≤ Q) (x,y)∈W × W

The complexity of the task of finding both ϕ and ψ without exceeding the available wiring resources can be lowered by dividing it into first performing the placement and then the routing for the chosen placement. Actually, both sub-tasks are interrelated. It may happen that for a given placement no routing can be found at all, or that any acceptable routing does not minimize S. The placement step is performed so that the number T is minimized. If the minimum T exceeds Q, then there is no routing respecting equation (1). For an implementation, V and W can be identified with the index sets [1, . . . ,m] and [1, . . . ,n] with n ≥ m. Then ϕ becomes an injective index mapping (a permutation in the case of n = m), and the problem of minimizing T is a so-called quadratic assignment problem for which standard solutions and tools exist [57]. This translates into the integer linear programming problem of minimizing the sum (T = )

m m n n

ci, j p j,l dk,l pi,k ,

with all

pi,k ≥ 0,

i=1 j=1 k=1 l=1

m

pi,k = 1,

i=1

n

pi,k = 1

k=1

for the given distances dk,l and bandwidths ci, j as a function of the pi,k . An integer solution will be such that for every i there is a unique j = ϕ(i) such that pi, j = 1. Finding a routing that minimizes S also translates into a large ILP problem that can be solved with LP solver tools [58]. If the edge set E is identified by the index set [1, . . . ,r], then paths can be represented as certain vectors (pk ) having the number r of 0 or 1 components, pk = 1 indicating that the edge k is on the path. Then this translation reads: S=

m r i,j=1 k=1

ci, j qi, j,k ,

with

0 ≤ qi, j,k ≤ 1,

m

qi, j,k = 1

i,j=1

Additional linear constraints are used for the qi, j,k to take care of getting a path from ϕ(i) to ϕ( j) and to obey the bandwidth limit in equation (1). To implement a network of flip-flop and gate components on an FPGA, the first step is to cluster the components to more complex sub-circuits that can be configured as individual cells and use these more efficiently. The resulting cluster graph is then injectively embedded onto the FPGA structure, performing combined or separate place and route steps. Before, the FPGA resources need to be modeled as a network of interconnected processing nodes. For these one has to distinguish between the input/output nodes, the logic cells, and special functions like RAM blocks or multipliers. The wiring resources are heterogeneous, too. There is a mixture of point-to-point connections and bus lines. Moreover, wire segments are not always connected to cells but also to other segments via switches. It is therefore common to model the FPGA so that the wiring segments are nodes, too, and the edges represent switches. The cluster graph is embedded so that clusters are mapped to functional units only. Edges are always expanded to paths through intermediate wiring nodes. The timing of the FPGA circuits is a concern. Therefore the arrangement of cells in a sub-circuit is considered, and the routing is performed in two steps, first assigning connections to wiring channels (‘global’ routing) and then performing the detailed ‘local’ routing. All discrete optimization heuristics are applied, in particular, simulated annealing for the placement of cells. As a starter to the literature see [59, 60].

• 226

SYSTEM-LEVEL DESIGN

7.6 COMMUNICATING DATA AND CONTROL INFORMATION After the more hardware-oriented discussion of communications in sections 7.1 and 7.2 we finally consider some application requirements and implementation techniques including the necessity to also send out control and protocol information. The need for protocols has already been remarked for the dynamic routing of messages using shared media. We continue to follow the application model introduced in section 1.5 in which communications occur between processors executing threads that are part of the same distributed, cyclic process. If two processors A and B are linked by an interface used to transfer data input or intermediate results computed on A to B, these data are not related to the data word sizes handled on the processors but are bit fields of application specific lengths. During the continuous operation of the system, streams of such bit fields are transferred. Consequently, it is required that the interface transfer blocks of data words. In order to achieve high ALU efficiencies of the processors, these block transfers ideally do not involve the processors but are handled by DMA such that the receiving processor is informed upon the arrival of new data after completing the entire block transfer only. The receiving processor may switch to other threads until this handshaking occurs. DMA can be substituted by a series of word receive operations also freeing the processor for another thread as long as a word still has to be waited for (then the overhead involved in switching to another context occurs for every word). As the interfaces involve hardware costs they are not dedicated to communicating within a single distributed process only, but often time-shared or multiplexed between several ones. This is different from just merging the individual data streams because at the receive site the merged stream branches into the individual streams again. Typically, all processes using both processors will share the same interface from A to B. Every bit field transferred from A to B must be directed to the right destination thread on B. This could be done by outputting a process identifier along with the message, or, more efficiently, by using the shared interface to first transfer the identifier before transferring the data bit field. The bit field therefore thus needs to be expanded by a code for the receiving thread, which amounts to implementing a simple protocol for the data transfer. At the receive site, the address information is used to signal the arrival of a new block of data to the destination thread only. The multiplexing of interfaces is also used for external interfaces to a system, in particular if they connect to other computers in a supersystem. Examples are the outputting of visual information to a screen where different windows are used to distinguish the outputs of the individual processes, or the outputting of sequences of control codes to a multi-voice musical instrument using the MIDI interface. Finally, if an interface output is to a bus to which several other processors are connected, the message must also be expanded by the network address of the destination processor. At the receive site, these additional protocol bits are stripped off again. The combination of the process address and the processor address uniquely identifies a destination thread running on one of the processors. On an Ethernet using the UDP protocol, the processor bus address is in the Ethernet header and automatically used by the Ethernet controller to select from the messages on the bus. The socket address in the UDP header can be used (but does not have to be used) to distinguish the destination threads. For a USB interface, different endpoints can be used for different destination threads in order to let the interface controller perform the selection instead of the receiving CPU.

S0 Thread 1on P1

• 227

COMMUNICATING DATA AND CONTROL INFORMATION

SD

SD S1 RD

Thread 2 on P2

RC

Figure 7.15 Data transfers (SD/RD) and added control code transfers for compiled communications RD RD

SD SD

RD RD

SD SD

RC

Figure 7.16 Cyclic server process for remote function calls

A thread on processor A wishing to send out a bit field to a particular thread on processor B that is part of the same distributed process needs to call an output operation for the block, adding the required protocol bits, and the receiving threads calls an input operation for the data, stripping off the protocol data that also causes the receiving thread to wait for the arrival of the data. No further protocol is involved if all information on the type of the transferred data and the action to be performed on them is built into the receiving thread. This method is called ‘compiled communications’. It requires the receiving thread to follow the control flow of the sending thread. If the transferred data or the action to be performed on them depend on which branch is taken in the thread on A, the branching condition needs to be transferred to the thread on B, too (Figure 7.15). If no communications occur after a branch executed on A, there is no control code transfer to B. A control code and a subsequent data transfer can be merged into a single block transfer. A simple implementation that numbers the receive operations and precedes every data transfer by a transfer of the number of the corresponding receive operation and uses it to select the appropriate action on the data would do this job, but would transfer more control information than necessary. This is similar to performing calls to remote functions selected by a control code without implementing a control flow on the processor executing the calls. The remote function call would be executed by a server thread that cyclically receives control codes, performs the equivalent of a multiple case branching structure and performs further receive and send operations of parameters and results in the branches (Figure 7.16). A single server thread can respond to calls from multiple processors within several processes. Through its cyclic operation it dynamically serializes the incoming calls. If the remote calls are issued from the same, distributed process, its timing may suffer from the random sequencing of the calls. The deterministic sequencing of the server actions through a local thread not only reduces the amount of control messages but can avoid wait times for time-consuming calls. The same remark applies to the management of buses as a shared resource used by several processors involved in the same process. A deterministic access schedule not only avoids the overheads for a bus access protocol but also enforces an optimized timing. Control transfers are also needed if a thread on A waits for some synchronizing condition that is also meant to synchronize the actions in a thread on B (that is part of the same process) if there is no data transfer from A to B in between. They also serve to maintain the order of

• 228

SYSTEM-LEVEL DESIGN

outputs issued from different threads in the same process and merged into the same output stream or communicated to a thread on a processor C. If e.g. an output on A occurs before the output from B, A has to send a control message to B that is only received after the send operation and waited for by B before it performs its output. If outputs from different processes are merged, there is no control transfer attempting to maintain a particular order. The control transfers used for the messages passed within a single process via a bus make sure that these are serialized so that there will be no bus congestion. In general, several processes compete for bus access. To avoid bus congestion (which is automatically resolved by some buses, but at the expense of extra delays) a token message can be circulated between these processes using broadcasts. Only the process holding the token would be allowed to send via the bus if needed, and would keep the token during this time.

7.7 THE π-NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS This section is on a particular research project aimed at a simple tool to design embedded digital systems and is not prerequisite reading for any of the later sections and chapters. It is included here as it takes up many of the ideas discussed so far and may serve for comparisons with other tools or to provide ideas for new ones. The tool can be downloaded from [55] for experimentation and noncommercial applications of some of the components introduced in the previous chapters. More complete documentation on it can also be obtained from [55]. In particular, it takes up the following ideas:

r distinguishing abstract data (numbers) and their binary encoding; r using algorithms to describe circuits (even recursive ones); r distributed, cyclic processes; r allocation to components supplied by a scalable architecture; r processors supplying sets of special, elementary functions; r handshaking interfaces and their modeling by files. This tool is an experimental programming language called π -Nets that attempts to describe digital systems based on various kinds of architectural building blocks including both sequential processors and FPGA in a coherent fashion and enough detail to permit the automatic code generation for the involved processors. It covers the following aspects that have to be dealt with by every set of tools for designing digital systems (how they are handled in standard tools will be commented on):

r the representation of a scalable architecture supplying the components applications can be based on;

r the definition of the set of building blocks for a given application and their interconnection on the basis of a scalable architecture;

r the definition of algorithms based on sets of elementary functions provided by the component processors;

r the definition of interfaces to the target system that are accessible by particular processors and have a specific timing behavior and handshaking;

r the definition of application processes that access the interfaces and memory structures and exchange data via these with the proper synchronization;

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

r distributing the processes and memory structures to the available processors; r handling reconfiguration and sequential control of shared resources; r performing code generation for the processors and configurable logic components; r synthesizing code to bootstrap the system and to distribute the configuration code; r supporting in-system debugging capabilities and simulation facilities.

• 229

In contrast, a conventional C compiler environment as the most common software tool performs code generation for a single processor only and the debugging of it. The π -Nets environment does not attempt to cover higher level design tasks as future tools might do, and for some of the above relies on explicit givens in the ‘programs’ in order to capture the knowledge of the system and software designer to reduce the complexity of the allocation tasks. It does not do the following:

r select the types and required numbers of component processors for an application. This is specified by the programmer.

r attempt to find or select algorithms for the functions to be performed. These are specified and also supposed to be verified by the programmer.

r select the number encodings within the algorithms. r distribute the processes, data storage structures and interfaces to the available processors. This is explicitly specified in the programs. Only within a class of similar processors, data structures and processes are automatically distributed, based on a specified pre-clustering. r select a set of operations assigned to an FPGA for serial execution on a single compute circuit. The programmer does this, yet without having to describe a control circuit. r partition a process for the reconfiguration of the FPGA or for soft caching. This is based on a program operator. The actual reconfiguration procedure is added automatically. The π-Nets compiler is part of an interactive programming environment that provides some common services such as editing program texts and a simple project management. The target system can be linked to the host to perform transparent code downloads and to perform interactive testing, and even include the host computer as a programmable processor. The π-Nets compiler supports the XC161 processor (section 6.6.2), the CPU2 (section 6.3.2) and several other components and can easily be extended to new processors. The code generator modules required for additional processor components are implemented as special π -Nets programs. The compiler performs a register allocation and outputs register operations that are automatically matched with the instruction patterns defined in the code generator to call the processor-specific encoding functions. The discussion of π -Nets in the following sections will concentrate on the approaches to handle the above issues. Besides these, the language has some non-standard features that are somewhat arbitrary and mostly chosen to make it easy to use and to read. E.g., names of objects and functions may be composed of several symbols divided by spaces. Local variables are written to just once, and the if-else structure defines ‘open’ branches without a common continuation. Numbers can also be input and output in a rational format (3.1/7), the use of the multiply operator ‘∗ ’ in expressions is optional, and comparisons can be chained (1 < ab < 10). Functions may have several result words, and constants and variables hold multiple words as well. Loops accessing the components of a multi component data structure can be written in a short form notation that is automatically expanded by the compiler. Available library functions are stored in an intermediate code and selectively compiled and optimized for the targets processors depending on their use in an application. All objects used in a program, all

• 230

SYSTEM-LEVEL DESIGN

processes and the associated output windows are defined statically. Disk files are treated as special ports receiving data streams and are defined statically, too.

7.7.1 Defining the Target System A scalable architecture on which target systems can be based defines several component types and associated network interfaces and media. The target system will be a suitable network of processor components of these available types. For conventional compilers for embedded systems, the target system is bound to be a single hardware processor or a single FPGA of some specific type. In contrast, for distributed operating systems on networks of workstations, programming tools such as the parallel virtual machine (PVM) and MPI are available to support parallel and distributed programming, as well as libraries supporting parallel applications and special programming languages supporting parallel processes [26, 31]. The Transputer family was supported by the OCCAM programming language based on the CSP model [32] that also allowed networks of several different Transputers as targets. Parallel OCCAM processes could be placed onto processors at given network addresses and the software interfaces between them (‘channels’) were placed onto the hardware links of the Transputers. π -Nets supports a small number of representative embedded standard processors that can serve as architectural building blocks in a broad range of applications. The π-Nets environment is based on a compiler front end for the π -Nets language and a runtime environment to execute programs on the host. It is complemented by a number code generator back ends (CG modules) to support different processor types. It does not link to just one of them but can use several of them as required if the target uses processors of different types. The CG modules are loaded on demand in response to the definition of the target network within the π-Nets program text. The scalable architecture is represented by the set of available CG modules. The code generator modules also define the available interconnection methods and interfaces. The target definition is in the header section of a π -Nets program and takes the form of a net list definition. The network nodes are declared as ‘processors’ using the syntax proc (type) name where ‘type’ denotes a special component type (actually, a CG module) and ‘name’ is the name of the processor of this type that is used subsequently to perform application functions. A simple application might require just a single processor, but a typical application would define several distinct programmable components of the target. If several processors of the same type are used, an index range can be specified in the ‘proc’ declaration instead of several separate declarations. The line proc (XC161) 8 MC e.g. specifies a set of eight hardware processors MC[0]. . . MC[7], all of the same type XC161. The special name ‘host’ is predefined and reserved for the workstation processor running the compiler. The processors of the target system operate on bit fields of various sizes that encode the application data. Every π-Nets processor performs a generic set of elementary Boolean functions on bit fields of a specific word size (e.g. 1, 8, 16, 32) that is an attribute of the

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

• 231

processor type. Complex functions derived from the elementary ones operate on bit fields, the sizes of which are multiples of the word size of the processor. The ‘processor’ constitutes a space of multifunction building blocks on which the elementary operations can be selected. The CG module defines how they can be composed with each other according to the algorithms given in the application program (in particular, whether a single multifunction circuit is used in sequence). If several, similar processors are defined, they constitute independent hardware resources. A CG module may also provide predefined special functions that can be used by all processors of this type. An FPGA e.g. provides building blocks performing a set of single bit operations, namely the AND, OR, XOR, NOT, SEL operations, the full adder operation and a few more. A micro controller such as the XC161 operates on 16-bit data words and supports the bit field operations AND, OR, XOR, NOT performed in parallel on the components of the operand words and 16bit binary arithmetic operations. Usually, a CG module also defines other sets of elementary operations to implement special number types (e.g., 16-bit signed binary numbers). Besides a space for operations, a π -Nets processor also provides a space for storage cells and a space for ports for inputting and outputting data words. A port to the target system needs to be declared and located on one of its processors. The statement ‘ON pname’ precedes the declarations of the ports and variables to be allocated on a specific processor ‘pname’. The allocation of variables can be coupled for several processors to implement a shared memory space. Ports are digital hardware devices that are used to input and output bit fields, and which take part in the data exchange between processor sub-systems. A port transports bit fields in units of the processor word size (or a smaller one) or numbers encoded by single words. There is a distinction between ‘raw’ ports that deliver valid data codes at any time without an indication of when a new input word is delivered, and handshaking ports that may fail to deliver data. For a handshaking port unidirectional or bi-directional handshake signals are specified besides its i/o address and word size. Variables allocated in the memory space of a processor hold bit fields spanning any number n of data words or number codes. They are global storage structures uniquely allocated in the memory space of some specific processor. The accesses to a multi-word variable are indexed as the processor only supplies word operations. In i/o and copy operations and some others the indexing can be implicit (serially outputting a multicomponent variable x to a port p is denoted ‘x >> p’). Then the variable is treated as holding a single, large bit field. Besides the processors, the target definition requires a specification on how the processors are linked to each other as the application processes eventually mapped to the processors will require data exchanges between them. The link between two processors is directional and specified by identifying an output port on the sending processor and an input port on the receiving processor of the same width. The use of handshaking ports guarantees an orderly data transfer, but there is no further protocol specification for how to access an interface bus or how to direct the transferred data blocks to the right destination processes. The link definition uses the syntax link pn1, pn2 port1, port2 where ‘pn1’ and ‘pn2’ stand for the names of the processors and ‘port1’ and ‘port2’ are the send and receive ports that must have been previously declared in the π -Nets program or in a CG module. If raw ports or i/o routines without handshaking are specified, their

• 232

SYSTEM-LEVEL DESIGN

implementation is supposed to provide buffering as needed and to block accesses until the transfers succeed. The ‘link’ statements impose the structure of a directed graph on the target, the nodes of which are the individual processors. Links are used to transport application-specific bit fields of arbitrary sizes that are broken into sub-fields corresponding to the port sizes and transferred serially. If the linked processors use different word sizes, no loss of information occurs but the bit field may span a different number of words on the receiving processor.

7.7.2 Algorithms and Elementary Data Types The algorithmic notation is used to specify the composition of operations within complex functions. The sets of elementary operations correspond to the data types handled by the processor. In C, the most common software language for embedded processors, common data types are ‘float’, ‘double’, ‘int’, ‘char’, ‘bool’. The numeric types refer to specific encodings of the numbers. ‘int’ (32-bit twos complement numbers) also supports bit field operations and is the type used for indices. Most languages (including VHDL) adhere to this scheme, and also support application-specific derived data types. Some support abstract data types hiding the implementation of the operations and the encoding of the data which, however, need to be based on the predefined encoding types. π-Nets implements a different type concept that involves a neat distinction between abstract entities (numbers) and their encoding by (still abstract) bit fields (Figure 7.17). There is a predefined abstract data type called NUM that represents the computable real numbers. NUM includes the standard arithmetic operations and comparisons as well as some transcendental functions (2∧ , ld, sin, etc.), yet no Boolean operations. NUM algorithms operate on tuples of numbers and use the NUM operations and functions as building blocks. At this level, some useful data types can be defined as classes of special NUM functions given through NUM algorithms, e.g. the types ‘complex’, ‘quaternion’, or ‘vector’. Besides NUM, there is the type BIN of finite bit fields, i.e. the type of data used in the input and output streams of digital systems. For both NUM and BIN literals distinguished input formats are used and supported for textual output, too. On subsets of BIN, binary data types are defined that are classes of Boolean operations/functions on fixed size bit fields outputting fixed size bit fields, too, the sizes being specified as multiples of a word size attribute of the type. These types are defined either abstractly by just defining their behavior or

NUM

BIN

16

2 compl derived numeric (vector) types

B16

B32 behavioral classes

3 SF48

I32

algorithmic class

Figure 7.17 π-Nets type hierarchy

P class

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

• 233

algorithmically based on another binary type. Among the binary types are basic sets of Boolean operations such as the type B16 that includes the standard SIMD operations (AND, OR, XOR, NOT) and binary mod(216 ) arithmetic operations. Then there are distinguished binary types that are fixed size encoding types for numbers, e.g. the types I16 and I32 of signed binary 16- and 32-bit numbers and the types F32 and F64 of single- and double-precision floating point numbers. These inherit the NUM operations but replace them with the corresponding Boolean functions on the codes. A particular floating point type, SF48, is defined algorithmically from the operations in B16. For all encoded NUM types the same input formats for numeric literals and textual output formats apply (e.g., ‘1’ and ‘1.0’ denote the same number). For every encoding type, the algorithmically defined classes in NUM carry over. The complex type can e.g. be used with I32 codes. Then complex numbers become encoded by 64-bit fields each concatenated from two integer codes. The π -Nets processors each provide a particular binary type with a particular word size and, optionally, additional encoding types. A processor providing a particular type, e.g. B16, also disposes of all types algorithmically derived from it, i.e. the types SF48 and ‘complex’ on top of SF48, but not of I32 and F64 which are not defined through B16 algorithms, unless they are provided as additional types. The ‘on’ statement with a processor selects the corresponding bit field type while the names of the encoded NUM types select these for the subsequent definitions (this extends to user-defined, application-specific classes of Boolean functions). For every binary type, algorithms can be defined on the basis of the supplied operations. Execution of these algorithms, however, requires the building blocks to be allocated ‘on’ a processor (the host can be used to ‘simulate’ the execution). The algorithmic notation serves to specify compositions of elementary operations of a selected set to form complex functions, yet not to define an order of execution on them (which would only make sense on a sequential processor). The syntax is fairly standard, using the usual infix notation for the arithmetic and Boolean operations, and prefix notation for the remaining functions. Compositions are specified as composite expressions. In a complex composition, the results of sub-functions can be given names using the notation ‘expression -> name’ as in: a (b + c) -> d and be subsequently referenced by these. Names of intermediate results cannot be reassigned and do not correspond to storing the results in some memory device but only serve to reference them. There is an indexing structure that serves to define sets of similar operations. The line idx10i\a [i] + b[i] -> c[i]\ defines the values c[0]. . . c[9] yet does not specify storage into an array or a sequential loop structure. A composition of operations can be encapsulated in the definition of a function that can subsequently be referred to by the assigned name. Functions are used as templates for composite operations and do not necessarily translate into sub-routines on the target processor. Functions are called by the usual syntax (‘f(x, y, z)’) and can be used within expressions, or take expressions as arguments. π-Nets functions are either NUM functions or bit field functions constructed from the operations within a type or a selected class of special Boolean functions, in particular a class of encoded NUM operations. A NUM function can be executed on every processor implementing some encoded NUM type, and a function defined within a class of encoded

• 234

SYSTEM-LEVEL DESIGN

NUM operations can be executed on every processor implementing this class. Functions may also be defined to be applicable on a particular processor only in order to allow different algorithms for different processors. A function defined ‘on’ a processor may also access the ports and variables allocated on it. All arguments and results of a function are of the same type (NUM, an encoded NUM type, or another bit field type), and the type of a function is given by the selected type and numbers of arguments and results. A function of some encoded NUM type may call NUM functions and functions of other encoded NUM types. In the latter case, code conversions are implied. Encoded NUM functions are special functions on bit fields and may also be called from bit field functions of the type associated to the processor. In this case, there are no conversions apart from mapping a wide number code to several words of a smaller size. The basic syntax to define a function uses the keyword ‘fct’: fct n m function name { -> x, y, z expression -> u ... r, s, t }

.. n, m are the numbers of parameter and result values .. parameter names .. definitions of intermediate results .. list of result expressions or references

The functional expression within the {. . . } may include branches into alternative exits delivering different results, and the definition of a function with several branches may include self-references (recursion). Branches take the form {

. . . if condition

expressions, results

else

alternative expressions, results }

The condition for taking the ‘if’ branch is computed by a comparison operator (=, <>, <= etc.). The logical results of comparisons are an auxiliary, implicit type. Logical results can be output by a function with no results (m = 0) to control branches but are not input to operations and functions. The special logical operators ‘and’, ‘or’, and ‘not’ are provided to combine with logical results and are equivalent to branch constructions (e.g., ‘if c1 and c2’ is equivalent to ‘if c1 if c2’ etc.). The elementary operations in the encoded NUM classes do not need to be defined for all input values (e.g. due to overflow), and the same holds for the functions defined in a program. A recursive function becomes undefined if the recursion depth exceeds a specified limit for a given input. A function can explicitly be made undefined for certain inputs by including a conditional expression that is not used as an ‘if ‘ condition. It becomes undefined when the expression evaluates as false. This is used in the GCD algorithm in Listing 7.1 defined for an integer data type: fct 2 1 gcd { -> a, b a > 0 and b > 0 if a = b, a else if a < b, gcd(b % a, a) else gcd(a % b, b) } Listing 7.1

GCD algorithm

.. precondition defining the domain of definition .. ‘%’ is the modulo operation

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

• 235

If a parameter value is applied for which a function or operation is not defined, an exception results. π-Nets provides a special variant of the if/else structure to handle such exceptions: {

. . . if valid

expressions, results

else

alternative expressions, results }

first attempts the computation of the ‘if’ branch. If an exception results, the ‘else’ branch is executed and delivers the results of the functional expression. If it fails too, the exception remains to be processed. Classes of functions operating on fixed size tuples of numbers or on fixed size bit strings representing an application-specific data type can be defined using the keyword ‘class’ followed by the size parameter in units of words and the class name. The names of class functions are composite and start with the class name. Thereafter the class name selects the class functions as the elementary operations of subsequent algorithms.

7.7.3 Application Processes and Communications A programming language intended for targets with more than one sequential processor needs to provide some notion of parallelism. Even for a single, embedded processor, interrupt processing needs to be used to be able to synchronize to external events. This corresponds to executing several threads of instructions of different priorities on a single processor and generalizes to handle several asynchronous sub-tasks on it. Using several threads within an application program is supported by several languages although not by C (where processes must be spawned by calls to an operating system from a main function). These include Modula, ADA, the real-time language Pearl, and OCCAM. ‘Processes’ are used as software abstractions of sequential processors executing instruction sequences in parallel (i.e., threads each executed on some hardware processor), with the exception of OCCAM where a process can be composed of parallel sub-processes executing on different processors, and communications of data occur between different processes. In contrast, π -Nets takes up the application model in section 1.5.2 where a single, even sequential process may span several processors which then communicate. π-Nets processes are defined statically and operate repetitively as long as the application system is running. In the languages mentioned before, processes are created dynamically starting from an initial process. A π-Nets application includes several processes that access the external input and output ports, memory structures for additional state input and output, call processing functions on the data, and also specify the control flow for the actions to be performed. It is the π -Nets processes that define the application processing on the target system whereas the defined functions merely represent optional complex building blocks that may or may not be used. The processes synchronize with the handshaking interfaces they access and with each other through their mutual data transfers. The functions executed within a process are performed by a single or several of the processors of the system. Each process using a sequential processor defines an instruction thread on it; a process using several processors defines a thread on each of them including the exchange control information to maintain a common control flow. The thread management on a particular processor does not rely on an operation system but is constructed by the compiler and tailored to the application.

• 236

SYSTEM-LEVEL DESIGN

In contrast to ‘pure’ functions, processes access ports and variables that have been allocated on some processors of the target system. The definition of a cyclic process is introduced by the keyword ‘cpc’ and takes the form cpc

process name

{ . . . port/variable accesses and function calls to be repeated .. }

The program within the curly brackets may branch into several exits and execute loops but is non-recursive as processes cannot be called like functions (they are all active from the start of the application). A process may perform host screen input and output and is assigned a window of its own if it does. A process stops if it executes the ‘stop’ statement in one of its branches and resumes after a ‘call’ from another process. The application processing ends once all processes have stopped. ‘cpc’ may be followed by parameters specifying a maximum cycle time and the maximum pipelining. The port accesses and write operations to variables in a process are performed in the order of the program text. Multiple, indexed port accesses start with the operation indexed by 0 and continue in the ascending order. If a handshaking port is not ready, the port access fails and causes the process to be suspended until the port is ready. Synchronizing port accesses are preceded by the symbol ‘$$’ at the position from where the process is retried if the handshake failed. The ‘$$’ mark can also be given with an explicit condition for the continuation of the process which is convenient for accesses to raw ports and data structures. ‘$$’ refers to the timing of the subsequent i/o and memory operations and divides them up into those occurring before or after the condition is met. Operations only depending on values read before the ‘$$’ barrier allow the compiler not to respect their position relative to it. ‘$$’ may be followed by a parameter indicating the maximum wait time. The following example for the XC161 processor shows a process that repetitively starts a conversion on the on-chip analogue-to-digital converter circuit (see section 8.1) by setting a particular port bit (bit 7 of the control port named ‘adcon’), suspends until the conversion is complete (as indicated by bit 8 in the same register), then reads the result from a port and outputs it by calling a display function (‘value >> x’ is the write operation to a port or a variable x): cpc adc { 1 adcon.7

$$adcon.8 = 0

display(addat) }

A process starts running on the processor previously selected with an ‘On’ statement or inferred from the access to a port or a data structure assigned to it. It may switch to another processor and transfer data via the interfaces specified in the ‘link’ statements. This is specified by a special communication operator ‘#’ that is followed by the name of the processor on which the process continues. ‘#’ may be used with an assignment of local variables (‘x,y,z # -> u,v,w’). The compiler takes care of the different word sizes handled by the processors. The usage of ‘#’ without naming another processor for the continuation allows the compiler to automatically select and use a processor of the same class for this purpose (automatic allocation). The ‘#’ operator may be associated with a function defined on a specific processor to allow for implicit remote calls. Data structures may also be specified to allow remote accesses. The output-to-screen operator ‘?’ performs a remote call to the host to output a data word to the window associated with the calling process. Outputs from different processors executing

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

• 237

the same process occur in the order in which they are listed in the process definition, and synchronization within a section of a process running on some processor also affects the continuation on other processors even if there is no data transfer between them. A ‘stop’ statement executed on one of the processors stops the associated threads on other processors, too. In the process segment . . . #HOST

$$key -> c

#MC

1 adcon.7

$$adcon.8 = 0

addat ?cr . . .

the section on the host waits for a key to be pressed at the process window. Only then MC performs the conversion and outputs the result to the host screen. Processes on a sequential processor can be only synchronized to the event sequences signaled by an interrupt line. They usually require some specific response time if none of the events are to be lost. In this case, they are handled by a sequence of interrupt routine executions. As before, they are defined statically and become active from the start of an application. Often interrupt programs only serve to implement the buffering of input or output port data (other typical applications are the compression of input data before storing them in a buffer or the realization of specific output time patterns). Interrupt programs just for buffering data can be substituted by specifying a buffer size in the port declaration. Then interrupt processing or application of DMA is synthesized by the compiler as defined in the CG module. The following definition shows a user-defined interrupt process inputting a block of data from an unbuffered port p to a buffer x and stopping until it is restarted by another process (the one reading the data) as it would otherwise be synthesized by the compiler: cpc

$$irq0

inp {$$ p x

stop}

..irq0 specifies the interrupt signal

The ‘if valid’ structure for handling fault conditions is also used to handle the case of a handshaking interface not being ready for a data transfer. It serves to merge several sequences of events and to implement timeouts. The following structure shows the alternative input from two synchronizing ports p1 and p2 into a variable x: . . . $$

if valid, p1 -> x . . . else p2 -> x . . .

If the valid access to p1 is substituted by the condition of a delay timer expiring that was started before ‘$$’, a timeout processing can be realized in the first branch. An important application of the cyclic π -Nets processes is within the π -Nets compiler itself. The host computer can be programmed to execute application processes too, and the compiler operates in one of the cyclic threads on the host processor (with an output window of its own). This thread cyclically waits for new text input from the user (edited in a dedicated input window), compiles it and downloads the compiled code, and allows individual processors to directly execute user commands given in the π -Nets syntax and to output results to the host screen provided that the processors are linked to the host via interfaces specified in the link statements in the program header. These commands are executed on the respective processor within the compiler process which hence also distributes to more than one processor. The ability to issue commands to all processors within the target system permits the interactively viewing of the variables and control ports in these and verifying the operation of the defined function on them even while the application processes are running. This proves to be a powerful debugging facility. The downloading and starting of code on a remote processor require a

• 238

SYSTEM-LEVEL DESIGN

protocol being executed on it in one of its threads. The ‘cfg’ statement to be discussed next allows specification of a server process doing this job in parallel to the application processes.

7.7.4 Configuration and Reconfiguration A target system based on programmable processors and configurable logic (FPGA) requires configuration code (programs and FPGA configuration data) to be resident or to be loaded into the program and configuration memories of these components. In a network of processors and FPGA chips not every component needs to have a non-volatile memory chip of its own to deliver the configuration code when the system is powered up. It is enough to have some hardware facility to load the code for every component from some unique memory device within the system into a program or configuration RAM (this ‘bootstrapping’ of a processor program allows the use of a fairly slow non-volatile memory as long as the destination program RAM is fast enough for the required performance). A processor chip with an integrated program RAM or an FPGA can e.g. receive their configuration code from a serial EPROM, or via a configuration port from another processor. By letting the processors in a network bootstrap each other starting from a single one that reads the code from some non-volatile memory, the software for the entire network can reside there apart from the standard loader routines implementing the bootstrap protocols of the various components. Moving configuration code to program and configuration RAM implies that other programs and other configuration codes can be supplied as required to cover other applications of the same hardware or different functions to be performed in sequence for a single application. This is called reconfiguration. Reconfiguration may be partial, leaving portions of the system unchanged. Configuration is also needed for the interfaces of multi-purpose integrated processors and is usually carried out by the processor executing an initial program to write particular bit patterns into the interface control registers. Conventional languages do not distinguish configuration from other functions within an application. A C program for a single chip micro controller typically starts with a long sequence of control register writes to configure the interfaces and calls to fairly complex user-defined functions to implement the bootstrapping before it enters a main loop in which the application functions are executed (apart from the functions executed in interrupt routines). In contrast, π -Nets has special provisions to support both aspects of configuration. A π-Nets program includes a collection of special configuration statements and functions and the application processes. From these the compiler automatically constructs an initialization sequence including the mutual bootstrapping of processors and a main loop executing all the processes in turn. The π-Nets configuration functions to be executed by a processor to configure the hardware as required are introduced by the keyword ‘cfg’ followed by a name and a sequence of instructions: cfg name { ..control register writes and initializations of variables.. } The ‘cfg’ functions are never called by their names but are all invoked once and for all at the start of the application. Their names merely serve to indicate the initialization performed by them. Every class definition implementing a particular type of automata optionally includes a ‘cfg’ function that describes how every object of the class needs to be initialized (otherwise, memory is uninitialized). ‘cfg’ functions are always defined ‘on’ a processor. The ‘cfg’ function with the

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

• 239

name of the processor is executed first and typically configures the memory interface, initializes processor registers, and moves some or all program code for the application processes to an internal or external RAM if required. Another form of the ‘cfg’ statement which is more similar to the ‘link’ statement specifies that a processor P configures another processor Q (i.e. downloads configuration code to it). It reads: cfg P, Q

xprot, rprot

where ‘xprot’ is a function on P that sends out code for Q allocated in the memory of P by the compiler to Q via some interface and some protocol, and starts the execution on Q when it is invoked with a special call code. The configuration interface from P to Q does not need to be the same as the one specified in a ‘link P,Q’ statement. The compiler generates code for P that includes the code of Q and performs the download of the Q code to Q. In contrast to the link graph the ‘cfg’ graph does not contain loops (circular paths). There may be processors that are not configured by others at all. Then the compiler simply places the compiled code for them into disk files. Before executing the application it must be inserted into the target memory using some extra tool. The optional ‘rprot’ in the CFG statement specifies a process running on Q that handles the configuration messages received from P. It is usually defined within the CG module for Q and is compiled into a thread on Q that is executed in parallel to the application threads. Then configuration commands are still handled while the application is running. An important special case is the configuration of a target processor from the host (the workstation running the compiler). After the ‘cfg host,Q’ statement the compiled code becomes automatically downloaded to Q and can be started by a button on the host screen. Moreover, the host can issue commands for immediate execution. If Q uses the parallel protocol thread, the interactive commands are handled while the application is running. Otherwise commands are only executed before the start or after the end of the application processes. The initial downloading of the application threads and the protocol server requires that a preliminary server process is already running. The target processor may have resident firmware for this purpose or implement the protocol in hardware, or it may be bootstrapped by the configuring processor P using another protocol (some simple download protocol defined by the manufacturer of the processor). The automatic configuration by the host is the prerequisite to the interactive testing of individual functions within the target environment and to access the state of the component processors while the application is running. The setup shown in Figure 7.18 would e.g. be used to implement an application running on two micro controllers communicating via some

user input

sio HOST

MC1

MC2

sio2 CAN

Figure 7.18 Interactive testing configuration for two micro controllers

• 240

SYSTEM-LEVEL DESIGN

interface like the CAN bus. The send and receive drivers for the CAN bus are available via references to pre-defined drivers in the CG module. The necessary initializations of the CAN hardware are added automatically. The Ethernet support based on a standard controller chip and using a low overhead non-standard protocol is similar. Alternatively UDP can be specified. The π-Nets program header for the CAN bus would read as in Listing 7.2. proc (XC161) MC1, MC2 cfg host, mc1 tcps, tcpr cfg host, mc2 tcps2,tcpr

.. tcps is a predefined protocol function using the ‘sio’ interface .. same using the second serial interface of the host, ‘sio2’

link mc1,host bsnd, sio link mc2,host bsnd, sio2 link mc1,mc2 cans,canr link mc2,mc1 cans,canr

.. ‘cans’ and ‘canr’ are predefined drivers to send and receive .. via the CAN bus

Listing 7.2

Definition of a CAN-based processor network

Reconfiguration of an FPGA or a processor component during the operation of the digital system is supported by an additional structure that can be applied within cyclic processes. The point from which such a process is repeated is not bound to be at the start but may be redefined by the operator ‘R#’. Then the process repeats until it reached a branch containing the ‘R#’ operator and changes to a process that repeats from behind this operator, as shown in the following template: ... if cond . . . . else R# if cond . . . . else R#

.. repeat from previous R# .. reconfigure .. repeat from new R# .. reconfigure for next R# section

... The previous part of the process is no longer executed, and the executing processor gets reconfigured for the new part by loading new instructions or configuration data if the available hardware resources (program memory or FPGA cells) are not sufficient to support both parts. If other processes also using the processor remain unchanged, the reconfiguration becomes partial. The reconfiguration is subject to the control flow within the process and is actually part of the process. It involves the configuring processor which may be the processor to be reconfigured itself if it can access the storage device holding the new code. An important special case of reconfiguration is for a processor like the CPU2 that only executes from a small memory. Before entering the new process section the processor would reload this small memory with new instructions, i.e. perform a software caching operation.

7.7.5 Hardware Targets For an FPGA target, the corresponding CG module is implemented to construct an EDIF net list and to call an external program to perform the placement and routing. The first implementation has been undertaken for the At 40k FPGA. As already mentioned in 3.1, there are other attempts to use a language otherwise applied to sequential processors to define

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

• 241

hardware structures as well. π-Nets is intended to really provide an integrated programming environment for mixed FPGA and processor systems. The π -Nets CG module takes part in mapping composite operations to FPGA cells (similarly to packing processor instructions). Some higher level functions are the generation of registered building blocks (cf. 1.4.3), the generation of automata from process definitions, and the synthesis of some auxiliary support functions. An FPGA to be used as a π-Nets processor implements the B1 data type and provides a space of elementary circuits that can be composed to form complex ones, and spaces of memory cells and ports. Algorithms based on the Boolean gate functions are defined in π -Nets after selecting the size one bit field data type B1 that provides the elementary operations ‘&’ (AND), ‘|’ (OR), ‘||’ (XOR), ‘∼’ (NOT) and a few more. The algorithms serve as a structural definition of complex hardware circuits on the FPGA. In contrast to VHDL, there is no separate behavioral description for their functional simulation. The structural description can be used for a functional simulation on every processor implementing the B1 data type (in particular, on the host). To speed up such a simulation, it is, however, possible to use a different algorithm for the simulating processor. The CG module for the FPGA can be implemented to also derive timing estimates on the circuits. In the code generation process some language elements need to be handled differently from an implementation on a sequential processor. The assignment of an intermediate result to a local variable is not related to storing it in a register or in a memory location but is only used to set up a net list including the instances of the operations and their wiring according to the algorithm. Using a named intermediate result several times means to connect several inputs to the outputting instance for it. Calls to other functions are treated as macros and expanded into sub-networks using separate instances for the elementary operations (there is no code sharing through sub-routines). Branches within an algorithm are transformed into equivalent select constructions (section 1.2.2). Indexed operations are expanded into the equivalent, fully enumerated sets of individual operations whereas on a sequential processor they could be implemented as loops. Definitions of functions may be recursive but require the specification of the recursion depth. The recursion is expended up to this depth as explained in section 1.2.2. Besides the elementary operations of the B1 type implemented by the FPGA cells, there is a space for storage elements, too, that provide the temporal storage needed for variables. They are implemented by the flip-flops in the FPGA cells. Indexed operations that read and write to multi-bit registers are fully expanded; in the FPGA hardware there is no provision to perform address calculations and indirect accesses to flip-flops. An exception to this are the integrated memory blocks of an FPGA. The circuits realized on the FPGA are defined by the cyclic application processes executing on the FPGA (and, maybe, on other processors as well). Processes access input and output ports and variables. A raw FPGA port corresponds to an external pin of the FPGA chip that is specified in the port definition. A process just accessing raw ports with no synchronization to particular events describes a reactive circuit that continuously follows the changes at the inputs. The process cpc

and gate

{ input1 & input2 output }

defines an FPGA circuit that is composed of a cell realizing the AND operation wired to the input pins ‘input1’, ‘input2’ and to the output pin ‘output’. To support synchronous FPGA

• 242

SYSTEM-LEVEL DESIGN

circuits processes may synchronize with a unique clock signal write operations to a variable to occur from a single synchronous process only. Synchronization is indicated by the ‘$$’ symbol. on FPGA var 16 x, 16 k port $133 opin fct

.. ‘$133’ specifies the package pin

32 16 add16 { -> 16 x, 16 y

0 -> c[0] idx i\eadd(c[i],x[i],y[i]) -> q[i],c[i+1] / q }

cpc $+ clk

rtl { add16(x\, k\) -> h $$ h >> x

h[15] >> opin }

Listing 7.3

π-Nets process implementing a fractional divider

The program segment in Listing 7.3 shows how to implement register transfer logic, i.e. a synchronous process repetitively computing new register values from the previous ones. It defines a fractional frequency divider (Figure 2.31). The function ‘f’ called by it performs a 16-bit binary add operation using the ripple-carry algorithm (‘eadd’ is the full adder function). x is the 16-bit counter register, and k holds the increment value. Without a further timing specification the store operations to the components of x and to the output ‘opin’ occur as early as possible after the clock event, and the function would be computed before this. As usual, the process may contain branches and change the register values conditionally. The ‘$$’ sites in a process mark the different states in a multi-cycle operation. An FPGA is typically configured by another processor defined in a ‘cfg p,q’ statement. The selected protocol on the configuring processor accesses the configuration port of the FPGA. If the reconfiguration operator ‘R#’ is used in a process running on the FPGA and the FPGA resources are not sufficient to implement the new process section in parallel to the former one, the configuring processor (partially) reconfigures the FPGA circuit and thereby takes part in the FPGA process. Interactive configuration through the compile process immediately downloads configuration data to the FPGA. A configuration server process would continue to perform this configuration after the application processes have started. If these application processes involve the FPGA, configuration through the compile process is partial. The interactive configuration of an FPGA in the π -Nets environment is a unique quality which is, however, somewhat restricted through the complexity of the involved place and route algorithms. In a mixed FPGA plus programmable processor application, a link specified between the processor and the FPGA permits a data exchange between them and allows the processor to invoke the FPGA function in a distributed process. Such a process then defines both a calling thread on the processor and a circuit on the FPGA and supports the use of the FPGA as a coprocessor or an i/o processor. The link statement defines a handshaking port for the processor that is usually realized on the FPGA, a common choice being to use the FPGA pins that are otherwise involved in the FPGA configuration. Registers on the FPGA hold the data from the processor and deliver the results of an FPGA operation. Once simple sequential processors are implemented on an FPGA with the aid of an integrated programming environment supporting both FPGA and programmable processor components, the challenge arises to also synthesize the software running on the processor

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

• 243

inside the FPGA. The approach chosen for the π -Nets environment is to use a control core that is predefined within the CG module and supported by predefined code generation functions. For an application that needs to sequentially execute operations on the same sub-circuit, instead of a processor with a control core and a data path, a class of special Boolean functions is defined that inherits from the predefined sequential controller class SER. The class declaration includes a word size parameter: class (SER)

16

asp

.. defines application specific processor type ‘asp’

The class functions then become the set of functions executed sequentially by an application specific coprocessor to the predefined core. Application-specific, composite functions can then be executed on an object (a processor) of this class. From these givens the processor structure and the application program can be synthesized automatically. Although this approach does not implement a fully automatic, high-level synthesis of sequential processing structures (the programmer explicitly defines the functions to be executed serially on a multi-function circuit), it puts the FPGA programming to a level where a processor core becomes an auxiliary structure that is generated automatically using the available FPGA resources.

7.7.6 Software Targets On a sequential processor the operations to be performed to execute an algorithm are put into a sequential schedule (not necessarily the one corresponding to the textual order). The operations are carried out such that their results are stored in registers to allow subsequent operations to access them as operands. Only if the number of registers is not large enough to hold all the intermediate results to be used as operands, register-to-memory store and load operations are inserted to be able to use the register for another value. For current scheduling and register allocation algorithms, see [18, 53]. The functions in a program first of all serve as an abstraction that allows referencing of a composition of several operations by a single name, and do not necessarily compile into sub-routines. The decision to implement a function by a sub-routine depends on whether the function is called more than once from the application processes and on whether it is complex enough to make the code sharing attractive. If subroutines are implemented for functions, there must be some calling conventions on how to pass parameters to the sub-routine and on where to find the results. The convention used by the π-Nets compiler is to pass parameters and results in registers in a fixed order. The local variables used by a function are allocated statically or dynamically in a stack-like structure. Loop structures and the control flow are compiled in a straightforward way in the available jump instructions of the processor. The threads extracted from the process definitions are compiled in instruction sequences that are executed cyclically. If there is only one thread, a jump to the start of its instruction sequence is placed at its end. If there are several threads, the instruction lists for them are arranged so that they form the sections of a combined sequence that is repeated as a whole. The program performed by the processor is synthesized from the configuration statements, from functions implemented as sub-routines, and the process loop. The code allocation for these threads and the sub-routines called by them is to subsequent memory addresses in a memory range reserved for instructions and constant data. Local variables used in a thread are allocated statically (not in a dynamic data structure such as a stack). The static allocation

• 244

SYSTEM-LEVEL DESIGN

of these and the global variables used in a program is to subsequent memory addresses in a suitable memory range. For every thread that needs to be suspended if a certain condition does not hold, a state variable is introduced by the compiler that holds the address from which the thread must be resumed to test the condition again. The thread performs a conditional branch to the next one in the overall cycle if the continuation condition is not satisfied. These conditional jumps are the context switches. They do not occur within sub-routines and hence do not have to switch between individual stacks for the threads. The interrupt threads are not part of this process loop but inherit their cyclic execution from the nature of the interrupt processing that is repeated from scratch for every new interrupt. The activation of the interrupt signal is the only continuation condition they wait for. The usual saving and restoring task of the CPU context is performed using the return stack. The CPU to be supported only needs to provide the usual interrupt support to enable the low and high priority π -Nets processes. A thread communicating with a thread on another processor via a port specified in a link statement does so by splitting the words to be transferred according to the width of the port. A 32-bit processor sending to a 16-bit processor via an 8-bit interface splits every 32-bit word to be transferred into 4 bytes and transfers these in series. At the receive site, the bytes fill up two 16-bit words. If a processor receives data that do not fill up a processor word, these data are zero-padded until they do. All definitions of functions and processes are transformed into a data structure (the ‘intermediate code’) that can be executed by the host. This serves a triple purpose. First, it permits the execution of π -Nets processes on the host. Second, processes defined for other processors (including FPGA) can be ‘simulated’ by the host. On the host, input and output ports are realized as disk files. For a simulation these would be prepared to deliver or to receive streams of input and output data. Finally, functions and operations called with constant arguments can be evaluated by the compiler (running on the host) instead of letting the target system do such evaluation.

7.7.7 Architectural Support for HLL Programming In the advertisements for commercial processors one sometimes finds the remark that the processor in question efficiently supports the use of higher level languages. HLL programming without caring about architecture-related details of the executing processor is a definite requirement in most projects. Therefore the compilation of HLL programs is a standard step in the implementation of a digital system that influences the net performance achieved with the selection of a particular processor. Compilers e.g. benefit from the availability of multiple, general purpose registers (i.e., achieve an acceptable performance with standard allocation procedures). The main reason to implement them, however, is that they provide fast operand accesses in spite of a limited memory bandwidth. The interrelated scheduling and register assignment steps for the operations to be executed are fairly complex tasks for the compiler. A stack processor architecture not providing any addressable registers at all results in a simplified, straightforward compilation and thereby supports the execution of HLL programs. However, an efficient stack processor implementation needs extra hardware support to cache the stack and to access intermediate results buried in the stack or stored separately so that no particular advantage remains in comparison to a register-based architecture. As a matter of fact, a processor is not made to simplify the task of building a compiler but to supply

THE π -NETS LANGUAGE FOR HETEROGENEOUS PROGRAMMABLE SYSTEMS

• 245

elementary operations at a high rate, and the compiler is crafted to simplify the application of the processor by automatically adjusting algorithms to the register structure of the processor, setting up loops and applying special index registers. Thus, a processor might have non-standard architectural features that provide a means to raise its ALU efficiency but can only be exploited by a compiler with some extra effort. Basic requirements on a processor that hold for all levels of programming are that it has to provide fast operand accesses to be able to keep the ALU busy, and sub-routine and loop structures to allow for the multiple use of instruction sequences in order to limit the memory requirements. More related to HLL programming are the fast access to any number of calling parameters to a sub-routine and to intermediate results and, more generally, the abstraction from resource limitations of the hardware. To support the simultaneous use of subroutines in several contexts and recursive calls, the addressing of parameters and intermediate results must be indirect and allow offsets to access several data items using a single or a few address registers. Related to the indirect addressing of parameters and the implementation of sub-routines are their calling conventions. HLL function calls may have more parameters or results than there are registers to hold them. Then some of these parameters or results must be allocated in memory. Large, application-specific bit fields require indexed addressing capabilities such as using an address that is the sum of a base address and a multiple of an index register. HLL programming also involves standard data types that must be implemented efficiently on a general purpose processor from the available operations including the required exception handling. To summarize, basic HLL support includes:

r sufficiently large address spaces (using a hardware-supported memory management); r fast operand accesses and efficient execution of standard operations; r low-overhead branches to exception handlers; r efficient implementation of nested sub-routines and loops; r indirect addressing of parameters and intermediate results with constant offsets; r parameter allocation in memory to subsequent indirect addresses (auto-incrementing); r indexed addressing modes for data arrays of arbitrary size. As well as these common requirements, a language can be defined to support some particular software methods which can in turn be supported by the special instructions and the structures of a processor. A processor optimized for the execution of Java byte code would e.g. provide hardware support for an evaluation stack and special address registers for accessing local variables, and a LISP processor would include special primitives for allocating list records and connecting them via address pointers. For π -Nets, special support is possible (yet not mandatory) for the following:

r the passing of the multiple results of a function, e.g. by providing a dedicated data stack; r the special error handling structure; r context switching between the multiple threads, and synchronization; r communications. The CPU2 in section 6.3.2 was defined following this analysis of the requirements on a processor to efficiently execute HLL and, in particular, π-Nets programs. The memory management is supported by the memory controller hardware and by special i/o instructions for block exchanges with the memory attached to it, which, however, have to be inserted by the compiler. Nested sub-routines and loops are uniquely supported by the CPU2 through the stack

• 246

SYSTEM-LEVEL DESIGN

implementation, the link register and the return executed in parallel with an ALU operation. The indirect addressing of calling parameters and intermediate results in memory can be satisfied by the single address register DP with its indirect plus offset addressing and auto-incrementing feature for store operations that allocates the stored valued in a stack-like structure. The inclusion of the store instruction in the ALU operation generating the result further speeds up the store operations. Intermediate results obtained at the level of the π-Nets threads are allocated statically at absolute addresses. DP is only used within sub-routines from which no context switches occur (except for interrupts). Therefore the single DP register serves all threads and does not have to be stored and restored during a context switch. Sub-routines implementing π-Nets functions may also have multiple result words that may have to be passed via memory locations, too. For these the DP ‘stack’ is also used. The de-allocation of intermediate results at the end of a sub-routine may involve moving data down in the DP stack instead of just subtracting from DP. The return stack is not used for data at all and remains implicit. The ‘if .. else’ control structure of π-Nets and other languages requires a single conditional branch instruction. The status bit F representing the branch condition holds the result of previous comparison operations. The unconditional jump required for the else branch is implemented by setting the F bit in parallel to an ALU operation and then performing the conditional jump. Concerning the special support of π-Nets, the error handling state of the CPU2 implements the ‘if valid . . . else’ construction with a minimum of overhead. The error checking for the individual ALU operations is through an embedded conditional return instruction that does not cost an extra cycle if the error does not occur and otherwise jumps to the error handler branch and restores DP. The error handling state is entered (and DP saved in memory) by an instruction explicitly clearing the F bit and immediately followed by a conditional jump to the error handler branch (which, consequently, is not taken) assuming that all previous intermediate results have been stored to memory. A subsequent unconditional return (embedded into an ALU instruction) or the jump to the error handler terminates the error handling state. Sub-routines are treated as application-specific instructions and given similar capabilities w. r. t. the exception handling. If the processor is not in the error handling state, the error return becomes a sub-routine return that falls through to another error return until the processor falls back into the error handling state again and handles the exception by jumping to the error branch. This works because the return state is saved and restored with the call and return instructions. A context switch between two π -Nets threads involves an indirect jump to the continuation address in the target process. The processes and their continuation pointers are called cyclically. The context switching can hence be supported by providing hardware for the management of the cyclic pointer list. The CPU2 reserves the 8 lowest memory addresses for the pointers and uses a 3-bit register identifying the current process, and a 1-bit register for each of the threads indicating whether it is ready to execute. The context switch occurs if a return is executed outside of any sub-routine or loop, i.e. when the stack and the link register are empty. The context switch does not save and restore registers. The compiler is supposed to insert the instructions needed for this before the return causing it. A context switch is typically needed if a handshaking port is not ready for a data transfer. Then the continuation address of the process needs to be set so that the port access is retried when the process is called again. Otherwise the process might continue immediately. For the

EXERCISES

• 247

sake of efficiency the port access should execute at a similar speed to a raw port access in that situation. The i/o instructions of the CPU2 force the context switching by optionally executing a conditional return if the handshaking is not ready. By letting the interface automaton control the bit of the calling thread that is ready, the i/o instruction is not retried before it can proceed. As explained in section 6.5.4, the i/o instructions are still more powerful and also support DMA block transfers delaying the handshaking until the entire block has been received or transmitted. This feature is also used for the implementation of soft caching. A special case of the i/o instruction is available to specifically support soft caching that includes the command code for the memory controller and the block receive operation.

7.8 SUMMARY At the system design level, instead of the individual processors or FPGA chips, networks of such components adjusted to the specific application requirements are considered. A few commercial designs only move beyond the component level and include the networking concepts for scalable architectures. One is the Sharc family of digital signal processors. With the available processor and FPGA components scalable architectures can also be defined. We demonstrated a number of regular network structures that can be considered for large, configurable processor networks, and single-chip network architectures. Automated system design was briefly discussed, with some detail on allocation techniques in networks. At the system design level, the design tools need to support all kinds of components, and their use in networks. The semi-automatic π -Nets environment has been mainly used to explain various issues to be addressed by such tools, and a possible approach to them, but it can also serve for real applications.

EXERCISES 1. Consider the increment and decrement operations mod(7), and the multiply operations mod(7) by the numbers 3 and 5. All are permutations of the set {0,. . . ,6}. The set of arbitrary compositions formed by them is a sub-group G of the permutation group S7 that is generated by the set S of the above four permutations. Determine the size, the degree, and the diameter of the resulting Cayley graph and compare to a 2D torus of about the same size. 2. Show that if G1 and G2 are groups with generator sets S1 and S2 , then the Cartesian product set G = G1 × G2 with multiplication by components is a group, too, and the set S of all the elements (s1 , e2 ) and (e1 , s2 ) with e1 and e2 being the neutral elements of G1 and G2 and s1 ∈ S1 and s2 ∈ S2 is a symmetric set of generators for G. The Cayley graph of (G, S) coincides with the Cartesian product graph of the Cayley graphs of (G1 , S1 ) and (G2 , S2 ). Hence all torus graphs including the hypercube are special Cayley graphs of commutative groups. 3. A construction that derives a large non-commutative group from two other given (even commutative) ones is the so-called wreath product. Let G, H be groups and F be the space

• 248

SYSTEM-LEVEL DESIGN

of all H-valued functions on G. The wreath product of G and H is the set G F equipped with the product: (g,f) ∗ (g’,f ’) = (g ∗ g , f ∗ fg )

where the function fg on G is defined by

fg (h) = f ’(g −1∗ h) and the product f ∗ fg is the function defined by: (f ∗ fg )(h) = f (h) ∗ fg (h). Show that this really defines a group, and that for generator sets S and T for G and H a size |S| + |T| generator set for the size |G|∗ |H||G| wreath product is obtained taking the couples (s, e) where s ∈ S and e is the constant function on G, taking the neutral element of H as its value and the couples (eG , δt ) where eG is the neutral element of G, and δt is the function on G taking the value t ∈ T at eG and the neutral element of H otherwise. Show that in the case of G = Zn , S = {1, n−1} and H = Z2 , T = {1} the wreath product yields the CCC graph. 4. The computational capabilities of regular networks of identical automata receiving input from their neighbors have been a research topic since the time of von Neumann [20, 79, 80]. Automata networks can also be considered that are based on Cayley graphs. Let a single storage bit automaton be attached to every group element g and hold the value bg . All automata are updated synchronously in every time step to the new values: bg = bgs s∈S

where S is the generator set of the groups G and the add operation is mod(2). Investigate the behavior of this network for simple examples like G = Z7 , S = {1,6} for limit cycles and fixed points. 5. Implement a placement and routing algorithm for a given net list in a crossbar network based on a line graph. 6. Compose a π -Nets application from the examples in section 7.7 running on two micro controllers connected via a CAN bus. One acquires an analog input, the other output the analog value to a display.

8 Digital Signal Processors

8.1 DIGITAL SIGNAL PROCESSING To illustrate how to design and implement application-specific processing systems we consider a special application domain, the processing of continuous (‘analog’) signals. The importance of this area is obvious from the fact that the signals occurring in physical systems are primarily of this kind. Classically, such signals have been processed through electronic components such as networks of resistors, capacitors, operational amplifiers and non-linear circuits, implementing filters, modulators, and computational operations. Complex processing functions realized with such means suffer from the lack of reproducibility, precision, dependency on temperature, noise, etc. which are not present in digital processors (at the expense of much higher circuit complexities).

8.1.1 Analog-to-Digital Conversion The first step to be taken to process a continuous input, e.g. a time-varying voltage e(t) ranging in some interval [U1 , U2 ], by means of a digital processor is to perform a conversion of the input value at a given time t0 into an encoded number (the A/D conversion). e(t0 ) is called a sample of the time function e taken at t0 . Reciprocally, the digital processor outputs encoded numbers that have to be converted into a continuous signal with a proportional magnitude (the D/A conversion). The most basic step towards performing such a conversion is the comparator that compares the input voltage e to a reference voltage r. The output is H if e > r + d and L if e < r − d for some small value d, and unspecified otherwise. The comparator is realized as a high gain difference amplifier that is driven to saturation once the absolute input difference is above d. To avoid a constant input near r producing an oscillating output due to the presence of noise, the comparator may be given a hysteresis. As usual, such a circuit needs a processing time from applying the input to getting the valid output (10 ns..10 µs, depending on the circuit and on d). Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 250

DIGITAL SIGNAL PROCESSORS

In order to perform the comparison at the specific time t0 , it may be necessary to ‘freeze’ the input to the comparator during this processing time in a sample and hold circuit and then freeze the output in a storage circuit until it is actually read by the digital processor. The input hold circuit may be realized by a capacitor at the input to the comparator that is disconnected from the input source at the time t0 . In order to perform the conversion of an input voltage e ranging in [0, U] w.r.t. some reference potential into an n-bit binary code representing the integer m < 2n so that 0V ≤ e − mU/2n < U/2n , a number of comparison operations are carried out in parallel or serially (keeping the analogue input frozen as long as necessary). A parallel conversion would use 2n −1 comparators with the references connected to the values mU/2n . The 2n −1 output bits of the comparators would be digitally encoded in the desired n-bit word. The reference voltages can be derived from a single reference r = U by means of a ladder of 2n resistors connected in series. Such circuits are the fastest A/D converters and are called flash converters. They are available as integrated circuits for n ≤ 10 with conversion rates of up to several 100 MHz. For a serial conversion, a single comparator can be used by applying a series of reference voltages. Not all reference values of the parallel conversion are needed. Instead, the first comparison is with the 1/2 U reference; then it is known in which of the intervals [0, U/2] or [U/2, U] the input is located and the search for the right code is continued there. One actually performs a binary search for the best fitting code, using n comparisons only. This method is called successive approximation. Converters based on it are also available as integrated circuits for n ≤ 16 and conversion rates of up to several MHz. There are also intermediate converters performing some comparisons in parallel, and high-speed converters suitable for digitizing video signals that perform subsequent comparisons of the successive approximation in a pipeline using several comparators. They operate at the through-put rates of flash converters, but need a higher processing time, and n comparators only. The highest resolution converters, used e.g. for digital volt meters, apply all reference values in a linear fashion, actually a continuous linear ramp and perform the equivalent of a time measurement. The conversion locates the input voltage in one of 2n intervals of size U/2n . W.r.t. the centers of these intervals the actual input differs by a maximum of U/2n+1 , the so-called quantization error. There are additional errors due to the generation of the reference voltages and also to the gains and offsets of the comparators. The basic element of the D/A conversion is the electronic switch controlled by a digital output, simply an n-channel transistor in saturation that behaves like a resistor with a very low resistance if H is applied to its gate, and a very high resistance if L is applied. If the switch is used to connect a resistor R driven by a reference voltage U to a summing node, a current of I = U/R or 0 depending on the digital input enters the node. The logic output is not used to directly drive the resistor as the H and L intervals are intentionally large and would not allow a precise output. The current can be converted to a voltage by means of an operational amplifier if needed [19]. An n-bit binary number can be converted into a proportional current by letting each bit switch a current to the summing node and letting the ith bit control the current Ii = 2i I0 = U/(R/2i ). The set n weighted currents Ii is easily obtained from a so-called R–2R network of resistors of the same size which is easier to manufacture (Figure 8.1). Such resistor networks along with the switches are available as integrated circuits for n ≤ 16. This type of D/A converter has the additional property that the analogue output is strictly proportional to the reference U. It may e.g. be used to implement a digital gain control into an amplifier circuit.

• 251

DIGITAL SIGNAL PROCESSING

R

R

R

U 2R

/b2

n

2R

n

b2 /b1

n

2R

n

b1 /b0

n

2R

n

b0 I = (4b2+2b1+b0)/16 * U/R

Figure 8.1 D/A converter using an R/2R ladder

Another way to convert a multi-bit word is to generate a pulse width modulated signal (PWM) by connecting a digital n-bit comparator to the output of a free running n-bit counter. If the PWM output controls the switch, the average current is proportional to the pulse width and hence to the input of the comparator. The period of the signal is given by the counter clock period times 2n ; typically for this method, n ≤ 12. The period must be adjusted to the averaging performed on the pulses by means of some analog filter. More generally, a signal may be represented as a continuous sequence of single-bit numbers clocked out at a high rate and passed through a converter that changes the bit values into voltages or currents. These are integrated by an analog filter to yield a smooth analog signal output. This method is e.g. used for audio converters and ‘digital’ power amplifiers.

8.1.2 Signal Sampling If a digital processor is used to process a time varying signal, the signal is converted into a series of samples taken periodically with a sampling frequency fs (taking equispaced samples is the usual way). Thus the continuous time signal e(t) is converted into the series of numbers en = e(t0 + n/fs ). The well-known sampling theorem states that the continuous time signal can be recovered from the series en of equispaced samples if fs is chosen to be larger than twice the bandwidth b of the signal (i.e. in the representation of the real signal as a superposition of phase-shifted sin(ωt) signals obtained through the Fourier transform, no components with a higher frequency than b occur). Thus no information about the signal is lost by just considering the chosen samples, and all processing of the original signal can – in principle – be substituted by a processing of the samples. If the processing results in another signal g(t) of at most the same bandwidth, then it suffices to calculate samples gn of it at a rate given by the same frequency fs . The signal g(t) can then be reconstructed from the gn according to the following formula. If its bandwidth is lower, a lower rate fs suffices. g(t) =

∞

gn sin(2π fs (t − t0 − n/ fs ))/(2π fs (t − t0 − n/ fs ))

n=−∞

The A/D conversion can e.g. be simplified by sampling the signal at a much higher rate than required by the sampling theorem using a single-bit converter (a comparator) only and some analog preprocessing. The resulting bit sequence (bn ) is passed through a digital filtering processor that computes high resolution samples, the rate of which is reduced according to

• 252

DIGITAL SIGNAL PROCESSORS

analog signal

analog adder

analog integrator

comparator

bn

digital filter

en at rate fs

N* fs

Figure 8.2 - -conversion

e(t) t #1

#2

#3

Figure 8.3 Complex sampling and baseband sampling

the bandwidth. The - -converter structure shown in Figure 8.2 is e.g. used for audio signals [14]. It requires continuous operation (in contrast to taking individual samples) and some processing delay for the digital filtering. Such converters are offered as integrated standard components by several manufacturers. The sampling theorem and the reconstruction formula hold for ‘baseband’ signals having Fourier components in the interval [−b, b] only. With a simple modification, it carries over to the case of a narrow-band signal centered around some carrier frequency f0 . A signal of this kind can be represented as: e(t) = er (t)cos ωt + ei(t)sin ωt,

with ω = 2π f0

The pair of signals (er(t), ei(t)) is the complex amplitude modulating the carrier. It only has frequency components in [−b/2, b/2] and needs to be sampled with a frequency of at least b (in total, real samples are needed with a minimum rate of 2b). An easy way to obtain samples of (er(t), ei(t)) without having to generate and premultiply the cos(ωt) and sin(ωt) signals (with a small superimposed error that increases with the bandwidth) is to choose the sampling 1 frequency to be fs = 4f0 . If g, h are subsequent samples of e(t), then 2− /2 (h + g, h − g) can be used as samples of (er(t), ei(t)) at the midpoint between the sampling times for g and h and taken at a reduced rate due to the assumed low bandwidth. This procedure called complex (or ‘quadrature’) sampling amounts to taking pairs of samples separated by about a quarter of the carrier period at a reduced rate (Figure 8.3). It corresponds to continuously sampling the signal with the rate of fs , multiplying by the samples of the complex carrier, filtering by summing up adjacent samples to suppress signal components near fs /2, and then sub-sampling according to the bandwidth. The operation of forming (h + g, h − g) can be combined with the further processing of the samples. The approximation error for the complex samples can be further reduced by using more than two subsequent samples for each, e.g. by replacing the sum 1/2(h + g) by its mean value with an extra sample taken at the mid-point. Instead of processing a single time signal e(t), a DSP system may be used to process an array of signals so that the samples become complex, indexed data structures. As well as depending on the time variable the signal may depend on other, continuous input variables as well. An example of this kind is the intensity e(x, y, t) of the light emitted from an image. All

• 253

DIGITAL SIGNAL PROCESSING

black level

horizontal line

line sync

t

Figure 8.4 Analog video signal

coordinates would be sampled so that the signal is converted into an array of ‘pixels’ ei, j,n , with 0 ≤ i < Nx , 0 ≤ j < N y and n being the time index. For a colored video signal, there would be additional signals defining the color, or separate intensities e R , eG , e B would be used for the red, green and blue components. For video signals the time parameter is usually sampled at image rates of up to a few 100 Hz only, but for typical spatial resolutions of Nx , Ny = 300..1600, every image sample may contain a million pixels that need to be processed, input and output at the image rates. One to two subsequent images are usually stored in 2D storage arrays providing indexed access to the pixels. Video signals are transmitted using a single analog signal to sequentially transfer images line by line within the image sample interval synchronously to the electron beam scanning the screen of a display tube, and suitable to directly modulate its intensity. For an analog video signal, the x coordinate is continuous, but the y and t coordinates are replaced by the discrete coordinates j, n. Synchronizing events identifying the start of a new image sample and within an image the start of every new line are transferred separately or integrated into the video signal as well forming a ‘composite’ video signal with a precisely defined timing (Figure 8.4). There are special A/D converter chips such as the SAA7113 from Philips that convert an analog video signal as it would e.g. be generated by a CCD image sensor chip into a serial data stream of digital words transmitted along with separate clock and frame synchronization signals. CCD imaging chips are an integrated array of photoelectric sensors. The charge coupled device (CCD) technology is used to shift out the analog signals (charges) generated at the individual sensor pixels and to output them in a time serial fashion through a synthetic video signal. Charge can be transferred from a capacitor to a neighboring one in a chain by raising its foot-point potential and discharging it to the other one down to a threshold through a transistor switch (with a loss of energy, other than in Figure 2.25). For the generation of video signals, fast D/A converters convert triple input words into the analog signals required for the red, green, and blue (RGB) pixel intensities of a cathode ray display tube (CRT) or a TFT monitor with a compatible interface.

8.1.3 DSP System Structure The set-up of a digital signal processing system transforming a band limited signal e(t) into another signal g(t) is as shown in Figure 8.5. The signal is converted into a sequence of samples en taken at the sampling rate fs . If the signal contains components outside the band to be processed, a low pass filter must be used to definitely limit the band to less than 1/2fs . Otherwise higher frequency components would show up (alias) in the samples in the same way as base band frequencies. For typical audio signal processors, the en are signed 16-bit numbers using the 2’s complement integer encoding. The 16-bit samples gm would be computed by the processor at some rate fs as required for the bandwidth of the signal g(t) and output to an D/A converter. The resulting staircase wave-form is then smoothed by another low pass filter to perform the interpolation.

• 254

DIGITAL SIGNAL PROCESSORS

e(t)

anti alias

A/D

(analogue)

fs

en

digital processor

gm

D/A

interpolate

f s’

(analogue)

g(t)

Figure 8.5 DSP system structure with analogue input and output

A large number of filtering operations can be moved to the digital domain in this way. Any frequency response can in fact be arbitrarily approximated by a digital processor. An important consequence of this set-up is that the digital processor operates under the constraint that the time available for the processing of an individual output sample is limited by the output sampling period fs −1 . For audio processing e.g., a standard sampling rate both at the input and at the output is 48 kHz which translates into an allowed time per sample output of 21 µs. In order to simplify the analog filters the input sampling rate can be increased. Then the digital processor can perform a low pass filter computation at the smaller sampling rate before further processing the samples. Similarly, the interpolation can be performed partly by the digital processor, so that interpolated samples are output at a higher rate. Applications other than filtering can just input some signals but not generate a continuous output signal but perform some other processing to e.g. detect patterns in the signal for such purposes as speech or object recognition and localization, or to perform some signal analysis and display the results. Also, there are many applications in which a DSP does not input signal samples at all but synthesizes and outputs complex signals (e.g., audio or video signals). A periodic audio signal can e.g. be synthesized by storing a set of samples representing a single period (a ‘wave table’). A set of 256 samples would e.g. cover more than 100 overtones. The wave table can be read out periodically at the audio sampling rate and be output to the D/A converters. If a different frequency is required, it is read out by stepping through it at a different pace, interpolating between adjacent samples if required. For the special case of video signals, a processor may be used to perform pattern recognition in images captured from a video sensor, or to synthesize and output complex images sequences. A low resolution monochrome CCD sensor as e.g. used in simple robotics applications would deliver 384 × 288 × 25 pixels per second. For video output, a rate considered low-end would be 640 × 480 × 50, each pixel being encoded by separate 8-bit numbers for the R, G and B components (in total 24 bits per pixel). Even if there is no change to the image data, the output to a display would require the continuous generation of an output data stream of 15M bytes/s from an image ‘frame’ buffer memory of nearly 1 M bytes. The synthesis and the processing of video signals require very high processing rates supported by dedicated hardware, fast interfaces and large memory buffers. Compression and decompression of the video data are essential for communicating and storing image data [75], but also for the synthesis of visual output. The 24 bits per pixel may e.g. be reduced to 8 bits by using a table of 256 entries to associate 24-bit RGB values to 8-bit codes. The combination of such a table with the D/A converters for the RGB outputs is a common component of video signal generators called a color palette. The outputting is still at a rate of 5M bytes/s from a frame buffer of 300 k bytes. The simplest form of hardwaresupported decompression is the use of character generators for textual output to a video display. Characters are displayed as sub-images of, say, 8 × 16 single-bit (black or white) pixels that are selected by the display circuit from a ROM table using the 8-bit ASCII character code

• 255

DSP ALGORITHMS

(interpolate along edges)

display line

Figure 8.6 Triangle painting

as a table index. Then only a page of a 2400 character codes remains to be output from a memory buffer to the video processing circuit at the rate of 50 Hz (120 k bytes/s) to generate a 640 × 480 monochrome video signal using some DMA circuit. Graphical displays do not use this kind of on-the-fly decompression but output pixel-bypixel from a large frame buffer at a high rate. Synthetic images are decompressed from a much smaller data structure and written into the frame buffer in a number of steps (the so-called rendering pipeline [74]) that can be supported through hardware circuits to generate the pixel data at the required rate. The image is usually represented as a set of triangle coordinates, some of which have to be displayed and filled with various patterns using special, fast algorithms. The endpoints of the fill lines can be interpolated from the triangle coordinates by performing incremental add and compare operations only, and the same applies to interpolating the pixel values or to copying them from some texture template (Figure 8.6). A triangle is filled line by line to exploit the fast page mode access to a video RAM addressed so that lines are contained within pages. The overwriting of pixels in the foreground can be inhibited by storing a zcoordinate for each pixel (in a ‘z-buffer’) or by ordering and selecting the triangles to be displayed. Changing views and the motion of objects in a 3D scene are realized by performing affine coordinate transformations on the triangles. Recent dedicated graphics chips perform this synthesis of video data into a fast video frame memory by employing a considerable amount of hardware, exploiting the possibility to compute several pixels in parallel. The multi-media operations found in the workstation class processors or an SH-4 (6.6.4) also support the synthesis of video data. Synthetic video (and audio) signals are thus expanded from fairly small parameter sets to implement the man–machine interface (and thereby become enriched by secondary effects at the ‘man’ side). Temporal filtering of the sequence images is used for data compression (by encoding changes of subsequent images). Filtering is used in the x and y directions only for various image analysis and enhancement steps [67]. Some common image processing steps are non-linear and implemented by special Boolean operations on the pixel values. In the subsequent discussion of DSP algorithms and processors, we concentrate on some basic linear algorithms that apply to time signals e(t) and multi-dimensional signals (e.g., e(x, y, t)) as well.

8.2 DSP ALGORITHMS In this section, some common DSP algorithms applied to the sequences of samples taken from some signal e(t) are considered. The main reference on this is [14]. [62] is an introduction also

• 256

DIGITAL SIGNAL PROCESSORS

covering implementations on a DSP chip, and [3] explains the relationship of algorithms and architectures. An analysis of their complexity and the available time reveals that even for the moderate audio sample rates the computational requirements are significant.

8.2.1 FIR Filters A standard algorithm for digital filtering is the FIR filter. Here: gn = c0 en + c1 en−1 + · · · + c N−1 en−N+1

(1)

The sequence (gn ) is the convolution product of the input sequence by the finite sequence of coefficients (c j ). The formula defines gn as the dot product of two vectors which is a general linear operation that comes up in many other contexts as well, e.g. as the basic operation in artificial neural networks. The length N of the filter is determined by the resolution with which a desired frequency response may be prescribed. The coefficients can be computed to produce any independently prescribed gains and phase shifts to the sampled sine input sequences of integer multiples by k of the frequency fs /N , for –N/2 < k < N/2. As a rule of thumb, such a filter can implement a narrow band pass of the width 2fs /N. If an audio equalizer has to be built that allows the gain setting for sub bands of width 100 Hz, then for fs = 48 kHz, N is about 1000. Typically, the filter output is computed at the input sampling rate. N determines the amount of processing per sample and number of input samples required and to be stored for the computation. For N = 1000, the time available to process a sample translates into a frequency of 96 million arithmetical operations per second. The en and ci in equation (1) are real numbers that have to be encoded on a DSP. Samples usually come up in a fixed point format, and the coefficients ranging in the interval (−1, 1) may be encoded in a similar way. The FIR algorithm with a 16-bit integer input and 16-bit fixed point coefficient data performs signed 16 × 16-bit multiplications which have a 31 bit result, of which the upper 16 bits represent a voltage in the input range. If 16 bits are needed for the final result and every product were rounded to 16 places, the rounding error could be as high as 2−16 and accumulate through the summation to N∗ 2−16 in the worst case (in practice, this would not happen). In order to avoid this for arbitrary N, the summation can be implemented to take care of the full product words. If a 16-bit output encoding is needed, a rounding operation from the 14th place is required (the 0th is the least significant bit, LSB). In the summation overflow may occur beyond the 31-bit word size. This can be handled by providing a wider word size for the sum. Implementations use 8 extra bits allowing the improbable event of 256 full size products of the same sign to accumulate (Figure 8.7). If at the end some of the extra bits are unequal to the sign bit of the lower 16 bits of the rounded result word (overflow condition), the result cannot be represented with 16 bits. In this case the 16-bit result is substituted by the most negative or the most positive 16-bit value in order to obtain an overflow behavior similar to that of an amplifier circuit (saturation). An important special case of the FIR filter is the symmetric FIR filter with the additional property that ci = c N−1−i . These are common as symmetry corresponds to the property of linear phase or constant group delay [14]. For the symmetric FIR filter, the distributive law can be applied to arrive at a simpler algorithm to compute the FIR output gn : gn = c0 (en + en−N+1 ) + c1 (en−1 + en−N+2 ) + · · · + c N/2−1 (en−N/2+1 + en−N/2 )

(2)

• 257

DSP ALGORITHMS

overflow bits

16-bit middle word

saturate

result

15-bit low word

round

Figure 8.7 Conversion from the result of the fixed point FIR computation

If samples and coefficients are encoded as 16-bit numbers again, the sum of two samples is a 17-bit number, and the multiplication is a signed 16-bit by 17-bit one. The least mean squares (LMS) algorithm is an extension to the FIR algorithms that adapts the coefficients of the FIR filter in order to let the output become similar to a reference sequence (sn ) in the sense that the difference sequence (sn −gn ) should have a low energy [61]. The adaptation rule for the coefficients ci after computing gn using (1) is: ci = ci + ε ∗ (sn − gn ) ∗ en−i

(3)

where ε is a small constant. If the adaptation step (3) is executed at the sampling rate, another N products and sums need to be computed. The LMS algorithm can also be applied to symmetric filters. Then the adaptation rule for the ci becomes: ci = ci + ε ∗ (sn − gn ) ∗ (en−i + en−N+i+1 ), for 0 ≤ i < N/2

(3 )

8.2.2 Fast Fourier Transform The discrete Fourier transform (DFT) serves to transform a finite real or complex vector of samples (xr ), r = 0,..,n − 1 into the frequency domain, i.e. to represent it as a linear superposition of complex sine vectors. The complex sine vector with the normalized frequency k that circles k times around the unit circle if r runs from 0 to n − 1 is the sequence (zkr ) , with z = e2πi/n . This can in turn be used to implement filters that multiply this sine sequence considered as a periodic sequence of signal samples by a constant complex phase factor hk . For 0 ≤ k < n the kth Fourier coefficient of the sequence x is the dot product: yk =

n−1

xr z−kr

(4)

r=0

With these coefficients, x can be expressed as: xr = 1/n

n−1

yk zkr

(5)

k=0

and the filter applied to the (cyclically repeated) sequence (xr ) yields the (periodic) output sequence: gr = 1/n

n−1

yk h k zkr

(6)

k=0

The DFT transforming the sequence (xr ) into the sequence (yk ) consumes O(n2 ) operations but can be simplified to the fast Fourier transform (FFT). If n is even, using the identity

• 258

DIGITAL SIGNAL PROCESSORS

zn/2 = − 1 we obtain for 0 ≤ k < n/2: yk =

n/2−1

x2r (z2 )−kr + z−k

n/2−1

r=0

yk+n/2 =

n/2−1

x2r +1 (z2 )−kr

(7a)

r=0

x2r (z2 )−kr − z−k

r=0

n/2−1

x2r +1 (z2 )−kr

(7b)

r=0

Thus the DFT breaks up into two half size DFTs applied to the sequences (e2r ) and (e2r +1 ) followed by n/2 operations of the type: (a , b) → (a + wb, a − wb)

(8)

the so-called FFT butterflies (cf. Chapter 1, exercise 7). If n is a power of 2, n = 2m , the same procedure can be continued with the half size DFTs, obtaining 4 quarter size DFTs etc., and the DFT is performed in m ‘passes’ of n/2 butterflies. The final pass uses the factors w = z−k , k < n/2, the one before twice the w = (z2 )−k , k < n/4, the first pass does not use multiplication at all, and the second uses w = −i only, which is a formal operation not needing a multiplier. Thus the first two passes do not really contribute multiply operations. Each butterfly comprises 2 complex adds and a complex multiply (6 real adds and 4 real multiplies), and has 4 real input and 4 real outputs. This is the radix-2 decimation in time FFT algorithm which hence contains n/2∗ log2 n butterflies and reduces the complexity from O(n2 ) to O(n logn). The FFT is the result of the ‘divide and conquer’ principle applied to the DFT. In a similar fashion, the Fourier coefficients yk , yk+n/4 , yk+n/2 , yk+3N/4 can be rewritten for 0 ≤ k < n/4 using the identity zn/4 = i, to perform the DFT for n = 4m in n/4 ∗ m operations of the type: (a , b, c, d) → (a + wb + w 2 c + w 3 d, a − iwb − w 2 c + iw 3 d,

(9)

a − wb + w c − w d, a + iwb − w c − iw d) 2

3

2

3

called radix-4 butterflies and comprising 8 complex adds (4 of the 12 adds in equation (9) can be shared) and 3 complex multiplies (22 real adds and 8 real multiplies) as the multiplication by i is just formal, and 8 real inputs and 8 real outputs. The first pass actually does not contribute real multiplies at all. Compared to the radix-2 transform, the number of multiplies is further reduced by 25%. If the input x to the DFT is real, the Fourier coefficients yk fulfill the symmetry relation: ◦ yk = yn−1−k

(10)

with ‘◦ ’ denoting complex conjugation. Thus, both yk and yn−1−k do not need to be computed. Further to just simplify the final butterfly pass, the computation of the FFT can actually be reduced to a half size FFT. Let x and x denote the vectors (x2r ) and (x2r +1 ), 0 ≤ r < n/2, and xc = x + i x . If y , y and yc are corresponding Fourier transforms, then yc = y + i y . According to equation (10), y and y can be obtained from yc as: c ◦ yk = 1/2 yc k + yc n−1−k◦ , yk = −1/2 i ykc − yn−1−k (11) From equations (7a) and (7b) we conclude: yk = yk + z−k yk ,

yk+n/2 = yk − z−k yk

(12)

• 259

DSP ALGORITHMS

Thus for n = 2m , the effort to compute y via the FFT applied to xc becomes n/2 + n/4 log2 n/2 butterflies. The inverse DFT computing the original vector from its Fourier coefficients is shown in equation (5). It differs from the DFT only by the sign in the exponent and the factor 1/n and can therefore be computed as the DFT of (yn−1−k ) using the FFT algorithm. If the symmetry of equation (10) holds, the inverse transform gives a real result and can be simplified by inverting the steps taken for the direct transform with real input explained before. First, half size sequences y , y are formed as: yk = 1/2 (yk + yk+n/2 ),

yk = 1/2 zk (yk − yk+n/2 )

(13)

Then the inverse transform is applied to the sequence yc = y + i y . Finally, the result vector xc is reinterpreted as a real vector of twice the length. The complexity becomes similar to the one of the direct transform applied to real input. Before implementing the FFT on a processor the data encoding must be decided on. Inputs from a converter will come up in a fixed point format, and it appears natural to use a fixed point format throughout the transformation, in particular, as the complex factors to be applied are on the unit circle. After s butterfly passes, size 2s DFTs have been calculated, and in equation (4) the fixed point encoding will need s more bits to represent the results. In contrast to the direct FIR computation, the multiplication is applied to the enlarged word size in the next pass. The fixed point format must be larger than the one needed to represent the samples from the A/D converter. If instead a division by 2 (shift and round) is applied before every pass, the rounding errors will accumulate. To avoid large rounding errors, some fractional places should be kept from the result of the multiplication and be used in the subsequent pass. For 16-bit input data and transform sizes of 2048, a 32-bit fixed point format would support 5 binary fractional places. As a compromise to avoid overflow in fixed point arithmetic but without providing extra places for a more precise multiplication result and just using a small word size multiplier (16 or 24 bit), the output vectors can be conditionally scaled down if the maximum component size could give rise to an overflow in the next pass. The actual scaling count is similar to a common exponent for the entire vector. Therefore using a fixed point with dynamic scaling is known as a block floating point. Then excessive rounding errors are avoided, and a higher dynamic range is maintained for the result. Overflow considerations can be avoided even for much larger transforms if a floating point encoding is used, e.g. the single precision 32-bit format with a 24-bit mantissa and an 8-bit exponent; a format with a 32-bit mantissa is also common. The large mantissa size helps to keep rounding errors low. Floating point calculations can also suffer from extinction errors. The absolute errors in an FFT calculation through the use of floating point operations, however, remain small as there are no multiplications by large numbers, and the errors are dominated by the quantization error for the input data. The floating point output vector claims a multi-digit accuracy for the result that is fake, especially for small Fourier coefficients. Important applications of the DFT are spectral analysis and filtering (section 8.2.3). For spectral analysis, one would apply a bank of narrow band pass filters to a signal and determine its power in these bands. The kth DFT coefficient (4) can be interpreted as the output of an FIR filter that lets the complex sine (znk ) of the frequency k pass but blocks (znk ) for all other k . The signal would be premultiplied by a window function [14] before the application of the DFT coefficient filter in order to obtain a better band path characteristic.

• 260

DIGITAL SIGNAL PROCESSORS

The DFT is used on 2D or 3D data arrays (e.g. to the pixels of a gray scale image), too, to extract features (e.g., from fingerprints) or to perform filter operations. The 2D DFT of an array (xr,s ), r, s = 0 . . . n − 1 is defined by yk,l =

n−1 n−1

xr,s z−kr z−ls

r =0 s=0

It can be calculated using the FFT separately in both dimensions. The required number of butterflies is 2n(n/2 ld(n)) = n2 /2 ld(n2 ), or knk−1 (n/2 ld(n)) = nk /2 ld(nk ) for a k-dimensional transform.

8.2.3 Fast Convolution and Correlation If we define the cyclic convolution of arbitrary complex vectors x and c as the vector a with the components: ar =

n−1

ck xr −k

(14)

k=0

for r = 0,..,n − 1, with the r − k index for x formed mod(n), and denote the discrete Fourier transforms of the vectors x, c, a by y, h, b, then b can be computed from h and y in a much simpler way. For all k, bk = h k yk

(15)

Now let n = 2N, and x be the vector of samples (em−2N+1 , . . . ,em ), and c the vector (c0 , . . . , c N−1 , 0,.., 0) of the FIR coefficients in equation (1) padded with N zeros to a vector of dimension 2N. Then the N latest FIR outputs gm−2N+1+r , r = N..2N−1, coincide with the ar in equation (14). Thus the vector of the amplitude and phase coefficients in equation (6) is the discrete Fourier transform of c, and the FIR outputs can be obtained by applying the inverse FFT to th bk . Thus to compute the N latest FIR outputs from the 2N latest signal samples, one can apply the 2N input real FFT to the signal vector, perform the filter operation in equation (15) in the frequency domain and transform back to the time domain with the 2N input inverse FFT in the special case of a real output vector. This is the so-called fast (real) convolution. The total number of operations needed to calculate a block of N output samples becomes: N/2∗log2 N + N N N/2∗log2 N + N

butterflies for the 2N input real FFT complex multiplies (2nd half obtained by symmetry) butterflies for the inverse FFT

which is much less than the original 2N2 real operations for large N. For example, for N = 1024 we obtain 126000 real operations (adds or multiplies) using the fast convolution compared to about 2000000 using the direct calculation. The FIR filter is applied to a continuous stream of samples. Therefore the computation for the block of the N latest output samples must be repeated every N sample periods, using the most recent 2N samples as its input, i.e., the Fourier coefficients of the input signal are computed at the reduced rate of fs /N, and the inputs to subsequent FFT computations overlap by 50% (Figure 8.8). This reduced rate can be interpreted as related to the reduced bandwidth

• 261

DSP ALGORITHMS

inp. N samples

N samples

N samples

N samples

process 2N samples

process 2N samples

process 2N samples

output N samples

output N samples

(time)

output N samples

Figure 8.8 Timing of a continuous fast convolution

of the coefficient interpreted as the result of a special FIR filter given by equation (4) applied to the input. If the time to compute the output block is T p and output samples are output at the sampling frequency again, the delay from receiving the input sample en to outputting the corresponding filter output gn given by equation (1) is larger than T p + (N−1)/fs . It may be as large as (2N−1)/fs whereas for a direct FIR implementation it can be below 1/fs . The processing time per output has been reduced. For N = 1024 and the sample rate of 48 kHz, the resulting rate for the operations drops from 96M op/s to 6.2M op/s. The complexity for the filtering algorithm can be further reduced by using larger block and transform sizes and thereby a reduced overlap of the input sequences, yet at the expense of a larger block processing delay. If in the N = 1024 example blocks of 3072 samples are computed using a 4096 input FFT, the number of operations becomes 272000, compared to the 378000 operations with the smaller block size. This 28% reduction is at the expense of a three times larger block processing delay and more memory. An important application of the FIR computation arises when the short-term correlation of two signals is needed. Then the coefficients in equation (1) are themselves samples of a reference signal in a time-reversed order. Applying the fast FIR algorithm then amounts to fast correlation. It may be used to estimate the time delay between two similar signals. The fast convolution can be applied to the LMS adaptive filter too [61]. Then the updating of the coefficients is performed on a per block basis.

8.2.4 Building Blocks for DSP Algorithms The FIR and FFT algorithms are each composed of a number of similar, application-specific building blocks that can be considered for the compute circuits of sequential processors dedicated to executing them. The most basic building blocks in these algorithms are the multiply and add operations (at the gate level both are, of course, composite), but composite expressions occurring in the algorithms can be considered as building blocks too. If a single or a few types of building block suffice, or a multifunction one with a few functions only, the program becomes simpler (can even be generated by a simple automaton), and if the building block is complex, the program becomes shorter. The building block receives multiple inputs in every cycle, typically from memories holding the input vectors. In order to keep the memory costs and the costs for generating the operand addresses low, the building block must be chosen so that as few parallel inputs need to be supplied as possible, and be supported by registers in the data path to avoid extra memory cycles to store intermediate results. The memory interface can also be simplified by interleaving subsequent computations that share some operands so that the operands are fetched only once.

• 262

DIGITAL SIGNAL PROCESSORS

*

*

+

*

+

+

(a)

(accu)

(b)

(accu)

+

(c)

*

*

*

+

(accu)

+

(d)

+

(dual accu)

Figure 8.9 FIR building blocks

For the FIR computation, the composite building block in Figure 8.9(a) is the MAC structure performing the operation a∗ b + s, s being an intermediate partial sum in the accumulator register. This building block connected to the accumulator register requires two parallel input words for every application. The more complex building block in Figure 8.9(b) performs the composite operation of adding two products to a partial sum, i.e. the composition a∗ c + b∗ d + s. It needs four parallel inputs (a,b,c,d) in every cycle and thereby drives up the address generation costs. With the same amount of add and multiply circuits one can also compute the pair of results a∗ c + s1 , b∗ d + s2 , using a second accumulator register. Then the combined operation is slightly faster, and the total sum can be obtained by adding the accumulators at the end. The symmetric FIR filter can use the building block (a + b)∗ c + s shown in Figure 8.9(c) that only requires three parallel inputs. As the FIR computation is repeated on a stream of input data, there is the option to combine two or even several subsequent computations into one that computes several output samples. Then the computations can be interleaved so that the coefficient input is the same for the products summed to the partial FIR sums, using e.g. the building block in Figure 8.9(d) to compute the pair of results a∗ c + s1 , b∗ c + s2 , with c = ci , a = en−i and b = en+1−i in the ith step of the serial computation. This building block would be connected to two sum registers to form a dual MAC and need only three inputs per cycle. Now the sample en−i = en+1−(i+1) used for a in the ith cycle is the sample used for b in the i + 1st cycle. Therefore instead of reloading it from memory, it can be passed via an additional input register. With this register, only 2 inputs per cycle are needed for the dual MAC structure. The same trick can be used for the single MAC structure. Two input registers are used, and a single input is applied per cycle alternating between samples and coefficients. There need to be two accumulator registers that are switched cycle by cycle. The required data memory bandwidth has been halved through the use of these registers. To summarize, the building block selection for a DSP algorithm is made so that:

r a block of several computations is carried out in parallel sharing input parameters; r the results of the building block get separate registers for the computations in the block; r input registers are introduced in order to further reduce the number of memory accesses; r the computations of the block are carried out in parallel or in several cycles, also distributing the memory accesses to these cycles. The LMS adaptation step (3) for ci is similar to a single MAC operation but requires three memory accesses to load ci and en−i and to store the updated ci , assuming the common

INTEGRATED DSP CHIPS

• 263

multiplier stands in a register. It can be interleaved with the multiplication of ci by en+1−i in the computation of the next output without having to load the operands. The combined operation can be executed using two MAC circuits or on a single one using two result registers and performing the three accesses in two cycles. For the FFT algorithm and the filtering in the frequency domain, the mutiply and add operations are on complex numbers, and a DSP might implement them. The complex multiply corresponds to four real multiply and two real add/subtract operations. A less complex implementation results from the identity (a + ib)(c + id) = a c − bd + i((a + b)(c + d) − a c − bd)

The memory interface must supply double width operands (comparing to real inputs). The butterflies is composed of a multiply of an input by a complex factor that may be shared between several butterflies and stored in a register, and the 2-input DFT operation computing the pair (u + v, u − v) with two complex inputs and outputs. A still more complex building block is the entire butterfly with two complex inputs and outputs per cycle as well (equivalently, 4 real inputs and outputs). The fairly large memory bandwidth requirement is due to the clustering of several operations into a building block that is executed on several different input vector components during an FFT pass without sharing data and without mutual data dependencies. For the radix-4 butterfly, 8 real inputs and 8 real outputs are made. If, however, the same resources (multipliers and adders) used for the radix-2 butterfly are applied to compute a radix-4 butterfly in four steps, the memory bandwidth requirement is halved in comparison to the computation by passes. For a DSP supporting both the FIR and the FFT algorithms the use of a single multiplier and parallel add capabilities is the common baseline. If a real 2-input DFT operation computing (a + b, a − b) is supported in parallel to a multiply, then the radix-2 butterfly can be computed in 4 cycles using 2 memory accesses per cycle. This is achieved by some current DSP chips. The FFT passes can be subdivided into independent half passes and executed by an SIMD structure obtained by duplicating the data path and providing 4 memory accesses per cycle (or 2 double width accesses). Following the above remark, radix-4 butterflies can actually be computed using half the number of memory accesses. Thus both the FIR and the FFT computations can efficiently use 2 real multipliers with just 2 real memory accesses per cycle, and 4 multipliers would need 4 parallel memory accesses per cycle.

8.3 INTEGRATED DSP CHIPS Starting with the FIR computation, any digital processor used for this has to perform multiply and add operations. For fairly short filters, the fact that the coefficients are constant can be exploited by using distributed arithmetic (see section 4.5). Longer FIR filters might be decomposed into short ones. The required rate of 48M multiply-add operations/sec in a typical filtering application suggests that a single, parallel MAC circuit (Figure 8.9(a)) that operates this fast would be a natural choice for the compute circuit of a DSP to be used for long FIR computations. Both multiplier operands must be loaded in parallel from memories holding the samples and the coefficients using subsequent addresses for the consecutive operand accesses. Integrated ‘integer’ DSP chips supporting FIR computation on 16-bit or 24-bit signed binary codes offer these features. The sequence of multiply and MAC operations is controlled

• 264

DIGITAL SIGNAL PROCESSORS

by reading instructions from an instruction memory. During every sampling period, an initial multiply followed by N−1 MAC operations is required. If only FIR filtering had to be supported, then a control circuit reading instructions from a memory could be substituted by a much simpler automaton cyclically producing a single-bit operation select output of this kind. Standard DSP chips are designed to be closer to general purpose processors and to provide additional operations and program control. Then their resources can also be shared between several FIR computations in a multi-channel application and be used for other algorithms as well (e.g., IIR filters [14]). Standard integer DSP chips address mass markets requiring a high level of integration. They are used to integrate enough memory to support standard filtering applications without extra memory (except for some program ROM) and also integrate special serial interfaces to A/D and D/A converters. In addition to the efficient multiply-add structure, the more complex (and costly) ‘floating point’ DSP chips are 32-bit processors also providing the single-precision floating point data type that is most useful for FFT-based algorithms. The premium paid for this allows FIR computations to use the fast convolution algorithm. The system cost using this algorithm may be lower than a design based on multiple integer processors. Some recent high-performance integer and floating-point architectures use a single control circuit and a single instruction sequence to control two independent multiply-add circuits or ones that perform identical operations on different data. This latter, SIMD-type operation is useful as many DSP algorithms, in particular FIR filtering, are performed in parallel on several data channels, e.g. on a left and right audio channel. A single DFT was shown to decompose into similar half size ones. DSP chips differ from the general purpose processors and micro controllers discussed in Chapter 6 by some characteristic features:

r specialization to a particular or a few fixed-size number encoding types (their memory is usually addressed by words of this size, not by bytes);

r the availability of a parallel MAC operation for this type (or related DSP primitives); r dual operand accesses in parallel to arithmetic operations and instruction fetches; r additional measures to support an almost 100% efficiency for the costly multiplier. In an FIR computation, a standard micro controller without a MAC circuit would sequentially load operands for the multiplication, perform the required wide add operation in several cycles and use up additional cycles to manage a loop. Standard micro controllers are targeted to other application areas and are slower by a factor of 10 to 100 compared to dedicated DSP chips. If this lower speed is sufficient for a particular application, they can, of course, be used as well. Nevertheless, there is some convergence between dedicated DSP chips and high-end microcontrollers. The XC161 (cf. section 6.6.2) e.g. covers simple DSP applications through its MAC unit, and the SH-4 supports efficient FIR and IIR implementations, in particular through its multimedia instructions. On the other hand, DSP chips like the ADSP2199 and Blackfin will be seen to integrate standard and control-oriented interfaces. The TMS320F2812 and the DSP56F8356 presented as fast micro controllers both are descended from DSP families and efficiently implement FIR filters too. The FIR algorithm may be implemented as a long sequence of multiply and add instructions. Alternatively, a program loop controlled by a loop counter can be used. Most DSP control circuits support this by including a dedicated counter register and registers that are loaded with the start and end addresses of the instruction range to be repeated. In order to

• 265

INTEGRATED DSP CHIPS

en-N+k+1

en-1

en en-N+1 en-N+2

en-N+k

start position

Figure 8.10 Circular buffer holding the latest N samples

execute the loop, the processor is placed into a special loop mode. Then the position of the instruction being executed is compared to the end address of the loop by a hardware circuit to trigger the conditional continuation from the start address. Thus a conditional jump instruction at the end of the loop with its execution time is avoided that would otherwise deteriorate the ALU efficiency. This method is called zero-overhead looping. Another, simpler repeat structure for a single instruction also found on most DSP chips consists in blocking the fetches of new instructions and executing the same instruction until the loop count has expired. On most DSP chips this mechanism does not permit interrupts. Note that the method explained in Chapter 6 for the CPU2 achieves the zero-overhead loop by simpler means, yet investing in an extra instruction bit for the conditional return to the start of the loop. The samples and coefficients are thus read from memories in parallel to the computation. The arrival of a new sample may be used to start a new, direct FIR computation. It may overwrite the latest sample stored in memory, the one which is no longer used for the new output. Instead of using a shift register-type memory, one uses a circular buffer in which the latest sample is at an arbitrary position which moves backwards by one position in every new cycle until it reaches the lowest address (Figure 8.10). Then the writing position jumps to the highest address. Reading the stored samples in the forward direction starts from the position of the latest sample, jumping to the start address after reading from the top address. The control circuits of the DSP chips support this scheme by means of circular address pointer registers that are automatically updated after every read operation to the next circular address. The FFT-based FIR computation has higher memory requirements. Integrated DSP chips are offered by several manufacturers, including Texas Instruments, Analog Devices, and Motorola. The following sections present a small selection of them. All manufacturers offer well-established families of chips that have been updated to higher speeds and a higher amount of integration through architectural evolution and by re-implementing them in more recent semiconductor processes. As already remarked for the FPGA offerings and standard CPU chips, the applied semiconductor process with its related density and speed dominates the characteristics of a DSP chip and may outweigh the advantages of another architecture through a higher instruction rate. The various DSP chips are general purpose processors that differ in many details although each one could be used for almost every DSP application. As in the case of the general purpose processors (section 6.6.1), the system’s designer needs to have some criteria of comparison that are not offered by the manufacturers in a uniform way. Performance, e.g., is application specific, and manufacturers tend to publish performance data for applications that are particularly well served by their architecture. Sensitive data such as the power consumption of a chip during a particular algorithm, or comparative DSP data are hard to get; some are published by [63]. Common DSP benchmarks are the GSM suite and MPEG encoding and decoding [75]. Processor independent benchmarks suffer from dependency on the quality of encoding by a compiler. Many DSP applications are related to communications and as well as the processing of the signals involve the encoding and decoding of information and error

• 266

DIGITAL SIGNAL PROCESSORS Table 8.1 DSP features useful for comparisons

Compute resources offered: – available compute circuits (supported operations and data types) – number of parallel memory operand accesses Performance evaluated for optimally coded key algorithms (FIR filter and variants, FFT, IIR filter, LMS, fast wavelet transform, software floating point for integer DSP chips, Viterbi algorithm): – net frequencies of the needed type of arithmetical operations resulting from – the instruction rate – the density of arithmetical operations in the actual code (counting parallel operations) – features to reduce overhead for multiplexing threads (IRQ support, DMA, coprocessors) Cost measures for typical DSP applications: – cost for manufacturing the hardware of the entire DSP system resulting from – power requirements (and related hardware costs) – integration of memory – integration of i/o and system functions – support for multi-DSP systems – packaging (board level costs) – power dissipation (operating costs)

correction. Therefore some standard DSP chips extend their application-specific support, e.g. to the Viterbi decoding algorithm [16] or operations on polynomials with binary coefficients. Ideally, the comparison of different DSP chips should be based on evaluating the performance for an intended application, working out some crucial part of the application processing for the competing architectures. On a standard DSP, some extra, unused capabilities may be welcome for possible software extensions. The criteria and features in Table 8.1 are related to cost and performance in some key applications and include some of the features used in section 6.6 to characterize other processors as well. Some other criteria will play a role for the actual selection of a processor. These concern the reliability and the quality level of the manufacturer and the availability of second sources, the cost and quality of the design tools, and the experience already gained with some architecture which may not be the optimum choice for the application but still acceptable.

8.4 INTEGER DSP CHIPS – INTEGRATED PROCESSORS FOR FIR FILTERING For DSP chips implementing a 16-bit integer or fixed-point arithmetic, the FIR algorithm and its variants are the most important benchmark. To efficiently compute the FIR algorithm one would use one or more multipliers and adders connected as a MAC or arranged in a pipeline with a register placed between the multiplier output and the adder input. In 0.18 µ technology parallel, non-pipelined 16-bit MAC circuits with 40-bit adders are offered within DSP chips at rates of 80M op/s and above. Integer DSP chips provide the parallel 16-bit integer multiplier and adder along with the control circuit for its efficient, sequential usage and instruction and data memories, the latter for the storage of samples and coefficients. If several multipliers have to be used to achieve the necessary processing speed, several integer DSP chips are used as multiplier plus control circuit combinations. Some recent integer DSP chips even include

INTEGER DSP CHIPS – INTEGRATED PROCESSORS FOR FIR FILTERING

• 267

two multiply-add circuits. The instruction set architectures of the integer DSP chips usually do not support efficient software floating point implementations.

8.4.1 The ADSP21xx Family The 21xx DSP architecture of 16-bit integer DSP chips from Analog Devices came out in the late 1980s. The more recent 218x series was one of the first to integrate enough memory for most applications. The 218x family offers a selection of chips of speeds varying between 40MIPS and 80MIPS, technologies, and amounts of integrated memory. The latest upgrade to the 21xx family is the 219x series that conveniently extends the addressing capabilities, adds some extra context switching support, provides extra DMA address registers, and, most importantly, raises the clock rate by a factor of 2 through architectural changes (a 6-stage pipeline and an instruction cache). A special version, the 21992, includes a fast 14-bit ADC, PWM outputs and incremental encoder inputs, and a CAN bus interface to support motor control applications, similarly to the competing TMS320F2812. Another chip, the 2192, integrates two DSP cores. The choice of a particular member depends on the application. The 218x chips are less expensive, and some have a very low power consumption (the –N types based on a 0.18 µ technology), while the 219x chips are faster and offer more interfaces, yet consume more power. The 21xx processors have all the ingredients mentioned before, namely a parallel (single cycle) MAC with a 40-bit result register, rounding and saturation operations, addressing logic supporting circular addressing, and the zero overhead loop feature. The 21xx also integrates an ALU circuit performing some standard operations like the binary add and subtract, AND, OR and XOR on bit fields, and a shifter circuit that is useful for implementing floating point arithmetic (floating point add and multiply take about 10 cycles each using a 16-bit mantissa word and a 16-bit exponent). Many instructions can be executed conditionally. As a special feature, a status bit is available that tracks a close-to-overflow condition for the components of a result vector, and another bit dynamically scales during a load; these bits implement the dynamic scaling needed to implement block floating point operations on vectors. Unfortunately, the ALU cannot be used in parallel to the MAC to e.g. support symmetric FIR filters. The efficient use of the compute circuits mostly concerns the MAC circuit which is the most complex one and also the most used one in typical DSP algorithms. There are separate on-chip memories called DM and PM which are accessed in parallel; the second one holding instructions and coefficients even supports two read accesses per cycle on the 218x s.t. the input registers to the MAC can be loaded from memory with new data in every cycle in parallel to fetching the next instruction. These on-chip memories exhaust the full address range of 16 k data and 16 k program memory words supported by the instruction set. On the 219x the cache frees the PM bus for data transfers. Some features of the 2191 and 2185 processors are listed in Table 8.2. There is an interface to a slow, 8-bit ROM that has a bootstrapping option and automatically copies an initialization routine from the ROM to the internal 24-bit program RAM after resetting the chip. The assembly of the bytes from the ROM into 24-bit instruction is performed by a DMA circuit that can also be called under software control to transfer blocks of data from the ROM into the internal memory. This can be used to implement soft caching and to thereby compensate for the limited addressable memory.

• 268

DIGITAL SIGNAL PROCESSORS Table 8.2 List of features for some ADSP21xx processors Compute resources: – 16/40-bit integer MAC operation, 16-bit ALU, 16/32-bit shifter – 16 data registers – 2 parallel 16-bit operand accesses from memory Performance data: – Instruction rate of 160 MHz for the ADSP2191M (80 MHz for the ADSP2185N) – densities of operations for FIR: 1 multiply/cycle, 1 add/cycle symmetric FIR: 0.5 multiply/cycle, 1 add/cycle complex FFT butterflies: 0.5 multiply/cycle, 0.75 add/cycle floating point MAC (16-bit mantissa): 1/20 multiply/cycle – measures to reduce control and addressing overheads dedicated stack for nested zero overhead loops dedicated stack for sub-routine return addresses extra bank of data and address registers for fast context switches several circular buffer length registers DMA support for integrated peripherals Cost-related features: – integrated memory of 160 Kbytes (32 K words DM, 32 K in PM, 80 k bytes for 2185) – integrated interfaces to converters, parallel or serial boot ROM or host processor – fast 16-bit host port is convenient for interfacing several DSPs with each other – power dissipation is 0.45 W for the 2191 M from a 2.5 V supply, separate 3.3 V i/o supply the 80 MHz ADSP2185N dissipates 0.05 W from a 1.8 V supply – 144-pin QFP package

Flash ROM

ADSP 2185

serial interface

ADC DAC

power reset clock

Figure 8.11 3-chip FIR processor with the ADSP2185N

Alternatively, the DSP may be connected to the bus of another processor via the 16-bit parallel host port which can also be used to transfer instruction codes and data to the internal RAM of the DSP before starting program execution on the DSP. Figure 8.11 shows the 218x based DSP system implementing an FIR filter and indicates how to connect a second DSP chip if more throughput is required. Note that a dual processor system based on the 2185N achieves the same processing rate and memory integration as a single 2191M processor with its simpler board level design, but consumes significantly less power due to the 0.18 µ technology and the simpler DSP architecture. The 21xx has input pipelining registers to the MAC (MX0, MX1 and MY0, MY1), the ALU (AX0, AX1 and AY0, AY1) and the shifter (SI) that are explicitly addressed in the parallel load operation. Each unit also has separate output registers, MR (the 40-bit accumulator, further subdivided into 16-bit registers MR0, MR1 and MR2), AR and SR (SR0 and SR1), and extra 16-bit output registers AF and MF. The 219x substitutes MF with an alternative 40-bit accumulator. The output registers can also be used as input registers in chained calculations.

INTEGER DSP CHIPS – INTEGRATED PROCESSORS FOR FIR FILTERING

• 269

Memory is addressed with 8 pointer registers I0-I7 that are incremented by the contents of update registers M0-M7 after every access (which is the only indirect addressing mode available), using modulo arithmetic to implement circular addressing. The circular buffer lengths are stored in extra registers L0-L7, and the 219x uses extra base registers for the start addresses. The instruction set supports load and store operations with their address updates in parallel to arithmetic. For the FIR filter implementation, a single instruction is repeated in a loop performing the MAC operation and in parallel loading the operands for the next. Before entering this loop, the address registers chosen for the coefficients and the samples, I4 and I0, must be set to the starting addresses, and the address modifier registers must be set to 1. In assembly language, the loop then reads as shown in Listing 8.1. (1) (2) (3) (4) (5) (6)

CLR MR, MX0 = DM(I0, M0), MY0 = PM(I4, M4); CTR = 511; DO L UNTIL CE L: MR = MR + MX0∗ MY0(SS), MX0 = DM(I0, M0), MY0 = PM(I4, M4); MR = MR + MX0∗ MY0(RND); IF MV THEN SAT MR;

Listing 8.1

FIR loop for the ADSP21xx

As the loading of the multiplier operands into the registers MX0 and MY0 is pipelined, the first pair of operands must be loaded before performing the first multiplication. This is done in stage (1) in parallel to clearing the accumulator register MR to zero. ‘DM(I0,M0)’ specifies a load operation from the data memory using the address register I0 and incrementing I0 afterwards by the contents of M0. The zero-overhead loop is started by setting the dedicated loop counter register CTR and by performing the DO instruction that takes the address L of the final instruction of the loop body as its argument. After performing the MAC instructions in stage (4) in which ‘(SS)’ specifies a signed multiplication the final multiply in stage (5) does not load more operands, and uses the rounding option to convert to a 16-bit result. Stage (6) performs saturation in the case of an overflow. The rounded and saturated result of the length 512 FIR filter then stands in the ‘middle’ part M1 of MR. The circular addressing of the samples is implicit. After setting the buffer length register L0 to 511 during the program initialization, all further addressing with I0 is cyclic mod(512). The FIR computation needs to be performed for every output sample and has to be synchronized with the input. Input comes from the A/D converter via a synchronous serial interface. The attached converter output the samples at the sampling rate to the DSP. The interface issues an interrupt for every received sample which can be used for the synchronization. A simple, single channel implementation could start by initializing the processor registers including L0, M0, I0 and M4 and the configuration registers of the serial interface to receive input words from the converter. The DSP then simply enables its interrupt and enters an idle loop. The interrupts cause this loop to be left for the interrupt processing and to be resumed afterwards. In the interrupt routine, the receive register is read, and the new sample is put into the cyclic buffer addressed by I0. Then the FIR loop is performed, and the result is written out to the transmit register of the interface. As the FIR computation has a constant execution time, the results are output at the sampling rate with a constant delay to the inputs.

• 270

DIGITAL SIGNAL PROCESSORS Table 8.3 List of features for the TMS320C54x Compute resources: – 17/40-bit MAC, 40-bit ALU, shifter – 2 parallel memory operand accesses (3 for the FIRS instruction) Performance data: – instruction rates range from below 80 MHz to 160 MHz – densities of operations for FIR: 1 multiply/cycle, 1 add/cycle Symmetric FIR: 1 multiply/cycle, 2 add/cycle – DMA support for the serial interface Cost-related features: – integrated RAM ranges from below 32 k bytes (16 k words) to more than 256 k bytes – integrated interfaces to converters, 8- or 16-bit host or ROM port supporting bootstrap – power dissipation is 0.15 W for the 160 MIPS 1.6 V chips, 0.1 W for 120MIPS, 1.5 V – dual power supply, 144 pin QFP

8.4.2 The TMS320C54x Family The TMS320C54x family from Texas Instruments has similar features (Table 8.3) but uses an implicit pipeline instead of input registers so that multiply instructions may directly specify a memory operand. Therefore only a few data registers are visible, namely two 40-bit accumulator registers A, B and an auxiliary input register T for the multiplier. Moreover, a second adder (a 40-bit ALU) can be used in parallel to the MAC operation in a special instruction called FIRS that takes a single cycle when it is repeated in a loop, and the memory organization allows two sample accesses and a coefficient read from an auto incrementing address to support the symmetrical FIR filter. The sum of two 16-bit samples occupies up to 17 bits in the accumulator, and the multiplier actually accepts a 17-bit signed input. A shifter is available, too, yet no logic to support block floating point. The TMS320C54x also offers a special instruction to speed up the Viterbi algorithm (cf. exercise 8). There is a broad choice of instruction rates ranging from 40 to 160 MHz and amounts of integrated memory, and there are versions integrating two DSP cores or a DSP core and an ARM processor. Low cost processors of this series like the 100MIPS TMS320C5402 have an 8-bit host port only, while those with a 16-bit host port require the application of a DMA address. Comparing the 21xx to the 54x, the support for the symmetric FIR filter is a definite advantage for the TI chip, while for block floating point operations, e.g. in the FFT, the 218x has an advantage. For LMS filtering and the distance of vectors the 54x can employ its dual adders in parallel again. The 218x supports context switching and the control flow more efficiently, while the 54x has an edge in raw arithmetic performance. There is no clear winner; the choice to be made is application specific. Memory is addressed with 8 registers AR0-AR7 that can also serve for loop counting. AR0 is also used as an index register. There is a unique stack implemented with a stack pointer register SP and residing in the data memory (in contrast to the separate, dedicated stack memory of the AD chips). Circular addressing is supported using a single block size register. There are a number of indirect addressing modes, some of which are shown in Table 8.4. For the symmetric FIR computation, a third operand is needed and requires address generation. This is implemented in a special instruction and can only be used within a

INTEGER DSP CHIPS – INTEGRATED PROCESSORS FOR FIR FILTERING Table 8.4 Indirect addressing modes of the TMS320C54x (examples) ∗

AR1 AR1+ , *AR1− ∗ AR1+%, *AR1−% ∗ AR1+0 ∗ +AR1 ∗ AR1(index) ∗ +AR1(index) ∗ +AR1(index)% ∗

• 271

operand at address in ar1 same with address being post incremented or decremented same using circular addressing incrementing by ar0 preincremented address index is a 16-bit signed constant added to ar1 before applying the address same with preincremented ar1 same using circular addressing

single-instruction repeat loop which cannot be interrupted. The instruction includes the absolute address of the coefficient array: REPEAT#count FIRS(∗ AR3 + %,∗AR4 − %, #coeff).

8.4.3 Dual MAC Architectures A recent derivative of the TMS320C54x architecture is the TMS320C55xx family. It achieves a still higher level of performance by increasing the instruction frequency up to 300 MHz and by providing more registers and two MAC circuits. There are four accumulator registers and four auxiliary registers. The 55xx series is a low power design. The lowest cost 5502 operates from a 1.26 V supply and integrates 64 k bytes of RAM. It is estimated to dissipate about 0.1 W at a 200 MHz clock rate. A version integrating an ARM core is used in a portable multimedia computer. The 55xx uses an instruction cache and decodes instructions of a variable size. The two MAC circuits may execute in parallel, e.g. to implement a two channel FIR filter (only 3 parallel data accesses are supported), or to calculate two subsequent outputs of the same filter. A two channel symmetric FIR filter is not supported as this would require an extra adder. Instruction set compatibility to the 54x family is maintained. On the 55x there is an extra address generator for the coefficient address, and single-instruction repeat loops are interruptible. The host port is 16 bit and can hence be used to build DSP networks. An important feature of the 55xx chips is the integration of an external memory interface that also supports synchronous DRAM (Table 8.5). For applications of the TMS320VC55xx DSP family combined with an exposition of DSP algorithms we refer to [62]. A new, dual MAC integer architecture was recently developed by Analog Devices and Intel. The ‘Blackfin’ processor profits from a fast semiconductor process. It operates at 1.2 V and below and is offered at clock rates of 400 and 600 MHz. In FIR algorithm, it hence achieves 800 or 1200 16/40-bit MAC operations/s. As a remarkable feature, the Blackfin integrates a software-controlled switching regulator that delivers a programmable voltage between 0.7 V and 1.2 V to the core. The clock of the core is generated by a PLL circuit and is under software control too, so that the power consumption can be adapted to changing requirements. The Blackfin shares some feature otherwise found on micro controllers (e.g. special timer functions) and on general-purpose processors that enable it to use some of its resources for general-purpose computing and to run an operating system. In particular, it offers a ‘supervisor’ mode and a memory management unit. Instruction lengths are 16- and 32-bit,

• 272

DIGITAL SIGNAL PROCESSORS Table 8.5 List of features for the TMS320C55xx processors Compute resources: – two 17/40-bit MAC circuits, 40-bit ALU, shifter – 3 parallel operand accesses from memory Performance: – instruction rates of 200 and 300 MHz – densities of operations for FIR: 2 multiply/cycle, 2 add/cycle Symmetric FIR: 1 multiply/cycle, 2 add/cycle – DMA support for the serial interface Cost-related features: – integrated RAM ranges from 64 k bytes to 320 k bytes – integrated interfaces for converters – ROM port supporting bootstrap, external bus with SDRAM support – 16-bit host or networking port – power is 0.2 W for the 200 MIPS 1.6 V processors, 0.1 W for the 1.26 V version

and an instruction cache delivers them fast enough to support their parallel execution. There is a data cache too, or a configurable mix of on-chip RAM and caches (a total of 148 k bytes on the ADSP-BF533). The processor uses 16 16-bit data registers and eight address registers and related modifier registers like the 219x family, and some extra address registers used for non-DSP HLL programs. Besides the usual synchronous serial ports it also includes a port for video data and provides special instructions for such. The external memory interface of the Blackfin uses a 16-bit data bus and also supports SDRAM. The Blackfin comes in a tiny, 160-ball BGA package. Besides the dual MAC chips there are some that even offer 4 MAC operations per cycle. A fast, integer DSP from Lucent and Motorola is the Star Core DSP. The basic core integrates 4 MAC circuits and runs at 300 MHz executing several 16-bit instructions in parallel. The MSC8101 also packs 512 k bytes of RAM and a number of peripherals and dissipates 2/3 W. Another version, the MSC8102 even integrates 4 cores and thus performs 16 MAC operations per cycle. Another very fast integer DSP is the TMS320C64xx from TI. It also performs 4 16-bit multiply operations per cycle and runs at still higher rates of 720 MHz and beyond. This architecture is related to the TMS320C67xx architecture that will be discussed in section 8.5.2. We finally mention the MC56321 DSP from Motorola, the latest offspring of a DSP family introduced in the late 1980s. It differs from the integer DSP chips discussed so far by implementing a 24-bit integer type, using a 24 × 24 bit multiplier and a 56-bit accumulator. The 24-bit data type is convenient for audio processing applications that often require more dynamics than provided by the 16-bit range. This processor core performs 240 MIPS consuming about 0.4 W from a 1.6 V supply. The chip is housed in a 196-pin BGA and integrates nearly 600 k bytes of SRAM. The processor supports block floating point operations and executes an FFT butterfly in six cycles. The MC56321 also includes a second MAC circuit but uses a different approach to support it. The second 24/56 bit MAC is part of a preprogrammed filtering coprocessor that can be used for a selection of standard filtering algorithm including the FIR computation. Together, the processor and its coprocessor execute 480 million 24-bit MAC operations per second during FIR computations.

FLOATING POINT PROCESSORS

8.5 FLOATING POINT PROCESSORS

• 273

Floating point DSP chips use some of the standard 32-, 40- or even 64-bit formats and implement the basic arithmetic operations +, −, ∗ for these with parallel or pipelined, high-speed circuits, typically executing them with a single cycle throughput. They are more complex and costly and are typically used for complex algorithms where the accumulation of rounding errors and overflow would be hard to handle in fixed point. They also offer an easy way to port DSP algorithms developed on a workstation with floating point arithmetic to a real-time environment without having to think about fixed point effects which has become a common design practice. An important benchmark for floating point DSP is the FFT. As the butterfly is the basic building block of the FFT, an approach to building a fast FFT processor would be to provide a parallel circuit for the butterfly, attach multiple memories and address generators to it for parallel input and output, and to use a sequencer to control the repeated usage of the circuit. Chips of this kind have been on the market, but this structure is highly specialized and expensive due to the multiple memories. For a standard chip serving many applications, a processor architecture with a single floating point multiplier is more flexible. As the butterfly contains 4 real multiplies, the multiplier would be applied 4 times per butterfly computation. The FFT would run effectively if the 6 add and subtract operations and the 8 i/o load and stores could be performed in parallel to the multiplications. This amounts to two load and store operations in parallel to the arithmetic operation which is more moderate and similar to what is needed for the FIR computation. In fact, one cannot do better. The performance of the parallel butterfly circuit performing 4 multiplies per application is also obtained by using 4 single multiplier processors performing a butterfly in 4 multiply cycles each. These perform 4 butterflies in 4 cycles and make 8 parallel memory accesses per cycle similarly to the dedicated structure. If only the FFT were concerned, they might share the control circuit.

8.5.1 The Sharc Family The Sharc family is a family of floating point DSP chips from Analog Devices. The Sharc architecture is similar to the ADSP218x architecture in several respects yet doubling most of the resources found there. It provides separate multiplier and ALU circuits but instead of the dedicated input and output registers found in the ADSP218x uses a uniform set of 16 data registers that can be used with all arithmetic operations, and additional addressing modes. These features were later implemented on the 219x processors, too. The Sharc CPU executes 40-bit floating point operations and stores floating-point data in 40- or 32-bit words. It also supports a 32-bit fixed point data type and provides a MAC structure with an extra adder and two 80-bit accumulator registers for this. As a special feature related to the FFT, the ALU performs a parallel add and subtract operation on the same inputs that can be used in parallel to a multiply as needed for a four cycle butterfly. In contrast to the 218x the Sharc performs one memory access per cycle from its program memory bus only. This is compensated for by providing a tiny instruction cache. A strange side effect of this is that instructions with two parallel data accesses execute twice as fast from a loop as from a linear, unrolled program. The Sharc instructions have a fairly large width of 48 bits and support various parallel operations of the computational circuits, e.g. a MAC operation and a simultaneous ALU

• 274

DIGITAL SIGNAL PROCESSORS

operation with independent addresses for five involved registers in parallel to single or dual data moves for another two registers. The only flaw is the lack of being able to address the second accumulator registers in combined MAC and ALU instructions which would allow efficient implementation of the symmetric FIR filter. The multifunction instructions used in the FFT butterfly e.g. perform a floating point multiply operation in parallel to a single add operation or to the 2-input DFT operation (the parallel add and subtract) and two memory accesses. A single instruction of this kind reads: f 12 = f 0∗ f 7, f 13 = f 8 + f 12, f 10 = f 8 − f 12, f 6 = DM(i0, m0), PM(i10, m10) = f 3; Individual MAC or ALU operations with parallel data moves may be conditional. Arithmetic overflow and floating point exceptions may generate interrupts (cf. section 6.4). As already pointed out in section 7.1.3, the Sharc architecture has the unique feature of defining and implementing special networking interfaces, the Sharc links. In the second Sharc generation, the link ports have been extended to transmit 8 bits per clock instead of 4 in the first generation. They can, however, still be configured to use just 4 and to interface to first generation processors. Unfortunately, compatible links ports are not available on other (16-bit) processors to allow for heterogeneous networks. The data in Table 8.6 hold for the ADSP21161 already mentioned there, a member of the second Sharc generation with a 32-bit external bus interface offering all support function (refresh, strobe generation, address multiplexing) necessary to directly connect to SDRAM chips for the low cost and high density external expansion of its memory. If the link ports are not used, 16 additional data bus lines become available. In the first generation, only the lowcost 66 MHz ADSP21065 without link interfaces and just 64 k bytes of internal RAM offers SDRAM support. The ADSP21065 consumes 1 W from a 3.3 V supply. Another important feature in the first and second Sharc generations is that the internal RAM is dual ported and can be independently accessed at full speed from both the CPU and the DMA circuits. The second generation processors use two identical sets of compute circuits that operate in a SIMD fashion, i.e. execute the same instructions which may, however, be conditional. This can be exploited in the common case of the same DSP algorithm being applied to several, independent data channels, e.g. processing left and right audio channels, or the half size DFT applications in the DFT factoring (otherwise, one of the compute section is not used at all). The SIMD processing requires independent data to be delivered to both compute sections, i.e. 4 parallel operand accesses per cycles. This is implemented by providing internal memory widths of 64 bits and storing the independent data in the upper and lower halves of the same 64bit words. No extra address generators are needed this way, and the instruction set remains the same. The ADSP21161 fabricated in 0.18 µ technology achieves three times the performance of the 21065 but only slightly increases the power consumption. Recently, the 2nd generation Sharc architecture has been reimplemented in a 0.12µ process. The ADSP21262 doubles the speed and the on-chip memory of the ADSP21161 yet consumes half the power. This chip, however, lacks the SDRAM support and the link and host interfaces (and comes in a 144-pin QFP package). There is a follow-on to the Sharc, the so-called Tiger Sharc architecture. The arithmetical circuits are further extended. There remain two identical compute sections with a 32-bit multiplier in each one. The Tiger Sharc is more universally applicable by supporting 16- and 32-bit fixed point data at the same level of efficiency. It does so by rearranging the ALU and MAC circuits to allow several 16-bit operands in parallel. Each of the MAC circuits can perform

FLOATING POINT PROCESSORS Table 8.6 List of features of the ADSP21161

• 275

Compute resources: Two sections each with a multiplier/MAC, an ALU and a shifter, and 16 data registers (second set only used for SIMD operations) – multiplier supporting 32-bit integer and 32 and 40-bit floating point data 80 bit adder and accumulator register for fixed point data – ALU for 32-bit integers an bit fields also supporting floating point adds parallel add and subtract for fixed an floating point data – shifter for bit fields – 2 parallel operand accesses from memory for each of the compute circuits Performance: – instruction rate of 100 MIPS – densities of operations for FIR: 2 multiply/cycle, 2 add/cycle Complex FFT butterflies: 2 multiply/cycle, 3 add/cycle – measures to reduce control and addressing overheads dedicated stack for nested zero overhead loops dedicated stack for sub-routine return addresses extra bank of data and address registers for fast context switches several circular buffer start and length registers DMA support for integrated peripherals Cost-related features: – separate 1.8 V and 3.3 V supplies, power dissipation is 1.3 W – integrated memory of 128Kbytes, internal memory is dual-ported – external 64-bit expansion bus with SDRAM support – dual data lineserial interfaces to converters, serial/parallel boot ROM, host port – integrated bus arbitration for six processors to share a common bus and to communicate by accessing the internal memories via the bus – two ‘link’ interfaces with individual DMA support – 225-pin BGA package

32-bit multiplies with 80-bit accumulation or in parallel 4 16-bit MAC operations with 40-bit accumulation, or complex 16-bit multiplies. In contrast, the previous Sharc processors would support 16-bit operations at the same rate as 32-bit data, leaving 75% of the multiplier circuit unused. Floating point operations do not benefit from this feature; the performance remains at 2.5 operations per cycle for the FFT. However, on the Tiger Sharc SIMD processing now becomes an option; the parallel usage of the compute sections is also possible with independent instructions, and more algorithms can exploit both compute circuits. Also, a larger set of registers, namely 32 32-bit registers per compute section, is provided. The dedicated address generators are replaced by integer ALU circuits that are used in parallel to the compute sections. Up to four instructions, one for each of the compute sections and integer ALUs can be executed in parallel, similarly to executing a VLIW instruction. The internal buses are 128 bits wide and allow 8 operand fetches per cycle. The restriction remains that only two addresses are generated in parallel. Multiple operands must be in consecutive locations. The internal memory is no longer dual-ported but provides a third bank that can be accessed by the DMA circuits while the CPU is accessing the other two. The Tiger Sharc processors run at faster clock speeds and use faster link ports that are no longer compatible with those of the former generations. Table 8.7 lists some data that hold for the TS101. The TS101 improves

• 276

DIGITAL SIGNAL PROCESSORS Table 8.7 List of features of the TS101 Compute resources: – two sections each with a 32-bit multiplier/MAC, a 64-bit ALU and a shifter 32 data registers per section ALU and MAC also perform parallel fixed point operations on partial words 16-bit complex data type supported – 2 auxiliary 32-bit integer ALUs for address generation, each with 32 registers – up to eight parallel 32-bit operand accesses (in two 128-bit accesses) Performance: – instruction clock rate of 300 MHz – density of operations for 16/40-bit FIR 8 multiply/cycle, 8 add/cycle floating point FFT 2 multiply/cycle, 3 add/cycle Cost-related features: – separate 3.3 V and 1.25 V supplies, dissipation is below 2 W, 484 pin BGA package – integrated memory of 768 k bytes, organized into three banks – external 64-bit expansion bus with SDRAM support – integrated interfaces with DMA and boot ROM support – bus arbitration for 8 processors – 4 integrated link interfaces with individual DMA support

the power efficiency by a factor of two over the second generation. The Sharc family offers the option to achieve the same floating point performance with three interconnected second generation processors too. The TS101 processor achieves about the same performance on an FFT (about 30 µs for a 1024 point complex FFT) as the 1 GHz MPC7447 with its vector processing unit. The latter, however, dissipates five times the power and needs an additional ASIC to interface it to an EPROM and to converters for stand-alone operation. The most recent TS20x chips operate from 1 V and raise the clock rate further to 500 MHz, replace on-chip SRAM by on-chip DRAM, and use LVDS link signals. The lowest cost TS203 provides two links only, and 512 k bytes of RAM while the TS201 integrates as much as 3 M bytes of RAM. They are housed in 576-pin BGA packages.

8.5.2 The TMS320C67xx Family The TMS320C67xx family from Texas Instruments represents a VLIW architecture (see section 6.6.4) with eight functional units executing individual instructions in parallel. The eight units break up into two identical sets each containing a 32-bit multiplier M, an ALU L performing adds, subtracts, comparisons, conversions and bit field operations, a shifter unit S also performing add, subtract, floating point divide and compare operations and branch instructions, and an address calculation unit D performing load and store operations and add and subtract operations on address codes. Each set connects to a bank of 16 32-bit registers of its own, the A and B register files. The L, M and S units can access a single operand from the other register file per cycle as well. For a 16-bit MAC operation, one would use the M and L units in a pipeline and specify the 40-bit add operation on L. While the L, S, and D units can be used for fixed point add and subtract operations, floating point add and subtract operations only execute on the L units. The 67xx has the unusual feature of also efficiently supporting the 64-bit, double precision floating point type at 1/4 of the single precision rate,

Table 8.8 Assignment and scheduling of operations Index ... 0 1 2 3 4 5 6 7 8 9 ...

L1

M1

D1

S1

load load

add

L2

M2

D2 load

add shr mpy

• 277

FLOATING POINT PROCESSORS

S2

Comment samples e1 , e2 to registers coefficient c to register ... . . . load delay for e1 , e2 ... e1 + e2 from both sides divide by two c ∗ (e1 + e2 )/2 . . . multiplier delay add result to accumulator

and addresses memory by bytes. It uses a pipeline with 7 to 16 stages including 1-9 execute stages depending on the operation. Usually, the compiler selects the operations that are executed in parallel on the different compute circuits. Assembly level programming is more difficult than for scalar and superscalar architectures as the pipeline of the processor and the parallel scheduling must be taken care of. As an example, the operations of loading two 16-bit samples from memory, multiplying their scaled sum by a coefficient also loaded from memory, and adding the multiply result to a register used as an accumulator could be arranged as shown in Table 8.8 using minimum delays. The delays after the load and multiply instructions are needed due to their extra processing time. The free time slots and unused units may, of course, be filled up with interleaving and parallel operations. Care must be taken not to use a register as a data source during the delay following an instruction that changes it. An intermediate interrupt might cause the change before this use. The up to eight instructions in a row are marked to be executed in parallel and constitute execute packets of up to 256 bits that are loaded in parallel from the instruction cache. There is no instruction for a single-cycle 17 × 16 bit multiply or for a single cycle combined add and divide by two. Therefore the ‘shr’ instructions cannot be avoided, and the symmetric FIR filter cannot be computed efficiently using both multipliers, one of the compute units being needed twice in every step. Only TMS320C64xx provides a fast 16 × 32 bit multiply operation. Branches are delayed, too, and become effective after 5 cycles only. During this time, interrupts are automatically disabled. All instructions including the branch instructions are conditional. If e.g. some composite operation is carried out on all components of a vector, one would first fill up execute packets of parallel operations on subsequent vector components (unrolling a preliminary description of the vector operation as a loop), and only then identify a periodic section of the assembly program. This is used to set up a loop, using one of the S units to maintain a counter and carry out the branch instructions in parallel to the other units. At least 6 cycles must be executed in the loop body (the branch execution time). The branch conditions do not refer to a status register but test some general register to be zero (or non-zero). Comparisons place their Boolean result (0 or 1) into a register. There is no overflow indication for integer add operations. Calls to sub-routines must be implemented as unconditional jumps, placing a load program counter instruction into the last execute packet

• 278

DIGITAL SIGNAL PROCESSORS

before the branch is taken. The encoding of the parallel operations of the circuits by individual 32-bit instruction evidently requires much memory. An efficient FIR routine performing two loads, a multiply and an add operation requires 128 bits of instruction code for these, and an FIR loop contains at least 24 instruction words. The computational resources of the 67xx are not too far from the multiplier, ALU, shifter and address generator circuits found in other DSP chips as well. The eight units are, in fact, similar to the parallel circuits in the second Sharc generation. The independent programmability of the two sets and the usability of the address generator unit for general purpose computations are only available on the Tiger Sharc. The 67xx lacks the high i/o bandwidth of the Sharc processors due to their parallel DMA accesses, their FFT efficiency by a factor of 2, and the support for accumulating integer products to 80 bits, their code density, and the efficiency on 16-bit data of the Tiger Sharc, but does provide the double precision floating point and the character data types that may be useful in some applications. Comparing the low-cost TMS320C6713 to the second generation of Sharc processors, in particular to the ADSP21161, the more advanced manufacturing technology (0.13 µ), however, leads to higher clock rates, to more performance in non-FFT applications, and to a better power efficiency. While the second generation Sharc processors can use their second compute section only for SIMD operations, the VLIW structure has no such restriction. The TMS320C6713 integrates more memory than the ADSP21161 and provides an easier host interface. Its main features are shown in Table 8.9 for the low cost QFP package version. The ADSP21262 is an obvious competitor. Another competitive device to the 6713 might be the SH-4 (see section 6.6.4) in some applications. The TMS320C67xx family also lacks a networking concept for the use of the processor as the compute node in a scalable architecture. This has to be added at the board level, using e.g. an FPGA to implement fast LVDS interfaces, or some dedicated extra chip, disadvantages being the higher costs and the non-existence of an architectural standard.

Table 8.9 List of features of the TMS320C6713 Compute resources: – two compute sections with 32-bit fixed and floating point multipliers, ALU, shifter efficient (quarter speed) double precision floating point support two 32-bit integer ALU circuits, also used for address generation – four parallel 32-bit operand loads from memory (two 64-bit accesses) Performance: – instruction rate 200 MIPS – densities of operations for FIR: 2 multiply/cycle, 2 add/cycle Complex FFT butterflies: 1 multiply/cycle, 1.5 add/cycle – measures to reduce control and addressing overheads circular addressing, parallel loop management DMA support for integrated peripherals Cost-related features: – separate 1.2 and 3.3 V supplies, power dissipation 0.7 W – integrated memory of 256 Kbytes – bus interface with SDRAM and boot ROM support – integrated interfaces to converters, separate 16-bit host port – 208-pin QFP package

APPLICATIONS TO UNDERWATER SOUND

8.6 DSP ON FPGA

• 279

The FPGA discussed in section 2.2.4 as a configurable hardware platform allowing the realization of application specific processors and even networks of such may also be used to implement FIR filtering and other DSP algorithms. The fast serial MAC structure explained in Chapter 4 is easily implemented in several instances on an FPGA along with the necessary control circuit. The RAM blocks found in recent FPGA chips may be used to store coefficients and samples; otherwise an external memory chip would need to be interfaced. For look-up table-based FPGA architectures the constancy of the coefficients can be used to implement FIR segments of 4-6 multiply-add operations using distributed arithmetic (section 4.5) which significantly decreases the complexity and increases the processing speed in comparison to an implementation of individual parallel multipliers. Longer FIR filters can be subdivided into short blocks implemented this way. FIR processing can also be implemented by subdividing the filter into short ones and performing the processing with these by blocks. After processing a block of samples with the short filter, the look-up tables can be overwritten for the next filter section. This is equivalent to partially reconfiguring the FPGA for different steps, buffering the intermediate data in a memory. This technique can actually be applied to all kinds of DSP algorithm having only a throughput requirement. Finally, the Virtex-II and Spartan-III families from Xilinx include 18-bit parallel multiplier blocks and RAM to build various DSP-related data paths and to implement even very high speed filters. Altera’s Stratix devices go even further by including the MAC function and some configurability. The required sequential control is easily implemented on the FPGA so that the equivalent of multiple specialized DSP processors can be mapped onto a high density FPGA. Then all kinds of algorithm-specific building blocks (cf. section 8.2.4), including the full butterfly structure, can be realized on the FPGA and be used sequentially as applicationspecific compute circuits receiving multiple input data per cycle from separate memory blocks. It is evident that with this inventory of methods the FPGA equipped with additional memory becomes a strong alternative to dedicated processor chips, in particular if appropriate tool chains are available to simplify the programming. Due to their configuration overheads FPGA-chips are more expensive than processor chips produced in mass volume, but further architectural advances in FPGA will increase the share of FPGA-based DSP designs, which in turn will contribute to driving down the FPGA prices.

8.7 APPLICATIONS TO UNDERWATER SOUND In this final section, we analyze some sample applications arising in the field of underwater sound for their computational requirements and outline appropriate system architectures for them on the basis of the techniques seen before. Besides the involved signal processing algorithms, the handling of the involved data rates and the memory requirements need to be considered. This is quite typical for high speed DSP applications which may require special interface and buffer memory designs attached to the processors, in particular, if many signal channels and 2D storage are involved.

• 280

DIGITAL SIGNAL PROCESSORS

In the sea, vision and object detection, communications, and distance measurements are by means of acoustic signals. Electromagnetic waves are damped too much to serve these purposes. Sound travels through water at a velocity c of 1495 m/s at the temperature of 20◦ C. It is sent out unintentionally by moving vessels, and intentionally by means of piezoelectric transducers at frequencies up to a few hundred kHz. Sound can be received by underwater microphones (‘hydrophones’). The transducers used for the generation of sound can also be used as hydrophones. The attenuation of a sound wave traveling through the water grows with its frequency. At low frequencies, directed sound beams can be registered over tens of miles. Narrow sound beams are generated by wide, flat transducers, while small transducers act as omni-directional point sources. If the same transducers are used as hydrophones, they have similar directional characteristics. A classical application of underwater sound is the echo sounder used to measure the water depth d beneath a vessel. A sound pulse is transmitted against the sea floor, and the echo from the sea floor is registered afterwards. The time t between sending the pulse and receiving the echo is given by the formula t = 2d/c as the sound has to travel down and up again. Fish schools can also be detected by their echoes. The transmitted signal can be synthesized by a DSP, and the received signal can be digitized and evaluated by means of a DSP as well. The same applies to communications through the water by means of modulated sound signals, and to passively listening for approaching vessels. Sonar signal processing can be further studied in [64].

8.7.1 Echo Sounder Design A modern ‘vertical’ echo sounder based on a digital processor is required to perform some or all of the following tasks. The basic structure of the measurement system into which the digital processor is embedded is shown in Figure 8.12. The measurement system is further embedded into its physical environment including the water and the seafloor reflecting the sound. An echo sounder can do the following:

r periodically perform a measurement by outputting a trigger pulse to a transmit circuit, or even a synthesized waveform to be amplified and sent out, and control the amplifiers;

r acquire the echo signal from a hydrophone amplifier; r display the signal on an LCD screen; r detect the echo from the sea floor and calculate the water depth; r scan the user interface (a keyboard) and input measurement parameters; r interface to a printer, a GPS system, to a remote computer, and to depth displays; r perform data logging to some storage device.

transducer

pulse gen. control

echo amp. physical environment

ADC

digital processor user input

Figure 8.12 Echo sounder system

display external equipment

• 281

APPLICATIONS TO UNDERWATER SOUND

e

0

echo

t t

Figure 8.13 Echo amplitude after the pulse and echo display

In order to determine the performance required from the digital processor, the amount of memory and its interfaces, the measurement parameters and the algorithms to be implemented have to be specified. Typical measurement parameters for a shallow water echo sounder would be:

r depth ranges of 25 m, 50 m, 100 m, 200 m; r measurement cycle times of 50 ms, 100 ms, 200 ms, 400 ms accordingly; r bandwidth and dynamic range of the echo amplitude signal are 5 kHz, 40 dB. A simple algorithm to determine the water depth, assuming a trapezoidal shape of the bottom echo, is to determine the maximum echo amplitude over the full range and to search for the time when some fraction of this level was reached first (Figure 8.13). Typically, a design project would not specify such a procedure in detail but rather some performance and cost levels and only some details concerning the user interface and the external interfaces, and leave it to the engineer to set up an adequate model of the measurement process. For the specific example of a low-cost, portable echo sounder the screen might e.g. be required to display 240 × 180 single bit pixels, and a single serial interface outputting the water depth as a string of characters after every measurement might be asked for. From the above givens one concludes that an 8-bit converter can be used at a sampling rate of 15 kHz (somewhat larger than 2 ∗ 5 kHz); in principle, there is the more costly option to digitize the AC echo signal and to perform the rectification and the smoothing digitally. The range of 25 m nicely translates into a set of 500 samples taken after the transmit pulse. As the resolution of the measurement and the display will not depend on the range, the digital processor may compress the samples to a set of 500 in all ranges. The video memory is required to have a size of just 5.8 k bytes. A typical signal display results from mapping the amplitude values from subsequent measurements to adjacent lines of 240 pixels, scrolling the display down so that new lines are always added at the top. Then the shape of the sea floor can be read off the display (Figure 8.13). The input signal is thus displayed as a 2D signal that is acquired serially line by line. We assume that the video signal will also be serially output line by line, as a bit sequence along with a clock and sync signals, similarly to the pattern in which the samples were originally input (but much faster). The measurement needs to be repeated as the ship moves on. It constitutes a sampling of the bottom shape and must be repeated at subsequent positions of the ship in relation to the ‘bandwidth’ of this shape along the track. The measurement rate must be in proportion to the ship’s speed. The maximum rate is usually limited so that the echo acquisition times for subsequent measurements do not overlap. In the 25 m range, the fastest rate would e.g. be set to 50 ms. This is hence the maximum available time to evaluate a measurement. For larger ranges, this time is longer (it might be used for a more elaborate evaluation). Usually, the measurement rate is not adapted to the speed and to the bottom shape at all so that the

• 282

DIGITAL SIGNAL PROCESSORS

acquire ADC response

sample buffer

generate stimulus

switches

serial out match to model

video RAM

refresh

video

Figure 8.14 Processes in the digital system

bottom shape is greatly over-sampled. The subsequent measurements are useful to provide more data to compensate for flaws in the individual measurements due to the changing acoustic conditions that may lead to echo losses, changes of the echo amplitude, and erroneous echoes from objects other than the sea floor. In order to define the computational task of the depth evaluation, it is useful to look at it as an inverse scattering problem. The environment (the water space and the sea floor) is probed with the echo pulse and responds by an echo from which structural parameters of the environment are derived. For this purpose, a simple physical model of the environment and its result on the echo are set up. The echo evaluation adjusts the parameters of the model so that it yields the best approximation of the actual echo. The quality of the depth evaluation depends on the quality of the physical model. If the sea floor is modeled as an ideal reflecting plane parallel to the water surface at the distance d, then the parameter d can be derived as above. A refined model would take care of the inclination of the sea floor and its structure and be adjusted through a series of measurements. It is only after these considerations that the basic processes of the digital processor can be defined (Figure 8.14). The steps involved are a repetitive process reading the user settings and starting the measurement by sending out the stimulus to the environment at the required rate, a process inputting and compressing the echo samples (the response to the stimulus), an evaluation process, and a video process to refresh the display. The response acquisition process takes the maximum of some subsequent samples and otherwise just writes the preprocessed samples to a memory buffer. The matching process served to evaluate the individual measurements. It further compresses, packs and passes the samples to the video RAM, and proceeds to the bottom detection by matching the model parameters to the received echo shape. For the simple, behavioral echo model this requires two comparisons per sample. Even with some more preprocessing of the samples and some smoothing of the results, the effort remains small. Once the processes are defined, the computational requirements can be estimated. Every 8-bit micro controller can perform all of the operations required for the echo acquisition and evaluation and even allows for a more sophisticated evaluation using a refined model. The sample input at the 66 µs rate can be realized as an interrupt process in parallel to the evaluation of the previous data. The video output process with its scrolling operation is the only process that requires a higher rate of operations. At a refresh rate of as low as 50 ms, the pixel output rate is about 1.5 MHz. Here, some hardware can be implemented that cyclically sends out the memory contents (or a standard display controller chip). A line of 240 pixels can be stored in 32 adjacent bytes of an SRAM to facilitate the 2D addressing by lines and columns. The scrolling can be realized by changing the start position for the sending to some multiple of 32 bytes

• 283

APPLICATIONS TO UNDERWATER SOUND

and by using a cyclic addressing scheme for the readout and the inputting. Thus, this process can be realized using a RAM controlled by an FPGA or PLD circuit interfaced to the micro controller. An At94 k chip e.g. has all these ingredients including the memory and the serial interface and only needs a serial Flash memory for the application code and an ADC but it does not yield the cheapest solution. If the video output is replaced by a numeric display, the application would be conveniently served by any micro controller with an integrated ADC, a timer to generate the sample rate and a serial interface, e.g. an AVR chip or the MSP430 that also include the program ROM. If more interfaces and functions were required, a more complex micro controller would be used. Only for operations on the AC echo signal a DSP chip would be required and be used stand-alone or in conjunction with a micro controller. The basic echo sounder function can be refined in various ways in order to enhance the sensitivity and the quality of the measurements. One is to send out a long, modulated pulse (synthesized by a DSP) with a sharply located autocorrelation. The received AC signal is sampled and correlated with the sent out waveform in order to undo the echo lengthening due to the long pulse. At a sampling rate of e.g. 100 kHz and a pulse length of 5 ms, the correlation is equivalent to performing an FIR computation of the length 500 which can be handled by any single MAC integer DSP at 50 MIPS.

8.7.2 Beam Forming The directionality of an extended, flat transducer used as a hydrophone results from the voltages at the different surface elements, only summing up constructively if they are excited in phase, and a directional beam forms from the transmitting transducer as the spherical wave fronts issued by the surface elements are superimposed constructively far from the transducer in the forward direction only. Electronic beam forming is based on subdividing the transducer used as a hydrophone into multiple elements, the signals of which are processed separately and only summed up electronically. The signals are sampled, buffered, and summed up with individual delays so that the constructive summation only occurs for input beams from an angle of ϕ w.r.t. the forward direction (Figure 8.15). If d is the distance between two elements, then the delay h to be applied is related to ϕ by the equation: h = d ∗ sin(ϕ)/c

beam j d*sin(j)

d

Figure 8.15 Time delay of the signals from two hydrophones

• 284

DIGITAL SIGNAL PROCESSORS

For a sine wave with the frequency f the delay h translates into the phase shift of Ψ so that: Ψ = 2π f ∗ h To form a directional beam from an array of N hydrophone elements, one has to calculate the sum sn =

N−1

wi ∗ qi (n − h i )

(16)

i=0

where qi (m) denotes the mth sample from the ith hydrophone, and hi is the delay to be applied to it in units of the sampling period and wi is a windowing coefficient used to improve the directional characteristic and to compensate for the directional characteristic of the individual hydrophone, if necessary. A 2D storage structure is needed to store the qi (m). If qi (n − hi ) is linearly interpolated from two adjacent samples (more are needed at low sampling rates), then 4N operations are needed to compute equation (16). The hydrophones do not need to be arranged linearly. An arbitrary arrangement can be used by properly adjusting the time delays hi , e.g. a circular array. Also, the sampling of the hydrophones does not need to be simultaneous. Apart from the interpolation errors, equation (16) does not depend on the frequency of the signal. For low frequencies one obtains a broader directional characteristic. For a narrow band signal centered on a frequency f time-domain beam forming equation (16) can be substituted by frequency-domain beam forming, i.e. by multiplying the complex signal sample pairs qi (n) = (rqi,n , iqi,n ) taken as in Figure 8.2 by phase factors pi = exp(−j∗ Ψi ), j being the imaginary unit and Ψi = 2π f ∗ hi . The computation thus becomes un =

N−1

pi∗ qi (n)

(17)

i=0

and is easier to implement as it no longer involves the addressing of samples at different time positions. The operation of forming the qi (n) from the sample pairs can be combined with equation (17) by adjusting the coefficients and thus costs no extra effort. To send out directional beams, the individual elements are excited with the negated time delays or phase shifts. Electronic beam forming can be used to stabilize the receive beam of a vertical deep water echo sounder against the roll motion of the ship. Sharp beams need to be used to increase the sensitivity for the weak echoes, and during the time the echo travels through the water the roll angle may have changed by several degrees. If a linear arrangement of 32 transducer stripes is used that are spaced by a quarter of the wavelength and if the sampling frequency is 80 kHz, then equation (16) requires 64 operations to be performed within 12.5 µs, i.e. 5 per µs, well within the reach all the current signal processors. 32 × 32 memory locations are needed to generate the time delays to sweep the receive beam by ±90◦ . Vertical echo sounders use narrow band pulses and can apply phase beam forming, as well. Still more interesting is the observation that from the same stored samples not just one but several different directional beams can be formed. This is exploited in the multi-beam echo sounders that are used to perform depth measurements not just beneath a surface vessel but also at some distance to its path, using directional beams sent out at up to 60◦ from the vertical direction. There is still enough echo energy coming back out of these directions to permit a

APPLICATIONS TO UNDERWATER SOUND

• 285

distance evaluation. Multi-beam echo sounders are used to scan the sea floor in harbors and rivers quickly and to derive charts of the bottom from these. As a second example, we estimate the computational requirements for a 20 kHz multi-beam echo sounder using two arrays of 32 transducer stripes each mounted on both sides of a ship (at angles of ±30◦ to the vertical) and simultaneously sending out and receiving fans of four beams each. The depth data are sent out via an interface to a standard computer to store them along with the positional data for later processing and to immediately display and optionally print them as a chart of the region beneath the ship covered by the measurements. Additional interfaces are needed for the roll angle and other sensors. In this case a narrow band signal is sent out so that phase beam forming can be used. Using quadrature sampling the sample rate can be reduced and will be assumed to be 20 kHz. The process structure of the single beam echo sounder shown in Figure 8.14 remains valid except for the video output process. The acquisition process needs to be extended by a digital phase shift beam former function. It receives more input samples at a higher rate, namely a stream of 2∗ 32 complex input sample pairs (gi , hi ) taken in parallel at a rate of 20 kHz and transforms it into a stream of sets of 8 complex samples at the same rate. These need to be rectified (taking their absolute values) and compressed to fill the input buffers for the depth evaluation. The depth evaluation process evaluates all of them one by one. The stimulus process generates the parameters for the 8 stimuli used within a fan of simultaneous beams. Each transducer array uses a multi-channel pulse generator that generates the delayed signals for the transducer elements from a pre-computed wave table according to the givens from the stimulus process for each of the four directions in sequence. The beam former needs to perform 128 real MAC operations every 50 µs for each of 8 beams, i.e. perform about 20 million MAC operations per second. The result rate is still high at 640 k samples per second, i.e. before the demodulation and compression are applied. The computation can be performed by several of the DSP processors previously described. In order to reduce the data rates before communications to another processor occur, it is convenient to also perform demodulation and compression on the same DSP, i.e. the entire acquisition process, and to provide the buffer memory required for the 8 channels to it. A DSP chip capable of operating even faster than the required total rate and equipped with enough on-chip memory is the ADSP2191. It can take over the depth evaluation as well. The ADSP2191 can receive the input samples from the amplifiers in the two transducer arrays via its two synchronous serial interfaces. This requires fast converters to be attached to the amplifiers that present the samples from the 32 channels as a single serial data stream (that may be transferred from the amplifier units to the DSP board using fast drivers, e.g. LVDS). In order to avoid an extensive context switching on the DSP, the stimulus process can be executed by an extra micro controller like the XC161 which also provides interfaces for the sensors and to the standard computer, and integrates the program storage for the total system. Both processors could be mounted on a single small circuit board, and the XC161 would communicate with the 2191 via the host port of the latter and download the DSP program to it when the system is started. Alternatively, a single highly integrated processor with DSP and controller features like the Blackfin could be used. A similar system designed in the late 1980s required six processors and dedicated hardware for the beam former on several complex circuit boards.

• 286

DIGITAL SIGNAL PROCESSORS

8.7.3 Passive Sonar As a final application involving an FFT-based 2D signal analysis we consider the passive sonar that does not send out stimuli to the environment but simply analyzes the sound received via a large array of hydrophones. These are found on submarines where they are used to detect the approach of vessels, and to supply the data to distinguish between harmless and dangerous objects. As a typical sonar hydrophone array we consider a circular array of 96 elements that are continuously sampled at 40 kHz to cover a signal bandwidth of 15 kHz. These data are evaluated through the following steps:

r Subdivide the signal input using a bank of narrow band pass filters (about 100 Hz width). r Form multiple beams by applying a phase beam former to the individual bands. r 2D display of the signals by directions and frequencies. r Detection and tracking of targets, and classification. We concentrate on the directional and frequency analysis steps that also require attention for the involved memory design. The further processing to track and classify some small number of targets proceeds at lower rates. The filter bank is realized by applying a 1024-input windowed FFT to every channel. The individual FFT outputs are spaced by 40 Hz. The FFT is computed with a 50% overlap (this is just enough for a bandwidth of 80 Hz and sufficient for the further processing), i.e. every 12.5 ms. In total, this amounts to computing a real 1024-input FFT every 125 µs. A single 100 MHz ADSP21161 can achieve this in half of this time. If every second FFT coefficient is used as a filter, the filters are centered on the multiples of 80 Hz. The total input sample rate is 4 samples per µs. The samples are received serially in blocks of 96, each block containing a sample from each beam. They need to be stored by columns in a 2D memory of 96∗ 1024 words and moved out by rows of 1024 to perform the FFT processing on the individual hydrophone channels, the rate being of 8 samples per µs as every sample is used in two FFT calculations. The results of these FFT evaluations repeated every 12.5 ms are 96∗ 1024 complex numbers, of which just 96∗ 192 have to be moved into another memory buffer as the desired filter outputs in the range of up to 15 kHz. The total data rate for this is 4 words per µs. The samples can be entered via one of the link interfaces of the Sharc processor which also facilitates the synchronization of the FFT processing with the data input. During the 125 µs available for every FFT 480 new samples come in and can be buffered on-chip first (alternatively, one might e.g. choose to process blocks of 16 FFT computations every 2ms receiving 7680 samples during this time). The memory buffer can be realized by attaching an SRAM of 256 k 16-bit words to the Sharc bus that can be accessed at a peak rate of 50 MHz and is large enough to hold 1024 FFT input samples for every beam and the new samples arriving during the current 12.5ms of FFT processing. Moving in the 1024 samples for an FFT takes 20 µs, and moving out the new samples received via the link takes 10 µs as well. The filter outputs are 384 real words that are moved out in another 8 µs, too, into some double SRAM buffer of 128 k 16-bit words. The data are moved via DMA to the dual-ported internal memory in order not to lose the 40 µs required for them for the processing. For the directional analysis, a set of 96 narrow beams are formed around the circle. For each of 192 frequencies, the complex FFT outputs hi,f of 32 adjacent hydrophone channels are weighted with amplitude and phase factors pi,f and summed up to form the signal energy

• 287

APPLICATIONS TO UNDERWATER SOUND

according to the formula: Ef =

2 pi,f h i,f

(18)

i

The beam former implementation becomes particularly simple in the special case of a circular array on which the hydrophones are regularly and precisely adjusted at angles of k∗ 360◦ /96. Then the coefficients in equation (18) are actually the same for all beams at the fixed frequency f, and the sums in equation (18) are obtained from a circular convolution of the 96 element hi,f sequence, or of this sequence extended to 128 elements and a fixed coefficient sequence of length 32. It can be computed by applying a 128-input complex FFT to the hi,f sequence, multiplying by the transformed coefficient sequence and then applying the inverse FFT. The total computation must be performed for 192 frequencies in 12.5 ms, i.e. in 62.5 µs per frequency. The ADSP21161 performs the 128-input FFT in less than 10 µs and hence the beam forming in less than half the required time. Thus a single ADSP21161 can be used to both perform the 192 band pass filters and the beam forming. The beam forming becomes much more complex if a non-circular and even nonsymmetric array is used, and equation (18) must be computed with individual coefficients for every beam and every frequency. The direct computation of equation (18) amounts to 192∗ 128 real MAC operations that need to be performed every 12.5 ms and for 96 directions. In total, these are nearly 200 million MAC operations per second, and 100 million coefficients need to be provided for these. In view of the available networking interfaces we continue to consider the ADSP21161 as an architectural component. Without the required initializations and the squaring operations in equation (18), the MAC operations nearly exhaust the processing capabilities of another ADSP21161 chip and are therefore distributed to two such chips. The coefficients cannot easily be computed at the MAC rate from more compressed geometrical data and need to be obtained from a memory table (the table data may initially be computed from geometrical data). Thus 1.2 M words of coefficient memory are required where real coefficients can be loaded at a rate of 100 MHz. If the beam processing is carried out by columns containing the 96 filter outputs of all beams at a particular frequency f, that are transferred to the internal memory of a processor before performing the beam forming on them, about 4 words per µs need to be read out again from the memory. The results of this processing are 192 energy levels for each of 96 beams that are obtained every 12.5 ms. They fill up another 20 k words of memory and need 4 accesses per µs if they have to be stored and loaded again before they are moved to the display. This output can occur via another Sharc link. The total required memory is thus less than 2M 16-bit words. The total access rate is 124 accesses per µs. A structure of this kind is obtained by connecting the ADSP21161 chips via a common bus and attaching SDRAM at a width of 32 bits to it (Figure 8.16). These can be accessed at a rate of 100 MHz using page mode accesses and DMA to the internal Sharc memories. The use of SDRAM requires the mentioned block processing alternative for the FFT in order to be able to perform page mode accesses to the 2D memory all the time. With some extra effort for the interfacing and by carefully checking the memory bandwidths, the two Sharc chips used to the beam forming could be substituted by a single, high-speed dual-MAC integer DSP as equation (18) only involves integer MAC operations. Finally, the three ADSP21161 chips could be substituted by a single Tiger Sharc that would

• 288

DIGITAL SIGNAL PROCESSORS

samples

ADSP 21161

ADSP 21161

ADSP 21161

display data and boot link

SDRAM 1M x 32

Figure 8.16 Sonar FFT processor and phase beam former

be even more efficient for the integer MAC operations. Both alternatives will hardly result in lower costs for the total system, however. The passive sonar signal processing can also be enhanced in various ways. The beam former can be implemented as an adaptive filter to suppress the noise issued from some strong source that would otherwise mask weak nearby targets [64]. The FFT filter bank as described above has the disadvantage of providing a constant frequency resolution all over the range, while it would be more desirable to have a larger resolution at the lower frequencies both for the beam forming and for the further signal analysis. Instead of resorting to larger FFT sizes with a proportional increase of processing time, higher memory requirements and a lower time resolution, one can switch to a multi-rate processing, namely, first perform an FFT to get N filter outputs from the upper octave (maybe, less than for the original filter bank, say, N = 48), and an inverse FFT to get a low pass filter output cutting off all frequencies in the upper half band. This filter output is computed at half the sampling rate only and undergoes the same processing at half the rate to get N filter outputs from its upper half in turn, then the lower half of it is processed to yield still narrower band filter outputs, doubling the frequency resolution, halving the sampling and processing rates, and adding a constant amount of required buffer space with every step. The frequencies generated by the resulting filter bank are distributed similarly to floating point numbers. The process of expanding the resolution in the lower octave can be iterated arbitrarily without significantly increasing the required amount of operations per second due to the reduction of the processing rate. A similar result can also be obtained using a fast wavelet transform instead (see [69]) where a direct computation of a bank of filters is used and subsampling is applied in the same way to apply the same processing to the lower half of the band. The FFT processing is essentially equivalent but further speeds up the filter evaluation and the generation of multiple sub-bands in the upper half band. The direct computation, on the other hand, avoids the block processing delay of n times the sampling period of the size n FFT which increases at the low frequencies due to the sub-sampling.

8.8 SUMMARY This final chapter explained some special algorithmic requirements of digital processors intended to process signals, and which otherwise would compete on their system costs and power requirements. The requirements for the FIR and FFT algorithms were the efficient execution of the multiply-add and the FFT butterfly which in turn can be served by a single multiplier structure executing all additional operations in parallel. These were found to be satisfied by commercially available DSP chips. None of the available processors could be considered an

• 289

EXERCISES

‘ideal’ system component. The integer processors lacked the efficient support of other data types, and did not all efficiently support the symmetric FIR computation, and the floating point chips lacked the efficient support of lower precision types. Only the Sharc processors would be used as scalable systems’ components, and the fairly efficient Tiger Sharc currently addresses high-end systems only and requires costly boards to support its interface signals. The advent of FPGA chip families providing DSP support functions then adds a viable choice to DSP system design. We finally considered some special DSP applications and found that memory design and communicating multi-channel data require some extra thought even when the processing requirements can be satisfied by using standard processors.

EXERCISES 1. Use the tip given for the CCD image sensor and work out the structure of an analog ‘bucket-brigade’ shift register. 2. Work out a circuit that outputs a sequence of row and column addresses and data values that fills a triangle. The corner values and coordinates stand in input registers. Substitute the multiplications in the linear interpolation formula by incremental changes so that only add operations are needed to compute subsequent data (cf. [74]). 3. Explain how cyclic addressing can be emulated by using a double size buffer and double write operations, or by subdividing the FIR loop. 4. Work out a data path for a single multiplier structure that efficiently computes a symmetric FIR filter using a maximum of 2 memory accesses per cycle only. 5. Work out an assembly program for the ADSP 21xx or the TMS320C54x that performs wave table synthesis, using a wave table of 256 samples for a full wave. 6. The signals from four microphones mounted in the corners of a 5 × 5 m2 square are digitized at a rate of 48 kHz and fed into a digital processor in order to localize sources of noise moving within the square. Pairs of signals with the sample sequences gn , hn are correlated by evaluating: 511

gn−m+i h n+i

i=0

for all m = −1000 . . . +1000 fifty times per second to determine the time delay of the signals. Analyze the computational requirements for these correlations and select a suitable DSP structure for them, using the FFT to speed up the computation. 7. An approach to speech recognition is to convert an audio signal into a sequence of feature vectors taken at a fairly low rate (say, every 20 ms), the components of which (say, 20) represent logarithmic signal energies in sub-bands of the speech band of 100. . . 4000 Hz. These vectors are used to detect the start of speech and are presented to a pattern recognition system, e.g. an artificial neural network. Determine the computational requirements to calculate short-term logarithmic energies at this rate, assuming an audio sample rate of 10 kHz and 40 ms of signal input for each. 8. In wireless communications, a bit sequence (bn ) encoding speech or some other information needs to be encoded and safely communicated using several noisy channels. A common method is to send out bit sequences (rn ) and (sn ) derived from (bn ) by convolving it mod(2)

• 290

DIGITAL SIGNAL PROCESSORS

with two different m-component coefficient vectors (c j ) and (b j ) satisfying c0 = d0 = cm−1 = dm−1 = 1, so that the binary polynomials with the coefficients (c j ) and (b j ) are relatively prime. Thus: rn =

m−1 i=0

ci bn−i ,

sn =

m−1

di bn−i

i=0

At the receive site, faulty sequences (rn ) and (sn ) are received, and the sequence (bn ) is to be decoded from these as the most probable sequence encoded and sent out. This is done by the Viterbi algorithm [16, 68]. It tracks the received code sequence and derives with a delay of about N = 5 m steps a sequence the encoding of which has a minimum Hamming distance to it. The Viterbi algorithm builds and incrementally maintains a table of 2m possible input sequences (one for each of the unknown ‘states’ d = (bn−m+1 , . . . ,bn )) and an associated distance measure of the output sequences generated according to these paths to the received sequence. For each received pair (rn ,sn ) the algorithm extends every input sequence by an assumed input and outputs the input recorded N steps before in some sequence which tends to be unique. Each of the two states d0 = (h0 , . . . , hm−2 , 0), d1 = (h0 , . . . , hm−2 , 1) is reached from two possible others through an assumed input of 0 or 1, namely from s0 = (0, h0 , . . . , hm−2 ) and s1 =(1,h0 , . . . ,hm−2 ), for which two paths c0 and c1 stand in the table, and distances m0 and m1 . For d = d0 and for d = d1 , c0 and c1 are extended, and the one with the smaller distance to the received input with its output sequence is selected associated with d. The distance changes from m0 and m1 (i.e., the distance between (rn , sn ) and the output obtained from the assumed input) are each symmetric w.r.t. to a mean distance change and complementary to each other, and only the deviation from the mean is accumulated for the update. For d = d0 the minimum of m0 + and m1 − becomes the updated distance, and for d = d1 it is the minimum of m0 − and m1 + . The operation of forming the mx ± , and comparing and selecting according to the minima is the so-called Viterbi butterfly. For every received pair 2m−1 butterflies have to be computed. Even for medium bit rates of a few 10 kHz this constitutes a considerable effort for a processor using individual add/subtract operations, comparisons and conditional branches only. Define compute circuits supporting the convolutional encoding and the Viterbi decoding algorithm, and investigate how these algorithms are supported by the TMS320C54x and on the TigerSharc.

References

[1] R.K. Brayton, G.D. Hachtel, C.T. McMullen and A.L. Sangiovanni-Vincentelli, Logic Minimization Algorithms for VLSI Synthesis, Kluwer Academic Publishers, 1984. [2] J.F. Wakerly, Digital Design, Prentice Hall, 2001. [3] L. Wanhammer, DSP Integrated Circuits, Academic Press, 1999. [4] S. Sze, Physics of Semiconductor Devices, John Wiley and Sons Inc., 1982. [5] C. Mead, Analog VLSI and Neural Systems, Addison-Wesley, 1989. [6] R.I. Hartley and Keshab K. Parhi, Digit-Serial Computation, Kluwer Academic Publishers, 1995. [7] K. van Berkel, Handshake Circuits, Cambridge University Press, 1993. [8] M.A. Nielsen and Isaac L. Chuang, Quantum Computation and Quantum Information, Cambridge University Press, 2000. [9] H.W. Johnson and M. Graham, High-Speed Digital Design : A Handbook of Black Magic, PrenticeHall, 1993. [10] J.P. Uyemura, CMOS Logic Circuit Design, Kluwer Academic Publishers, 1999. [11] I. Wegener, The Complexity of Boolean Functions, T¨ubner-Verlag, 1987. [12] H.F. Mattson, Discrete Mathematics, John Wiley & Sons Ltd, 1993. [13] A.V. Aho and J.D. Ullman, Foundations of Computer Science, Computer Science Press, 1992. [14] J.G. Proakis and D.G. Manolakis, Digital Signal Processing, Prentice Hall, 1996. [15] J. Nievergelt and K. Hinrichs, Algorithms and Data Structures, Prentice Hall, 1993. [16] R.H. Morelos-Zaragoza, The Art of Error Correcting Coding, Wiley, 2002. [17] A.M.K. Cheng, Real-Time Systems, Wiley, 2002. [18] A.W. Appel, Modern Compiler Implementation in Java, Cambridge University Press, 1998. [19] B.H. Vassos and G.W. Ewing, Analog and Computer Electronics for Scientists, Wiley, 1993. [20] J. Gruska, Foundations of Computing, Thomson Computer Press, 1997. [21] ANSI/TIA/EIA-644 (LVDS) and IEEE 1596.3 standards. [22] A. Sheikholeslami and P.G. Gulag, A survey of circuit innovations in ferroelectric random-access memories, Proc. IEEE, 88 (5), May 2000. [23] A. Marshall et al., A Reconfigurable Arithmetic Array for Multimedia Applications, HP Laboratories Bristol, [email protected] [24] S. Hauck, S. Burns, G. Borriello and C. Ebeling, An FPGA for implementing asynchronous circuits, IEEE Design and Test of Computers, 11 (3), 1994. [25] F. Mayer-Lindenberg, A heterogeneous parallel system employing a configurable interconnection network, PDCS’97, Washington, 1997.

Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 292

REFERENCES

[26] A. Grama, A. Gupta, G. Karypis and V. Kumar, Introduction to Parallel Computing, AddisonWesley, 1994. [27] F. Mayer-Lindenberg, Crossbar design for a super FPGA architecture, PACT’1998, Paris, available via [55]. [28] IEEE std 1149.1 (JTAG) testability primer, available from Texas Instruments, www.ti.com [29] I. Page, Constructing hardware/software systems from a single description, Journal of VLSI Signal Processing, 12 (1), 1996. [30] P.J. Ashenden, The Designers Guide to VHDL, Morgan Kaufmann Publishers Inc., 1996. [31] R.H. Perrott, Parallel Programming, Addison-Wesley, 1987. [32] C.A.R. Hoare, Communicating Sequential Processes, Prentice Hall, 1985. [33] T. Murata, Petri nets: properties, analysis, applications, proceedings of the IEEE 77 (4), 1989. [34] G. Berry and G. Gonthier, The ESTEREL Programming Language: design, semantics, implementation, Science of Computer Programming, 19 (2), 87–152, Nov. 1992. [35] G. de Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994. [36] P. Michel, U. Lauther and P. Duzy, The Synthesis Approach to Digital System Design, Kluwer Academic Publishers, 1992. [37] L. Svensson, Adiabatic switching, in A.P. Chandrakasan and R.W. Brodensen (eds), Low Power Digital CMOS Design, Kluwer Academic Publishers, 1995. [38] S. Kim and M.C. Papaefthymiou, True single-phase adiabatic circuitry, IEEE Transactions on VLSI Systems 9 (1), 2001. [39] C.J. Myers, Asynchronous Circuit Design, Wiley & Sons, 2002. [40] F. Kroupa, Phase Lock Loops and Frequency Synthesis, Wiley & Sons, 2003. [41] E.A. Hall, Internet Core Protocols, O’Reilly Inc., 2000. [42] Electronic Design Interchange Format, Version 2.0.0, Electronic Indistries Association, Washington, DC, 1987. [43] Standard 1076.6 for VHDL Register Transfer Level Synthesis, IEEE Standards Departement, New York, 1999. [44] M.J. Flynn and S.F. Oberman, Advanced Computer Arithmetic Design, Wiley & Sons, 2001. [45] S. Furber, ARM System-on-a-Chip Architecture, Addison-Wesley, 2000. [46] O. Mencer, L. Semeria, M. Morf and J.M. Delosme, Application of reconfigurable CORDIC architectures, Journal of VLSI Signal Processing, Kluwer, March 2000. [47] J.F. Hart, Computer Approximations, Wiley & Sons, 1968. [48] J.L. Hennessy and D.A. Petterson, Computer Architecture, 3rd edition, Morgan Kaufmann, 2003. [49] M.M. Mano and C.R. Kime, Logic and Computer Design Fundamentals, Prentice Hall, 2000. [50] ANSI/VITA26-1998 standard, available from www.myri.com [51] E. Waingold et al., Baring t all to software: RAW machines, IEEE Computer, 1997. [52] C.A. Moritz, Hot pages: Software Caching for RAW Microprocessors, MIT/LCS Tech. Memo. LCS-TM-599, August 1999. [53] S.S. Muchnik, Advanced Compiler Design and Implementation, Morgan Kaufmann, 1997. [54] D.E. Knuth, The Art of Computer Programming, Vol. 2, Seminumerical Algorithms, 2nd edition, Addison-Wesley, 1981. [55] www.ti6.tuhh.de [56] J. Teich, Digital Hardware/Software-Systeme, Springer-Verlag, 1997. [57] D.H. West, Approximate solution of the quadratic assignment problem, ACM Transactions on Mathematical Software, 9 (4), 461–466, 1983. [58] J. Castro and N. Nabona, An Implementation of Linear and Nonlinear Multi Commodity Network Flows, European Journal of Operations Research, 92, 37–53, 1996. [59] C. Ebeling, L. McMurchie, S.A. Hauck and S. Burns, Placement and Routing Tools for the Triptych FPGAs, IEEE Transactions on VLSI, Dec., 473–482, 1995.

REFERENCES

• 293

[60] S.K. Nag and R.A. Rutenbar, Performance-Driven Simultaneous Placement and Routing for FPGAs, IEEE Trans. on CAD of Integrated Circuits and Systems, 16 (5), May 1997. [61] S. Haykin, Adaptive Filter Theory, Prentice Hall, 1996. [62] S.M. Kuo and B.H. Lee, Real-Time Digital Signal Processing, Wiley, 2001. [63] www.bdti.com [64] R.O. Nielsen, Sonar Signal Processing, Artech House, Norwood, 1991. [65] K. Mai, T. Paaske, N. Jayasena, R. Ho, W.J. Dally and M. Horowitz, Smart Memories: A Modular Reconfigurable Architecture, ISCA, 2000. [66] C. Kozyrakis et al., Scalable processors in the billion-transistor era: IRAM, IEEE Computer, Sept. 1997, 75–78, 1997. [67] N. Efford, Digital Image Processing Using Java, Addison-Wesley, 2000. [68] W.W. Peterson and E.J. Weldon, Error Correcting Codes, MIT Press, 1971. [69] J.D. Morrison and A.S. Clarke, ELLA2000, McGraw-Hill, 1994. [70] www.SystemC.org [71] IEEE1284-1994 standard. [72] J.E. Hopcroft and J.D. Ullman, Introduction to Automata Theory, Languages and Computation, Addison-Wesley, 1979. [73] PCI Local Bus Specifications, Revision 2.1, PCI Special Interest Group, 1995. [74] J.D. Foley, A. van Dam, S.K. Feiner and J.F. Hughes, Computer Graphics, Addison-Wesley, 1997. [75] V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards, Kluwer Academic Press, 1997. [76] www.eembc.com [77] F.T. Leighton, Introduction to Parallel Architectures and Algorithms, Morgan Kaufmann, 1992. [78] N. Biggs, Algebraic Graph Theory, Cambridge University Press, 1974. [79] M. Garzon, Analysis of Models of Massive Parallelism, Springer-Verlag, 1995. [80] T. Toffoli and N. Margolus, Cellular Automata Machines, MIT Press, 1987. [81] K. Dieffendorff, P.D. Dubey, R. Hochsprung and H. Scales, AltiVec extension to PowerPC accelerates media processing, IEEE Micro, March/April 2000. [82] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, 1993. [83] S.W. Golomb, Shift Register Sequences, Aegean Park Press, 1982. [84] P.M. Athanas and H.F. Silverman, Processor reconfiguration through instruction-set metamorphosis, IEEE Computer, March 1993. [85] H.-O. Peitgen and P.H. Richter, The Beauty of Fractals, Springer-Verlag, 1986. [86] F. Mayer-Lindenberg, A parallel computer based on simple DSP modules, Microprocessing and Microprogramming 41, 1995. [87] A. Goscinski, Distributed Operating Systems, Addison-Wesley, 1991. [88] G. Tel, Introduction to Distributed Algorithms, Cambridge University Press, 2000.

Index

absolute addressing, 141 accumulator, 160 adaptive beam former, 288 A/D conversion, 249 add/subtract circuit, 154 address bus, 70 address resolution protocol (ARP), 189 add-subtract circuit, 119 adiabatic circuits, 59 adiabatic computation, 58 ADSP21161, 210 ADSP21xx family, 267 ADSP-BF533, 272 advanced microcontroller bus architecture, 91 affine group graphs, 217 affine permutation, 216 after, 107 algorithm, 9 allowed interval, 1 Altivec, 201 ALU circuit efficiency, 158 ALU efficiency, 36 analog video signal, 253 analog-to-digital converter (ADC), 195 and-or-invert gate, 48 anti-aliasing filter, 253 APEX, 84 application specific standard processors (ASSP), 194 approximate operations, 5 architecture, 29 arithmetic and logic unit (ALU), 151 ARM architecture, 198 artificial neural network, 256, 289 ASCII code, 254 ASIC, 68

assembly function, 111 asynchronous serial interface, 184 At40k, 82 AT90S1200, 196 At94k, 83 At94k family, 196 Athlon processor, 208 Au1100, 199 autoincrementing, 141 automata, 54 automata design, 139 automatic stack management, 175 automaton, 137 auxiliary bits, 143 auxiliary control, 23 AVR family, 196 ball grid array, 67 band pass, 256 bandwidth function, 224 banks of registers, 181 baseband signal, 252 BCD encoding, 4 beam forming, 283 behavior of an automaton, 138 benchmark, 194 bidirectional interface, 207 binary coded digit, 4 binary digits, 4 binary division, 130 binary encoding, 3 binary multiplier, 123 binary number, 4 binary polynomials, 5 bipolar clock, 55

Dedicated Digital Processors: Methods in Hardware/Software System Design. F. Mayer-Lindenberg C 2004 John Wiley & Sons, Ltd ISBNs: 0-470-84444-2

• 296

bit field, 2 bit type, 101 bit vector, 101 Blackfin processor, 271 block floating point, 259 block processing delay, 261 Boolean function, 2 Boolean operations, 14 Booth’s algorithm, 127 bootstrap, 183 branch prediction, 158 branch processing unit, 159 bubble-sort algorithm, 217 buffer, 102 building blocks, 9 bulk transfer mode, 187 bus, 49 bus arbitration, 186 bus control, 186 bus interface, 207 bus protocol, 71, 186 bus subset, 212 cache, 165 cache line, 165 call instruction, 164 calling conventions, 245 CAN objects, 187 carrier frequency, 252 carry input, 117 carry select adder, 117 carry-save multiplier, 127 CAS, 75 cathode ray tube (CRT), 253 Cayley graph, 216 CCD image sensor, 253 central processing unit (CPU), 151 check sum, 8 chip, 67 chip-level networks, 218 circle graph, 212 circuit boards, 86 circular array, 284 circular buffer, 265 clock gating, 60 clock signal, 26 clustering, 217, 221 CMOS inverter, 44 CMOS technology, 42 code decompression, 144 code expansion, 144 code generator (CG) modules, 229 collision, 186 color palette, 254 combined function, 19

INDEX common memory, 207 communicating sequential processes, 33 communication deadlocks, 224 comparator, 249 compare operation, 119 compiled communications, 227 complementary clocks, 51 complementary networks, 48 complex gate, 46 complex instruction set computer (CISC), 163 complex sampling, 252 complexity, 13 complexity of an automaton, 139 complexity of multiplication, 124 component type, 106 composite video, 253 composition by cases, 11 composition of automata, 139 compression of input, 281 computable number, 5 computable real numbers, 232 concurrent execution, 105 conditional add/subtract operation, 130 conditional ALU operations, 159 conditional branches, 11 conditional signal assignment, 104 configurable components, 31 configurable logic block (CLB), 83 configuration, 238 configuration RAM, 80 configuration register, 184 conjunctive normal form (CNF), 16 content addressable memory (CAM), 74 context switch, 181 continued fraction, 7 control automaton, 142 control automaton for the serial multiplier, 143 control flow, 11 control register, 78 controlled switch, 41 controller area network (CAN), 186 convolutional encoding, 289 coprocessor instructions, 177 coprocessor interface, 177 Cordic algorithm, 132 correlation, 283 CPU2-based network, 220 CPU2-based system, 177 CRC code, 8 crossbar, 92 crossbar network, 212 cross-talk, 87 crystal oszillator circuit, 63 CS8900A, 190

CSP model, 33 cube connected cycles (CCC), 214 cyclic redundancy check, 8 Cyclone, 85 D flip-flop, 54 D latch, 53 D/A conversion, 249 data bus, 70 data compression, 9 data flow graph, 10 data logging, 280 deadlock, 219 debounce, 65 decrement operation, 119 degree of a graph, 212 degree of a polynomial, 5 delay generator, 27 delayed branches, 159 de Morgan’s laws, 16 depth evaluation, 282 design philosophy, 205 destination address of a message, 186 deterministic bus access schedule, 227 deterministic sequencing, 227 diameter, 213 differential signal, 58 digital gain control, 250 digital power amplifiers, 251 digital system, 1 direct memory access (DMA), 183 directed graph, 211 directional attributes, 102 directional interface, 207 disabled interrupt signal, 180 discrete Fourier transform (DFT), 257 disjunctive form, 17 disjunctive normal form (DNF), 15 distance, 213 distributed allocation algorithm, 223 distributed arithmetic, 128 divide and conquer, 258 double data rate interface (DDR), 66 double data rate SDRAM, 75 double precision, 6 drain terminal, 42 DRAM, 72 DSP benchmarks, 265 DSP56F8356, 90 DSP56F826, 196 dual bus graph, 215 dual half-word multiply operation, 155 dual-port RAM, 73 dynamic allocation, 222 dynamic D flip-flop, 51

• 297

INDEX dynamic D latch, 51 dynamic logic, 52 edge transitivity, 216 EDIF netlist format, 111 EEPROM, 75 efficiency, 35 elementary operations, 9 ELLA, 99 embedded system block (ESB), 85 encryption, 9 end-of-DMA interrupt, 183 endurance, 197 entity, 101 EPROM, 69 equivalence of automata, 138 ER2 parallel computer, 92 error correcting codes, 8 Ethernet, 188 Ethernet frame, 189 Ethernet protocol, 189 event counter, 184 exception handling, 179 expansion of a recursion, 12 exponent, 6 expression, 10 extended add operation, 155 extended k-bit adder, 123 fall time, 44 fan-out, 2, 46 fast convolution, 260 fast Fourier transform (FFT), 257 fast multiply algorithm, 124 fast wavelet transform, 288 feature size, 67 feature vectors, 289 feedback, 139 feedback shift register, 133 FFT filter bank, 288 FFT butterfly, 258 field, 5 field programmable gate array (FPGA), 79 filtering coprocessor, 272 finite automaton, 138 FIR filter, 256 FIRS instruction, 270 first-in-first-out buffer (FIFO), 73 fixed point encoding, 6 flash converter, 250 Flash EPROM, 72 flip-flop (VHDL definition), 108 floating point encoding, 6 floating point operations, 131 floating point rounding errors, 132

• 298

for..generate, 107 Fourier coefficient, 257 fractional counter, 64 FRAM, 76 Fredkin gate, 16 frequency divider, 63 frequency domain, 257 frequency-domain beam forming, 284 full adder, 116 functional behavior, 101 functional composition, 10 FX2, 188 game of life, 149 gate count, 16 gate terminal, 42 generator set, 216 generic definition, 103 global routing, 225 GPS, 280 graph embedding, 217 Gray code, 8 Gray counter, 146 greatest common divisor, 12 grid, 213 ground reference, 2 group, 216 H bridge, 66 half adder, 115 half-duplex, 184 Hamiltonian path, 8 Hamming distance, 7 handshaking, 27, 148 hardware, 205 hardware design language (HDL), 99 Harvard architecture, 161 hash code, 8 heterogeneous networks, 206 hierarchical network, 215 high (H), 2 hold time, 54, 108 homogeneous graph, 216 Horner scheme, 132 host interface, 211 host port, 207 hot plugging, 186 hydrophones, 280 hypercube, 214 hypertransport channels, 208 I2 C bus, 88 i8051 micro controller, 196 I-Cube, 93 IDE interface, 191 IEEE standard format, 6

INDEX IF/ELSE construction, 11 incremental cost of an operation, 151 incrementing, 118 indirect addressing, 141 indirect jump instruction, 163 inertial delay, 108 initial state, 137 injective, 3 input acknowledge, 28 input request, 28 input selection, 141 instance, 101 instruction level parallelism (ILP), 201 instruction set architecture, 162 instruction size, 162 instructions, 144 interface definition, 102 intermediate results, 10 Internet protocol (IP), 189 interpolation filter, 253 interrupt, 180 interrupt priority levels, 180 interrupt program, 180 inverse scattering problem, 282 inverter tree, 46 invisible stack, 175 IO bus, 186 IP module, 91 joint test action group (JTAG), 95 JTAG chain, 95 jump instruction, 147 keeper circuit, 53 LAN91C111, 190 last-in-first out buffer, 74 layered network, 24 least mean squares (LMS) algorithm, 257 library, 101 library NUMERIC BIT, 101 library NUMERIC STD, 102 library STD LOGIC 1164, 101 light emitting diode (LED), 66 line graph, 212 linear phase, 256 linear regulator, 62 link and counter state bits, 175 link register, 164 load capacitance, 45 load constant instructions, 163 local area networks (LAN), 188 local routing, 225 logarithmic encoding, 6 logic encoding, 2

LonWorks, 208 look-up table (LUT), 81 low (L), 2 low voltage differential signaling (LVDS), 65 Manchester encoding, 66 Mandelbrot set, 222 mantissa, 6 master–slave D flip-flop, 54 MC56321 processor, 272 MCF5282 processor, 90 Mealy automaton, 138 mean complexity, 13 media access control (MAC) address, 189 memory access time, 70 memory bandwidth, 73 memory controller, 167 memory handshake, 181 memory hierarchy, 159 memory management, 166 memory management unit, 271 memory refresh, 74 message passing, 206 micro code, 144 micro controller, 195 MIDI interface, 226 MIMD, 200 MIPS architecture, 199 mixed signal chips, 195 mod(2n ) add operation, 117 mod(2n ) increment operation, 118 Moore automaton, 138 motherboards, 86 MPC7447, 202 MPI, 230 MRAM, 76 MSP430, 196 Muller C gate, 54 multi-beam echo sounder, 284 multi-rate signal processing, 288 multi-master bus, 89, 186 multiple timed assignments, 109 multiplexed bus, 70 multiplexer, 18 multiplier network, 123 multiplier-accumulator (MAC), 156 Myrinet, 207 n-channel network, 46 n-channel transistor, 42 negation, 119 netlist, 111 network address, 226 networks of switches, 41 NIOS, 91

• 299

INDEX NMOS technology, 42 node neighborhood, 213 normalized, 6 NP completeness, 224 number of registers, 160 number of required registers, 141 OCCAM, 209 on-line property, 122 open-drain output, 50 operating system, 181 operational amplifier, 250 opto couplers, 65 output acknowledge, 28 output request, 28 overflow, 5, 117 overflow checking, 156 package, 102 page mode, 74 parallel input, 183 parallel output, 183 parallel partial word operations, 153 parallel store instruction, 174 parity bit, 8 parity function, 18 partial reconfiguration, 80, 240 pass transistor, 48 passive sonar, 286 pattern recognition, 289 p-channel network, 46 p-channel transistor, 42 PCI burst transfer, 191 PCMCIA bus, 191 periodic interrupt, 185 peripheral component interconnect (PCI), 191 permutation graph, 217 Petri Nets, 32 phase comparator, 64 phase-locked loop, 64 physical encoding, 2 physical type, 102 pipelined cache loads, 167 pipelined converters, 250 pipelined counter, 145 pipelined input and output, 183 pipelined multiplier, 125 pipelining, 21 pixel, 253 placement, 224 PLD, 79 π -Nets, 228 π -Nets type hierarchy, 232 polyadic encoding, 3 polynomial division, 5

• 300

port map, 106 power efficiency, 194 power supply design, 61 power-on reset, 61 PowerPC family, 201 PowerPC440GX, 190 primitive polynomial, 5 ProASIC family, 86 process, 33, 34 process (VHDL), 104 process identifier, 226 processing element (PE), 205 processing time, 20 processor chip, 77 processor pipeline, 158 product graph, 214 program memory, 143 programmable logic device, 79 programmable processor, 23, 151 propagation delay, 45 pseudo-static RAM, 75 pull-up resistor, 43 pulse width modulation (PWM), 66, 251 PVM, 230 quad data rate SDRAM, 75 quadratic flat packages, 67 quadrature sampling, 252, 285 quantization error, 6, 250 quantum computer, 16 Quine and McCluskey algorithm, 17 R-2R network, 250 radix-4 butterfly, 258 RAMBUS, 75 random access memories, 72 RAS, 75 RAW architecture, 218 reactive building block, 25 reconfigurable ALU, 157 reconfiguration, 80, 240 reconstruction, 251 recursion, 12 reduced instruction set computer (RISC), 163 redundant add operation, 120 reflections, 87 register, 26 register bank, 141 relative frequencies of operations, 153 remote function calls, 227 rendering pipeline, 255 reset circuit, 62 retried memory and interface accesses, 181 return instruction, 164

INDEX return modes, 176 reversible computation, 16 reversible NOT, 59 ripple-carry adder, 116 ripple counter, 60 rise time, 44 roll angle, 284 rounding error, 5 routing, 224 RS flip-flop, 53 run length coding, 9 SAA7113, 253 sample, 249 sample and hold circuit, 250 sampling frequency, 251 sampling theorem, 251 saturation, 256 SBSRAM, 73 schedule, 140 schematic entry, 99 Schmitt trigger, 63 screen windows, 226 SDRAM, 75 segmented bus, 207 select function, 11 sensitivity list, 105 sequencer, 143 serial adder, 125 serial EPROM, 76 serial execution, 23 serial interface, 66 serial multiplier, 126 serial multiply-and-add function (MAC), 127 server thread, 227 set of elementary operations, 15 set operations, 154 set-up time, 54, 108 SH-4 processor, 202 SH7750R, 202 Sharc family, 210, 273 shift register, 55 shift register sequence, 146 shimming delays, 52 sigma-delta-converter, 252 signal, 1 signal assignment, 102 signal delay, 87 signal resolution, 102 signed binary number, 4 signed digit add operation, 121 signed multiply, 123 signed-digit code, 4 SIMD, 144

SIMD operations, 153 simple CPU, 152 simplification step, 16 simulator display, 110 simultaneous multithreading (SMT), 159 simultaneous scheduling, 140 single precision, 6 single-master bus, 186 soft caching, 167 sound velocity, 280 source localization, 289 source terminal, 42 Spartan-III, 85 spectral analysis, 259 speculative execution, 159 speech recognition, 289 square root algorithm, 130 SRAM, 69 stack automaton, 139 stack implementation, 162 stack pointer register, 162 stack processor, 162 Star Core DSP, 272 start bit, 184 state, 137 std logic, 101 stop bit, 184 storage automata, 139 storage elements, 26 store and forward, 212 Stratix family, 85 streams, 32 structural definition, 101 sub-routine, 164 subtract operation, 119 successive approximation, 250 super-scalar processors, 201 supervisor mode, 199, 271 switch network, 47 switched capacitor regulator, 62 switched Ethernet, 207 switching regulator, 62 symmetric FIR filter, 256 symmetric group, 216 synchronous burst SRAM, 73 synchronous counter, 63, 144 synthesis, 112 synthesis rules, 112 System C, 99 system specification, 32 system-on-a-chip (SOC), 78 test bench, 109 TFT monitor, 253

• 301

INDEX thread, 159, 180 thread management, 210 three-address architecture, 162 throughput, 20 timeout, 185 timer, 185 timing simulation, 107 TMS320C40, 210 TMS320C54x family, 270 TMS320C55xx family, 271 TMS320C64xx, 272 TMS320C67xx family, 276 TMS320F2812, 90, 196 token passing, 228 torus graph, 214 TRAM modules, 209 transducers, 280 transfer function, 1 transition function, 138 transmission control protocol (TCP), 189 transmission gate, 48 transport after, 107 Transputer, 209 Transputer link, 209 trap, 180 triangle fill operation, 255 tri-state, 49 true ALU efficiency, 153 tuple, 1 two-address architecture, 162 twos complement encoding, 4 underwater sound, 279 universal function, 18 universal serial bus (USB), 187 unnormalized floating point, 6 USB endpoints, 187 user datagram protocol (UDP), 189 variable, 105 Verilog, 99 very large instruction word (VLIW), 157 VHDL, 99 video memory, 281 video signals, 253 Virtex II, 83 Virtex-II Pro family, 84 virtual memory, 166 virtual shared memory, 207 Viterbi algorithm, 290 voltage swing, 58 voltage-controlled oscillator (VCO), 64 von Neumann architecture, 161 VR4131, 203

• 302

wafer-scale integration, 67 wait until, 105 Wallace tree, 123 watchdog timer, 185 wave table synthesis, 254 wide memory, 159 wired AND, 50 wireless LAN (WLAN), 188 wormhole routing, 212 worst-case complexity, 13

INDEX wreath product, 247 write-through, 166 XC16x family, 197 Xscale processor, 199 yield, 67 z-buffer, 255 zero-overhead loop, 164, 175, 265